exit: complete synchronize_group_exit

[PATCH 01/17] signal: Make SIGKILL during coredumps an explicit special case

Posted by Eric W. Biederman 1 year, 7 months ago


Simplify the code that allows SIGKILL during coredumps to terminate
the coredump.  As far as I can tell I have avoided breaking this
case by dumb luck.

Historically with all of the other threads stopping in exit_mm the
wants_signal loop in complete_signal would find the dumper task and
then complete_signal would wake the dumper task with signal_wake_up.

After moving the coredump_task_exit above the setting of PF_EXITING in
commit 92307383082d ("coredump: Don't perform any cleanups before
dumping core") wants_signal will consider all of the threads in a
multi-threaded process for waking up, not just the core dumping task.

Luckily complete_signal short circuits SIGKILL during a coredump marks
every thread with SIGKILL and signal_wake_up.  This code is arguably
buggy however as it tries to skip creating a group exit when is already
present, and it fails that a coredump is in progress.

Ever since commit 06af8679449d ("coredump: Limit what can interrupt
coredumps") was added, dump_interrupted needs not just TIF_SIGPENDING
set on the dumper task but also SIGKILL set in it's pending bitmap.
This means that if the code is ever fixed not to short-circuit and
kill a process after it has already been killed the special case
for SIGKILL during a coredump will be broken.

Sort all of this out by making the coredump special case more special.
Perform all of the work in prepare_signal and leave the rest of the
signal delivery path out of it.

In prepare_signal when the process coredumping is sent SIGKILL find
the task performing the coredump and use sigaddset and signal_wake_up
to ensure that task reports fatal_signal_pending.

Return false from prepare_signal to tell the rest of the signal
delivery path to ignore the signal.

Remove the "signal->core_state || !(signal->flags &&
SIGNAL_GROUP_EXIT)" test from complete_signal as signal delivery after
process exit does not reach complete_signal.

I have tested this and verified I did not break SIGKILL during
coredumps by accident (before or after this change).  I actually
thought I had and I had to figure out what I had misread that kept
SIGKILL during coredumps working.

v1: https://lkml.kernel.org/r/20211213225350.27481-1-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/signal.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 1f9dd41c04be..e3662fff919a 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -907,8 +907,12 @@ static bool prepare_signal(int sig, struct task_struct *p, bool force)
 	sigset_t flush;
 
 	if (signal->flags & SIGNAL_GROUP_EXIT) {
-		if (signal->core_state)
-			return sig == SIGKILL;
+		if (signal->core_state && (sig == SIGKILL)) {
+			struct task_struct *dumper =
+				signal->core_state->dumper.task;
+			sigaddset(&dumper->pending.signal, SIGKILL);
+			signal_wake_up(dumper, 1);
+		}
 		/*
 		 * The process is in the middle of dying, drop the signal.
 		 */
@@ -1033,7 +1037,6 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
 	 * then start taking the whole group down immediately.
 	 */
 	if (sig_fatal(p, sig) &&
-	    (signal->core_state || !(signal->flags & SIGNAL_GROUP_EXIT)) &&
 	    !sigismember(&t->real_blocked, sig) &&
 	    (sig == SIGKILL || !p->ptrace)) {
 		/*
-- 
2.41.0

Re: [PATCH 01/17] signal: Make SIGKILL during coredumps an explicit special case

Posted by Oleg Nesterov 1 year, 7 months ago

Hi Eric,

I'll _try_ to read this (nontrivial) changes this week. To be honest,
right now I don't really understand your goals after the quick glance...

So far I have only looked at this simple 1/17 and it doesn't look right
to me.

On 06/18, Eric W. Biederman wrote:
>
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -907,8 +907,12 @@ static bool prepare_signal(int sig, struct task_struct *p, bool force)
>  	sigset_t flush;
>  
>  	if (signal->flags & SIGNAL_GROUP_EXIT) {
> -		if (signal->core_state)
> -			return sig == SIGKILL;
> +		if (signal->core_state && (sig == SIGKILL)) {
> +			struct task_struct *dumper =
> +				signal->core_state->dumper.task;
> +			sigaddset(&dumper->pending.signal, SIGKILL);
> +			signal_wake_up(dumper, 1);
> +		}

and after that it returns false so __send_signal_locked/send_sigqueue simply
return. This means:

	- the caller will wrongly report TRACE_SIGNAL_IGNORED

	- complete_signal() won't be called, so signal->group_exit_code
	  won't be updated.

	  coredump_finish() won't change it too so the process will exit
	  with group_exit_code == signr /* coredumping signal */.

	  Yes, the fix is obvious and trivial...

Oleg.

Re: [PATCH 01/17] signal: Make SIGKILL during coredumps an explicit special case

Posted by Eric W. Biederman 1 year, 7 months ago

Oleg Nesterov <oleg@redhat.com> writes:

> Hi Eric,
>
> I'll _try_ to read this (nontrivial) changes this week. To be honest,
> right now I don't really understand your goals after the quick glance...
>
> So far I have only looked at this simple 1/17 and it doesn't look right
> to me.

It might be worth applying them all on a branch and just looking at the
end result.

The interactions between all of the ways a process can exit wind up
being different, but being clean.

> On 06/18, Eric W. Biederman wrote:
>>
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -907,8 +907,12 @@ static bool prepare_signal(int sig, struct task_struct *p, bool force)
>>  	sigset_t flush;
>>  
>>  	if (signal->flags & SIGNAL_GROUP_EXIT) {
>> -		if (signal->core_state)
>> -			return sig == SIGKILL;
>> +		if (signal->core_state && (sig == SIGKILL)) {
>> +			struct task_struct *dumper =
>> +				signal->core_state->dumper.task;
>> +			sigaddset(&dumper->pending.signal, SIGKILL);
>> +			signal_wake_up(dumper, 1);
>> +		}
>
> and after that it returns false so __send_signal_locked/send_sigqueue simply
> return. This means:
>
> 	- the caller will wrongly report TRACE_SIGNAL_IGNORED

Fair.

>
> 	- complete_signal() won't be called, so signal->group_exit_code
> 	  won't be updated.
>
> 	  coredump_finish() won't change it too so the process will exit
> 	  with group_exit_code == signr /* coredumping signal */.
>
> 	  Yes, the fix is obvious and trivial...

The signal handling from the coredump is arguably correct.  The process
has already exited, and gotten an exit code.

But I really don't care about the exit_code either way.  I just want to
make ``killing'' a dead process while it core dumps independent of
complete_signal.

That ``killing'' of a dead process is a completely special case.

Eric

Re: [PATCH 01/17] signal: Make SIGKILL during coredumps an explicit special case

Posted by Oleg Nesterov 1 year, 7 months ago

On 06/19, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > Hi Eric,
> >
> > I'll _try_ to read this (nontrivial) changes this week. To be honest,
> > right now I don't really understand your goals after the quick glance...
> >
> > So far I have only looked at this simple 1/17 and it doesn't look right
> > to me.
>
> It might be worth applying them all on a branch and just looking at the
> end result.

Perhaps. Say, the next 2/17 patch. I'd say it is very difficult to understand
the purpose unless you read the next patches. OK, at least the change log
mentions "in preparation".

> > 	- complete_signal() won't be called, so signal->group_exit_code
> > 	  won't be updated.
> >
> > 	  coredump_finish() won't change it too so the process will exit
> > 	  with group_exit_code == signr /* coredumping signal */.
> >
> > 	  Yes, the fix is obvious and trivial...
>
> The signal handling from the coredump is arguably correct.  The process
> has already exited, and gotten an exit code.

And zap_process() sets roup_exit_code = signr. But,

> But I really don't care about the exit_code either way.  I just want to
> make ``killing'' a dead process while it core dumps independent of
> complete_signal.
>
> That ``killing'' of a dead process is a completely special case.

Sorry I fail to understand...

If the coredumping process is killed by SIGKILL, it should exit with
group_exit_code = SIGKILL, right? At least this is what we have now.

Oleg.

Re: [PATCH 01/17] signal: Make SIGKILL during coredumps an explicit special case

Posted by Eric W. Biederman 1 year, 7 months ago

Oleg Nesterov <oleg@redhat.com> writes:

> On 06/19, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <oleg@redhat.com> writes:
>>
>> > Hi Eric,
>> >
>> > I'll _try_ to read this (nontrivial) changes this week. To be honest,
>> > right now I don't really understand your goals after the quick glance...
>> >
>> > So far I have only looked at this simple 1/17 and it doesn't look right
>> > to me.
>>
>> It might be worth applying them all on a branch and just looking at the
>> end result.
>
> Perhaps. Say, the next 2/17 patch. I'd say it is very difficult to understand
> the purpose unless you read the next patches. OK, at least the change log
> mentions "in preparation".
>
>> > 	- complete_signal() won't be called, so signal->group_exit_code
>> > 	  won't be updated.
>> >
>> > 	  coredump_finish() won't change it too so the process will exit
>> > 	  with group_exit_code == signr /* coredumping signal */.
>> >
>> > 	  Yes, the fix is obvious and trivial...
>>
>> The signal handling from the coredump is arguably correct.  The process
>> has already exited, and gotten an exit code.
>
> And zap_process() sets roup_exit_code = signr. But,
>
>> But I really don't care about the exit_code either way.  I just want to
>> make ``killing'' a dead process while it core dumps independent of
>> complete_signal.
>>
>> That ``killing'' of a dead process is a completely special case.
>
> Sorry I fail to understand...
>
> If the coredumping process is killed by SIGKILL, it should exit with
> group_exit_code = SIGKILL, right? At least this is what we have now.

In general when a fatal signal is sent:
- It is short circuit delivered.
- If SIGNAL_GROUP_EXIT is not set
   SIGNAL_GROUP_EXIT is set
   signal->group_exit_code is set

Under those rules group_exit_code should not be updated.  Frankly no
signals should even be processed (except to be queued) after a fatal
signal comes in.

There is an issue that short circuit delivery does not work on coredump
signals (because of the way the coredump code works).  So it winds up
being zap_threads that tests if SIGNAL_GROUP_EXIT is already set and
zap_process that sets SIGNAL_GROUP_EXIT.  Essentially the logic remains
the same, and importantly no later signal is able to set
group_exit_code.  Or really have any effect because the signal sent was
fatal.

Except except except when the kernel in the context of the userspace
process is writing a coredump for that userspace process.  Then we allow
the kernel to be sent SIGKILL to stop it's coredumping activities
because sometimes it can block indefinitely otherwise.

Which is why I call handling that SIGKILL after a coredump has
begun and SIGNAL_GROUP_EXIT is already set a completely special case.

We might have to change group_exit_code to SIGKILL in that special case,
if someone in userspace cares.  However I expect no one cares.

Further if adding support for SIGKILL during a coredump were being added
from scratch.  The semantics I would choose would be for that SIGKILL
and indeed all of the coredumping activity would be invisible to
userspace except for the delay to make it happen.  Otherwise a coredump
which every occasionally gets it's return code changed could introduce
heisenbugs.

But none of this is documented in the change description and at a bare
minimum this change of behavior should be before such code is merged.

Eric

Re: [PATCH 01/17] signal: Make SIGKILL during coredumps an explicit special case

Posted by Oleg Nesterov 1 year, 7 months ago

Another case when I can hardly understand your reply...

This patch adds a minor user visible change, that was my point.

If you say that the new behaviour is better / more consistent -
I won't really argue, "I expect no one cares" below is probably
true. In my opinion group_exit_code = SIGKILL makes more sense
in this special case, but again, I won't insist.

But then this change should be mentioned and explained in the
changelog, agree?

As for "zap_threads that tests if SIGNAL_GROUP_EXIT is already set",
this is another thing but probably I misundertood you. It is not that
zap_threads/zap_process do not set ->group_exit_code in this case,
in this case do_coredump() will be aborted.

And to remind, zap_threads() used to set SIGNAL_GROUP_COREDUMP, not
SIGNAL_GROUP_EXIT. Because to me the coredumping process is not exiting
yet, it tries to handle the coredumping signal. That is why I prefer
group_exit_code = SIGKILL if it is killed during the dump. But this is
slightly offtopic today.

Oleg.

On 06/21, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > On 06/19, Eric W. Biederman wrote:
> >>
> >> Oleg Nesterov <oleg@redhat.com> writes:
> >>
> >> > Hi Eric,
> >> >
> >> > I'll _try_ to read this (nontrivial) changes this week. To be honest,
> >> > right now I don't really understand your goals after the quick glance...
> >> >
> >> > So far I have only looked at this simple 1/17 and it doesn't look right
> >> > to me.
> >>
> >> It might be worth applying them all on a branch and just looking at the
> >> end result.
> >
> > Perhaps. Say, the next 2/17 patch. I'd say it is very difficult to understand
> > the purpose unless you read the next patches. OK, at least the change log
> > mentions "in preparation".
> >
> >> > 	- complete_signal() won't be called, so signal->group_exit_code
> >> > 	  won't be updated.
> >> >
> >> > 	  coredump_finish() won't change it too so the process will exit
> >> > 	  with group_exit_code == signr /* coredumping signal */.
> >> >
> >> > 	  Yes, the fix is obvious and trivial...
> >>
> >> The signal handling from the coredump is arguably correct.  The process
> >> has already exited, and gotten an exit code.
> >
> > And zap_process() sets roup_exit_code = signr. But,
> >
> >> But I really don't care about the exit_code either way.  I just want to
> >> make ``killing'' a dead process while it core dumps independent of
> >> complete_signal.
> >>
> >> That ``killing'' of a dead process is a completely special case.
> >
> > Sorry I fail to understand...
> >
> > If the coredumping process is killed by SIGKILL, it should exit with
> > group_exit_code = SIGKILL, right? At least this is what we have now.
>
> In general when a fatal signal is sent:
> - It is short circuit delivered.
> - If SIGNAL_GROUP_EXIT is not set
>    SIGNAL_GROUP_EXIT is set
>    signal->group_exit_code is set
>
> Under those rules group_exit_code should not be updated.  Frankly no
> signals should even be processed (except to be queued) after a fatal
> signal comes in.
>
> There is an issue that short circuit delivery does not work on coredump
> signals (because of the way the coredump code works).  So it winds up
> being zap_threads that tests if SIGNAL_GROUP_EXIT is already set and
> zap_process that sets SIGNAL_GROUP_EXIT.  Essentially the logic remains
> the same, and importantly no later signal is able to set
> group_exit_code.  Or really have any effect because the signal sent was
> fatal.
>
> Except except except when the kernel in the context of the userspace
> process is writing a coredump for that userspace process.  Then we allow
> the kernel to be sent SIGKILL to stop it's coredumping activities
> because sometimes it can block indefinitely otherwise.
>
> Which is why I call handling that SIGKILL after a coredump has
> begun and SIGNAL_GROUP_EXIT is already set a completely special case.
>
> We might have to change group_exit_code to SIGKILL in that special case,
> if someone in userspace cares.  However I expect no one cares.
>
> Further if adding support for SIGKILL during a coredump were being added
> from scratch.  The semantics I would choose would be for that SIGKILL
> and indeed all of the coredumping activity would be invisible to
> userspace except for the delay to make it happen.  Otherwise a coredump
> which every occasionally gets it's return code changed could introduce
> heisenbugs.
>
> But none of this is documented in the change description and at a bare
> minimum this change of behavior should be before such code is merged.
>
> Eric
>

Re: [PATCH 01/17] signal: Make SIGKILL during coredumps an explicit special case

Posted by Eric W. Biederman 1 year, 7 months ago

Oleg Nesterov <oleg@redhat.com> writes:

> Another case when I can hardly understand your reply...
>
> This patch adds a minor user visible change, that was my point.
>
> If you say that the new behaviour is better / more consistent -
> I won't really argue, "I expect no one cares" below is probably
> true. In my opinion group_exit_code = SIGKILL makes more sense
> in this special case, but again, I won't insist.
>
> But then this change should be mentioned and explained in the
> changelog, agree?

I very much agree.  It was an oversight and bug not to have included
that in the change description.

> As for "zap_threads that tests if SIGNAL_GROUP_EXIT is already set",
> this is another thing but probably I misundertood you. It is not that
> zap_threads/zap_process do not set ->group_exit_code in this case,
> in this case do_coredump() will be aborted.
>
> And to remind, zap_threads() used to set SIGNAL_GROUP_COREDUMP, not
> SIGNAL_GROUP_EXIT. Because to me the coredumping process is not exiting
> yet, it tries to handle the coredumping signal. That is why I prefer
> group_exit_code = SIGKILL if it is killed during the dump. But this is
> slightly offtopic today.

Slightly.

A major goal of this set of changes is to unify all of the process
teardown in complete_signal, do_group_exit, and zap_process into a
single subroutine for consistency.

When a coredump is not generated the code for dumpable signals and
other fatal signals should be the same.  Including short circuit
delivery.  It isn't today.

My rougher in progress patchset that follows this one makes, teaches
get_signal to dequeue signals that have been processed with short
circuit delivery and makes it so that do_coredump is just a little
bit of extra code that runs.  With the net result that all of the code
is simpler and easier to reason about.

Messing with the coredump code today is a real pain because of io_uring
and those funny interactions.

Eric