Simplify the code that allows SIGKILL during coredumps to terminate
the coredump. As far as I can tell I have avoided breaking this
case by dumb luck.
Historically with all of the other threads stopping in exit_mm the
wants_signal loop in complete_signal would find the dumper task and
then complete_signal would wake the dumper task with signal_wake_up.
After moving the coredump_task_exit above the setting of PF_EXITING in
commit 92307383082d ("coredump: Don't perform any cleanups before
dumping core") wants_signal will consider all of the threads in a
multi-threaded process for waking up, not just the core dumping task.
Luckily complete_signal short circuits SIGKILL during a coredump marks
every thread with SIGKILL and signal_wake_up. This code is arguably
buggy however as it tries to skip creating a group exit when is already
present, and it fails that a coredump is in progress.
Ever since commit 06af8679449d ("coredump: Limit what can interrupt
coredumps") was added, dump_interrupted needs not just TIF_SIGPENDING
set on the dumper task but also SIGKILL set in it's pending bitmap.
This means that if the code is ever fixed not to short-circuit and
kill a process after it has already been killed the special case
for SIGKILL during a coredump will be broken.
Sort all of this out by making the coredump special case more special.
Perform all of the work in prepare_signal and leave the rest of the
signal delivery path out of it.
In prepare_signal when the process coredumping is sent SIGKILL find
the task performing the coredump and use sigaddset and signal_wake_up
to ensure that task reports fatal_signal_pending.
Return false from prepare_signal to tell the rest of the signal
delivery path to ignore the signal.
Remove the "signal->core_state || !(signal->flags &&
SIGNAL_GROUP_EXIT)" test from complete_signal as signal delivery after
process exit does not reach complete_signal.
I have tested this and verified I did not break SIGKILL during
coredumps by accident (before or after this change). I actually
thought I had and I had to figure out what I had misread that kept
SIGKILL during coredumps working.
v1: https://lkml.kernel.org/r/20211213225350.27481-1-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
kernel/signal.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/kernel/signal.c b/kernel/signal.c
index 1f9dd41c04be..e3662fff919a 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -907,8 +907,12 @@ static bool prepare_signal(int sig, struct task_struct *p, bool force)
sigset_t flush;
if (signal->flags & SIGNAL_GROUP_EXIT) {
- if (signal->core_state)
- return sig == SIGKILL;
+ if (signal->core_state && (sig == SIGKILL)) {
+ struct task_struct *dumper =
+ signal->core_state->dumper.task;
+ sigaddset(&dumper->pending.signal, SIGKILL);
+ signal_wake_up(dumper, 1);
+ }
/*
* The process is in the middle of dying, drop the signal.
*/
@@ -1033,7 +1037,6 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
* then start taking the whole group down immediately.
*/
if (sig_fatal(p, sig) &&
- (signal->core_state || !(signal->flags & SIGNAL_GROUP_EXIT)) &&
!sigismember(&t->real_blocked, sig) &&
(sig == SIGKILL || !p->ptrace)) {
/*
--
2.41.0
Hi Eric,
I'll _try_ to read this (nontrivial) changes this week. To be honest,
right now I don't really understand your goals after the quick glance...
So far I have only looked at this simple 1/17 and it doesn't look right
to me.
On 06/18, Eric W. Biederman wrote:
>
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -907,8 +907,12 @@ static bool prepare_signal(int sig, struct task_struct *p, bool force)
> sigset_t flush;
>
> if (signal->flags & SIGNAL_GROUP_EXIT) {
> - if (signal->core_state)
> - return sig == SIGKILL;
> + if (signal->core_state && (sig == SIGKILL)) {
> + struct task_struct *dumper =
> + signal->core_state->dumper.task;
> + sigaddset(&dumper->pending.signal, SIGKILL);
> + signal_wake_up(dumper, 1);
> + }
and after that it returns false so __send_signal_locked/send_sigqueue simply
return. This means:
- the caller will wrongly report TRACE_SIGNAL_IGNORED
- complete_signal() won't be called, so signal->group_exit_code
won't be updated.
coredump_finish() won't change it too so the process will exit
with group_exit_code == signr /* coredumping signal */.
Yes, the fix is obvious and trivial...
Oleg.
Oleg Nesterov <oleg@redhat.com> writes:
> Hi Eric,
>
> I'll _try_ to read this (nontrivial) changes this week. To be honest,
> right now I don't really understand your goals after the quick glance...
>
> So far I have only looked at this simple 1/17 and it doesn't look right
> to me.
It might be worth applying them all on a branch and just looking at the
end result.
The interactions between all of the ways a process can exit wind up
being different, but being clean.
> On 06/18, Eric W. Biederman wrote:
>>
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -907,8 +907,12 @@ static bool prepare_signal(int sig, struct task_struct *p, bool force)
>> sigset_t flush;
>>
>> if (signal->flags & SIGNAL_GROUP_EXIT) {
>> - if (signal->core_state)
>> - return sig == SIGKILL;
>> + if (signal->core_state && (sig == SIGKILL)) {
>> + struct task_struct *dumper =
>> + signal->core_state->dumper.task;
>> + sigaddset(&dumper->pending.signal, SIGKILL);
>> + signal_wake_up(dumper, 1);
>> + }
>
> and after that it returns false so __send_signal_locked/send_sigqueue simply
> return. This means:
>
> - the caller will wrongly report TRACE_SIGNAL_IGNORED
Fair.
>
> - complete_signal() won't be called, so signal->group_exit_code
> won't be updated.
>
> coredump_finish() won't change it too so the process will exit
> with group_exit_code == signr /* coredumping signal */.
>
> Yes, the fix is obvious and trivial...
The signal handling from the coredump is arguably correct. The process
has already exited, and gotten an exit code.
But I really don't care about the exit_code either way. I just want to
make ``killing'' a dead process while it core dumps independent of
complete_signal.
That ``killing'' of a dead process is a completely special case.
Eric
On 06/19, Eric W. Biederman wrote: > > Oleg Nesterov <oleg@redhat.com> writes: > > > Hi Eric, > > > > I'll _try_ to read this (nontrivial) changes this week. To be honest, > > right now I don't really understand your goals after the quick glance... > > > > So far I have only looked at this simple 1/17 and it doesn't look right > > to me. > > It might be worth applying them all on a branch and just looking at the > end result. Perhaps. Say, the next 2/17 patch. I'd say it is very difficult to understand the purpose unless you read the next patches. OK, at least the change log mentions "in preparation". > > - complete_signal() won't be called, so signal->group_exit_code > > won't be updated. > > > > coredump_finish() won't change it too so the process will exit > > with group_exit_code == signr /* coredumping signal */. > > > > Yes, the fix is obvious and trivial... > > The signal handling from the coredump is arguably correct. The process > has already exited, and gotten an exit code. And zap_process() sets roup_exit_code = signr. But, > But I really don't care about the exit_code either way. I just want to > make ``killing'' a dead process while it core dumps independent of > complete_signal. > > That ``killing'' of a dead process is a completely special case. Sorry I fail to understand... If the coredumping process is killed by SIGKILL, it should exit with group_exit_code = SIGKILL, right? At least this is what we have now. Oleg.
Oleg Nesterov <oleg@redhat.com> writes: > On 06/19, Eric W. Biederman wrote: >> >> Oleg Nesterov <oleg@redhat.com> writes: >> >> > Hi Eric, >> > >> > I'll _try_ to read this (nontrivial) changes this week. To be honest, >> > right now I don't really understand your goals after the quick glance... >> > >> > So far I have only looked at this simple 1/17 and it doesn't look right >> > to me. >> >> It might be worth applying them all on a branch and just looking at the >> end result. > > Perhaps. Say, the next 2/17 patch. I'd say it is very difficult to understand > the purpose unless you read the next patches. OK, at least the change log > mentions "in preparation". > >> > - complete_signal() won't be called, so signal->group_exit_code >> > won't be updated. >> > >> > coredump_finish() won't change it too so the process will exit >> > with group_exit_code == signr /* coredumping signal */. >> > >> > Yes, the fix is obvious and trivial... >> >> The signal handling from the coredump is arguably correct. The process >> has already exited, and gotten an exit code. > > And zap_process() sets roup_exit_code = signr. But, > >> But I really don't care about the exit_code either way. I just want to >> make ``killing'' a dead process while it core dumps independent of >> complete_signal. >> >> That ``killing'' of a dead process is a completely special case. > > Sorry I fail to understand... > > If the coredumping process is killed by SIGKILL, it should exit with > group_exit_code = SIGKILL, right? At least this is what we have now. In general when a fatal signal is sent: - It is short circuit delivered. - If SIGNAL_GROUP_EXIT is not set SIGNAL_GROUP_EXIT is set signal->group_exit_code is set Under those rules group_exit_code should not be updated. Frankly no signals should even be processed (except to be queued) after a fatal signal comes in. There is an issue that short circuit delivery does not work on coredump signals (because of the way the coredump code works). So it winds up being zap_threads that tests if SIGNAL_GROUP_EXIT is already set and zap_process that sets SIGNAL_GROUP_EXIT. Essentially the logic remains the same, and importantly no later signal is able to set group_exit_code. Or really have any effect because the signal sent was fatal. Except except except when the kernel in the context of the userspace process is writing a coredump for that userspace process. Then we allow the kernel to be sent SIGKILL to stop it's coredumping activities because sometimes it can block indefinitely otherwise. Which is why I call handling that SIGKILL after a coredump has begun and SIGNAL_GROUP_EXIT is already set a completely special case. We might have to change group_exit_code to SIGKILL in that special case, if someone in userspace cares. However I expect no one cares. Further if adding support for SIGKILL during a coredump were being added from scratch. The semantics I would choose would be for that SIGKILL and indeed all of the coredumping activity would be invisible to userspace except for the delay to make it happen. Otherwise a coredump which every occasionally gets it's return code changed could introduce heisenbugs. But none of this is documented in the change description and at a bare minimum this change of behavior should be before such code is merged. Eric
Another case when I can hardly understand your reply... This patch adds a minor user visible change, that was my point. If you say that the new behaviour is better / more consistent - I won't really argue, "I expect no one cares" below is probably true. In my opinion group_exit_code = SIGKILL makes more sense in this special case, but again, I won't insist. But then this change should be mentioned and explained in the changelog, agree? As for "zap_threads that tests if SIGNAL_GROUP_EXIT is already set", this is another thing but probably I misundertood you. It is not that zap_threads/zap_process do not set ->group_exit_code in this case, in this case do_coredump() will be aborted. And to remind, zap_threads() used to set SIGNAL_GROUP_COREDUMP, not SIGNAL_GROUP_EXIT. Because to me the coredumping process is not exiting yet, it tries to handle the coredumping signal. That is why I prefer group_exit_code = SIGKILL if it is killed during the dump. But this is slightly offtopic today. Oleg. On 06/21, Eric W. Biederman wrote: > > Oleg Nesterov <oleg@redhat.com> writes: > > > On 06/19, Eric W. Biederman wrote: > >> > >> Oleg Nesterov <oleg@redhat.com> writes: > >> > >> > Hi Eric, > >> > > >> > I'll _try_ to read this (nontrivial) changes this week. To be honest, > >> > right now I don't really understand your goals after the quick glance... > >> > > >> > So far I have only looked at this simple 1/17 and it doesn't look right > >> > to me. > >> > >> It might be worth applying them all on a branch and just looking at the > >> end result. > > > > Perhaps. Say, the next 2/17 patch. I'd say it is very difficult to understand > > the purpose unless you read the next patches. OK, at least the change log > > mentions "in preparation". > > > >> > - complete_signal() won't be called, so signal->group_exit_code > >> > won't be updated. > >> > > >> > coredump_finish() won't change it too so the process will exit > >> > with group_exit_code == signr /* coredumping signal */. > >> > > >> > Yes, the fix is obvious and trivial... > >> > >> The signal handling from the coredump is arguably correct. The process > >> has already exited, and gotten an exit code. > > > > And zap_process() sets roup_exit_code = signr. But, > > > >> But I really don't care about the exit_code either way. I just want to > >> make ``killing'' a dead process while it core dumps independent of > >> complete_signal. > >> > >> That ``killing'' of a dead process is a completely special case. > > > > Sorry I fail to understand... > > > > If the coredumping process is killed by SIGKILL, it should exit with > > group_exit_code = SIGKILL, right? At least this is what we have now. > > In general when a fatal signal is sent: > - It is short circuit delivered. > - If SIGNAL_GROUP_EXIT is not set > SIGNAL_GROUP_EXIT is set > signal->group_exit_code is set > > Under those rules group_exit_code should not be updated. Frankly no > signals should even be processed (except to be queued) after a fatal > signal comes in. > > There is an issue that short circuit delivery does not work on coredump > signals (because of the way the coredump code works). So it winds up > being zap_threads that tests if SIGNAL_GROUP_EXIT is already set and > zap_process that sets SIGNAL_GROUP_EXIT. Essentially the logic remains > the same, and importantly no later signal is able to set > group_exit_code. Or really have any effect because the signal sent was > fatal. > > Except except except when the kernel in the context of the userspace > process is writing a coredump for that userspace process. Then we allow > the kernel to be sent SIGKILL to stop it's coredumping activities > because sometimes it can block indefinitely otherwise. > > Which is why I call handling that SIGKILL after a coredump has > begun and SIGNAL_GROUP_EXIT is already set a completely special case. > > We might have to change group_exit_code to SIGKILL in that special case, > if someone in userspace cares. However I expect no one cares. > > Further if adding support for SIGKILL during a coredump were being added > from scratch. The semantics I would choose would be for that SIGKILL > and indeed all of the coredumping activity would be invisible to > userspace except for the delay to make it happen. Otherwise a coredump > which every occasionally gets it's return code changed could introduce > heisenbugs. > > But none of this is documented in the change description and at a bare > minimum this change of behavior should be before such code is merged. > > Eric >
Oleg Nesterov <oleg@redhat.com> writes: > Another case when I can hardly understand your reply... > > This patch adds a minor user visible change, that was my point. > > If you say that the new behaviour is better / more consistent - > I won't really argue, "I expect no one cares" below is probably > true. In my opinion group_exit_code = SIGKILL makes more sense > in this special case, but again, I won't insist. > > But then this change should be mentioned and explained in the > changelog, agree? I very much agree. It was an oversight and bug not to have included that in the change description. > As for "zap_threads that tests if SIGNAL_GROUP_EXIT is already set", > this is another thing but probably I misundertood you. It is not that > zap_threads/zap_process do not set ->group_exit_code in this case, > in this case do_coredump() will be aborted. > > And to remind, zap_threads() used to set SIGNAL_GROUP_COREDUMP, not > SIGNAL_GROUP_EXIT. Because to me the coredumping process is not exiting > yet, it tries to handle the coredumping signal. That is why I prefer > group_exit_code = SIGKILL if it is killed during the dump. But this is > slightly offtopic today. Slightly. A major goal of this set of changes is to unify all of the process teardown in complete_signal, do_group_exit, and zap_process into a single subroutine for consistency. When a coredump is not generated the code for dumpable signals and other fatal signals should be the same. Including short circuit delivery. It isn't today. My rougher in progress patchset that follows this one makes, teaches get_signal to dequeue signals that have been processed with short circuit delivery and makes it so that do_coredump is just a little bit of extra code that runs. With the net result that all of the code is simpler and easier to reason about. Messing with the coredump code today is a real pain because of io_uring and those funny interactions. Eric
© 2016 - 2026 Red Hat, Inc.