io_uring truncating coredumps

Posted by Eric W. Biederman 4 years, 5 months ago

Subject updated to reflect the current discussion.

> Linus Torvalds <torvalds@linux-foundation.org> writes:

> But I really think it's wrong.
> 
> You're trying to work around a problem the wrong way around. If a task
> is dead, and is dumping core, then signals just shouldn't matter in
> the first place, and thus the whole "TASK_INTERRUPTIBLE vs
> TASK_UNINTERRUPTIBLE" really shouldn't be an issue. The fact that it
> is an issue means there's something wrong in signaling, not in the
> pipe code.
> 
> So I really think that's where the fix should be - on the signal delivery side.

Thinking about it from the perspective of not delivering the wake-ups
fixing io_uring and coredumps in a non-hacky way looks comparatively
simple.  The function task_work_add just needs to not wake anything up
after a process has started dying.

Something like the patch below.

The only tricky part I can see is making certain there are not any races
between task_work_add and do_coredump depending on task_work_add not
causing signal_pending to return true.

diff --git a/kernel/task_work.c b/kernel/task_work.c
index fad745c59234..5f941e377268 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -44,6 +44,9 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
 		work->next = head;
 	} while (cmpxchg(&task->task_works, head, work) != head);

+	if (notify && (task->signal->flags & SIGNAL_GROUP_EXIT))
+		return 0;
+
 	switch (notify) {
 	case TWA_NONE:
 		break;

Eric

Re: io_uring truncating coredumps

Posted by Linus Torvalds 4 years, 5 months ago

On Mon, Jan 17, 2022 at 8:47 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Thinking about it from the perspective of not delivering the wake-ups
> fixing io_uring and coredumps in a non-hacky way looks comparatively
> simple.  The function task_work_add just needs to not wake anything up
> after a process has started dying.
>
> Something like the patch below.

Hmm. Yes, I think this is the right direction.

That said, I think it should not add the work at all, and return
-ESRCH, the exact same way that it does for that work_exited
condition.

Because it's basically the same thing: the task is dead and shouldn't
do more work. In fact, task_work_run() is the thing that sets it to
&work_exited as it sees PF_EXITING, so it feels to me that THAT is
actually the issue here - we react to PF_EXITING too late. We react to
it *after* we've already added the work, and then we do that "no more
work" logic only after we've accepted those late work entries?

So my gut feel is that task_work_add() should just also test PF_EXITING.

And in fact, my gut feel is that PF_EXITING is too late anyway (it
happens after core-dumping, no?)

But I guess that thing may be on purpose, and maybe the act of dumping
core itself wants to do more work, and so that isn't an option?

So I don't think your patch is "right" as-is, and it all worries me,
but yes, I think this area is very much the questionable one.

I think that work stopping and the io_uring shutdown should probably
move earlier in the exit queue, but as mentioned above, maybe the work
addition boundary in particular really wants to be late because the
exit process itself still uses task works? ;(

              Linus