[PATCH RFC exit] Sleep at TASK_IDLE when waiting for application core dump

Paul E. McKenney posted 1 patch 1 year, 4 months ago
[PATCH RFC exit] Sleep at TASK_IDLE when waiting for application core dump
Posted by Paul E. McKenney 1 year, 4 months ago
Currently, the coredump_task_exit() function sets the task state to
TASK_UNINTERRUPTIBLE|TASK_FREEZABLE, which usually works well.  But a
combination of large memory and slow (and/or highly contended) mass
storage can cause application core dumps to take more than two minutes,
which can triggers "task blocked" splats.  There does not seem to be
any reasonable benefit to getting these splats.

Therefore, change that TASK_UNINTERRUPTIBLE to TASK_IDLE.

Reported-by: Anhad Jai Singh <ffledgling@meta.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Rik van Riel <riel@surriel.com>

diff --git a/kernel/exit.c b/kernel/exit.c
index f95a2c1338a8..b0d18f7b6d15 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -429,7 +429,7 @@ static void coredump_task_exit(struct task_struct *tsk)
 			complete(&core_state->startup);
 
 		for (;;) {
-			set_current_state(TASK_UNINTERRUPTIBLE|TASK_FREEZABLE);
+			set_current_state(TASK_IDLE|TASK_FREEZABLE);
 			if (!self.task) /* see coredump_finish() */
 				break;
 			schedule();
Re: [PATCH RFC exit] Sleep at TASK_IDLE when waiting for application core dump
Posted by Oleg Nesterov 1 year, 4 months ago
On 07/24, Paul E. McKenney wrote:
>
> Currently, the coredump_task_exit() function sets the task state to
> TASK_UNINTERRUPTIBLE|TASK_FREEZABLE, which usually works well.  But a
> combination of large memory and slow (and/or highly contended) mass
> storage can cause application core dumps to take more than two minutes,
> which can triggers "task blocked" splats.

Do you mean check_hung_uninterruptible_tasks() ?

In any case,

> Therefore, change that TASK_UNINTERRUPTIBLE to TASK_IDLE.
...
> @@ -429,7 +429,7 @@ static void coredump_task_exit(struct task_struct *tsk)
>  			complete(&core_state->startup);
>
>  		for (;;) {
> -			set_current_state(TASK_UNINTERRUPTIBLE|TASK_FREEZABLE);
> +			set_current_state(TASK_IDLE|TASK_FREEZABLE);

To me this change makes sense regardless...

To some degree TASK_UNINTERRUPTIBLE is misleading, in that the task which
sleeps in coredump_task_exit() is _KILLABLE, although not "directly".

A SIGKILL sent to the coredumping process will interrupt the coredumping thread
(see the signal->core_state check in prepare_signal() and fatal_signal_pending()
check in dump_interrupted()) and then this thread will wakeup the threads sleeping
in coredump_task_exit(), see coredump_finish().

Acked-by: Oleg Nesterov <oleg@redhat.com>
Re: [PATCH RFC exit] Sleep at TASK_IDLE when waiting for application core dump
Posted by Paul E. McKenney 1 year, 4 months ago
On Thu, Jul 25, 2024 at 03:29:35PM +0200, Oleg Nesterov wrote:
> On 07/24, Paul E. McKenney wrote:
> >
> > Currently, the coredump_task_exit() function sets the task state to
> > TASK_UNINTERRUPTIBLE|TASK_FREEZABLE, which usually works well.  But a
> > combination of large memory and slow (and/or highly contended) mass
> > storage can cause application core dumps to take more than two minutes,
> > which can triggers "task blocked" splats.
> 
> Do you mean check_hung_uninterruptible_tasks() ?

Yes, from its call to check_hung_task().  Good point, I will add that
to the commit log.

> In any case,
> 
> > Therefore, change that TASK_UNINTERRUPTIBLE to TASK_IDLE.
> ...
> > @@ -429,7 +429,7 @@ static void coredump_task_exit(struct task_struct *tsk)
> >  			complete(&core_state->startup);
> >
> >  		for (;;) {
> > -			set_current_state(TASK_UNINTERRUPTIBLE|TASK_FREEZABLE);
> > +			set_current_state(TASK_IDLE|TASK_FREEZABLE);
> 
> To me this change makes sense regardless...
> 
> To some degree TASK_UNINTERRUPTIBLE is misleading, in that the task which
> sleeps in coredump_task_exit() is _KILLABLE, although not "directly".
> 
> A SIGKILL sent to the coredumping process will interrupt the coredumping thread
> (see the signal->core_state check in prepare_signal() and fatal_signal_pending()
> check in dump_interrupted()) and then this thread will wakeup the threads sleeping
> in coredump_task_exit(), see coredump_finish().

Very good, I will add this to the commit log with attribution.

> Acked-by: Oleg Nesterov <oleg@redhat.com>

Thank you for the thorough review!  How does the updated patch shown
below look to you?

							Thanx, Paul

------------------------------------------------------------------------

commit a6c7779283d67a409b81616a5b485ac21637d7e7
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Wed Jul 24 16:51:52 2024 -0700

    exit: Sleep at TASK_IDLE when waiting for application core dump
    
    Currently, the coredump_task_exit() function sets the task state
    to TASK_UNINTERRUPTIBLE|TASK_FREEZABLE, which usually works well.
    But a combination of large memory and slow (and/or highly contended)
    mass storage can cause application core dumps to take more than
    two minutes, which can cause check_hung_task(), which is invoked by
    check_hung_uninterruptible_tasks(), to produce task-blocked splats.
    There does not seem to be any reasonable benefit to getting these splats.
    
    Furthermore, as Oleg Nesterov points out, TASK_UNINTERRUPTIBLE could
    be misleading because the task sleeping in coredump_task_exit() really
    is killable, albeit indirectly.  See the check of signal->core_state
    in prepare_signal() and the check of fatal_signal_pending()
    in dump_interrupted(), which bypass the normal unkillability of
    TASK_UNINTERRUPTIBLE, resulting in coredump_finish() invoking
    wake_up_process() on any threads sleeping in coredump_task_exit().
    
    Therefore, change that TASK_UNINTERRUPTIBLE to TASK_IDLE.
    
    Reported-by: Anhad Jai Singh <ffledgling@meta.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Acked-by: Oleg Nesterov <oleg@redhat.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Chris Mason <clm@fb.com>
    Cc: Rik van Riel <riel@surriel.com>

diff --git a/kernel/exit.c b/kernel/exit.c
index f95a2c1338a8..b0d18f7b6d15 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -429,7 +429,7 @@ static void coredump_task_exit(struct task_struct *tsk)
 			complete(&core_state->startup);
 
 		for (;;) {
-			set_current_state(TASK_UNINTERRUPTIBLE|TASK_FREEZABLE);
+			set_current_state(TASK_IDLE|TASK_FREEZABLE);
 			if (!self.task) /* see coredump_finish() */
 				break;
 			schedule();
Re: [PATCH RFC exit] Sleep at TASK_IDLE when waiting for application core dump
Posted by Oleg Nesterov 1 year, 4 months ago
On 07/25, Paul E. McKenney wrote:
>
> How does the updated patch shown below look to you?

Thanks, looks good to me ;)

Oleg.

> ------------------------------------------------------------------------
> 
> commit a6c7779283d67a409b81616a5b485ac21637d7e7
> Author: Paul E. McKenney <paulmck@kernel.org>
> Date:   Wed Jul 24 16:51:52 2024 -0700
> 
>     exit: Sleep at TASK_IDLE when waiting for application core dump
>     
>     Currently, the coredump_task_exit() function sets the task state
>     to TASK_UNINTERRUPTIBLE|TASK_FREEZABLE, which usually works well.
>     But a combination of large memory and slow (and/or highly contended)
>     mass storage can cause application core dumps to take more than
>     two minutes, which can cause check_hung_task(), which is invoked by
>     check_hung_uninterruptible_tasks(), to produce task-blocked splats.
>     There does not seem to be any reasonable benefit to getting these splats.
>     
>     Furthermore, as Oleg Nesterov points out, TASK_UNINTERRUPTIBLE could
>     be misleading because the task sleeping in coredump_task_exit() really
>     is killable, albeit indirectly.  See the check of signal->core_state
>     in prepare_signal() and the check of fatal_signal_pending()
>     in dump_interrupted(), which bypass the normal unkillability of
>     TASK_UNINTERRUPTIBLE, resulting in coredump_finish() invoking
>     wake_up_process() on any threads sleeping in coredump_task_exit().
>     
>     Therefore, change that TASK_UNINTERRUPTIBLE to TASK_IDLE.
>     
>     Reported-by: Anhad Jai Singh <ffledgling@meta.com>
>     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
>     Acked-by: Oleg Nesterov <oleg@redhat.com>
>     Cc: Jens Axboe <axboe@kernel.dk>
>     Cc: Christian Brauner <brauner@kernel.org>
>     Cc: Andrew Morton <akpm@linux-foundation.org>
>     Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
>     Cc: Chris Mason <clm@fb.com>
>     Cc: Rik van Riel <riel@surriel.com>
> 
> diff --git a/kernel/exit.c b/kernel/exit.c
> index f95a2c1338a8..b0d18f7b6d15 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -429,7 +429,7 @@ static void coredump_task_exit(struct task_struct *tsk)
>  			complete(&core_state->startup);
>  
>  		for (;;) {
> -			set_current_state(TASK_UNINTERRUPTIBLE|TASK_FREEZABLE);
> +			set_current_state(TASK_IDLE|TASK_FREEZABLE);
>  			if (!self.task) /* see coredump_finish() */
>  				break;
>  			schedule();
>