[v1] Suppress undesirable hung task warnings.

[PATCH 0/3] Suppress undesirable hung task warnings.

Posted by Julian Sun 1 week, 2 days ago

As suggested by Andrew Morton in [1], we need a general mechanism 
that allows the hung task detector to ignore unnecessary hung 
tasks. This patch set implements this functionality.

Patch 1 introduces a PF_DONT_HUNG flag. The hung task detector will 
ignores all tasks that have the PF_DONT_HUNG flag set.

Patch 2 introduces wait_event_no_hung() and wb_wait_for_completion_no_hung(), 
which enable the hung task detector to ignore hung tasks caused by these
wait events.

Patch 3 uses wb_wait_for_completion_no_hung() in the final phase of memcg 
teardown to eliminate the hung task warning.

Julian Sun (3):
  sched: Introduce a new flag PF_DONT_HUNG.
  writeback: Introduce wb_wait_for_completion_no_hung().
  memcg: Don't trigger hung task when memcg is releasing.

 fs/fs-writeback.c           | 15 +++++++++++++++
 include/linux/backing-dev.h |  1 +
 include/linux/sched.h       | 12 +++++++++++-
 include/linux/wait.h        | 15 +++++++++++++++
 kernel/hung_task.c          |  6 ++++++
 mm/memcontrol.c             |  2 +-
 6 files changed, 49 insertions(+), 2 deletions(-)

-- 
2.39.5

Re: [PATCH 0/3] Suppress undesirable hung task warnings.

Posted by Peter Zijlstra 1 week, 2 days ago

On Mon, Sep 22, 2025 at 05:41:43PM +0800, Julian Sun wrote:
> As suggested by Andrew Morton in [1], we need a general mechanism 
> that allows the hung task detector to ignore unnecessary hung 
> tasks. This patch set implements this functionality.
> 
> Patch 1 introduces a PF_DONT_HUNG flag. The hung task detector will 
> ignores all tasks that have the PF_DONT_HUNG flag set.
> 
> Patch 2 introduces wait_event_no_hung() and wb_wait_for_completion_no_hung(), 
> which enable the hung task detector to ignore hung tasks caused by these
> wait events.
> 
> Patch 3 uses wb_wait_for_completion_no_hung() in the final phase of memcg 
> teardown to eliminate the hung task warning.
> 
> Julian Sun (3):
>   sched: Introduce a new flag PF_DONT_HUNG.
>   writeback: Introduce wb_wait_for_completion_no_hung().
>   memcg: Don't trigger hung task when memcg is releasing.

This is all quite terrible. I'm not at all sure why a task that is
genuinely not making progress and isn't killable should not be reported.

Re: [PATCH 0/3] Suppress undesirable hung task warnings.

Posted by Christoph Hellwig 1 week, 2 days ago

On Mon, Sep 22, 2025 at 03:27:18PM +0200, Peter Zijlstra wrote:
> > Julian Sun (3):
> >   sched: Introduce a new flag PF_DONT_HUNG.
> >   writeback: Introduce wb_wait_for_completion_no_hung().
> >   memcg: Don't trigger hung task when memcg is releasing.
> 
> This is all quite terrible. I'm not at all sure why a task that is
> genuinely not making progress and isn't killable should not be reported.

The hung device detector is way to aggressive for very slow I/O.
See blk_wait_io, which has been around for a long time to work
around just that.  Given that this series targets writeback I suspect
it is about an overloaded device as well.

> 
---end quoted text---

Re: [PATCH 0/3] Suppress undesirable hung task warnings.

Posted by Andrew Morton 1 week, 2 days ago

On Mon, 22 Sep 2025 11:08:32 -0700 Christoph Hellwig <hch@infradead.org> wrote:

> On Mon, Sep 22, 2025 at 03:27:18PM +0200, Peter Zijlstra wrote:
> > > Julian Sun (3):
> > >   sched: Introduce a new flag PF_DONT_HUNG.
> > >   writeback: Introduce wb_wait_for_completion_no_hung().
> > >   memcg: Don't trigger hung task when memcg is releasing.
> > 
> > This is all quite terrible. I'm not at all sure why a task that is
> > genuinely not making progress and isn't killable should not be reported.
> 
> The hung device detector is way to aggressive for very slow I/O.
> See blk_wait_io, which has been around for a long time to work
> around just that.  Given that this series targets writeback I suspect
> it is about an overloaded device as well.

Yup, it's writeback - the bug report is in
https://lkml.kernel.org/r/20250917212959.355656-1-sunjunchao@bytedance.com

Memory is big and storage is slow, there's nothing wrong if a task
which is designed to wait for writeback waits for a long time.

Of course, there's something wrong if some other task which isn't
designed to wait for writeback gets stuck waiting for the task which
*is* designed to wait for writeback, but we'll still warn about that.

Regarding an implementation, I'm wondering if we can put a flag in
`struct completion' telling the hung task detector that this one is
expected to wait for long periods sometimes.  Probably messy and it
only works for completions (not semaphores, mutexes, etc).  Just
putting it out there ;)

Re: [PATCH 0/3] Suppress undesirable hung task warnings.

Posted by Peter Zijlstra 1 week, 1 day ago

On Mon, Sep 22, 2025 at 02:50:45PM -0700, Andrew Morton wrote:
> On Mon, 22 Sep 2025 11:08:32 -0700 Christoph Hellwig <hch@infradead.org> wrote:
> 
> > On Mon, Sep 22, 2025 at 03:27:18PM +0200, Peter Zijlstra wrote:
> > > > Julian Sun (3):
> > > >   sched: Introduce a new flag PF_DONT_HUNG.
> > > >   writeback: Introduce wb_wait_for_completion_no_hung().
> > > >   memcg: Don't trigger hung task when memcg is releasing.
> > > 
> > > This is all quite terrible. I'm not at all sure why a task that is
> > > genuinely not making progress and isn't killable should not be reported.
> > 
> > The hung device detector is way to aggressive for very slow I/O.
> > See blk_wait_io, which has been around for a long time to work
> > around just that.  Given that this series targets writeback I suspect
> > it is about an overloaded device as well.
> 
> Yup, it's writeback - the bug report is in
> https://lkml.kernel.org/r/20250917212959.355656-1-sunjunchao@bytedance.com
> 
> Memory is big and storage is slow, there's nothing wrong if a task
> which is designed to wait for writeback waits for a long time.
> 
> Of course, there's something wrong if some other task which isn't
> designed to wait for writeback gets stuck waiting for the task which
> *is* designed to wait for writeback, but we'll still warn about that.
> 
> 
> Regarding an implementation, I'm wondering if we can put a flag in
> `struct completion' telling the hung task detector that this one is
> expected to wait for long periods sometimes.  Probably messy and it
> only works for completions (not semaphores, mutexes, etc).  Just
> putting it out there ;)

So the problem is that there *is* progress (albeit rather slowly), the
watchdog just doesn't see that. Perhaps that is the thing we should look
at fixing.

How about something like the below? That will 'spuriously' wake up the
waiters as long as there is some progress being made. Thereby increasing
the context switch counters of the tasks and thus the hung_task watchdog
sees progress.

This approach should be safer than the blk_wait_io() hack, which has a
timer ticking, regardless of actual completions happening or not.

---

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index a07b8cf73ae2..1326193b4d95 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -174,9 +174,10 @@ static void finish_writeback_work(struct wb_writeback_work *work)
 		kfree(work);
 	if (done) {
 		wait_queue_head_t *waitq = done->waitq;
+		bool force_wake = (jiffies - done->stamp) > HZ/2;
 
 		/* @done can't be accessed after the following dec */
-		if (atomic_dec_and_test(&done->cnt))
+		if (atomic_dec_and_test(&done->cnt) || force_wake)
 			wake_up_all(waitq);
 	}
 }
@@ -213,7 +214,7 @@ static void wb_queue_work(struct bdi_writeback *wb,
 void wb_wait_for_completion(struct wb_completion *done)
 {
 	atomic_dec(&done->cnt);		/* put down the initial count */
-	wait_event(*done->waitq, !atomic_read(&done->cnt));
+	wait_event(*done->waitq, ({ done->stamp = jiffies; !atomic_read(&done->cnt); }));
 }
 
 #ifdef CONFIG_CGROUP_WRITEBACK
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 2ad261082bba..197593193ce3 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -63,6 +63,7 @@ enum wb_reason {
 struct wb_completion {
 	atomic_t		cnt;
 	wait_queue_head_t	*waitq;
+	unsigned long		stamp;
 };
 
 #define __WB_COMPLETION_INIT(_waitq)	\

Re: [PATCH 0/3] Suppress undesirable hung task warnings.

Posted by Jan Kara 1 week ago

On Tue 23-09-25 09:16:07, Peter Zijlstra wrote:
> On Mon, Sep 22, 2025 at 02:50:45PM -0700, Andrew Morton wrote:
> > On Mon, 22 Sep 2025 11:08:32 -0700 Christoph Hellwig <hch@infradead.org> wrote:
> > 
> > > On Mon, Sep 22, 2025 at 03:27:18PM +0200, Peter Zijlstra wrote:
> > > > > Julian Sun (3):
> > > > >   sched: Introduce a new flag PF_DONT_HUNG.
> > > > >   writeback: Introduce wb_wait_for_completion_no_hung().
> > > > >   memcg: Don't trigger hung task when memcg is releasing.
> > > > 
> > > > This is all quite terrible. I'm not at all sure why a task that is
> > > > genuinely not making progress and isn't killable should not be reported.
> > > 
> > > The hung device detector is way to aggressive for very slow I/O.
> > > See blk_wait_io, which has been around for a long time to work
> > > around just that.  Given that this series targets writeback I suspect
> > > it is about an overloaded device as well.
> > 
> > Yup, it's writeback - the bug report is in
> > https://lkml.kernel.org/r/20250917212959.355656-1-sunjunchao@bytedance.com
> > 
> > Memory is big and storage is slow, there's nothing wrong if a task
> > which is designed to wait for writeback waits for a long time.
> > 
> > Of course, there's something wrong if some other task which isn't
> > designed to wait for writeback gets stuck waiting for the task which
> > *is* designed to wait for writeback, but we'll still warn about that.
> > 
> > 
> > Regarding an implementation, I'm wondering if we can put a flag in
> > `struct completion' telling the hung task detector that this one is
> > expected to wait for long periods sometimes.  Probably messy and it
> > only works for completions (not semaphores, mutexes, etc).  Just
> > putting it out there ;)
> 
> So the problem is that there *is* progress (albeit rather slowly), the
> watchdog just doesn't see that. Perhaps that is the thing we should look
> at fixing.
> 
> How about something like the below? That will 'spuriously' wake up the
> waiters as long as there is some progress being made. Thereby increasing
> the context switch counters of the tasks and thus the hung_task watchdog
> sees progress.
> 
> This approach should be safer than the blk_wait_io() hack, which has a
> timer ticking, regardless of actual completions happening or not.

I like the idea. The problem with your patch is that the progress is not
visible with high enough granularity in wb_writeback_work->done completion.
That is only incremented by 1, when say a request to writeout 1GB is queued
and decremented by 1 when that 1GB is written. The progress can be observed
with higher granularity by wb_writeback_work->nr_pages getting decremented
as we submit pages for writeback but this counter still gets updated only
once we are done with a particular inode so if all those 1GB of data are in
one inode there wouldn't be much to observe. So we might need to observe
how struct writeback_control member nr_to_write gets updated. That is
really updated frequently on IO submission but each filesystem updates it
in their writepages() function so implementing that gets messy pretty
quickly.

But maybe a good place to hook into for registering progress would be
wbc_init_bio()? Filesystems call that whenever we create new bio for writeback
purposes. We do have struct writeback_control available there so through
that we could propagate information that forward progress is being made.

What do people think?

								Honza

> ---
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index a07b8cf73ae2..1326193b4d95 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -174,9 +174,10 @@ static void finish_writeback_work(struct wb_writeback_work *work)
>  		kfree(work);
>  	if (done) {
>  		wait_queue_head_t *waitq = done->waitq;
> +		bool force_wake = (jiffies - done->stamp) > HZ/2;
>  
>  		/* @done can't be accessed after the following dec */
> -		if (atomic_dec_and_test(&done->cnt))
> +		if (atomic_dec_and_test(&done->cnt) || force_wake)
>  			wake_up_all(waitq);
>  	}
>  }
> @@ -213,7 +214,7 @@ static void wb_queue_work(struct bdi_writeback *wb,
>  void wb_wait_for_completion(struct wb_completion *done)
>  {
>  	atomic_dec(&done->cnt);		/* put down the initial count */
> -	wait_event(*done->waitq, !atomic_read(&done->cnt));
> +	wait_event(*done->waitq, ({ done->stamp = jiffies; !atomic_read(&done->cnt); }));
>  }
>  
>  #ifdef CONFIG_CGROUP_WRITEBACK
> diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
> index 2ad261082bba..197593193ce3 100644
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -63,6 +63,7 @@ enum wb_reason {
>  struct wb_completion {
>  	atomic_t		cnt;
>  	wait_queue_head_t	*waitq;
> +	unsigned long		stamp;
>  };
>  
>  #define __WB_COMPLETION_INIT(_waitq)	\
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [External] Re: [PATCH 0/3] Suppress undesirable hung task warnings.

Posted by Julian Sun 6 days, 12 hours ago

Hi,

On Wed, Sep 24, 2025 at 6:34 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 23-09-25 09:16:07, Peter Zijlstra wrote:
> > On Mon, Sep 22, 2025 at 02:50:45PM -0700, Andrew Morton wrote:
> > > On Mon, 22 Sep 2025 11:08:32 -0700 Christoph Hellwig <hch@infradead.org> wrote:
> > >
> > > > On Mon, Sep 22, 2025 at 03:27:18PM +0200, Peter Zijlstra wrote:
> > > > > > Julian Sun (3):
> > > > > >   sched: Introduce a new flag PF_DONT_HUNG.
> > > > > >   writeback: Introduce wb_wait_for_completion_no_hung().
> > > > > >   memcg: Don't trigger hung task when memcg is releasing.
> > > > >
> > > > > This is all quite terrible. I'm not at all sure why a task that is
> > > > > genuinely not making progress and isn't killable should not be reported.
> > > >
> > > > The hung device detector is way to aggressive for very slow I/O.
> > > > See blk_wait_io, which has been around for a long time to work
> > > > around just that.  Given that this series targets writeback I suspect
> > > > it is about an overloaded device as well.
> > >
> > > Yup, it's writeback - the bug report is in
> > > https://lkml.kernel.org/r/20250917212959.355656-1-sunjunchao@bytedance.com
> > >
> > > Memory is big and storage is slow, there's nothing wrong if a task
> > > which is designed to wait for writeback waits for a long time.
> > >
> > > Of course, there's something wrong if some other task which isn't
> > > designed to wait for writeback gets stuck waiting for the task which
> > > *is* designed to wait for writeback, but we'll still warn about that.
> > >
> > >
> > > Regarding an implementation, I'm wondering if we can put a flag in
> > > `struct completion' telling the hung task detector that this one is
> > > expected to wait for long periods sometimes.  Probably messy and it
> > > only works for completions (not semaphores, mutexes, etc).  Just
> > > putting it out there ;)
> >
> > So the problem is that there *is* progress (albeit rather slowly), the
> > watchdog just doesn't see that. Perhaps that is the thing we should look
> > at fixing.
> >
> > How about something like the below? That will 'spuriously' wake up the
> > waiters as long as there is some progress being made. Thereby increasing
> > the context switch counters of the tasks and thus the hung_task watchdog
> > sees progress.
> >
> > This approach should be safer than the blk_wait_io() hack, which has a
> > timer ticking, regardless of actual completions happening or not.
>
> I like the idea. The problem with your patch is that the progress is not
> visible with high enough granularity in wb_writeback_work->done completion.
> That is only incremented by 1, when say a request to writeout 1GB is queued
> and decremented by 1 when that 1GB is written. The progress can be observed
> with higher granularity by wb_writeback_work->nr_pages getting decremented
> as we submit pages for writeback but this counter still gets updated only
> once we are done with a particular inode so if all those 1GB of data are in
> one inode there wouldn't be much to observe. So we might need to observe
> how struct writeback_control member nr_to_write gets updated. That is
> really updated frequently on IO submission but each filesystem updates it
> in their writepages() function so implementing that gets messy pretty
> quickly.
>
> But maybe a good place to hook into for registering progress would be
> wbc_init_bio()? Filesystems call that whenever we create new bio for writeback
> purposes. We do have struct writeback_control available there so through
> that we could propagate information that forward progress is being made.
>
> What do people think?

Sorry for the late reply. Yes, Jan, I agree — your proposal sounds
both fine-grained and elegant. But do we really have a strong need for
such detailed progress tracking?

In background writeback, for example, if the bandwidth is very low
(e.g. avg_write_bandwidth=24), writeback_chunk_size() already splits
pages into chunks of MIN_WRITEBACK_PAGES (1024). This is usually
enough to avoid hung task warnings, so reporting progress there might
be sufficient.

I’m also a bit concerned that reporting progress on every
wbc_init_bio() could lead to excessive wakeups in normal or
high-throughput cases, which might have side effects. Please correct
me if I’m missing something.

>
>                                                                 Honza
>
> > ---
> >
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index a07b8cf73ae2..1326193b4d95 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -174,9 +174,10 @@ static void finish_writeback_work(struct wb_writeback_work *work)
> >               kfree(work);
> >       if (done) {
> >               wait_queue_head_t *waitq = done->waitq;
> > +             bool force_wake = (jiffies - done->stamp) > HZ/2;
> >
> >               /* @done can't be accessed after the following dec */
> > -             if (atomic_dec_and_test(&done->cnt))
> > +             if (atomic_dec_and_test(&done->cnt) || force_wake)
> >                       wake_up_all(waitq);
> >       }
> >  }
> > @@ -213,7 +214,7 @@ static void wb_queue_work(struct bdi_writeback *wb,
> >  void wb_wait_for_completion(struct wb_completion *done)
> >  {
> >       atomic_dec(&done->cnt);         /* put down the initial count */
> > -     wait_event(*done->waitq, !atomic_read(&done->cnt));
> > +     wait_event(*done->waitq, ({ done->stamp = jiffies; !atomic_read(&done->cnt); }));
> >  }
> >
> >  #ifdef CONFIG_CGROUP_WRITEBACK
> > diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
> > index 2ad261082bba..197593193ce3 100644
> > --- a/include/linux/backing-dev-defs.h
> > +++ b/include/linux/backing-dev-defs.h
> > @@ -63,6 +63,7 @@ enum wb_reason {
> >  struct wb_completion {
> >       atomic_t                cnt;
> >       wait_queue_head_t       *waitq;
> > +     unsigned long           stamp;
> >  };
> >
> >  #define __WB_COMPLETION_INIT(_waitq) \
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR


Thanks,
-- 
Julian Sun <sunjunchao@bytedance.com>

Re: [External] Re: [PATCH 0/3] Suppress undesirable hung task warnings.

Posted by Jan Kara 6 days, 11 hours ago

On Thu 25-09-25 23:07:24, Julian Sun wrote:
> On Wed, Sep 24, 2025 at 6:34 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Tue 23-09-25 09:16:07, Peter Zijlstra wrote:
> > > On Mon, Sep 22, 2025 at 02:50:45PM -0700, Andrew Morton wrote:
> > > > On Mon, 22 Sep 2025 11:08:32 -0700 Christoph Hellwig <hch@infradead.org> wrote:
> > > >
> > > > > On Mon, Sep 22, 2025 at 03:27:18PM +0200, Peter Zijlstra wrote:
> > > > > > > Julian Sun (3):
> > > > > > >   sched: Introduce a new flag PF_DONT_HUNG.
> > > > > > >   writeback: Introduce wb_wait_for_completion_no_hung().
> > > > > > >   memcg: Don't trigger hung task when memcg is releasing.
> > > > > >
> > > > > > This is all quite terrible. I'm not at all sure why a task that is
> > > > > > genuinely not making progress and isn't killable should not be reported.
> > > > >
> > > > > The hung device detector is way to aggressive for very slow I/O.
> > > > > See blk_wait_io, which has been around for a long time to work
> > > > > around just that.  Given that this series targets writeback I suspect
> > > > > it is about an overloaded device as well.
> > > >
> > > > Yup, it's writeback - the bug report is in
> > > > https://lkml.kernel.org/r/20250917212959.355656-1-sunjunchao@bytedance.com
> > > >
> > > > Memory is big and storage is slow, there's nothing wrong if a task
> > > > which is designed to wait for writeback waits for a long time.
> > > >
> > > > Of course, there's something wrong if some other task which isn't
> > > > designed to wait for writeback gets stuck waiting for the task which
> > > > *is* designed to wait for writeback, but we'll still warn about that.
> > > >
> > > >
> > > > Regarding an implementation, I'm wondering if we can put a flag in
> > > > `struct completion' telling the hung task detector that this one is
> > > > expected to wait for long periods sometimes.  Probably messy and it
> > > > only works for completions (not semaphores, mutexes, etc).  Just
> > > > putting it out there ;)
> > >
> > > So the problem is that there *is* progress (albeit rather slowly), the
> > > watchdog just doesn't see that. Perhaps that is the thing we should look
> > > at fixing.
> > >
> > > How about something like the below? That will 'spuriously' wake up the
> > > waiters as long as there is some progress being made. Thereby increasing
> > > the context switch counters of the tasks and thus the hung_task watchdog
> > > sees progress.
> > >
> > > This approach should be safer than the blk_wait_io() hack, which has a
> > > timer ticking, regardless of actual completions happening or not.
> >
> > I like the idea. The problem with your patch is that the progress is not
> > visible with high enough granularity in wb_writeback_work->done completion.
> > That is only incremented by 1, when say a request to writeout 1GB is queued
> > and decremented by 1 when that 1GB is written. The progress can be observed
> > with higher granularity by wb_writeback_work->nr_pages getting decremented
> > as we submit pages for writeback but this counter still gets updated only
> > once we are done with a particular inode so if all those 1GB of data are in
> > one inode there wouldn't be much to observe. So we might need to observe
> > how struct writeback_control member nr_to_write gets updated. That is
> > really updated frequently on IO submission but each filesystem updates it
> > in their writepages() function so implementing that gets messy pretty
> > quickly.
> >
> > But maybe a good place to hook into for registering progress would be
> > wbc_init_bio()? Filesystems call that whenever we create new bio for writeback
> > purposes. We do have struct writeback_control available there so through
> > that we could propagate information that forward progress is being made.
> >
> > What do people think?
> 
> Sorry for the late reply. Yes, Jan, I agree — your proposal sounds
> both fine-grained and elegant. But do we really have a strong need for
> such detailed progress tracking?
> 
> In background writeback, for example, if the bandwidth is very low
> (e.g. avg_write_bandwidth=24), writeback_chunk_size() already splits
> pages into chunks of MIN_WRITEBACK_PAGES (1024). This is usually
> enough to avoid hung task warnings, so reporting progress there might
> be sufficient.

Right.

> I’m also a bit concerned that reporting progress on every
> wbc_init_bio() could lead to excessive wakeups in normal or
> high-throughput cases, which might have side effects. Please correct
> me if I’m missing something.

Hum, fair, we'd have to somehow ratelimit it which adds even more
complexity. If the waking on completion is enough to silence the hung task
detector in your cases, I'm all for a simple solution. We can always
reconsider if it proves to not be good enough.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH 0/3] Suppress undesirable hung task warnings.

Posted by Masami Hiramatsu (Google) 1 week, 1 day ago

On Tue, 23 Sep 2025 09:16:07 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Sep 22, 2025 at 02:50:45PM -0700, Andrew Morton wrote:
> > On Mon, 22 Sep 2025 11:08:32 -0700 Christoph Hellwig <hch@infradead.org> wrote:
> > 
> > > On Mon, Sep 22, 2025 at 03:27:18PM +0200, Peter Zijlstra wrote:
> > > > > Julian Sun (3):
> > > > >   sched: Introduce a new flag PF_DONT_HUNG.
> > > > >   writeback: Introduce wb_wait_for_completion_no_hung().
> > > > >   memcg: Don't trigger hung task when memcg is releasing.
> > > > 
> > > > This is all quite terrible. I'm not at all sure why a task that is
> > > > genuinely not making progress and isn't killable should not be reported.
> > > 
> > > The hung device detector is way to aggressive for very slow I/O.
> > > See blk_wait_io, which has been around for a long time to work
> > > around just that.  Given that this series targets writeback I suspect
> > > it is about an overloaded device as well.
> > 
> > Yup, it's writeback - the bug report is in
> > https://lkml.kernel.org/r/20250917212959.355656-1-sunjunchao@bytedance.com
> > 
> > Memory is big and storage is slow, there's nothing wrong if a task
> > which is designed to wait for writeback waits for a long time.
> > 
> > Of course, there's something wrong if some other task which isn't
> > designed to wait for writeback gets stuck waiting for the task which
> > *is* designed to wait for writeback, but we'll still warn about that.
> > 
> > 
> > Regarding an implementation, I'm wondering if we can put a flag in
> > `struct completion' telling the hung task detector that this one is
> > expected to wait for long periods sometimes.  Probably messy and it
> > only works for completions (not semaphores, mutexes, etc).  Just
> > putting it out there ;)
> 
> So the problem is that there *is* progress (albeit rather slowly), the
> watchdog just doesn't see that. Perhaps that is the thing we should look
> at fixing.
> 
> How about something like the below? That will 'spuriously' wake up the
> waiters as long as there is some progress being made. Thereby increasing
> the context switch counters of the tasks and thus the hung_task watchdog
> sees progress.
> 
> This approach should be safer than the blk_wait_io() hack, which has a
> timer ticking, regardless of actual completions happening or not.

I like this idea, because this does not priotize any task, but priotize
context. The problem sounds like the kernel knows the operation
should be slow. Then what we need is a "progress bar" for the hung task
instead of an "ever-spinning circle".

It seems that the task hang detector can be triggered by an IO device
problem. This is not because IO is slow but in progress, but because
removable devices, etc., may not return from IO.

So, is it possible to reset such a watchdog only when it is confirmed
that some IO is slow but in progress? No matter how slow the IO is or
how throttled it is, I think it's rare for there to be no IO at all
within a few seconds.

Thank you,

> 
> ---
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index a07b8cf73ae2..1326193b4d95 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -174,9 +174,10 @@ static void finish_writeback_work(struct wb_writeback_work *work)
>  		kfree(work);
>  	if (done) {
>  		wait_queue_head_t *waitq = done->waitq;
> +		bool force_wake = (jiffies - done->stamp) > HZ/2;
>  
>  		/* @done can't be accessed after the following dec */
> -		if (atomic_dec_and_test(&done->cnt))
> +		if (atomic_dec_and_test(&done->cnt) || force_wake)
>  			wake_up_all(waitq);
>  	}
>  }
> @@ -213,7 +214,7 @@ static void wb_queue_work(struct bdi_writeback *wb,
>  void wb_wait_for_completion(struct wb_completion *done)
>  {
>  	atomic_dec(&done->cnt);		/* put down the initial count */
> -	wait_event(*done->waitq, !atomic_read(&done->cnt));
> +	wait_event(*done->waitq, ({ done->stamp = jiffies; !atomic_read(&done->cnt); }));
>  }
>  
>  #ifdef CONFIG_CGROUP_WRITEBACK
> diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
> index 2ad261082bba..197593193ce3 100644
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -63,6 +63,7 @@ enum wb_reason {
>  struct wb_completion {
>  	atomic_t		cnt;
>  	wait_queue_head_t	*waitq;
> +	unsigned long		stamp;
>  };
>  
>  #define __WB_COMPLETION_INIT(_waitq)	\


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

Re: [PATCH 0/3] Suppress undesirable hung task warnings.

Posted by Jan Kara 1 week, 2 days ago

On Mon 22-09-25 15:27:18, Peter Zijlstra wrote:
> On Mon, Sep 22, 2025 at 05:41:43PM +0800, Julian Sun wrote:
> > As suggested by Andrew Morton in [1], we need a general mechanism 
> > that allows the hung task detector to ignore unnecessary hung 
> > tasks. This patch set implements this functionality.
> > 
> > Patch 1 introduces a PF_DONT_HUNG flag. The hung task detector will 
> > ignores all tasks that have the PF_DONT_HUNG flag set.
> > 
> > Patch 2 introduces wait_event_no_hung() and wb_wait_for_completion_no_hung(), 
> > which enable the hung task detector to ignore hung tasks caused by these
> > wait events.
> > 
> > Patch 3 uses wb_wait_for_completion_no_hung() in the final phase of memcg 
> > teardown to eliminate the hung task warning.
> > 
> > Julian Sun (3):
> >   sched: Introduce a new flag PF_DONT_HUNG.
> >   writeback: Introduce wb_wait_for_completion_no_hung().
> >   memcg: Don't trigger hung task when memcg is releasing.
> 
> This is all quite terrible. I'm not at all sure why a task that is
> genuinely not making progress and isn't killable should not be reported.

In principle it is a variation of the old problem where hung task detector
was reporting tasks that were waiting for IO to complete for too long (e.g.
if sync(2) took longer than 2 minutes or whatever the limit is set). In
this case we are waiting for IO in a cgroup to complete which has the
additional dimension that cgroup IO may be throttled so it progresses extra
slowly. But the reports of hang check firing for long running IO
disappeared quite some time ago - I have a vague recollection there were
some tweaks to it for this case but maybe I'm wrong and people just learned
to tune the hang check timer :).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [External] Re: [PATCH 0/3] Suppress undesirable hung task warnings.

Posted by Julian Sun 1 week, 2 days ago

On Mon, Sep 22, 2025 at 9:27 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Sep 22, 2025 at 05:41:43PM +0800, Julian Sun wrote:
> > As suggested by Andrew Morton in [1], we need a general mechanism
> > that allows the hung task detector to ignore unnecessary hung
> > tasks. This patch set implements this functionality.
> >
> > Patch 1 introduces a PF_DONT_HUNG flag. The hung task detector will
> > ignores all tasks that have the PF_DONT_HUNG flag set.
> >
> > Patch 2 introduces wait_event_no_hung() and wb_wait_for_completion_no_hung(),
> > which enable the hung task detector to ignore hung tasks caused by these
> > wait events.
> >
> > Patch 3 uses wb_wait_for_completion_no_hung() in the final phase of memcg
> > teardown to eliminate the hung task warning.
> >
> > Julian Sun (3):
> >   sched: Introduce a new flag PF_DONT_HUNG.
> >   writeback: Introduce wb_wait_for_completion_no_hung().
> >   memcg: Don't trigger hung task when memcg is releasing.
>
> This is all quite terrible. I'm not at all sure why a task that is
> genuinely not making progress and isn't killable should not be reported.

Actually, I tried another approach to fix this issue [1], but Andrew
thinks eliminating the warning should be simpler. Either approach is
fine with me.

[1]: https://lore.kernel.org/cgroups/20250917212959.355656-1-sunjunchao@bytedance.com/

Thanks,
-- 
Julian Sun <sunjunchao@bytedance.com>

Re: [PATCH 0/3] Suppress undesirable hung task warnings.

Posted by Michal Hocko 1 week, 2 days ago

On Mon 22-09-25 17:41:43, Julian Sun wrote:
> As suggested by Andrew Morton in [1], we need a general mechanism 

what is the reference?

> that allows the hung task detector to ignore unnecessary hung 
> tasks. This patch set implements this functionality.
> 
> Patch 1 introduces a PF_DONT_HUNG flag. The hung task detector will 
> ignores all tasks that have the PF_DONT_HUNG flag set.
> 
> Patch 2 introduces wait_event_no_hung() and wb_wait_for_completion_no_hung(), 
> which enable the hung task detector to ignore hung tasks caused by these
> wait events.
> 
> Patch 3 uses wb_wait_for_completion_no_hung() in the final phase of memcg 
> teardown to eliminate the hung task warning.
> 
> Julian Sun (3):
>   sched: Introduce a new flag PF_DONT_HUNG.
>   writeback: Introduce wb_wait_for_completion_no_hung().
>   memcg: Don't trigger hung task when memcg is releasing.
> 
>  fs/fs-writeback.c           | 15 +++++++++++++++
>  include/linux/backing-dev.h |  1 +
>  include/linux/sched.h       | 12 +++++++++++-
>  include/linux/wait.h        | 15 +++++++++++++++
>  kernel/hung_task.c          |  6 ++++++
>  mm/memcontrol.c             |  2 +-
>  6 files changed, 49 insertions(+), 2 deletions(-)
> 
> -- 
> 2.39.5

-- 
Michal Hocko
SUSE Labs

Re: [External] Re: [PATCH 0/3] Suppress undesirable hung task warnings.

Posted by Julian Sun 1 week, 2 days ago

On Mon, Sep 22, 2025 at 9:07 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 22-09-25 17:41:43, Julian Sun wrote:
> > As suggested by Andrew Morton in [1], we need a general mechanism
>
> what is the reference?

Sorry, the link is:
https://lore.kernel.org/cgroups/20250917152155.5a8ddb3e4ff813289ea0b4c9@linux-foundation.org/
>
> > that allows the hung task detector to ignore unnecessary hung
> > tasks. This patch set implements this functionality.
> >
> > Patch 1 introduces a PF_DONT_HUNG flag. The hung task detector will
> > ignores all tasks that have the PF_DONT_HUNG flag set.
> >
> > Patch 2 introduces wait_event_no_hung() and wb_wait_for_completion_no_hung(),
> > which enable the hung task detector to ignore hung tasks caused by these
> > wait events.
> >
> > Patch 3 uses wb_wait_for_completion_no_hung() in the final phase of memcg
> > teardown to eliminate the hung task warning.
> >
> > Julian Sun (3):
> >   sched: Introduce a new flag PF_DONT_HUNG.
> >   writeback: Introduce wb_wait_for_completion_no_hung().
> >   memcg: Don't trigger hung task when memcg is releasing.
> >
> >  fs/fs-writeback.c           | 15 +++++++++++++++
> >  include/linux/backing-dev.h |  1 +
> >  include/linux/sched.h       | 12 +++++++++++-
> >  include/linux/wait.h        | 15 +++++++++++++++
> >  kernel/hung_task.c          |  6 ++++++
> >  mm/memcontrol.c             |  2 +-
> >  6 files changed, 49 insertions(+), 2 deletions(-)
> >
> > --
> > 2.39.5
>
> --
> Michal Hocko
> SUSE Labs


Thanks,
-- 
Julian Sun <sunjunchao@bytedance.com>

Re: [PATCH 0/3] Suppress undesirable hung task warnings.

Posted by Lance Yang 1 week, 2 days ago

Hi Julian

Thanks for the patch series!

On 2025/9/22 17:41, Julian Sun wrote:
> As suggested by Andrew Morton in [1], we need a general mechanism
> that allows the hung task detector to ignore unnecessary hung

Yep, I understand the goal is to suppress what can be a benign hung task
warning during memcg teardown.

> tasks. This patch set implements this functionality.
> 
> Patch 1 introduces a PF_DONT_HUNG flag. The hung task detector will
> ignores all tasks that have the PF_DONT_HUNG flag set.

However, I'm concerned that the PF_DONT_HUNG flag is a bit too powerful
and might mask real, underlying hangs.

> 
> Patch 2 introduces wait_event_no_hung() and wb_wait_for_completion_no_hung(),
> which enable the hung task detector to ignore hung tasks caused by these
> wait events.

Instead of making the detector ignore the task, what if we just change
the waiting mechanism? Looking at wb_wait_for_completion(), we could
introduce a new helper that internally uses wait_event_timeout() in a
loop.

Something simple like this:

void wb_wait_for_completion_no_hung(struct wb_completion *done)
{
         atomic_dec(&done->cnt);
         while (atomic_read(&done->cnt))
                 wait_event_timeout(*done->waitq, 
!atomic_read(&done->cnt), timeout);
}

The periodic wake-ups from wait_event_timeout() would naturally prevent
the detector from complaining about slow but eventually completing 
writeback.

> 
> Patch 3 uses wb_wait_for_completion_no_hung() in the final phase of memcg
> teardown to eliminate the hung task warning.
> 
> Julian Sun (3):
>    sched: Introduce a new flag PF_DONT_HUNG.
>    writeback: Introduce wb_wait_for_completion_no_hung().
>    memcg: Don't trigger hung task when memcg is releasing.
> 
>   fs/fs-writeback.c           | 15 +++++++++++++++
>   include/linux/backing-dev.h |  1 +
>   include/linux/sched.h       | 12 +++++++++++-
>   include/linux/wait.h        | 15 +++++++++++++++
>   kernel/hung_task.c          |  6 ++++++
>   mm/memcontrol.c             |  2 +-
>   6 files changed, 49 insertions(+), 2 deletions(-)
>

Re: [PATCH 0/3] Suppress undesirable hung task warnings.

Posted by Andrew Morton 1 week, 2 days ago

On Mon, 22 Sep 2025 19:38:21 +0800 Lance Yang <lance.yang@linux.dev> wrote:

> On 2025/9/22 17:41, Julian Sun wrote:
> > As suggested by Andrew Morton in [1], we need a general mechanism
> > that allows the hung task detector to ignore unnecessary hung
> 
> Yep, I understand the goal is to suppress what can be a benign hung task
> warning during memcg teardown.
> 
> > tasks. This patch set implements this functionality.
> > 
> > Patch 1 introduces a PF_DONT_HUNG flag. The hung task detector will
> > ignores all tasks that have the PF_DONT_HUNG flag set.
> 
> However, I'm concerned that the PF_DONT_HUNG flag is a bit too powerful
> and might mask real, underlying hangs.

I think that's OK if the calling task is discriminating about it.  Just
set PF_DONT_HUNG (unpleasing name!) around those bits of code where
it's needed, clear it otherwise.

Julian, did you take a look at what a touch_hung_task_detector() would
involve?  It's a bit of an interface inconsistency - our various other
timeout detectors (softlockup, NMI, rcu) each have a touch_ function.

Re: [PATCH 0/3] Suppress undesirable hung task warnings.

Posted by Lance Yang 1 week, 2 days ago


On 2025/9/23 05:57, Andrew Morton wrote:
> On Mon, 22 Sep 2025 19:38:21 +0800 Lance Yang <lance.yang@linux.dev> wrote:
> 
>> On 2025/9/22 17:41, Julian Sun wrote:
>>> As suggested by Andrew Morton in [1], we need a general mechanism
>>> that allows the hung task detector to ignore unnecessary hung
>>
>> Yep, I understand the goal is to suppress what can be a benign hung task
>> warning during memcg teardown.
>>
>>> tasks. This patch set implements this functionality.
>>>
>>> Patch 1 introduces a PF_DONT_HUNG flag. The hung task detector will
>>> ignores all tasks that have the PF_DONT_HUNG flag set.
>>
>> However, I'm concerned that the PF_DONT_HUNG flag is a bit too powerful
>> and might mask real, underlying hangs.
> 
> I think that's OK if the calling task is discriminating about it.  Just
> set PF_DONT_HUNG (unpleasing name!) around those bits of code where
> it's needed, clear it otherwise.

Makes sense to me :)

> 
> Julian, did you take a look at what a touch_hung_task_detector() would
> involve?  It's a bit of an interface inconsistency - our various other
> timeout detectors (softlockup, NMI, rcu) each have a touch_ function.

On second thought, I agree that a touch_hung_task_detector() would be a
much better approach for interface consistency.

We could implement touch_hung_task_detector() to grant the task temporary
immunity from hung task checks for as long as it remains uninterruptible.
Once the task becomes runnable again, the immunity is automatically revoked.

Something like this:

---
diff --git a/include/linux/hung_task.h b/include/linux/hung_task.h
index c4403eeb7144..fac92039dce0 100644
--- a/include/linux/hung_task.h
+++ b/include/linux/hung_task.h
@@ -98,4 +98,9 @@ static inline void *hung_task_blocker_to_lock(unsigned 
long blocker)
  }
  #endif

+void touch_hung_task_detector(struct task_struct *t)
+{
+	t->last_switch_count = ULONG_MAX;
+}
+
  #endif /* __LINUX_HUNG_TASK_H */
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 8708a1205f82..094a277b3b39 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -203,6 +203,9 @@ static void check_hung_task(struct task_struct *t, 
unsigned long timeout)
  	if (unlikely(!switch_count))
  		return;

+	if (t->last_switch_count == ULONG_MAX)
+		return;
+
  	if (switch_count != t->last_switch_count) {
  		t->last_switch_count = switch_count;
  		t->last_switch_time = jiffies;
@@ -317,6 +320,9 @@ static void 
check_hung_uninterruptible_tasks(unsigned long timeout)
  		    !(state & TASK_WAKEKILL) &&
  		    !(state & TASK_NOLOAD))
  			check_hung_task(t, timeout);
+		else if (t->last_switch_count == ULONG_MAX)
+			t->last_switch_count = t->nvcsw + t->nivcsw;
+
  	}
   unlock:
  	rcu_read_unlock();
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8dc470aa6c3c..3d5f36455b74 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3910,8 +3910,10 @@ static void mem_cgroup_css_free(struct 
cgroup_subsys_state *css)
  	int __maybe_unused i;

  #ifdef CONFIG_CGROUP_WRITEBACK
-	for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++)
+	for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) {
+		touch_hung_task_detector(current);
  		wb_wait_for_completion(&memcg->cgwb_frn[i].done);
+	}
  #endif
  	if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket)
  		static_branch_dec(&memcg_sockets_enabled_key);
---

Using ULONG_MAX as a marker to grant this immunity. As long as the task
remains in state D, check_hung_task() sees the marker and bails out.

Re: [External] Re: [PATCH 0/3] Suppress undesirable hung task warnings.

Posted by Julian Sun 1 week, 2 days ago

On Tue, Sep 23, 2025 at 10:30 AM Lance Yang <lance.yang@linux.dev> wrote:
>
>
>
> On 2025/9/23 05:57, Andrew Morton wrote:
> > On Mon, 22 Sep 2025 19:38:21 +0800 Lance Yang <lance.yang@linux.dev> wrote:
> >
> >> On 2025/9/22 17:41, Julian Sun wrote:
> >>> As suggested by Andrew Morton in [1], we need a general mechanism
> >>> that allows the hung task detector to ignore unnecessary hung
> >>
> >> Yep, I understand the goal is to suppress what can be a benign hung task
> >> warning during memcg teardown.
> >>
> >>> tasks. This patch set implements this functionality.
> >>>
> >>> Patch 1 introduces a PF_DONT_HUNG flag. The hung task detector will
> >>> ignores all tasks that have the PF_DONT_HUNG flag set.
> >>
> >> However, I'm concerned that the PF_DONT_HUNG flag is a bit too powerful
> >> and might mask real, underlying hangs.
> >
> > I think that's OK if the calling task is discriminating about it.  Just
> > set PF_DONT_HUNG (unpleasing name!) around those bits of code where
> > it's needed, clear it otherwise.
>
> Makes sense to me :)
>
> >
> > Julian, did you take a look at what a touch_hung_task_detector() would
> > involve?  It's a bit of an interface inconsistency - our various other
> > timeout detectors (softlockup, NMI, rcu) each have a touch_ function.
>
> On second thought, I agree that a touch_hung_task_detector() would be a
> much better approach for interface consistency.
>
> We could implement touch_hung_task_detector() to grant the task temporary
> immunity from hung task checks for as long as it remains uninterruptible.
> Once the task becomes runnable again, the immunity is automatically revoked.

Yes, this looks much cleaner.  I didn’t think of this specific code
implementation method :)
>
> Something like this:
>
> ---
> diff --git a/include/linux/hung_task.h b/include/linux/hung_task.h
> index c4403eeb7144..fac92039dce0 100644
> --- a/include/linux/hung_task.h
> +++ b/include/linux/hung_task.h
> @@ -98,4 +98,9 @@ static inline void *hung_task_blocker_to_lock(unsigned
> long blocker)
>   }
>   #endif
>
> +void touch_hung_task_detector(struct task_struct *t)
> +{
> +       t->last_switch_count = ULONG_MAX;
> +}
> +
>   #endif /* __LINUX_HUNG_TASK_H */
> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> index 8708a1205f82..094a277b3b39 100644
> --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -203,6 +203,9 @@ static void check_hung_task(struct task_struct *t,
> unsigned long timeout)
>         if (unlikely(!switch_count))
>                 return;
>
> +       if (t->last_switch_count == ULONG_MAX)
> +               return;
> +
>         if (switch_count != t->last_switch_count) {
>                 t->last_switch_count = switch_count;
>                 t->last_switch_time = jiffies;
> @@ -317,6 +320,9 @@ static void
> check_hung_uninterruptible_tasks(unsigned long timeout)
>                     !(state & TASK_WAKEKILL) &&
>                     !(state & TASK_NOLOAD))
>                         check_hung_task(t, timeout);
> +               else if (t->last_switch_count == ULONG_MAX)
> +                       t->last_switch_count = t->nvcsw + t->nivcsw;

Maybe we don't need this statement here, the if (switch_count !=
t->last_switch_count) statement inside the check_hung_task() will do
it automatically. Or am I missing something?
> +
>         }
>    unlock:
>         rcu_read_unlock();
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8dc470aa6c3c..3d5f36455b74 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3910,8 +3910,10 @@ static void mem_cgroup_css_free(struct
> cgroup_subsys_state *css)
>         int __maybe_unused i;
>
>   #ifdef CONFIG_CGROUP_WRITEBACK
> -       for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++)
> +       for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) {
> +               touch_hung_task_detector(current);
>                 wb_wait_for_completion(&memcg->cgwb_frn[i].done);
> +       }
>   #endif
>         if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket)
>                 static_branch_dec(&memcg_sockets_enabled_key);
> ---
>
> Using ULONG_MAX as a marker to grant this immunity. As long as the task
> remains in state D, check_hung_task() sees the marker and bails out.

Thanks for your review, I will send patch v2 with this approach.


-- 
Julian Sun <sunjunchao@bytedance.com>

Re: [External] Re: [PATCH 0/3] Suppress undesirable hung task warnings.

Posted by Lance Yang 1 week, 2 days ago


On 2025/9/23 10:45, Julian Sun wrote:
> On Tue, Sep 23, 2025 at 10:30 AM Lance Yang <lance.yang@linux.dev> wrote:
>>
>>
>>
>> On 2025/9/23 05:57, Andrew Morton wrote:
>>> On Mon, 22 Sep 2025 19:38:21 +0800 Lance Yang <lance.yang@linux.dev> wrote:
>>>
>>>> On 2025/9/22 17:41, Julian Sun wrote:
>>>>> As suggested by Andrew Morton in [1], we need a general mechanism
>>>>> that allows the hung task detector to ignore unnecessary hung
>>>>
>>>> Yep, I understand the goal is to suppress what can be a benign hung task
>>>> warning during memcg teardown.
>>>>
>>>>> tasks. This patch set implements this functionality.
>>>>>
>>>>> Patch 1 introduces a PF_DONT_HUNG flag. The hung task detector will
>>>>> ignores all tasks that have the PF_DONT_HUNG flag set.
>>>>
>>>> However, I'm concerned that the PF_DONT_HUNG flag is a bit too powerful
>>>> and might mask real, underlying hangs.
>>>
>>> I think that's OK if the calling task is discriminating about it.  Just
>>> set PF_DONT_HUNG (unpleasing name!) around those bits of code where
>>> it's needed, clear it otherwise.
>>
>> Makes sense to me :)
>>
>>>
>>> Julian, did you take a look at what a touch_hung_task_detector() would
>>> involve?  It's a bit of an interface inconsistency - our various other
>>> timeout detectors (softlockup, NMI, rcu) each have a touch_ function.
>>
>> On second thought, I agree that a touch_hung_task_detector() would be a
>> much better approach for interface consistency.
>>
>> We could implement touch_hung_task_detector() to grant the task temporary
>> immunity from hung task checks for as long as it remains uninterruptible.
>> Once the task becomes runnable again, the immunity is automatically revoked.
> 
> Yes, this looks much cleaner.  I didn’t think of this specific code
> implementation method :)
>>
>> Something like this:
>>
>> ---
>> diff --git a/include/linux/hung_task.h b/include/linux/hung_task.h
>> index c4403eeb7144..fac92039dce0 100644
>> --- a/include/linux/hung_task.h
>> +++ b/include/linux/hung_task.h
>> @@ -98,4 +98,9 @@ static inline void *hung_task_blocker_to_lock(unsigned
>> long blocker)
>>    }
>>    #endif
>>
>> +void touch_hung_task_detector(struct task_struct *t)
>> +{
>> +       t->last_switch_count = ULONG_MAX;
>> +}
>> +
>>    #endif /* __LINUX_HUNG_TASK_H */
>> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
>> index 8708a1205f82..094a277b3b39 100644
>> --- a/kernel/hung_task.c
>> +++ b/kernel/hung_task.c
>> @@ -203,6 +203,9 @@ static void check_hung_task(struct task_struct *t,
>> unsigned long timeout)
>>          if (unlikely(!switch_count))
>>                  return;
>>
>> +       if (t->last_switch_count == ULONG_MAX)
>> +               return;
>> +
>>          if (switch_count != t->last_switch_count) {
>>                  t->last_switch_count = switch_count;
>>                  t->last_switch_time = jiffies;
>> @@ -317,6 +320,9 @@ static void
>> check_hung_uninterruptible_tasks(unsigned long timeout)
>>                      !(state & TASK_WAKEKILL) &&
>>                      !(state & TASK_NOLOAD))
>>                          check_hung_task(t, timeout);
>> +               else if (t->last_switch_count == ULONG_MAX)
>> +                       t->last_switch_count = t->nvcsw + t->nivcsw;
> 
> Maybe we don't need this statement here, the if (switch_count !=
> t->last_switch_count) statement inside the check_hung_task() will do
> it automatically. Or am I missing something?

IIUC, we do need that "else if" block. check_hung_task() is ONLY called
for tasks that are currently in TASK_UNINTERRUPTIBLE state.

Without the "else if" block, the task's last_switch_count would remain
ULONG_MAX forever, effectively granting it permanent immunity from all
future hung task checks. Unless I am missing something ;)

>> +
>>          }
>>     unlock:
>>          rcu_read_unlock();
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 8dc470aa6c3c..3d5f36455b74 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3910,8 +3910,10 @@ static void mem_cgroup_css_free(struct
>> cgroup_subsys_state *css)
>>          int __maybe_unused i;
>>
>>    #ifdef CONFIG_CGROUP_WRITEBACK
>> -       for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++)
>> +       for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) {
>> +               touch_hung_task_detector(current);
>>                  wb_wait_for_completion(&memcg->cgwb_frn[i].done);
>> +       }
>>    #endif
>>          if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket)
>>                  static_branch_dec(&memcg_sockets_enabled_key);
>> ---
>>
>> Using ULONG_MAX as a marker to grant this immunity. As long as the task
>> remains in state D, check_hung_task() sees the marker and bails out.
> 
> Thanks for your review, I will send patch v2 with this approach.

Cheers!

Re: [PATCH 0/3] Suppress undesirable hung task warnings.

Posted by Julian Sun 1 week, 2 days ago

On 9/22/25 7:38 PM, Lance Yang wrote:

Hi, Lance

Thanks for your review and comments.

> Hi Julian
> 
> Thanks for the patch series!
> 
> On 2025/9/22 17:41, Julian Sun wrote:
>> As suggested by Andrew Morton in [1], we need a general mechanism
>> that allows the hung task detector to ignore unnecessary hung
> 
> Yep, I understand the goal is to suppress what can be a benign hung task
> warning during memcg teardown.
> 
>> tasks. This patch set implements this functionality.
>>
>> Patch 1 introduces a PF_DONT_HUNG flag. The hung task detector will
>> ignores all tasks that have the PF_DONT_HUNG flag set.
> 
> However, I'm concerned that the PF_DONT_HUNG flag is a bit too powerful
> and might mask real, underlying hangs.

The flag takes effect only when wait_event_no_hung() or 
wb_wait_for_completion_no_hung() is called, and its effect is limited to 
a single wait event, without affecting subsequent wait events. So AFAICS 
it will not mask real hang warnings.>
>>
>> Patch 2 introduces wait_event_no_hung() and 
>> wb_wait_for_completion_no_hung(),
>> which enable the hung task detector to ignore hung tasks caused by these
>> wait events.
> 
> Instead of making the detector ignore the task, what if we just change
> the waiting mechanism? Looking at wb_wait_for_completion(), we could
> introduce a new helper that internally uses wait_event_timeout() in a
> loop.
> 
> Something simple like this:
> 
> void wb_wait_for_completion_no_hung(struct wb_completion *done)
> {
>          atomic_dec(&done->cnt);
>          while (atomic_read(&done->cnt))
>                  wait_event_timeout(*done->waitq, !atomic_read(&done- 
>  >cnt), timeout);
> }
> 
> The periodic wake-ups from wait_event_timeout() would naturally prevent
> the detector from complaining about slow but eventually completing 
> writeback.

Yeah, this could definitely eliminate the hung task warning complained here.
However what I aim to provide is a general mechanism for waiting on 
events. Of course, we could use code similar to the following, but this 
would introduce additional overhead from waking tasks and multiple 
operations on wq_head—something I don't want to introduce.

+#define wait_event_no_hung(wq_head, condition) \
+do {                   \
+       while (!(condition))    \
+               wait_event_timeout(wq_head, condition, timeout); \
+}

But I can try this approach or do not introcude wait_event_no_hung() if 
you want.>
>>
>> Patch 3 uses wb_wait_for_completion_no_hung() in the final phase of memcg
>> teardown to eliminate the hung task warning.
>>
>> Julian Sun (3):
>>    sched: Introduce a new flag PF_DONT_HUNG.
>>    writeback: Introduce wb_wait_for_completion_no_hung().
>>    memcg: Don't trigger hung task when memcg is releasing.
>>
>>   fs/fs-writeback.c           | 15 +++++++++++++++
>>   include/linux/backing-dev.h |  1 +
>>   include/linux/sched.h       | 12 +++++++++++-
>>   include/linux/wait.h        | 15 +++++++++++++++
>>   kernel/hung_task.c          |  6 ++++++
>>   mm/memcontrol.c             |  2 +-
>>   6 files changed, 49 insertions(+), 2 deletions(-)
>>
> 

Thanks,
-- 
Julian Sun <sunjunchao@bytedance.com>

Re: [PATCH 0/3] Suppress undesirable hung task warnings.

Posted by Lance Yang 1 week, 2 days ago


On 2025/9/22 20:40, Julian Sun wrote:
> On 9/22/25 7:38 PM, Lance Yang wrote:
> 
> Hi, Lance
> 
> Thanks for your review and comments.
> 
>> Hi Julian
>>
>> Thanks for the patch series!
>>
>> On 2025/9/22 17:41, Julian Sun wrote:
>>> As suggested by Andrew Morton in [1], we need a general mechanism
>>> that allows the hung task detector to ignore unnecessary hung
>>
>> Yep, I understand the goal is to suppress what can be a benign hung task
>> warning during memcg teardown.
>>
>>> tasks. This patch set implements this functionality.
>>>
>>> Patch 1 introduces a PF_DONT_HUNG flag. The hung task detector will
>>> ignores all tasks that have the PF_DONT_HUNG flag set.
>>
>> However, I'm concerned that the PF_DONT_HUNG flag is a bit too powerful
>> and might mask real, underlying hangs.
> 
> The flag takes effect only when wait_event_no_hung() or 
> wb_wait_for_completion_no_hung() is called, and its effect is limited to 
> a single wait event, without affecting subsequent wait events. So AFAICS 
> it will not mask real hang warnings.>

Emm... the risk of future misuse is what worries me. I would rather have
call sites actively "pet the watchdog" by periodically calling a helper
like touch_hung_task_detector(), instead of passively ignoring the detector.

>>>
>>> Patch 2 introduces wait_event_no_hung() and 
>>> wb_wait_for_completion_no_hung(),
>>> which enable the hung task detector to ignore hung tasks caused by these
>>> wait events.
>>
>> Instead of making the detector ignore the task, what if we just change
>> the waiting mechanism? Looking at wb_wait_for_completion(), we could
>> introduce a new helper that internally uses wait_event_timeout() in a
>> loop.
>>
>> Something simple like this:
>>
>> void wb_wait_for_completion_no_hung(struct wb_completion *done)
>> {
>>          atomic_dec(&done->cnt);
>>          while (atomic_read(&done->cnt))
>>                  wait_event_timeout(*done->waitq, !atomic_read(&done- 
>>  >cnt), timeout);
>> }
>>
>> The periodic wake-ups from wait_event_timeout() would naturally prevent
>> the detector from complaining about slow but eventually completing 
>> writeback.
> 
> Yeah, this could definitely eliminate the hung task warning complained 
> here.
> However what I aim to provide is a general mechanism for waiting on 
> events. Of course, we could use code similar to the following, but this 
> would introduce additional overhead from waking tasks and multiple 
> operations on wq_head—something I don't want to introduce.

Yeah, I agree there's some overhead with a polling approach, but
mem_cgroup_css_free() should be an infrequent operation. So, I think it's
an acceptable trade-off :)

> 
> +#define wait_event_no_hung(wq_head, condition) \
> +do {                   \
> +       while (!(condition))    \
> +               wait_event_timeout(wq_head, condition, timeout); \
> +}
> 
> But I can try this approach or do not introcude wait_event_no_hung() if 
> you want.>

Well, let's see what other folks think ;)

Cheers,
Lance

>>>
>>> Patch 3 uses wb_wait_for_completion_no_hung() in the final phase of 
>>> memcg
>>> teardown to eliminate the hung task warning.
>>>
>>> Julian Sun (3):
>>>    sched: Introduce a new flag PF_DONT_HUNG.
>>>    writeback: Introduce wb_wait_for_completion_no_hung().
>>>    memcg: Don't trigger hung task when memcg is releasing.
>>>
>>>   fs/fs-writeback.c           | 15 +++++++++++++++
>>>   include/linux/backing-dev.h |  1 +
>>>   include/linux/sched.h       | 12 +++++++++++-
>>>   include/linux/wait.h        | 15 +++++++++++++++
>>>   kernel/hung_task.c          |  6 ++++++
>>>   mm/memcontrol.c             |  2 +-
>>>   6 files changed, 49 insertions(+), 2 deletions(-)
>>>
>>
> 
> Thanks,