[PATCH v2 2/2] memcg: Don't trigger hung task warnings when memcg is releasing resources.

Julian Sun posted 2 patches 1 week ago
[PATCH v2 2/2] memcg: Don't trigger hung task warnings when memcg is releasing resources.
Posted by Julian Sun 1 week ago
Hung task warning in mem_cgroup_css_free() is undesirable and
unnecessary since the behavior of waiting for a long time is
expected.

Use touch_hung_task_detector() to eliminate the possible
hung task warning.

Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
---

 I didn’t add a fixes tag because there is no actual bug in the
 original code, and this patch is more of an improvement-type one.

 mm/memcontrol.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8dd7fbed5a94..fc73a56372a4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -63,6 +63,7 @@
 #include <linux/seq_buf.h>
 #include <linux/sched/isolation.h>
 #include <linux/kmemleak.h>
+#include <linux/nmi.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -3912,8 +3913,15 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 	int __maybe_unused i;
 
 #ifdef CONFIG_CGROUP_WRITEBACK
-	for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++)
+	for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) {
+		/*
+		 * We don't want the hung task detector to report warnings
+		 * here since there's nothing wrong if the writeback work
+		 * lasts for a long time.
+		 */
+		touch_hung_task_detector(current);
 		wb_wait_for_completion(&memcg->cgwb_frn[i].done);
+	}
 #endif
 	if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket)
 		static_branch_dec(&memcg_sockets_enabled_key);
-- 
2.39.5

Re: [PATCH v2 2/2] memcg: Don't trigger hung task warnings when memcg is releasing resources.
Posted by Peter Zijlstra 1 week ago
On Wed, Sep 24, 2025 at 11:41:00AM +0800, Julian Sun wrote:
> Hung task warning in mem_cgroup_css_free() is undesirable and
> unnecessary since the behavior of waiting for a long time is
> expected.
> 
> Use touch_hung_task_detector() to eliminate the possible
> hung task warning.
> 
> Signed-off-by: Julian Sun <sunjunchao@bytedance.com>

Still hate this. It is not tied to progress. If progress really stalls,
no warning will be given.

>  mm/memcontrol.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8dd7fbed5a94..fc73a56372a4 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -63,6 +63,7 @@
>  #include <linux/seq_buf.h>
>  #include <linux/sched/isolation.h>
>  #include <linux/kmemleak.h>
> +#include <linux/nmi.h>
>  #include "internal.h"
>  #include <net/sock.h>
>  #include <net/ip.h>
> @@ -3912,8 +3913,15 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
>  	int __maybe_unused i;
>  
>  #ifdef CONFIG_CGROUP_WRITEBACK
> -	for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++)
> +	for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) {
> +		/*
> +		 * We don't want the hung task detector to report warnings
> +		 * here since there's nothing wrong if the writeback work
> +		 * lasts for a long time.
> +		 */
> +		touch_hung_task_detector(current);
>  		wb_wait_for_completion(&memcg->cgwb_frn[i].done);
> +	}
>  #endif
>  	if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket)
>  		static_branch_dec(&memcg_sockets_enabled_key);
> -- 
> 2.39.5
>
Re: [External] Re: [PATCH v2 2/2] memcg: Don't trigger hung task warnings when memcg is releasing resources.
Posted by Julian Sun 1 week ago
On 9/24/25 2:32 PM, Peter Zijlstra wrote:
> On Wed, Sep 24, 2025 at 11:41:00AM +0800, Julian Sun wrote:
>> Hung task warning in mem_cgroup_css_free() is undesirable and
>> unnecessary since the behavior of waiting for a long time is
>> expected.
>>
>> Use touch_hung_task_detector() to eliminate the possible
>> hung task warning.
>>
>> Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
> 
> Still hate this. It is not tied to progress. If progress really stalls,
> no warning will be given.

Hi, peter

Thanks for your review and comments.

I did take a look at your solution provided yesterday, and get your 
point. However AFAICS it can't resolve the unexpected warnings here. 
Because it only works after we reach the finish_writeback_work(), and 
the key point here is, it *already* takes a long time before we reach 
finish_writeback_work(), and there is true progress before finish the 
writeback work that hung task detector still can not know.

If we want to make the hung task detector to known the progress of 
writeback work, we need to add some code within do_writepages(): after 
each finish of a_ops->writepages(), we need to make detector to known 
there's progress. Something like this:

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 3e248d1c3969..49572a83c47b 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2635,6 +2635,10 @@ int do_writepages(struct address_space *mapping, 
struct writeback_control *wbc)
                 else
                         /* deal with chardevs and other special files */
                         ret = 0;
+               /* Make hung task detector to known there's progress. */
+               if (force_wake)
+                       wake_up_all(waitq);
+
                 if (ret != -ENOMEM || wbc->sync_mode != WB_SYNC_ALL)
                         break;

which has a big impact on current code - I don't want to introduce this.

Yes, the behavior in this patch does have the possibility to paper cover 
the real warnings, and what I want to argue is that the essence of this 
patch is the same as the current touch_nmi_watchdog() and 
touch_softlockup_watchdog() - these functions are used only in specific 
scenarios we known and only affect a single event. And there seems no 
report that touch_nmi/softlockup_watchdog() will paper cover the real 
warnings (do we?).

Correct me if there's anything I'm missing or misunderstanding.


> 
>>   mm/memcontrol.c | 10 +++++++++-
>>   1 file changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 8dd7fbed5a94..fc73a56372a4 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -63,6 +63,7 @@
>>   #include <linux/seq_buf.h>
>>   #include <linux/sched/isolation.h>
>>   #include <linux/kmemleak.h>
>> +#include <linux/nmi.h>
>>   #include "internal.h"
>>   #include <net/sock.h>
>>   #include <net/ip.h>
>> @@ -3912,8 +3913,15 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
>>   	int __maybe_unused i;
>>   
>>   #ifdef CONFIG_CGROUP_WRITEBACK
>> -	for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++)
>> +	for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) {
>> +		/*
>> +		 * We don't want the hung task detector to report warnings
>> +		 * here since there's nothing wrong if the writeback work
>> +		 * lasts for a long time.
>> +		 */
>> +		touch_hung_task_detector(current);
>>   		wb_wait_for_completion(&memcg->cgwb_frn[i].done);
>> +	}
>>   #endif
>>   	if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket)
>>   		static_branch_dec(&memcg_sockets_enabled_key);
>> -- 
>> 2.39.5
>>

Thanks,
-- 
Julian Sun <sunjunchao@bytedance.com>
Re: [External] Re: [PATCH v2 2/2] memcg: Don't trigger hung task warnings when memcg is releasing resources.
Posted by Peter Zijlstra 1 week ago
On Wed, Sep 24, 2025 at 03:50:41PM +0800, Julian Sun wrote:
> On 9/24/25 2:32 PM, Peter Zijlstra wrote:
> > On Wed, Sep 24, 2025 at 11:41:00AM +0800, Julian Sun wrote:
> > > Hung task warning in mem_cgroup_css_free() is undesirable and
> > > unnecessary since the behavior of waiting for a long time is
> > > expected.
> > > 
> > > Use touch_hung_task_detector() to eliminate the possible
> > > hung task warning.
> > > 
> > > Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
> > 
> > Still hate this. It is not tied to progress. If progress really stalls,
> > no warning will be given.
> 
> Hi, peter
> 
> Thanks for your review and comments.
> 
> I did take a look at your solution provided yesterday, and get your point.
> However AFAICS it can't resolve the unexpected warnings here. Because it
> only works after we reach the finish_writeback_work(), and the key point
> here is, it *already* takes a long time before we reach
> finish_writeback_work(), and there is true progress before finish the
> writeback work that hung task detector still can not know.

But wb_split_bdi_pages() should already split things into smaller chunks
if there is low bandwidth, right? And we call finish_writeback_work()
for each chunk.

If a chunk is still taking too long, surely the solution is to use
smaller chunks?

> If we want to make the hung task detector to known the progress of writeback
> work, we need to add some code within do_writepages(): after each finish of
> a_ops->writepages(), we need to make detector to known there's progress.
> Something like this:
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 3e248d1c3969..49572a83c47b 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2635,6 +2635,10 @@ int do_writepages(struct address_space *mapping,
> struct writeback_control *wbc)
>                 else
>                         /* deal with chardevs and other special files */
>                         ret = 0;
> +               /* Make hung task detector to known there's progress. */
> +               if (force_wake)
> +                       wake_up_all(waitq);
> +
>                 if (ret != -ENOMEM || wbc->sync_mode != WB_SYNC_ALL)
>                         break;
> 
> which has a big impact on current code - I don't want to introduce this.

You sure? It looks to me like the next level down is wb_writeback() and
writeback_sb_inodes(), and those already have time based breaks in and
still have access to wb_writeback_work::done, while do_writepages() no
longer has that context.

> Yes, the behavior in this patch does have the possibility to paper cover the
> real warnings, and what I want to argue is that the essence of this patch is
> the same as the current touch_nmi_watchdog() and touch_softlockup_watchdog()
> - these functions are used only in specific scenarios we known and only
> affect a single event. And there seems no report that
> touch_nmi/softlockup_watchdog() will paper cover the real warnings (do we?).
> 
> Correct me if there's anything I'm missing or misunderstanding.

The thing with touch_nmi_watchdog() is that you need to keep doing it.
The moment you stop calling touch_nmi_watchdog(), you will cause it to
fire.

That is very much in line with the thing I proposed, and rather unlike
your proposal that blanket kill reporting for the task, irrespective of
how long it sits there waiting.
Re: [External] Re: [PATCH v2 2/2] memcg: Don't trigger hung task warnings when memcg is releasing resources.
Posted by Julian Sun 1 week ago
On 9/24/25 4:28 PM, Peter Zijlstra wrote:

Hi,
> On Wed, Sep 24, 2025 at 03:50:41PM +0800, Julian Sun wrote:
>> On 9/24/25 2:32 PM, Peter Zijlstra wrote:
>>> On Wed, Sep 24, 2025 at 11:41:00AM +0800, Julian Sun wrote:
>>>> Hung task warning in mem_cgroup_css_free() is undesirable and
>>>> unnecessary since the behavior of waiting for a long time is
>>>> expected.
>>>>
>>>> Use touch_hung_task_detector() to eliminate the possible
>>>> hung task warning.
>>>>
>>>> Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
>>>
>>> Still hate this. It is not tied to progress. If progress really stalls,
>>> no warning will be given.
>>
>> Hi, peter
>>
>> Thanks for your review and comments.
>>
>> I did take a look at your solution provided yesterday, and get your point.
>> However AFAICS it can't resolve the unexpected warnings here. Because it
>> only works after we reach the finish_writeback_work(), and the key point
>> here is, it *already* takes a long time before we reach
>> finish_writeback_work(), and there is true progress before finish the
>> writeback work that hung task detector still can not know.
> 
> But wb_split_bdi_pages() should already split things into smaller chunks
> if there is low bandwidth, right? And we call finish_writeback_work()
> for each chunk.

AFAICS, wb_split_bdi_pages() will only be invoked in the sync scenarios, 
and not in the background writeback scenarios - which is exactly the 
case here.

And I noticed that there's something similar in background writeback, 
where writeback_chunk_size() will split all pages into several chunks, 
the min chunk size is 1024(MIN_WRITEBACK_PAGES) pages. The difference 
from wb_split_bdi_pages() is that we don't report progress after 
finishing the writeback of a chunk.
> 
> If a chunk is still taking too long, surely the solution is to use
> smaller chunks?

Yeah it still takes a long time, I checked the write_bandwidth and 
avg_write_bandwidth when warning was triggered:

	>>> wb.write_bandwidth
	(unsigned long)24
	>>> wb.avg_write_bandwidth
	(unsigned long)24
	>>> wb.write_bandwidth
	(unsigned long)13
	>>> wb.write_bandwidth
	(unsigned long)13

At this bandwidth, it will still takes a lot of seconds to write back 
MIN_WRITEBACK_PAGES pages.

So it might be a solution, but given the fact that the current minimum 
chunk size (1024) has been in place for over ten years, and that making 
it smaller would probably have a negative impact on performance. I'm 
afraid the filesystem maintainers will not accept this change.
If we don’t modify this part but can report progress after finishing the 
chunk writeback, it should probably eliminate most of the unexpected 
warnings.
> 
>> If we want to make the hung task detector to known the progress of writeback
>> work, we need to add some code within do_writepages(): after each finish of
>> a_ops->writepages(), we need to make detector to known there's progress.
>> Something like this:
>>
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index 3e248d1c3969..49572a83c47b 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -2635,6 +2635,10 @@ int do_writepages(struct address_space *mapping,
>> struct writeback_control *wbc)
>>                  else
>>                          /* deal with chardevs and other special files */
>>                          ret = 0;
>> +               /* Make hung task detector to known there's progress. */
>> +               if (force_wake)
>> +                       wake_up_all(waitq);
>> +
>>                  if (ret != -ENOMEM || wbc->sync_mode != WB_SYNC_ALL)
>>                          break;
>>
>> which has a big impact on current code - I don't want to introduce this.
> 
> You sure? It looks to me like the next level down is wb_writeback() and
> writeback_sb_inodes(), and those already have time based breaks in and
> still have access to wb_writeback_work::done, while do_writepages() no
> longer has that context.

Yeah, exactly. What I mean is report progress within the whole writeback 
work, either writeback_sb_inodes() or do_writepages() is ok.
> 
>> Yes, the behavior in this patch does have the possibility to paper cover the
>> real warnings, and what I want to argue is that the essence of this patch is
>> the same as the current touch_nmi_watchdog() and touch_softlockup_watchdog()
>> - these functions are used only in specific scenarios we known and only
>> affect a single event. And there seems no report that
>> touch_nmi/softlockup_watchdog() will paper cover the real warnings (do we?).
>>
>> Correct me if there's anything I'm missing or misunderstanding.
> 
> The thing with touch_nmi_watchdog() is that you need to keep doing it.
> The moment you stop calling touch_nmi_watchdog(), you will cause it to
> fire.
> 
> That is very much in line with the thing I proposed, and rather unlike
> your proposal that blanket kill reporting for the task, irrespective of
> how long it sits there waiting.
> 

Thanks for clarification. So how about the following solution?

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index a07b8cf73ae2..e0698fd3f9ab 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -14,6 +14,7 @@
   *             Additions for address_space-based writeback
   */

+#include <linux/sched/sysctl.h>
  #include <linux/kernel.h>
  #include <linux/export.h>
  #include <linux/spinlock.h>
@@ -213,7 +214,7 @@ static void wb_queue_work(struct bdi_writeback *wb,
  void wb_wait_for_completion(struct wb_completion *done)
  {
         atomic_dec(&done->cnt);         /* put down the initial count */
-       wait_event(*done->waitq, !atomic_read(&done->cnt));
+       wait_event(*done->waitq, (done->stamp = jiffies; 
!atomic_read(&done->cnt)));
  }

  #ifdef CONFIG_CGROUP_WRITEBACK
@@ -1975,6 +1976,11 @@ static long writeback_sb_inodes(struct 
super_block *sb,
                  */
                 __writeback_single_inode(inode, &wbc);

+               /* Report progress to make hung task detector know it. */
+               if (jiffies - work->done->stamp >
+                   HZ * sysctl_hung_task_timeout_secs / 2)
+                       wake_up_all(work->done->waitq);
+
                 wbc_detach_inode(&wbc);
                 work->nr_pages -= write_chunk - wbc.nr_to_write;
                 wrote = write_chunk - wbc.nr_to_write - wbc.pages_skipped;

Instead of waking up all waiting threads every half second, we only wake 
them up if the writeback work lasts for the value of 
sysctl_hung_task_timeout_secs / 2 seconds to reduce possible overhead.

Hi, Jan, Christian, how do you think about it?

Please correct me if there's anything I'm missing or misunderstanding.

Thanks,
-- 
Julian Sun <sunjunchao@bytedance.com>