Hung task warning in mem_cgroup_css_free() is undesirable and
unnecessary since the behavior of waiting for a long time is
expected.
Use touch_hung_task_detector() to eliminate the possible
hung task warning.
Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
---
I didn’t add a fixes tag because there is no actual bug in the
original code, and this patch is more of an improvement-type one.
mm/memcontrol.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8dd7fbed5a94..fc73a56372a4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -63,6 +63,7 @@
#include <linux/seq_buf.h>
#include <linux/sched/isolation.h>
#include <linux/kmemleak.h>
+#include <linux/nmi.h>
#include "internal.h"
#include <net/sock.h>
#include <net/ip.h>
@@ -3912,8 +3913,15 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
int __maybe_unused i;
#ifdef CONFIG_CGROUP_WRITEBACK
- for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++)
+ for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) {
+ /*
+ * We don't want the hung task detector to report warnings
+ * here since there's nothing wrong if the writeback work
+ * lasts for a long time.
+ */
+ touch_hung_task_detector(current);
wb_wait_for_completion(&memcg->cgwb_frn[i].done);
+ }
#endif
if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket)
static_branch_dec(&memcg_sockets_enabled_key);
--
2.39.5
On Wed, Sep 24, 2025 at 11:41:00AM +0800, Julian Sun wrote: > Hung task warning in mem_cgroup_css_free() is undesirable and > unnecessary since the behavior of waiting for a long time is > expected. > > Use touch_hung_task_detector() to eliminate the possible > hung task warning. > > Signed-off-by: Julian Sun <sunjunchao@bytedance.com> Still hate this. It is not tied to progress. If progress really stalls, no warning will be given. > mm/memcontrol.c | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 8dd7fbed5a94..fc73a56372a4 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -63,6 +63,7 @@ > #include <linux/seq_buf.h> > #include <linux/sched/isolation.h> > #include <linux/kmemleak.h> > +#include <linux/nmi.h> > #include "internal.h" > #include <net/sock.h> > #include <net/ip.h> > @@ -3912,8 +3913,15 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css) > int __maybe_unused i; > > #ifdef CONFIG_CGROUP_WRITEBACK > - for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) > + for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) { > + /* > + * We don't want the hung task detector to report warnings > + * here since there's nothing wrong if the writeback work > + * lasts for a long time. > + */ > + touch_hung_task_detector(current); > wb_wait_for_completion(&memcg->cgwb_frn[i].done); > + } > #endif > if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket) > static_branch_dec(&memcg_sockets_enabled_key); > -- > 2.39.5 >
On 9/24/25 2:32 PM, Peter Zijlstra wrote: > On Wed, Sep 24, 2025 at 11:41:00AM +0800, Julian Sun wrote: >> Hung task warning in mem_cgroup_css_free() is undesirable and >> unnecessary since the behavior of waiting for a long time is >> expected. >> >> Use touch_hung_task_detector() to eliminate the possible >> hung task warning. >> >> Signed-off-by: Julian Sun <sunjunchao@bytedance.com> > > Still hate this. It is not tied to progress. If progress really stalls, > no warning will be given. Hi, peter Thanks for your review and comments. I did take a look at your solution provided yesterday, and get your point. However AFAICS it can't resolve the unexpected warnings here. Because it only works after we reach the finish_writeback_work(), and the key point here is, it *already* takes a long time before we reach finish_writeback_work(), and there is true progress before finish the writeback work that hung task detector still can not know. If we want to make the hung task detector to known the progress of writeback work, we need to add some code within do_writepages(): after each finish of a_ops->writepages(), we need to make detector to known there's progress. Something like this: diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 3e248d1c3969..49572a83c47b 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2635,6 +2635,10 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc) else /* deal with chardevs and other special files */ ret = 0; + /* Make hung task detector to known there's progress. */ + if (force_wake) + wake_up_all(waitq); + if (ret != -ENOMEM || wbc->sync_mode != WB_SYNC_ALL) break; which has a big impact on current code - I don't want to introduce this. Yes, the behavior in this patch does have the possibility to paper cover the real warnings, and what I want to argue is that the essence of this patch is the same as the current touch_nmi_watchdog() and touch_softlockup_watchdog() - these functions are used only in specific scenarios we known and only affect a single event. And there seems no report that touch_nmi/softlockup_watchdog() will paper cover the real warnings (do we?). Correct me if there's anything I'm missing or misunderstanding. > >> mm/memcontrol.c | 10 +++++++++- >> 1 file changed, 9 insertions(+), 1 deletion(-) >> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index 8dd7fbed5a94..fc73a56372a4 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c >> @@ -63,6 +63,7 @@ >> #include <linux/seq_buf.h> >> #include <linux/sched/isolation.h> >> #include <linux/kmemleak.h> >> +#include <linux/nmi.h> >> #include "internal.h" >> #include <net/sock.h> >> #include <net/ip.h> >> @@ -3912,8 +3913,15 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css) >> int __maybe_unused i; >> >> #ifdef CONFIG_CGROUP_WRITEBACK >> - for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) >> + for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) { >> + /* >> + * We don't want the hung task detector to report warnings >> + * here since there's nothing wrong if the writeback work >> + * lasts for a long time. >> + */ >> + touch_hung_task_detector(current); >> wb_wait_for_completion(&memcg->cgwb_frn[i].done); >> + } >> #endif >> if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket) >> static_branch_dec(&memcg_sockets_enabled_key); >> -- >> 2.39.5 >> Thanks, -- Julian Sun <sunjunchao@bytedance.com>
On Wed, Sep 24, 2025 at 03:50:41PM +0800, Julian Sun wrote: > On 9/24/25 2:32 PM, Peter Zijlstra wrote: > > On Wed, Sep 24, 2025 at 11:41:00AM +0800, Julian Sun wrote: > > > Hung task warning in mem_cgroup_css_free() is undesirable and > > > unnecessary since the behavior of waiting for a long time is > > > expected. > > > > > > Use touch_hung_task_detector() to eliminate the possible > > > hung task warning. > > > > > > Signed-off-by: Julian Sun <sunjunchao@bytedance.com> > > > > Still hate this. It is not tied to progress. If progress really stalls, > > no warning will be given. > > Hi, peter > > Thanks for your review and comments. > > I did take a look at your solution provided yesterday, and get your point. > However AFAICS it can't resolve the unexpected warnings here. Because it > only works after we reach the finish_writeback_work(), and the key point > here is, it *already* takes a long time before we reach > finish_writeback_work(), and there is true progress before finish the > writeback work that hung task detector still can not know. But wb_split_bdi_pages() should already split things into smaller chunks if there is low bandwidth, right? And we call finish_writeback_work() for each chunk. If a chunk is still taking too long, surely the solution is to use smaller chunks? > If we want to make the hung task detector to known the progress of writeback > work, we need to add some code within do_writepages(): after each finish of > a_ops->writepages(), we need to make detector to known there's progress. > Something like this: > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index 3e248d1c3969..49572a83c47b 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -2635,6 +2635,10 @@ int do_writepages(struct address_space *mapping, > struct writeback_control *wbc) > else > /* deal with chardevs and other special files */ > ret = 0; > + /* Make hung task detector to known there's progress. */ > + if (force_wake) > + wake_up_all(waitq); > + > if (ret != -ENOMEM || wbc->sync_mode != WB_SYNC_ALL) > break; > > which has a big impact on current code - I don't want to introduce this. You sure? It looks to me like the next level down is wb_writeback() and writeback_sb_inodes(), and those already have time based breaks in and still have access to wb_writeback_work::done, while do_writepages() no longer has that context. > Yes, the behavior in this patch does have the possibility to paper cover the > real warnings, and what I want to argue is that the essence of this patch is > the same as the current touch_nmi_watchdog() and touch_softlockup_watchdog() > - these functions are used only in specific scenarios we known and only > affect a single event. And there seems no report that > touch_nmi/softlockup_watchdog() will paper cover the real warnings (do we?). > > Correct me if there's anything I'm missing or misunderstanding. The thing with touch_nmi_watchdog() is that you need to keep doing it. The moment you stop calling touch_nmi_watchdog(), you will cause it to fire. That is very much in line with the thing I proposed, and rather unlike your proposal that blanket kill reporting for the task, irrespective of how long it sits there waiting.
On 9/24/25 4:28 PM, Peter Zijlstra wrote: Hi, > On Wed, Sep 24, 2025 at 03:50:41PM +0800, Julian Sun wrote: >> On 9/24/25 2:32 PM, Peter Zijlstra wrote: >>> On Wed, Sep 24, 2025 at 11:41:00AM +0800, Julian Sun wrote: >>>> Hung task warning in mem_cgroup_css_free() is undesirable and >>>> unnecessary since the behavior of waiting for a long time is >>>> expected. >>>> >>>> Use touch_hung_task_detector() to eliminate the possible >>>> hung task warning. >>>> >>>> Signed-off-by: Julian Sun <sunjunchao@bytedance.com> >>> >>> Still hate this. It is not tied to progress. If progress really stalls, >>> no warning will be given. >> >> Hi, peter >> >> Thanks for your review and comments. >> >> I did take a look at your solution provided yesterday, and get your point. >> However AFAICS it can't resolve the unexpected warnings here. Because it >> only works after we reach the finish_writeback_work(), and the key point >> here is, it *already* takes a long time before we reach >> finish_writeback_work(), and there is true progress before finish the >> writeback work that hung task detector still can not know. > > But wb_split_bdi_pages() should already split things into smaller chunks > if there is low bandwidth, right? And we call finish_writeback_work() > for each chunk. AFAICS, wb_split_bdi_pages() will only be invoked in the sync scenarios, and not in the background writeback scenarios - which is exactly the case here. And I noticed that there's something similar in background writeback, where writeback_chunk_size() will split all pages into several chunks, the min chunk size is 1024(MIN_WRITEBACK_PAGES) pages. The difference from wb_split_bdi_pages() is that we don't report progress after finishing the writeback of a chunk. > > If a chunk is still taking too long, surely the solution is to use > smaller chunks? Yeah it still takes a long time, I checked the write_bandwidth and avg_write_bandwidth when warning was triggered: >>> wb.write_bandwidth (unsigned long)24 >>> wb.avg_write_bandwidth (unsigned long)24 >>> wb.write_bandwidth (unsigned long)13 >>> wb.write_bandwidth (unsigned long)13 At this bandwidth, it will still takes a lot of seconds to write back MIN_WRITEBACK_PAGES pages. So it might be a solution, but given the fact that the current minimum chunk size (1024) has been in place for over ten years, and that making it smaller would probably have a negative impact on performance. I'm afraid the filesystem maintainers will not accept this change. If we don’t modify this part but can report progress after finishing the chunk writeback, it should probably eliminate most of the unexpected warnings. > >> If we want to make the hung task detector to known the progress of writeback >> work, we need to add some code within do_writepages(): after each finish of >> a_ops->writepages(), we need to make detector to known there's progress. >> Something like this: >> >> diff --git a/mm/page-writeback.c b/mm/page-writeback.c >> index 3e248d1c3969..49572a83c47b 100644 >> --- a/mm/page-writeback.c >> +++ b/mm/page-writeback.c >> @@ -2635,6 +2635,10 @@ int do_writepages(struct address_space *mapping, >> struct writeback_control *wbc) >> else >> /* deal with chardevs and other special files */ >> ret = 0; >> + /* Make hung task detector to known there's progress. */ >> + if (force_wake) >> + wake_up_all(waitq); >> + >> if (ret != -ENOMEM || wbc->sync_mode != WB_SYNC_ALL) >> break; >> >> which has a big impact on current code - I don't want to introduce this. > > You sure? It looks to me like the next level down is wb_writeback() and > writeback_sb_inodes(), and those already have time based breaks in and > still have access to wb_writeback_work::done, while do_writepages() no > longer has that context. Yeah, exactly. What I mean is report progress within the whole writeback work, either writeback_sb_inodes() or do_writepages() is ok. > >> Yes, the behavior in this patch does have the possibility to paper cover the >> real warnings, and what I want to argue is that the essence of this patch is >> the same as the current touch_nmi_watchdog() and touch_softlockup_watchdog() >> - these functions are used only in specific scenarios we known and only >> affect a single event. And there seems no report that >> touch_nmi/softlockup_watchdog() will paper cover the real warnings (do we?). >> >> Correct me if there's anything I'm missing or misunderstanding. > > The thing with touch_nmi_watchdog() is that you need to keep doing it. > The moment you stop calling touch_nmi_watchdog(), you will cause it to > fire. > > That is very much in line with the thing I proposed, and rather unlike > your proposal that blanket kill reporting for the task, irrespective of > how long it sits there waiting. > Thanks for clarification. So how about the following solution? diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index a07b8cf73ae2..e0698fd3f9ab 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -14,6 +14,7 @@ * Additions for address_space-based writeback */ +#include <linux/sched/sysctl.h> #include <linux/kernel.h> #include <linux/export.h> #include <linux/spinlock.h> @@ -213,7 +214,7 @@ static void wb_queue_work(struct bdi_writeback *wb, void wb_wait_for_completion(struct wb_completion *done) { atomic_dec(&done->cnt); /* put down the initial count */ - wait_event(*done->waitq, !atomic_read(&done->cnt)); + wait_event(*done->waitq, (done->stamp = jiffies; !atomic_read(&done->cnt))); } #ifdef CONFIG_CGROUP_WRITEBACK @@ -1975,6 +1976,11 @@ static long writeback_sb_inodes(struct super_block *sb, */ __writeback_single_inode(inode, &wbc); + /* Report progress to make hung task detector know it. */ + if (jiffies - work->done->stamp > + HZ * sysctl_hung_task_timeout_secs / 2) + wake_up_all(work->done->waitq); + wbc_detach_inode(&wbc); work->nr_pages -= write_chunk - wbc.nr_to_write; wrote = write_chunk - wbc.nr_to_write - wbc.pages_skipped; Instead of waking up all waiting threads every half second, we only wake them up if the writeback work lasts for the value of sysctl_hung_task_timeout_secs / 2 seconds to reduce possible overhead. Hi, Jan, Christian, how do you think about it? Please correct me if there's anything I'm missing or misunderstanding. Thanks, -- Julian Sun <sunjunchao@bytedance.com>
© 2016 - 2025 Red Hat, Inc.