[v3] mm/mglru: improve reclaim loop and dirty folio handling

[PATCH v3 14/14] mm/vmscan: unify writeback reclaim statistic and throttling

Posted by Kairui Song via B4 Relay 3 days, 18 hours ago

From: Kairui Song <kasong@tencent.com>

Currently MGLRU and non-MGLRU handle the reclaim statistic and
writeback handling very differently, especially throttling.
Basically MGLRU just ignored the throttling part.

Let's just unify this part, use a helper to deduplicate the code
so both setups will share the same behavior.

Test using following reproducer using bash:

  echo "Setup a slow device using dm delay"
  dd if=/dev/zero of=/var/tmp/backing bs=1M count=2048
  LOOP=$(losetup --show -f /var/tmp/backing)
  mkfs.ext4 -q $LOOP
  echo "0 $(blockdev --getsz $LOOP) delay $LOOP 0 0 $LOOP 0 1000" | \
      dmsetup create slow_dev
  mkdir -p /mnt/slow && mount /dev/mapper/slow_dev /mnt/slow

  echo "Start writeback pressure"
  sync && echo 3 > /proc/sys/vm/drop_caches
  mkdir /sys/fs/cgroup/test_wb
  echo 128M > /sys/fs/cgroup/test_wb/memory.max
  (echo $BASHPID > /sys/fs/cgroup/test_wb/cgroup.procs && \
      dd if=/dev/zero of=/mnt/slow/testfile bs=1M count=192)

  echo "Clean up"
  echo "0 $(blockdev --getsz $LOOP) error" | dmsetup load slow_dev
  dmsetup resume slow_dev
  umount -l /mnt/slow && sync
  dmsetup remove slow_dev

Before this commit, `dd` will get OOM killed immediately if
MGLRU is enabled. Classic LRU is fine.

After this commit, throttling is now effective and no more spin on
LRU or premature OOM. Stress test on other workloads also looking good.

Global throttling is not here yet, we will fix that separately later.

Suggested-by: Chen Ridong <chenridong@huaweicloud.com>
Tested-by: Leno Hou <lenohou@gmail.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 90 ++++++++++++++++++++++++++++---------------------------------
 1 file changed, 41 insertions(+), 49 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9120d914445e..a7b3e5b6676b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1942,6 +1942,44 @@ static int current_may_throttle(void)
 	return !(current->flags & PF_LOCAL_THROTTLE);
 }
 
+static void handle_reclaim_writeback(unsigned long nr_taken,
+				     struct pglist_data *pgdat,
+				     struct scan_control *sc,
+				     struct reclaim_stat *stat)
+{
+	/*
+	 * If dirty folios are scanned that are not queued for IO, it
+	 * implies that flushers are not doing their job. This can
+	 * happen when memory pressure pushes dirty folios to the end of
+	 * the LRU before the dirty limits are breached and the dirty
+	 * data has expired. It can also happen when the proportion of
+	 * dirty folios grows not through writes but through memory
+	 * pressure reclaiming all the clean cache. And in some cases,
+	 * the flushers simply cannot keep up with the allocation
+	 * rate. Nudge the flusher threads in case they are asleep.
+	 */
+	if (stat->nr_unqueued_dirty == nr_taken && nr_taken) {
+		wakeup_flusher_threads(WB_REASON_VMSCAN);
+		/*
+		 * For cgroupv1 dirty throttling is achieved by waking up
+		 * the kernel flusher here and later waiting on folios
+		 * which are in writeback to finish (see shrink_folio_list()).
+		 *
+		 * Flusher may not be able to issue writeback quickly
+		 * enough for cgroupv1 writeback throttling to work
+		 * on a large system.
+		 */
+		if (!writeback_throttling_sane(sc))
+			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
+	}
+
+	sc->nr.dirty += stat->nr_dirty;
+	sc->nr.congested += stat->nr_congested;
+	sc->nr.writeback += stat->nr_writeback;
+	sc->nr.immediate += stat->nr_immediate;
+	sc->nr.taken += nr_taken;
+}
+
 /*
  * shrink_inactive_list() is a helper for shrink_node().  It returns the number
  * of reclaimed pages
@@ -2005,39 +2043,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 	lruvec_lock_irq(lruvec);
 	lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout,
 					nr_scanned - nr_reclaimed);
-
-	/*
-	 * If dirty folios are scanned that are not queued for IO, it
-	 * implies that flushers are not doing their job. This can
-	 * happen when memory pressure pushes dirty folios to the end of
-	 * the LRU before the dirty limits are breached and the dirty
-	 * data has expired. It can also happen when the proportion of
-	 * dirty folios grows not through writes but through memory
-	 * pressure reclaiming all the clean cache. And in some cases,
-	 * the flushers simply cannot keep up with the allocation
-	 * rate. Nudge the flusher threads in case they are asleep.
-	 */
-	if (stat.nr_unqueued_dirty == nr_taken) {
-		wakeup_flusher_threads(WB_REASON_VMSCAN);
-		/*
-		 * For cgroupv1 dirty throttling is achieved by waking up
-		 * the kernel flusher here and later waiting on folios
-		 * which are in writeback to finish (see shrink_folio_list()).
-		 *
-		 * Flusher may not be able to issue writeback quickly
-		 * enough for cgroupv1 writeback throttling to work
-		 * on a large system.
-		 */
-		if (!writeback_throttling_sane(sc))
-			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
-	}
-
-	sc->nr.dirty += stat.nr_dirty;
-	sc->nr.congested += stat.nr_congested;
-	sc->nr.writeback += stat.nr_writeback;
-	sc->nr.immediate += stat.nr_immediate;
-	sc->nr.taken += nr_taken;
-
+	handle_reclaim_writeback(nr_taken, pgdat, sc, &stat);
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
 			nr_scanned, nr_reclaimed, &stat, sc->priority, file);
 	return nr_reclaimed;
@@ -4824,26 +4830,11 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 retry:
 	reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false, memcg);
 	sc->nr_reclaimed += reclaimed;
+	handle_reclaim_writeback(isolated, pgdat, sc, &stat);
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
 			type_scanned, reclaimed, &stat, sc->priority,
 			type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
 
-	/*
-	 * If too many file cache in the coldest generation can't be evicted
-	 * due to being dirty, wake up the flusher.
-	 */
-	if (stat.nr_unqueued_dirty == isolated) {
-		wakeup_flusher_threads(WB_REASON_VMSCAN);
-
-		/*
-		 * For cgroupv1 dirty throttling is achieved by waking up
-		 * the kernel flusher here and later waiting on folios
-		 * which are in writeback to finish (see shrink_folio_list()).
-		 */
-		if (!writeback_throttling_sane(sc))
-			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
-	}
-
 	list_for_each_entry_safe_reverse(folio, next, &list, lru) {
 		DEFINE_MIN_SEQ(lruvec);
 
@@ -4886,6 +4877,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 	if (!list_empty(&list)) {
 		skip_retry = true;
+		isolated = 0;
 		goto retry;
 	}
 

-- 
2.53.0

Re: [PATCH v3 14/14] mm/vmscan: unify writeback reclaim statistic and throttling

Posted by Axel Rasmussen 2 days, 16 hours ago

On Thu, Apr 2, 2026 at 11:53 AM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Currently MGLRU and non-MGLRU handle the reclaim statistic and
> writeback handling very differently, especially throttling.
> Basically MGLRU just ignored the throttling part.
>
> Let's just unify this part, use a helper to deduplicate the code
> so both setups will share the same behavior.
>
> Test using following reproducer using bash:
>
>   echo "Setup a slow device using dm delay"
>   dd if=/dev/zero of=/var/tmp/backing bs=1M count=2048
>   LOOP=$(losetup --show -f /var/tmp/backing)
>   mkfs.ext4 -q $LOOP
>   echo "0 $(blockdev --getsz $LOOP) delay $LOOP 0 0 $LOOP 0 1000" | \
>       dmsetup create slow_dev
>   mkdir -p /mnt/slow && mount /dev/mapper/slow_dev /mnt/slow
>
>   echo "Start writeback pressure"
>   sync && echo 3 > /proc/sys/vm/drop_caches
>   mkdir /sys/fs/cgroup/test_wb
>   echo 128M > /sys/fs/cgroup/test_wb/memory.max
>   (echo $BASHPID > /sys/fs/cgroup/test_wb/cgroup.procs && \
>       dd if=/dev/zero of=/mnt/slow/testfile bs=1M count=192)
>
>   echo "Clean up"
>   echo "0 $(blockdev --getsz $LOOP) error" | dmsetup load slow_dev
>   dmsetup resume slow_dev
>   umount -l /mnt/slow && sync
>   dmsetup remove slow_dev
>
> Before this commit, `dd` will get OOM killed immediately if
> MGLRU is enabled. Classic LRU is fine.
>
> After this commit, throttling is now effective and no more spin on
> LRU or premature OOM. Stress test on other workloads also looking good.
>
> Global throttling is not here yet, we will fix that separately later.

If I understand correctly, I think this fixes this regression report
[1] from a long time ago that was never fully resolved?

[1]: https://lore.kernel.org/lkml/ZeC-u7GRSptoVqia@chrisdown.name/

We investigated at that time, but I don't feel we got to a consensus
on how to solve it. I think we got a bit bogged down trying to
"completely solve writeback throttling" rather than just doing some
incremental improvement which fixed that particular case.

>
> Suggested-by: Chen Ridong <chenridong@huaweicloud.com>
> Tested-by: Leno Hou <lenohou@gmail.com>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/vmscan.c | 90 ++++++++++++++++++++++++++++---------------------------------
>  1 file changed, 41 insertions(+), 49 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9120d914445e..a7b3e5b6676b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1942,6 +1942,44 @@ static int current_may_throttle(void)
>         return !(current->flags & PF_LOCAL_THROTTLE);
>  }
>
> +static void handle_reclaim_writeback(unsigned long nr_taken,
> +                                    struct pglist_data *pgdat,
> +                                    struct scan_control *sc,
> +                                    struct reclaim_stat *stat)
> +{
> +       /*
> +        * If dirty folios are scanned that are not queued for IO, it
> +        * implies that flushers are not doing their job. This can
> +        * happen when memory pressure pushes dirty folios to the end of
> +        * the LRU before the dirty limits are breached and the dirty
> +        * data has expired. It can also happen when the proportion of
> +        * dirty folios grows not through writes but through memory
> +        * pressure reclaiming all the clean cache. And in some cases,
> +        * the flushers simply cannot keep up with the allocation
> +        * rate. Nudge the flusher threads in case they are asleep.
> +        */
> +       if (stat->nr_unqueued_dirty == nr_taken && nr_taken) {
> +               wakeup_flusher_threads(WB_REASON_VMSCAN);
> +               /*
> +                * For cgroupv1 dirty throttling is achieved by waking up
> +                * the kernel flusher here and later waiting on folios
> +                * which are in writeback to finish (see shrink_folio_list()).
> +                *
> +                * Flusher may not be able to issue writeback quickly
> +                * enough for cgroupv1 writeback throttling to work
> +                * on a large system.
> +                */
> +               if (!writeback_throttling_sane(sc))
> +                       reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> +       }
> +
> +       sc->nr.dirty += stat->nr_dirty;
> +       sc->nr.congested += stat->nr_congested;
> +       sc->nr.writeback += stat->nr_writeback;
> +       sc->nr.immediate += stat->nr_immediate;
> +       sc->nr.taken += nr_taken;
> +}
> +
>  /*
>   * shrink_inactive_list() is a helper for shrink_node().  It returns the number
>   * of reclaimed pages
> @@ -2005,39 +2043,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>         lruvec_lock_irq(lruvec);
>         lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout,
>                                         nr_scanned - nr_reclaimed);
> -
> -       /*
> -        * If dirty folios are scanned that are not queued for IO, it
> -        * implies that flushers are not doing their job. This can
> -        * happen when memory pressure pushes dirty folios to the end of
> -        * the LRU before the dirty limits are breached and the dirty
> -        * data has expired. It can also happen when the proportion of
> -        * dirty folios grows not through writes but through memory
> -        * pressure reclaiming all the clean cache. And in some cases,
> -        * the flushers simply cannot keep up with the allocation
> -        * rate. Nudge the flusher threads in case they are asleep.
> -        */
> -       if (stat.nr_unqueued_dirty == nr_taken) {
> -               wakeup_flusher_threads(WB_REASON_VMSCAN);
> -               /*
> -                * For cgroupv1 dirty throttling is achieved by waking up
> -                * the kernel flusher here and later waiting on folios
> -                * which are in writeback to finish (see shrink_folio_list()).
> -                *
> -                * Flusher may not be able to issue writeback quickly
> -                * enough for cgroupv1 writeback throttling to work
> -                * on a large system.
> -                */
> -               if (!writeback_throttling_sane(sc))
> -                       reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> -       }
> -
> -       sc->nr.dirty += stat.nr_dirty;
> -       sc->nr.congested += stat.nr_congested;
> -       sc->nr.writeback += stat.nr_writeback;
> -       sc->nr.immediate += stat.nr_immediate;
> -       sc->nr.taken += nr_taken;
> -
> +       handle_reclaim_writeback(nr_taken, pgdat, sc, &stat);
>         trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
>                         nr_scanned, nr_reclaimed, &stat, sc->priority, file);
>         return nr_reclaimed;
> @@ -4824,26 +4830,11 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>  retry:
>         reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false, memcg);
>         sc->nr_reclaimed += reclaimed;
> +       handle_reclaim_writeback(isolated, pgdat, sc, &stat);
>         trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
>                         type_scanned, reclaimed, &stat, sc->priority,
>                         type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
>
> -       /*
> -        * If too many file cache in the coldest generation can't be evicted
> -        * due to being dirty, wake up the flusher.
> -        */
> -       if (stat.nr_unqueued_dirty == isolated) {
> -               wakeup_flusher_threads(WB_REASON_VMSCAN);
> -
> -               /*
> -                * For cgroupv1 dirty throttling is achieved by waking up
> -                * the kernel flusher here and later waiting on folios
> -                * which are in writeback to finish (see shrink_folio_list()).
> -                */
> -               if (!writeback_throttling_sane(sc))
> -                       reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> -       }
> -
>         list_for_each_entry_safe_reverse(folio, next, &list, lru) {
>                 DEFINE_MIN_SEQ(lruvec);
>
> @@ -4886,6 +4877,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>
>         if (!list_empty(&list)) {
>                 skip_retry = true;
> +               isolated = 0;
>                 goto retry;
>         }
>
>
> --
> 2.53.0
>
>

Re: [PATCH v3 14/14] mm/vmscan: unify writeback reclaim statistic and throttling

Posted by Kairui Song 1 day, 19 hours ago

On Sat, Apr 4, 2026 at 5:16 AM Axel Rasmussen <axelrasmussen@google.com> wrote:
>
> On Thu, Apr 2, 2026 at 11:53 AM Kairui Song via B4 Relay
> <devnull+kasong.tencent.com@kernel.org> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Currently MGLRU and non-MGLRU handle the reclaim statistic and
> > writeback handling very differently, especially throttling.
> > Basically MGLRU just ignored the throttling part.
> >
> > Let's just unify this part, use a helper to deduplicate the code
> > so both setups will share the same behavior.
> >
> > Test using following reproducer using bash:
> >
> >   echo "Setup a slow device using dm delay"
> >   dd if=/dev/zero of=/var/tmp/backing bs=1M count=2048
> >   LOOP=$(losetup --show -f /var/tmp/backing)
> >   mkfs.ext4 -q $LOOP
> >   echo "0 $(blockdev --getsz $LOOP) delay $LOOP 0 0 $LOOP 0 1000" | \
> >       dmsetup create slow_dev
> >   mkdir -p /mnt/slow && mount /dev/mapper/slow_dev /mnt/slow
> >
> >   echo "Start writeback pressure"
> >   sync && echo 3 > /proc/sys/vm/drop_caches
> >   mkdir /sys/fs/cgroup/test_wb
> >   echo 128M > /sys/fs/cgroup/test_wb/memory.max
> >   (echo $BASHPID > /sys/fs/cgroup/test_wb/cgroup.procs && \
> >       dd if=/dev/zero of=/mnt/slow/testfile bs=1M count=192)
> >
> >   echo "Clean up"
> >   echo "0 $(blockdev --getsz $LOOP) error" | dmsetup load slow_dev
> >   dmsetup resume slow_dev
> >   umount -l /mnt/slow && sync
> >   dmsetup remove slow_dev
> >
> > Before this commit, `dd` will get OOM killed immediately if
> > MGLRU is enabled. Classic LRU is fine.
> >
> > After this commit, throttling is now effective and no more spin on
> > LRU or premature OOM. Stress test on other workloads also looking good.
> >
> > Global throttling is not here yet, we will fix that separately later.
>
> If I understand correctly, I think this fixes this regression report
> [1] from a long time ago that was never fully resolved?
>
> [1]: https://lore.kernel.org/lkml/ZeC-u7GRSptoVqia@chrisdown.name/
>
> We investigated at that time, but I don't feel we got to a consensus
> on how to solve it. I think we got a bit bogged down trying to
> "completely solve writeback throttling" rather than just doing some
> incremental improvement which fixed that particular case.
>

Hello Axel!

Yes, we also observed that problem. I almost forgot about that report,
thanks for the link! No worry, for the majority of the users I think
the problem was fixed already a year ago.

I asked Jingxiang previously to help fix that by waking up writeback
previously. In that discussion, the info is showing that fluster is
not waking at all, and Yafang reports that reverting 14aa8b2d5c2e can
fix it. So Jingxiang's fix seemed work well at that time:
https://lore.kernel.org/linux-mm/20241026115714.1437435-1-jingxiangzeng.cas@gmail.com/

AFAIK there seems to be no more reports of premature OOM in the mail
list since then, but later we found that that fix isn't enough for
some particular and rare setups (for example I used dm delay in the
test script above to simulate slow IO). Usually the reclaim can always
keep up, since it's rare for LRU to be full of writeback folios and
there are always clean folios to drop, waking up flusher is good
enough. But when under extreme pressure or very slow devices, LRU
could get congested with writeback folios. And it's hard to apply a
reasonable throttle or improve the dirty flush without a bit more
refactor first, and that's not the only cgroup OOM problem we
encountered.

With this series, I think the known problems mentioned above are all
covered in a clean way.

Global pressure and throttle is still not here yet, it's an even more
rare problem since LRU getting congested with writeback globally seems
already a really bad situation to me. That can also be fixed
separately later.