From: Kairui Song <kasong@tencent.com>
Currently MGLRU and non-MGLRU handle the reclaim statistic and
writeback handling very differently, especially throttling.
Basically MGLRU just ignored the throttling part.
Let's just unify this part, use a helper to deduplicate the code
so both setups will share the same behavior.
Test using following reproducer using bash:
echo "Setup a slow device using dm delay"
dd if=/dev/zero of=/var/tmp/backing bs=1M count=2048
LOOP=$(losetup --show -f /var/tmp/backing)
mkfs.ext4 -q $LOOP
echo "0 $(blockdev --getsz $LOOP) delay $LOOP 0 0 $LOOP 0 1000" | \
dmsetup create slow_dev
mkdir -p /mnt/slow && mount /dev/mapper/slow_dev /mnt/slow
echo "Start writeback pressure"
sync && echo 3 > /proc/sys/vm/drop_caches
mkdir /sys/fs/cgroup/test_wb
echo 128M > /sys/fs/cgroup/test_wb/memory.max
(echo $BASHPID > /sys/fs/cgroup/test_wb/cgroup.procs && \
dd if=/dev/zero of=/mnt/slow/testfile bs=1M count=192)
echo "Clean up"
echo "0 $(blockdev --getsz $LOOP) error" | dmsetup load slow_dev
dmsetup resume slow_dev
umount -l /mnt/slow && sync
dmsetup remove slow_dev
Before this commit, `dd` will get OOM killed immediately if
MGLRU is enabled. Classic LRU is fine.
After this commit, throttling is now effective and no more spin on
LRU or premature OOM. Stress test on other workloads also looking good.
Global throttling is not here yet, we will fix that separately later.
Suggested-by: Chen Ridong <chenridong@huaweicloud.com>
Tested-by: Leno Hou <lenohou@gmail.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
mm/vmscan.c | 90 ++++++++++++++++++++++++++++---------------------------------
1 file changed, 41 insertions(+), 49 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9120d914445e..a7b3e5b6676b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1942,6 +1942,44 @@ static int current_may_throttle(void)
return !(current->flags & PF_LOCAL_THROTTLE);
}
+static void handle_reclaim_writeback(unsigned long nr_taken,
+ struct pglist_data *pgdat,
+ struct scan_control *sc,
+ struct reclaim_stat *stat)
+{
+ /*
+ * If dirty folios are scanned that are not queued for IO, it
+ * implies that flushers are not doing their job. This can
+ * happen when memory pressure pushes dirty folios to the end of
+ * the LRU before the dirty limits are breached and the dirty
+ * data has expired. It can also happen when the proportion of
+ * dirty folios grows not through writes but through memory
+ * pressure reclaiming all the clean cache. And in some cases,
+ * the flushers simply cannot keep up with the allocation
+ * rate. Nudge the flusher threads in case they are asleep.
+ */
+ if (stat->nr_unqueued_dirty == nr_taken && nr_taken) {
+ wakeup_flusher_threads(WB_REASON_VMSCAN);
+ /*
+ * For cgroupv1 dirty throttling is achieved by waking up
+ * the kernel flusher here and later waiting on folios
+ * which are in writeback to finish (see shrink_folio_list()).
+ *
+ * Flusher may not be able to issue writeback quickly
+ * enough for cgroupv1 writeback throttling to work
+ * on a large system.
+ */
+ if (!writeback_throttling_sane(sc))
+ reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
+ }
+
+ sc->nr.dirty += stat->nr_dirty;
+ sc->nr.congested += stat->nr_congested;
+ sc->nr.writeback += stat->nr_writeback;
+ sc->nr.immediate += stat->nr_immediate;
+ sc->nr.taken += nr_taken;
+}
+
/*
* shrink_inactive_list() is a helper for shrink_node(). It returns the number
* of reclaimed pages
@@ -2005,39 +2043,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
lruvec_lock_irq(lruvec);
lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout,
nr_scanned - nr_reclaimed);
-
- /*
- * If dirty folios are scanned that are not queued for IO, it
- * implies that flushers are not doing their job. This can
- * happen when memory pressure pushes dirty folios to the end of
- * the LRU before the dirty limits are breached and the dirty
- * data has expired. It can also happen when the proportion of
- * dirty folios grows not through writes but through memory
- * pressure reclaiming all the clean cache. And in some cases,
- * the flushers simply cannot keep up with the allocation
- * rate. Nudge the flusher threads in case they are asleep.
- */
- if (stat.nr_unqueued_dirty == nr_taken) {
- wakeup_flusher_threads(WB_REASON_VMSCAN);
- /*
- * For cgroupv1 dirty throttling is achieved by waking up
- * the kernel flusher here and later waiting on folios
- * which are in writeback to finish (see shrink_folio_list()).
- *
- * Flusher may not be able to issue writeback quickly
- * enough for cgroupv1 writeback throttling to work
- * on a large system.
- */
- if (!writeback_throttling_sane(sc))
- reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
- }
-
- sc->nr.dirty += stat.nr_dirty;
- sc->nr.congested += stat.nr_congested;
- sc->nr.writeback += stat.nr_writeback;
- sc->nr.immediate += stat.nr_immediate;
- sc->nr.taken += nr_taken;
-
+ handle_reclaim_writeback(nr_taken, pgdat, sc, &stat);
trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
nr_scanned, nr_reclaimed, &stat, sc->priority, file);
return nr_reclaimed;
@@ -4824,26 +4830,11 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
retry:
reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false, memcg);
sc->nr_reclaimed += reclaimed;
+ handle_reclaim_writeback(isolated, pgdat, sc, &stat);
trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
type_scanned, reclaimed, &stat, sc->priority,
type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
- /*
- * If too many file cache in the coldest generation can't be evicted
- * due to being dirty, wake up the flusher.
- */
- if (stat.nr_unqueued_dirty == isolated) {
- wakeup_flusher_threads(WB_REASON_VMSCAN);
-
- /*
- * For cgroupv1 dirty throttling is achieved by waking up
- * the kernel flusher here and later waiting on folios
- * which are in writeback to finish (see shrink_folio_list()).
- */
- if (!writeback_throttling_sane(sc))
- reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
- }
-
list_for_each_entry_safe_reverse(folio, next, &list, lru) {
DEFINE_MIN_SEQ(lruvec);
@@ -4886,6 +4877,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
if (!list_empty(&list)) {
skip_retry = true;
+ isolated = 0;
goto retry;
}
--
2.53.0
On Thu, Apr 2, 2026 at 11:53 AM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Currently MGLRU and non-MGLRU handle the reclaim statistic and
> writeback handling very differently, especially throttling.
> Basically MGLRU just ignored the throttling part.
>
> Let's just unify this part, use a helper to deduplicate the code
> so both setups will share the same behavior.
>
> Test using following reproducer using bash:
>
> echo "Setup a slow device using dm delay"
> dd if=/dev/zero of=/var/tmp/backing bs=1M count=2048
> LOOP=$(losetup --show -f /var/tmp/backing)
> mkfs.ext4 -q $LOOP
> echo "0 $(blockdev --getsz $LOOP) delay $LOOP 0 0 $LOOP 0 1000" | \
> dmsetup create slow_dev
> mkdir -p /mnt/slow && mount /dev/mapper/slow_dev /mnt/slow
>
> echo "Start writeback pressure"
> sync && echo 3 > /proc/sys/vm/drop_caches
> mkdir /sys/fs/cgroup/test_wb
> echo 128M > /sys/fs/cgroup/test_wb/memory.max
> (echo $BASHPID > /sys/fs/cgroup/test_wb/cgroup.procs && \
> dd if=/dev/zero of=/mnt/slow/testfile bs=1M count=192)
>
> echo "Clean up"
> echo "0 $(blockdev --getsz $LOOP) error" | dmsetup load slow_dev
> dmsetup resume slow_dev
> umount -l /mnt/slow && sync
> dmsetup remove slow_dev
>
> Before this commit, `dd` will get OOM killed immediately if
> MGLRU is enabled. Classic LRU is fine.
>
> After this commit, throttling is now effective and no more spin on
> LRU or premature OOM. Stress test on other workloads also looking good.
>
> Global throttling is not here yet, we will fix that separately later.
If I understand correctly, I think this fixes this regression report
[1] from a long time ago that was never fully resolved?
[1]: https://lore.kernel.org/lkml/ZeC-u7GRSptoVqia@chrisdown.name/
We investigated at that time, but I don't feel we got to a consensus
on how to solve it. I think we got a bit bogged down trying to
"completely solve writeback throttling" rather than just doing some
incremental improvement which fixed that particular case.
>
> Suggested-by: Chen Ridong <chenridong@huaweicloud.com>
> Tested-by: Leno Hou <lenohou@gmail.com>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
> mm/vmscan.c | 90 ++++++++++++++++++++++++++++---------------------------------
> 1 file changed, 41 insertions(+), 49 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9120d914445e..a7b3e5b6676b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1942,6 +1942,44 @@ static int current_may_throttle(void)
> return !(current->flags & PF_LOCAL_THROTTLE);
> }
>
> +static void handle_reclaim_writeback(unsigned long nr_taken,
> + struct pglist_data *pgdat,
> + struct scan_control *sc,
> + struct reclaim_stat *stat)
> +{
> + /*
> + * If dirty folios are scanned that are not queued for IO, it
> + * implies that flushers are not doing their job. This can
> + * happen when memory pressure pushes dirty folios to the end of
> + * the LRU before the dirty limits are breached and the dirty
> + * data has expired. It can also happen when the proportion of
> + * dirty folios grows not through writes but through memory
> + * pressure reclaiming all the clean cache. And in some cases,
> + * the flushers simply cannot keep up with the allocation
> + * rate. Nudge the flusher threads in case they are asleep.
> + */
> + if (stat->nr_unqueued_dirty == nr_taken && nr_taken) {
> + wakeup_flusher_threads(WB_REASON_VMSCAN);
> + /*
> + * For cgroupv1 dirty throttling is achieved by waking up
> + * the kernel flusher here and later waiting on folios
> + * which are in writeback to finish (see shrink_folio_list()).
> + *
> + * Flusher may not be able to issue writeback quickly
> + * enough for cgroupv1 writeback throttling to work
> + * on a large system.
> + */
> + if (!writeback_throttling_sane(sc))
> + reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> + }
> +
> + sc->nr.dirty += stat->nr_dirty;
> + sc->nr.congested += stat->nr_congested;
> + sc->nr.writeback += stat->nr_writeback;
> + sc->nr.immediate += stat->nr_immediate;
> + sc->nr.taken += nr_taken;
> +}
> +
> /*
> * shrink_inactive_list() is a helper for shrink_node(). It returns the number
> * of reclaimed pages
> @@ -2005,39 +2043,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
> lruvec_lock_irq(lruvec);
> lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout,
> nr_scanned - nr_reclaimed);
> -
> - /*
> - * If dirty folios are scanned that are not queued for IO, it
> - * implies that flushers are not doing their job. This can
> - * happen when memory pressure pushes dirty folios to the end of
> - * the LRU before the dirty limits are breached and the dirty
> - * data has expired. It can also happen when the proportion of
> - * dirty folios grows not through writes but through memory
> - * pressure reclaiming all the clean cache. And in some cases,
> - * the flushers simply cannot keep up with the allocation
> - * rate. Nudge the flusher threads in case they are asleep.
> - */
> - if (stat.nr_unqueued_dirty == nr_taken) {
> - wakeup_flusher_threads(WB_REASON_VMSCAN);
> - /*
> - * For cgroupv1 dirty throttling is achieved by waking up
> - * the kernel flusher here and later waiting on folios
> - * which are in writeback to finish (see shrink_folio_list()).
> - *
> - * Flusher may not be able to issue writeback quickly
> - * enough for cgroupv1 writeback throttling to work
> - * on a large system.
> - */
> - if (!writeback_throttling_sane(sc))
> - reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> - }
> -
> - sc->nr.dirty += stat.nr_dirty;
> - sc->nr.congested += stat.nr_congested;
> - sc->nr.writeback += stat.nr_writeback;
> - sc->nr.immediate += stat.nr_immediate;
> - sc->nr.taken += nr_taken;
> -
> + handle_reclaim_writeback(nr_taken, pgdat, sc, &stat);
> trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
> nr_scanned, nr_reclaimed, &stat, sc->priority, file);
> return nr_reclaimed;
> @@ -4824,26 +4830,11 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> retry:
> reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false, memcg);
> sc->nr_reclaimed += reclaimed;
> + handle_reclaim_writeback(isolated, pgdat, sc, &stat);
> trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
> type_scanned, reclaimed, &stat, sc->priority,
> type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
>
> - /*
> - * If too many file cache in the coldest generation can't be evicted
> - * due to being dirty, wake up the flusher.
> - */
> - if (stat.nr_unqueued_dirty == isolated) {
> - wakeup_flusher_threads(WB_REASON_VMSCAN);
> -
> - /*
> - * For cgroupv1 dirty throttling is achieved by waking up
> - * the kernel flusher here and later waiting on folios
> - * which are in writeback to finish (see shrink_folio_list()).
> - */
> - if (!writeback_throttling_sane(sc))
> - reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> - }
> -
> list_for_each_entry_safe_reverse(folio, next, &list, lru) {
> DEFINE_MIN_SEQ(lruvec);
>
> @@ -4886,6 +4877,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>
> if (!list_empty(&list)) {
> skip_retry = true;
> + isolated = 0;
> goto retry;
> }
>
>
> --
> 2.53.0
>
>
On Sat, Apr 4, 2026 at 5:16 AM Axel Rasmussen <axelrasmussen@google.com> wrote: > > On Thu, Apr 2, 2026 at 11:53 AM Kairui Song via B4 Relay > <devnull+kasong.tencent.com@kernel.org> wrote: > > > > From: Kairui Song <kasong@tencent.com> > > > > Currently MGLRU and non-MGLRU handle the reclaim statistic and > > writeback handling very differently, especially throttling. > > Basically MGLRU just ignored the throttling part. > > > > Let's just unify this part, use a helper to deduplicate the code > > so both setups will share the same behavior. > > > > Test using following reproducer using bash: > > > > echo "Setup a slow device using dm delay" > > dd if=/dev/zero of=/var/tmp/backing bs=1M count=2048 > > LOOP=$(losetup --show -f /var/tmp/backing) > > mkfs.ext4 -q $LOOP > > echo "0 $(blockdev --getsz $LOOP) delay $LOOP 0 0 $LOOP 0 1000" | \ > > dmsetup create slow_dev > > mkdir -p /mnt/slow && mount /dev/mapper/slow_dev /mnt/slow > > > > echo "Start writeback pressure" > > sync && echo 3 > /proc/sys/vm/drop_caches > > mkdir /sys/fs/cgroup/test_wb > > echo 128M > /sys/fs/cgroup/test_wb/memory.max > > (echo $BASHPID > /sys/fs/cgroup/test_wb/cgroup.procs && \ > > dd if=/dev/zero of=/mnt/slow/testfile bs=1M count=192) > > > > echo "Clean up" > > echo "0 $(blockdev --getsz $LOOP) error" | dmsetup load slow_dev > > dmsetup resume slow_dev > > umount -l /mnt/slow && sync > > dmsetup remove slow_dev > > > > Before this commit, `dd` will get OOM killed immediately if > > MGLRU is enabled. Classic LRU is fine. > > > > After this commit, throttling is now effective and no more spin on > > LRU or premature OOM. Stress test on other workloads also looking good. > > > > Global throttling is not here yet, we will fix that separately later. > > If I understand correctly, I think this fixes this regression report > [1] from a long time ago that was never fully resolved? > > [1]: https://lore.kernel.org/lkml/ZeC-u7GRSptoVqia@chrisdown.name/ > > We investigated at that time, but I don't feel we got to a consensus > on how to solve it. I think we got a bit bogged down trying to > "completely solve writeback throttling" rather than just doing some > incremental improvement which fixed that particular case. > Hello Axel! Yes, we also observed that problem. I almost forgot about that report, thanks for the link! No worry, for the majority of the users I think the problem was fixed already a year ago. I asked Jingxiang previously to help fix that by waking up writeback previously. In that discussion, the info is showing that fluster is not waking at all, and Yafang reports that reverting 14aa8b2d5c2e can fix it. So Jingxiang's fix seemed work well at that time: https://lore.kernel.org/linux-mm/20241026115714.1437435-1-jingxiangzeng.cas@gmail.com/ AFAIK there seems to be no more reports of premature OOM in the mail list since then, but later we found that that fix isn't enough for some particular and rare setups (for example I used dm delay in the test script above to simulate slow IO). Usually the reclaim can always keep up, since it's rare for LRU to be full of writeback folios and there are always clean folios to drop, waking up flusher is good enough. But when under extreme pressure or very slow devices, LRU could get congested with writeback folios. And it's hard to apply a reasonable throttle or improve the dirty flush without a bit more refactor first, and that's not the only cgroup OOM problem we encountered. With this series, I think the known problems mentioned above are all covered in a clean way. Global pressure and throttle is still not here yet, it's an even more rare problem since LRU getting congested with writeback globally seems already a really bad situation to me. That can also be fixed separately later.
© 2016 - 2026 Red Hat, Inc.