[v3] mm/vmscan: mitigate spurious kswapd_failures reset and add tracepoints

[PATCH v3 1/2] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim

Posted by Jiayuan Chen 3 weeks, 4 days ago

From: Jiayuan Chen <jiayuan.chen@shopee.com>

When kswapd fails to reclaim memory, kswapd_failures is incremented.
Once it reaches MAX_RECLAIM_RETRIES, kswapd stops running to avoid
futile reclaim attempts. However, any successful direct reclaim
unconditionally resets kswapd_failures to 0, which can cause problems.

We observed an issue in production on a multi-NUMA system where a
process allocated large amounts of anonymous pages on a single NUMA
node, causing its watermark to drop below high and evicting most file
pages:

$ numastat -m
Per-node system memory usage (in MBs):
                          Node 0          Node 1           Total
                 --------------- --------------- ---------------
MemTotal               128222.19       127983.91       256206.11
MemFree                  1414.48         1432.80         2847.29
MemUsed                126807.71       126551.11       252358.82
SwapCached                  0.00            0.00            0.00
Active                  29017.91        25554.57        54572.48
Inactive                92749.06        95377.00       188126.06
Active(anon)            28998.96        23356.47        52355.43
Inactive(anon)          92685.27        87466.11       180151.39
Active(file)               18.95         2198.10         2217.05
Inactive(file)             63.79         7910.89         7974.68

With swap disabled, only file pages can be reclaimed. When kswapd is
woken (e.g., via wake_all_kswapds()), it runs continuously but cannot
raise free memory above the high watermark since reclaimable file pages
are insufficient. Normally, kswapd would eventually stop after
kswapd_failures reaches MAX_RECLAIM_RETRIES.

However, containers on this machine have memory.high set in their
cgroup. Business processes continuously trigger the high limit, causing
frequent direct reclaim that keeps resetting kswapd_failures to 0. This
prevents kswapd from ever stopping.

The key insight is that direct reclaim triggered by cgroup memory.high
performs aggressive scanning to throttle the allocating process. With
sufficiently aggressive scanning, even hot pages will eventually be
reclaimed, making direct reclaim "successful" at freeing some memory.
However, this success does not mean the node has reached a balanced
state - the freed memory may still be insufficient to bring free pages
above the high watermark. Unconditionally resetting kswapd_failures in
this case keeps kswapd alive indefinitely.

The result is that kswapd runs endlessly. Unlike direct reclaim which
only reclaims from the allocating cgroup, kswapd scans the entire node's
memory. This causes hot file pages from all workloads on the node to be
evicted, not just those from the cgroup triggering memory.high. These
pages constantly refault, generating sustained heavy IO READ pressure
across the entire system.

Fix this by only resetting kswapd_failures when the node is actually
balanced. This allows both kswapd and direct reclaim to clear
kswapd_failures upon successful reclaim, but only when the reclaim
actually resolves the memory pressure (i.e., the node becomes balanced).

Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
---
 mm/vmscan.c | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 670fe9fae5ba..6fd100130987 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2650,6 +2650,25 @@ static bool can_age_anon_pages(struct lruvec *lruvec,
 			  lruvec_memcg(lruvec));
 }
 
+static void pgdat_reset_kswapd_failures(pg_data_t *pgdat)
+{
+	atomic_set(&pgdat->kswapd_failures, 0);
+}
+
+/*
+ * Reset kswapd_failures only when the node is balanced. Without this
+ * check, successful direct reclaim (e.g., from cgroup memory.high
+ * throttling) can keep resetting kswapd_failures even when the node
+ * cannot be balanced, causing kswapd to run endlessly.
+ */
+static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx);
+static inline void pgdat_try_reset_kswapd_failures(struct pglist_data *pgdat,
+						   struct scan_control *sc)
+{
+	if (pgdat_balanced(pgdat, sc->order, sc->reclaim_idx))
+		pgdat_reset_kswapd_failures(pgdat);
+}
+
 #ifdef CONFIG_LRU_GEN
 
 #ifdef CONFIG_LRU_GEN_ENABLED
@@ -5067,7 +5086,7 @@ static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control *
 	blk_finish_plug(&plug);
 done:
 	if (sc->nr_reclaimed > reclaimed)
-		atomic_set(&pgdat->kswapd_failures, 0);
+		pgdat_try_reset_kswapd_failures(pgdat, sc);
 }
 
 /******************************************************************************
@@ -6141,7 +6160,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	 * successful direct reclaim run will revive a dormant kswapd.
 	 */
 	if (reclaimable)
-		atomic_set(&pgdat->kswapd_failures, 0);
+		pgdat_try_reset_kswapd_failures(pgdat, sc);
 	else if (sc->cache_trim_mode)
 		sc->cache_trim_mode_failed = 1;
 }
-- 
2.43.0

Re: [PATCH v3 1/2] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim

Posted by Shakeel Butt 3 weeks ago

On Wed, Jan 14, 2026 at 03:40:35PM +0800, Jiayuan Chen wrote:
> From: Jiayuan Chen <jiayuan.chen@shopee.com>
> 
> When kswapd fails to reclaim memory, kswapd_failures is incremented.
> Once it reaches MAX_RECLAIM_RETRIES, kswapd stops running to avoid
> futile reclaim attempts. However, any successful direct reclaim
> unconditionally resets kswapd_failures to 0, which can cause problems.
> 
> We observed an issue in production on a multi-NUMA system where a
> process allocated large amounts of anonymous pages on a single NUMA
> node, causing its watermark to drop below high and evicting most file
> pages:
> 
> $ numastat -m
> Per-node system memory usage (in MBs):
>                           Node 0          Node 1           Total
>                  --------------- --------------- ---------------
> MemTotal               128222.19       127983.91       256206.11
> MemFree                  1414.48         1432.80         2847.29
> MemUsed                126807.71       126551.11       252358.82
> SwapCached                  0.00            0.00            0.00
> Active                  29017.91        25554.57        54572.48
> Inactive                92749.06        95377.00       188126.06
> Active(anon)            28998.96        23356.47        52355.43
> Inactive(anon)          92685.27        87466.11       180151.39
> Active(file)               18.95         2198.10         2217.05
> Inactive(file)             63.79         7910.89         7974.68
> 
> With swap disabled, only file pages can be reclaimed. When kswapd is
> woken (e.g., via wake_all_kswapds()), it runs continuously but cannot
> raise free memory above the high watermark since reclaimable file pages
> are insufficient. Normally, kswapd would eventually stop after
> kswapd_failures reaches MAX_RECLAIM_RETRIES.
> 
> However, containers on this machine have memory.high set in their
> cgroup. Business processes continuously trigger the high limit, causing
> frequent direct reclaim that keeps resetting kswapd_failures to 0. This
> prevents kswapd from ever stopping.
> 
> The key insight is that direct reclaim triggered by cgroup memory.high
> performs aggressive scanning to throttle the allocating process. With
> sufficiently aggressive scanning, even hot pages will eventually be
> reclaimed, making direct reclaim "successful" at freeing some memory.
> However, this success does not mean the node has reached a balanced
> state - the freed memory may still be insufficient to bring free pages
> above the high watermark. Unconditionally resetting kswapd_failures in
> this case keeps kswapd alive indefinitely.
> 
> The result is that kswapd runs endlessly. Unlike direct reclaim which
> only reclaims from the allocating cgroup, kswapd scans the entire node's
> memory. This causes hot file pages from all workloads on the node to be
> evicted, not just those from the cgroup triggering memory.high. These
> pages constantly refault, generating sustained heavy IO READ pressure
> across the entire system.
> 
> Fix this by only resetting kswapd_failures when the node is actually
> balanced. This allows both kswapd and direct reclaim to clear
> kswapd_failures upon successful reclaim, but only when the reclaim
> actually resolves the memory pressure (i.e., the node becomes balanced).
> 
> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>

After incorporating suggestions from Johannes, you can add:

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>

Re: [PATCH v3 1/2] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim

Posted by Johannes Weiner 3 weeks, 1 day ago

On Wed, Jan 14, 2026 at 03:40:35PM +0800, Jiayuan Chen wrote:
> From: Jiayuan Chen <jiayuan.chen@shopee.com>
> 
> When kswapd fails to reclaim memory, kswapd_failures is incremented.
> Once it reaches MAX_RECLAIM_RETRIES, kswapd stops running to avoid
> futile reclaim attempts. However, any successful direct reclaim
> unconditionally resets kswapd_failures to 0, which can cause problems.
> 
> We observed an issue in production on a multi-NUMA system where a
> process allocated large amounts of anonymous pages on a single NUMA
> node, causing its watermark to drop below high and evicting most file
> pages:
> 
> $ numastat -m
> Per-node system memory usage (in MBs):
>                           Node 0          Node 1           Total
>                  --------------- --------------- ---------------
> MemTotal               128222.19       127983.91       256206.11
> MemFree                  1414.48         1432.80         2847.29
> MemUsed                126807.71       126551.11       252358.82
> SwapCached                  0.00            0.00            0.00
> Active                  29017.91        25554.57        54572.48
> Inactive                92749.06        95377.00       188126.06
> Active(anon)            28998.96        23356.47        52355.43
> Inactive(anon)          92685.27        87466.11       180151.39
> Active(file)               18.95         2198.10         2217.05
> Inactive(file)             63.79         7910.89         7974.68
> 
> With swap disabled, only file pages can be reclaimed. When kswapd is
> woken (e.g., via wake_all_kswapds()), it runs continuously but cannot
> raise free memory above the high watermark since reclaimable file pages
> are insufficient. Normally, kswapd would eventually stop after
> kswapd_failures reaches MAX_RECLAIM_RETRIES.
> 
> However, containers on this machine have memory.high set in their
> cgroup. Business processes continuously trigger the high limit, causing
> frequent direct reclaim that keeps resetting kswapd_failures to 0. This
> prevents kswapd from ever stopping.
> 
> The key insight is that direct reclaim triggered by cgroup memory.high
> performs aggressive scanning to throttle the allocating process. With
> sufficiently aggressive scanning, even hot pages will eventually be
> reclaimed, making direct reclaim "successful" at freeing some memory.
> However, this success does not mean the node has reached a balanced
> state - the freed memory may still be insufficient to bring free pages
> above the high watermark. Unconditionally resetting kswapd_failures in
> this case keeps kswapd alive indefinitely.
> 
> The result is that kswapd runs endlessly. Unlike direct reclaim which
> only reclaims from the allocating cgroup, kswapd scans the entire node's
> memory. This causes hot file pages from all workloads on the node to be
> evicted, not just those from the cgroup triggering memory.high. These
> pages constantly refault, generating sustained heavy IO READ pressure
> across the entire system.
> 
> Fix this by only resetting kswapd_failures when the node is actually
> balanced. This allows both kswapd and direct reclaim to clear
> kswapd_failures upon successful reclaim, but only when the reclaim
> actually resolves the memory pressure (i.e., the node becomes balanced).
> 
> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>

Great analysis, and I agree with both the fix and adding tracepoints.

Two minor nits:

> @@ -2650,6 +2650,25 @@ static bool can_age_anon_pages(struct lruvec *lruvec,
>  			  lruvec_memcg(lruvec));
>  }
>  
> +static void pgdat_reset_kswapd_failures(pg_data_t *pgdat)
> +{
> +	atomic_set(&pgdat->kswapd_failures, 0);
> +/*
> + * Reset kswapd_failures only when the node is balanced. Without this
> + * check, successful direct reclaim (e.g., from cgroup memory.high
> + * throttling) can keep resetting kswapd_failures even when the node
> + * cannot be balanced, causing kswapd to run endlessly.
> + */
> +static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx);
> +static inline void pgdat_try_reset_kswapd_failures(struct pglist_data *pgdat,

Please remove the inline, the compiler will figure it out.

> +						   struct scan_control *sc)
> +{
> +	if (pgdat_balanced(pgdat, sc->order, sc->reclaim_idx))
> +		pgdat_reset_kswapd_failures(pgdat);
> +}

As this is kswapd API, please move these down to after wakeup_kswapd().

I think we can streamline the names a bit. We already use "hopeless"
for that state in the comments; can you please rename the functions
kswapd_clear_hopeless() and kswapd_try_clear_hopeless()?

We should then also replace the open-coded kswapd_failure checks with
kswapd_test_hopeless(). But I can send a follow-up patch if you don't
want to, just let me know.

Re: [PATCH v3 1/2] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim

Posted by Jiayuan Chen 2 weeks, 6 days ago

2026/1/17 01:00, "Johannes Weiner" <hannes@cmpxchg.org mailto:hannes@cmpxchg.org?to=%22Johannes%20Weiner%22%20%3Channes%40cmpxchg.org%3E > wrote:

[...]
> > 
> Great analysis, and I agree with both the fix and adding tracepoints.
> 
> Two minor nits:
> 
> > 
> > @@ -2650,6 +2650,25 @@ static bool can_age_anon_pages(struct lruvec *lruvec,
> >  lruvec_memcg(lruvec));
> >  }
> >  
> >  +static void pgdat_reset_kswapd_failures(pg_data_t *pgdat)
> >  +{
> >  + atomic_set(&pgdat->kswapd_failures, 0);
> >  +/*
> >  + * Reset kswapd_failures only when the node is balanced. Without this
> >  + * check, successful direct reclaim (e.g., from cgroup memory.high
> >  + * throttling) can keep resetting kswapd_failures even when the node
> >  + * cannot be balanced, causing kswapd to run endlessly.
> >  + */
> >  +static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx);
> >  +static inline void pgdat_try_reset_kswapd_failures(struct pglist_data *pgdat,
> > 
> Please remove the inline, the compiler will figure it out.
> 
> > 
> > + struct scan_control *sc)
> >  +{
> >  + if (pgdat_balanced(pgdat, sc->order, sc->reclaim_idx))
> >  + pgdat_reset_kswapd_failures(pgdat);
> >  +}
> > 
> As this is kswapd API, please move these down to after wakeup_kswapd().
> 
> I think we can streamline the names a bit. We already use "hopeless"
> for that state in the comments; can you please rename the functions
> kswapd_clear_hopeless() and kswapd_try_clear_hopeless()?
> 
> We should then also replace the open-coded kswapd_failure checks with
> kswapd_test_hopeless(). But I can send a follow-up patch if you don't
> want to, just let me know.
>

Thanks, Johannes and Shakeel. I'll send an updated version with these fixes.

[PATCH v3 1/2] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim
[PATCH v3 2/2] mm/vmscan: add tracepoint and reason for kswapd_failures reset