[v2] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim

[PATCH v2] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim

Posted by Jiayuan Chen 1 month, 1 week ago

From: Jiayuan Chen <jiayuan.chen@shopee.com>

This is v2 of this patch series. For v1, see [1].

When kswapd fails to reclaim memory, kswapd_failures is incremented.
Once it reaches MAX_RECLAIM_RETRIES, kswapd stops running to avoid
futile reclaim attempts. However, any successful direct reclaim
unconditionally resets kswapd_failures to 0, which can cause problems.

We observed an issue in production on a multi-NUMA system where a
process allocated large amounts of anonymous pages on a single NUMA
node, causing its watermark to drop below high and evicting most file
pages:

$ numastat -m
Per-node system memory usage (in MBs):
                          Node 0          Node 1           Total
                 --------------- --------------- ---------------
MemTotal               128222.19       127983.91       256206.11
MemFree                  1414.48         1432.80         2847.29
MemUsed                126807.71       126551.11       252358.82
SwapCached                  0.00            0.00            0.00
Active                  29017.91        25554.57        54572.48
Inactive                92749.06        95377.00       188126.06
Active(anon)            28998.96        23356.47        52355.43
Inactive(anon)          92685.27        87466.11       180151.39
Active(file)               18.95         2198.10         2217.05
Inactive(file)             63.79         7910.89         7974.68

With swap disabled, only file pages can be reclaimed. When kswapd is
woken (e.g., via wake_all_kswapds()), it runs continuously but cannot
raise free memory above the high watermark since reclaimable file pages
are insufficient. Normally, kswapd would eventually stop after
kswapd_failures reaches MAX_RECLAIM_RETRIES.

However, containers on this machine have memory.high set in their
cgroup. Business processes continuously trigger the high limit, causing
frequent direct reclaim that keeps resetting kswapd_failures to 0. This
prevents kswapd from ever stopping.

The key insight is that direct reclaim triggered by cgroup memory.high
performs aggressive scanning to throttle the allocating process. With
sufficiently aggressive scanning, even hot pages will eventually be
reclaimed, making direct reclaim "successful" at freeing some memory.
However, this success does not mean the node has reached a balanced
state - the freed memory may still be insufficient to bring free pages
above the high watermark. Unconditionally resetting kswapd_failures in
this case keeps kswapd alive indefinitely.

The result is that kswapd runs endlessly. Unlike direct reclaim which
only reclaims from the allocating cgroup, kswapd scans the entire node's
memory. This causes hot file pages from all workloads on the node to be
evicted, not just those from the cgroup triggering memory.high. These
pages constantly refault, generating sustained heavy IO READ pressure
across the entire system.

Fix this by only resetting kswapd_failures when the node is actually
balanced. This allows both kswapd and direct reclaim to clear
kswapd_failures upon successful reclaim, but only when the reclaim
actually resolves the memory pressure (i.e., the node becomes balanced).

[1] https://lore.kernel.org/all/20251222122022.254268-1-jiayuan.chen@linux.dev/
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
---
 mm/vmscan.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 453d654727c1..594bf6eb52fb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2648,6 +2648,20 @@ static bool can_age_anon_pages(struct lruvec *lruvec,
 			  lruvec_memcg(lruvec));
 }
 
+/*
+ * Reset kswapd_failures only when the node is balanced. Without this
+ * check, successful direct reclaim (e.g., from cgroup memory.high
+ * throttling) can keep resetting kswapd_failures even when the node
+ * cannot be balanced, causing kswapd to run endlessly.
+ */
+static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx);
+static inline void reset_kswapd_failures(struct pglist_data *pgdat,
+					 struct scan_control *sc)
+{
+	if (pgdat_balanced(pgdat, sc->order, sc->reclaim_idx))
+		atomic_set(&pgdat->kswapd_failures, 0);
+}
+
 #ifdef CONFIG_LRU_GEN
 
 #ifdef CONFIG_LRU_GEN_ENABLED
@@ -5065,7 +5079,7 @@ static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control *
 	blk_finish_plug(&plug);
 done:
 	if (sc->nr_reclaimed > reclaimed)
-		atomic_set(&pgdat->kswapd_failures, 0);
+		reset_kswapd_failures(pgdat, sc);
 }
 
 /******************************************************************************
@@ -6139,7 +6153,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	 * successful direct reclaim run will revive a dormant kswapd.
 	 */
 	if (reclaimable)
-		atomic_set(&pgdat->kswapd_failures, 0);
+		reset_kswapd_failures(pgdat, sc);
 	else if (sc->cache_trim_mode)
 		sc->cache_trim_mode_failed = 1;
 }
-- 
2.43.0

Re: [PATCH v2] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim

Posted by Shakeel Butt 1 month ago

On Fri, Dec 26, 2025 at 04:00:42PM +0800, Jiayuan Chen wrote:
> From: Jiayuan Chen <jiayuan.chen@shopee.com>
> 
> This is v2 of this patch series. For v1, see [1].
> 
> When kswapd fails to reclaim memory, kswapd_failures is incremented.
> Once it reaches MAX_RECLAIM_RETRIES, kswapd stops running to avoid
> futile reclaim attempts. However, any successful direct reclaim
> unconditionally resets kswapd_failures to 0, which can cause problems.
> 
> We observed an issue in production on a multi-NUMA system where a
> process allocated large amounts of anonymous pages on a single NUMA
> node, causing its watermark to drop below high and evicting most file
> pages:
> 
> $ numastat -m
> Per-node system memory usage (in MBs):
>                           Node 0          Node 1           Total
>                  --------------- --------------- ---------------
> MemTotal               128222.19       127983.91       256206.11
> MemFree                  1414.48         1432.80         2847.29
> MemUsed                126807.71       126551.11       252358.82
> SwapCached                  0.00            0.00            0.00
> Active                  29017.91        25554.57        54572.48
> Inactive                92749.06        95377.00       188126.06
> Active(anon)            28998.96        23356.47        52355.43
> Inactive(anon)          92685.27        87466.11       180151.39
> Active(file)               18.95         2198.10         2217.05
> Inactive(file)             63.79         7910.89         7974.68
> 
> With swap disabled, only file pages can be reclaimed. When kswapd is
> woken (e.g., via wake_all_kswapds()), it runs continuously but cannot
> raise free memory above the high watermark since reclaimable file pages
> are insufficient. Normally, kswapd would eventually stop after
> kswapd_failures reaches MAX_RECLAIM_RETRIES.
> 
> However, containers on this machine have memory.high set in their
> cgroup. Business processes continuously trigger the high limit, causing
> frequent direct reclaim that keeps resetting kswapd_failures to 0. This
> prevents kswapd from ever stopping.
> 
> The key insight is that direct reclaim triggered by cgroup memory.high
> performs aggressive scanning to throttle the allocating process. With
> sufficiently aggressive scanning, even hot pages will eventually be
> reclaimed, making direct reclaim "successful" at freeing some memory.
> However, this success does not mean the node has reached a balanced
> state - the freed memory may still be insufficient to bring free pages
> above the high watermark. Unconditionally resetting kswapd_failures in
> this case keeps kswapd alive indefinitely.
> 
> The result is that kswapd runs endlessly. Unlike direct reclaim which
> only reclaims from the allocating cgroup, kswapd scans the entire node's
> memory. This causes hot file pages from all workloads on the node to be
> evicted, not just those from the cgroup triggering memory.high. These
> pages constantly refault, generating sustained heavy IO READ pressure
> across the entire system.
> 
> Fix this by only resetting kswapd_failures when the node is actually
> balanced. This allows both kswapd and direct reclaim to clear
> kswapd_failures upon successful reclaim, but only when the reclaim
> actually resolves the memory pressure (i.e., the node becomes balanced).
> 
> [1] https://lore.kernel.org/all/20251222122022.254268-1-jiayuan.chen@linux.dev/
> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>

Hi Jiayuan, can you please send v3 of this patch with the following
additional information:

1. Impact of the patch on your production jobs i.e. does it really
solves the issue?

2. Memory reclaim stats or cpu usage of kswapd with and without patch.

thanks,
Shakeel

Re: [PATCH v2] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim

Posted by Jiayuan Chen 1 month ago

January 7, 2026 at 06:06, "Shakeel Butt" <shakeel.butt@linux.dev mailto:shakeel.butt@linux.dev?to=%22Shakeel%20Butt%22%20%3Cshakeel.butt%40linux.dev%3E > wrote:


> 
> On Fri, Dec 26, 2025 at 04:00:42PM +0800, Jiayuan Chen wrote:
> 
> > 
> > From: Jiayuan Chen <jiayuan.chen@shopee.com>
> >  
> >  This is v2 of this patch series. For v1, see [1].
> >  
> >  When kswapd fails to reclaim memory, kswapd_failures is incremented.
> >  Once it reaches MAX_RECLAIM_RETRIES, kswapd stops running to avoid
> >  futile reclaim attempts. However, any successful direct reclaim
> >  unconditionally resets kswapd_failures to 0, which can cause problems.
> >  
> >  We observed an issue in production on a multi-NUMA system where a
> >  process allocated large amounts of anonymous pages on a single NUMA
> >  node, causing its watermark to drop below high and evicting most file
> >  pages:
> >  
> >  $ numastat -m
> >  Per-node system memory usage (in MBs):
> >  Node 0 Node 1 Total
> >  --------------- --------------- ---------------
> >  MemTotal 128222.19 127983.91 256206.11
> >  MemFree 1414.48 1432.80 2847.29
> >  MemUsed 126807.71 126551.11 252358.82
> >  SwapCached 0.00 0.00 0.00
> >  Active 29017.91 25554.57 54572.48
> >  Inactive 92749.06 95377.00 188126.06
> >  Active(anon) 28998.96 23356.47 52355.43
> >  Inactive(anon) 92685.27 87466.11 180151.39
> >  Active(file) 18.95 2198.10 2217.05
> >  Inactive(file) 63.79 7910.89 7974.68
> >  
> >  With swap disabled, only file pages can be reclaimed. When kswapd is
> >  woken (e.g., via wake_all_kswapds()), it runs continuously but cannot
> >  raise free memory above the high watermark since reclaimable file pages
> >  are insufficient. Normally, kswapd would eventually stop after
> >  kswapd_failures reaches MAX_RECLAIM_RETRIES.
> >  
> >  However, containers on this machine have memory.high set in their
> >  cgroup. Business processes continuously trigger the high limit, causing
> >  frequent direct reclaim that keeps resetting kswapd_failures to 0. This
> >  prevents kswapd from ever stopping.
> >  
> >  The key insight is that direct reclaim triggered by cgroup memory.high
> >  performs aggressive scanning to throttle the allocating process. With
> >  sufficiently aggressive scanning, even hot pages will eventually be
> >  reclaimed, making direct reclaim "successful" at freeing some memory.
> >  However, this success does not mean the node has reached a balanced
> >  state - the freed memory may still be insufficient to bring free pages
> >  above the high watermark. Unconditionally resetting kswapd_failures in
> >  this case keeps kswapd alive indefinitely.
> >  
> >  The result is that kswapd runs endlessly. Unlike direct reclaim which
> >  only reclaims from the allocating cgroup, kswapd scans the entire node's
> >  memory. This causes hot file pages from all workloads on the node to be
> >  evicted, not just those from the cgroup triggering memory.high. These
> >  pages constantly refault, generating sustained heavy IO READ pressure
> >  across the entire system.
> >  
> >  Fix this by only resetting kswapd_failures when the node is actually
> >  balanced. This allows both kswapd and direct reclaim to clear
> >  kswapd_failures upon successful reclaim, but only when the reclaim
> >  actually resolves the memory pressure (i.e., the node becomes balanced).
> >  
> >  [1] https://lore.kernel.org/all/20251222122022.254268-1-jiayuan.chen@linux.dev/
> >  Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
> > 
> Hi Jiayuan, can you please send v3 of this patch with the following
> additional information:
> 
> 1. Impact of the patch on your production jobs i.e. does it really
> solves the issue?
> 
> 2. Memory reclaim stats or cpu usage of kswapd with and without patch.
> 
> thanks,
> Shakeel
>


Hi Shakeel,

Thanks for the feedback.

To be honest, the issue is difficult to reproduce because the boundary conditions are quite complex.
We also haven't deployed this patch in production yet. I discovered the relationship between
kswapd_failures and direct reclaim through the following bpftrace script:

'''bash

bpftrace -e '
#include <linux/mmzone.h>
#include <linux/shrinker.h>
kprobe:balance_pgdat {
	$pgdat = (struct pglist_data *)arg0;
	if ($pgdat->kswapd_failures > 0) {
		printf("[node %d] [%lu] kswapd end, kswapd_failures %d\n", $pgdat->node_id, jiffies, $pgdat->kswapd_failures);
	}
}
tracepoint:vmscan:mm_vmscan_direct_reclaim_end {
	printf("[cpu %d] [%ul] reset kswapd_failures %d \n", cpu, jiffies, args.nr_reclaimed)
}
'

'''

The trace results showed that when kswapd_failures reaches 15, continuous direct reclaim keeps
resetting it to 0. This was accompanied by a flood of kswapd_failures log entries, and shortly
after, we observed massive refaults occurring.
(Note that I can only observe up to 15 in the trace due to a kprobe limitation:
the kprobe on balance_pgdat fires at function entry, but kswapd_failures is incremented to 16 only
when balance_pgdat fails to reclaim any pages - at which point kswapd goes to sleep and there's no
suitable hook point to capture it.)


Before I send v3, I'd like to continue the discussion to make sure we're aligned on the approach:

    Do you think the bpftrace evidence above is sufficient?


If you and Michal are okay with the current approach, I'll prepare v3 with mote detailed comments addressed.

By the way, this tracing limitation makes me wonder: would it be appropriate to add two tracepoints for
kswapd_failures? One for when kswapd_failures reaches MAX_RECLAIM_RETRIES (16), and another for when it
gets reset to 0. Currently, the only way to detect this is by polling node_unreclaimable from /proc/zoneinfo,
but the sampling interval is usually too coarse to catch these events.

Thanks

Re: [PATCH v2] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim

Posted by Shakeel Butt 3 weeks, 4 days ago

Hi Jiayuan,

Sorry for late reply. Let me respond in-place below.

On Wed, Jan 07, 2026 at 11:39:36AM +0000, Jiayuan Chen wrote:
[...]
> 
> Hi Shakeel,
> 
> Thanks for the feedback.
> 
> To be honest, the issue is difficult to reproduce because the boundary conditions are quite complex.
> We also haven't deployed this patch in production yet. I discovered the relationship between
> kswapd_failures and direct reclaim through the following bpftrace script:
> 
> '''bash
> 
> bpftrace -e '
> #include <linux/mmzone.h>
> #include <linux/shrinker.h>
> kprobe:balance_pgdat {
> 	$pgdat = (struct pglist_data *)arg0;
> 	if ($pgdat->kswapd_failures > 0) {
> 		printf("[node %d] [%lu] kswapd end, kswapd_failures %d\n", $pgdat->node_id, jiffies, $pgdat->kswapd_failures);
> 	}
> }
> tracepoint:vmscan:mm_vmscan_direct_reclaim_end {
> 	printf("[cpu %d] [%ul] reset kswapd_failures %d \n", cpu, jiffies, args.nr_reclaimed)
> }
> '
> 
> '''
> 
> The trace results showed that when kswapd_failures reaches 15, continuous direct reclaim keeps
> resetting it to 0. This was accompanied by a flood of kswapd_failures log entries, and shortly
> after, we observed massive refaults occurring.
> (Note that I can only observe up to 15 in the trace due to a kprobe limitation:
> the kprobe on balance_pgdat fires at function entry, but kswapd_failures is incremented to 16 only
> when balance_pgdat fails to reclaim any pages - at which point kswapd goes to sleep and there's no
> suitable hook point to capture it.)
> 
> 
> Before I send v3, I'd like to continue the discussion to make sure we're aligned on the approach:
> 
>     Do you think the bpftrace evidence above is sufficient?

Mainly I want to see if the patch is contributing positively or
negatively in the situation you are seeing in your production. Overall I
think Michal and I are on the same page that the patch is net positive
but the testing in production would eliminate the concerns completely.
Anyways we can proceed with the patch and we can always change in future
if this does not work. Please go ahead with v3 with additional
explanation.

> 
> 
> If you and Michal are okay with the current approach, I'll prepare v3 with mote detailed comments addressed.
> 
> By the way, this tracing limitation makes me wonder: would it be appropriate to add two tracepoints for
> kswapd_failures? One for when kswapd_failures reaches MAX_RECLAIM_RETRIES (16), and another for when it
> gets reset to 0. Currently, the only way to detect this is by polling node_unreclaimable from /proc/zoneinfo,
> but the sampling interval is usually too coarse to catch these events.

tracepoints are cheap and I am all for more observability. Go ahead and
propose the tracepoints which you see fit.

Re: [PATCH v2] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim

Posted by Andrew Morton 1 month, 1 week ago

On Fri, 26 Dec 2025 16:00:42 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote:

> This is v2 of this patch series. For v1, see [1].

This isn't significantly different from v1.  You appear to be
mid-discussion with Shakeel on v1.

I guess I'll toss v2 into mm.git for some additional exposure but
please let's continue that discussion.

Re: [PATCH v2] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim

Posted by Shakeel Butt 1 month, 1 week ago

On Sun, Dec 28, 2025 at 11:46:22AM -0800, Andrew Morton wrote:
> On Fri, 26 Dec 2025 16:00:42 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote:
> 
> > This is v2 of this patch series. For v1, see [1].
> 
> This isn't significantly different from v1.  You appear to be
> mid-discussion with Shakeel on v1.

Yes, the discussion will continue.

Jiayuan, please don't send a new version until the discussion is
concluded. If there is slow response, just ping on the same email chain.
I will respond on v1 to keep the context of the discussion at the same
place.