From nobody Sat Feb  7 19:08:14 2026
Received: from localhost.localdomain (unknown [147.136.157.0])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 482E325DB12
	for <linux-kernel@vger.kernel.org>; Mon, 22 Dec 2025 12:20:33 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=147.136.157.0
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1766406035; cv=none;
 b=A3dqXtYmbtxrCYAcRMsygXGKRb3+XmiXs/8ogGjbBbuXXULGkNy4Vvxh8Q5rL8lP4VxFsPDBUyV7/jQFWv6hG+uxazuiyrjUN7PWWBgB4lYPwygH6o9n9fOQ8GxNsIpSollJxyIZdQZP6F3d4CxSkcKwMpulNxJHhysBLq/gDGQ=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1766406035; c=relaxed/simple;
	bh=JNLepnH0iyklOpeVjbhP3C15Dp6ii2eNqmn736pSrOQ=;
	h=From:To:Cc:Subject:Date:Message-ID:MIME-Version;
 b=Kpgdsh7dDzEhItrwMt1iVj7E/w8Bz+wCjePf6+FZidHwD1OWMllL0DBIA/MPfKPmB36dp4LRPEiqGNKW87G2ek4q9uI2n5Ugr0rEDflB3j14PfA5+EupSbOVZgx+KO1X7quN3OjB9kjzRZ0BJdWwiQgIy1Nj2jyHMFVp1eBWXK4=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=fail (p=none dis=none) header.from=linux.dev;
 spf=none smtp.mailfrom=localhost.localdomain;
 arc=none smtp.client-ip=147.136.157.0
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=fail (p=none dis=none) header.from=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
 spf=none smtp.mailfrom=localhost.localdomain
Received: by localhost.localdomain (Postfix, from userid 1007)
	id 36CD28AA380; Mon, 22 Dec 2025 20:20:26 +0800 (+08)
From: Jiayuan Chen <jiayuan.chen@linux.dev>
To: linux-mm@kvack.org
Cc: Jiayuan Chen <jiayuan.chen@shopee.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	David Hildenbrand <david@kernel.org>,
	Michal Hocko <mhocko@kernel.org>,
	Qi Zheng <zhengqi.arch@bytedance.com>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Yuanchu Xie <yuanchu@google.com>,
	Wei Xu <weixugc@google.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from
 direct reclaim
Date: Mon, 22 Dec 2025 20:20:21 +0800
Message-ID: <20251222122022.254268-1-jiayuan.chen@linux.dev>
X-Mailer: git-send-email 2.43.0
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Jiayuan Chen <jiayuan.chen@shopee.com>

When kswapd fails to reclaim memory, kswapd_failures is incremented.
Once it reaches MAX_RECLAIM_RETRIES, kswapd stops running to avoid
futile reclaim attempts. However, any successful direct reclaim
unconditionally resets kswapd_failures to 0, which can cause problems.

We observed an issue in production on a multi-NUMA system where a
process allocated large amounts of anonymous pages on a single NUMA
node, causing its watermark to drop below high and evicting most file
pages:

$ numastat -m
Per-node system memory usage (in MBs):
                          Node 0          Node 1           Total
                 --------------- --------------- ---------------
MemTotal               128222.19       127983.91       256206.11
MemFree                  1414.48         1432.80         2847.29
MemUsed                126807.71       126551.11       252358.82
SwapCached                  0.00            0.00            0.00
Active                  29017.91        25554.57        54572.48
Inactive                92749.06        95377.00       188126.06
Active(anon)            28998.96        23356.47        52355.43
Inactive(anon)          92685.27        87466.11       180151.39
Active(file)               18.95         2198.10         2217.05
Inactive(file)             63.79         7910.89         7974.68

With swap disabled, only file pages can be reclaimed. When kswapd is
woken (e.g., via wake_all_kswapds()), it runs continuously but cannot
raise free memory above the high watermark since reclaimable file pages
are insufficient. Normally, kswapd would eventually stop after
kswapd_failures reaches MAX_RECLAIM_RETRIES.

However, pods on this machine have memory.high set in their cgroup.
Business processes continuously trigger the high limit, causing frequent
direct reclaim that keeps resetting kswapd_failures to 0. This prevents
kswapd from ever stopping.

The result is that kswapd runs endlessly, repeatedly evicting the few
remaining file pages which are actually hot. These pages constantly
refault, generating sustained heavy IO READ pressure.

Fix this by only resetting kswapd_failures from direct reclaim when the
node is actually balanced. This prevents direct reclaim from keeping
kswapd alive when the node cannot be balanced through reclaim alone.

Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
---
 mm/vmscan.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 453d654727c1..b450bde4e489 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2648,6 +2648,15 @@ static bool can_age_anon_pages(struct lruvec *lruvec,
 			  lruvec_memcg(lruvec));
 }
=20
+static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneid=
x);
+static inline void reset_kswapd_failures(struct pglist_data *pgdat,
+					 struct scan_control *sc)
+{
+	if (!current_is_kswapd() &&
+	    pgdat_balanced(pgdat, sc->order, sc->reclaim_idx))
+		atomic_set(&pgdat->kswapd_failures, 0);
+}
+
 #ifdef CONFIG_LRU_GEN
=20
 #ifdef CONFIG_LRU_GEN_ENABLED
@@ -5065,7 +5074,7 @@ static void lru_gen_shrink_node(struct pglist_data *p=
gdat, struct scan_control *
 	blk_finish_plug(&plug);
 done:
 	if (sc->nr_reclaimed > reclaimed)
-		atomic_set(&pgdat->kswapd_failures, 0);
+		reset_kswapd_failures(pgdat, sc);
 }
=20
 /*************************************************************************=
*****
@@ -6139,7 +6148,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan=
_control *sc)
 	 * successful direct reclaim run will revive a dormant kswapd.
 	 */
 	if (reclaimable)
-		atomic_set(&pgdat->kswapd_failures, 0);
+		reset_kswapd_failures(pgdat, sc);
 	else if (sc->cache_trim_mode)
 		sc->cache_trim_mode_failed =3D 1;
 }
--=20
2.43.0