From nobody Sun Feb  8 15:58:37 2026
Received: from out-186.mta0.migadu.com (out-186.mta0.migadu.com
 [91.218.175.186])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9A6F3379971;
	Wed, 14 Jan 2026 07:41:20 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=91.218.175.186
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1768376484; cv=none;
 b=ZnDOkHoeONktkye98y/jrJ+trH9rqtbsZ59N0rkFoQGJycJadZjdMPBXetP6pKigwpxeEJAr4XNt/Lxur+J45dvOnJNnU214d0HN0W4BY0jtNhoYerNuW5065iammUuQzZ7dlpRiUHWWCv/oiP/u5QDfMwR7tblT0P/cR4n6emA=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1768376484; c=relaxed/simple;
	bh=YTdVAgfK5gnPWN0zIJ1m9pCPeYWmsJ5yJ4O6kvkm164=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=p5pTXScfKcW10yEFfONdssGqfo20EWenuWV/oeao5L76iGINwPJ9K/eLUpvbUqUg0lQGfv/wD446LCwaS1AHOk/QGFaWdv7SvUPM37TjiUIH6uT8avUDrXm/8DlD02dGvjVWpUrE5msf5WBK9Fe/27iXFU+kTUX9iRUNdeXmyQU=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.dev;
 spf=pass smtp.mailfrom=linux.dev;
 dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev
 header.b=HdN96whk; arc=none smtp.client-ip=91.218.175.186
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev
 header.b="HdN96whk"
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and
 include these headers.
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1768376477;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=m29ZRcbPtm9JpokMLGdHmu+Zzbc/y/DtDAKnbut40qw=;
	b=HdN96whkUmsgrx9dCvaK4FKjnr4b/YWZi+Fh8vqu/jrQVehdy5bbEPBksLoG5nWYZae1nR
	beLx7ZZTiyR/S7lgwaSgA4WQEEZ3Sb7TDvcBGH3jcq9M09gHEXmlApjzWAPXsqXksufhQq
	OOrdludBb226tISEMzE2u8UrtZjufKQ=
From: Jiayuan Chen <jiayuan.chen@linux.dev>
To: linux-mm@kvack.org,
	shakeel.butt@linux.dev
Cc: Jiayuan Chen <jiayuan.chen@shopee.com>,
	Jiayuan Chen <jiayuan.chen@linux.dev>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@kernel.org>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Yuanchu Xie <yuanchu@google.com>,
	Wei Xu <weixugc@google.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Masami Hiramatsu <mhiramat@kernel.org>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Brendan Jackman <jackmanb@google.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Zi Yan <ziy@nvidia.com>,
	Qi Zheng <zhengqi.arch@bytedance.com>,
	linux-kernel@vger.kernel.org,
	linux-trace-kernel@vger.kernel.org
Subject: [PATCH v3 1/2] mm/vmscan: mitigate spurious kswapd_failures reset
 from direct reclaim
Date: Wed, 14 Jan 2026 15:40:35 +0800
Message-ID: <20260114074049.229935-2-jiayuan.chen@linux.dev>
In-Reply-To: <20260114074049.229935-1-jiayuan.chen@linux.dev>
References: <20260114074049.229935-1-jiayuan.chen@linux.dev>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Migadu-Flow: FLOW_OUT
Content-Type: text/plain; charset="utf-8"

From: Jiayuan Chen <jiayuan.chen@shopee.com>

When kswapd fails to reclaim memory, kswapd_failures is incremented.
Once it reaches MAX_RECLAIM_RETRIES, kswapd stops running to avoid
futile reclaim attempts. However, any successful direct reclaim
unconditionally resets kswapd_failures to 0, which can cause problems.

We observed an issue in production on a multi-NUMA system where a
process allocated large amounts of anonymous pages on a single NUMA
node, causing its watermark to drop below high and evicting most file
pages:

$ numastat -m
Per-node system memory usage (in MBs):
                          Node 0          Node 1           Total
                 --------------- --------------- ---------------
MemTotal               128222.19       127983.91       256206.11
MemFree                  1414.48         1432.80         2847.29
MemUsed                126807.71       126551.11       252358.82
SwapCached                  0.00            0.00            0.00
Active                  29017.91        25554.57        54572.48
Inactive                92749.06        95377.00       188126.06
Active(anon)            28998.96        23356.47        52355.43
Inactive(anon)          92685.27        87466.11       180151.39
Active(file)               18.95         2198.10         2217.05
Inactive(file)             63.79         7910.89         7974.68

With swap disabled, only file pages can be reclaimed. When kswapd is
woken (e.g., via wake_all_kswapds()), it runs continuously but cannot
raise free memory above the high watermark since reclaimable file pages
are insufficient. Normally, kswapd would eventually stop after
kswapd_failures reaches MAX_RECLAIM_RETRIES.

However, containers on this machine have memory.high set in their
cgroup. Business processes continuously trigger the high limit, causing
frequent direct reclaim that keeps resetting kswapd_failures to 0. This
prevents kswapd from ever stopping.

The key insight is that direct reclaim triggered by cgroup memory.high
performs aggressive scanning to throttle the allocating process. With
sufficiently aggressive scanning, even hot pages will eventually be
reclaimed, making direct reclaim "successful" at freeing some memory.
However, this success does not mean the node has reached a balanced
state - the freed memory may still be insufficient to bring free pages
above the high watermark. Unconditionally resetting kswapd_failures in
this case keeps kswapd alive indefinitely.

The result is that kswapd runs endlessly. Unlike direct reclaim which
only reclaims from the allocating cgroup, kswapd scans the entire node's
memory. This causes hot file pages from all workloads on the node to be
evicted, not just those from the cgroup triggering memory.high. These
pages constantly refault, generating sustained heavy IO READ pressure
across the entire system.

Fix this by only resetting kswapd_failures when the node is actually
balanced. This allows both kswapd and direct reclaim to clear
kswapd_failures upon successful reclaim, but only when the reclaim
actually resolves the memory pressure (i.e., the node becomes balanced).

Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 mm/vmscan.c | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 670fe9fae5ba..6fd100130987 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2650,6 +2650,25 @@ static bool can_age_anon_pages(struct lruvec *lruvec,
 			  lruvec_memcg(lruvec));
 }
=20
+static void pgdat_reset_kswapd_failures(pg_data_t *pgdat)
+{
+	atomic_set(&pgdat->kswapd_failures, 0);
+}
+
+/*
+ * Reset kswapd_failures only when the node is balanced. Without this
+ * check, successful direct reclaim (e.g., from cgroup memory.high
+ * throttling) can keep resetting kswapd_failures even when the node
+ * cannot be balanced, causing kswapd to run endlessly.
+ */
+static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneid=
x);
+static inline void pgdat_try_reset_kswapd_failures(struct pglist_data *pgd=
at,
+						   struct scan_control *sc)
+{
+	if (pgdat_balanced(pgdat, sc->order, sc->reclaim_idx))
+		pgdat_reset_kswapd_failures(pgdat);
+}
+
 #ifdef CONFIG_LRU_GEN
=20
 #ifdef CONFIG_LRU_GEN_ENABLED
@@ -5067,7 +5086,7 @@ static void lru_gen_shrink_node(struct pglist_data *p=
gdat, struct scan_control *
 	blk_finish_plug(&plug);
 done:
 	if (sc->nr_reclaimed > reclaimed)
-		atomic_set(&pgdat->kswapd_failures, 0);
+		pgdat_try_reset_kswapd_failures(pgdat, sc);
 }
=20
 /*************************************************************************=
*****
@@ -6141,7 +6160,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan=
_control *sc)
 	 * successful direct reclaim run will revive a dormant kswapd.
 	 */
 	if (reclaimable)
-		atomic_set(&pgdat->kswapd_failures, 0);
+		pgdat_try_reset_kswapd_failures(pgdat, sc);
 	else if (sc->cache_trim_mode)
 		sc->cache_trim_mode_failed =3D 1;
 }
--=20
2.43.0
From nobody Sun Feb  8 15:58:37 2026
Received: from out-172.mta0.migadu.com (out-172.mta0.migadu.com
 [91.218.175.172])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 210962D8DA6
	for <linux-kernel@vger.kernel.org>; Wed, 14 Jan 2026 07:41:30 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=91.218.175.172
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1768376494; cv=none;
 b=HJZBsngBy3j3zOcR3mBhSuap7hDH0dlug/X3MxceZilqJYVGDxOyXs8OG+K7iOdu/EUX90SQ9L4t7CltyKtQFqM36KM6GF73hMwClToMyDaRK+B4Hn89YsA44qyMkO1bHtVBo6Svd3fPGfAcIudYQAWO4uVA/kYy7iyVXq5L69s=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1768376494; c=relaxed/simple;
	bh=DhQb4dNhELPWC8zIgpmV+VnbB0DpBj+2058/o+oR8Dw=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=vEJDJFiYpMK921VdJylWz+BlcNWVqA+d/CjCt1Jiof3HaJqWShvPG9NklhfxdXHaOMRGIwgM55gXRBNoq4g5KEAY1w2vQNPc53x2l7i3VfFnKV7bFYSrHAmLwsLzIJiH73t7gNKEShne9GwL5KiSpr3qWbbgyyNt+qaJBmc7VvA=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.dev;
 spf=pass smtp.mailfrom=linux.dev;
 dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev
 header.b=EwIKnBY0; arc=none smtp.client-ip=91.218.175.172
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev
 header.b="EwIKnBY0"
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and
 include these headers.
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1768376488;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=8x3YqP9ZG6FyL2JK7Cc+pkWrJriDPOZs61tUnMuCISM=;
	b=EwIKnBY0ptJt79ezcxQtB+XpWywMpIAhrAHe/965yMbyYfj/AB6iOJvD0YLPcAWnpaGVYn
	l66Gxdy+0N1R+XFSDbjz5JWdK7Ki2JexfGflIoyCyuwprwpCp0UCIqd4CR/GmwGRvOVK1e
	M+iUSM9aWzMnwm043wnCZkMuRFMmgf0=
From: Jiayuan Chen <jiayuan.chen@linux.dev>
To: linux-mm@kvack.org,
	shakeel.butt@linux.dev
Cc: Jiayuan Chen <jiayuan.chen@shopee.com>,
	Jiayuan Chen <jiayuan.chen@linux.dev>,
	Andrew Morton <akpm@linux-foundation.org>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Yuanchu Xie <yuanchu@google.com>,
	Wei Xu <weixugc@google.com>,
	David Hildenbrand <david@kernel.org>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Masami Hiramatsu <mhiramat@kernel.org>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Brendan Jackman <jackmanb@google.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Zi Yan <ziy@nvidia.com>,
	Qi Zheng <zhengqi.arch@bytedance.com>,
	linux-kernel@vger.kernel.org,
	linux-trace-kernel@vger.kernel.org
Subject: [PATCH v3 2/2] mm/vmscan: add tracepoint and reason for
 kswapd_failures reset
Date: Wed, 14 Jan 2026 15:40:36 +0800
Message-ID: <20260114074049.229935-3-jiayuan.chen@linux.dev>
In-Reply-To: <20260114074049.229935-1-jiayuan.chen@linux.dev>
References: <20260114074049.229935-1-jiayuan.chen@linux.dev>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Migadu-Flow: FLOW_OUT
Content-Type: text/plain; charset="utf-8"

From: Jiayuan Chen <jiayuan.chen@shopee.com>

Currently, kswapd_failures is reset in multiple places (kswapd,
direct reclaim, PCP freeing, memory-tiers), but there's no way to
trace when and why it was reset, making it difficult to debug
memory reclaim issues.

This patch:

1. Introduce pgdat_reset_kswapd_failures() as a wrapper function to
   centralize kswapd_failures reset logic.

2. Add reset_kswapd_failures_reason enum to distinguish reset sources:
   - RESET_KSWAPD_FAILURES_KSWAPD: reset from kswapd context
   - RESET_KSWAPD_FAILURES_DIRECT: reset from direct reclaim
   - RESET_KSWAPD_FAILURES_PCP: reset from PCP page freeing
   - RESET_KSWAPD_FAILURES_OTHER: reset from other paths

3. Add tracepoints for better observability:
   - mm_vmscan_reset_kswapd_failures: traces each reset with reason
   - mm_vmscan_kswapd_reclaim_fail: traces each kswapd reclaim failure

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
---
Test results:

$ trace-cmd record -e vmscan:mm_vmscan_reset_kswapd_failures -e vmscan:mm_v=
mscan_kswapd_reclaim_fail
$ # generate memory pressure
$ trace-cmd report
cpus=3D4
kswapd1-73  [002]  24.863112: mm_vmscan_kswapd_reclaim_fail: nid=3D1 failur=
es=3D1
kswapd1-73  [002]  24.863472: mm_vmscan_kswapd_reclaim_fail: nid=3D1 failur=
es=3D2
kswapd1-73  [002]  24.863813: mm_vmscan_kswapd_reclaim_fail: nid=3D1 failur=
es=3D3
kswapd1-73  [002]  24.864141: mm_vmscan_kswapd_reclaim_fail: nid=3D1 failur=
es=3D4
kswapd1-73  [002]  24.864462: mm_vmscan_kswapd_reclaim_fail: nid=3D1 failur=
es=3D5
kswapd1-73  [002]  24.864779: mm_vmscan_kswapd_reclaim_fail: nid=3D1 failur=
es=3D6
kswapd1-73  [002]  24.865103: mm_vmscan_kswapd_reclaim_fail: nid=3D1 failur=
es=3D7
kswapd1-73  [002]  24.865421: mm_vmscan_kswapd_reclaim_fail: nid=3D1 failur=
es=3D8
kswapd1-73  [002]  24.865737: mm_vmscan_kswapd_reclaim_fail: nid=3D1 failur=
es=3D9
kswapd1-73  [002]  24.866070: mm_vmscan_kswapd_reclaim_fail: nid=3D1 failur=
es=3D10
kswapd1-73  [002]  24.866385: mm_vmscan_kswapd_reclaim_fail: nid=3D1 failur=
es=3D11
kswapd1-73  [002]  24.866701: mm_vmscan_kswapd_reclaim_fail: nid=3D1 failur=
es=3D12
kswapd1-73  [002]  24.867016: mm_vmscan_kswapd_reclaim_fail: nid=3D1 failur=
es=3D13
kswapd1-73  [002]  24.867333: mm_vmscan_kswapd_reclaim_fail: nid=3D1 failur=
es=3D14
kswapd1-73  [002]  24.867649: mm_vmscan_kswapd_reclaim_fail: nid=3D1 failur=
es=3D15
kswapd1-73  [002]  24.867965: mm_vmscan_kswapd_reclaim_fail: nid=3D1 failur=
es=3D16
kswapd0-72  [001]  25.020464: mm_vmscan_kswapd_reclaim_fail: nid=3D0 failur=
es=3D1
kswapd0-72  [001]  25.021054: mm_vmscan_kswapd_reclaim_fail: nid=3D0 failur=
es=3D2
kswapd0-72  [001]  25.021628: mm_vmscan_kswapd_reclaim_fail: nid=3D0 failur=
es=3D3
kswapd0-72  [001]  25.022217: mm_vmscan_kswapd_reclaim_fail: nid=3D0 failur=
es=3D4
kswapd0-72  [001]  25.022790: mm_vmscan_kswapd_reclaim_fail: nid=3D0 failur=
es=3D5
kswapd0-72  [001]  25.023366: mm_vmscan_kswapd_reclaim_fail: nid=3D0 failur=
es=3D6
kswapd0-72  [001]  25.023937: mm_vmscan_kswapd_reclaim_fail: nid=3D0 failur=
es=3D7
kswapd0-72  [001]  25.024511: mm_vmscan_kswapd_reclaim_fail: nid=3D0 failur=
es=3D8
kswapd0-72  [001]  25.025092: mm_vmscan_kswapd_reclaim_fail: nid=3D0 failur=
es=3D9
kswapd0-72  [001]  25.025665: mm_vmscan_kswapd_reclaim_fail: nid=3D0 failur=
es=3D10
kswapd0-72  [001]  25.026249: mm_vmscan_kswapd_reclaim_fail: nid=3D0 failur=
es=3D11
kswapd0-72  [001]  25.026824: mm_vmscan_kswapd_reclaim_fail: nid=3D0 failur=
es=3D12
kswapd0-72  [001]  25.027398: mm_vmscan_kswapd_reclaim_fail: nid=3D0 failur=
es=3D13
kswapd0-72  [001]  25.027976: mm_vmscan_kswapd_reclaim_fail: nid=3D0 failur=
es=3D14
kswapd0-72  [001]  25.028554: mm_vmscan_kswapd_reclaim_fail: nid=3D0 failur=
es=3D15
kswapd0-72  [001]  25.029140: mm_vmscan_kswapd_reclaim_fail: nid=3D0 failur=
es=3D16
ann-416     [002]  25.577925: mm_vmscan_reset_kswapd_failures: nid=3D0 reas=
on=3DPCP
dd-417      [002]  35.111721: mm_vmscan_reset_kswapd_failures: nid=3D1 reas=
on=3DDIRECT

Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
---
 include/linux/mmzone.h        |  9 +++++++
 include/trace/events/vmscan.h | 51 +++++++++++++++++++++++++++++++++++
 mm/memory-tiers.c             |  2 +-
 mm/page_alloc.c               |  2 +-
 mm/vmscan.c                   | 16 +++++++----
 5 files changed, 73 insertions(+), 7 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 75ef7c9f9307..3f4d2928d8dc 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1531,6 +1531,15 @@ static inline unsigned long pgdat_end_pfn(pg_data_t =
*pgdat)
 	return pgdat->node_start_pfn + pgdat->node_spanned_pages;
 }
=20
+enum reset_kswapd_failures_reason {
+	RESET_KSWAPD_FAILURES_OTHER =3D 0,
+	RESET_KSWAPD_FAILURES_KSWAPD,
+	RESET_KSWAPD_FAILURES_DIRECT,
+	RESET_KSWAPD_FAILURES_PCP,
+};
+
+void pgdat_reset_kswapd_failures(pg_data_t *pgdat, enum reset_kswapd_failu=
res_reason reason);
+
 #include <linux/memory_hotplug.h>
=20
 void build_all_zonelists(pg_data_t *pgdat);
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 490958fa10de..0747ad2f7932 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -40,6 +40,16 @@
 		{_VMSCAN_THROTTLE_CONGESTED,	"VMSCAN_THROTTLE_CONGESTED"}	\
 		) : "VMSCAN_THROTTLE_NONE"
=20
+TRACE_DEFINE_ENUM(RESET_KSWAPD_FAILURES_OTHER);
+TRACE_DEFINE_ENUM(RESET_KSWAPD_FAILURES_KSWAPD);
+TRACE_DEFINE_ENUM(RESET_KSWAPD_FAILURES_DIRECT);
+TRACE_DEFINE_ENUM(RESET_KSWAPD_FAILURES_PCP);
+
+#define reset_kswapd_src				\
+	{RESET_KSWAPD_FAILURES_KSWAPD,	"KSWAPD"},	\
+	{RESET_KSWAPD_FAILURES_DIRECT,	"DIRECT"},	\
+	{RESET_KSWAPD_FAILURES_PCP,	"PCP"},		\
+	{RESET_KSWAPD_FAILURES_OTHER,	"OTHER"}
=20
 #define trace_reclaim_flags(file) ( \
 	(file ? RECLAIM_WB_FILE : RECLAIM_WB_ANON) | \
@@ -535,6 +545,47 @@ TRACE_EVENT(mm_vmscan_throttled,
 		__entry->usec_delayed,
 		show_throttle_flags(__entry->reason))
 );
+
+TRACE_EVENT(mm_vmscan_kswapd_reclaim_fail,
+
+	TP_PROTO(int nid, int failures),
+
+	TP_ARGS(nid, failures),
+
+	TP_STRUCT__entry(
+		__field(int, nid)
+		__field(int, failures)
+	),
+
+	TP_fast_assign(
+		__entry->nid =3D nid;
+		__entry->failures =3D failures;
+	),
+
+	TP_printk("nid=3D%d failures=3D%d",
+		__entry->nid, __entry->failures)
+);
+
+TRACE_EVENT(mm_vmscan_reset_kswapd_failures,
+
+	TP_PROTO(int nid, int reason),
+
+	TP_ARGS(nid, reason),
+
+	TP_STRUCT__entry(
+		__field(int, nid)
+		__field(int, reason)
+	),
+
+	TP_fast_assign(
+		__entry->nid =3D nid;
+		__entry->reason =3D reason;
+	),
+
+	TP_printk("nid=3D%d reason=3D%s",
+		__entry->nid,
+		__print_symbolic(__entry->reason, reset_kswapd_src))
+);
 #endif /* _TRACE_VMSCAN_H */
=20
 /* This part must be outside protection */
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 864811fff409..8188f341bd77 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -956,7 +956,7 @@ static ssize_t demotion_enabled_store(struct kobject *k=
obj,
 		struct pglist_data *pgdat;
=20
 		for_each_online_pgdat(pgdat)
-			atomic_set(&pgdat->kswapd_failures, 0);
+			pgdat_reset_kswapd_failures(pgdat, RESET_KSWAPD_FAILURES_OTHER);
 	}
=20
 	return count;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c380f063e8b7..cadf2c8b06a5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2918,7 +2918,7 @@ static bool free_frozen_page_commit(struct zone *zone,
 		 */
 		if (atomic_read(&pgdat->kswapd_failures) >=3D MAX_RECLAIM_RETRIES &&
 		    next_memory_node(pgdat->node_id) < MAX_NUMNODES)
-			atomic_set(&pgdat->kswapd_failures, 0);
+			pgdat_reset_kswapd_failures(pgdat, RESET_KSWAPD_FAILURES_PCP);
 	}
 	return ret;
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6fd100130987..8d9f3d29fe3b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2650,9 +2650,11 @@ static bool can_age_anon_pages(struct lruvec *lruvec,
 			  lruvec_memcg(lruvec));
 }
=20
-static void pgdat_reset_kswapd_failures(pg_data_t *pgdat)
+void pgdat_reset_kswapd_failures(pg_data_t *pgdat, enum reset_kswapd_failu=
res_reason reason)
 {
-	atomic_set(&pgdat->kswapd_failures, 0);
+	/* Only trace actual resets, not redundant zero-to-zero */
+	if (atomic_xchg(&pgdat->kswapd_failures, 0))
+		trace_mm_vmscan_reset_kswapd_failures(pgdat->node_id, reason);
 }
=20
 /*
@@ -2666,7 +2668,8 @@ static inline void pgdat_try_reset_kswapd_failures(st=
ruct pglist_data *pgdat,
 						   struct scan_control *sc)
 {
 	if (pgdat_balanced(pgdat, sc->order, sc->reclaim_idx))
-		pgdat_reset_kswapd_failures(pgdat);
+		pgdat_reset_kswapd_failures(pgdat, current_is_kswapd() ?
+			RESET_KSWAPD_FAILURES_KSWAPD : RESET_KSWAPD_FAILURES_DIRECT);
 }
=20
 #ifdef CONFIG_LRU_GEN
@@ -7153,8 +7156,11 @@ static int balance_pgdat(pg_data_t *pgdat, int order=
, int highest_zoneidx)
 	 * watermark_high at this point. We need to avoid increasing the
 	 * failure count to prevent the kswapd thread from stopping.
 	 */
-	if (!sc.nr_reclaimed && !boosted)
-		atomic_inc(&pgdat->kswapd_failures);
+	if (!sc.nr_reclaimed && !boosted) {
+		int fail_cnt =3D atomic_inc_return(&pgdat->kswapd_failures);
+		/* kswapd context, low overhead to trace every failure */
+		trace_mm_vmscan_kswapd_reclaim_fail(pgdat->node_id, fail_cnt);
+	}
=20
 out:
 	clear_reclaim_active(pgdat, highest_zoneidx);
--=20
2.43.0