From nobody Mon Jun  8 12:13:56 2026
Received: from xmbghk7.mail.qq.com (xmbghk7.mail.qq.com [43.163.128.44])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4A41D3DFC93;
	Fri, 29 May 2026 12:19:33 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=43.163.128.44
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1780057178; cv=none;
 b=C1j3GMqJbfJdL7yIBwT7s4qZzlIjZYyntlJVCczBGDktfT3Ud52WUNAlW1w6RVBHhmbKCdzGpwbQcajO7SMR53mxyDGNAHoebhe34xdsngIb/tMDIqbWybvW2/J26g8dNjveYxWylZWqZ6txTgLMkIpAAH287S5JIoPMWaWIVhI=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1780057178; c=relaxed/simple;
	bh=7jLMGW+raRr9jRp/zuJHoM7n9LG+68alAMBMXxBd+uo=;
	h=Message-ID:From:To:Cc:Subject:Date:In-Reply-To:References:
	 MIME-Version;
 b=qDkI5DHzY+U0xD7pgQ5mi7LdDEFKDDfpLNjPZG4R+QfEOfywYS8i/BySrcQxuyTxlm+y4UaqBoyVD/3yt4QVFhsn9qbcq6ykOftY9gCstScORnfTXw22pOOB5UAm+m7aW3x4lbFsoW+iLcT2zc6Y4ZMW2cr54jCTFhJ0l2kkcM0=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=qq.com;
 spf=pass smtp.mailfrom=qq.com;
 dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b=mCkTiPo2;
 arc=none smtp.client-ip=43.163.128.44
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=qq.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=qq.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b="mCkTiPo2"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=qq.com; s=s201512;
	t=1780057172; bh=bVdSFaRs3+EuMAnxIy5DMxxMx3cOnErIMJI6trzSpzw=;
	h=From:To:Cc:Subject:Date:In-Reply-To:References;
	b=mCkTiPo2IHFozwt8DpBgCjFjySFOUCuCqTBoaeJIrfHDNtKRvPSy8ga8RIIhXoGD1
	 +Vp+23K9Y/9cO0wv9B05pN4RsbWvBjmiFM6f2UqnZxxpT5BPdEDwYq0/0pvLDMdxkT
	 ZT7UMm7mvMFjuQvpNyL+8DmLq55Qmg1UFp/uDPHI=
Received: from node68.. ([166.111.236.25])
	by newxmesmtplogicsvrsza73-0.qq.com (NewEsmtp) with SMTP
	id 4DC10017; Fri, 29 May 2026 20:19:28 +0800
X-QQ-mid: xmsmtpt1780057168tzjwrsgd1
Message-ID: <tencent_C78A02F3C41E15233C371816825C7DCF8708@qq.com>
X-QQ-XMAILINFO: Mi3PnGw1zbXUjsp55hP//2TzDYS0KqexQwEQGN+M9bSC/Uz/t0z4g0+EhjTtIp
	 aeC+9AOG4QF1yU3MizVH6jsjooanTR/YImuIOHMLZtLBWkfeahthNaouMCgnkq+4hSzD3tgQh2XG
	 cqRup5WSheqyiCFxk21Z3mbwtSfMbOQMHLRbr+E/+chvZheENv8ppzUeBEI8UCEzfG1/jxZiFxyF
	 JE7xjP8ub2ACuhw+tODKS062CDu4/T9QD8F4NdlKhcTl3Jc8EeTjr2kYsRQ5UK5Hk0KLHSrKMFhu
	 mZLyoxhqmL1jKhtZL3fx+CpZe3qVa7BYYlggv4aTQdCjfE+gPcyezlNIe/RqbynSqgetnZZgsqdf
	 tQ8sNXkXnO3Pz3ypFtgl71X/b7TsQgVV6YKTEJiTObQezFUY9yhVU03QB06s1/56FkU5WR5nm+6k
	 3TV8h2C14/FRkpGI5zAxZguRacsR3epJaRo/i1fKKeTOOPLil7pQc6nrJsPOiHluEG5RqdTLXO4K
	 cgqOd1uaAA1UEPNoiFIryuDsv+ioAjdNf0nRomgnDn/v2ha5Wcum3JEmj3UAPnSUaFL2CuEuSlGP
	 XxQyizk8zeTOLyKJkYZbh1FIGsZ91Az/F18jOPCi5v1DxHim6oBqViPSqpO7QmKOeE1j2abubHlh
	 Y0UD9hZObBOFfAZXXADZ7j85B5bFiXwYoJHkkTLrlRokRx1srF81L9m1TLVDMmlmsyIGNkAICdLj
	 qlyDS0lRnH9L7NG1iBiExGC8Ys3vKHoeQjs5L9X+64zilXWEn2oOeXYVxAO44Edo9KYIjEU4G/BR
	 l3jySsFO94cRwP9qfuxfnK1oYgBx7kzjFWri8MeLRAwaFANb0AcvRqLz5lw4qtItsg613ZSgXNIK
	 IreARaqnQ0WTnkMJIBgyoZtL9itkNwxPi4V0HTRNt3cp4cMpSceA6PZyfd/wLkKN4HugFzokiodR
	 yP0IMUMzY+oT7YKZlNRnMy7NsctfIbfvyV93nAb6jwA1Un3gbA9m8YXxYjEZLOj7Ox9RtHkiWbd9
	 6tpV34wAsTreBbqm/Hw9CMLzwYvmtSTm/70HLX2IrRJayVsRbBlEiEnUB8qygbYNw9XXJA8d9Vpt
	 DNjywgVE4gVz9JYNTNHYobOfd5IKqB8kxs12z+UbOK0fU3j6uV6lOBmLn2hw==
X-QQ-XMRINFO: M/715EihBoGS47X28/vv4NpnfpeBLnr4Qg==
From: fujunjie <fujunjie1@qq.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org,
	Alexandre Ghiti <alexghiti@meta.com>,
	Kairui Song <kasong@tencent.com>,
	Usama Arif <usamaarif642@gmail.com>
Cc: Chris Li <chrisl@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Yosry Ahmed <yosry@kernel.org>,
	Nhat Pham <nphamcs@gmail.com>,
	David Hildenbrand <david@kernel.org>,
	Hugh Dickins <hughd@google.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	linux-kernel@vger.kernel.org,
	cgroups@vger.kernel.org
Subject: [RFC PATCH v2 1/9] mm/zswap: expose range state for swapin policy
Date: Fri, 29 May 2026 12:19:20 +0000
X-OQ-MSGID: <20260529121928.4115683-1-fujunjie1@qq.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <tencent_98CD9F78E48D08DC005A6471A13CFF28B60A@qq.com>
References: <tencent_98CD9F78E48D08DC005A6471A13CFF28B60A@qq.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Large folio swapin needs to know whether a candidate swap range is fully
backed by zswap before it can choose an order. That decision should stay
in common swapin code, not inside zswap.

Export two zswap facts for that caller: a lockless range occupancy snapshot
and the current zswap reclaim-pressure state. The range state is
advisory only. Writeback or invalidation can change the backend after the
snapshot, so users must recheck before issuing large-folio IO.

Signed-off-by: fujunjie <fujunjie1@qq.com>
---
 include/linux/zswap.h | 26 +++++++++++++++++++++++++
 mm/zswap.c            | 44 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 70 insertions(+)

diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 30c193a1207e..8f9aee97517c 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -9,6 +9,18 @@ struct lruvec;
=20
 extern atomic_long_t zswap_stored_pages;
=20
+/*
+ * Advisory zswap occupancy snapshot for a swap range. This is not a compl=
ete
+ * backend classifier; callers must recheck before depending on ALL_ZSWAP =
for
+ * large-folio IO.
+ */
+enum zswap_range_state {
+	ZSWAP_RANGE_NEVER_ENABLED,
+	ZSWAP_RANGE_NO_ZSWAP,
+	ZSWAP_RANGE_ALL_ZSWAP,
+	ZSWAP_RANGE_MIXED,
+};
+
 #ifdef CONFIG_ZSWAP
=20
 struct zswap_lruvec_state {
@@ -27,6 +39,9 @@ struct zswap_lruvec_state {
 unsigned long zswap_total_pages(void);
 bool zswap_store(struct folio *folio);
 int zswap_load(struct folio *folio);
+enum zswap_range_state zswap_probe_range(swp_entry_t swp,
+					 unsigned int nr_pages);
+bool zswap_pool_reclaim_pressure(void);
 void zswap_invalidate(swp_entry_t swp);
 int zswap_swapon(int type, unsigned long nr_pages);
 void zswap_swapoff(int type);
@@ -49,6 +64,17 @@ static inline int zswap_load(struct folio *folio)
 	return -ENOENT;
 }
=20
+static inline enum zswap_range_state zswap_probe_range(swp_entry_t swp,
+						       unsigned int nr_pages)
+{
+	return ZSWAP_RANGE_NEVER_ENABLED;
+}
+
+static inline bool zswap_pool_reclaim_pressure(void)
+{
+	return false;
+}
+
 static inline void zswap_invalidate(swp_entry_t swp) {}
 static inline int zswap_swapon(int type, unsigned long nr_pages)
 {
diff --git a/mm/zswap.c b/mm/zswap.c
index 761cd699e0a3..da5297f7bd69 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -506,6 +506,19 @@ unsigned long zswap_total_pages(void)
 	return total;
 }
=20
+/*
+ * Expose whether zswap reclaim pressure is active. This is a backend fact:
+ * zswap_check_limits() sets the state once the pool reaches the hard limi=
t and
+ * keeps it set until the pool falls below the accept threshold.
+ */
+bool zswap_pool_reclaim_pressure(void)
+{
+	if (zswap_never_enabled())
+		return false;
+
+	return READ_ONCE(zswap_pool_reached_full);
+}
+
 static bool zswap_check_limits(void)
 {
 	unsigned long cur_pages =3D zswap_total_pages();
@@ -1559,6 +1572,37 @@ bool zswap_store(struct folio *folio)
 	return ret;
 }
=20
+enum zswap_range_state zswap_probe_range(swp_entry_t swp,
+					 unsigned int nr_pages)
+{
+	unsigned int type =3D swp_type(swp);
+	pgoff_t offset =3D swp_offset(swp);
+	bool present =3D false, missing =3D false;
+	unsigned int i;
+
+	/*
+	 * This is an advisory, lockless snapshot for common swapin admission.
+	 * Callers must recheck before depending on an all-zswap range for IO:
+	 * concurrent writeback or invalidation can change the backend state.
+	 */
+	if (zswap_never_enabled())
+		return ZSWAP_RANGE_NEVER_ENABLED;
+
+	for (i =3D 0; i < nr_pages; i++) {
+		struct xarray *tree =3D swap_zswap_tree(swp_entry(type, offset + i));
+
+		if (xa_load(tree, offset + i))
+			present =3D true;
+		else
+			missing =3D true;
+
+		if (present && missing)
+			return ZSWAP_RANGE_MIXED;
+	}
+
+	return present ? ZSWAP_RANGE_ALL_ZSWAP : ZSWAP_RANGE_NO_ZSWAP;
+}
+
 /**
  * zswap_load() - load a folio from zswap
  * @folio: folio to load
--=20
2.34.1
From nobody Mon Jun  8 12:13:56 2026
Received: from xmbghk7.mail.qq.com (xmbghk7.mail.qq.com [43.163.128.48])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9C1F23BA24F;
	Fri, 29 May 2026 12:19:41 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=43.163.128.48
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1780057183; cv=none;
 b=sHbj+GWvHy3H844dRZwrRI+sjKeALAa1fKU2sZLGRr5MwkpiKcv1AkFAoa3Z6ZZJJHQZA2ahwVRO7LNBUeod0Zx97XYuAObVS9FSjkhYwjOgLbwmjje2XyAOuFL/1UMfyN3tYL3zf5uJELBl0Fqfd2cUHGiiL2yIV+ptAZtKQCc=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1780057183; c=relaxed/simple;
	bh=UsDJ1SlBWLJpayBtQ4Q84psXrtlnhwGU8Oh1yWXjOXE=;
	h=Message-ID:From:To:Cc:Subject:Date:In-Reply-To:References:
	 MIME-Version;
 b=CBRHiWA/8r7fTcg6UsDvsMggdg0t+/luyZcuGhc+87XGtotLAX+4ZFmtGuIRvfHgBt97B9d+dS7geQeoybgt3IQ8Y+MJFhExQmKbxatI44wrWJ6q7TgGJWdlAGUkPdznv2mNcqqoVFVYvzblXHGKGG1vy93aUK9j7DlG2lAy338=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=qq.com;
 spf=pass smtp.mailfrom=qq.com;
 dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b=BzvFiQIv;
 arc=none smtp.client-ip=43.163.128.48
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=qq.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=qq.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b="BzvFiQIv"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=qq.com; s=s201512;
	t=1780057173; bh=6upOtVdcSMEiWQj/LoAhI+cSqrQul/bXobDwKNZi3uE=;
	h=From:To:Cc:Subject:Date:In-Reply-To:References;
	b=BzvFiQIvEF5lRfApXuRIxXloyfW601cGekF8ReFj0UlHXWD9RHUFiEFDrLk7GXs1J
	 7at9eI/Y2ppncPSORnYgaR+zy2U5HhUJ6SEq4ztojqt0LsG3nv10NrU73s7W6HLBiQ
	 UPA5LeF5x9NJmz9dXdOA+uSBXx+lGhzGX1PXPY9c=
Received: from node68.. ([166.111.236.25])
	by newxmesmtplogicsvrsza73-0.qq.com (NewEsmtp) with SMTP
	id 4DC10017; Fri, 29 May 2026 20:19:28 +0800
X-QQ-mid: xmsmtpt1780057170tklm0azzd
Message-ID: <tencent_930D3FD72B2ECF2379248F7CDE48587C700A@qq.com>
X-QQ-XMAILINFO: OOyEews/EdUgGrC4kGz6V3LDdcHDMeu94hgH5yaIgtByJKumzwq6uB/LtnUX5R
	 NhmcZbGtYlmURp2oM1N69k54MXOemCKwW531GwD9CHAQzgSkdUfQ+G51ao2t9n0h/ek1cueUGZjv
	 oGf/2fN8/9HbK/4I7fQiyEx8TguKH2lXJAqD5ocJy0zjJndYW1p3bgE+aGbYgGe0+2e8w8vW48PQ
	 rRQZVsbVSa0bCZbTax1zSG7ZZchB0vywtSbJ6tG/A3/vYjfJDm+89dV+PmTkiLLqaUDv07/yeUmn
	 0/5BKilNIWIrY5caenuAr1qohxZjapyNZ4QcShN7K/ILv8ed0HZFwNMDh1YQBBg3TlI7sbOk39yX
	 J9rrE97T7IMgYlE7UFzBAjsHGUH2opi9VGznwyWyG/nI4Ju+6CrOJwt71y+wlKAaCB2noNv+cdfC
	 BL/Yw0pSDCNLsShxYtCEZXadUKNPMjD//CTiHS8I17KAZdhrCno6NX8i+3NWo3pe5vYwGWsRzEvr
	 aAK4L3zS9JQiEVCcbzT+z0QcWKEcmYP1e7vuaWrWKg6Y+Glkam4raxVypRdI9qQJ6CV1VWRa0+ir
	 wxbY1ByjmCzt5I2poqyT13EiBvgoSUcFuLKlTmvfVai43SZ8MDefdUrtZGTI8qme/WsA5xHIyMYC
	 zYIUFPT0vCMiZp2sPWJHdDAq9r0sPFCp4bnfd98UkPA1m2MiGpokbzj1oxVrz4f4YR59fn3VAccP
	 VYzyHyoORAeyGMnUahdrRLTbiZfyC3vlXlOMFzPrKTJ6qL3qS70aHdxHzO3sJE7TQrwcNTOAZLzy
	 Inbj/ucF6w9UcFL3ZN2YqZbCsnkMRIJe5GiBw1SVfjxF+Jizz1UDAJ2NW2mEeBbat2GprN4ATd2P
	 MHAV1jsF1QK1QVY92ZzgvKj8vKvqjDuFl8NjIcw+y9b+yxM0gxI1/N7lXLPHXZUuVJmjOGH7wEge
	 VcobaFqn0REtwLVTc/GIFIHaW3Jni9Q9p4irGdaGOerWpqZJDNEnPhVaDf47gaxpibaSO/frPnma
	 Aoj/hz08eLwq61sqEVp+hh2pCt0nkXSjZ93AvWh1h+CN5Dd7Dw
X-QQ-XMRINFO: MPJ6Tf5t3I/ylTmHUqvI8+Wpn+Gzalws3A==
From: fujunjie <fujunjie1@qq.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org,
	Alexandre Ghiti <alexghiti@meta.com>,
	Kairui Song <kasong@tencent.com>,
	Usama Arif <usamaarif642@gmail.com>
Cc: Chris Li <chrisl@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Yosry Ahmed <yosry@kernel.org>,
	Nhat Pham <nphamcs@gmail.com>,
	David Hildenbrand <david@kernel.org>,
	Hugh Dickins <hughd@google.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	linux-kernel@vger.kernel.org,
	cgroups@vger.kernel.org
Subject: [RFC PATCH v2 2/9] mm: let swap_read_folio() report retryable zswap
 races
Date: Fri, 29 May 2026 12:19:21 +0000
X-OQ-MSGID: <20260529121928.4115683-2-fujunjie1@qq.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <tencent_98CD9F78E48D08DC005A6471A13CFF28B60A@qq.com>
References: <tencent_98CD9F78E48D08DC005A6471A13CFF28B60A@qq.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Large zswap loads need a way to ask the caller to drop a speculative large
swapcache folio and retry order-0. A void swap_read_folio() cannot express
that without turning a backend race into an IO failure.

Return int from swap_read_folio() and reserve -EAGAIN for retryable large
zswap races. Existing order-0 paths keep treating the read as before; the
synchronous swapin path only warns for now. A later patch will consume
-EAGAIN and retry order-0.

Signed-off-by: fujunjie <fujunjie1@qq.com>
---
 mm/page_io.c    | 19 +++++++++++++++++--
 mm/swap.h       |  5 +++--
 mm/swap_state.c | 13 +++++++++++--
 3 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index f2d8fe7fd057..16724bdfb400 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -653,13 +653,21 @@ static void swap_read_folio_bdev_async(struct folio *=
folio,
 	submit_bio(bio);
 }
=20
-void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
+/*
+ * Return -EAGAIN only when a locked large swapcache folio hit a retryable
+ * zswap backend race. The caller owns that still-locked folio and must dr=
op or
+ * retry it. Other zswap errors are still reported through the usual folio
+ * state: the folio is unlocked without PG_uptodate and the fault path will
+ * turn that into an I/O error.
+ */
+int swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 {
 	struct swap_info_struct *sis =3D __swap_entry_to_info(folio->swap);
 	bool synchronous =3D sis->flags & SWP_SYNCHRONOUS_IO;
 	bool workingset =3D folio_test_workingset(folio);
 	unsigned long pflags;
 	bool in_thrashing;
+	int ret =3D 0;
=20
 	VM_BUG_ON_FOLIO(!folio_test_swapcache(folio) && !synchronous, folio);
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
@@ -681,8 +689,14 @@ void swap_read_folio(struct folio *folio, struct swap_=
iocb **plug)
 		goto finish;
 	}
=20
-	if (zswap_load(folio) !=3D -ENOENT)
+	ret =3D zswap_load(folio);
+	if (ret =3D=3D -EAGAIN) {
+		VM_WARN_ON_ONCE_FOLIO(!folio_test_large(folio), folio);
 		goto finish;
+	}
+	if (ret !=3D -ENOENT)
+		goto finish;
+	ret =3D 0;
=20
 	/* We have to read from slower devices. Increase zswap protection. */
 	zswap_folio_swapin(folio);
@@ -701,6 +715,7 @@ void swap_read_folio(struct folio *folio, struct swap_i=
ocb **plug)
 		psi_memstall_leave(&pflags);
 	}
 	delayacct_swapin_end();
+	return ret;
 }
=20
 void __swap_read_unplug(struct swap_iocb *sio)
diff --git a/mm/swap.h b/mm/swap.h
index 77d2d14eda42..ea7e1f3c4410 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -241,7 +241,7 @@ extern void __swap_cluster_free_entries(struct swap_inf=
o_struct *si,
 /* linux/mm/page_io.c */
 int sio_pool_init(void);
 struct swap_iocb;
-void swap_read_folio(struct folio *folio, struct swap_iocb **plug);
+int swap_read_folio(struct folio *folio, struct swap_iocb **plug);
 void __swap_read_unplug(struct swap_iocb *plug);
 static inline void swap_read_unplug(struct swap_iocb *plug)
 {
@@ -381,8 +381,9 @@ static inline void folio_put_swap(struct folio *folio, =
struct page *page)
 {
 }
=20
-static inline void swap_read_folio(struct folio *folio, struct swap_iocb *=
*plug)
+static inline int swap_read_folio(struct folio *folio, struct swap_iocb **=
plug)
 {
+	return 0;
 }
=20
 static inline void swap_write_unplug(struct swap_iocb *sio)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 04f5ce992401..d37097913b30 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -628,6 +628,7 @@ static struct folio *swap_cache_read_folio(swp_entry_t =
entry, gfp_t gfp,
 					   struct swap_iocb **plug, bool readahead)
 {
 	struct folio *folio;
+	int ret;
=20
 	do {
 		folio =3D swap_cache_get_folio(entry);
@@ -639,7 +640,13 @@ static struct folio *swap_cache_read_folio(swp_entry_t=
 entry, gfp_t gfp,
 	if (IS_ERR_OR_NULL(folio))
 		return NULL;
=20
-	swap_read_folio(folio, plug);
+	ret =3D swap_read_folio(folio, plug);
+	/*
+	 * Swap readahead allocates order-0 folios. -EAGAIN is reserved for
+	 * retryable large zswap backend races and must be handled by the
+	 * synchronous common swapin path.
+	 */
+	VM_WARN_ON_ONCE(ret =3D=3D -EAGAIN);
 	if (readahead) {
 		folio_set_readahead(folio);
 		count_vm_event(SWAP_RA);
@@ -668,6 +675,7 @@ struct folio *swapin_sync(swp_entry_t entry, gfp_t gfp,=
 unsigned long orders,
 			   struct vm_fault *vmf, struct mempolicy *mpol, pgoff_t ilx)
 {
 	struct folio *folio;
+	int ret;
=20
 	do {
 		folio =3D swap_cache_get_folio(entry);
@@ -679,7 +687,8 @@ struct folio *swapin_sync(swp_entry_t entry, gfp_t gfp,=
 unsigned long orders,
 	if (IS_ERR(folio))
 		return folio;
=20
-	swap_read_folio(folio, NULL);
+	ret =3D swap_read_folio(folio, NULL);
+	VM_WARN_ON_ONCE(ret =3D=3D -EAGAIN);
 	return folio;
 }
=20
--=20
2.34.1
From nobody Mon Jun  8 12:13:56 2026
Received: from xmbghk7.mail.qq.com (xmbghk7.mail.qq.com [43.163.128.52])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 868173E00BA;
	Fri, 29 May 2026 12:19:42 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=43.163.128.52
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1780057184; cv=none;
 b=pybA6eqq9aDNNG3nxM5M6M9eWapPSe4T8cavL3X51X+gno/ZU9SjNHFmIdFptxSd5s2Va0yvw3AXTJLggWzs3X5aXn0yNeb+940gmOuSoCqhLq8LgIddqXsZNUXYemaj2rXvOl/TZT+gTdbbIW+MHvU+M3akU22C/RqdcMQARWc=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1780057184; c=relaxed/simple;
	bh=/K2wG6zQbNZTvESItts/W7FuL5ZTh3gamKmzdrcQRVM=;
	h=Message-ID:From:To:Cc:Subject:Date:In-Reply-To:References:
	 MIME-Version;
 b=HsTrGrLaJhFIOw+ekWO5NVk1EWyTZKM1nR4xBkaDSj6o69WxGGyOrlJ+ndM2LztZTqnIeIIeD+iVVZajW/kWEJyDmULeYuE2DB3iHgHuB5sKbxBRIFiDHGOAyDTrYEEkUT7fWBO9dptZbg7l6nS7Ur/bwC1ppxIUniieMLCq1vU=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=qq.com;
 spf=pass smtp.mailfrom=qq.com;
 dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b=M7hGKodZ;
 arc=none smtp.client-ip=43.163.128.52
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=qq.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=qq.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b="M7hGKodZ"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=qq.com; s=s201512;
	t=1780057175; bh=xut7uEA2LcmM4LrlZiZ/byyWZ1r0WI1YSP1vZ4Egj64=;
	h=From:To:Cc:Subject:Date:In-Reply-To:References;
	b=M7hGKodZsqPkt26jzW2LkQsc6xsh6OEcmg69gQ4tmGWLrp4MG/I8hglTllW2D/G6I
	 hJYW3mcxbVoEaTUTBlDhm/wscJUy0CZq0IvnfiT+rCGPsbS8LQLTRJMzJVtFrvPmcH
	 lZ3jATB+1ppOf+u95kPUyGtBEDxKAsX+KNDcWsI0=
Received: from node68.. ([166.111.236.25])
	by newxmesmtplogicsvrsza73-0.qq.com (NewEsmtp) with SMTP
	id 4DC10017; Fri, 29 May 2026 20:19:28 +0800
X-QQ-mid: xmsmtpt1780057172tac6i9zgh
Message-ID: <tencent_7D186EDC2C9AB9009F9915C1E68F3CF44609@qq.com>
X-QQ-XMAILINFO: OATpkVjS499uNEtWMEQWZx4ULda/MSJcXY82UGEO/ENJfG6dAYdIxvNqNQ/4b5
	 NHjOLjOuuAhYF6UBBQoNqJg19p5N8mZHmqoquhVxAR/rlpLMCNtE3cRRk9CUve/AaHFTV6ZXFrU3
	 yeqjsTFYNmxR+Z6ruyX+zd/HCdJUErIf/ytbubdXHsfjzdlwKC1+WwOAEbqwvha+9rvrow2rlZ2r
	 de35kdqnkk62m40gvQ4nKczdt55wpV+TC4gARpND2xGdEurKCsJ8ux8fsuUiB7GEdHs/C5viEize
	 xmxKjD++Fcux8yvlWSQsA+J8j8RkKagftsl+/BnbAxzpWzPs02kQMxZ+nhwPWT6L62yhP7VIEvbz
	 CkqqQPROLBg6XEQolIGukrp1iM98/n44SQQIILHJ5fjF1SInNRBDHJosBvHLYwqAf6x03xGGvZs4
	 A7QAvojdiXMbu4s6ZgPuGbcFWgt3PZQZvy4LTi03VfZdMn1ON+rEupb+q2TmFCuGGCM3FLnyg8O+
	 BrqU2yUT9YIXuLRkkjnDgpeIGjYYHBcj11odwImoAlTec1N+FbGQYCTAiClAYDu+9W5PMQvC66LE
	 ZbqeY6hgPE25rj8FfQ3JgHzKe9zepNn7LcyjFNIIVz0mfZvkvS6GVDM+KVt6IHDAKV7cGqnC7Hu7
	 8HF0j5RzVfZKlrIV30lt/445qaD/GXFN/Y4Q/gfCHl40YOw8FOSQP53iNY/9c8LZ5YbbFSxsU3UA
	 25Y7kBdc/3U64ORJ+85DGz0ruorUU8kHUl+0iNtnmJMMadp8+U/ivsXcbmbAXawixcUSqoSBKJ6j
	 rJcqAsEFIf63tN7SZdkbIP/DApDQW43Kf6872Ux5u3nWN6ihONCF+JR2GfLg+0CXpX+weqLikMAJ
	 1gQ/QN5S0UITTDoA7uJfSDz6pM2/u0J79ZiI0FKqKF41ZN20jVKFqt7R4WzMgjmBgB+1b4+z9fHJ
	 9NGej8Q9WFPgKD80I4VxxNc0qkjqriULetfmT1SM8IHDf83DRec0I18+2d8zcnDVlTsvTz+0O91n
	 OkhSF9Zac1BqRyUOPDKLT3IJa6Fe6SIy/9nQlOyxcwYknlSNNh
X-QQ-XMRINFO: MPJ6Tf5t3I/ylTmHUqvI8+Wpn+Gzalws3A==
From: fujunjie <fujunjie1@qq.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org,
	Alexandre Ghiti <alexghiti@meta.com>,
	Kairui Song <kasong@tencent.com>,
	Usama Arif <usamaarif642@gmail.com>
Cc: Chris Li <chrisl@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Yosry Ahmed <yosry@kernel.org>,
	Nhat Pham <nphamcs@gmail.com>,
	David Hildenbrand <david@kernel.org>,
	Hugh Dickins <hughd@google.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	linux-kernel@vger.kernel.org,
	cgroups@vger.kernel.org
Subject: [RFC PATCH v2 3/9] mm/zswap: support fully zswap-backed large folio
 loads
Date: Fri, 29 May 2026 12:19:22 +0000
X-OQ-MSGID: <20260529121928.4115683-3-fujunjie1@qq.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <tencent_98CD9F78E48D08DC005A6471A13CFF28B60A@qq.com>
References: <tencent_98CD9F78E48D08DC005A6471A13CFF28B60A@qq.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

zswap currently refuses large swapcache folios. That is correct for mixed
backend ranges, but it also prevents the common swapin path from loading a
range that is still fully backed by zswap.

Teach zswap_load() to fill a locked large swapcache folio by decompressing
each base-page entry into the matching folio offset, then flushing the
folio once. A missing entry after zswap data has been seen is reported as
-EAGAIN so the caller can drop the speculative large folio and retry
order-0.

The large load keeps the zswap entries in place. It is a clean speculative
fill: until the swap slots are freed, zswap remains the backing copy if
reclaim drops the large folio before PTEs are installed.

Signed-off-by: fujunjie <fujunjie1@qq.com>
---
 mm/zswap.c | 105 ++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 87 insertions(+), 18 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index da5297f7bd69..94ba112a2982 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -15,6 +15,8 @@
=20
 #include <linux/module.h>
 #include <linux/cpu.h>
+#include <linux/mm.h>
+#include <linux/huge_mm.h>
 #include <linux/highmem.h>
 #include <linux/slab.h>
 #include <linux/spinlock.h>
@@ -934,7 +936,8 @@ static bool zswap_compress(struct page *page, struct zs=
wap_entry *entry,
 	return comp_ret =3D=3D 0 && alloc_ret =3D=3D 0;
 }
=20
-static bool zswap_decompress(struct zswap_entry *entry, struct folio *foli=
o)
+static bool zswap_decompress(struct zswap_entry *entry, struct folio *foli=
o,
+			     unsigned int page_idx, bool flush_dcache)
 {
 	struct zswap_pool *pool =3D entry->pool;
 	struct scatterlist input[2]; /* zsmalloc returns an SG list 1-2 entries */
@@ -952,14 +955,15 @@ static bool zswap_decompress(struct zswap_entry *entr=
y, struct folio *folio)
=20
 		WARN_ON_ONCE(input->length !=3D PAGE_SIZE);
=20
-		dst =3D kmap_local_folio(folio, 0);
+		dst =3D kmap_local_folio(folio, page_idx * PAGE_SIZE);
 		memcpy_from_sglist(dst, input, 0, PAGE_SIZE);
 		dlen =3D PAGE_SIZE;
 		kunmap_local(dst);
-		flush_dcache_folio(folio);
+		if (flush_dcache)
+			flush_dcache_folio(folio);
 	} else {
 		sg_init_table(&output, 1);
-		sg_set_folio(&output, folio, PAGE_SIZE, 0);
+		sg_set_folio(&output, folio, PAGE_SIZE, page_idx * PAGE_SIZE);
 		acomp_request_set_params(acomp_ctx->req, input, &output,
 					 entry->length, PAGE_SIZE);
 		ret =3D crypto_acomp_decompress(acomp_ctx->req);
@@ -1042,7 +1046,7 @@ static int zswap_writeback_entry(struct zswap_entry *=
entry,
 		goto out;
 	}
=20
-	if (!zswap_decompress(entry, folio)) {
+	if (!zswap_decompress(entry, folio, 0, true)) {
 		ret =3D -EIO;
 		goto out;
 	}
@@ -1615,10 +1619,9 @@ enum zswap_range_state zswap_probe_range(swp_entry_t=
 swp,
  *  NOT marked up-to-date, so that an IO error is emitted (e.g. do_swap_pa=
ge()
  *  will SIGBUS).
  *
- *  -EINVAL: if the swapped out content was in zswap, but the page belongs
- *  to a large folio, which is not supported by zswap. The folio is unlock=
ed,
- *  but NOT marked up-to-date, so that an IO error is emitted (e.g.
- *  do_swap_page() will SIGBUS).
+ *  -EAGAIN: if the swapped out content belongs to a large folio, but the
+ *  range is mixed or raced with writeback. The folio remains locked so the
+ *  caller can drop the large swapcache folio and retry order-0.
  *
  *  -ENOENT: if the swapped out content was not in zswap. The folio remains
  *  locked on return.
@@ -1626,9 +1629,12 @@ enum zswap_range_state zswap_probe_range(swp_entry_t=
 swp,
 int zswap_load(struct folio *folio)
 {
 	swp_entry_t swp =3D folio->swap;
+	unsigned int nr_pages =3D folio_nr_pages(folio);
+	unsigned int type =3D swp_type(swp);
 	pgoff_t offset =3D swp_offset(swp);
-	struct xarray *tree =3D swap_zswap_tree(swp);
+	struct xarray *tree;
 	struct zswap_entry *entry;
+	unsigned int i;
=20
 	VM_WARN_ON_ONCE(!folio_test_locked(folio));
 	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
@@ -1636,21 +1642,84 @@ int zswap_load(struct folio *folio)
 	if (zswap_never_enabled())
 		return -ENOENT;
=20
-	/*
-	 * Large folios should not be swapped in while zswap is being used, as
-	 * they are not properly handled. Zswap does not properly load large
-	 * folios, and a large folio may only be partially in zswap.
-	 */
-	if (WARN_ON_ONCE(folio_test_large(folio))) {
+	if (folio_test_large(folio)) {
+		struct obj_cgroup *first_objcg =3D NULL;
+		bool same_objcg =3D true;
+		bool saw_zswap =3D false;
+		bool saw_non_zswap =3D false;
+
+		/*
+		 * The locked large swapcache folio now covers the range and
+		 * conflicts with zswap writeback's order-0 swapcache allocation.
+		 * If the range is mixed or an entry disappears, retry order-0.
+		 */
+		for (i =3D 0; i < nr_pages; i++) {
+			tree =3D swap_zswap_tree(swp_entry(type, offset + i));
+			entry =3D xa_load(tree, offset + i);
+			if (!entry) {
+				if (saw_zswap)
+					return -EAGAIN;
+				saw_non_zswap =3D true;
+				continue;
+			}
+			if (saw_non_zswap)
+				return -EAGAIN;
+
+			if (!saw_zswap)
+				first_objcg =3D entry->objcg;
+			else if (entry->objcg !=3D first_objcg)
+				same_objcg =3D false;
+			saw_zswap =3D true;
+		}
+		if (!saw_zswap)
+			return -ENOENT;
+
+		for (i =3D 0; i < nr_pages; i++) {
+			tree =3D swap_zswap_tree(swp_entry(type, offset + i));
+			entry =3D xa_load(tree, offset + i);
+			if (!entry)
+				return -EAGAIN;
+
+			if (!zswap_decompress(entry, folio, i, false)) {
+				folio_unlock(folio);
+				return -EIO;
+			}
+		}
+
+		flush_dcache_folio(folio);
+		/*
+		 * Keep zswap entries until swap slots are freed. This is a clean
+		 * speculative fill; zswap remains the backing copy if reclaim
+		 * drops the large folio before PTEs are installed.
+		 */
+		folio_mark_uptodate(folio);
+		count_vm_events(ZSWPIN, nr_pages);
+		count_mthp_stat(folio_order(folio), MTHP_STAT_SWPIN);
+
+		if (same_objcg) {
+			if (first_objcg)
+				count_objcg_events(first_objcg, ZSWPIN, nr_pages);
+		} else {
+			for (i =3D 0; i < nr_pages; i++) {
+				tree =3D swap_zswap_tree(swp_entry(type, offset + i));
+				entry =3D xa_load(tree, offset + i);
+				if (WARN_ON_ONCE(!entry))
+					continue;
+				if (entry->objcg)
+					count_objcg_events(entry->objcg, ZSWPIN, 1);
+			}
+		}
+
 		folio_unlock(folio);
-		return -EINVAL;
+		return 0;
 	}
=20
+	tree =3D swap_zswap_tree(swp);
 	entry =3D xa_load(tree, offset);
 	if (!entry)
 		return -ENOENT;
=20
-	if (!zswap_decompress(entry, folio)) {
+	if (!zswap_decompress(entry, folio, 0, true)) {
 		folio_unlock(folio);
 		return -EIO;
 	}
--=20
2.34.1
From nobody Mon Jun  8 12:13:56 2026
Received: from out162-62-58-216.mail.qq.com (out162-62-58-216.mail.qq.com
 [162.62.58.216])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 211D43BA24F;
	Fri, 29 May 2026 12:23:16 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=162.62.58.216
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1780057400; cv=none;
 b=rY8qktdwVc495SZVW1RPOhmaoFAFZsnNmzsDPt5GZ81R/3wZ0vFkOQLtN9TZb4FERpeDDWTU+w9KkdNuM1r6MifGqiqhE0EIzXLfI8xV2Y+w0BF6W8Zf0D6EYbf05Yn2fom+SpgzukBrL+Pm/5a+SeH8AAHE59CKXT5XMRZhL8I=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1780057400; c=relaxed/simple;
	bh=3GejemDuQ+B6ja3wGhZoKUlAXxNXc/BATRCoZWzMKto=;
	h=Message-ID:From:To:Cc:Subject:Date:In-Reply-To:References:
	 MIME-Version;
 b=DiafdoQfHx5tQhUuK7fQs3gSIROYBz6dnc11Ztc7+TJ1rln4A3a8JmTqKwqn5phqKo9Gbv73U40t7HAJTLbp6w0wKmJZIAyzCsPjn0obfol0TJBTzhJc/lr94p4B4dwre24BUO3LNOBBP0b44scBTqGE5N1EYdcBR3QYRf7cFng=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=qq.com;
 spf=pass smtp.mailfrom=qq.com;
 dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b=DfjL4Abv;
 arc=none smtp.client-ip=162.62.58.216
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=qq.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=qq.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b="DfjL4Abv"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=qq.com; s=s201512;
	t=1780057388; bh=nDsGZbWSvOF3obvNlWe9iRNaS8HxvEtyxSV9cL2PYxY=;
	h=From:To:Cc:Subject:Date:In-Reply-To:References;
	b=DfjL4AbvPPHUAc8aYoIv2/9+jtY5Kmptyv2iXtQu4KHgfxhBZjcrEbquKqYKJLBU/
	 ekPKI7qXYmcJJY/SDwvYD4dVLIZYNg6KDZHP+SE5Hl+0g0blzXbN+nTCMWhaYKRnBH
	 VY52470BYEjpVrv3Z0h9TCykKvKIkyvY4tOmJwnc=
Received: from node68.. ([166.111.236.25])
	by newxmesmtplogicsvrsza73-0.qq.com (NewEsmtp) with SMTP
	id 4DC10017; Fri, 29 May 2026 20:19:28 +0800
X-QQ-mid: xmsmtpt1780057174txygo7p0f
Message-ID: <tencent_EB78848E34DC7858C873193D67286ECD4B0A@qq.com>
X-QQ-XMAILINFO: M2SvzgchpLqfOoNEbAiJkRB8duAu+hjJoUl9Si4GM4H2cgFZGLV6zbPHlxKEsa
	 6Llk5KmrboO8QkCi9+MjUpV4XtHw7OxB0j1dTCB2wROWoZcF0hUfsiLNZUXno/R/40OliDIRo6LR
	 ncP8tLRIARrkJEjo6U9A+lg97PVOwRDi7mAwBTUbglLI2hhXLezDN3vKDrGOoN9aRjBCQDJ7C0+7
	 OfIVszyzD7QKPBf2ScZiyGEiLtuE5X/DmHhxb8+999pYjhnkd4Q9MeYxcvweChIXROGDnloJEORd
	 ZXY4H5CirLavW87XX2zBRCKvUPCMvhQHEjiBmA7kcYCHyyx3nU0ZqxqQUJPhJ9IpjFZO8g/JhugO
	 5uxr0YTIfk0Y64FJ4paz85qwnDTpENDnfM5D5+4rp0u0xt48LuGUETujUfuwDjWUyb2TnHlQMmQu
	 cRjwZB7MdDgSWML5UnXxw9EmdVVKWP9ihfgG13MgUetZb7b/Eitu6iKm3jYfdcBJhmenYJwxzNZO
	 qhYHEzIA6FJ7o6Yn3hGtoGVCY1+a0hBWTiapwGCewo89htBbQuNgJr8MA7/61nW6zCj0Rl0x2YOA
	 0cAp+Ta6znmLxyLAleKSaFXXn1w/kmeLpjAanKcwzDao5vOLTLeevhbTDWP/uOZlcaEhqZcdFALd
	 dgXqwV+fnNn39KhmoskAJsO3yYOXmD//SVPtiYxkbTN0QPbRRaOYUfcshoiP+LgqlAXXKIavtIfT
	 aiVstxGOWk7cjlhD2F9Yhk53HNwFMU0RsNG4SThh0VmvE3FVkFnmziN174aZtaMT+wt+oF9/FeWO
	 YIqQihGNFIIeohCUUeeCaMq7UruOc//+HtxNe/gJyzipTsN2aVoUVDmL1J8MbbLYAMVkn3EYjD4j
	 nJKX4X+GqTlDdXz74lRqXDsqbpRhQikRehJp7S9UNm74hVwoUzXdpPUShqklrREExwAY6EBjjJua
	 nTE+iSEW0/rB/FJGCYyeuKl0ofpRGfnWOy+L2zzpkuiEoiaoUv0vOmWSNTVKwz0l/k4CSNKfdMbC
	 Em7Qh3Y4bWyGxh/SogHslxF5U/TzisZIeny5psaPupSq+A9F+IupHp+UK15aM=
X-QQ-XMRINFO: MSVp+SPm3vtSI1QTLgDHQqIV1w2oNKDqfg==
From: fujunjie <fujunjie1@qq.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org,
	Alexandre Ghiti <alexghiti@meta.com>,
	Kairui Song <kasong@tencent.com>,
	Usama Arif <usamaarif642@gmail.com>
Cc: Chris Li <chrisl@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Yosry Ahmed <yosry@kernel.org>,
	Nhat Pham <nphamcs@gmail.com>,
	David Hildenbrand <david@kernel.org>,
	Hugh Dickins <hughd@google.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	linux-kernel@vger.kernel.org,
	cgroups@vger.kernel.org
Subject: [RFC PATCH v2 4/9] mm: admit large swapin by backend range in
 swapin_sync()
Date: Fri, 29 May 2026 12:19:23 +0000
X-OQ-MSGID: <20260529121928.4115683-4-fujunjie1@qq.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <tencent_98CD9F78E48D08DC005A6471A13CFF28B60A@qq.com>
References: <tencent_98CD9F78E48D08DC005A6471A13CFF28B60A@qq.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

A large swapin can only read one folio when the whole range has compatible
backing. Mixed zswap/disk ranges must not reach large-folio IO, and zswap
range probes are only snapshots.

Filter the orders passed to swap_cache_alloc_folio() in swapin_sync().
Uniform zeromap ranges and all-disk ranges keep the existing large swapin
path. Fully zswap-backed ranges may be tried. Mixed zswap/disk ranges fall
back before allocation.

After a large swapcache folio is installed, recheck the zswap range and
drop the fresh folio if it became mixed. Also consume -EAGAIN from
swap_read_folio() the same way. Both cases retry order-0, where each slot
can resolve its current backend independently.

Signed-off-by: fujunjie <fujunjie1@qq.com>
---
 mm/memcontrol-v1.c |   8 ++-
 mm/memory.c        |  31 ++++++++-
 mm/swap_state.c    | 169 ++++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 194 insertions(+), 14 deletions(-)

diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 765069211567..5b11b8055c66 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -682,8 +682,8 @@ void __memcg1_swapout(struct folio *folio, struct swap_=
cluster_info *ci)
  * memcg1_swapin - uncharge swap slot on swapin
  * @folio: folio being swapped in
  *
- * Call this function after successfully adding the charged
- * folio to swapcache.
+ * Call this after the charged folio has been added to swapcache and the c=
aller
+ * is no longer going to drop it back to swapped-out state.
  *
  * Context: The folio has to be in swap cache and locked.
  */
@@ -721,7 +721,9 @@ void memcg1_swapin(struct folio *folio)
 	id =3D __swap_cgroup_clear(ci, swp_cluster_offset(folio->swap),
 				 nr_pages);
 	swap_cluster_unlock(ci);
-	mem_cgroup_uncharge_swap(id, nr_pages);
+
+	if (id)
+		mem_cgroup_uncharge_swap(id, nr_pages);
 }
 #endif
=20
diff --git a/mm/memory.c b/mm/memory.c
index 5a365492a9a2..d73a19692dea 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4538,6 +4538,24 @@ static inline bool should_try_to_free_swap(struct sw=
ap_info_struct *si,
 		folio_ref_count(folio) =3D=3D (extra_refs + folio_nr_pages(folio));
 }
=20
+static void memcg1_swapin_retry_folio(struct folio *folio,
+				      struct vm_fault *vmf)
+{
+	if (!folio_test_large(folio) || !folio_test_swapcache(folio))
+		return;
+
+	if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT) {
+		if (!folio_trylock(folio))
+			return;
+	} else {
+		folio_lock(folio);
+	}
+
+	if (folio_test_large(folio) && folio_test_swapcache(folio))
+		memcg1_swapin(folio);
+	folio_unlock(folio);
+}
+
 static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
 {
 	vmf->pte =3D pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
@@ -4857,8 +4875,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
=20
 	swapcache =3D folio;
 	ret |=3D folio_lock_or_retry(folio, vmf);
-	if (ret & VM_FAULT_RETRY)
+	if (ret & VM_FAULT_RETRY) {
+		memcg1_swapin_retry_folio(folio, vmf);
 		goto out_release;
+	}
=20
 	page =3D folio_file_page(folio, swp_offset(entry));
 	/*
@@ -5067,6 +5087,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (unlikely(folio !=3D swapcache)) {
 		folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
 		folio_add_lru_vma(folio, vma);
+		if (folio_test_large(swapcache))
+			memcg1_swapin(swapcache);
 		folio_put_swap(swapcache, NULL);
 	} else if (!folio_test_anon(folio)) {
 		/*
@@ -5076,6 +5098,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) !=3D nr_pages, folio);
 		VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio);
 		folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
+		if (folio_test_large(folio))
+			memcg1_swapin(folio);
 		folio_put_swap(folio, NULL);
 	} else {
 		VM_WARN_ON_ONCE(nr_pages !=3D 1 && nr_pages !=3D folio_nr_pages(folio));
@@ -5132,8 +5156,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 out_page:
-	if (folio_test_swapcache(folio))
+	if (folio_test_swapcache(folio)) {
+		if (folio_test_large(folio))
+			memcg1_swapin(folio);
 		folio_free_swap(folio);
+	}
 	folio_unlock(folio);
 out_release:
 	folio_put(folio);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index d37097913b30..f03ad4832f16 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -21,6 +21,7 @@
 #include <linux/migrate.h>
 #include <linux/vmalloc.h>
 #include <linux/huge_mm.h>
+#include <linux/zswap.h>
 #include <linux/shmem_fs.h>
 #include "internal.h"
 #include "swap_table.h"
@@ -403,7 +404,8 @@ void __swap_cache_replace_folio(struct swap_cluster_inf=
o *ci,
 static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
 					swp_entry_t targ_entry, gfp_t gfp,
 					unsigned int order, struct vm_fault *vmf,
-					struct mempolicy *mpol, pgoff_t ilx)
+					struct mempolicy *mpol, pgoff_t ilx,
+					bool defer_memcg1_swapin)
 {
 	int err;
 	swp_entry_t entry;
@@ -466,7 +468,8 @@ static struct folio *__swap_cache_alloc(struct swap_clu=
ster_info *ci,
 	}
=20
 	/* memsw uncharges swap when folio is added to swap cache */
-	memcg1_swapin(folio);
+	if (!defer_memcg1_swapin || !order)
+		memcg1_swapin(folio);
 	if (shadow)
 		workingset_refault(folio, shadow);
=20
@@ -495,9 +498,12 @@ static struct folio *__swap_cache_alloc(struct swap_cl=
uster_info *ci,
  * Return: Returns the folio if allocation succeeded and folio is in the s=
wap
  * cache. Returns error code if failed due to race, OOM or invalid argumen=
ts.
  */
-struct folio *swap_cache_alloc_folio(swp_entry_t targ_entry, gfp_t gfp,
-				     unsigned long orders, struct vm_fault *vmf,
-				     struct mempolicy *mpol, pgoff_t ilx)
+static struct folio *__swap_cache_alloc_folio(swp_entry_t targ_entry,
+					      gfp_t gfp, unsigned long orders,
+					      struct vm_fault *vmf,
+					      struct mempolicy *mpol,
+					      pgoff_t ilx,
+					      bool defer_memcg1_swapin)
 {
 	int order, err;
 	struct folio *ret;
@@ -512,7 +518,8 @@ struct folio *swap_cache_alloc_folio(swp_entry_t targ_e=
ntry, gfp_t gfp,
=20
 	do {
 		ret =3D __swap_cache_alloc(ci, targ_entry, gfp, order,
-					 vmf, mpol, ilx);
+					 vmf, mpol, ilx,
+					 defer_memcg1_swapin);
 		if (!IS_ERR(ret))
 			break;
 		err =3D PTR_ERR(ret);
@@ -525,6 +532,124 @@ struct folio *swap_cache_alloc_folio(swp_entry_t targ=
_entry, gfp_t gfp,
 	return ret;
 }
=20
+struct folio *swap_cache_alloc_folio(swp_entry_t targ_entry, gfp_t gfp,
+				     unsigned long orders, struct vm_fault *vmf,
+				     struct mempolicy *mpol, pgoff_t ilx)
+{
+	return __swap_cache_alloc_folio(targ_entry, gfp, orders, vmf,
+					mpol, ilx, false);
+}
+
+static struct folio *swap_cache_alloc_speculative_folio(swp_entry_t targ_e=
ntry,
+							gfp_t gfp,
+							unsigned long orders,
+							struct vm_fault *vmf,
+							struct mempolicy *mpol,
+							pgoff_t ilx)
+{
+	/*
+	 * Speculative large swapin may drop this fresh swapcache folio and
+	 * retry order-0 after backend or page-table revalidation. Keep the
+	 * cgroup v1 memsw swap owner until the caller commits the folio.
+	 */
+	return __swap_cache_alloc_folio(targ_entry, gfp, orders, vmf,
+					mpol, ilx, true);
+}
+
+static bool swapin_zeromap_same(swp_entry_t entry, unsigned int nr_pages)
+{
+	unsigned int ci_start =3D swp_cluster_offset(entry);
+	struct swap_cluster_info *ci =3D __swap_entry_to_cluster(entry);
+	bool is_zero;
+	unsigned int i;
+
+	if (ci_start + nr_pages > SWAPFILE_CLUSTER) {
+		VM_WARN_ON_ONCE(1);
+		return false;
+	}
+
+	rcu_read_lock();
+	if (!rcu_dereference(ci->table)) {
+		rcu_read_unlock();
+		return true;
+	}
+
+	is_zero =3D __swap_table_test_zero(ci, ci_start);
+	for (i =3D 1; i < nr_pages; i++) {
+		if (is_zero !=3D __swap_table_test_zero(ci, ci_start + i)) {
+			rcu_read_unlock();
+			return false;
+		}
+	}
+	rcu_read_unlock();
+
+	return true;
+}
+
+static unsigned long swapin_admit_orders(swp_entry_t entry,
+					 unsigned long orders)
+{
+	unsigned long candidates =3D orders & ~BIT(0);
+	unsigned long admitted =3D orders & BIT(0);
+	int order;
+
+	if (!candidates)
+		return orders;
+
+	while (candidates) {
+		enum zswap_range_state state;
+		unsigned int nr_pages;
+		swp_entry_t range_entry;
+		bool admit =3D false;
+
+		order =3D fls_long(candidates) - 1;
+		if (order > MAX_PAGE_ORDER) {
+			candidates &=3D ~BIT(order);
+			continue;
+		}
+
+		nr_pages =3D 1U << order;
+		range_entry =3D swp_entry(swp_type(entry),
+					round_down(swp_offset(entry), nr_pages));
+		if (!swapin_zeromap_same(range_entry, nr_pages))
+			goto next;
+
+		state =3D zswap_probe_range(range_entry, nr_pages);
+		switch (state) {
+		case ZSWAP_RANGE_MIXED:
+			break;
+		case ZSWAP_RANGE_ALL_ZSWAP:
+		case ZSWAP_RANGE_NEVER_ENABLED:
+		case ZSWAP_RANGE_NO_ZSWAP:
+			admit =3D true;
+			break;
+		}
+
+next:
+		if (admit)
+			admitted |=3D BIT(order);
+		else
+			count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK);
+		candidates &=3D ~BIT(order);
+	}
+
+	return admitted ? admitted : BIT(0);
+}
+
+static bool zswap_needs_order0_retry(struct folio *folio)
+{
+	if (!folio_test_large(folio))
+		return false;
+
+	/*
+	 * Admission sees only an advisory zswap snapshot. Recheck after the
+	 * large swapcache folio is installed; if the range became mixed, drop
+	 * the fresh folio before IO and let order-0 handle each slot.
+	 */
+	return zswap_probe_range(folio->swap, folio_nr_pages(folio)) =3D=3D
+	       ZSWAP_RANGE_MIXED;
+}
+
 /*
  * If we are the only user, then try to free up the swap cache.
  *
@@ -634,7 +759,8 @@ static struct folio *swap_cache_read_folio(swp_entry_t =
entry, gfp_t gfp,
 		folio =3D swap_cache_get_folio(entry);
 		if (folio)
 			return folio;
-		folio =3D swap_cache_alloc_folio(entry, gfp, BIT(0), NULL, mpol, ilx);
+		folio =3D swap_cache_alloc_folio(entry, gfp, BIT(0), NULL,
+					       mpol, ilx);
 	} while (PTR_ERR(folio) =3D=3D -EEXIST);
=20
 	if (IS_ERR_OR_NULL(folio))
@@ -677,18 +803,43 @@ struct folio *swapin_sync(swp_entry_t entry, gfp_t gf=
p, unsigned long orders,
 	struct folio *folio;
 	int ret;
=20
+	orders =3D swapin_admit_orders(entry, orders);
+again:
 	do {
 		folio =3D swap_cache_get_folio(entry);
 		if (folio)
 			return folio;
-		folio =3D swap_cache_alloc_folio(entry, gfp, orders, vmf, mpol, ilx);
+		folio =3D swap_cache_alloc_speculative_folio(entry, gfp, orders,
+							   vmf, mpol, ilx);
 	} while (PTR_ERR(folio) =3D=3D -EEXIST);
=20
 	if (IS_ERR(folio))
 		return folio;
=20
+	if (zswap_needs_order0_retry(folio)) {
+		count_mthp_stat(folio_order(folio), MTHP_STAT_SWPIN_FALLBACK);
+		/*
+		 * The folio is newly allocated, locked, clean and not uptodate;
+		 * no data has been read into it. Removing it only restores the
+		 * swap table entries so order-0 swapin can resolve a backend
+		 * race without attempting speculative large-folio zswapin.
+		 */
+		swap_cache_del_folio(folio);
+		folio_unlock(folio);
+		folio_put(folio);
+		orders =3D BIT(0);
+		goto again;
+	}
+
 	ret =3D swap_read_folio(folio, NULL);
-	VM_WARN_ON_ONCE(ret =3D=3D -EAGAIN);
+	if (ret =3D=3D -EAGAIN) {
+		count_mthp_stat(folio_order(folio), MTHP_STAT_SWPIN_FALLBACK);
+		swap_cache_del_folio(folio);
+		folio_unlock(folio);
+		folio_put(folio);
+		orders =3D BIT(0);
+		goto again;
+	}
 	return folio;
 }
=20
--=20
2.34.1
From nobody Mon Jun  8 12:13:56 2026
Received: from xmbghk7.mail.qq.com (xmbghk7.mail.qq.com [43.163.128.48])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9B9BE3438BD;
	Fri, 29 May 2026 12:19:41 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=43.163.128.48
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1780057183; cv=none;
 b=OKFcv5ewCkaTdaR/5+GnoTDV9dt6jM7xyy0eAQhfEx2oFmVxF068lSrZeCPafcqo+5gsbefaJvpEGqcgCWAXme5vAOlm5GUeWU7hnDjoinkJqvKzy1zNO2n1UPos7vljiwj/3J3Ue4LD6yvdU6KRP0ou9NBYHgnE01ffAVkNG3M=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1780057183; c=relaxed/simple;
	bh=IJIhxNj6wWIRUVVmjKYBMeKl3sJ/RkXIW4Hu88GF9gE=;
	h=Message-ID:From:To:Cc:Subject:Date:In-Reply-To:References:
	 MIME-Version;
 b=Pqz2tQMM+4HOdlTjWH3oOqkimIk+E5iqyimD93/K6c/VVR3Uvv1zO2hEXIyIPWVVv/FgwVViSbjzyKR39rp4YfEFunX2r2+7aonI6XLZ5whmAeUKlSETanzaQrGiedbfFE4lq94z+F4cpNDpU+2zX6Ls/opYcS6uQOj5dim9O7c=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=qq.com;
 spf=pass smtp.mailfrom=qq.com;
 dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b=h+Xilcb+;
 arc=none smtp.client-ip=43.163.128.48
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=qq.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=qq.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b="h+Xilcb+"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=qq.com; s=s201512;
	t=1780057179; bh=e9pDd68+wLvH4AeGJEXTLZCNqdsKkVQuDmz0y/Qyl40=;
	h=From:To:Cc:Subject:Date:In-Reply-To:References;
	b=h+Xilcb+Fe2a+TdyND9eRmzpcWn65K1nBdf+U1Tcpxrle9mkaWuaGs92Dj7+4Gyqu
	 E5kqBNXOpgr4kNi7T3fhli4vTM6GVQ2gt42yEsJ1O+AzgjDPfPS6p4L/Zf9KtjNZMy
	 Bx2TH95F4j3um3bRRG5JuxuisRPgCw5F6ZQVWuXw=
Received: from node68.. ([166.111.236.25])
	by newxmesmtplogicsvrsza73-0.qq.com (NewEsmtp) with SMTP
	id 4DC10017; Fri, 29 May 2026 20:19:28 +0800
X-QQ-mid: xmsmtpt1780057176tr24p40mn
Message-ID: <tencent_69E7033C2446FE6E922D28B82E9F59142D09@qq.com>
X-QQ-XMAILINFO: NcJw5CIirWgGxMPMMf2Xe4hBEBXn4hAHfolRSxm6+zPzISbV8Hly/4GctJa4j0
	 /Zwax7rBPKgCvjRhIpWIEMfK/hLRiJNnJv5iYpCP8i5S5K3nMVpulonB9o0ul+xA9hcQ0mx+X+85
	 2NJdkYsR/B9ahT5scviN33mGJYnW4QltRFCRA/lTjDQyeTozqDPEHYSzmODT52L4xGBHcR4WnJaF
	 NhiOv+sE1NQE9qdszpvr0iVpMg8Pg4hGvHwTZAYU/iWhgHiNF55eMcX50UFh1lRdS/1nkur5lydG
	 FLVGBLwrVzv+ga0RF6bxF7z4ZioHDKh0OD31pp0hlwJyixBEswr0ReNnl/sQqaY5YJqwyMEfdekZ
	 BxIHOMNKyY9mOQOAFul5mbS43GsrLLEZHPI3LG3oHH9JFzJCLc7NBbvGNyw2ymKgu5pYqJcFhPYg
	 kqyDna8DHjyvVjont2VGFEVUjYSRU2XMZZWUA9y+4pG8beI1CPfssyimdEl6HmNaMfyxfF9t/3Yc
	 tEF5zoBXdSUa77Ch8bC1GoOt1OsGGkoANEY+PUbZ2Euob8C4ZQXrFYl8TfFBPIw6vFsbr0zBx+bg
	 kLjBaUboDP11xAE1vE1nGBGjt9PGzGtCxUfbzmwXXnXOrhjRPzBOCj4gsNWALJu4p7eL21EPDvmQ
	 yfjvioRg6edGBBhgP+AUxlOWr30REOTMsq1ImSZqKgUENMxZdrua4namZUD2CBC0ig8kQu7LKTnG
	 GgNKS3nkTVMEYshUKswxb8TGE/0LPku+SBz63hq8MjjbqqDvDG62UnFwV9QAYSYmgo/iK7qkwQyv
	 Kpug2b7ABBs7Ngszgk0QwvkTIIEMLxv3f+VgAERJ2tQLbyc6++FXndclzN1PUsN64zhsQvfNabqE
	 Y+vXnSkAEg914R0pjjmh1S0keSAE5pBD+4U2HAZ4NDfZSLV9QsnaOzuy0CYJj0fEDjpSVjTRUFRC
	 NuZ8cFSN+FpmpCHw9H+z+PS6RIuqlUi3KihEhPHgtzjuv+AgbUeKiC4tZfg6FUWgmuP4QpqeST3T
	 uVTGS3pWJ1c3t/Vd5ezCpKIta1kwPhccBSJQYQaFnXzjMy7UEUUwf1rvHM7Ih/TUBiGqSBww==
X-QQ-XMRINFO: NyFYKkN4Ny6FuXrnB5Ye7Aabb3ujjtK+gg==
From: fujunjie <fujunjie1@qq.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org,
	Alexandre Ghiti <alexghiti@meta.com>,
	Kairui Song <kasong@tencent.com>,
	Usama Arif <usamaarif642@gmail.com>
Cc: Chris Li <chrisl@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Yosry Ahmed <yosry@kernel.org>,
	Nhat Pham <nphamcs@gmail.com>,
	David Hildenbrand <david@kernel.org>,
	Hugh Dickins <hughd@google.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	linux-kernel@vger.kernel.org,
	cgroups@vger.kernel.org
Subject: [RFC PATCH v2 5/9] mm: add common locality admission for zswap large
 swapin
Date: Fri, 29 May 2026 12:19:24 +0000
X-OQ-MSGID: <20260529121928.4115683-5-fujunjie1@qq.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <tencent_98CD9F78E48D08DC005A6471A13CFF28B60A@qq.com>
References: <tencent_98CD9F78E48D08DC005A6471A13CFF28B60A@qq.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Fully zswap-backed ranges are safe to load as a large folio only when
the caller has a reason to expect the neighbouring slots to be useful.
Otherwise a sparse refault can turn one 4K demand fault into a 64K
decompression and swapcache fill.

Add a common admission gate for zswap-backed large swapin. The common
layer keeps backend checks, the 64K cap, recent-refault rejection, and
zswap reclaim-pressure rejection. It consumes a caller-provided locality
order mask instead of looking at anon or shmem state directly.

Callers pass no locality evidence for now, so this patch only installs
the common policy hook. Later patches add anon and shmem producers.

Signed-off-by: fujunjie <fujunjie1@qq.com>
---
 mm/memory.c     |   2 +-
 mm/shmem.c      |   2 +-
 mm/swap.h       |   8 ++--
 mm/swap_state.c | 118 ++++++++++++++++++++++++++++++++++++++++++++----
 4 files changed, 117 insertions(+), 13 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index d73a19692dea..92a82008d583 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4849,7 +4849,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
 			folio =3D swapin_sync(entry, GFP_HIGHUSER_MOVABLE,
 					    thp_swapin_suitable_orders(vmf) | BIT(0),
-					    vmf, NULL, 0);
+					    0, vmf, NULL, 0);
 		else
 			folio =3D swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf);
=20
diff --git a/mm/shmem.c b/mm/shmem.c
index 56c23a7b15c7..fa99b48ed62b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2031,7 +2031,7 @@ static struct folio *shmem_swap_alloc_folio(struct in=
ode *inode,
=20
 again:
 	mpol =3D shmem_get_pgoff_policy(info, index, order, &ilx);
-	folio =3D swapin_sync(entry, gfp, BIT(order), vmf, mpol, ilx);
+	folio =3D swapin_sync(entry, gfp, BIT(order), 0, vmf, mpol, ilx);
 	mpol_cond_put(mpol);
=20
 	if (!IS_ERR(folio))
diff --git a/mm/swap.h b/mm/swap.h
index ea7e1f3c4410..dd35a310d06d 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -323,9 +323,10 @@ struct folio *read_swap_cache_async(swp_entry_t entry,=
 gfp_t gfp_mask,
 struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
 		struct mempolicy *mpol, pgoff_t ilx);
 struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
-		struct vm_fault *vmf);
+			struct vm_fault *vmf);
 struct folio *swapin_sync(swp_entry_t entry, gfp_t flag, unsigned long ord=
ers,
-			   struct vm_fault *vmf, struct mempolicy *mpol, pgoff_t ilx);
+			  unsigned long locality_orders, struct vm_fault *vmf,
+			  struct mempolicy *mpol, pgoff_t ilx);
 void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
 			   unsigned long addr);
=20
@@ -418,7 +419,8 @@ static inline struct folio *swapin_readahead(swp_entry_=
t swp, gfp_t gfp_mask,
=20
 static inline struct folio *swapin_sync(
 	swp_entry_t entry, gfp_t flag, unsigned long orders,
-	struct vm_fault *vmf, struct mempolicy *mpol, pgoff_t ilx)
+	unsigned long locality_orders, struct vm_fault *vmf,
+	struct mempolicy *mpol, pgoff_t ilx)
 {
 	return NULL;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index f03ad4832f16..5a4ca289009a 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -21,6 +21,7 @@
 #include <linux/migrate.h>
 #include <linux/vmalloc.h>
 #include <linux/huge_mm.h>
+#include <linux/sizes.h>
 #include <linux/zswap.h>
 #include <linux/shmem_fs.h>
 #include "internal.h"
@@ -556,6 +557,24 @@ static struct folio *swap_cache_alloc_speculative_foli=
o(swp_entry_t targ_entry,
 					mpol, ilx, true);
 }
=20
+/*
+ * Initial conservative cap for speculative zswap large swapin. Locality
+ * evidence is supplied by the caller or by generic VMA hints; the common
+ * swapin layer keeps backend safety and pressure decisions here.
+ */
+#define SWAPIN_ZSWAP_MAX_SIZE			SZ_64K
+#if PAGE_SIZE < SWAPIN_ZSWAP_MAX_SIZE
+#define SWAPIN_ZSWAP_MAX_ORDER			\
+	ilog2(SWAPIN_ZSWAP_MAX_SIZE / PAGE_SIZE)
+#else
+#define SWAPIN_ZSWAP_MAX_ORDER			0
+#endif
+
+struct zswap_admit_ctx {
+	bool pressure_checked;
+	bool reclaim_pressure;
+};
+
 static bool swapin_zeromap_same(swp_entry_t entry, unsigned int nr_pages)
 {
 	unsigned int ci_start =3D swp_cluster_offset(entry);
@@ -586,11 +605,84 @@ static bool swapin_zeromap_same(swp_entry_t entry, un=
signed int nr_pages)
 	return true;
 }
=20
+static bool swapin_zswap_locality(struct vm_fault *vmf, unsigned int order,
+				  unsigned long locality_orders)
+{
+	struct vm_area_struct *vma =3D vmf ? vmf->vma : NULL;
+
+	if (!order || order > MAX_PAGE_ORDER)
+		return false;
+
+	if (vma && (vma->vm_flags & VM_RAND_READ))
+		return false;
+
+	return locality_orders & BIT(order);
+}
+
+static bool swapin_zswap_refaulted(swp_entry_t entry, unsigned int nr_page=
s)
+{
+	unsigned int type =3D swp_type(entry);
+	pgoff_t offset =3D swp_offset(entry);
+	unsigned int i;
+
+	for (i =3D 0; i < nr_pages; i++) {
+		bool workingset;
+		void *shadow;
+
+		shadow =3D swap_cache_get_shadow(swp_entry(type, offset + i));
+		if (!shadow)
+			continue;
+		if (workingset_test_recent(shadow, false, &workingset, false) &&
+		    workingset)
+			return true;
+	}
+
+	return false;
+}
+
+static bool swapin_zswap_admit(swp_entry_t entry,
+			       unsigned int order, unsigned int nr_pages,
+			       struct vm_fault *vmf,
+			       unsigned long locality_orders,
+			       struct zswap_admit_ctx *ctx)
+{
+	if (order > SWAPIN_ZSWAP_MAX_ORDER)
+		return false;
+
+	/*
+	 * Treat zswap-backed large swapin as speculative. The common layer
+	 * consumes caller-provided locality orders, but does not inspect
+	 * anon-specific PTE state or shmem-specific mapping state directly.
+	 */
+	if (!swapin_zswap_locality(vmf, order, locality_orders))
+		return false;
+
+	/*
+	 * A recent workingset refault shadow in the target range means reclaim
+	 * already saw churn there. Keep the refault path narrow instead of
+	 * speculatively decompressing neighbouring slots.
+	 */
+	if (swapin_zswap_refaulted(entry, nr_pages))
+		return false;
+
+	if (!ctx->pressure_checked) {
+		ctx->reclaim_pressure =3D zswap_pool_reclaim_pressure();
+		ctx->pressure_checked =3D true;
+	}
+	if (ctx->reclaim_pressure)
+		return false;
+
+	return true;
+}
+
 static unsigned long swapin_admit_orders(swp_entry_t entry,
-					 unsigned long orders)
+					 unsigned long orders,
+					 struct vm_fault *vmf,
+					 unsigned long locality_orders)
 {
 	unsigned long candidates =3D orders & ~BIT(0);
 	unsigned long admitted =3D orders & BIT(0);
+	struct zswap_admit_ctx zswap_ctx =3D {};
 	int order;
=20
 	if (!candidates)
@@ -616,9 +708,14 @@ static unsigned long swapin_admit_orders(swp_entry_t e=
ntry,
=20
 		state =3D zswap_probe_range(range_entry, nr_pages);
 		switch (state) {
+		case ZSWAP_RANGE_ALL_ZSWAP:
+			admit =3D swapin_zswap_admit(range_entry, order,
+						   nr_pages, vmf,
+						   locality_orders,
+						   &zswap_ctx);
+			break;
 		case ZSWAP_RANGE_MIXED:
 			break;
-		case ZSWAP_RANGE_ALL_ZSWAP:
 		case ZSWAP_RANGE_NEVER_ENABLED:
 		case ZSWAP_RANGE_NO_ZSWAP:
 			admit =3D true;
@@ -769,8 +866,8 @@ static struct folio *swap_cache_read_folio(swp_entry_t =
entry, gfp_t gfp,
 	ret =3D swap_read_folio(folio, plug);
 	/*
 	 * Swap readahead allocates order-0 folios. -EAGAIN is reserved for
-	 * retryable large zswap backend races and must be handled by the
-	 * synchronous common swapin path.
+	 * retryable large zswap backend races and should never escape to this
+	 * order-0 path.
 	 */
 	VM_WARN_ON_ONCE(ret =3D=3D -EAGAIN);
 	if (readahead) {
@@ -786,6 +883,7 @@ static struct folio *swap_cache_read_folio(swp_entry_t =
entry, gfp_t gfp,
  * @entry: swap entry indicating the target slot
  * @gfp: memory allocation flags
  * @orders: allocation orders
+ * @locality_orders: orders with caller-provided locality evidence
  * @vmf: fault information
  * @mpol: NUMA memory allocation policy to be applied
  * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE
@@ -794,16 +892,20 @@ static struct folio *swap_cache_read_folio(swp_entry_=
t entry, gfp_t gfp,
  * existing folio in the swap cache for @entry. This initiates the IO, too,
  * if needed. @entry is rounded down if @orders allow large allocation.
  *
- * Context: Caller must ensure @entry is valid and pin the swap device wit=
h refcount.
+ * Context: Caller must ensure @entry is valid and pin the swap device with
+ * refcount.
  * Return: Returns the folio on success, error code if failed.
  */
-struct folio *swapin_sync(swp_entry_t entry, gfp_t gfp, unsigned long orde=
rs,
-			   struct vm_fault *vmf, struct mempolicy *mpol, pgoff_t ilx)
+struct folio *swapin_sync(swp_entry_t entry, gfp_t gfp,
+			  unsigned long orders,
+			  unsigned long locality_orders,
+			  struct vm_fault *vmf, struct mempolicy *mpol,
+			  pgoff_t ilx)
 {
 	struct folio *folio;
 	int ret;
=20
-	orders =3D swapin_admit_orders(entry, orders);
+	orders =3D swapin_admit_orders(entry, orders, vmf, locality_orders);
 again:
 	do {
 		folio =3D swap_cache_get_folio(entry);
--=20
2.34.1
From nobody Mon Jun  8 12:13:56 2026
Received: from xmbghk7.mail.qq.com (xmbghk7.mail.qq.com [43.163.128.48])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 964623DBD53;
	Fri, 29 May 2026 12:19:43 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=43.163.128.48
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1780057186; cv=none;
 b=hMexri45wr8wHD5sZZ/BYji4OyaSSFvIJwuvyIVVxBvhrqC7UdnVCeMoYQxI71j1XEa8LiUv1S9CJ6f33dyinvWAmWLI1b+82dkq50T5XNT7id8TK8LAJ7NQns9Jzjk2J2QdfIraXJ/eQIYDztaDXUylgYdneJKt8XMFUSgTVhw=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1780057186; c=relaxed/simple;
	bh=XIjrOmuzKDqZwfu/U4RkH6jnfzV1A1isv3f+BqgqDpg=;
	h=Message-ID:From:To:Cc:Subject:Date:In-Reply-To:References:
	 MIME-Version;
 b=EEqIo00Pqpfvr7iorGtimbdwrCoiBOKKkfwWR56dlCHxH8qt3D3X8+fP0nlxir01yKSlY+YqZtyEVN6ANHqSnghgkxHmtaTcZbTtnqc2MG/g/DJqF11SN3r3i4sa3YQQNlCBY/2B1kojjL/vV9Sxs/QjUxdufM67uoKuAvFZtv4=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=qq.com;
 spf=pass smtp.mailfrom=qq.com;
 dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b=bg/vJ0vD;
 arc=none smtp.client-ip=43.163.128.48
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=qq.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=qq.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b="bg/vJ0vD"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=qq.com; s=s201512;
	t=1780057181; bh=LLCdcZM28LSqnhskCBNjIY/IATM4Pd8LdF/tlghQRJQ=;
	h=From:To:Cc:Subject:Date:In-Reply-To:References;
	b=bg/vJ0vD8Wq0E8xQ9lzRiF2CcmID31e+p9BP697CZf008EDAhLiqgi/G+SyfO1EhC
	 jpiezYKBUAoDoo7KWwZBFkvdR7QdfLPc/tNIhgLI8JXSAii3DfHe/TekIyu78dmIF/
	 nbi5hAtlAIzswGMAolgpOS3lTeuahX2KL+SqvQHA=
Received: from node68.. ([166.111.236.25])
	by newxmesmtplogicsvrsza73-0.qq.com (NewEsmtp) with SMTP
	id 4DC10017; Fri, 29 May 2026 20:19:28 +0800
X-QQ-mid: xmsmtpt1780057178t8d71qf7g
Message-ID: <tencent_913470853E9B289ECF0379248E24DFB4590A@qq.com>
X-QQ-XMAILINFO: MZtEYADUG4AgFKeqSv1IeByIiaUnNl54ksdfcW2cSvrmU+UkYbeDznU9ZeKWcW
	 Fa+snv+S5Hh8iRLGeTdFPQy6v/uDbbAxElaP+GdWL1jv1GUoon3bMyomyKlDb2joL4deC9UcFUdh
	 jrdkn+9GB2y7uy8jMSO4LCWrJYqQGdJDfr8aB6b72wsxjDj7tKka3KOqSo6BWX9s7hqEhBZVAPgy
	 1pw3i5IAr3kIYT9bV9ab8m6+4IgWGGl1WWDCWrxWSf3ubJGR0Z+dMQNcK7VU80LdrMqVwQZNklUi
	 X4gLiztR5flLNtzfpqa6lOakdPqCcC7xpvSEfL7oDo4eWtiX1Eg/RdDAwdXEHKwlFn2SCPCI0N97
	 sg4GjU2G5ub0CsBjw1/zGN1nSuwFoI0ghjrq59bSLAeP3vDn5PlvT0Q1sfI+TP8r22UhUpOLvA18
	 K8UF5A5fbdIxWom4TQ6Js1pxoEQ1/lLgQ2J4j9Lfqqnk6ifMuJ6zu4x8OEHsg54zPX2m9Eiy+fVD
	 nh/AXqAxQbyhoMa0DMQfWw3ti7N2vi9FQbsobRalnefxIJ+Dhi8DUaSnZLb1lKxql47IeTggC27y
	 wu92J78jkOkEWMKCB1M7KPEPoEJ6YGNJbpp7UrdVdCtx4bbjBC03uLD9zVNVWpOzEei/UT3ygdHx
	 U5HMqjD1WbgIagPg6JFuenf62rVAoZA21QX3Dvs+g21NLdmVZNiEIgxX4pgW/Cb+FNqB3gG8Kkgj
	 e8VTkq7zSTolqLjFHp1IYBVnwurn9rABUxFkkPkSsu6kw2ep9ZrBXdlwzbnYEKX8hIsars4l9K6M
	 3kPCF6HRnRI32pJ2S39kD8tX6AxN+KwCuOhnfTI2fxXZh8l9TjdjEGlLlh91KfsRILyQfq3ks0Fx
	 brQNXSIztgvz7hIwZ55iTToNhhx+q6tB0LFUeiCaDUl0RNTaazo3ZCrVvI9AT2WGNOq+vHXYvEZQ
	 tUXcB3RfkPaFY8/jt4xGm0ItYEiEj/v7tJg7OCHN3+doHSHZ6UPsYHJaAVXMK0A5PBo+aYufpYAH
	 72vYPw5VAu82yPkzMmsZMq4Z5kYCXYxkvBQ66bPrfog+4FWHHNCUP+McfAgLuzn4OXouHWnQ==
X-QQ-XMRINFO: Nq+8W0+stu50tPAe92KXseR0ZZmBTk3gLg==
From: fujunjie <fujunjie1@qq.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org,
	Alexandre Ghiti <alexghiti@meta.com>,
	Kairui Song <kasong@tencent.com>,
	Usama Arif <usamaarif642@gmail.com>
Cc: Chris Li <chrisl@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Yosry Ahmed <yosry@kernel.org>,
	Nhat Pham <nphamcs@gmail.com>,
	David Hildenbrand <david@kernel.org>,
	Hugh Dickins <hughd@google.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	linux-kernel@vger.kernel.org,
	cgroups@vger.kernel.org
Subject: [RFC PATCH v2 6/9] mm: provide anon locality evidence for zswap large
 swapin
Date: Fri, 29 May 2026 12:19:25 +0000
X-OQ-MSGID: <20260529121928.4115683-6-fujunjie1@qq.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <tencent_98CD9F78E48D08DC005A6471A13CFF28B60A@qq.com>
References: <tencent_98CD9F78E48D08DC005A6471A13CFF28B60A@qq.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

The common zswap large-swapin policy needs locality evidence from
callers before it can admit a large folio. For anonymous faults, provide
that evidence from existing VMA hints and from the PTE young state left
by earlier zswap-backed large swapins.

Keep non-faulting PTEs old when mapping a speculative all-zswap large
folio. A later fault can then require a dense young previous range before
admitting another large swapin without adding VMA state.

This also removes the old zswap-enabled guard from the THP swapin
candidate scan. The common swapin path now classifies the backend range
and falls back to order-0 for mixed zswap/disk ranges or races.

Signed-off-by: fujunjie <fujunjie1@qq.com>
---
 mm/memory.c     | 234 +++++++++++++++++++++++++++++++++++++++++++-----
 mm/swap.h       |   6 ++
 mm/swap_state.c |  15 ++++
 3 files changed, 235 insertions(+), 20 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 92a82008d583..7bbb89632000 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4556,6 +4556,35 @@ static void memcg1_swapin_retry_folio(struct folio *=
folio,
 	folio_unlock(folio);
 }
=20
+static void set_swapin_ptes(struct vm_area_struct *vma,
+			    unsigned long address, pte_t *ptep, pte_t pte,
+			    unsigned int nr_pages, unsigned int fault_pte_idx,
+			    bool fault_only_young)
+{
+	struct mm_struct *mm =3D vma->vm_mm;
+	pte_t old_pte;
+
+	if (!fault_only_young || nr_pages =3D=3D 1) {
+		set_ptes(mm, address, ptep, pte, nr_pages);
+		return;
+	}
+
+	old_pte =3D pte_mkold(pte);
+	if (fault_pte_idx)
+		set_ptes(mm, address, ptep, old_pte, fault_pte_idx);
+
+	set_pte_at(mm, address + fault_pte_idx * PAGE_SIZE,
+		   ptep + fault_pte_idx,
+		   pte_mkyoung(pte_advance_pfn(pte, fault_pte_idx)));
+
+	fault_pte_idx++;
+	if (fault_pte_idx < nr_pages)
+		set_ptes(mm, address + fault_pte_idx * PAGE_SIZE,
+			 ptep + fault_pte_idx,
+			 pte_advance_pfn(old_pte, fault_pte_idx),
+			 nr_pages - fault_pte_idx);
+}
+
 static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
 {
 	vmf->pte =3D pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
@@ -4628,6 +4657,157 @@ static vm_fault_t handle_pte_marker(struct vm_fault=
 *vmf)
 }
=20
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define SWAPIN_ANON_YOUNG_MIN_PERCENT		75
+#define SWAPIN_ANON_MAX_FAULT_SKIP_SHIFT	2
+
+static bool swapin_anon_prev_young_dense(struct vm_fault *vmf,
+					 unsigned int order)
+{
+	struct vm_area_struct *vma;
+	unsigned int nr_pages;
+	unsigned int threshold;
+	unsigned long size;
+	unsigned long base, prev, addr;
+	struct folio *first =3D NULL;
+	unsigned int present =3D 0;
+	unsigned int young =3D 0;
+	pmd_t *pmd;
+	pmd_t pmdval;
+	spinlock_t *ptl; /* protects the previous PTE range */
+	pte_t *ptep;
+	unsigned int i;
+
+	if (!IS_ENABLED(CONFIG_MMU) || !arch_has_hw_pte_young() || !vmf ||
+	    !vmf->vma || !vmf->pmd || !order || order > MAX_PAGE_ORDER)
+		return false;
+
+	nr_pages =3D 1U << order;
+	threshold =3D DIV_ROUND_UP(nr_pages *
+				 SWAPIN_ANON_YOUNG_MIN_PERCENT, 100);
+	size =3D PAGE_SIZE << order;
+
+	vma =3D vmf->vma;
+	base =3D ALIGN_DOWN(vmf->address, size);
+	if (base < size)
+		return false;
+
+	prev =3D base - size;
+	if (prev < vma->vm_start || prev + size > vma->vm_end)
+		return false;
+
+	pmd =3D vmf->pmd;
+	if ((prev & PMD_MASK) !=3D (base & PMD_MASK)) {
+		pmd =3D mm_find_pmd(vma->vm_mm, prev);
+		if (!pmd)
+			return false;
+	}
+
+	pmdval =3D pmdp_get_lockless(pmd);
+	if (!pmd_present(pmdval) || pmd_leaf(pmdval))
+		return false;
+
+	ptep =3D pte_offset_map_lock(vma->vm_mm, pmd, prev, &ptl);
+	if (!ptep)
+		return false;
+
+	for (i =3D 0, addr =3D prev; i < nr_pages; i++, addr +=3D PAGE_SIZE) {
+		struct folio *folio;
+		pte_t pte =3D ptep_get(ptep + i);
+
+		if (!pte_present(pte))
+			break;
+
+		folio =3D vm_normal_folio(vma, addr, pte);
+		if (!folio || folio_order(folio) !=3D order)
+			break;
+		if (!first)
+			first =3D folio;
+		else if (folio !=3D first)
+			break;
+
+		present++;
+		if (pte_young(pte))
+			young++;
+	}
+
+	pte_unmap_unlock(ptep, ptl);
+	if (present !=3D nr_pages)
+		return false;
+
+	return young >=3D threshold;
+}
+
+static bool swapin_anon_accessed_neighbour(struct vm_fault *vmf,
+					   unsigned int order)
+{
+	unsigned long size;
+	unsigned long base;
+	unsigned long fault_idx;
+	unsigned long max_skip;
+
+	if (!vmf || !vmf->vma || !order || order > MAX_PAGE_ORDER)
+		return false;
+
+	size =3D PAGE_SIZE << order;
+	base =3D ALIGN_DOWN(vmf->address, size);
+
+	/*
+	 * Without a sequential hint, require prior young-density evidence and
+	 * only allow faults near the start of the candidate range.
+	 */
+	fault_idx =3D (vmf->address - base) >> PAGE_SHIFT;
+	max_skip =3D (1UL << order) >> SWAPIN_ANON_MAX_FAULT_SKIP_SHIFT;
+	if (fault_idx > max_skip)
+		return false;
+
+	return swapin_anon_prev_young_dense(vmf, order);
+}
+
+static bool swapin_anon_fault_starts_range(struct vm_fault *vmf,
+					   unsigned int order)
+{
+	struct vm_area_struct *vma;
+	unsigned long size;
+	unsigned long base;
+	unsigned long first;
+
+	if (!vmf || !vmf->vma || !order || order > MAX_PAGE_ORDER)
+		return false;
+
+	vma =3D vmf->vma;
+	size =3D PAGE_SIZE << order;
+	base =3D ALIGN_DOWN(vmf->address, size);
+	first =3D ALIGN(vma->vm_start, size);
+
+	return base =3D=3D first && vmf->address =3D=3D base &&
+	       base + size <=3D vma->vm_end;
+}
+
+static unsigned long swapin_anon_locality_orders(struct vm_fault *vmf,
+						 unsigned long orders)
+{
+	struct vm_area_struct *vma =3D vmf ? vmf->vma : NULL;
+	unsigned long locality_orders =3D 0;
+	unsigned long candidates =3D orders & ~BIT(0);
+	int order;
+
+	if (vma && (vma->vm_flags & VM_RAND_READ))
+		return 0;
+
+	if (vma && (vma->vm_flags & VM_SEQ_READ))
+		return candidates;
+
+	while (candidates) {
+		order =3D fls_long(candidates) - 1;
+		if (swapin_anon_fault_starts_range(vmf, order) ||
+		    swapin_anon_accessed_neighbour(vmf, order))
+			locality_orders |=3D BIT(order);
+		candidates &=3D ~BIT(order);
+	}
+
+	return locality_orders;
+}
+
 /*
  * Check if the PTEs within a range are contiguous swap entries.
  */
@@ -4644,9 +4824,9 @@ static bool can_swapin_thp(struct vm_fault *vmf, pte_=
t *ptep, int nr_pages)
 	if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx)))
 		return false;
 	/*
-	 * swap_read_folio() can't handle the case a large folio is hybridly
-	 * from different backends. And they are likely corner cases. Similar
-	 * things might be added once zswap support large folios.
+	 * swap_read_folio() can't do mixed-backend large folio IO. The common
+	 * synchronous swapin path will recheck backend state and fall back to
+	 * order-0 if a zswap/disk race makes the range mixed.
 	 */
 	if (swap_pte_batch(ptep, nr_pages, pte) !=3D nr_pages)
 		return false;
@@ -4693,14 +4873,6 @@ static unsigned long thp_swapin_suitable_orders(stru=
ct vm_fault *vmf)
 	if (unlikely(userfaultfd_armed(vma)))
 		return 0;
=20
-	/*
-	 * A large swapped out folio could be partially or fully in zswap. We
-	 * lack handling for such cases, so fallback to swapping in order-0
-	 * folio.
-	 */
-	if (!zswap_never_enabled())
-		return 0;
-
 	entry =3D softleaf_from_pte(vmf->orig_pte);
 	/*
 	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
@@ -4708,10 +4880,13 @@ static unsigned long thp_swapin_suitable_orders(str=
uct vm_fault *vmf)
 	 */
 	orders =3D thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT,
 					  BIT(PMD_ORDER) - 1);
+	if (!orders)
+		return 0;
 	orders =3D thp_vma_suitable_orders(vma, vmf->address, orders);
+	if (!orders)
+		return 0;
 	orders =3D thp_swap_suitable_orders(swp_offset(entry),
 					  vmf->address, orders);
-
 	if (!orders)
 		return 0;
=20
@@ -4741,6 +4916,12 @@ static unsigned long thp_swapin_suitable_orders(stru=
ct vm_fault *vmf)
 {
 	return 0;
 }
+
+static unsigned long swapin_anon_locality_orders(struct vm_fault *vmf,
+						 unsigned long orders)
+{
+	return 0;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
=20
 /* Sanity check that a folio is fully exclusive */
@@ -4777,6 +4958,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	unsigned long page_idx;
 	unsigned long address;
 	pte_t *ptep;
+	bool fault_only_young =3D false;
=20
 	if (!pte_unmap_same(vmf))
 		goto out;
@@ -4845,13 +5027,22 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (folio)
 		swap_update_readahead(folio, vma, vmf->address);
 	if (!folio) {
-		/* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices */
-		if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
+		/*
+		 * Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices.
+		 * The swap device is pinned while checking the flag, matching
+		 * the existing fault path.
+		 */
+		if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
+			unsigned long swapin_orders =3D thp_swapin_suitable_orders(vmf);
+			unsigned long locality_orders =3D
+				swapin_anon_locality_orders(vmf, swapin_orders);
+
 			folio =3D swapin_sync(entry, GFP_HIGHUSER_MOVABLE,
-					    thp_swapin_suitable_orders(vmf) | BIT(0),
-					    0, vmf, NULL, 0);
-		else
+					    swapin_orders | BIT(0),
+					    locality_orders, vmf, NULL, 0);
+		} else {
 			folio =3D swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf);
+		}
=20
 		if (IS_ERR_OR_NULL(folio)) {
 			/*
@@ -5110,9 +5301,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
=20
 	VM_BUG_ON(!folio_test_anon(folio) ||
 			(pte_write(pte) && !PageAnonExclusive(page)));
-	set_ptes(vma->vm_mm, address, ptep, pte, nr_pages);
-	arch_do_swap_page_nr(vma->vm_mm, vma, address,
-			pte, pte, nr_pages);
+	if (folio =3D=3D swapcache && nr_pages =3D=3D folio_nr_pages(folio) &&
+	    arch_has_hw_pte_young())
+		fault_only_young =3D swapin_fault_only_young(folio);
+	set_swapin_ptes(vma, address, ptep, pte, nr_pages, page_idx,
+			fault_only_young);
+	arch_do_swap_page_nr(vma->vm_mm, vma, address, pte, pte, nr_pages);
=20
 	/*
 	 * Remove the swap entry and conditionally try to free up the swapcache.
diff --git a/mm/swap.h b/mm/swap.h
index dd35a310d06d..5d1c81ab49b9 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -327,6 +327,7 @@ struct folio *swapin_readahead(swp_entry_t entry, gfp_t=
 flag,
 struct folio *swapin_sync(swp_entry_t entry, gfp_t flag, unsigned long ord=
ers,
 			  unsigned long locality_orders, struct vm_fault *vmf,
 			  struct mempolicy *mpol, pgoff_t ilx);
+bool swapin_fault_only_young(struct folio *folio);
 void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
 			   unsigned long addr);
=20
@@ -430,6 +431,11 @@ static inline void swap_update_readahead(struct folio =
*folio,
 {
 }
=20
+static inline bool swapin_fault_only_young(struct folio *folio)
+{
+	return false;
+}
+
 static inline int swap_writeout(struct folio *folio,
 		struct swap_iocb **swap_plug)
 {
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 5a4ca289009a..80dff6a1ee65 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -747,6 +747,21 @@ static bool zswap_needs_order0_retry(struct folio *fol=
io)
 	       ZSWAP_RANGE_MIXED;
 }
=20
+/*
+ * A speculative large swapin may install PTEs for pages that did not faul=
t.
+ * Keep those non-faulting PTEs old so a later anon fault can report
+ * PTE-young density as caller-provided locality evidence without storing
+ * state in the VMA.
+ */
+bool swapin_fault_only_young(struct folio *folio)
+{
+	if (!folio_test_large(folio) || !folio_test_swapcache(folio))
+		return false;
+
+	return zswap_probe_range(folio->swap, folio_nr_pages(folio)) =3D=3D
+	       ZSWAP_RANGE_ALL_ZSWAP;
+}
+
 /*
  * If we are the only user, then try to free up the swap cache.
  *
--=20
2.34.1
From nobody Mon Jun  8 12:13:56 2026
Received: from xmbghk7.mail.qq.com (xmbghk7.mail.qq.com [43.163.128.54])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id A04C33E1688;
	Fri, 29 May 2026 12:19:45 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=43.163.128.54
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1780057187; cv=none;
 b=FbgME5prBXFNPVs6bhlz2TWxi7pAwq4hlxF6gBIWKEQt3u1gnVaqE0MFKED7HD0M+lF/CcWTMxtbrnZ9iMgRQlY9GELmU7om9ySXaH7vnZwf5fRSsxQlwEbfdj631bE1HO1YS5z3qY1WhD8BJkH4OQovFhTRvCm2YO9Yg1TdGjk=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1780057187; c=relaxed/simple;
	bh=2jOwB81hIiabcdQV76BP550VxG9p9yrp8SjCpV0ZmVA=;
	h=Message-ID:From:To:Cc:Subject:Date:In-Reply-To:References:
	 MIME-Version;
 b=riD4vmIQhf8TWP8ze3Vnsd0jOg6eeMryNcKqK4Jqshn4mhIpEni2CcehVZSKZ40aSNuZaI9G4TzHR1zQC1IRYULPHULQGD9SNjTTDBEYAk7/RiOfe1zdgQlhGK2SBtbe5GuXSKtdSkYKv88zaDalBiNHbhiQAQvOyKj5Nyp4vb0=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=qq.com;
 spf=pass smtp.mailfrom=qq.com;
 dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b=xYlebyWJ;
 arc=none smtp.client-ip=43.163.128.54
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=qq.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=qq.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b="xYlebyWJ"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=qq.com; s=s201512;
	t=1780057183; bh=cGeZMeV+6Toy2BfwU50kHDN1vTufg8/wOGuq0DvP3ZE=;
	h=From:To:Cc:Subject:Date:In-Reply-To:References;
	b=xYlebyWJ6iNwSk68CRHjhw0IyNqkx2uaU+DJwpaoVhlC+UHcNa1NzcvVgU8zTlrrX
	 UEBFmK3fiyKWgTZuq/tIKSePlxCtX+GvRTLfefN3WEC7gT++na7N+Uei+c5ZLtLCBP
	 P1L/wO9xJFb7QTN4kQWiB+bTYWUS8exkZ4hUwe9g=
Received: from node68.. ([166.111.236.25])
	by newxmesmtplogicsvrsza73-0.qq.com (NewEsmtp) with SMTP
	id 4DC10017; Fri, 29 May 2026 20:19:28 +0800
X-QQ-mid: xmsmtpt1780057180tkgo7kc1y
Message-ID: <tencent_A2335C92CF3C8A7EF3121E4A23D437531608@qq.com>
X-QQ-XMAILINFO: NGZp1yYNf7Y+XlrNUl1Lxurg2YUpupTBj+94wLieCfHcv0JrqBr9ojPlIzbJqI
	 x386pIbILgiph0fcnAAdez/p4vVKlgRXojF+wmEB5ZY7a9m7zElx+jUi6NEpdq6D3+uxyqpjhCgI
	 WNrxNLKhLIMM9915NjPrml5pw1rOJY8NCxptsD4LYTdTSYiiD37GDyqa/WqfyaLXgvxG81S3Sq+j
	 34xI8IlJCZwsAJZQLhFsggR804KMeWER3l8eCo6GtrHle981IvR7mB5qhwZKuuPdvO9tWf5oXue8
	 db5F+5LlCb7ky0SqFz/8ECb0Y63scxgKvddp6HcPFWDgHyou2diJp6RC4ZrjAELeyxxfEKv3HFoJ
	 cJ0rcz5XrTl9wW+KmFXTN3mweDxeQFy4etXz2AnnkEUx16bAupA1NCObm/e6+gTlcq6nfLqiqYSg
	 +PqQX+Fc8dih2wIvYxFVOzpAwA31J4vNHE1CUwMqH2pIRs1rElumI5dVpRQbuBAB75+wQt0ZyYIt
	 qN4GSCu6hSJVqlaZ19mY1v1W+w1wLCoW89IkL85LEIkFW6w+Efc1ma99G564fOJ4B/JYooxYU0Si
	 EjbwIsmrIH4aNqdb9X2pp7pex4NVVnkKQvCiLUqIXkykzmDCAHv979boLO5y4uToWSBBa+5C9kqO
	 dKLJtNHIgVec+bMDneMhz1TVr06qjmuK7H2noH3umPYqRaOTgQpcui2/4QXFksVJr9dLliSHXANw
	 y3K8cdC9BjfNSptWmOivQUcTwHKzofvSVMA2YGCyCsu9Rm3qZg2L6h4okkpZ4/p2jdqnfVtz3Fq+
	 Wie6Aq60x60wATTWD6tVXEPFC1/OEx3AiS6brGN7SbnoKkYQT5+pdyyHuVB3t0i4Ct4gGqAwPn+E
	 99jccLH7PL2dm8BAKZLBKV08PWyYaNmwOS2dS/yDQ2SUK1IGBuVHrRSdY5Vfx9hxVmJR7gFpEKyZ
	 +B2cIwX+8LA3JHi560XBMpBYR9X4boC0KHiTSRkVoQ4fCpuq5BRyrT105Y8erdJqT7na3WhGN04t
	 uaEx0vLH6nnB/agzx/qqna9bL5n5IdOKqN4k+GWQ==
X-QQ-XMRINFO: Nq+8W0+stu50tPAe92KXseR0ZZmBTk3gLg==
From: fujunjie <fujunjie1@qq.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org,
	Alexandre Ghiti <alexghiti@meta.com>,
	Kairui Song <kasong@tencent.com>,
	Usama Arif <usamaarif642@gmail.com>
Cc: Chris Li <chrisl@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Yosry Ahmed <yosry@kernel.org>,
	Nhat Pham <nphamcs@gmail.com>,
	David Hildenbrand <david@kernel.org>,
	Hugh Dickins <hughd@google.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	linux-kernel@vger.kernel.org,
	cgroups@vger.kernel.org
Subject: [RFC PATCH v2 7/9] mm/shmem: provide VMA-hint locality for zswap
 large swapin
Date: Fri, 29 May 2026 12:19:26 +0000
X-OQ-MSGID: <20260529121928.4115683-7-fujunjie1@qq.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <tencent_98CD9F78E48D08DC005A6471A13CFF28B60A@qq.com>
References: <tencent_98CD9F78E48D08DC005A6471A13CFF28B60A@qq.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Let the shmem swap fault path pass locality evidence into the common
zswap large-swapin policy. Shmem does not have anon PTE-young density
evidence, so this first step only treats explicit VM_SEQ_READ as
positive evidence and VM_RAND_READ as a veto.

The non-fault shmem readahead path remains unchanged. This keeps large
zswap swapin limited to synchronous shmem faults where the caller
supplies a VMA and the common policy can still fall back to order-0.

Signed-off-by: fujunjie <fujunjie1@qq.com>
---
 mm/shmem.c | 42 +++++++++++++++++++++++++++++++++++++-----
 1 file changed, 37 insertions(+), 5 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index fa99b48ed62b..a5ac35ac85fb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -30,6 +30,7 @@
 #include <linux/fileattr.h>
 #include <linux/filelock.h>
 #include <linux/mm.h>
+#include <linux/memcontrol.h>
 #include <linux/random.h>
 #include <linux/sched/signal.h>
 #include <linux/export.h>
@@ -1791,6 +1792,29 @@ static struct folio *shmem_swapin_cluster(swp_entry_=
t swap, gfp_t gfp,
 	return folio;
 }
=20
+static unsigned long shmem_swapin_locality_orders(struct vm_fault *vmf,
+						  unsigned long orders)
+{
+	struct vm_area_struct *vma =3D vmf ? vmf->vma : NULL;
+	unsigned long candidates =3D orders & ~BIT(0);
+
+	/*
+	 * Shmem does not have anon-style PTE young density evidence. Start with
+	 * explicit VMA access hints; future shmem/page-cache readahead evidence
+	 * can be folded into this producer without changing common swapin policy.
+	 */
+	if (!vma)
+		return 0;
+
+	if (vma->vm_flags & VM_RAND_READ)
+		return 0;
+
+	if (vma->vm_flags & VM_SEQ_READ)
+		return candidates;
+
+	return 0;
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 bool shmem_hpage_pmd_enabled(void)
 {
@@ -2020,18 +2044,22 @@ static struct folio *shmem_swap_alloc_folio(struct =
inode *inode,
 		struct vm_fault *vmf, pgoff_t index,
 		swp_entry_t entry, int order, gfp_t gfp)
 {
+	unsigned long locality_orders;
+	unsigned long orders;
 	pgoff_t ilx;
 	struct folio *folio;
 	struct mempolicy *mpol;
 	struct shmem_inode_info *info =3D SHMEM_I(inode);
=20
-	if ((vmf && unlikely(userfaultfd_armed(vmf->vma))) ||
-	     !zswap_never_enabled())
+	if (vmf && unlikely(userfaultfd_armed(vmf->vma)))
 		order =3D 0;
=20
 again:
+	orders =3D BIT(order);
+	locality_orders =3D shmem_swapin_locality_orders(vmf, orders);
 	mpol =3D shmem_get_pgoff_policy(info, index, order, &ilx);
-	folio =3D swapin_sync(entry, gfp, BIT(order), 0, vmf, mpol, ilx);
+	folio =3D swapin_sync(entry, gfp, orders, locality_orders, vmf, mpol,
+			    ilx);
 	mpol_cond_put(mpol);
=20
 	if (!IS_ERR(folio))
@@ -2339,7 +2367,7 @@ static int shmem_swapin_folio(struct inode *inode, pg=
off_t index,
 	if (!folio_matches_swap_entry(folio, swap) ||
 	    shmem_confirm_swap(mapping, index, swap) < 0) {
 		error =3D -EEXIST;
-		goto unlock;
+		goto failed_swapcache;
 	}
 	if (!folio_test_uptodate(folio)) {
 		error =3D -EIO;
@@ -2369,6 +2397,8 @@ static int shmem_swapin_folio(struct inode *inode, pg=
off_t index,
 	if (sgp =3D=3D SGP_WRITE)
 		folio_mark_accessed(folio);
=20
+	if (folio_test_large(folio))
+		memcg1_swapin(folio);
 	folio_put_swap(folio, NULL);
 	swap_cache_del_folio(folio);
 	folio_mark_dirty(folio);
@@ -2379,9 +2409,11 @@ static int shmem_swapin_folio(struct inode *inode, p=
goff_t index,
 failed:
 	if (shmem_confirm_swap(mapping, index, swap) < 0)
 		error =3D -EEXIST;
+failed_swapcache:
+	if (folio && folio_test_large(folio) && folio_test_swapcache(folio))
+		memcg1_swapin(folio);
 	if (error =3D=3D -EIO)
 		shmem_set_folio_swapin_error(inode, index, folio, swap);
-unlock:
 	if (folio)
 		folio_unlock(folio);
 failed_nolock:
--=20
2.34.1
From nobody Mon Jun  8 12:13:56 2026
Received: from xmbghk7.mail.qq.com (xmbghk7.mail.qq.com [43.163.128.53])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8069F3E0240;
	Fri, 29 May 2026 12:19:53 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=43.163.128.53
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1780057196; cv=none;
 b=Y/OJ0OqzoY3yn0LNyZEd4s7Zl8EtIVCr+i+oKVfrVn/92xeyQzkuz2EWav8xVVdyRJqU/qid3u/nTxVKJceIFxCkzrVGaKWTMG6RTstqPLJmkgniyNSHCwTNpqzlVXd4JG5DncXd8APlDeqV4hLGhT9CW8I7M4YkLaswiaxgPLk=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1780057196; c=relaxed/simple;
	bh=N3a8S+cu7eF5svVcyWafGORVDlZ1pYeRMLR5WOLvyQc=;
	h=Message-ID:From:To:Cc:Subject:Date:In-Reply-To:References:
	 MIME-Version;
 b=B+Dig6TKRzdWKVs9kCuJxj/Ehg8FhMZm6ka8N2eVI/JsMw0m7YfEsx4w/9+otYPWq0kkSRuvAXqcQMnMA00C3poHCRhKRzsPJpxSjLugY5he7GJFHjcsiSaeTK8422iHf5rIwrZIX2S0AFRJaBbhTPJ+X1P9vYObpuvxP2m6xvI=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=qq.com;
 spf=pass smtp.mailfrom=qq.com;
 dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b=S7/7VMTF;
 arc=none smtp.client-ip=43.163.128.53
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=qq.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=qq.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b="S7/7VMTF"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=qq.com; s=s201512;
	t=1780057185; bh=kl4wii9J3ChOs7nTTpM79Hhbg1V1e1SmdcBHFxEE0sY=;
	h=From:To:Cc:Subject:Date:In-Reply-To:References;
	b=S7/7VMTF2p4WUnkjXd9oMK4prBrss/vil+bREisKEI3a5N1ZH3gW+zSUgNKfHt9sV
	 0u510DSEovZt/WtrLm9/01lOWTRQxlMu8DIuu5E44z8ojGgAzjCn9vj8LvUgMkox0O
	 lPXr6nnNrlnAj5aWyiANMSOjmdVI+DMkq36zZCJ4=
Received: from node68.. ([166.111.236.25])
	by newxmesmtplogicsvrsza73-0.qq.com (NewEsmtp) with SMTP
	id 4DC10017; Fri, 29 May 2026 20:19:28 +0800
X-QQ-mid: xmsmtpt1780057182t9ia4wj44
Message-ID: <tencent_5185ABF4594EB17C1ED5C36F44EE26D2E205@qq.com>
X-QQ-XMAILINFO: MFA3rFz8fXqrL3TcEFDEZcEzucYlFRdI7QXOI63aSlIg61BH/4wc6n0xwKM/Py
	 /yZog2STHLXTXeVT5sNihqGQ3EW/bly5bV8I3/pBmGrkWpHv7jFVitqKgz0zSwhsvwykkUtQFgKR
	 lTwt2i3FLG6weLwapkzFNDygnomcSuPM3HIHhLTrnFQ06mrNTMggEohFvpOtCkA/25jofLNhAdoO
	 IQO+SJwCN+Q0UiYZbjvxzHmk22sWLzg90wIjWCoFHOuVMb14lwU2PtRovbS3V82uBqchkK5Aruok
	 Sjhtz1KxyhuUg8Y1QnZYkehz5CUC0yGQf3EPlAiQp3FIhq/vXZUbYxjda2mm10RCscaUfVJntIgo
	 PbFK0BPq8wuVK2/5kp3igGP28/Db1K1W9on3bcc8PD+kxnxA27hMSXpTfltl3BhXoXzYlPOffBb4
	 kNojrW1XvTetpGjqT+fI9Y7XOEsHSKoJkFeE5NgNQQ7SPA1BKgJpE9F36oyM8QaXVzIoIm54lh1N
	 d0Re30SwdrHWMRKtj0b8dn14bu0CCo3u4BNCNepwoWIRdKg+z8GqvV0CO2Cz3I+1re/guRqVWhyZ
	 Uk++SablSl7mCTNpRLsT1HKRNKYin1Ippfh3fvT8QET/fEa9JHHIvsq3nRFyYviEB/eW+qGxIj7n
	 menDaWqozWKNjv4IH5wsRBNDZg80k3EXc8YgmyQY7SuzxE5CSZ+AgRhLEcgyAW4aPcZ7URWoSRbj
	 RjKk7ZiKUosItWJGBd1eK2n9/fdXk1u6GUdWtZmVSBjOoT2lASijNEBSMf7YJAAwGW1YaolYB+aw
	 lVgzFCmXPJirp7Vhl57fYoc5dgqKXuqEgElk94QTwX+5qpcfvKZuhaiUGjT3W/zda5ssEi+4B+jP
	 r4aCn53Q48dvCzrBh5VBqZIuITbGA7IDcdma4Qb5jM7Zbr0RKhlV/fMRd8dNNF1PfyKz4kezJBaU
	 ATcCukAi4fWL9m1pV/tPlyzBBY6vQZ3xjhs8nr6WRLJDcmoDhaLGg/cibhVxM7fFjo6w3fURW4u6
	 18zh/hJEUBJqyXbmiChsZGmCfovZgCNBhf4Evf5OSQNISaGDouWATATlXsicU=
X-QQ-XMRINFO: MSVp+SPm3vtSI1QTLgDHQqIV1w2oNKDqfg==
From: fujunjie <fujunjie1@qq.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org,
	Alexandre Ghiti <alexghiti@meta.com>,
	Kairui Song <kasong@tencent.com>,
	Usama Arif <usamaarif642@gmail.com>
Cc: Chris Li <chrisl@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Yosry Ahmed <yosry@kernel.org>,
	Nhat Pham <nphamcs@gmail.com>,
	David Hildenbrand <david@kernel.org>,
	Hugh Dickins <hughd@google.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	linux-kernel@vger.kernel.org,
	cgroups@vger.kernel.org
Subject: [RFC PATCH v2 8/9] mm: try all-zswap large swapin within swap
 readahead windows
Date: Fri, 29 May 2026 12:19:27 +0000
X-OQ-MSGID: <20260529121928.4115683-8-fujunjie1@qq.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <tencent_98CD9F78E48D08DC005A6471A13CFF28B60A@qq.com>
References: <tencent_98CD9F78E48D08DC005A6471A13CFF28B60A@qq.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

The non-synchronous swap fault path already computes either a VMA-based
or cluster-based readahead window. Use that existing window as locality
evidence for zswap-backed large swapin instead of mixing it with the
synchronous anon/shmem evidence.

The path first prepares the normal readahead window. If the faulting
aligned range is fully covered by that window and is still all-zswap, it
may be loaded as one large folio. If the large attempt fails or a backend
race is detected, the precomputed order-0 readahead window is used
without updating readahead state again.

Mixed zswap/disk ranges remain order-0 only. Disk-backed large swapin is
not added by this change.

Signed-off-by: fujunjie <fujunjie1@qq.com>
---
 mm/memory.c     |   6 +-
 mm/swap.h       |   4 +-
 mm/swap_state.c | 434 +++++++++++++++++++++++++++++++++++++++---------
 mm/swapfile.c   |   2 +-
 4 files changed, 360 insertions(+), 86 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7bbb89632000..451375090d83 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5027,13 +5027,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (folio)
 		swap_update_readahead(folio, vma, vmf->address);
 	if (!folio) {
+		unsigned long swapin_orders =3D thp_swapin_suitable_orders(vmf);
+
 		/*
 		 * Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices.
 		 * The swap device is pinned while checking the flag, matching
 		 * the existing fault path.
 		 */
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
-			unsigned long swapin_orders =3D thp_swapin_suitable_orders(vmf);
 			unsigned long locality_orders =3D
 				swapin_anon_locality_orders(vmf, swapin_orders);
=20
@@ -5041,7 +5042,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 					    swapin_orders | BIT(0),
 					    locality_orders, vmf, NULL, 0);
 		} else {
-			folio =3D swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf);
+			folio =3D swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
+						 swapin_orders, vmf);
 		}
=20
 		if (IS_ERR_OR_NULL(folio)) {
diff --git a/mm/swap.h b/mm/swap.h
index 5d1c81ab49b9..0e1bf9218b5e 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -323,7 +323,7 @@ struct folio *read_swap_cache_async(swp_entry_t entry, =
gfp_t gfp_mask,
 struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
 		struct mempolicy *mpol, pgoff_t ilx);
 struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
-			struct vm_fault *vmf);
+			       unsigned long orders, struct vm_fault *vmf);
 struct folio *swapin_sync(swp_entry_t entry, gfp_t flag, unsigned long ord=
ers,
 			  unsigned long locality_orders, struct vm_fault *vmf,
 			  struct mempolicy *mpol, pgoff_t ilx);
@@ -413,7 +413,7 @@ static inline struct folio *swap_cluster_readahead(swp_=
entry_t entry,
 }
=20
 static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_ma=
sk,
-			struct vm_fault *vmf)
+				unsigned long orders, struct vm_fault *vmf)
 {
 	return NULL;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 80dff6a1ee65..4f1eb0a7f9f5 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -678,20 +678,24 @@ static bool swapin_zswap_admit(swp_entry_t entry,
 static unsigned long swapin_admit_orders(swp_entry_t entry,
 					 unsigned long orders,
 					 struct vm_fault *vmf,
-					 unsigned long locality_orders)
+					 unsigned long locality_orders,
+					 bool zswap_only)
 {
 	unsigned long candidates =3D orders & ~BIT(0);
-	unsigned long admitted =3D orders & BIT(0);
+	unsigned long admitted =3D zswap_only ? 0 : orders & BIT(0);
+	enum zswap_range_state fault_zswap_state =3D ZSWAP_RANGE_NEVER_ENABLED;
 	struct zswap_admit_ctx zswap_ctx =3D {};
+	bool fault_zswap_checked =3D false;
 	int order;
=20
 	if (!candidates)
-		return orders;
+		return zswap_only ? 0 : orders;
=20
 	while (candidates) {
 		enum zswap_range_state state;
 		unsigned int nr_pages;
 		swp_entry_t range_entry;
+		bool zswap_locality;
 		bool admit =3D false;
=20
 		order =3D fls_long(candidates) - 1;
@@ -703,6 +707,29 @@ static unsigned long swapin_admit_orders(swp_entry_t e=
ntry,
 		nr_pages =3D 1U << order;
 		range_entry =3D swp_entry(swp_type(entry),
 					round_down(swp_offset(entry), nr_pages));
+		zswap_locality =3D order <=3D SWAPIN_ZSWAP_MAX_ORDER &&
+				 swapin_zswap_locality(vmf, order,
+						       locality_orders);
+		/*
+		 * If the faulting slot is already in zswap but this order has
+		 * no zswap locality evidence, a larger range covering the fault
+		 * cannot be admitted: it is either all-zswap or mixed, and both
+		 * require zswap locality. Avoid scanning the whole range on
+		 * sparse/random zswap refaults. If the faulting slot is not in
+		 * zswap, keep the full classification so all-disk large swapin
+		 * can follow the existing policy.
+		 */
+		if (!zswap_locality) {
+			if (zswap_only)
+				goto next;
+			if (!fault_zswap_checked) {
+				fault_zswap_state =3D zswap_probe_range(entry, 1);
+				fault_zswap_checked =3D true;
+			}
+			if (fault_zswap_state =3D=3D ZSWAP_RANGE_ALL_ZSWAP)
+				goto next;
+		}
+
 		if (!swapin_zeromap_same(range_entry, nr_pages))
 			goto next;
=20
@@ -718,7 +745,7 @@ static unsigned long swapin_admit_orders(swp_entry_t en=
try,
 			break;
 		case ZSWAP_RANGE_NEVER_ENABLED:
 		case ZSWAP_RANGE_NO_ZSWAP:
-			admit =3D true;
+			admit =3D !zswap_only;
 			break;
 		}
=20
@@ -730,21 +757,32 @@ static unsigned long swapin_admit_orders(swp_entry_t =
entry,
 		candidates &=3D ~BIT(order);
 	}
=20
-	return admitted ? admitted : BIT(0);
+	return admitted ? admitted : (zswap_only ? 0 : BIT(0));
 }
=20
-static bool zswap_needs_order0_retry(struct folio *folio)
+static bool zswap_folio_all_zswap(struct folio *folio)
 {
+	return zswap_probe_range(folio->swap, folio_nr_pages(folio)) =3D=3D
+	       ZSWAP_RANGE_ALL_ZSWAP;
+}
+
+static bool zswap_needs_fallback(struct folio *folio, bool zswap_only)
+{
+	enum zswap_range_state state;
+
 	if (!folio_test_large(folio))
 		return false;
=20
+	state =3D zswap_probe_range(folio->swap, folio_nr_pages(folio));
+	if (zswap_only)
+		return state !=3D ZSWAP_RANGE_ALL_ZSWAP;
+
 	/*
 	 * Admission sees only an advisory zswap snapshot. Recheck after the
 	 * large swapcache folio is installed; if the range became mixed, drop
 	 * the fresh folio before IO and let order-0 handle each slot.
 	 */
-	return zswap_probe_range(folio->swap, folio_nr_pages(folio)) =3D=3D
-	       ZSWAP_RANGE_MIXED;
+	return state =3D=3D ZSWAP_RANGE_MIXED;
 }
=20
 /*
@@ -758,8 +796,7 @@ bool swapin_fault_only_young(struct folio *folio)
 	if (!folio_test_large(folio) || !folio_test_swapcache(folio))
 		return false;
=20
-	return zswap_probe_range(folio->swap, folio_nr_pages(folio)) =3D=3D
-	       ZSWAP_RANGE_ALL_ZSWAP;
+	return zswap_folio_all_zswap(folio);
 }
=20
 /*
@@ -893,34 +930,15 @@ static struct folio *swap_cache_read_folio(swp_entry_=
t entry, gfp_t gfp,
 	return folio;
 }
=20
-/**
- * swapin_sync - swap-in one or multiple entries skipping readahead.
- * @entry: swap entry indicating the target slot
- * @gfp: memory allocation flags
- * @orders: allocation orders
- * @locality_orders: orders with caller-provided locality evidence
- * @vmf: fault information
- * @mpol: NUMA memory allocation policy to be applied
- * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE
- *
- * This allocates a folio suitable for given @orders, or returns the
- * existing folio in the swap cache for @entry. This initiates the IO, too,
- * if needed. @entry is rounded down if @orders allow large allocation.
- *
- * Context: Caller must ensure @entry is valid and pin the swap device with
- * refcount.
- * Return: Returns the folio on success, error code if failed.
- */
-struct folio *swapin_sync(swp_entry_t entry, gfp_t gfp,
-			  unsigned long orders,
-			  unsigned long locality_orders,
-			  struct vm_fault *vmf, struct mempolicy *mpol,
-			  pgoff_t ilx)
+static struct folio *swapin_alloc_read(swp_entry_t entry, gfp_t gfp,
+				       unsigned long orders,
+				       struct vm_fault *vmf,
+				       struct mempolicy *mpol, pgoff_t ilx,
+				       bool retry_order0, bool zswap_only)
 {
 	struct folio *folio;
 	int ret;
=20
-	orders =3D swapin_admit_orders(entry, orders, vmf, locality_orders);
 again:
 	do {
 		folio =3D swap_cache_get_folio(entry);
@@ -931,19 +949,21 @@ struct folio *swapin_sync(swp_entry_t entry, gfp_t gf=
p,
 	} while (PTR_ERR(folio) =3D=3D -EEXIST);
=20
 	if (IS_ERR(folio))
-		return folio;
+		return retry_order0 ? folio : NULL;
=20
-	if (zswap_needs_order0_retry(folio)) {
+	if (zswap_needs_fallback(folio, zswap_only)) {
 		count_mthp_stat(folio_order(folio), MTHP_STAT_SWPIN_FALLBACK);
 		/*
 		 * The folio is newly allocated, locked, clean and not uptodate;
 		 * no data has been read into it. Removing it only restores the
-		 * swap table entries so order-0 swapin can resolve a backend
+		 * swap table entries so the fallback path can resolve a backend
 		 * race without attempting speculative large-folio zswapin.
 		 */
 		swap_cache_del_folio(folio);
 		folio_unlock(folio);
 		folio_put(folio);
+		if (!retry_order0)
+			return NULL;
 		orders =3D BIT(0);
 		goto again;
 	}
@@ -954,12 +974,62 @@ struct folio *swapin_sync(swp_entry_t entry, gfp_t gf=
p,
 		swap_cache_del_folio(folio);
 		folio_unlock(folio);
 		folio_put(folio);
+		if (!retry_order0)
+			return NULL;
 		orders =3D BIT(0);
 		goto again;
 	}
 	return folio;
 }
=20
+/**
+ * swapin_sync - swap-in one or multiple entries skipping readahead.
+ * @entry: swap entry indicating the target slot
+ * @gfp: memory allocation flags
+ * @orders: allocation orders
+ * @locality_orders: orders with caller-provided locality evidence
+ * @vmf: fault information
+ * @mpol: NUMA memory allocation policy to be applied
+ * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE
+ *
+ * This allocates a folio suitable for given @orders, or returns the
+ * existing folio in the swap cache for @entry. This initiates the IO, too,
+ * if needed. @entry is rounded down if @orders allow large allocation.
+ *
+ * Context: Caller must ensure @entry is valid and pin the swap device with
+ * refcount.
+ * Return: Returns the folio on success, error code if failed.
+ */
+struct folio *swapin_sync(swp_entry_t entry, gfp_t gfp,
+			  unsigned long orders,
+			  unsigned long locality_orders,
+			  struct vm_fault *vmf, struct mempolicy *mpol,
+			  pgoff_t ilx)
+{
+	orders =3D swapin_admit_orders(entry, orders, vmf,
+				     locality_orders, false);
+	return swapin_alloc_read(entry, gfp, orders, vmf, mpol, ilx,
+				 true, false);
+}
+
+static struct folio *swapin_zswap_large(swp_entry_t entry, gfp_t gfp,
+					unsigned long orders,
+					unsigned long locality_orders,
+					struct vm_fault *vmf,
+					struct mempolicy *mpol, pgoff_t ilx)
+{
+	if (READ_ONCE(page_cluster) <=3D 0)
+		return NULL;
+
+	orders =3D swapin_admit_orders(entry, orders, vmf,
+				     locality_orders, true);
+	if (!orders)
+		return NULL;
+
+	return swapin_alloc_read(entry, gfp, orders, vmf, mpol, ilx,
+				 false, true);
+}
+
 /*
  * Locate a page of swap in physical memory, reserving swap cache space
  * and reading the disk if it is not already cached.
@@ -1048,12 +1118,88 @@ static unsigned long swapin_nr_pages(unsigned long =
offset)
 	return pages;
 }
=20
+struct swap_cluster_ra {
+	unsigned long start_offset;
+	unsigned long end_offset;
+	bool readahead;
+};
+
+static void swap_cluster_ra_prepare(swp_entry_t entry,
+				    struct swap_cluster_ra *ra)
+{
+	struct swap_info_struct *si =3D __swap_entry_to_info(entry);
+	unsigned long entry_offset =3D swp_offset(entry);
+	unsigned long mask;
+
+	mask =3D swapin_nr_pages(entry_offset) - 1;
+	ra->readahead =3D !!mask;
+	ra->start_offset =3D entry_offset;
+	ra->end_offset =3D entry_offset;
+	if (!mask)
+		return;
+
+	/* Read a page_cluster sized and aligned cluster around offset. */
+	ra->start_offset =3D entry_offset & ~mask;
+	ra->end_offset =3D entry_offset | mask;
+	if (!ra->start_offset)	/* First page is swap header. */
+		ra->start_offset++;
+	if (ra->end_offset >=3D si->max)
+		ra->end_offset =3D si->max - 1;
+}
+
+static unsigned long swap_cluster_ra_orders(swp_entry_t entry,
+					    unsigned long orders,
+					    const struct swap_cluster_ra *ra)
+{
+	unsigned long admitted =3D 0;
+	unsigned long candidates =3D orders & ~BIT(0);
+	unsigned long entry_offset =3D swp_offset(entry);
+	int order;
+
+	if (!ra->readahead)
+		return 0;
+
+	while (candidates) {
+		unsigned long nr_pages;
+		unsigned long start_offset;
+		unsigned long end_offset;
+
+		order =3D fls_long(candidates) - 1;
+		if (order > MAX_PAGE_ORDER) {
+			candidates &=3D ~BIT(order);
+			continue;
+		}
+
+		nr_pages =3D 1UL << order;
+		start_offset =3D round_down(entry_offset, nr_pages);
+		end_offset =3D start_offset + nr_pages - 1;
+		if (start_offset >=3D ra->start_offset &&
+		    end_offset <=3D ra->end_offset)
+			admitted |=3D BIT(order);
+		candidates &=3D ~BIT(order);
+	}
+
+	return admitted;
+}
+
+static bool swapin_readahead_skip(unsigned long index,
+				  unsigned long skip_start,
+				  unsigned long skip_end)
+{
+	return skip_start < skip_end &&
+	       index >=3D skip_start && index < skip_end;
+}
+
 /**
- * swap_cluster_readahead - swap in pages in hope we need them soon
+ * swap_cluster_readahead_win - swap in pages from a prepared swap window
  * @entry: swap entry of this memory
  * @gfp_mask: memory allocation flags
  * @mpol: NUMA memory allocation policy to be applied
  * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE
+ * @ra: readahead window prepared by swap_cluster_ra_prepare()
+ * @skip_start: first offset already covered by @target_folio
+ * @skip_end: offset after the already covered range
+ * @target_folio: target folio to return after queueing the rest of the wi=
ndow
  *
  * Returns the struct folio for entry and addr, after queueing swapin.
  *
@@ -1066,33 +1212,38 @@ static unsigned long swapin_nr_pages(unsigned long =
offset)
  * are used for every page of the readahead: neighbouring pages on swap
  * are fairly likely to have been swapped out from the same node.
  */
-struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
-				     struct mempolicy *mpol, pgoff_t ilx)
+static struct folio *swap_cluster_readahead_win(swp_entry_t entry,
+						gfp_t gfp_mask,
+						struct mempolicy *mpol,
+						pgoff_t ilx,
+						const struct swap_cluster_ra *ra,
+						unsigned long skip_start,
+						unsigned long skip_end,
+						struct folio *target_folio)
 {
 	struct folio *folio;
 	unsigned long entry_offset =3D swp_offset(entry);
-	unsigned long offset =3D entry_offset;
-	unsigned long start_offset, end_offset;
-	unsigned long mask;
-	struct swap_info_struct *si =3D __swap_entry_to_info(entry);
+	unsigned long offset;
 	struct blk_plug plug;
 	struct swap_iocb *splug =3D NULL;
 	swp_entry_t ra_entry;
=20
-	mask =3D swapin_nr_pages(offset) - 1;
-	if (!mask)
+	if (!ra->readahead)
 		goto skip;
=20
-	/* Read a page_cluster sized and aligned cluster around offset. */
-	start_offset =3D offset & ~mask;
-	end_offset =3D offset | mask;
-	if (!start_offset)	/* First page is swap header. */
-		start_offset++;
-	if (end_offset >=3D si->max)
-		end_offset =3D si->max - 1;
+	if (target_folio &&
+	    skip_start <=3D ra->start_offset && skip_end > ra->end_offset)
+		goto skip;
=20
 	blk_start_plug(&plug);
-	for (offset =3D start_offset; offset <=3D end_offset ; offset++) {
+	for (offset =3D ra->start_offset; offset <=3D ra->end_offset; offset++) {
+		if (swapin_readahead_skip(offset, skip_start, skip_end)) {
+			if (skip_end > ra->end_offset)
+				break;
+			offset =3D skip_end - 1;
+			continue;
+		}
+
 		/* Ok, do the async read-ahead now */
 		ra_entry =3D swp_entry(swp_type(entry), offset);
 		folio =3D swap_cache_read_folio(ra_entry, gfp_mask, mpol, ilx,
@@ -1105,10 +1256,29 @@ struct folio *swap_cluster_readahead(swp_entry_t en=
try, gfp_t gfp_mask,
 	swap_read_unplug(splug);
 	lru_add_drain();	/* Push any new pages onto the LRU now */
 skip:
+	if (target_folio)
+		return target_folio;
+
 	/* The page was likely read above, so no need for plugging here */
 	return swap_cache_read_folio(entry, gfp_mask, mpol, ilx, NULL, false);
 }
=20
+struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
+				     struct mempolicy *mpol, pgoff_t ilx)
+{
+	struct swap_cluster_ra ra;
+
+	swap_cluster_ra_prepare(entry, &ra);
+	return swap_cluster_readahead_win(entry, gfp_mask, mpol, ilx, &ra,
+					 0, 0, NULL);
+}
+
+struct swap_vma_ra {
+	unsigned long start;
+	unsigned long end;
+	int win;
+};
+
 static int swap_vma_ra_win(struct vm_fault *vmf, unsigned long *start,
 			   unsigned long *end)
 {
@@ -1147,35 +1317,69 @@ static int swap_vma_ra_win(struct vm_fault *vmf, un=
signed long *start,
 	return win;
 }
=20
-/**
- * swap_vma_readahead - swap in pages in hope we need them soon
- * @targ_entry: swap entry of the targeted memory
- * @gfp_mask: memory allocation flags
- * @mpol: NUMA memory allocation policy to be applied
- * @targ_ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE
- * @vmf: fault information
- *
- * Returns the struct folio for entry and addr, after queueing swapin.
- *
- * Primitive swap readahead code. We simply read in a few pages whose
- * virtual addresses are around the fault address in the same vma.
- *
- * Caller must hold read mmap_lock if vmf->vma is not NULL.
- *
+static unsigned long swap_vma_ra_orders(struct vm_fault *vmf,
+					unsigned long orders,
+					const struct swap_vma_ra *ra)
+{
+	unsigned long admitted =3D 0;
+	unsigned long candidates =3D orders & ~BIT(0);
+	int order;
+
+	if (ra->win <=3D 1)
+		return 0;
+
+	while (candidates) {
+		unsigned long size;
+		unsigned long start;
+		unsigned long end;
+
+		order =3D fls_long(candidates) - 1;
+		if (order > MAX_PAGE_ORDER) {
+			candidates &=3D ~BIT(order);
+			continue;
+		}
+
+		size =3D PAGE_SIZE << order;
+		start =3D ALIGN_DOWN(vmf->address, size);
+		end =3D start + size;
+		if (start >=3D ra->start && end <=3D ra->end)
+			admitted |=3D BIT(order);
+		candidates &=3D ~BIT(order);
+	}
+
+	return admitted;
+}
+
+/*
+ * Queue swapin for a precomputed VMA readahead window. The window has alr=
eady
+ * been accounted in vma->swap_readahead_info, so fallback after a failed
+ * zswap-large attempt does not update readahead state a second time. If
+ * @target_folio is already populated, queue only the part of the window o=
utside
+ * [@skip_start, @skip_end) and return @target_folio.
  */
-static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_=
mask,
-		struct mempolicy *mpol, pgoff_t targ_ilx, struct vm_fault *vmf)
+static struct folio *swap_vma_readahead_win(swp_entry_t targ_entry,
+					    gfp_t gfp_mask,
+					    struct mempolicy *mpol,
+					    pgoff_t targ_ilx,
+					    struct vm_fault *vmf,
+					    const struct swap_vma_ra *ra,
+					    unsigned long skip_start,
+					    unsigned long skip_end,
+					    struct folio *target_folio)
 {
 	struct blk_plug plug;
 	struct swap_iocb *splug =3D NULL;
 	struct folio *folio;
 	pte_t *pte =3D NULL, pentry;
-	int win;
 	unsigned long start, end, addr;
 	pgoff_t ilx =3D targ_ilx;
=20
-	win =3D swap_vma_ra_win(vmf, &start, &end);
-	if (win =3D=3D 1)
+	if (ra->win <=3D 1)
+		goto skip;
+
+	start =3D ra->start;
+	end =3D ra->end;
+	if (target_folio && skip_start <=3D start && skip_end >=3D end)
 		goto skip;
=20
 	ilx =3D targ_ilx - PFN_DOWN(vmf->address - start);
@@ -1185,6 +1389,18 @@ static struct folio *swap_vma_readahead(swp_entry_t =
targ_entry, gfp_t gfp_mask,
 		struct swap_info_struct *si =3D NULL;
 		softleaf_t entry;
=20
+		if (swapin_readahead_skip(addr, skip_start, skip_end)) {
+			unsigned long next =3D min(skip_end, end);
+
+			if (pte) {
+				pte_unmap(pte);
+				pte =3D NULL;
+			}
+			ilx +=3D PFN_DOWN(next - addr) - 1;
+			addr =3D next - PAGE_SIZE;
+			continue;
+		}
+
 		if (!pte++) {
 			pte =3D pte_offset_map(vmf->pmd, addr);
 			if (!pte)
@@ -1220,6 +1436,9 @@ static struct folio *swap_vma_readahead(swp_entry_t t=
arg_entry, gfp_t gfp_mask,
 	swap_read_unplug(splug);
 	lru_add_drain();
 skip:
+	if (target_folio)
+		return target_folio;
+
 	/* The folio was likely read above, so no need for plugging here */
 	folio =3D swap_cache_read_folio(targ_entry, gfp_mask, mpol, targ_ilx,
 				      NULL, false);
@@ -1230,25 +1449,78 @@ static struct folio *swap_vma_readahead(swp_entry_t=
 targ_entry, gfp_t gfp_mask,
  * swapin_readahead - swap in pages in hope we need them soon
  * @entry: swap entry of this memory
  * @gfp_mask: memory allocation flags
+ * @orders: large folio orders suitable for the faulting entry
  * @vmf: fault information
  *
  * Returns the struct folio for entry and addr, after queueing swapin.
  *
- * It's a main entry function for swap readahead. By the configuration,
- * it will read ahead blocks by cluster-based(ie, physical disk based)
- * or vma-based(ie, virtual address based on faulty address) readahead.
+ * This first computes the normal VMA or cluster readahead window. If the
+ * window fully covers an aligned all-zswap range containing the fault, th=
at
+ * range may be swapped in as one large folio. The remaining window is sti=
ll
+ * queued through the original order-0 readahead path, skipping the already
+ * covered target range and without updating readahead state a second time.
  */
 struct folio *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
-				struct vm_fault *vmf)
+				unsigned long orders, struct vm_fault *vmf)
 {
 	struct mempolicy *mpol;
 	pgoff_t ilx;
 	struct folio *folio;
+	unsigned long ra_orders;
+	bool vma_ra;
=20
 	mpol =3D get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
-	folio =3D swap_use_vma_readahead() ?
-		swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf) :
-		swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
+	vma_ra =3D swap_use_vma_readahead();
+	if (vma_ra) {
+		struct swap_vma_ra ra =3D {};
+		unsigned long skip_start =3D 0;
+		unsigned long skip_end =3D 0;
+
+		ra.win =3D swap_vma_ra_win(vmf, &ra.start, &ra.end);
+		ra_orders =3D swap_vma_ra_orders(vmf, orders, &ra);
+		if (ra_orders) {
+			folio =3D swapin_zswap_large(entry, gfp_mask, ra_orders,
+						   ra_orders, vmf, mpol, ilx);
+			if (folio) {
+				skip_start =3D ALIGN_DOWN(vmf->address,
+							folio_size(folio));
+				skip_end =3D skip_start + folio_size(folio);
+				folio =3D swap_vma_readahead_win(entry, gfp_mask,
+							       mpol, ilx, vmf,
+							       &ra, skip_start,
+							       skip_end, folio);
+				goto out;
+			}
+		}
+		folio =3D swap_vma_readahead_win(entry, gfp_mask, mpol, ilx,
+					       vmf, &ra, 0, 0, NULL);
+	} else {
+		struct swap_cluster_ra ra;
+		unsigned long skip_start =3D 0;
+		unsigned long skip_end =3D 0;
+
+		swap_cluster_ra_prepare(entry, &ra);
+		ra_orders =3D swap_cluster_ra_orders(entry, orders, &ra);
+		if (ra_orders) {
+			folio =3D swapin_zswap_large(entry, gfp_mask, ra_orders,
+						   ra_orders, vmf, mpol, ilx);
+			if (folio) {
+				skip_start =3D swp_offset(folio->swap);
+				skip_end =3D skip_start + folio_nr_pages(folio);
+				folio =3D swap_cluster_readahead_win(entry,
+								   gfp_mask,
+								   mpol, ilx,
+								   &ra,
+								   skip_start,
+								   skip_end,
+								   folio);
+				goto out;
+			}
+		}
+		folio =3D swap_cluster_readahead_win(entry, gfp_mask, mpol, ilx,
+						   &ra, 0, 0, NULL);
+	}
+out:
 	mpol_cond_put(mpol);
=20
 	return folio;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 615d90867111..3b7e7d8ae89d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2452,7 +2452,7 @@ static int unuse_pte_range(struct vm_area_struct *vma=
, pmd_t *pmd,
 			};
=20
 			folio =3D swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
-						&vmf);
+						 0, &vmf);
 		}
 		if (!folio) {
 			swp_tb =3D swap_table_get(__swap_entry_to_cluster(entry),
--=20
2.34.1
From nobody Mon Jun  8 12:13:56 2026
Received: from xmbghk7.mail.qq.com (xmbghk7.mail.qq.com [43.163.128.47])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 070813CEBA7;
	Fri, 29 May 2026 12:19:54 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=43.163.128.47
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1780057196; cv=none;
 b=okKcxVl8v9GS3xNzlTDzkbyqTe62JzSTU/UrQMxt5dqJM4e6qq10GhK1poCnrVVBDD/g+DmCTUpacMAJfIv/V3SY+eHDKrGvcKQINOFA/Bi9cGzxa0HmMzovEe+d+FEm4fvImkE33FpF7q2E5ZmoKwATPdyeSOKwcNRz4+WsKt8=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1780057196; c=relaxed/simple;
	bh=N0bkbZk9e+3vO2o+P1+AL6OGGPDRe1aokkNHMcWFoa0=;
	h=Message-ID:From:To:Cc:Subject:Date:In-Reply-To:References:
	 MIME-Version;
 b=hc0kaIVVXBUjxRN/04DgWof6+xwA5aPKgAFc0MhrCO1WAOxmRsxYMa40qmmSqbcSEPx/lgJOXehLxJWU5tmgFKm5EZGveN9aKZ2ApuxtQIsgWcxvHvDkq2UO6wqm4Yu7G2YN0kxSwbWZsUKPR1n5hpDWKex5AcO/gFesp9+luBQ=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=qq.com;
 spf=pass smtp.mailfrom=qq.com;
 dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b=bSaVaO1p;
 arc=none smtp.client-ip=43.163.128.47
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=qq.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=qq.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b="bSaVaO1p"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=qq.com; s=s201512;
	t=1780057187; bh=0F7tBsnwtr9hum4Sk2+uPTGTQ76BR9xyPCoNaMnjzxw=;
	h=From:To:Cc:Subject:Date:In-Reply-To:References;
	b=bSaVaO1pHWsE2as9v5aqSMI3gC6pea5BdBgdNC7Zp/OSLATc3MR/tPAEO2FgIruVC
	 YB9mZhzuJch1Lyv9TcBU1iRgTAX1iW1VQk/UqItvsjW7pq3FuwOdkBfXuvHXobr1Zz
	 JyGIUF9yPPyeUphvvzOes8X1ZlDuBPPGyfWROCuI=
Received: from node68.. ([166.111.236.25])
	by newxmesmtplogicsvrsza73-0.qq.com (NewEsmtp) with SMTP
	id 4DC10017; Fri, 29 May 2026 20:19:28 +0800
X-QQ-mid: xmsmtpt1780057184ts6fo2ynp
Message-ID: <tencent_0F6F682C60B99E9E0F1553E6BF3D86468409@qq.com>
X-QQ-XMAILINFO: Mzcurg9uYAemtBwe+INFz74//KNnE+Z/ry9r/kfo1MYznumf3xBU+p9iy79Xaw
	 5SFqWyk9ocoWQh4qtxi57Wl8PIpPfEbG5//RFyUtp7BvID1cq8ZOL2RNORIdwQAzWbREte+FISGS
	 rQLBGEYc6eUsDDl76FFUFO5/22ui9p3L6U1JZw0N66akvDKNhv/1ynm4BWilnqfjBg2mCXJnkx67
	 pwCuegeawCfvgAVS+7KNfwxM5DDiHZnFNXH6ZG4+UCSsbK7o40q0YwxET8LyS0tiHcYiLb5VtZFL
	 AhOMNGUUEYOXEXb+8Skf3IL+tU0BsQDlyt/+KP2F+tZGfDgScU1V/cFt104Yj/eM6uMAX/97T5NW
	 74mzCT2cveD5PAqm4nQwFkHJFe5i9cJReF55joujorHiYDxaX8gXPlCIF8ejkOK+BlPhSqw7MEsR
	 GWIOs7Xh0FTyVz+6i/R+CfGiEPU1aMpB+fMq67UMOM0cs+kt69P07GXv8K3gO1g5Nr2DA/5BLnjZ
	 mAToYyjg9Ydu/7jk0GSMBpTyEG0GJEgZ14RiImwQOaFQxrcor9tUVNIuTs/1jVyeZjt/rfUMJH9D
	 H0vc0MXd/ln8DTRmhr0dDIfzVJtXGNKUGZLszineBd2wsEbTzl1DVmsjblQCkRC8dDR3iCGOoC9P
	 KhoOLRf/Y01X/QHibkPXac0vuPGeZFo9sYT5Xajl2kQgof722s+DxCQB4b6domZD1/i4gkgfsLSC
	 AwqOLfuWfOn3JASXbC8cLYiyXjduwjbvL2Y9NYDWSMmWTjAYBO9atEQ0d8770p2vzQ8PyP+KfWqg
	 tx6Y2DHMgRODRVKFoBsMSERP7Ar8uNDk2TvEdEStURJEMO6K50+et+x2VpxYvFUpk2QSPcwSvK0L
	 TXjiwyPvWmy8RVmLFleFoPIBszmIlosHJPcYsGUoKTbY/wH3sFaXX2C3JgUOYva0ncVfCDuOWYMa
	 FPuQLGY7kjudHqGukOTmTK7VYzX/GUfQwLlbrJwUrdcxSdrFqwAbD9eESEvPovIzqoUER1VAm/H6
	 uELPk+BgwYmrm1cLTFhdappr6OAVFkmrqx+qYiY9YcRRoUL9mzGIOdpgDNqH0=
X-QQ-XMRINFO: Mp0Kj//9VHAxzExpfF+O8yhSrljjwrznVg==
From: fujunjie <fujunjie1@qq.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org,
	Alexandre Ghiti <alexghiti@meta.com>,
	Kairui Song <kasong@tencent.com>,
	Usama Arif <usamaarif642@gmail.com>
Cc: Chris Li <chrisl@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Yosry Ahmed <yosry@kernel.org>,
	Nhat Pham <nphamcs@gmail.com>,
	David Hildenbrand <david@kernel.org>,
	Hugh Dickins <hughd@google.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	linux-kernel@vger.kernel.org,
	cgroups@vger.kernel.org
Subject: [RFC PATCH v2 9/9] docs: mm: update THP swapin counter descriptions
Date: Fri, 29 May 2026 12:19:28 +0000
X-OQ-MSGID: <20260529121928.4115683-9-fujunjie1@qq.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <tencent_98CD9F78E48D08DC005A6471A13CFF28B60A@qq.com>
References: <tencent_98CD9F78E48D08DC005A6471A13CFF28B60A@qq.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

The THP swapin counter descriptions still describe large swapin as
coming only from non-zswap swap devices. Update them now that
zswap-backed large folio swapin can also increment swpin.

Also describe policy and backend rejection as swpin_fallback cases,
since speculative zswap large swapin can intentionally fall back before
doing large IO.

Signed-off-by: fujunjie <fujunjie1@qq.com>
---
 Documentation/admin-guide/mm/transhuge.rst | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/adm=
in-guide/mm/transhuge.rst
index 23f8d13c2629..59b7a0d09243 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -667,13 +667,14 @@ zswpout
 	piece without splitting.
=20
 swpin
-	is incremented every time a huge page is swapped in from a non-zswap
-	swap device in one piece.
+	is incremented every time a huge page is swapped in from swap or
+	zswap in one piece.
=20
 swpin_fallback
-	is incremented if swapin fails to allocate or charge a huge page
-	and instead falls back to using huge pages with lower orders or
-	small pages.
+	is incremented if swapin cannot use a huge page and instead falls
+	back to using huge pages with lower orders or small pages. This can
+	happen because allocation or charging fails, or because policy or
+	backend state rejects a speculative large swapin.
=20
 swpin_fallback_charge
 	is incremented if swapin fails to charge a huge page and instead
--=20
2.34.1