From nobody Thu Oct  9 10:49:42 2025
Received: from mail-pj1-f53.google.com (mail-pj1-f53.google.com
 [209.85.216.53])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id D597828B507;
	Tue, 17 Jun 2025 18:35:27 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.216.53
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1750185329; cv=none;
 b=LE8VM4wgtEcjA4/NRsub9Pz0ENnDunxSuY7KRmZPPLMo3Ov4QwGA1Jv5YPFAmMubkhvx25+Msju5fgXrJBdqfjh4SMadyBx1uucsg4z+PwWtEaA6tJx5e7VKgB9Nyc95eswlxXRpQcTPfkroF2HjC6BPCJUIij374hJrMhF6nIQ=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1750185329; c=relaxed/simple;
	bh=z98qWiz2TPOSiS2oB0qvfqGpiJumBiAmrfNf7XULJMk=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=bwsvOhYBKppm0v0E7JtaBkfOuvyMxcTavkSNRi0pfEMCTLZ1VGJLbUeaFIcLTOuvjnbuxFfGwMNxGx8NlLqfEBA4wNrglMYD2Drjd99te/wdqjv0eDFZxoUh00N/uBYSueIAJ8rg2ro1eohr9LlqTEUOwOI4p3xM9F0SxCtpVYA=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=g2CN2WmI; arc=none smtp.client-ip=209.85.216.53
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="g2CN2WmI"
Received: by mail-pj1-f53.google.com with SMTP id
 98e67ed59e1d1-3122368d7c4so5357298a91.1;
        Tue, 17 Jun 2025 11:35:27 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1750185327; x=1750790127;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject
         :date:message-id:reply-to;
        bh=Bntp77Hs3RcMQMVsD2F77AAd9YdDXm5jtfci1+gUQ14=;
        b=g2CN2WmI4Ebpbr0GyWnVKB6vJTpGE9jfI50fCvNEJwY4HooAOZSwYaQQto4l/v15aA
         kvp3ziN7BpEP7cEsZCjIrsFcFNr8UwIKlWGgQi/iNWNMzc2JvGhELD2/FZWv5elwGBV5
         sbmU+b6pdaq7jdc3kAe51BhuOSGPyfC8u/t6QssxN3sojqlcfy7FBqvPty9A6MeDsT0J
         PZ79qfiR7MmFwSjmCqiLHyQTz5sdtDOmixDCpTf0cgOtc/xTpukUDVNojtFhHmkAO4yr
         JY5WPTYEnH8tCKqKaiqdE6kbZuQo/W3CSbplloWEgCoSR2+mxzLo0a2jPs38+DZdGJwT
         YgGQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1750185327; x=1750790127;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state
         :from:to:cc:subject:date:message-id:reply-to;
        bh=Bntp77Hs3RcMQMVsD2F77AAd9YdDXm5jtfci1+gUQ14=;
        b=pCUpanstU05cK9VffUMd856getHRkAJepWSGl821N7WKu/Sr9T//nhbv04D/6tYhYq
         cBUIOEDM0CkRc7VdlKtqg1/7BXyKbGCJSGPilDJlQqE0dZkinFoph2KG5fa597aJa8Gq
         f2+1yR/SjsG/t6vXYmUn9WXZOkKltLOLGDYZaRHYNdgcKL/6TsTvC+NMsCazAiFPLpWV
         Q7WNNuFv+m9cai4f1xgsY97n+6O0ZvqchL5agXHYXV+G4HrrRwQrT23QErLxi6W8z2Jf
         9M/ogsERMXx5IsgWRqMljoyIOabl59NSNoGgn/DuFywc2xWKPYLAKHzXyPp7xm8cQvN1
         w5bw==
X-Forwarded-Encrypted: i=1;
 AJvYcCUDVl3u1LVaZ5MVWtXi1CfNHxTHaW+bXeA+z2jUOv+S+MQpL/U7scPjdTIAUYBTCHlA/yIcqJl68SJ7LN8=@vger.kernel.org,
 AJvYcCXpO2Myirmu7ZOmepw3w0XLUAY90hXrkk7QR2Ys5RQQFu34lpZvNT6FZgWPwLfjjY88TqE8P5hK@vger.kernel.org
X-Gm-Message-State: AOJu0Yx/DJNMcUkDmxPHywk4eJWF+D42ax487b8f26tk8yiok1jN9huO
	X6To7HX82rErDWOaNgD4yror9tPRswnxrshS297ejpaVjGjhocBspN6S
X-Gm-Gg: ASbGncsY34MtewuyCWix9fTk/jcEXwK2Zzseg2yOeBbq9uSQB6vRpyleDCCnhlJZO1H
	z/xuyyUN5PDtfXSPtJI3ztkXtq4lr4PcX3hCkaTbpRqmERAMNEYofFNSZh/P7/RntfklUuZz24f
	Aj3hmyn2TQlUI6BG/7zfCCfX4MTj0F7O3O0rm2QPG1I4BBE5oC6wxLcb36te5wIwPdXVoNUbnzR
	5brkrSSKkeUMySbCOdfMujuwtV/cLg+sVurBQQnoRs4pBjIVyAbfo9Ey94BBXswKxSCLwTM+99k
	AaKI4MODaRjwSivDkqVCJQOMAzYrLViXJ907oS1pCzSKPqeIaiQ7O5dUFbYSV/Gm0alMs7hsMAF
	F9Z/hkNc=
X-Google-Smtp-Source: 
 AGHT+IGdnat7JAHqcXUEzk7AuYqg9efS1z6IgHRw61xLREkaJw9YC8i5ETUICV3Vo2xPHkSaPRPoQg==
X-Received: by 2002:a17:90b:48ca:b0:312:e90b:419e with SMTP id
 98e67ed59e1d1-313f1cacddfmr27441718a91.12.1750185326908;
        Tue, 17 Jun 2025 11:35:26 -0700 (PDT)
Received: from KASONG-MC4.tencent.com ([106.37.123.168])
        by smtp.gmail.com with ESMTPSA id
 d9443c01a7336-2365de781c7sm83753715ad.128.2025.06.17.11.35.21
        (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256);
        Tue, 17 Jun 2025 11:35:26 -0700 (PDT)
From: Kairui Song <ryncsn@gmail.com>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Hugh Dickins <hughd@google.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Matthew Wilcox <willy@infradead.org>,
	Kemeng Shi <shikemeng@huaweicloud.com>,
	Chris Li <chrisl@kernel.org>,
	Nhat Pham <nphamcs@gmail.com>,
	Baoquan He <bhe@redhat.com>,
	Barry Song <baohua@kernel.org>,
	linux-kernel@vger.kernel.org,
	Kairui Song <kasong@tencent.com>,
	stable@vger.kernel.org
Subject: [PATCH 1/4] mm/shmem,
 swap: improve cached mTHP handling and fix potential hung
Date: Wed, 18 Jun 2025 02:35:00 +0800
Message-ID: <20250617183503.10527-2-ryncsn@gmail.com>
X-Mailer: git-send-email 2.50.0
In-Reply-To: <20250617183503.10527-1-ryncsn@gmail.com>
References: <20250617183503.10527-1-ryncsn@gmail.com>
Reply-To: Kairui Song <kasong@tencent.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Kairui Song <kasong@tencent.com>

The current swap-in code assumes that, when a swap entry in shmem
mapping is order 0, its cached folios (if present) must be order 0
too, which turns out not always correct.

The problem is shmem_split_large_entry is called before verifying the
folio will eventually be swapped in, one possible race is:

    CPU1                          CPU2
shmem_swapin_folio
/* swap in of order > 0 swap entry S1 */
  folio =3D swap_cache_get_folio
  /* folio =3D NULL */
  order =3D xa_get_order
  /* order > 0 */
  folio =3D shmem_swap_alloc_folio
  /* mTHP alloc failure, folio =3D NULL */
  <... Interrupted ...>
                                 shmem_swapin_folio
                                 /* S1 is swapped in */
                                 shmem_writeout
                                 /* S1 is swapped out, folio cached */
  shmem_split_large_entry(..., S1)
  /* S1 is split, but the folio covering it has order > 0 now */

Now any following swapin of S1 will hang: `xa_get_order` returns 0,
and folio lookup will return a folio with order > 0. The
`xa_get_order(&mapping->i_pages, index) !=3D folio_order(folio)` will
always return false causing swap-in to return -EEXIST.

And this looks fragile. So fix this up by allowing seeing a larger folio
in swap cache, and check the whole shmem mapping range covered by the
swapin have the right swap value upon inserting the folio. And drop
the redundant tree walks before the insertion.

This will actually improve the performance, as it avoided two redundant
Xarray tree walks in the hot path, and the only side effect is that in
the failure path, shmem may redundantly reallocate a few folios
causing temporary slight memory pressure.

And worth noting, it may seems the order and value check before
inserting might help reducing the lock contention, which is not true.
The swap cache layer ensures raced swapin will either see a swap cache
folio or failed to do a swapin (we have SWAP_HAS_CACHE bit even if
swap cache is bypassed), so holding the folio lock and checking the
folio flag is already good enough for avoiding the lock contention.
The chance that a folio passes the swap entry value check but the
shmem mapping slot has changed should be very low.

Cc: stable@vger.kernel.org
Fixes: 058313515d5a ("mm: shmem: fix potential data corruption during shmem=
 swapin")
Fixes: 809bc86517cc ("mm: shmem: support large folio swap out")
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
 mm/shmem.c | 30 +++++++++++++++++++++---------
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index eda35be2a8d9..4e7ef343a29b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -884,7 +884,9 @@ static int shmem_add_to_page_cache(struct folio *folio,
 				   pgoff_t index, void *expected, gfp_t gfp)
 {
 	XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio));
-	long nr =3D folio_nr_pages(folio);
+	unsigned long nr =3D folio_nr_pages(folio);
+	swp_entry_t iter, swap;
+	void *entry;
=20
 	VM_BUG_ON_FOLIO(index !=3D round_down(index, nr), folio);
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
@@ -896,14 +898,24 @@ static int shmem_add_to_page_cache(struct folio *foli=
o,
=20
 	gfp &=3D GFP_RECLAIM_MASK;
 	folio_throttle_swaprate(folio, gfp);
+	swap =3D iter =3D radix_to_swp_entry(expected);
=20
 	do {
 		xas_lock_irq(&xas);
-		if (expected !=3D xas_find_conflict(&xas)) {
-			xas_set_err(&xas, -EEXIST);
-			goto unlock;
+		xas_for_each_conflict(&xas, entry) {
+			/*
+			 * The range must either be empty, or filled with
+			 * expected swap entries. Shmem swap entries are never
+			 * partially freed without split of both entry and
+			 * folio, so there shouldn't be any holes.
+			 */
+			if (!expected || entry !=3D swp_to_radix_entry(iter)) {
+				xas_set_err(&xas, -EEXIST);
+				goto unlock;
+			}
+			iter.val +=3D 1 << xas_get_order(&xas);
 		}
-		if (expected && xas_find_conflict(&xas)) {
+		if (expected && iter.val - nr !=3D swap.val) {
 			xas_set_err(&xas, -EEXIST);
 			goto unlock;
 		}
@@ -2323,7 +2335,7 @@ static int shmem_swapin_folio(struct inode *inode, pg=
off_t index,
 			error =3D -ENOMEM;
 			goto failed;
 		}
-	} else if (order !=3D folio_order(folio)) {
+	} else if (order > folio_order(folio)) {
 		/*
 		 * Swap readahead may swap in order 0 folios into swapcache
 		 * asynchronously, while the shmem mapping can still stores
@@ -2348,15 +2360,15 @@ static int shmem_swapin_folio(struct inode *inode, =
pgoff_t index,
=20
 			swap =3D swp_entry(swp_type(swap), swp_offset(swap) + offset);
 		}
+	} else if (order < folio_order(folio)) {
+		swap.val =3D round_down(swp_type(swap), folio_order(folio));
 	}
=20
 alloced:
 	/* We have to do this with folio locked to prevent races */
 	folio_lock(folio);
 	if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
-	    folio->swap.val !=3D swap.val ||
-	    !shmem_confirm_swap(mapping, index, swap) ||
-	    xa_get_order(&mapping->i_pages, index) !=3D folio_order(folio)) {
+	    folio->swap.val !=3D swap.val) {
 		error =3D -EEXIST;
 		goto unlock;
 	}
--=20
2.50.0
From nobody Thu Oct  9 10:49:42 2025
Received: from mail-pl1-f171.google.com (mail-pl1-f171.google.com
 [209.85.214.171])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7C62B27C873
	for <linux-kernel@vger.kernel.org>; Tue, 17 Jun 2025 18:35:33 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.214.171
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1750185335; cv=none;
 b=nYYp+ymHGmfrzQQuqlo7K7PA1uVsaS2DVS18i4OpXLI+bgq7GXxMNBO1vT+VJjjeja89HWHSIhwOqiqp0gxfCO6YSFc3owQ46J+66jEmH2s9xDdBmQYWe2NXNWCcmbPQh+YbAi5tQ5U0lOOJDidj5sFDIlzxPtecujjlEh9vGAc=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1750185335; c=relaxed/simple;
	bh=Bo25oI0Qz2JUnbyDkKjXxJzy4KM8ejzXvgXTGt7TOSI=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=IltEhXXIZHb71zDQn7MdT13B4WMDb/BNMHxgYtego2dsbavQoQaoAw1oswuYBnPVNY7maJzdNfKajlvV4204gT1ct50l/sndxNWbNeDuqCNszVUtOMOgWL/n0V3w4bQIy+2PL7Rhu6iHMYC4z4B70e3ukdjmE0ODk6CKWym26bs=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=dJu9sgXu; arc=none smtp.client-ip=209.85.214.171
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="dJu9sgXu"
Received: by mail-pl1-f171.google.com with SMTP id
 d9443c01a7336-2353a2bc210so54031685ad.2
        for <linux-kernel@vger.kernel.org>;
 Tue, 17 Jun 2025 11:35:33 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1750185333; x=1750790133;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject
         :date:message-id:reply-to;
        bh=1y6Vpeow9ypGvWf8u47quO+HQeuFr+2+BZOOkcyGh+o=;
        b=dJu9sgXuz5vlOq22mBbupApRT5aIS6LytR7EnzOscLuBCkYWMFUjUj+cipKZDESVOS
         9Rb++3TMQabmEuyDR2oR7RxGNUrTcIFOIvptd9KPawclgIVKcuZTXXnIySVVuNtD7zv1
         5g7nJ19V5cQB5G/oWCUgfnllfjfdsEoiN73akKcH6x2WZLLhQAzbC6bbNnn/+I/E+701
         hTeCnn9VGphITHt2mhPiVxt1W++t8hzKiPRISIuTrFlftNgw9oLV8nTiqhzu5B6oc7gM
         UnaMoJhlSBtsGZzGL3llEuwwHF1n2JaEaCk9BKTwIU3FR5ls2KJOidhRnbbCuNj/yUFq
         AITw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1750185333; x=1750790133;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state
         :from:to:cc:subject:date:message-id:reply-to;
        bh=1y6Vpeow9ypGvWf8u47quO+HQeuFr+2+BZOOkcyGh+o=;
        b=euvmKyVnJYkgXWevk1zKJy1b6kQkGGPC5i9o/WZm4SBIC8Qq8uEt0VWPvHaOtP/++m
         UlTkXRzV802fIYNHNxMuEFIR2Okgj7uD69aQz7dU/GqOOhB1RQBzmYmH9sKM4upt7zO7
         6RsWqrJpwzR2DAKyvkUFoQsyLm5+pbObFPVqW6CME41ZtW8yLHnxqiIe4sBYNisFAg6T
         YXQU82BlfFd6Apql8j0lZ12AZ21gwyouukcvpK+0n9SOFMcRFwT5C7BbJ5HNDLX+/S51
         1Zu0WdOx8Msmy7QPvyc25JxBgwR8ztc/u9/2UthQlVQsfokH3SUKP6QMmpIzHahHazKG
         kr7g==
X-Forwarded-Encrypted: i=1;
 AJvYcCVJAQciGkU4a0JaOjE6+lQp6nnFI5FyojGK+MQYxKxmYoDEjrsqi1xRtLgIJ8wERdmziGgpk1grsyARivg=@vger.kernel.org
X-Gm-Message-State: AOJu0Yyrcutt99mDxS22wzU+46oWZTU5cNI5a4AR4OKEsM6bNt119orP
	bv+i87npw4Yn210P42m+YAzJa/y2dT2CAN5KS0AF4S5MBQbb+BD8jvxq
X-Gm-Gg: ASbGncsmGogSvfEJF0ZTrxzXnXQl9Y8CGl8dcj96Y1LRYzCHyX0sB9Fpr9T6W5hbmLQ
	8Z/Cph2H8NEY7JigkL9S6uZbnSSQmNFmzlkF9YjvPyWaweNUTQf7SnhZkc1aivMWDcPnZhYNe0U
	C4fZGnHMLYo2I8n5wH3zSvjiQ+FK6P0LzKeWX/R9KjI5Y1oxXLol2k4Bwr6xL61dow18XAYIv6y
	wApMWbhfZ5JQRYXSLDBx+W/9hPPMoZIW5FqUCSZzhLY3xxczEoz4m6Dp+H4gzP/OlhpZqeu3/vS
	MzoUpPsxwNxJOfpW3CFGq2OgjTevxYT6H2C7GGLLQy3Qw97DQoo8MyljIu/lqRyUxa3hSKeWS8Q
	4GYianvzCmJWvluC7rQ==
X-Google-Smtp-Source: 
 AGHT+IEhcH1Eh2sxuz2dhf/9qQbdYl9KrMk1DSJaqjlnZ4S/lSOuKNutG0tXdGufHO/DVYbnkNGULg==
X-Received: by 2002:a17:902:e949:b0:235:1962:1c13 with SMTP id
 d9443c01a7336-2366b34426cmr230449935ad.14.1750185332656;
        Tue, 17 Jun 2025 11:35:32 -0700 (PDT)
Received: from KASONG-MC4.tencent.com ([106.37.123.168])
        by smtp.gmail.com with ESMTPSA id
 d9443c01a7336-2365de781c7sm83753715ad.128.2025.06.17.11.35.27
        (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256);
        Tue, 17 Jun 2025 11:35:32 -0700 (PDT)
From: Kairui Song <ryncsn@gmail.com>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Hugh Dickins <hughd@google.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Matthew Wilcox <willy@infradead.org>,
	Kemeng Shi <shikemeng@huaweicloud.com>,
	Chris Li <chrisl@kernel.org>,
	Nhat Pham <nphamcs@gmail.com>,
	Baoquan He <bhe@redhat.com>,
	Barry Song <baohua@kernel.org>,
	linux-kernel@vger.kernel.org,
	Kairui Song <kasong@tencent.com>
Subject: [PATCH 2/4] mm/shmem,
 swap: avoid redundant Xarray lookup during swapin
Date: Wed, 18 Jun 2025 02:35:01 +0800
Message-ID: <20250617183503.10527-3-ryncsn@gmail.com>
X-Mailer: git-send-email 2.50.0
In-Reply-To: <20250617183503.10527-1-ryncsn@gmail.com>
References: <20250617183503.10527-1-ryncsn@gmail.com>
Reply-To: Kairui Song <kasong@tencent.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Kairui Song <kasong@tencent.com>

Currently shmem calls xa_get_order to get the swap radix entry order,
requiring a full tree walk. This can be easily combined with the swap
entry value checking (shmem_confirm_swap) to avoid the duplicated
lookup, which should improve the performance.

Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
 mm/shmem.c | 33 ++++++++++++++++++++++++---------
 1 file changed, 24 insertions(+), 9 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 4e7ef343a29b..0ad49e57f736 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -505,15 +505,27 @@ static int shmem_replace_entry(struct address_space *=
mapping,
=20
 /*
  * Sometimes, before we decide whether to proceed or to fail, we must check
- * that an entry was not already brought back from swap by a racing thread.
+ * that an entry was not already brought back or split by a racing thread.
  *
  * Checking folio is not enough: by the time a swapcache folio is locked, =
it
  * might be reused, and again be swapcache, using the same swap as before.
+ * Returns the swap entry's order if it still presents, else returns -1.
  */
-static bool shmem_confirm_swap(struct address_space *mapping,
-			       pgoff_t index, swp_entry_t swap)
+static int shmem_swap_check_entry(struct address_space *mapping, pgoff_t i=
ndex,
+				  swp_entry_t swap)
 {
-	return xa_load(&mapping->i_pages, index) =3D=3D swp_to_radix_entry(swap);
+	XA_STATE(xas, &mapping->i_pages, index);
+	int ret =3D -1;
+	void *entry;
+
+	rcu_read_lock();
+	do {
+		entry =3D xas_load(&xas);
+		if (entry =3D=3D swp_to_radix_entry(swap))
+			ret =3D xas_get_order(&xas);
+	} while (xas_retry(&xas, entry));
+	rcu_read_unlock();
+	return ret;
 }
=20
 /*
@@ -2256,16 +2268,20 @@ static int shmem_swapin_folio(struct inode *inode, =
pgoff_t index,
 		return -EIO;
=20
 	si =3D get_swap_device(swap);
-	if (!si) {
-		if (!shmem_confirm_swap(mapping, index, swap))
+	order =3D shmem_swap_check_entry(mapping, index, swap);
+	if (unlikely(!si)) {
+		if (order < 0)
 			return -EEXIST;
 		else
 			return -EINVAL;
 	}
+	if (unlikely(order < 0)) {
+		put_swap_device(si);
+		return -EEXIST;
+	}
=20
 	/* Look it up and read it in.. */
 	folio =3D swap_cache_get_folio(swap, NULL, 0);
-	order =3D xa_get_order(&mapping->i_pages, index);
 	if (!folio) {
 		int nr_pages =3D 1 << order;
 		bool fallback_order0 =3D false;
@@ -2415,7 +2431,7 @@ static int shmem_swapin_folio(struct inode *inode, pg=
off_t index,
 	*foliop =3D folio;
 	return 0;
 failed:
-	if (!shmem_confirm_swap(mapping, index, swap))
+	if (shmem_swap_check_entry(mapping, index, swap) < 0)
 		error =3D -EEXIST;
 	if (error =3D=3D -EIO)
 		shmem_set_folio_swapin_error(inode, index, folio, swap,
@@ -2428,7 +2444,6 @@ static int shmem_swapin_folio(struct inode *inode, pg=
off_t index,
 		folio_put(folio);
 	}
 	put_swap_device(si);
-
 	return error;
 }
=20
--=20
2.50.0
From nobody Thu Oct  9 10:49:42 2025
Received: from mail-pj1-f51.google.com (mail-pj1-f51.google.com
 [209.85.216.51])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 55E1B2F3DC7
	for <linux-kernel@vger.kernel.org>; Tue, 17 Jun 2025 18:35:39 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.216.51
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1750185341; cv=none;
 b=SVyHX2NVZYEiaBM7n0M52S2i7lMERBFiBkrGb2yXW7ZYd6Po6+1/Rc5ACqlKy+ycyG6x1E/jo9ySv2Y7Macr8rJhGIoo8gYzFf7qZnUshjtqc4n4W+O2TYddDSr9JopovEIcfsEKyNECToufHrDjuiPpAkCfj94DZX1KgoniypQ=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1750185341; c=relaxed/simple;
	bh=Jjp3Zi4h4hDpzEgLBe2XD8coOE5MVx4J0v1p0F7rJRM=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=qJ+LPA7fqBzKFw01a+etlUfGiXc7eJZqJFZuOknkYoHKdbblIW57N/AfDotBLRa1YBY6aJGzOFbdLrQG6aOcifyTEajbiKpvxvyNhe/r+1Tmcm4dGzqsELbOJviMhaWlkmnVmXbh8WeiUv3bC1VKf9K1i8hW8n/WIsLrBoM/Kv0=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=AFIUWcHM; arc=none smtp.client-ip=209.85.216.51
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="AFIUWcHM"
Received: by mail-pj1-f51.google.com with SMTP id
 98e67ed59e1d1-311da0bef4aso6757557a91.3
        for <linux-kernel@vger.kernel.org>;
 Tue, 17 Jun 2025 11:35:39 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1750185338; x=1750790138;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject
         :date:message-id:reply-to;
        bh=VwmU5QFgBrV4IT7f3BoX3RhTubCWuigzMCMXoFU7oOA=;
        b=AFIUWcHMdwJydA4wsjvWwEnxqnaXQxwxqdGGkHbMJq4qATrHxB6YpvsUCLCoKnrs17
         DY6nHCky4fyuHRJyZwUzA5DrBxGNDi3GZ2jzyxXuKuvAxQqEeuH6ALU37/E3S1GWad6w
         reLKGVdOJCCaYWmaZqu3JauXTXlRQ3x7MU7RjgJVZdnvs3s4l14+vE6HigAy+FApBlSq
         0wRecQJUuYe+ff0v+X6kJ/dbBU4cHQsi9wC7MYCctPElXZ4rc+f6ldzQJB796OWXDaB1
         en54nYHZoZUf+cvS3fcEw1J30w9Am687Qf9RtX2ZZaisaxuOJ2D9N5bj5Jq+cDKTKhG5
         Nv/Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1750185338; x=1750790138;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state
         :from:to:cc:subject:date:message-id:reply-to;
        bh=VwmU5QFgBrV4IT7f3BoX3RhTubCWuigzMCMXoFU7oOA=;
        b=mWyyRnk4wNxkVUhq1c9h9KlHv26uL6hjiET64NHHGg/gPm8fqv/1ShZf2lVLXgQ86E
         EQHDkudBCr2flUrThKnQwMW6vjEZQhsyTFLHox1eglVuJvN0GNzHFcgVmdT6Xtbeb1oc
         7lJaKxXbfYOKXMwee9Pu4dCeDoV+YK3SFkapZ+JNZKOsR9hegsja2w2ajLod52BztltL
         PD/LrqjtUo30Uiw4xwhfYNrw4V5Sr/KvzpRb56XjNavKFtLTMR03sbwcSrahU0sFzL0l
         I9PVViWSNTcRUc6hYSBrRx+ClWOofraaEUKqGgTWBVlnZd8U9c5JVeDjDe1nJ7i30hTU
         O0lA==
X-Forwarded-Encrypted: i=1;
 AJvYcCWA3cWbVfoP3CPh0OaEGJMfSQsL2+4Rz2U+MEZFiDhAuAaW5kN+uClTSuJ7pkv3zL4ZFunrTY+lr48xUwY=@vger.kernel.org
X-Gm-Message-State: AOJu0YzbM607cijYlugV7F+A1jQCNpnCaV9xBAQOnrBwfUPiJEaYLNM1
	PaPDjJC90YBTAC4kPfKp2S+FHQxLFfSm4Z9ZhPtrPydgCaLcqI46u0pf
X-Gm-Gg: ASbGnctcpDOXy5CL8ypwEv8fIW3QbfUDMbCtvC/KO8Tizc/SbiT+ORk3IrTHXqsvMZn
	x+i07+WVH0trAtIdbtSjwQR0NfsAMmLpMN7di4neSYnJ3ssDi1G2+Lt3GW3RQ50D/Alw8qf6UdO
	rXcBVXXEJUusoY3rL9Fj/uHzgbjbQCrDnfXlTbCibrrLkubCbL9vWPJ3Z1ynuVUFThWirBiBMvL
	7hqm/4OYqQF2+PprPFzeMbYI+9y0GiT96ibW2lXN7rUiBvZ0jz/AyuGxkzv20t9Q0f+8+CXkTd6
	dXN+MKiHmwnIWLlgzEaiHiZoDgDA2/hzCpCsmUkKolVhIioma5Lg7LzXtK/wDDFfJDmeNWDsG4m
	yYNJ8X58=
X-Google-Smtp-Source: 
 AGHT+IG+2p7b+3vUcWPcVCyiKJMHqLk7htZb0HpMOmCbryFbII9jSUjLr5nsIVjixeyzH2l1GmrmdQ==
X-Received: by 2002:a17:90b:1e07:b0:312:eaea:afa1 with SMTP id
 98e67ed59e1d1-313f1e2bfa7mr22198598a91.29.1750185338375;
        Tue, 17 Jun 2025 11:35:38 -0700 (PDT)
Received: from KASONG-MC4.tencent.com ([106.37.123.168])
        by smtp.gmail.com with ESMTPSA id
 d9443c01a7336-2365de781c7sm83753715ad.128.2025.06.17.11.35.33
        (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256);
        Tue, 17 Jun 2025 11:35:37 -0700 (PDT)
From: Kairui Song <ryncsn@gmail.com>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Hugh Dickins <hughd@google.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Matthew Wilcox <willy@infradead.org>,
	Kemeng Shi <shikemeng@huaweicloud.com>,
	Chris Li <chrisl@kernel.org>,
	Nhat Pham <nphamcs@gmail.com>,
	Baoquan He <bhe@redhat.com>,
	Barry Song <baohua@kernel.org>,
	linux-kernel@vger.kernel.org,
	Kairui Song <kasong@tencent.com>
Subject: [PATCH 3/4] mm/shmem, swap: improve mthp swapin process
Date: Wed, 18 Jun 2025 02:35:02 +0800
Message-ID: <20250617183503.10527-4-ryncsn@gmail.com>
X-Mailer: git-send-email 2.50.0
In-Reply-To: <20250617183503.10527-1-ryncsn@gmail.com>
References: <20250617183503.10527-1-ryncsn@gmail.com>
Reply-To: Kairui Song <kasong@tencent.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Kairui Song <kasong@tencent.com>

Tidy up the mTHP swapin workflow. There should be no feature change, but
consolidates the mTHP related check to one place so they are now all
wrapped by CONFIG_TRANSPARENT_HUGEPAGE, and will be trimmed off by
compiler if not needed.

Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
 mm/shmem.c | 175 ++++++++++++++++++++++++-----------------------------
 1 file changed, 78 insertions(+), 97 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 0ad49e57f736..46dea2fa1b43 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1975,31 +1975,51 @@ static struct folio *shmem_alloc_and_add_folio(stru=
ct vm_fault *vmf,
 	return ERR_PTR(error);
 }
=20
-static struct folio *shmem_swap_alloc_folio(struct inode *inode,
+static struct folio *shmem_swapin_direct(struct inode *inode,
 		struct vm_area_struct *vma, pgoff_t index,
-		swp_entry_t entry, int order, gfp_t gfp)
+		swp_entry_t entry, int *order, gfp_t gfp)
 {
 	struct shmem_inode_info *info =3D SHMEM_I(inode);
+	int nr_pages =3D 1 << *order;
 	struct folio *new;
+	pgoff_t offset;
 	void *shadow;
-	int nr_pages;
=20
 	/*
 	 * We have arrived here because our zones are constrained, so don't
 	 * limit chance of success with further cpuset and node constraints.
 	 */
 	gfp &=3D ~GFP_CONSTRAINT_MASK;
-	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && order > 0) {
-		gfp_t huge_gfp =3D vma_thp_gfp_mask(vma);
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+		if (WARN_ON_ONCE(*order))
+			return ERR_PTR(-EINVAL);
+	} else if (*order) {
+		/*
+		 * If uffd is active for the vma, we need per-page fault
+		 * fidelity to maintain the uffd semantics, then fallback
+		 * to swapin order-0 folio, as well as for zswap case.
+		 * Any existing sub folio in the swap cache also blocks
+		 * mTHP swapin.
+		 */
+		if ((vma && userfaultfd_armed(vma)) ||
+		    !zswap_never_enabled() ||
+		    non_swapcache_batch(entry, nr_pages) !=3D nr_pages) {
+			offset =3D index - round_down(index, nr_pages);
+			entry =3D swp_entry(swp_type(entry),
+					  swp_offset(entry) + offset);
+			*order =3D 0;
+			nr_pages =3D 1;
+		} else {
+			gfp_t huge_gfp =3D vma_thp_gfp_mask(vma);
=20
-		gfp =3D limit_gfp_mask(huge_gfp, gfp);
+			gfp =3D limit_gfp_mask(huge_gfp, gfp);
+		}
 	}
=20
-	new =3D shmem_alloc_folio(gfp, order, info, index);
+	new =3D shmem_alloc_folio(gfp, *order, info, index);
 	if (!new)
 		return ERR_PTR(-ENOMEM);
=20
-	nr_pages =3D folio_nr_pages(new);
 	if (mem_cgroup_swapin_charge_folio(new, vma ? vma->vm_mm : NULL,
 					   gfp, entry)) {
 		folio_put(new);
@@ -2165,8 +2185,12 @@ static void shmem_set_folio_swapin_error(struct inod=
e *inode, pgoff_t index,
 	swap_free_nr(swap, nr_pages);
 }
=20
-static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
-				   swp_entry_t swap, gfp_t gfp)
+/*
+ * Split an existing large swap entry. @index should point to one sub mapp=
ing
+ * slot within the entry @swap, this sub slot will be split into order 0.
+ */
+static int shmem_split_swap_entry(struct inode *inode, pgoff_t index,
+				  swp_entry_t swap, gfp_t gfp)
 {
 	struct address_space *mapping =3D inode->i_mapping;
 	XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
@@ -2226,7 +2250,6 @@ static int shmem_split_large_entry(struct inode *inod=
e, pgoff_t index,
 			cur_order =3D split_order;
 			split_order =3D xas_try_split_min_order(split_order);
 		}
-
 unlock:
 		xas_unlock_irq(&xas);
=20
@@ -2237,7 +2260,7 @@ static int shmem_split_large_entry(struct inode *inod=
e, pgoff_t index,
 	if (xas_error(&xas))
 		return xas_error(&xas);
=20
-	return entry_order;
+	return 0;
 }
=20
 /*
@@ -2254,11 +2277,11 @@ static int shmem_swapin_folio(struct inode *inode, =
pgoff_t index,
 	struct address_space *mapping =3D inode->i_mapping;
 	struct mm_struct *fault_mm =3D vma ? vma->vm_mm : NULL;
 	struct shmem_inode_info *info =3D SHMEM_I(inode);
+	int error, nr_pages, order, swap_order;
 	struct swap_info_struct *si;
 	struct folio *folio =3D NULL;
 	bool skip_swapcache =3D false;
 	swp_entry_t swap;
-	int error, nr_pages, order, split_order;
=20
 	VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
 	swap =3D radix_to_swp_entry(*foliop);
@@ -2283,110 +2306,66 @@ static int shmem_swapin_folio(struct inode *inode,=
 pgoff_t index,
 	/* Look it up and read it in.. */
 	folio =3D swap_cache_get_folio(swap, NULL, 0);
 	if (!folio) {
-		int nr_pages =3D 1 << order;
-		bool fallback_order0 =3D false;
-
 		/* Or update major stats only when swapin succeeds?? */
 		if (fault_type) {
 			*fault_type |=3D VM_FAULT_MAJOR;
 			count_vm_event(PGMAJFAULT);
 			count_memcg_event_mm(fault_mm, PGMAJFAULT);
 		}
-
-		/*
-		 * If uffd is active for the vma, we need per-page fault
-		 * fidelity to maintain the uffd semantics, then fallback
-		 * to swapin order-0 folio, as well as for zswap case.
-		 * Any existing sub folio in the swap cache also blocks
-		 * mTHP swapin.
-		 */
-		if (order > 0 && ((vma && unlikely(userfaultfd_armed(vma))) ||
-				  !zswap_never_enabled() ||
-				  non_swapcache_batch(swap, nr_pages) !=3D nr_pages))
-			fallback_order0 =3D true;
-
-		/* Skip swapcache for synchronous device. */
-		if (!fallback_order0 && data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
-			folio =3D shmem_swap_alloc_folio(inode, vma, index, swap, order, gfp);
+		/* Try direct mTHP swapin bypassing swap cache and readahead */
+		if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
+			swap_order =3D order;
+			folio =3D shmem_swapin_direct(inode, vma, index,
+						    swap, &swap_order, gfp);
 			if (!IS_ERR(folio)) {
 				skip_swapcache =3D true;
 				goto alloced;
 			}
-
-			/*
-			 * Fallback to swapin order-0 folio unless the swap entry
-			 * already exists.
-			 */
+			/* Fallback if order > 0 swapin failed with -ENOMEM */
 			error =3D PTR_ERR(folio);
 			folio =3D NULL;
-			if (error =3D=3D -EEXIST)
+			if (error !=3D -ENOMEM || !swap_order)
 				goto failed;
 		}
-
 		/*
-		 * Now swap device can only swap in order 0 folio, then we
-		 * should split the large swap entry stored in the pagecache
-		 * if necessary.
+		 * Try order 0 swapin using swap cache and readahead, it still
+		 * may return order > 0 folio due to raced swap cache.
 		 */
-		split_order =3D shmem_split_large_entry(inode, index, swap, gfp);
-		if (split_order < 0) {
-			error =3D split_order;
-			goto failed;
-		}
-
-		/*
-		 * If the large swap entry has already been split, it is
-		 * necessary to recalculate the new swap entry based on
-		 * the old order alignment.
-		 */
-		if (split_order > 0) {
-			pgoff_t offset =3D index - round_down(index, 1 << split_order);
-
-			swap =3D swp_entry(swp_type(swap), swp_offset(swap) + offset);
-		}
-
-		/* Here we actually start the io */
 		folio =3D shmem_swapin_cluster(swap, gfp, info, index);
 		if (!folio) {
 			error =3D -ENOMEM;
 			goto failed;
 		}
-	} else if (order > folio_order(folio)) {
-		/*
-		 * Swap readahead may swap in order 0 folios into swapcache
-		 * asynchronously, while the shmem mapping can still stores
-		 * large swap entries. In such cases, we should split the
-		 * large swap entry to prevent possible data corruption.
-		 */
-		split_order =3D shmem_split_large_entry(inode, index, swap, gfp);
-		if (split_order < 0) {
-			folio_put(folio);
-			folio =3D NULL;
-			error =3D split_order;
-			goto failed;
-		}
-
-		/*
-		 * If the large swap entry has already been split, it is
-		 * necessary to recalculate the new swap entry based on
-		 * the old order alignment.
-		 */
-		if (split_order > 0) {
-			pgoff_t offset =3D index - round_down(index, 1 << split_order);
-
-			swap =3D swp_entry(swp_type(swap), swp_offset(swap) + offset);
-		}
-	} else if (order < folio_order(folio)) {
-		swap.val =3D round_down(swp_type(swap), folio_order(folio));
 	}
-
 alloced:
+	/*
+	 * We need to split an existing large entry if swapin brought in a
+	 * smaller folio due to various of reasons.
+	 *
+	 * And worth noting there is a special case: if there is a smaller
+	 * cached folio that covers @swap, but not @index (it only covers
+	 * first few sub entries of the large entry, but @index points to
+	 * later parts), the swap cache lookup will still see this folio,
+	 * And we need to split the large entry here. Later checks will fail,
+	 * as it can't satisfy the swap requirement, and we will retry
+	 * the swapin from beginning.
+	 */
+	swap_order =3D folio_order(folio);
+	if (order > swap_order) {
+		error =3D shmem_split_swap_entry(inode, index, swap, gfp);
+		if (error)
+			goto failed_nolock;
+	}
+
+	index =3D round_down(index, 1 << swap_order);
+	swap.val =3D round_down(swap.val, 1 << swap_order);
+
 	/* We have to do this with folio locked to prevent races */
 	folio_lock(folio);
 	if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
 	    folio->swap.val !=3D swap.val) {
 		error =3D -EEXIST;
-		goto unlock;
+		goto failed_unlock;
 	}
 	if (!folio_test_uptodate(folio)) {
 		error =3D -EIO;
@@ -2407,8 +2386,7 @@ static int shmem_swapin_folio(struct inode *inode, pg=
off_t index,
 			goto failed;
 	}
=20
-	error =3D shmem_add_to_page_cache(folio, mapping,
-					round_down(index, nr_pages),
+	error =3D shmem_add_to_page_cache(folio, mapping, index,
 					swp_to_radix_entry(swap), gfp);
 	if (error)
 		goto failed;
@@ -2419,8 +2397,8 @@ static int shmem_swapin_folio(struct inode *inode, pg=
off_t index,
 		folio_mark_accessed(folio);
=20
 	if (skip_swapcache) {
+		swapcache_clear(si, folio->swap, folio_nr_pages(folio));
 		folio->swap.val =3D 0;
-		swapcache_clear(si, swap, nr_pages);
 	} else {
 		delete_from_swap_cache(folio);
 	}
@@ -2436,13 +2414,16 @@ static int shmem_swapin_folio(struct inode *inode, =
pgoff_t index,
 	if (error =3D=3D -EIO)
 		shmem_set_folio_swapin_error(inode, index, folio, swap,
 					     skip_swapcache);
-unlock:
-	if (skip_swapcache)
-		swapcache_clear(si, swap, folio_nr_pages(folio));
-	if (folio) {
+failed_unlock:
+	if (folio)
 		folio_unlock(folio);
-		folio_put(folio);
+failed_nolock:
+	if (skip_swapcache) {
+		swapcache_clear(si, folio->swap, folio_nr_pages(folio));
+		folio->swap.val =3D 0;
 	}
+	if (folio)
+		folio_put(folio);
 	put_swap_device(si);
 	return error;
 }
--=20
2.50.0
From nobody Thu Oct  9 10:49:43 2025
Received: from mail-pj1-f49.google.com (mail-pj1-f49.google.com
 [209.85.216.49])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 92DF2277008
	for <linux-kernel@vger.kernel.org>; Tue, 17 Jun 2025 18:35:44 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.216.49
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1750185346; cv=none;
 b=LlSHJ0HbQYk6c8V1R62qhXqvGIm4Fn911GGeYB6XGJiQrdPnZApDLr659k4jGUFTFW6B5ZrVnJurscgfgM4NwQpDwBjJlHvDNSHhROFLoV4JPTwHD1xNhZJ2u8lX28SIoztIObcaoQ6edSDMAZXUZOM+zCalolYnzguXVmC2BrA=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1750185346; c=relaxed/simple;
	bh=OHD/ehd26V8PT/8jsOVnuBdDkf4MMCpUPuPkhl2nfHA=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=gwBAE6BtAKZ7z4wM3bsOjR+4H15b39tV8Mgtl6qv/Khvl20131FMRtFGs0tkLh59qzMT0q4I5fmQsL5L26nlr3xKDLa4dGMqojtnNevwL48UurAqVwGwOnwTpBWnVfK4vP/DiTEHvLDOHIFVPhba9hIsrCi78r4oqk0OzqHvN1E=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=NwtZ4dmM; arc=none smtp.client-ip=209.85.216.49
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="NwtZ4dmM"
Received: by mail-pj1-f49.google.com with SMTP id
 98e67ed59e1d1-313a001d781so5439272a91.3
        for <linux-kernel@vger.kernel.org>;
 Tue, 17 Jun 2025 11:35:44 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1750185344; x=1750790144;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject
         :date:message-id:reply-to;
        bh=LgTulNr4jx8hCL9cMcrSknGVCL7uEZpC1WW1KTxxd18=;
        b=NwtZ4dmMtnpCetv3mJR+inDdRKCGpcAiVKrybRQuPzx1/IFYdvLMfTmWYXR7oRzxXI
         olwZ/Q2cGfMx8WfGrlRWUvgGwQDbkMO0c7I/fhNtSiuUC+fY5aAQ2ApbWwIiTKBphXhJ
         4QukJa2QfV6PdzL+89eEujrfLxCgyl1FyVBc1cp7kD5RH8v++feh/WKiTwRWB8IJTLP0
         +Zwr1YohTwwmdNtVkW4b32IdhbijRXaivAOi4VsjAvROFniVn0B+Ald9l4xRhm1QVuG0
         Xp3IAIIOxck+AwOmRsW6l6aFPPW4NGp50IpIz2iXvqBku5tfe6b/THgEUsEcmGLUbjRA
         /smw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1750185344; x=1750790144;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state
         :from:to:cc:subject:date:message-id:reply-to;
        bh=LgTulNr4jx8hCL9cMcrSknGVCL7uEZpC1WW1KTxxd18=;
        b=LOqJYNJSI37Le9dzDIKUMzcM203iJuMT3wVxNP1xs3l+RV1MdkPmUkoRmJG9co5B21
         Qvfu+4FySqoLRNku1MNKZR+gvniaFK9Sn8d+cj5H1FVpDQJ40AxGCSuYDJPRTZiHz037
         0oLvEKKKo3mmn5HWgCLvPIGiF1eoTuP0pxpRTYflsXdDnN+XSOa06Ac5625nlPLPtpWw
         oWkcuG5awk9bMkAB5rxRbpB7/6Mtseng8dven/SbQU/eIOEoxW2p20IlutqlIDSZsQWl
         WFQomb0nqp4Z/COoC4QHtLK/LYNgjjeavnJYVQIP/tVzPpwmjhMjc4NCG0xZXRS9DlfB
         wj+w==
X-Forwarded-Encrypted: i=1;
 AJvYcCV7O3lJ8UtJTTnhTqfTN1VamlEkvORBP34h1FbLTIVFkOK+pqEOjwgogDhqvZRd6BadSr7t/ktAdYiYWxA=@vger.kernel.org
X-Gm-Message-State: AOJu0YzRn+wJAxst1RXjfDKY1p5H0nRcwW7y0Y29MFi1meLcTMdFh81v
	h8Sogqx3hVm4rafUUKVUQLVWhvCy564k9AndpjNnxtyPoH1H7p+Bq2iB
X-Gm-Gg: ASbGnctcw1fvAGV5Np/oEsvsayz5FPL5Q57oomeCZ2KWs2cw7RKozgYtMWqRs85dX9o
	IGwU8JrgXu8JFiZcBRhWPcxJj/djXZvMa0Oe0iaguM3UsieZQli+OH5ZvKgP4v9LC6k1vC+Ml+m
	9K2kNMTTkOZhCB/iCEDguXg10d5AK5aeKZELJMwFM/lnugRPEZm9FLk4zyJG1HOxZsvogab3WS9
	YGdghg0fbkPSQoSU+0t7LuaKwG01+oy+NgxnGKDZIyVlz0Huciu3gVKFBFHP1fZSwkR2XJdUjvi
	abFA2JwNs0hrJijj+PqRIxgqiEjCga8gSe5m33uZdy/7YtL9EEcThKkPUN+OL3rnI4jEL1jzETZ
	QSWMTiMU=
X-Google-Smtp-Source: 
 AGHT+IFpvL4LkuVFY2mtE2Qq5xqwAAinBHrY520uObdv06QZLbd46hL0BA94TNP5rdY3aaT6byR1Ww==
X-Received: by 2002:a17:90b:2688:b0:313:2e69:8002 with SMTP id
 98e67ed59e1d1-313f1daa7a9mr22767858a91.20.1750185343766;
        Tue, 17 Jun 2025 11:35:43 -0700 (PDT)
Received: from KASONG-MC4.tencent.com ([106.37.123.168])
        by smtp.gmail.com with ESMTPSA id
 d9443c01a7336-2365de781c7sm83753715ad.128.2025.06.17.11.35.38
        (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256);
        Tue, 17 Jun 2025 11:35:43 -0700 (PDT)
From: Kairui Song <ryncsn@gmail.com>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Hugh Dickins <hughd@google.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Matthew Wilcox <willy@infradead.org>,
	Kemeng Shi <shikemeng@huaweicloud.com>,
	Chris Li <chrisl@kernel.org>,
	Nhat Pham <nphamcs@gmail.com>,
	Baoquan He <bhe@redhat.com>,
	Barry Song <baohua@kernel.org>,
	linux-kernel@vger.kernel.org,
	Kairui Song <kasong@tencent.com>
Subject: [PATCH 4/4] mm/shmem, swap: avoid false positive swap cache lookup
Date: Wed, 18 Jun 2025 02:35:03 +0800
Message-ID: <20250617183503.10527-5-ryncsn@gmail.com>
X-Mailer: git-send-email 2.50.0
In-Reply-To: <20250617183503.10527-1-ryncsn@gmail.com>
References: <20250617183503.10527-1-ryncsn@gmail.com>
Reply-To: Kairui Song <kasong@tencent.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Kairui Song <kasong@tencent.com>

If the shmem read request's index points to the middle of a large swap
entry, shmem swap in does the swap cache lookup use the large swap
entry's starting value (the first sub swap entry of this large entry).
This will lead to false positive lookup result if only the first few
swap entries are cached, but the requested swap entry pointed by index
is uncached.

Currently shmem will do a large entry split then retry the swapin from
beginning, which is a waste of CPU and fragile. Handle this correctly.

Also add some sanity checks to help understand the code and ensure
things won't go wrong.

Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
 mm/shmem.c | 61 ++++++++++++++++++++++++++----------------------------
 1 file changed, 29 insertions(+), 32 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 46dea2fa1b43..0bc30dafad90 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1977,12 +1977,12 @@ static struct folio *shmem_alloc_and_add_folio(stru=
ct vm_fault *vmf,
=20
 static struct folio *shmem_swapin_direct(struct inode *inode,
 		struct vm_area_struct *vma, pgoff_t index,
-		swp_entry_t entry, int *order, gfp_t gfp)
+		swp_entry_t swap_entry, swp_entry_t swap,
+		int *order, gfp_t gfp)
 {
 	struct shmem_inode_info *info =3D SHMEM_I(inode);
 	int nr_pages =3D 1 << *order;
 	struct folio *new;
-	pgoff_t offset;
 	void *shadow;
=20
 	/*
@@ -2003,13 +2003,11 @@ static struct folio *shmem_swapin_direct(struct ino=
de *inode,
 		 */
 		if ((vma && userfaultfd_armed(vma)) ||
 		    !zswap_never_enabled() ||
-		    non_swapcache_batch(entry, nr_pages) !=3D nr_pages) {
-			offset =3D index - round_down(index, nr_pages);
-			entry =3D swp_entry(swp_type(entry),
-					  swp_offset(entry) + offset);
+		    non_swapcache_batch(swap_entry, nr_pages) !=3D nr_pages) {
 			*order =3D 0;
 			nr_pages =3D 1;
 		} else {
+			swap.val =3D swap_entry.val;
 			gfp_t huge_gfp =3D vma_thp_gfp_mask(vma);
=20
 			gfp =3D limit_gfp_mask(huge_gfp, gfp);
@@ -2021,7 +2019,7 @@ static struct folio *shmem_swapin_direct(struct inode=
 *inode,
 		return ERR_PTR(-ENOMEM);
=20
 	if (mem_cgroup_swapin_charge_folio(new, vma ? vma->vm_mm : NULL,
-					   gfp, entry)) {
+					   gfp, swap)) {
 		folio_put(new);
 		return ERR_PTR(-ENOMEM);
 	}
@@ -2036,17 +2034,17 @@ static struct folio *shmem_swapin_direct(struct ino=
de *inode,
 	 * In this case, shmem_add_to_page_cache() will help identify the
 	 * concurrent swapin and return -EEXIST.
 	 */
-	if (swapcache_prepare(entry, nr_pages)) {
+	if (swapcache_prepare(swap, nr_pages)) {
 		folio_put(new);
 		return ERR_PTR(-EEXIST);
 	}
=20
 	__folio_set_locked(new);
 	__folio_set_swapbacked(new);
-	new->swap =3D entry;
+	new->swap =3D swap;
=20
-	memcg1_swapin(entry, nr_pages);
-	shadow =3D get_shadow_from_swap_cache(entry);
+	memcg1_swapin(swap, nr_pages);
+	shadow =3D get_shadow_from_swap_cache(swap);
 	if (shadow)
 		workingset_refault(new, shadow);
 	folio_add_lru(new);
@@ -2278,20 +2276,21 @@ static int shmem_swapin_folio(struct inode *inode, =
pgoff_t index,
 	struct mm_struct *fault_mm =3D vma ? vma->vm_mm : NULL;
 	struct shmem_inode_info *info =3D SHMEM_I(inode);
 	int error, nr_pages, order, swap_order;
+	swp_entry_t swap, swap_entry;
 	struct swap_info_struct *si;
 	struct folio *folio =3D NULL;
 	bool skip_swapcache =3D false;
-	swp_entry_t swap;
+	pgoff_t offset;
=20
 	VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
-	swap =3D radix_to_swp_entry(*foliop);
+	swap_entry =3D radix_to_swp_entry(*foliop);
 	*foliop =3D NULL;
=20
-	if (is_poisoned_swp_entry(swap))
+	if (is_poisoned_swp_entry(swap_entry))
 		return -EIO;
=20
-	si =3D get_swap_device(swap);
-	order =3D shmem_swap_check_entry(mapping, index, swap);
+	si =3D get_swap_device(swap_entry);
+	order =3D shmem_swap_check_entry(mapping, index, swap_entry);
 	if (unlikely(!si)) {
 		if (order < 0)
 			return -EEXIST;
@@ -2303,7 +2302,9 @@ static int shmem_swapin_folio(struct inode *inode, pg=
off_t index,
 		return -EEXIST;
 	}
=20
-	/* Look it up and read it in.. */
+	/* @index may points to the middle of a large entry, get the real swap va=
lue first */
+	offset =3D index - round_down(index, 1 << order);
+	swap.val =3D swap_entry.val + offset;
 	folio =3D swap_cache_get_folio(swap, NULL, 0);
 	if (!folio) {
 		/* Or update major stats only when swapin succeeds?? */
@@ -2315,7 +2316,7 @@ static int shmem_swapin_folio(struct inode *inode, pg=
off_t index,
 		/* Try direct mTHP swapin bypassing swap cache and readahead */
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
 			swap_order =3D order;
-			folio =3D shmem_swapin_direct(inode, vma, index,
+			folio =3D shmem_swapin_direct(inode, vma, index, swap_entry,
 						    swap, &swap_order, gfp);
 			if (!IS_ERR(folio)) {
 				skip_swapcache =3D true;
@@ -2338,28 +2339,25 @@ static int shmem_swapin_folio(struct inode *inode, =
pgoff_t index,
 		}
 	}
 alloced:
+	swap_order =3D folio_order(folio);
+	nr_pages =3D folio_nr_pages(folio);
+
+	/* The swap-in should cover both @swap and @index */
+	swap.val =3D round_down(swap.val, nr_pages);
+	VM_WARN_ON_ONCE(swap.val > swap_entry.val + offset);
+	VM_WARN_ON_ONCE(swap.val + nr_pages <=3D swap_entry.val + offset);
+
 	/*
 	 * We need to split an existing large entry if swapin brought in a
 	 * smaller folio due to various of reasons.
-	 *
-	 * And worth noting there is a special case: if there is a smaller
-	 * cached folio that covers @swap, but not @index (it only covers
-	 * first few sub entries of the large entry, but @index points to
-	 * later parts), the swap cache lookup will still see this folio,
-	 * And we need to split the large entry here. Later checks will fail,
-	 * as it can't satisfy the swap requirement, and we will retry
-	 * the swapin from beginning.
 	 */
-	swap_order =3D folio_order(folio);
+	index =3D round_down(index, nr_pages);
 	if (order > swap_order) {
-		error =3D shmem_split_swap_entry(inode, index, swap, gfp);
+		error =3D shmem_split_swap_entry(inode, index, swap_entry, gfp);
 		if (error)
 			goto failed_nolock;
 	}
=20
-	index =3D round_down(index, 1 << swap_order);
-	swap.val =3D round_down(swap.val, 1 << swap_order);
-
 	/* We have to do this with folio locked to prevent races */
 	folio_lock(folio);
 	if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
@@ -2372,7 +2370,6 @@ static int shmem_swapin_folio(struct inode *inode, pg=
off_t index,
 		goto failed;
 	}
 	folio_wait_writeback(folio);
-	nr_pages =3D folio_nr_pages(folio);
=20
 	/*
 	 * Some architectures may have to restore extra metadata to the
--=20
2.50.0