From nobody Sat Feb  7 13:57:25 2026
Received: from mail-pl1-f178.google.com (mail-pl1-f178.google.com
 [209.85.214.178])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8E2B8265603
	for <linux-kernel@vger.kernel.org>; Mon, 24 Feb 2025 18:02:54 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.214.178
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1740420176; cv=none;
 b=g1j6gEgqffIvFvrZdd6ccAFS8Qs9yk+BOaNMN4XPEVAk82ol1eeJ0sTaqc0cQjtI6hwpBeZYxkr8g6YyRTBMzosep5bshJBjw6qsszkWGo/kI/mQTW/vVsUtAyK/MBI6ffSQjfnTUHOnwmBh63hiBqrC2hahUuDAjNpzJ44WrNY=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1740420176; c=relaxed/simple;
	bh=+73Mwgn0zfHYZeYevkhdnQ9HGxEviarda0EUfAnj/Hc=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=EaMmJ0mKKgd32+GxOq4VpeKDx1EUOf7FmJ4BZ/GrkwjduZVsNz6O8+tMWo079F2jHPcyNe2AnL6ufqRRecmJIXjgArk1Mh0s/+Fhlek0WzP3gvNZJ+ZAQLlmbO3aY2H211iLEylNPtrkCIHZTSjuXTpYeNyRg0QIwpo36NH0Syc=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=KBdfQgaG; arc=none smtp.client-ip=209.85.214.178
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="KBdfQgaG"
Received: by mail-pl1-f178.google.com with SMTP id
 d9443c01a7336-22185cddbffso96388925ad.1
        for <linux-kernel@vger.kernel.org>;
 Mon, 24 Feb 2025 10:02:54 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1740420174; x=1741024974;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject
         :date:message-id:reply-to;
        bh=xwT//N+9nfh7P5Px11klS6FaPMqYvLp+fVr/05Baw3w=;
        b=KBdfQgaGLaO7NtHUYMr8TnZV0/F6zrG7sNg3yDtPSWRhFriY2Ljo783DnES8jM6UXX
         tGGaxn095NxJjSKvgfYIzVlk0a6zt3vkH2PuzMsJA7kTX0qMxpiaRtBQOA2jAFS2+DTh
         nbaQ/kd2nRb8U+FpRjSM061smgrj5A1n8kwaSodUkY2Tbscg7mJQyRf0SsknQsuUS+yT
         XHKYobW1r5W3YOZt0lTx8Wik5Lwk9sBkVNBwaTLJGjmTSNQFpEtpPuVdVigdu2edYd67
         gqoYkgDU0x2t8C6Luik7MWe3PdAEq3/cn6UnvT4QzrRS4JzgQZFaabT4REP4WRuBSSM4
         mxJg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1740420174; x=1741024974;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state
         :from:to:cc:subject:date:message-id:reply-to;
        bh=xwT//N+9nfh7P5Px11klS6FaPMqYvLp+fVr/05Baw3w=;
        b=i1/OMxfrw/eHH6kI17ljwijbHuSrAoZXo51In2nvGDZHGz/ADtmFlNSGTzn0J1TMAH
         IOLddsd4ofgEZ9W0AJR3qcnrmFnaSErwk2H5QtGJvhyoP7KItyttIdgHPKL3NayPdDf6
         WN6WXtFrcM4t22NzaUsemnhPABUT5CVzJJBw4kwFnhx1gVx2tNOtN1KtbnTJXL6G6sy/
         +LR2spveAutvWcbmp+0qsI2bWW2iiHa5PMtGG0VNI20Z6P+6yWRio4UWTqlTrF5DM6Xj
         K1IuWtz0Ye87uPOkukD9aqcm8LNTGPjDfEZSBP/RzibqnDDQdv4W73wUr81i2WAF9QJh
         MT+g==
X-Forwarded-Encrypted: i=1;
 AJvYcCWBf0Kz+IglatFN3EyT58cI6EJXzFCX6SeJ5lyTRLhsmkn8hdiDhnqSMNfBDHE5dy2n+kHwMVS9RgGakZI=@vger.kernel.org
X-Gm-Message-State: AOJu0YznVuKgZIXyVGoeV1OrBWRVMNQBYgjS2xUZP5PBO3VKunwentMO
	izE9zULpIVB9hiVZhNiXqId/0Aq0syiv59FYa/kJX+gp11j1JtiA
X-Gm-Gg: ASbGncu0qZjgvtcyNY04YTXAxGPrkbR224bF0gyGJ6rOYAJzdp0IOCH/2BQlRoIh1K9
	4eBXHm7N9x6liEmKmjxmVdpMq0nAFvADG3hVUiWP9L0ofU+pC4BR9Vdk3Gq6bscu4oeiZEku9EZ
	OAbFUGmvlVwgUc37kSDRo0KIusnB8Ey78wUxbPrWTdyo9pnO4Rhpp0eI7DAAxD+Z2VfnNwSvyQW
	SkEYuutR7imwImphnZbEx9E5lc6pRUMjq+OjKaHipCtO+533eDzZ+CgggYvkBAMQSUobp5SmZIj
	VLfELrhpQRu5CKVWCljVAtDzKOc420Exi2JGZgTwF9yE
X-Google-Smtp-Source: 
 AGHT+IHg1obCVKFszbKO58ucl3S2uuTMPf5zZW6roBu4cVs9S04wje26rUs4hychmcu7wQV7//msbA==
X-Received: by 2002:a17:902:d2c1:b0:21a:7e04:7021 with SMTP id
 d9443c01a7336-221a001fb74mr213193225ad.24.1740420173812;
        Mon, 24 Feb 2025 10:02:53 -0800 (PST)
Received: from KASONG-MC4.tencent.com ([1.203.117.88])
        by smtp.gmail.com with ESMTPSA id
 d9443c01a7336-220d556e15esm184834695ad.190.2025.02.24.10.02.50
        (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256);
        Mon, 24 Feb 2025 10:02:53 -0800 (PST)
From: Kairui Song <ryncsn@gmail.com>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Chris Li <chrisl@kernel.org>,
	Barry Song <v-songbaohua@oppo.com>,
	Hugh Dickins <hughd@google.com>,
	Yosry Ahmed <yosryahmed@google.com>,
	"Huang, Ying" <ying.huang@linux.alibaba.com>,
	Baoquan He <bhe@redhat.com>,
	Nhat Pham <nphamcs@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Kalesh Singh <kaleshsingh@google.com>,
	Matthew Wilcox <willy@infradead.org>,
	linux-kernel@vger.kernel.org,
	Kairui Song <kasong@tencent.com>
Subject: [PATCH v2 1/7] mm, swap: avoid reclaiming irrelevant swap cache
Date: Tue, 25 Feb 2025 02:02:06 +0800
Message-ID: <20250224180212.22802-2-ryncsn@gmail.com>
X-Mailer: git-send-email 2.48.1
In-Reply-To: <20250224180212.22802-1-ryncsn@gmail.com>
References: <20250224180212.22802-1-ryncsn@gmail.com>
Reply-To: Kairui Song <kasong@tencent.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Kairui Song <kasong@tencent.com>

Swap allocator will do swap cache reclaim to recycle HAS_CACHE slots for
allocation. It initiates the reclaim from the offset to be reclaimed and
looks up the corresponding folio. The lookup process is lockless, so it's
possible the folio will be removed from the swap cache and given
a different swap entry before the reclaim locks the folio. If
it happens, the reclaim will end up reclaiming an irrelevant folio, and
return wrong return value.

This shouldn't cause any problem with correctness or stability, but
it is indeed confusing and unexpected, and will increase fragmentation,
decrease performance.

Fix this by checking whether the folio is still pointing to the offset
the allocator want to reclaim before reclaiming it.

Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
---
 mm/swapfile.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index a7f60006c52c..5618cd1c4b03 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -210,6 +210,7 @@ static int __try_to_reclaim_swap(struct swap_info_struc=
t *si,
 	int ret, nr_pages;
 	bool need_reclaim;
=20
+again:
 	folio =3D filemap_get_folio(address_space, swap_cache_index(entry));
 	if (IS_ERR(folio))
 		return 0;
@@ -227,8 +228,16 @@ static int __try_to_reclaim_swap(struct swap_info_stru=
ct *si,
 	if (!folio_trylock(folio))
 		goto out;
=20
-	/* offset could point to the middle of a large folio */
+	/*
+	 * Offset could point to the middle of a large folio, or folio
+	 * may no longer point to the expected offset before it's locked.
+	 */
 	entry =3D folio->swap;
+	if (offset < swp_offset(entry) || offset >=3D swp_offset(entry) + nr_page=
s) {
+		folio_unlock(folio);
+		folio_put(folio);
+		goto again;
+	}
 	offset =3D swp_offset(entry);
=20
 	need_reclaim =3D ((flags & TTRS_ANYWAY) ||
--=20
2.48.1
From nobody Sat Feb  7 13:57:25 2026
Received: from mail-pl1-f176.google.com (mail-pl1-f176.google.com
 [209.85.214.176])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 94102265CAD
	for <linux-kernel@vger.kernel.org>; Mon, 24 Feb 2025 18:02:58 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.214.176
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1740420180; cv=none;
 b=bcYVueU8CRZJPm7LKjy1hvBGBKpfXeTPxL5sn//ZQA27+I5IbFqw6HqQFIeXWpAQbj7AuniqMQb3mcQRhrNqq1OTFjo4McXYaozf7oK6C2J/nViyGow1JE76Ln+TAeLhpxq24ked1Av1yG5c4N/rWenbyKuvhrFaM+KHU/82oy8=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1740420180; c=relaxed/simple;
	bh=p1n4Qyp/95Qy6I9u2yFOetoChLo/IYiV09Xs8fzmKP0=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=m5znLvA4X1kP4A0dtIX1m3XiS2O7zQtYGVC1kz30fj78st7RMg1GYiCzvjbfD2Q5wqvFTn7w+7ZFjUQNpeZpBl6PFqUqtIWk9MAhJQYg0tR8S8nCJFrgj3o2XIKTSYAAV9hPjzCbSVisEjiGMLsf5EWVWprVF0Hf2L0VwgBvmHY=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=k7ZLOxlb; arc=none smtp.client-ip=209.85.214.176
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="k7ZLOxlb"
Received: by mail-pl1-f176.google.com with SMTP id
 d9443c01a7336-220c8f38febso100586525ad.2
        for <linux-kernel@vger.kernel.org>;
 Mon, 24 Feb 2025 10:02:58 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1740420178; x=1741024978;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject
         :date:message-id:reply-to;
        bh=QRxfoQL+pBTLKNCpBu9kbqbY4X297q/ZVa8ZHPjqplw=;
        b=k7ZLOxlbR6Z9okjkPwE9lsF9YDRcEmHq38k5nTQv8pNBRpiR73nUdCxEdTR4zwpjs/
         B8W3Z4YEYZiOIcPRD5KMdKY5W0C4hSCgNj4uNm9/1TlCIbqakiFCqYfDOlgwyjvb7FGZ
         T/TomqU9qjaP36EaF9WabUHdjGRXqxbBv1Jo+CXUikJ/o2bA3zCmbgHkXa2RzzLZArt2
         fs0x4Enmm7PFUsMLfsHbe4DHAKq8waZ+1JATOsobyYBSLXa/YN4GsuOSW/AsF2e7ic2Z
         8ga/kfZAVjgAlZN22FE+dOOmz4e7EDvQ95fDJbdmWMVLCLJ15QlC4135JK1BGVBZuY2B
         gKGA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1740420178; x=1741024978;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state
         :from:to:cc:subject:date:message-id:reply-to;
        bh=QRxfoQL+pBTLKNCpBu9kbqbY4X297q/ZVa8ZHPjqplw=;
        b=KYwnmBFMcZo0zqEc1eiiKb/NjLt58loIxHrMcCj7nYUUQsPd46whHsNduLunoCjzHh
         GSHvfwAl2KmzhCJfopuwys1RZZCf8WlKG6v8iHMKYcMdg/f7jBRKMBXWaxpIivBLyF0l
         GSerHVx99JTcOG2uyoiRf9v/0N5gOOJ3K9/rDeAz78Fh1UjiybDY07BatRaB3ym/numI
         I/vUra1ZCrpVdiNlZZ/fx1zKf10uOisXUBhuOKcrOJNsio7kaYjUxePyU8e6+6fd+kKZ
         YoumB3AH0kxpW/C9tjchJNVmL1c08kdVCwndXam4EpHu8GzZob+4HUDmNO3bTmNcx75K
         +ddg==
X-Forwarded-Encrypted: i=1;
 AJvYcCUHotET50lUwbB5zDXkYrLRld8GiyqcBhmhsIPJRen8vUjq2tnb85Fbgw88r1O5Exf+R7ACUIhhGRS82Qo=@vger.kernel.org
X-Gm-Message-State: AOJu0YxOegcfrRfAof/AtIj35Eg5HITrqsSUUOeBwESg98eh1zxiVAEj
	x3jXWYU0H4MW1lh78yXEnygUELeJ/RB9BXZTlPWAL+JWobZbEPuq
X-Gm-Gg: ASbGncvnFos/jTsgK7+PBUnAMZ3Bf2cwshgjkVsvSVmu6v7MrOdFJ91DdbPWFDoYGKh
	albljXenO3ebD9SfM71LqwLvWlHsHkulaGHC6WT1e406LCOJwJzJYFGUXgEVnq4SRkmVtibpPU8
	xB+UNG5vCjTQ2uE8tWU64iNUeqZKV7Tz48RZylsYslG6zztF+xy7VzaXw33s/83oXTOJHAXvhhD
	gwRT5yM+xKRcbRdR2RnMHpp2JYZCWP7yQgx8yPgE0DSw5WnAXD6Inm5+EmtUv9LO2UtqxbEMEOK
	UngqtHvl3YwgOycOn3T1xk4Q/kKWJRfrhnuXKmQr3zzn
X-Google-Smtp-Source: 
 AGHT+IH2vkZ4qNq/k2BTmYvnntomhPccNa7QFOZ5vqVX7VxUA99Ns1FNfhFZfIv+cr8MOTzBlLBrlw==
X-Received: by 2002:a17:902:d2cd:b0:220:c911:3f60 with SMTP id
 d9443c01a7336-221a00260e0mr232172315ad.47.1740420177678;
        Mon, 24 Feb 2025 10:02:57 -0800 (PST)
Received: from KASONG-MC4.tencent.com ([1.203.117.88])
        by smtp.gmail.com with ESMTPSA id
 d9443c01a7336-220d556e15esm184834695ad.190.2025.02.24.10.02.54
        (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256);
        Mon, 24 Feb 2025 10:02:57 -0800 (PST)
From: Kairui Song <ryncsn@gmail.com>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Chris Li <chrisl@kernel.org>,
	Barry Song <v-songbaohua@oppo.com>,
	Hugh Dickins <hughd@google.com>,
	Yosry Ahmed <yosryahmed@google.com>,
	"Huang, Ying" <ying.huang@linux.alibaba.com>,
	Baoquan He <bhe@redhat.com>,
	Nhat Pham <nphamcs@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Kalesh Singh <kaleshsingh@google.com>,
	Matthew Wilcox <willy@infradead.org>,
	linux-kernel@vger.kernel.org,
	Kairui Song <kasong@tencent.com>
Subject: [PATCH v2 2/7] mm, swap: drop the flag TTRS_DIRECT
Date: Tue, 25 Feb 2025 02:02:07 +0800
Message-ID: <20250224180212.22802-3-ryncsn@gmail.com>
X-Mailer: git-send-email 2.48.1
In-Reply-To: <20250224180212.22802-1-ryncsn@gmail.com>
References: <20250224180212.22802-1-ryncsn@gmail.com>
Reply-To: Kairui Song <kasong@tencent.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Kairui Song <kasong@tencent.com>

This flag exists temporarily to allow the allocator to bypass the slot
cache during freeing, so reclaiming one slot will free the slot
immediately.

But now we have already removed slot cache usage on freeing, so this
flag has no effect now.

Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
---
 mm/swapfile.c | 23 +++--------------------
 1 file changed, 3 insertions(+), 20 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 5618cd1c4b03..6f2de59c6355 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -158,8 +158,6 @@ static long swap_usage_in_pages(struct swap_info_struct=
 *si)
 #define TTRS_UNMAPPED		0x2
 /* Reclaim the swap entry if swap is getting full */
 #define TTRS_FULL		0x4
-/* Reclaim directly, bypass the slot cache and don't touch device lock */
-#define TTRS_DIRECT		0x8
=20
 static bool swap_only_has_cache(struct swap_info_struct *si,
 			      unsigned long offset, int nr_pages)
@@ -257,23 +255,8 @@ static int __try_to_reclaim_swap(struct swap_info_stru=
ct *si,
 	if (!need_reclaim)
 		goto out_unlock;
=20
-	if (!(flags & TTRS_DIRECT)) {
-		/* Free through slot cache */
-		delete_from_swap_cache(folio);
-		folio_set_dirty(folio);
-		ret =3D nr_pages;
-		goto out_unlock;
-	}
-
-	xa_lock_irq(&address_space->i_pages);
-	__delete_from_swap_cache(folio, entry, NULL);
-	xa_unlock_irq(&address_space->i_pages);
-	folio_ref_sub(folio, nr_pages);
+	delete_from_swap_cache(folio);
 	folio_set_dirty(folio);
-
-	ci =3D lock_cluster(si, offset);
-	swap_entry_range_free(si, ci, entry, nr_pages);
-	unlock_cluster(ci);
 	ret =3D nr_pages;
 out_unlock:
 	folio_unlock(folio);
@@ -697,7 +680,7 @@ static bool cluster_reclaim_range(struct swap_info_stru=
ct *si,
 			offset++;
 			break;
 		case SWAP_HAS_CACHE:
-			nr_reclaim =3D __try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIR=
ECT);
+			nr_reclaim =3D __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
 			if (nr_reclaim > 0)
 				offset +=3D nr_reclaim;
 			else
@@ -849,7 +832,7 @@ static void swap_reclaim_full_clusters(struct swap_info=
_struct *si, bool force)
 			if (READ_ONCE(map[offset]) =3D=3D SWAP_HAS_CACHE) {
 				spin_unlock(&ci->lock);
 				nr_reclaim =3D __try_to_reclaim_swap(si, offset,
-								   TTRS_ANYWAY | TTRS_DIRECT);
+								   TTRS_ANYWAY);
 				spin_lock(&ci->lock);
 				if (nr_reclaim) {
 					offset +=3D abs(nr_reclaim);
--=20
2.48.1
From nobody Sat Feb  7 13:57:25 2026
Received: from mail-pl1-f171.google.com (mail-pl1-f171.google.com
 [209.85.214.171])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5BCAE265CC1
	for <linux-kernel@vger.kernel.org>; Mon, 24 Feb 2025 18:03:02 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.214.171
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1740420183; cv=none;
 b=O9ARCDo+zIR8iCy/qAu2mQQpm/YqHZicheu/3+BWa+nO+3XSoB8pQHmVEIn+QD2OpvwZnUODPH9vzzshBeWk5e70melADN63HXYU/4NipySqloZTg7bhZWJB8iV4vSK3jzaKPkN7clGemfMJscpNiLb6VAU/DARGIWyThSuidDQ=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1740420183; c=relaxed/simple;
	bh=xqToex4VcWxib0ndV9cUtQuDd3ofcz9vQ62tZ39OdXg=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=e9YhiVknSnHKHbPxWhSgf1tdzpvQNmUX+/Ql/mZlkcVGjItBpjufrj0+xy9qS+CGSwKx/DeYI0h2rNJ9CIH3IQ6XhrUIvqQ26lMs8T1T/76cAoVNIduFnrsasA0DQyR3pv/UBOC8WyDAp8nq7jLxcNcR7rux/jaMn5fMvppeYb4=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=BreM1I9O; arc=none smtp.client-ip=209.85.214.171
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="BreM1I9O"
Received: by mail-pl1-f171.google.com with SMTP id
 d9443c01a7336-2211acda7f6so105731745ad.3
        for <linux-kernel@vger.kernel.org>;
 Mon, 24 Feb 2025 10:03:02 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1740420182; x=1741024982;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject
         :date:message-id:reply-to;
        bh=HFN+prd/q5R/gZ+dsQNPlGWJ7GO+s/LReOSW13ezbxY=;
        b=BreM1I9O+NMV+eJMqF7OMxGwR5+XMgF5w2c1aHRA5YN8yqbZsnsl6RYGZ0XYexn4iz
         lN7cIuaqjVZiyrb0xrFXi6E13gVB3sLNxIcizRP6r6lfT1SjTSZOMO+1NZwK1L2/fOTr
         UTnpRzdc5HfhL0j2jpAE+QXZPbx5Y9R1c6P4l9R0QxXRILSETAOWIolHJDXJsXUllHRT
         W+dS7mOn7AfScHxmf9SnRYpgJFYxiiEmGmBF4kl+EkdGf42QFpEOz515RRfhzgIGr6SZ
         O0iY+d89MQsXzLVtGSDmg/cTSTv5/PfKBCjk56lEa3HnxFlfSkvEoGbBa77HoLvaisgG
         LXxg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1740420182; x=1741024982;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state
         :from:to:cc:subject:date:message-id:reply-to;
        bh=HFN+prd/q5R/gZ+dsQNPlGWJ7GO+s/LReOSW13ezbxY=;
        b=T8MXveT11dxZ57BZb+hJZhrZ/1O9uxVl05ngX7WJenw2/U2SN/U1EWYWQ3C4TderGl
         zXr6smARdjPtKGoB8WNV7RA/6nJ1sLtsWF4ayBCsFjx+Af1mmtOpgnPDmO4wOk5A0nRd
         CJgKS5m8sCRHE31ugvv0a7w0bWBlg3pthSVL3/wypcngoc55haCqR1Q/e02svU4hh6GR
         hLZ5XVClWrs4NzwLMoxQaWtfi47pkF2VQHAQlLtu0vyddHQOqjLpzpSOWK/UZzH1yEIG
         meaUAsXaFuzfKhduMUBlcKNekfAsKgvrg3UbHpmeHhiJTvY3JVnnRWZkczpUaIGuDubY
         2xGg==
X-Forwarded-Encrypted: i=1;
 AJvYcCWtEKYgpNAFbf2U9hd5RuGgVq27Fki7n1bKAg0m7vT25NygD5MiIAAnc4RjsRi3jTWvS0jOJM24no+dTSU=@vger.kernel.org
X-Gm-Message-State: AOJu0Ywk2u2cLV9S7aYxPFSr/3iNumJqDbCJznLB/jpP+WWc4Omp2oX+
	kVxURfDJLj+6nAj1gt0aYc7jjsRwzRI7biVmoXMdWiowWC1njBWB
X-Gm-Gg: ASbGncsP9+6sWv+/psskkOgekMfyrEqe4JtTdJ8h+5Ir9W1L4WCZqcuKvtPp8m443rL
	D2nFciCjRGjQ77sX8yEbraxe3ujpafUYFa8M3N7d/pkk2gLpZ95txY2rqnXVvPjh0Ksbz1vlIH6
	Ol+Zn06v7gXowiVADdVqZp8X42dY100xI1Z0D3hTpGjjM4zMClJhMeYO49WnTZ50NRqwar8cjuE
	0L+wZOJThuohX8s3jTARVLvvGheMEM/9eivFfFIOO3vmJhq9Ns+tsaqxX3Gp0cycXd0CUu6V8oW
	2MIE8Qvy5ys3gHpUJoThk8u4vXbzckVQdT3MsYPOukP7
X-Google-Smtp-Source: 
 AGHT+IE7STMjzV87+hEURfhIdB6UH8LCMfNhfiVrnK0UHPVcExU8+vGKEEn+3A78YlQKgQnFURWA4g==
X-Received: by 2002:a17:902:c950:b0:21f:507b:9ad7 with SMTP id
 d9443c01a7336-2219ff5e619mr252864045ad.25.1740420181541;
        Mon, 24 Feb 2025 10:03:01 -0800 (PST)
Received: from KASONG-MC4.tencent.com ([1.203.117.88])
        by smtp.gmail.com with ESMTPSA id
 d9443c01a7336-220d556e15esm184834695ad.190.2025.02.24.10.02.58
        (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256);
        Mon, 24 Feb 2025 10:03:01 -0800 (PST)
From: Kairui Song <ryncsn@gmail.com>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Chris Li <chrisl@kernel.org>,
	Barry Song <v-songbaohua@oppo.com>,
	Hugh Dickins <hughd@google.com>,
	Yosry Ahmed <yosryahmed@google.com>,
	"Huang, Ying" <ying.huang@linux.alibaba.com>,
	Baoquan He <bhe@redhat.com>,
	Nhat Pham <nphamcs@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Kalesh Singh <kaleshsingh@google.com>,
	Matthew Wilcox <willy@infradead.org>,
	linux-kernel@vger.kernel.org,
	Kairui Song <kasong@tencent.com>
Subject: [PATCH v2 3/7] mm, swap: avoid redundant swap device pinning
Date: Tue, 25 Feb 2025 02:02:08 +0800
Message-ID: <20250224180212.22802-4-ryncsn@gmail.com>
X-Mailer: git-send-email 2.48.1
In-Reply-To: <20250224180212.22802-1-ryncsn@gmail.com>
References: <20250224180212.22802-1-ryncsn@gmail.com>
Reply-To: Kairui Song <kasong@tencent.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Kairui Song <kasong@tencent.com>

Currently __read_swap_cache_async() has get/put_swap_device() calls to
increase/decrease a swap device reference to prevent swapoff. While some
of its callers have already held the swap device reference, e.g in
do_swap_page() and shmem_swapin_folio() where __read_swap_cache_async()
will finally called. Now there are only two callers not holding a swap
device reference, so make them hold a reference instead. And drop the
get/put_swap_device calls in __read_swap_cache_async. This should reduce
the overhead for swap in during page fault slightly.

Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
---
 mm/swap_state.c | 14 ++++++++------
 mm/zswap.c      |  6 ++++++
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index a54b035d6a6c..50840a2887a5 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -426,17 +426,13 @@ struct folio *__read_swap_cache_async(swp_entry_t ent=
ry, gfp_t gfp_mask,
 		struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
 		bool skip_if_exists)
 {
-	struct swap_info_struct *si;
+	struct swap_info_struct *si =3D swp_swap_info(entry);
 	struct folio *folio;
 	struct folio *new_folio =3D NULL;
 	struct folio *result =3D NULL;
 	void *shadow =3D NULL;
=20
 	*new_page_allocated =3D false;
-	si =3D get_swap_device(entry);
-	if (!si)
-		return NULL;
-
 	for (;;) {
 		int err;
 		/*
@@ -532,7 +528,6 @@ struct folio *__read_swap_cache_async(swp_entry_t entry=
, gfp_t gfp_mask,
 	put_swap_folio(new_folio, entry);
 	folio_unlock(new_folio);
 put_and_return:
-	put_swap_device(si);
 	if (!(*new_page_allocated) && new_folio)
 		folio_put(new_folio);
 	return result;
@@ -552,11 +547,16 @@ struct folio *read_swap_cache_async(swp_entry_t entry=
, gfp_t gfp_mask,
 		struct vm_area_struct *vma, unsigned long addr,
 		struct swap_iocb **plug)
 {
+	struct swap_info_struct *si;
 	bool page_allocated;
 	struct mempolicy *mpol;
 	pgoff_t ilx;
 	struct folio *folio;
=20
+	si =3D get_swap_device(entry);
+	if (!si)
+		return NULL;
+
 	mpol =3D get_vma_policy(vma, addr, 0, &ilx);
 	folio =3D __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
 					&page_allocated, false);
@@ -564,6 +564,8 @@ struct folio *read_swap_cache_async(swp_entry_t entry, =
gfp_t gfp_mask,
=20
 	if (page_allocated)
 		swap_read_folio(folio, plug);
+
+	put_swap_device(si);
 	return folio;
 }
=20
diff --git a/mm/zswap.c b/mm/zswap.c
index ac9d299e7d0c..83dfa1f9e689 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1051,14 +1051,20 @@ static int zswap_writeback_entry(struct zswap_entry=
 *entry,
 	struct folio *folio;
 	struct mempolicy *mpol;
 	bool folio_was_allocated;
+	struct swap_info_struct *si;
 	struct writeback_control wbc =3D {
 		.sync_mode =3D WB_SYNC_NONE,
 	};
=20
 	/* try to allocate swap cache folio */
+	si =3D get_swap_device(swpentry);
+	if (!si)
+		return -EEXIST;
+
 	mpol =3D get_task_policy(current);
 	folio =3D __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
 				NO_INTERLEAVE_INDEX, &folio_was_allocated, true);
+	put_swap_device(si);
 	if (!folio)
 		return -ENOMEM;
=20
--=20
2.48.1
From nobody Sat Feb  7 13:57:25 2026
Received: from mail-pl1-f174.google.com (mail-pl1-f174.google.com
 [209.85.214.174])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5F93526618C
	for <linux-kernel@vger.kernel.org>; Mon, 24 Feb 2025 18:03:06 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.214.174
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1740420188; cv=none;
 b=CrqQ6vKCW+PtcLi1QXuVnycLyTekYtO6ppLDLPEfmIftdiR130JzJnmuoWdfVc+vq2wnPKf2A53zEoRh8uWK5yRy1E/WTLG+hWRJL/gA9TWJJG08IOa3k6B0spIPdbUFr1B7HMMjCXfeGqvV9g2fWjIjBnKU7ccWDF0+4unKZMg=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1740420188; c=relaxed/simple;
	bh=5V8CRRxEKwAvlne0up0DL89xDRbM4OY8DNb00/7ZeuQ=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=qF40iWXrIJ2fwWNqJsOEXyv2FRAVSQPef+f/ZgoSvbUYVW5PorSRkrSOV9EM94oR/JiEWy54U8TkLKn2Ptl832Uy4QDKRwhAglVTkDg2NGIalnL8SdN83ziQN4S/j2BmOAe1hWkpkGH/znxseeE8kBEoCsYFlXuKvEMcgLvyrQo=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=iaNFLy0N; arc=none smtp.client-ip=209.85.214.174
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="iaNFLy0N"
Received: by mail-pl1-f174.google.com with SMTP id
 d9443c01a7336-220c92c857aso81040795ad.0
        for <linux-kernel@vger.kernel.org>;
 Mon, 24 Feb 2025 10:03:06 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1740420185; x=1741024985;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject
         :date:message-id:reply-to;
        bh=XyR+PRc/eBOLkBnPi/9fuJnUVozTaiV4/upc2f3YPZU=;
        b=iaNFLy0NPfFmaM8jG7VdBO2McQ9kKsJO+uHgYTgWGCjcYMtPZo1zTJNSm8OsmoNiD+
         r0CYUW+7TS60DsPPhBhMrQqcboIkyrvBplbfV2ZmV2ANH1RMsWpwUm23IPbdRcttbVBo
         1r+kz2s7bvgDL4aq+pZAm75gcGoXBqCIgXxK8xSqi0CLpN3sRcjwCBE6lXcL4aP5O/c4
         HKRz8+w1x0PhaYC+WZ7vAc6/Ie/gmc10dqw6Ryeay81KCvuIg1XIV1wF/qlfFmstzCij
         RfLzjMl1EeINwU7YQv1O1JlRuIz5MZF9BxkDDSQxti6+62EM5UOs5Th9NBMv9k0t0p5m
         82qA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1740420185; x=1741024985;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state
         :from:to:cc:subject:date:message-id:reply-to;
        bh=XyR+PRc/eBOLkBnPi/9fuJnUVozTaiV4/upc2f3YPZU=;
        b=CP/C29vsjhXHsKPeSQhwYj2U+lc1rqUL8dqU3WiIW+Ctjq+Xsu+LYos52yaNzy99ri
         wbSysM+lRi0MA95T2tFbD3A8fUwX97YhnFQ9reQzN/tagmLP7BabUfcxdNS3ETVXm7Hw
         ldaKBp7X1hk6K2Qly2RiTdEBvPgI5Zzq7de0Vn/BNR++lDnYhtjR9Dihz1mHeKbsZGug
         774dUa6TiPwPenp2HzzDZkgwYvPmKTSp9oyZW8tyH8U2ef32QRtVd2ZDcoz7pJ7kwVCZ
         KAFc+K6BGET10x1mon9zS4O7h+fOvsvuiFKpjjuUh/htXePQtr8DD2m4lSWQx4EVxwCz
         x3Ng==
X-Forwarded-Encrypted: i=1;
 AJvYcCVS7yEfCxKreEgjk59IVquM3diklS9zy4VkatO1qcqSbUGnvD5cakylE7f3LGphVWGLExlKSns4EK0Zqmk=@vger.kernel.org
X-Gm-Message-State: AOJu0YzUS6iITMsCY2paO7NdW0wOmP4rZfsEF7LeO7nn9nqKLMgChatY
	ifaotsUFDoh6P5n8QUcI/G3r1xXYafNdu8HNMRdgDgL1gr5rZDeH
X-Gm-Gg: ASbGnct9UlNSAk613qQ16+BwExbjP0YhInUfHN49/8s7mHDdupSRizceekrYZ5J64Bw
	z9LDBddLQ5qv0qGD0czbRKaM+D9o7rmvnqkPv1Z9++i9+lMCYJZe/3ThMSK6Y6by1mre/15nu5R
	sfsRBsxvbDQGxRns5Sq/LA/IdwccgAXcpS3i9gK20e/3W3DVGCBh5iUmCXkKDtrI0TWL17F3Y4E
	xH9aYU90E96JimIhWlRldl9a2hh7TAGdJ/OnhCbwc27r9IY0pZU3K8gYmDeUWPkKUYBh6VKlN8G
	mblUj5VDoDwChG9IEMxKlLNHVHeK0sL9gRxGXj9LuTsT
X-Google-Smtp-Source: 
 AGHT+IEWFKdxJ4VlDrM+6lJej0/o+uztgn5SKjZ2meUIah0cwkjkOUeetzc3bGXlgudO6SvyG97xhQ==
X-Received: by 2002:a17:903:2312:b0:221:1eac:bf7a with SMTP id
 d9443c01a7336-2218c765b1amr310546725ad.24.1740420185393;
        Mon, 24 Feb 2025 10:03:05 -0800 (PST)
Received: from KASONG-MC4.tencent.com ([1.203.117.88])
        by smtp.gmail.com with ESMTPSA id
 d9443c01a7336-220d556e15esm184834695ad.190.2025.02.24.10.03.01
        (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256);
        Mon, 24 Feb 2025 10:03:04 -0800 (PST)
From: Kairui Song <ryncsn@gmail.com>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Chris Li <chrisl@kernel.org>,
	Barry Song <v-songbaohua@oppo.com>,
	Hugh Dickins <hughd@google.com>,
	Yosry Ahmed <yosryahmed@google.com>,
	"Huang, Ying" <ying.huang@linux.alibaba.com>,
	Baoquan He <bhe@redhat.com>,
	Nhat Pham <nphamcs@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Kalesh Singh <kaleshsingh@google.com>,
	Matthew Wilcox <willy@infradead.org>,
	linux-kernel@vger.kernel.org,
	Kairui Song <kasong@tencent.com>
Subject: [PATCH v2 4/7] mm, swap: don't update the counter up-front
Date: Tue, 25 Feb 2025 02:02:09 +0800
Message-ID: <20250224180212.22802-5-ryncsn@gmail.com>
X-Mailer: git-send-email 2.48.1
In-Reply-To: <20250224180212.22802-1-ryncsn@gmail.com>
References: <20250224180212.22802-1-ryncsn@gmail.com>
Reply-To: Kairui Song <kasong@tencent.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Kairui Song <kasong@tencent.com>

The counter update before allocation design was useful to avoid
unnecessary scan when device is full, so it will abort early if the
counter indicates the device is full. But that is an uncommon case,
and now scanning of a full device is very fast, so the up-front update
is not helpful any more.

Remove it and simplify the slot allocation logic.

Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
---
 mm/swapfile.c | 18 ++----------------
 1 file changed, 2 insertions(+), 16 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6f2de59c6355..db836670c334 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1201,22 +1201,10 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entr=
ies[], int entry_order)
 	int order =3D swap_entry_order(entry_order);
 	unsigned long size =3D 1 << order;
 	struct swap_info_struct *si, *next;
-	long avail_pgs;
 	int n_ret =3D 0;
 	int node;
=20
 	spin_lock(&swap_avail_lock);
-
-	avail_pgs =3D atomic_long_read(&nr_swap_pages) / size;
-	if (avail_pgs <=3D 0) {
-		spin_unlock(&swap_avail_lock);
-		goto noswap;
-	}
-
-	n_goal =3D min3((long)n_goal, (long)SWAP_BATCH, avail_pgs);
-
-	atomic_long_sub(n_goal * size, &nr_swap_pages);
-
 start_over:
 	node =3D numa_node_id();
 	plist_for_each_entry_safe(si, next, &swap_avail_heads[node], avail_lists[=
node]) {
@@ -1250,10 +1238,8 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entri=
es[], int entry_order)
 	spin_unlock(&swap_avail_lock);
=20
 check_out:
-	if (n_ret < n_goal)
-		atomic_long_add((long)(n_goal - n_ret) * size,
-				&nr_swap_pages);
-noswap:
+	atomic_long_sub(n_ret * size, &nr_swap_pages);
+
 	return n_ret;
 }
=20
--=20
2.48.1
From nobody Sat Feb  7 13:57:25 2026
Received: from mail-pl1-f178.google.com (mail-pl1-f178.google.com
 [209.85.214.178])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3ACE12661B3
	for <linux-kernel@vger.kernel.org>; Mon, 24 Feb 2025 18:03:10 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.214.178
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1740420192; cv=none;
 b=o08010rASVBjgDPhQFSWzHHByJaD3Es+gsf958OGa/ggG+wZkFaa1EWN6+P6ykzN+MzzGp/A9A18zAPsy+xcTEjIlYdVppe+oCx3JHoA291JGVPcViqYFKFkbBovF4iQ37GSO4ZaIJPo4Dch2cDRpfaNdLPbLDFXClOW5qBW5bs=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1740420192; c=relaxed/simple;
	bh=oYYXPLXah6O4GDpgcNX6vv6CFj4niVcn0EBzBLMyYn8=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=S2MOZougj1IBF4lVV1MtoERgeVRIXZ0POB2ApeNWCCXNV3p4wPK2s0x/3rm9QHKWyrfwk6x9F2VuW8Z5ls243xCYjAOWuFPA0cAA7xCrFYOBGb+VaoYBnc8HkUFAu0n8iBGcmWzV7MD9jsT67eRs7DbLX6ye7bDwcht3nEtr+w4=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=E/KrnGz9; arc=none smtp.client-ip=209.85.214.178
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="E/KrnGz9"
Received: by mail-pl1-f178.google.com with SMTP id
 d9443c01a7336-22114b800f7so93580005ad.2
        for <linux-kernel@vger.kernel.org>;
 Mon, 24 Feb 2025 10:03:10 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1740420189; x=1741024989;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject
         :date:message-id:reply-to;
        bh=WYFJk5aGTp6MYsHBjRjRVr+n+yHW6h7e82rjV++fDV8=;
        b=E/KrnGz9PxMCsPgK2kBbirwkC6gRW0dyA3C5SzIeheyqk/uB+ESrEFUrc7+9Bqhr5X
         qZ23xdBYhXxXuMym47jSC7fBbOGIlIrEhMS2Exj9SB1UdoIqfBw/Wg+agCp24/1xnTWD
         9OU0MKqVfeTQCdGbWHz5idDDAY9LMHKouYiToNmeKD+EnWL+HvMWsJNjC/dqX6TAKgZI
         2BljFR38/adlsoZJFJAebsnLqDSCIOb0jyTiTdJ5I8pDPQ3HjURTuEsw6TbE3dstUf12
         XZVkzVS75E/lzoXfr4hs+gD3tN+rNAeO2FXR9STl68UOsoHZ5MH/MRabS1c9jfD3XvEl
         N6Qg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1740420189; x=1741024989;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state
         :from:to:cc:subject:date:message-id:reply-to;
        bh=WYFJk5aGTp6MYsHBjRjRVr+n+yHW6h7e82rjV++fDV8=;
        b=EzRERmZOLhSgX+dvjbT7YgUZKsqb7mcHwcDmTFsUO1FWGVLL6+BUeKK3Gxkjvwf5sP
         KjZ6oo6Rm+Wzj23YHe3vLVe5QE6J0Ae1YWvaVUpOr9ftFzFl9QbaJn5fqDMjVWg2Ejbu
         OTR1rUo/I/NP0eL/t77PxKxM21nr68nNwZ02G/FCWeEUXOrfffLV4liFBcEfQayFnU+9
         1tyaCpkwKNeL2UyxcaOo8i/9MBpKIy3Uiw6Y/i7wpZmp/xUHQ99pr1Q2raa4IMHdhTCb
         gXmMuXkt/OgKQr5v9BCk6zzEEhjIJPb06AG7Z4ixVQoOoUuYpoQ5rAJLkLr1n1Vr8am8
         5R4w==
X-Forwarded-Encrypted: i=1;
 AJvYcCVKSgFS6Kz9r1qzC3UCVshcH0lfaIKJipiRLb6zDm6k7L0DcSs+LI008wJiAUco1gqSHHf2GtUkNke1mKc=@vger.kernel.org
X-Gm-Message-State: AOJu0YxaoCrHp0gylwjEleRUtfT81htiHxkU25hYEOLi7FH0hY4WZdqJ
	e50vbmS0YYhvF/5Psfc9kZZO+LQ5en72IBjd9x/G02Y2LTpD1bqZ
X-Gm-Gg: ASbGncvlRypR+FZK4LuAUEBW+Orh8itqrojjMEhtdctWNBK/OHiLIlcGEOGBKRLls9H
	3C48LtBd3gsNjCwHqwijHwC78fcwX8Jkg32Zyw6KYWxNiisQGKPoCsYJSYErJPnmZLotBSc4f5h
	UP02FVpRj2uBaIasoWZEwhvtHzIn0forOppGJYl2N4s773Cx3cbLwuk66CpqjIgkO0LHtokJbxE
	M9ULnBakZOX0s3Csr+V4iSMme1llnlrthYqmeWp21bR2IzFITsOzstH5OC9kaGRz5QgQMwSEkz3
	KCsshrSF3Ed0lfdW2n1xxLXrV1v5vIL0+rGhg3InW+DM
X-Google-Smtp-Source: 
 AGHT+IF+nUtt9yT1whYy8czY+uFBQEuZI14FZHiJnUYmIVHKhu2fnDX//MoXY7yrvfn2QU5ZdSXwoA==
X-Received: by 2002:a17:902:d502:b0:21f:988d:5758 with SMTP id
 d9443c01a7336-2219ffd288emr253625875ad.35.1740420189294;
        Mon, 24 Feb 2025 10:03:09 -0800 (PST)
Received: from KASONG-MC4.tencent.com ([1.203.117.88])
        by smtp.gmail.com with ESMTPSA id
 d9443c01a7336-220d556e15esm184834695ad.190.2025.02.24.10.03.05
        (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256);
        Mon, 24 Feb 2025 10:03:08 -0800 (PST)
From: Kairui Song <ryncsn@gmail.com>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Chris Li <chrisl@kernel.org>,
	Barry Song <v-songbaohua@oppo.com>,
	Hugh Dickins <hughd@google.com>,
	Yosry Ahmed <yosryahmed@google.com>,
	"Huang, Ying" <ying.huang@linux.alibaba.com>,
	Baoquan He <bhe@redhat.com>,
	Nhat Pham <nphamcs@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Kalesh Singh <kaleshsingh@google.com>,
	Matthew Wilcox <willy@infradead.org>,
	linux-kernel@vger.kernel.org,
	Kairui Song <kasong@tencent.com>
Subject: [PATCH v2 5/7] mm, swap: use percpu cluster as allocation fast path
Date: Tue, 25 Feb 2025 02:02:10 +0800
Message-ID: <20250224180212.22802-6-ryncsn@gmail.com>
X-Mailer: git-send-email 2.48.1
In-Reply-To: <20250224180212.22802-1-ryncsn@gmail.com>
References: <20250224180212.22802-1-ryncsn@gmail.com>
Reply-To: Kairui Song <kasong@tencent.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Kairui Song <kasong@tencent.com>

Current allocation workflow first traverses the plist with a global lock
held, after choosing a device, it uses the percpu cluster on that swap
device. This commit moves the percpu cluster variable out of being tied
to individual swap devices, making it a global percpu variable, and will
be used directly for allocation as a fast path.

The global percpu cluster variable will never point to a HDD device, and
allocations on a HDD device are still globally serialized.

This improves the allocator performance and prepares for removal of the
slot cache in later commits. There shouldn't be much observable behavior
change, except one thing: this changes how swap device allocation
rotation works.

Currently, each allocation will rotate the plist, and because of the
existence of slot cache (one order 0 allocation usually returns 64
entries), swap devices of the same priority are rotated for every 64
order 0 entries consumed. High order allocations are different, they
will bypass the slot cache, and so swap device is rotated for every
16K, 32K, or up to 2M allocation.

The rotation rule was never clearly defined or documented, it was changed
several times without mentioning.

After this commit, and once slot cache is gone in later commits, swap
device rotation will happen for every consumed cluster. Ideally non-HDD
devices will be rotated if 2M space has been consumed for each order.
Fragmented clusters will rotate the device faster, which seems OK.
HDD devices is rotated for every allocation regardless of the allocation
order, which should be OK too and trivial.

This commit also slightly changes allocation behaviour for slot cache.
The new added cluster allocation fast path may allocate entries from
different device to the slot cache, this is not observable from user
space, only impact performance very slightly, and slot cache will be
just gone in next commit, so this can be ignored.

Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
---
 include/linux/swap.h |  11 ++--
 mm/swapfile.c        | 136 +++++++++++++++++++++++++++++--------------
 2 files changed, 95 insertions(+), 52 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2fe91c293636..374bffc87427 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -284,12 +284,10 @@ enum swap_cluster_flags {
 #endif
=20
 /*
- * We assign a cluster to each CPU, so each CPU can allocate swap entry fr=
om
- * its own cluster and swapout sequentially. The purpose is to optimize sw=
apout
- * throughput.
+ * We keep using same cluster for rotational device so IO will be sequenti=
al.
+ * The purpose is to optimize SWAP throughput on these device.
  */
-struct percpu_cluster {
-	local_lock_t lock; /* Protect the percpu_cluster above */
+struct swap_sequential_cluster {
 	unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
 };
=20
@@ -315,8 +313,7 @@ struct swap_info_struct {
 	atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
 	unsigned int pages;		/* total of usable pages of swap */
 	atomic_long_t inuse_pages;	/* number of those currently in use */
-	struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap locatio=
n */
-	struct percpu_cluster *global_cluster; /* Use one global cluster for rota=
ting device */
+	struct swap_sequential_cluster *global_cluster; /* Use one global cluster=
 for rotating device */
 	spinlock_t global_cluster_lock;	/* Serialize usage of global cluster */
 	struct rb_root swap_extent_root;/* root of the swap extent rbtree */
 	struct block_device *bdev;	/* swap device or bdev of swap file */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index db836670c334..7caaaea95408 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -116,6 +116,18 @@ static atomic_t proc_poll_event =3D ATOMIC_INIT(0);
=20
 atomic_t nr_rotate_swap =3D ATOMIC_INIT(0);
=20
+struct percpu_swap_cluster {
+	struct swap_info_struct *si[SWAP_NR_ORDERS];
+	unsigned long offset[SWAP_NR_ORDERS];
+	local_lock_t lock;
+};
+
+static DEFINE_PER_CPU(struct percpu_swap_cluster, percpu_swap_cluster) =3D=
 {
+	.si =3D { NULL },
+	.offset =3D { SWAP_ENTRY_INVALID },
+	.lock =3D INIT_LOCAL_LOCK(),
+};
+
 static struct swap_info_struct *swap_type_to_swap_info(int type)
 {
 	if (type >=3D MAX_SWAPFILES)
@@ -539,7 +551,7 @@ static bool swap_do_scheduled_discard(struct swap_info_=
struct *si)
 		ci =3D list_first_entry(&si->discard_clusters, struct swap_cluster_info,=
 list);
 		/*
 		 * Delete the cluster from list to prepare for discard, but keep
-		 * the CLUSTER_FLAG_DISCARD flag, there could be percpu_cluster
+		 * the CLUSTER_FLAG_DISCARD flag, percpu_swap_cluster could be
 		 * pointing to it, or ran into by relocate_cluster.
 		 */
 		list_del(&ci->list);
@@ -805,10 +817,12 @@ static unsigned int alloc_swap_scan_cluster(struct sw=
ap_info_struct *si,
 out:
 	relocate_cluster(si, ci);
 	unlock_cluster(ci);
-	if (si->flags & SWP_SOLIDSTATE)
-		__this_cpu_write(si->percpu_cluster->next[order], next);
-	else
+	if (si->flags & SWP_SOLIDSTATE) {
+		__this_cpu_write(percpu_swap_cluster.si[order], si);
+		__this_cpu_write(percpu_swap_cluster.offset[order], next);
+	} else {
 		si->global_cluster->next[order] =3D next;
+	}
 	return found;
 }
=20
@@ -862,9 +876,8 @@ static void swap_reclaim_work(struct work_struct *work)
 }
=20
 /*
- * Try to get swap entries with specified order from current cpu's swap en=
try
- * pool (a cluster). This might involve allocating a new cluster for curre=
nt CPU
- * too.
+ * Try to allocate swap entries with specified order and try set a new
+ * cluster for current CPU too.
  */
 static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,=
 int order,
 					      unsigned char usage)
@@ -872,18 +885,12 @@ static unsigned long cluster_alloc_swap_entry(struct =
swap_info_struct *si, int o
 	struct swap_cluster_info *ci;
 	unsigned int offset, found =3D 0;
=20
-	if (si->flags & SWP_SOLIDSTATE) {
-		/* Fast path using per CPU cluster */
-		local_lock(&si->percpu_cluster->lock);
-		offset =3D __this_cpu_read(si->percpu_cluster->next[order]);
-	} else {
+	if (!(si->flags & SWP_SOLIDSTATE)) {
 		/* Serialize HDD SWAP allocation for each device. */
 		spin_lock(&si->global_cluster_lock);
 		offset =3D si->global_cluster->next[order];
-	}
-
-	if (offset) {
 		ci =3D lock_cluster(si, offset);
+
 		/* Cluster could have been used by another order */
 		if (cluster_is_usable(ci, order)) {
 			if (cluster_is_empty(ci))
@@ -973,9 +980,7 @@ static unsigned long cluster_alloc_swap_entry(struct sw=
ap_info_struct *si, int o
 		}
 	}
 done:
-	if (si->flags & SWP_SOLIDSTATE)
-		local_unlock(&si->percpu_cluster->lock);
-	else
+	if (!(si->flags & SWP_SOLIDSTATE))
 		spin_unlock(&si->global_cluster_lock);
 	return found;
 }
@@ -1196,6 +1201,49 @@ static bool get_swap_device_info(struct swap_info_st=
ruct *si)
 	return true;
 }
=20
+/*
+ * Fast path try to get swap entries with specified order from current
+ * CPU's swap entry pool (a cluster).
+ */
+static int swap_alloc_fast(swp_entry_t entries[],
+			   unsigned char usage,
+			   int order, int n_goal)
+{
+	struct swap_cluster_info *ci;
+	struct swap_info_struct *si;
+	unsigned int offset, found;
+	int n_ret =3D 0;
+
+	n_goal =3D min(n_goal, SWAP_BATCH);
+
+	/*
+	 * Once allocated, swap_info_struct will never be completely freed,
+	 * so checking it's liveness by get_swap_device_info is enough.
+	 */
+	si =3D __this_cpu_read(percpu_swap_cluster.si[order]);
+	offset =3D __this_cpu_read(percpu_swap_cluster.offset[order]);
+	if (!si || !offset || !get_swap_device_info(si))
+		return 0;
+
+	while (offset) {
+		ci =3D lock_cluster(si, offset);
+		if (!cluster_is_usable(ci, order))
+			break;
+		if (cluster_is_empty(ci))
+			offset =3D cluster_offset(si, ci);
+		found =3D alloc_swap_scan_cluster(si, ci, offset, order, usage);
+		if (!found)
+			break;
+		entries[n_ret++] =3D swp_entry(si->type, found);
+		if (n_ret =3D=3D n_goal)
+			break;
+		offset =3D __this_cpu_read(percpu_swap_cluster.offset[order]);
+	}
+
+	put_swap_device(si);
+	return n_ret;
+}
+
 int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
 {
 	int order =3D swap_entry_order(entry_order);
@@ -1204,19 +1252,36 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entr=
ies[], int entry_order)
 	int n_ret =3D 0;
 	int node;
=20
+	/* Fast path using percpu cluster */
+	local_lock(&percpu_swap_cluster.lock);
+	n_ret =3D swap_alloc_fast(swp_entries,
+				SWAP_HAS_CACHE,
+				order, n_goal);
+	if (n_ret =3D=3D n_goal)
+		goto out;
+
+	n_goal =3D min_t(int, n_goal - n_ret, SWAP_BATCH);
+	/* Rotate the device and switch to a new cluster */
 	spin_lock(&swap_avail_lock);
 start_over:
 	node =3D numa_node_id();
 	plist_for_each_entry_safe(si, next, &swap_avail_heads[node], avail_lists[=
node]) {
-		/* requeue si to after same-priority siblings */
 		plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
 		spin_unlock(&swap_avail_lock);
 		if (get_swap_device_info(si)) {
-			n_ret =3D scan_swap_map_slots(si, SWAP_HAS_CACHE,
-					n_goal, swp_entries, order);
+			/*
+			 * For order 0 allocation, try best to fill the request
+			 * as it's used by slot cache.
+			 *
+			 * For mTHP allocation, it always have n_goal =3D=3D 1,
+			 * and falling a mTHP swapin will just make the caller
+			 * fallback to order 0 allocation, so just bail out.
+			 */
+			n_ret +=3D scan_swap_map_slots(si, SWAP_HAS_CACHE, n_goal,
+					swp_entries + n_ret, order);
 			put_swap_device(si);
 			if (n_ret || size > 1)
-				goto check_out;
+				goto out;
 		}
=20
 		spin_lock(&swap_avail_lock);
@@ -1234,12 +1299,10 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entr=
ies[], int entry_order)
 		if (plist_node_empty(&next->avail_lists[node]))
 			goto start_over;
 	}
-
 	spin_unlock(&swap_avail_lock);
-
-check_out:
+out:
+	local_unlock(&percpu_swap_cluster.lock);
 	atomic_long_sub(n_ret * size, &nr_swap_pages);
-
 	return n_ret;
 }
=20
@@ -2725,8 +2788,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, special=
file)
 	arch_swap_invalidate_area(p->type);
 	zswap_swapoff(p->type);
 	mutex_unlock(&swapon_mutex);
-	free_percpu(p->percpu_cluster);
-	p->percpu_cluster =3D NULL;
 	kfree(p->global_cluster);
 	p->global_cluster =3D NULL;
 	vfree(swap_map);
@@ -3125,7 +3186,7 @@ static struct swap_cluster_info *setup_clusters(struc=
t swap_info_struct *si,
 	unsigned long nr_clusters =3D DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
 	struct swap_cluster_info *cluster_info;
 	unsigned long i, j, idx;
-	int cpu, err =3D -ENOMEM;
+	int err =3D -ENOMEM;
=20
 	cluster_info =3D kvcalloc(nr_clusters, sizeof(*cluster_info), GFP_KERNEL);
 	if (!cluster_info)
@@ -3134,20 +3195,7 @@ static struct swap_cluster_info *setup_clusters(stru=
ct swap_info_struct *si,
 	for (i =3D 0; i < nr_clusters; i++)
 		spin_lock_init(&cluster_info[i].lock);
=20
-	if (si->flags & SWP_SOLIDSTATE) {
-		si->percpu_cluster =3D alloc_percpu(struct percpu_cluster);
-		if (!si->percpu_cluster)
-			goto err_free;
-
-		for_each_possible_cpu(cpu) {
-			struct percpu_cluster *cluster;
-
-			cluster =3D per_cpu_ptr(si->percpu_cluster, cpu);
-			for (i =3D 0; i < SWAP_NR_ORDERS; i++)
-				cluster->next[i] =3D SWAP_ENTRY_INVALID;
-			local_lock_init(&cluster->lock);
-		}
-	} else {
+	if (!(si->flags & SWP_SOLIDSTATE)) {
 		si->global_cluster =3D kmalloc(sizeof(*si->global_cluster),
 				     GFP_KERNEL);
 		if (!si->global_cluster)
@@ -3424,8 +3472,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialf=
ile, int, swap_flags)
 bad_swap_unlock_inode:
 	inode_unlock(inode);
 bad_swap:
-	free_percpu(si->percpu_cluster);
-	si->percpu_cluster =3D NULL;
 	kfree(si->global_cluster);
 	si->global_cluster =3D NULL;
 	inode =3D NULL;
--=20
2.48.1
From nobody Sat Feb  7 13:57:25 2026
Received: from mail-pl1-f180.google.com (mail-pl1-f180.google.com
 [209.85.214.180])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6B1D8265634
	for <linux-kernel@vger.kernel.org>; Mon, 24 Feb 2025 18:03:14 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.214.180
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1740420197; cv=none;
 b=Tf0rYWCGtaEH6gcOjaUE9y0XAzGZtVhLxPCmN3yWn03UKx3lBCGciG0+vCzWE+tJPPMyt6sKPy9k28De/zYQ9wzfXWYGiaTlDLIKfvqW4RZN+k1UrjjkC7yz3XgAiuk/z647zCcZBPEyg0l6mr86QDYmzCaxEQ9MbM/FiwRVwpg=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1740420197; c=relaxed/simple;
	bh=eqyVv0ytVtC5hUb4iO4EwzN/ZnymNXJLk5QC9J2cGfM=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=j3gCr0AOpRHssi4e71e7bSCyicVG7FFM1cOwT0YczJg0+UKRpNPp6DDH70DmxSvau3NmnqOqG9ka0XEakzEBnRUrQEYaFGsncCDE8djOrcQO7lwB/kFDaWx3BegGMgU+BYC/FF/oHoAuBNyhpi3FG04p5IIYCNkGugyxuFhMneg=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=U9hpLvt6; arc=none smtp.client-ip=209.85.214.180
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="U9hpLvt6"
Received: by mail-pl1-f180.google.com with SMTP id
 d9443c01a7336-221057b6ac4so90639425ad.2
        for <linux-kernel@vger.kernel.org>;
 Mon, 24 Feb 2025 10:03:14 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1740420193; x=1741024993;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject
         :date:message-id:reply-to;
        bh=z/H/eT26mOgQNU3BllgWxVbfjIUbjkvzBf7y7ZTGGVA=;
        b=U9hpLvt69cGnExMns5O+N5GJ6RtW6qn22v7C1xP2nQk6R3pmUIKWBRzdsw6Sbf1xbQ
         1AIu5QpknHmsp4zrPkTtTTaOuc6CctFcIE93CRs0OxhZVzGsu8SXDGKp4gk8Gra4M7ls
         2dvCT6r+6IvfXl7xsHsNktZlENPe81tWC5ndBMy3gWDb9x+tK2O1Jh5NfKk8ML5DwLPV
         PRujGEGL6bc0BY4Qgcm9uF05wZCYccg1pnMB7EURhxO0As8V7zkZBZv8yRPiTHrcZ/EN
         QdJ6GG6nkFMQw4XOz5GhAIknm+ohCKTFPbFuJ0zjlhn0ZTzMcLjg2bhvNxLRQy8Z2UHY
         VuPA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1740420193; x=1741024993;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state
         :from:to:cc:subject:date:message-id:reply-to;
        bh=z/H/eT26mOgQNU3BllgWxVbfjIUbjkvzBf7y7ZTGGVA=;
        b=DbBu/JRy+1yOWtZh+hG5ULoTeRqVP8hRty09kczz7sAMs2X5MA0Z6QJo79tiEMfmDd
         9zqIyl1UZBg0Q4YYZXiaF3QAMRoIFP4lbVEHPba/vPSS6+8ErOiCX3bpD2udXloKEyay
         L5VXY+5e0mdP3fD4ihhOArne3HUpHo0rGIwAShecX9KvdWIYeoOE/1mdFhXLWLkELoDk
         QpEG9CdfwuHcsW9igKBMgPOS3lIUhp8GgqnAR4NMluS3feY5UUvGmsKU867itigVkJp1
         5TTZT25GHcr/2IM2I/DgsdMfESKUx/z57NsgcemAbF95ZGQywMB8pY88mchsUQD6o0io
         EIpw==
X-Forwarded-Encrypted: i=1;
 AJvYcCWDtnYwPFWa3jjnW+GnSNevcea8as+TUk7s5FmfzhH2mQ3jh2cw3RKcROj19cJY4xQQ31kmQ5AL8QBSHw4=@vger.kernel.org
X-Gm-Message-State: AOJu0YwTSqf2cbNDYjdHOvePSGLvPOIUTEjcancjVi0CDVp9qrA7d7L3
	aJSiNRUkZ3HOJ60Go9W9i1gSjB1v6WXET84mbO+8YJWOySyqrBWq
X-Gm-Gg: ASbGnctnP+G7iG4ojs8IPeS15408Qju/XVHWRUiaWZwL+3F8VmiadtNfHiwehnPuabT
	CuCODTtSxlxS8NpeOmmlhekQgikNtmyTyMAdtjnPF4chB0L2LU3X1CLon996epfb/U79seGbhQF
	dRFzNLUyGoOJHmZgcPnUfOcme/NFHLFmsFOyyjF9SgivsvgiXsN8+X4cRS3nj12gZV7vu06IIGF
	7I48st0RPaAK21pnnvTT9kIQ7ah4pOtisY5nderIHoVLVDAKL4LRBqxVGCQ2Wxrkbt7mOu7Gmlc
	17NxDBQwDKVn1P9ym9LIb3eiTkdjud+1AGh9u4R0Cbdo
X-Google-Smtp-Source: 
 AGHT+IGUDqNjKzzodPdbBw61i2d6PWv+5jYpVz5ZaK/dw7xN7n6YgLYt5rhN5Rse7OqlMkN3KAlLxw==
X-Received: by 2002:a17:903:41cc:b0:220:d5a4:3ac with SMTP id
 d9443c01a7336-221a11b9809mr212857675ad.45.1740420193321;
        Mon, 24 Feb 2025 10:03:13 -0800 (PST)
Received: from KASONG-MC4.tencent.com ([1.203.117.88])
        by smtp.gmail.com with ESMTPSA id
 d9443c01a7336-220d556e15esm184834695ad.190.2025.02.24.10.03.09
        (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256);
        Mon, 24 Feb 2025 10:03:12 -0800 (PST)
From: Kairui Song <ryncsn@gmail.com>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Chris Li <chrisl@kernel.org>,
	Barry Song <v-songbaohua@oppo.com>,
	Hugh Dickins <hughd@google.com>,
	Yosry Ahmed <yosryahmed@google.com>,
	"Huang, Ying" <ying.huang@linux.alibaba.com>,
	Baoquan He <bhe@redhat.com>,
	Nhat Pham <nphamcs@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Kalesh Singh <kaleshsingh@google.com>,
	Matthew Wilcox <willy@infradead.org>,
	linux-kernel@vger.kernel.org,
	Kairui Song <kasong@tencent.com>
Subject: [PATCH v2 6/7] mm, swap: remove swap slot cache
Date: Tue, 25 Feb 2025 02:02:11 +0800
Message-ID: <20250224180212.22802-7-ryncsn@gmail.com>
X-Mailer: git-send-email 2.48.1
In-Reply-To: <20250224180212.22802-1-ryncsn@gmail.com>
References: <20250224180212.22802-1-ryncsn@gmail.com>
Reply-To: Kairui Song <kasong@tencent.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Kairui Song <kasong@tencent.com>

Slot cache is no longer needed now, removing it and all related code.

- vm-scalability with: `usemem --init-time -O -y -x -R -31 1G`,
12G memory cgroup using simulated pmem as SWAP (32G pmem, 32 CPUs),
16 test runs for each case, measuring the total throughput:

                      Before (KB/s) (stdev)  After (KB/s) (stdev)
Random (4K):          424907.60 (24410.78)   414745.92  (34554.78)
Random (64K):         163308.82 (11635.72)   167314.50  (18434.99)
Sequential (4K, !-R): 6150056.79 (103205.90) 6321469.06 (115878.16)

The performance changes are below noise level.

- Build linux kernel with make -j96, using 4K folio with 1.5G memory
cgroup limit and 64K folio with 2G memory cgroup limit, on top of tmpfs,
12 test runs, measuring the system time:

                  Before (s) (stdev)  After (s) (stdev)
make -j96 (4K):   6445.69 (61.95)     6408.80 (69.46)
make -j96 (64K):  6841.71 (409.04)    6437.99 (435.55)

Similar to above, 64k mTHP case showed a slight improvement.

Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
---
 include/linux/swap.h       |   2 -
 include/linux/swap_slots.h |  28 ----
 mm/Makefile                |   2 +-
 mm/swap_slots.c            | 295 -------------------------------------
 mm/swap_state.c            |   8 +-
 mm/swapfile.c              | 193 +++++++++---------------
 6 files changed, 71 insertions(+), 457 deletions(-)
 delete mode 100644 include/linux/swap_slots.h
 delete mode 100644 mm/swap_slots.c

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 374bffc87427..a0a262bcaf41 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -465,7 +465,6 @@ void free_pages_and_swap_cache(struct encoded_page **, =
int);
 extern atomic_long_t nr_swap_pages;
 extern long total_swap_pages;
 extern atomic_t nr_rotate_swap;
-extern bool has_usable_swap(void);
=20
 /* Swap 50% full? Release swapcache more aggressively.. */
 static inline bool vm_swap_full(void)
@@ -489,7 +488,6 @@ extern void swap_shmem_alloc(swp_entry_t, int);
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t entry, int nr);
 extern void swap_free_nr(swp_entry_t entry, int nr_pages);
-extern void swapcache_free_entries(swp_entry_t *entries, int n);
 extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
 int swap_type_of(dev_t device, sector_t offset);
 int find_first_swap(dev_t *device);
diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h
deleted file mode 100644
index 840aec3523b2..000000000000
--- a/include/linux/swap_slots.h
+++ /dev/null
@@ -1,28 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _LINUX_SWAP_SLOTS_H
-#define _LINUX_SWAP_SLOTS_H
-
-#include <linux/swap.h>
-#include <linux/spinlock.h>
-#include <linux/mutex.h>
-
-#define SWAP_SLOTS_CACHE_SIZE			SWAP_BATCH
-#define THRESHOLD_ACTIVATE_SWAP_SLOTS_CACHE	(5*SWAP_SLOTS_CACHE_SIZE)
-#define THRESHOLD_DEACTIVATE_SWAP_SLOTS_CACHE	(2*SWAP_SLOTS_CACHE_SIZE)
-
-struct swap_slots_cache {
-	bool		lock_initialized;
-	struct mutex	alloc_lock; /* protects slots, nr, cur */
-	swp_entry_t	*slots;
-	int		nr;
-	int		cur;
-	int		n_ret;
-};
-
-void disable_swap_slots_cache_lock(void);
-void reenable_swap_slots_cache_unlock(void);
-void enable_swap_slots_cache(void);
-
-extern bool swap_slot_cache_enabled;
-
-#endif /* _LINUX_SWAP_SLOTS_H */
diff --git a/mm/Makefile b/mm/Makefile
index 4510a9869e77..e7f6bbf8ae5f 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -75,7 +75,7 @@ ifdef CONFIG_MMU
 	obj-$(CONFIG_ADVISE_SYSCALLS)	+=3D madvise.o
 endif
=20
-obj-$(CONFIG_SWAP)	+=3D page_io.o swap_state.o swapfile.o swap_slots.o
+obj-$(CONFIG_SWAP)	+=3D page_io.o swap_state.o swapfile.o
 obj-$(CONFIG_ZSWAP)	+=3D zswap.o
 obj-$(CONFIG_HAS_DMA)	+=3D dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+=3D hugetlb.o
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
deleted file mode 100644
index 9c7c171df7ba..000000000000
--- a/mm/swap_slots.c
+++ /dev/null
@@ -1,295 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/*
- * Manage cache of swap slots to be used for and returned from
- * swap.
- *
- * Copyright(c) 2016 Intel Corporation.
- *
- * Author: Tim Chen <tim.c.chen@linux.intel.com>
- *
- * We allocate the swap slots from the global pool and put
- * it into local per cpu caches.  This has the advantage
- * of no needing to acquire the swap_info lock every time
- * we need a new slot.
- *
- * There is also opportunity to simply return the slot
- * to local caches without needing to acquire swap_info
- * lock.  We do not reuse the returned slots directly but
- * move them back to the global pool in a batch.  This
- * allows the slots to coalesce and reduce fragmentation.
- *
- * The swap entry allocated is marked with SWAP_HAS_CACHE
- * flag in map_count that prevents it from being allocated
- * again from the global pool.
- *
- * The swap slots cache is protected by a mutex instead of
- * a spin lock as when we search for slots with scan_swap_map,
- * we can possibly sleep.
- */
-
-#include <linux/swap_slots.h>
-#include <linux/cpu.h>
-#include <linux/cpumask.h>
-#include <linux/slab.h>
-#include <linux/vmalloc.h>
-#include <linux/mutex.h>
-#include <linux/mm.h>
-
-static DEFINE_PER_CPU(struct swap_slots_cache, swp_slots);
-static bool	swap_slot_cache_active;
-bool	swap_slot_cache_enabled;
-static bool	swap_slot_cache_initialized;
-static DEFINE_MUTEX(swap_slots_cache_mutex);
-/* Serialize swap slots cache enable/disable operations */
-static DEFINE_MUTEX(swap_slots_cache_enable_mutex);
-
-static void __drain_swap_slots_cache(void);
-
-#define use_swap_slot_cache (swap_slot_cache_active && swap_slot_cache_ena=
bled)
-
-static void deactivate_swap_slots_cache(void)
-{
-	mutex_lock(&swap_slots_cache_mutex);
-	swap_slot_cache_active =3D false;
-	__drain_swap_slots_cache();
-	mutex_unlock(&swap_slots_cache_mutex);
-}
-
-static void reactivate_swap_slots_cache(void)
-{
-	mutex_lock(&swap_slots_cache_mutex);
-	swap_slot_cache_active =3D true;
-	mutex_unlock(&swap_slots_cache_mutex);
-}
-
-/* Must not be called with cpu hot plug lock */
-void disable_swap_slots_cache_lock(void)
-{
-	mutex_lock(&swap_slots_cache_enable_mutex);
-	swap_slot_cache_enabled =3D false;
-	if (swap_slot_cache_initialized) {
-		/* serialize with cpu hotplug operations */
-		cpus_read_lock();
-		__drain_swap_slots_cache();
-		cpus_read_unlock();
-	}
-}
-
-static void __reenable_swap_slots_cache(void)
-{
-	swap_slot_cache_enabled =3D has_usable_swap();
-}
-
-void reenable_swap_slots_cache_unlock(void)
-{
-	__reenable_swap_slots_cache();
-	mutex_unlock(&swap_slots_cache_enable_mutex);
-}
-
-static bool check_cache_active(void)
-{
-	long pages;
-
-	if (!swap_slot_cache_enabled)
-		return false;
-
-	pages =3D get_nr_swap_pages();
-	if (!swap_slot_cache_active) {
-		if (pages > num_online_cpus() *
-		    THRESHOLD_ACTIVATE_SWAP_SLOTS_CACHE)
-			reactivate_swap_slots_cache();
-		goto out;
-	}
-
-	/* if global pool of slot caches too low, deactivate cache */
-	if (pages < num_online_cpus() * THRESHOLD_DEACTIVATE_SWAP_SLOTS_CACHE)
-		deactivate_swap_slots_cache();
-out:
-	return swap_slot_cache_active;
-}
-
-static int alloc_swap_slot_cache(unsigned int cpu)
-{
-	struct swap_slots_cache *cache;
-	swp_entry_t *slots;
-
-	/*
-	 * Do allocation outside swap_slots_cache_mutex
-	 * as kvzalloc could trigger reclaim and folio_alloc_swap,
-	 * which can lock swap_slots_cache_mutex.
-	 */
-	slots =3D kvcalloc(SWAP_SLOTS_CACHE_SIZE, sizeof(swp_entry_t),
-			 GFP_KERNEL);
-	if (!slots)
-		return -ENOMEM;
-
-	mutex_lock(&swap_slots_cache_mutex);
-	cache =3D &per_cpu(swp_slots, cpu);
-	if (cache->slots) {
-		/* cache already allocated */
-		mutex_unlock(&swap_slots_cache_mutex);
-
-		kvfree(slots);
-
-		return 0;
-	}
-
-	if (!cache->lock_initialized) {
-		mutex_init(&cache->alloc_lock);
-		cache->lock_initialized =3D true;
-	}
-	cache->nr =3D 0;
-	cache->cur =3D 0;
-	cache->n_ret =3D 0;
-	/*
-	 * We initialized alloc_lock and free_lock earlier.  We use
-	 * !cache->slots or !cache->slots_ret to know if it is safe to acquire
-	 * the corresponding lock and use the cache.  Memory barrier below
-	 * ensures the assumption.
-	 */
-	mb();
-	cache->slots =3D slots;
-	mutex_unlock(&swap_slots_cache_mutex);
-	return 0;
-}
-
-static void drain_slots_cache_cpu(unsigned int cpu, bool free_slots)
-{
-	struct swap_slots_cache *cache;
-
-	cache =3D &per_cpu(swp_slots, cpu);
-	if (cache->slots) {
-		mutex_lock(&cache->alloc_lock);
-		swapcache_free_entries(cache->slots + cache->cur, cache->nr);
-		cache->cur =3D 0;
-		cache->nr =3D 0;
-		if (free_slots && cache->slots) {
-			kvfree(cache->slots);
-			cache->slots =3D NULL;
-		}
-		mutex_unlock(&cache->alloc_lock);
-	}
-}
-
-static void __drain_swap_slots_cache(void)
-{
-	unsigned int cpu;
-
-	/*
-	 * This function is called during
-	 *	1) swapoff, when we have to make sure no
-	 *	   left over slots are in cache when we remove
-	 *	   a swap device;
-	 *      2) disabling of swap slot cache, when we run low
-	 *	   on swap slots when allocating memory and need
-	 *	   to return swap slots to global pool.
-	 *
-	 * We cannot acquire cpu hot plug lock here as
-	 * this function can be invoked in the cpu
-	 * hot plug path:
-	 * cpu_up -> lock cpu_hotplug -> cpu hotplug state callback
-	 *   -> memory allocation -> direct reclaim -> folio_alloc_swap
-	 *   -> drain_swap_slots_cache
-	 *
-	 * Hence the loop over current online cpu below could miss cpu that
-	 * is being brought online but not yet marked as online.
-	 * That is okay as we do not schedule and run anything on a
-	 * cpu before it has been marked online. Hence, we will not
-	 * fill any swap slots in slots cache of such cpu.
-	 * There are no slots on such cpu that need to be drained.
-	 */
-	for_each_online_cpu(cpu)
-		drain_slots_cache_cpu(cpu, false);
-}
-
-static int free_slot_cache(unsigned int cpu)
-{
-	mutex_lock(&swap_slots_cache_mutex);
-	drain_slots_cache_cpu(cpu, true);
-	mutex_unlock(&swap_slots_cache_mutex);
-	return 0;
-}
-
-void enable_swap_slots_cache(void)
-{
-	mutex_lock(&swap_slots_cache_enable_mutex);
-	if (!swap_slot_cache_initialized) {
-		int ret;
-
-		ret =3D cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "swap_slots_cache",
-					alloc_swap_slot_cache, free_slot_cache);
-		if (WARN_ONCE(ret < 0, "Cache allocation failed (%s), operating "
-				       "without swap slots cache.\n", __func__))
-			goto out_unlock;
-
-		swap_slot_cache_initialized =3D true;
-	}
-
-	__reenable_swap_slots_cache();
-out_unlock:
-	mutex_unlock(&swap_slots_cache_enable_mutex);
-}
-
-/* called with swap slot cache's alloc lock held */
-static int refill_swap_slots_cache(struct swap_slots_cache *cache)
-{
-	if (!use_swap_slot_cache)
-		return 0;
-
-	cache->cur =3D 0;
-	if (swap_slot_cache_active)
-		cache->nr =3D get_swap_pages(SWAP_SLOTS_CACHE_SIZE,
-					   cache->slots, 0);
-
-	return cache->nr;
-}
-
-swp_entry_t folio_alloc_swap(struct folio *folio)
-{
-	swp_entry_t entry;
-	struct swap_slots_cache *cache;
-
-	entry.val =3D 0;
-
-	if (folio_test_large(folio)) {
-		if (IS_ENABLED(CONFIG_THP_SWAP))
-			get_swap_pages(1, &entry, folio_order(folio));
-		goto out;
-	}
-
-	/*
-	 * Preemption is allowed here, because we may sleep
-	 * in refill_swap_slots_cache().  But it is safe, because
-	 * accesses to the per-CPU data structure are protected by the
-	 * mutex cache->alloc_lock.
-	 *
-	 * The alloc path here does not touch cache->slots_ret
-	 * so cache->free_lock is not taken.
-	 */
-	cache =3D raw_cpu_ptr(&swp_slots);
-
-	if (likely(check_cache_active() && cache->slots)) {
-		mutex_lock(&cache->alloc_lock);
-		if (cache->slots) {
-repeat:
-			if (cache->nr) {
-				entry =3D cache->slots[cache->cur];
-				cache->slots[cache->cur++].val =3D 0;
-				cache->nr--;
-			} else if (refill_swap_slots_cache(cache)) {
-				goto repeat;
-			}
-		}
-		mutex_unlock(&cache->alloc_lock);
-		if (entry.val)
-			goto out;
-	}
-
-	get_swap_pages(1, &entry, 0);
-out:
-	if (mem_cgroup_try_charge_swap(folio, entry)) {
-		put_swap_folio(folio, entry);
-		entry.val =3D 0;
-	}
-	return entry;
-}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 50840a2887a5..2b5744e211cd 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -20,7 +20,6 @@
 #include <linux/blkdev.h>
 #include <linux/migrate.h>
 #include <linux/vmalloc.h>
-#include <linux/swap_slots.h>
 #include <linux/huge_mm.h>
 #include <linux/shmem_fs.h>
 #include "internal.h"
@@ -447,13 +446,8 @@ struct folio *__read_swap_cache_async(swp_entry_t entr=
y, gfp_t gfp_mask,
=20
 		/*
 		 * Just skip read ahead for unused swap slot.
-		 * During swap_off when swap_slot_cache is disabled,
-		 * we have to handle the race between putting
-		 * swap entry in swap cache and marking swap slot
-		 * as SWAP_HAS_CACHE.  That's done in later part of code or
-		 * else swap_off will be aborted if we return NULL.
 		 */
-		if (!swap_entry_swapped(si, entry) && swap_slot_cache_enabled)
+		if (!swap_entry_swapped(si, entry))
 			goto put_and_return;
=20
 		/*
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 7caaaea95408..1ba916109d99 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -37,7 +37,6 @@
 #include <linux/oom.h>
 #include <linux/swapfile.h>
 #include <linux/export.h>
-#include <linux/swap_slots.h>
 #include <linux/sort.h>
 #include <linux/completion.h>
 #include <linux/suspend.h>
@@ -885,6 +884,13 @@ static unsigned long cluster_alloc_swap_entry(struct s=
wap_info_struct *si, int o
 	struct swap_cluster_info *ci;
 	unsigned int offset, found =3D 0;
=20
+	/*
+	 * Swapfile is not block device so unable
+	 * to allocate large entries.
+	 */
+	if (order && !(si->flags & SWP_BLKDEV))
+		return 0;
+
 	if (!(si->flags & SWP_SOLIDSTATE)) {
 		/* Serialize HDD SWAP allocation for each device. */
 		spin_lock(&si->global_cluster_lock);
@@ -1148,43 +1154,6 @@ static void swap_range_free(struct swap_info_struct =
*si, unsigned long offset,
 	swap_usage_sub(si, nr_entries);
 }
=20
-static int scan_swap_map_slots(struct swap_info_struct *si,
-			       unsigned char usage, int nr,
-			       swp_entry_t slots[], int order)
-{
-	unsigned int nr_pages =3D 1 << order;
-	int n_ret =3D 0;
-
-	if (order > 0) {
-		/*
-		 * Should not even be attempting large allocations when huge
-		 * page swap is disabled.  Warn and fail the allocation.
-		 */
-		if (!IS_ENABLED(CONFIG_THP_SWAP) ||
-		    nr_pages > SWAPFILE_CLUSTER) {
-			VM_WARN_ON_ONCE(1);
-			return 0;
-		}
-
-		/*
-		 * Swapfile is not block device so unable
-		 * to allocate large entries.
-		 */
-		if (!(si->flags & SWP_BLKDEV))
-			return 0;
-	}
-
-	while (n_ret < nr) {
-		unsigned long offset =3D cluster_alloc_swap_entry(si, order, usage);
-
-		if (!offset)
-			break;
-		slots[n_ret++] =3D swp_entry(si->type, offset);
-	}
-
-	return n_ret;
-}
-
 static bool get_swap_device_info(struct swap_info_struct *si)
 {
 	if (!percpu_ref_tryget_live(&si->users))
@@ -1205,16 +1174,13 @@ static bool get_swap_device_info(struct swap_info_s=
truct *si)
  * Fast path try to get swap entries with specified order from current
  * CPU's swap entry pool (a cluster).
  */
-static int swap_alloc_fast(swp_entry_t entries[],
+static int swap_alloc_fast(swp_entry_t *entry,
 			   unsigned char usage,
-			   int order, int n_goal)
+			   int order)
 {
 	struct swap_cluster_info *ci;
 	struct swap_info_struct *si;
-	unsigned int offset, found;
-	int n_ret =3D 0;
-
-	n_goal =3D min(n_goal, SWAP_BATCH);
+	unsigned int offset, found =3D SWAP_ENTRY_INVALID;
=20
 	/*
 	 * Once allocated, swap_info_struct will never be completely freed,
@@ -1223,44 +1189,48 @@ static int swap_alloc_fast(swp_entry_t entries[],
 	si =3D __this_cpu_read(percpu_swap_cluster.si[order]);
 	offset =3D __this_cpu_read(percpu_swap_cluster.offset[order]);
 	if (!si || !offset || !get_swap_device_info(si))
-		return 0;
+		return false;
=20
-	while (offset) {
-		ci =3D lock_cluster(si, offset);
-		if (!cluster_is_usable(ci, order))
-			break;
+	ci =3D lock_cluster(si, offset);
+	if (cluster_is_usable(ci, order)) {
 		if (cluster_is_empty(ci))
 			offset =3D cluster_offset(si, ci);
 		found =3D alloc_swap_scan_cluster(si, ci, offset, order, usage);
-		if (!found)
-			break;
-		entries[n_ret++] =3D swp_entry(si->type, found);
-		if (n_ret =3D=3D n_goal)
-			break;
-		offset =3D __this_cpu_read(percpu_swap_cluster.offset[order]);
+		if (found)
+			*entry =3D swp_entry(si->type, found);
+	} else {
+		unlock_cluster(ci);
 	}
=20
 	put_swap_device(si);
-	return n_ret;
+	return !!found;
 }
=20
-int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
+swp_entry_t folio_alloc_swap(struct folio *folio)
 {
-	int order =3D swap_entry_order(entry_order);
-	unsigned long size =3D 1 << order;
+	unsigned int order =3D folio_order(folio);
+	unsigned int size =3D 1 << order;
 	struct swap_info_struct *si, *next;
-	int n_ret =3D 0;
+	swp_entry_t entry =3D {};
+	unsigned long offset;
 	int node;
=20
+	if (order) {
+		/*
+		 * Should not even be attempting large allocations when huge
+		 * page swap is disabled. Warn and fail the allocation.
+		 */
+		if (!IS_ENABLED(CONFIG_THP_SWAP) || size > SWAPFILE_CLUSTER) {
+			VM_WARN_ON_ONCE(1);
+			return entry;
+		}
+	}
+
 	/* Fast path using percpu cluster */
 	local_lock(&percpu_swap_cluster.lock);
-	n_ret =3D swap_alloc_fast(swp_entries,
-				SWAP_HAS_CACHE,
-				order, n_goal);
-	if (n_ret =3D=3D n_goal)
-		goto out;
+	if (swap_alloc_fast(&entry, SWAP_HAS_CACHE, order))
+		goto out_alloced;
=20
-	n_goal =3D min_t(int, n_goal - n_ret, SWAP_BATCH);
 	/* Rotate the device and switch to a new cluster */
 	spin_lock(&swap_avail_lock);
 start_over:
@@ -1269,19 +1239,14 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entr=
ies[], int entry_order)
 		plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
 		spin_unlock(&swap_avail_lock);
 		if (get_swap_device_info(si)) {
-			/*
-			 * For order 0 allocation, try best to fill the request
-			 * as it's used by slot cache.
-			 *
-			 * For mTHP allocation, it always have n_goal =3D=3D 1,
-			 * and falling a mTHP swapin will just make the caller
-			 * fallback to order 0 allocation, so just bail out.
-			 */
-			n_ret +=3D scan_swap_map_slots(si, SWAP_HAS_CACHE, n_goal,
-					swp_entries + n_ret, order);
+			offset =3D cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE);
 			put_swap_device(si);
-			if (n_ret || size > 1)
-				goto out;
+			if (offset) {
+				entry =3D swp_entry(si->type, offset);
+				goto out_alloced;
+			}
+			if (order)
+				goto out_failed;
 		}
=20
 		spin_lock(&swap_avail_lock);
@@ -1300,10 +1265,20 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entr=
ies[], int entry_order)
 			goto start_over;
 	}
 	spin_unlock(&swap_avail_lock);
-out:
+out_failed:
+	local_unlock(&percpu_swap_cluster.lock);
+	return entry;
+
+out_alloced:
 	local_unlock(&percpu_swap_cluster.lock);
-	atomic_long_sub(n_ret * size, &nr_swap_pages);
-	return n_ret;
+	if (mem_cgroup_try_charge_swap(folio, entry)) {
+		put_swap_folio(folio, entry);
+		entry.val =3D 0;
+	} else {
+		atomic_long_sub(size, &nr_swap_pages);
+	}
+
+	return entry;
 }
=20
 static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
@@ -1599,25 +1574,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t=
 entry)
 	unlock_cluster(ci);
 }
=20
-void swapcache_free_entries(swp_entry_t *entries, int n)
-{
-	int i;
-	struct swap_cluster_info *ci;
-	struct swap_info_struct *si =3D NULL;
-
-	if (n <=3D 0)
-		return;
-
-	for (i =3D 0; i < n; ++i) {
-		si =3D _swap_info_get(entries[i]);
-		if (si) {
-			ci =3D lock_cluster(si, swp_offset(entries[i]));
-			swap_entry_range_free(si, ci, entries[i], 1);
-			unlock_cluster(ci);
-		}
-	}
-}
-
 int __swap_count(swp_entry_t entry)
 {
 	struct swap_info_struct *si =3D swp_swap_info(entry);
@@ -1858,6 +1814,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 swp_entry_t get_swap_page_of_type(int type)
 {
 	struct swap_info_struct *si =3D swap_type_to_swap_info(type);
+	unsigned long offset;
 	swp_entry_t entry =3D {0};
=20
 	if (!si)
@@ -1865,8 +1822,13 @@ swp_entry_t get_swap_page_of_type(int type)
=20
 	/* This is called for allocating swap entry, not cache */
 	if (get_swap_device_info(si)) {
-		if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0=
))
-			atomic_long_dec(&nr_swap_pages);
+		if (si->flags & SWP_WRITEOK) {
+			offset =3D cluster_alloc_swap_entry(si, 0, 1);
+			if (offset) {
+				entry =3D swp_entry(si->type, offset);
+				atomic_long_dec(&nr_swap_pages);
+			}
+		}
 		put_swap_device(si);
 	}
 fail:
@@ -2627,21 +2589,6 @@ static void reinsert_swap_info(struct swap_info_stru=
ct *si)
 	spin_unlock(&swap_lock);
 }
=20
-static bool __has_usable_swap(void)
-{
-	return !plist_head_empty(&swap_active_head);
-}
-
-bool has_usable_swap(void)
-{
-	bool ret;
-
-	spin_lock(&swap_lock);
-	ret =3D __has_usable_swap();
-	spin_unlock(&swap_lock);
-	return ret;
-}
-
 /*
  * Called after clearing SWP_WRITEOK, ensures cluster_alloc_range
  * see the updated flags, so there will be no more allocations.
@@ -2732,8 +2679,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, special=
file)
=20
 	wait_for_allocation(p);
=20
-	disable_swap_slots_cache_lock();
-
 	set_current_oom_origin();
 	err =3D try_to_unuse(p->type);
 	clear_current_oom_origin();
@@ -2741,12 +2686,9 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specia=
lfile)
 	if (err) {
 		/* re-insert swap space back into swap_list */
 		reinsert_swap_info(p);
-		reenable_swap_slots_cache_unlock();
 		goto out_dput;
 	}
=20
-	reenable_swap_slots_cache_unlock();
-
 	/*
 	 * Wait for swap operations protected by get/put_swap_device()
 	 * to complete.  Because of synchronize_rcu() here, all swap
@@ -3495,8 +3437,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialf=
ile, int, swap_flags)
 		putname(name);
 	if (inode)
 		inode_unlock(inode);
-	if (!error)
-		enable_swap_slots_cache();
 	return error;
 }
=20
@@ -3892,6 +3832,11 @@ static void free_swap_count_continuations(struct swa=
p_info_struct *si)
 }
=20
 #if defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP)
+static bool __has_usable_swap(void)
+{
+	return !plist_head_empty(&swap_active_head);
+}
+
 void __folio_throttle_swaprate(struct folio *folio, gfp_t gfp)
 {
 	struct swap_info_struct *si, *next;
--=20
2.48.1
From nobody Sat Feb  7 13:57:25 2026
Received: from mail-pl1-f174.google.com (mail-pl1-f174.google.com
 [209.85.214.174])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 045D4265CB9
	for <linux-kernel@vger.kernel.org>; Mon, 24 Feb 2025 18:03:18 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.214.174
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1740420201; cv=none;
 b=pQLMO8pas7vn4JVi2KHhTvlsmnFd5lBFbJwhGJVRj9N7H5Oi4i54VJ8axMLGspLLAZc7weI+5+1eQTV80xH5+VVIZrIzcH6t8STxUlt0yvoFDqabLLVBPzHSW9a+l0UVLeBY5eyc8eTzw0yCgSZoVJFFKMqSVUdvLxcAE08b+UY=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1740420201; c=relaxed/simple;
	bh=ib4lDZF59793HGLQr65XX1VChB0fF9pxuSCAzGY0fJY=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=HbRz46nihMEAJ61C4+r85x89SxaBCGSWEwMZmJc4e9DoUziQ7IoXFvbh1LvBFE8D9TljCOoj3iPCnbaStGzTPUhK+GZuzh5664pLP+7UzNUiRfAcK22Arx18d2m/eo1q9Sv9Fio86UKOvNOEcyVx0u76PnmX2Buu5UJNlG2YJ2c=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=d8SRc0eA; arc=none smtp.client-ip=209.85.214.174
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="d8SRc0eA"
Received: by mail-pl1-f174.google.com with SMTP id
 d9443c01a7336-220c92c857aso81046165ad.0
        for <linux-kernel@vger.kernel.org>;
 Mon, 24 Feb 2025 10:03:18 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1740420198; x=1741024998;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject
         :date:message-id:reply-to;
        bh=0F3kezvsTBdwcMP4/p3aFmc/96fOBXv0a/r8BX7g49c=;
        b=d8SRc0eAmn2yrRruw/KR+qCLP1lZOhP5l5fGZEE2vhguzV9dHYckgK7T/WXekTpbSm
         9wO6oyqmNagUTlKuI8WGv6dtxkxqRXITW/WUUCzTRbl0iSbDMOI71ZBFpZFXFKRStCOZ
         dpRkKNmVC28G0WXH2keZn5zhcc8zdRklO4YPZGw21APRaTAhp2xVLiNkJfshYvWqJE8c
         3UzNoCYb+Kiw2Ntj5e+2fE9BYk4LLf7xgbmDnUD7GbNdrNIp2JDUqWcQVURdG23mg6DD
         BNXYVVCxhyIs3E/zGtDcAeLywrh8rvZV6ygDo+dmvcIZfA48Y8GUG/06sLIXZutrkMXh
         I79w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1740420198; x=1741024998;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state
         :from:to:cc:subject:date:message-id:reply-to;
        bh=0F3kezvsTBdwcMP4/p3aFmc/96fOBXv0a/r8BX7g49c=;
        b=lhUDFwzVvrwZFfq4V30rcW8Ux4lEjORm+C2ajYNC4ozzoX2rqIOPU8HMqecBuYC8Jx
         fq9w086qSoBNFV84mEI0tvvIkA4ff6GJhXrD3YDw8QXmpWRFF4CLsJ7r5o0VtBRtGSgC
         6zF9fybcyCt8pg2fChvBk3wsJPXVw+zIUtKckXVPEUax5QAKmItA0YZj5yjAZmQPre5f
         sksfltC9Atr2hGhhKOLoW+110MNng2wvhYZ8KkO6yv+nLBDumPUMhI6LLUSDayymur0m
         t1/f+IRSIOE1jO47qx7j1Pt3m1Zb8yrcbvsYAMDqj+tBA/CtwcnZKLjQJ22osdFCzaLi
         x8HA==
X-Forwarded-Encrypted: i=1;
 AJvYcCVMMfytLUsjQSNX90RSr5d0QvT5J2PPAKnuvfQuA6xQSp1c99C86Ayzqt8Zd3tzvRv/ex4O5ruv1X9+QC8=@vger.kernel.org
X-Gm-Message-State: AOJu0YwP0wuEhGr/pVOt1jN3rlu3oMUXHBCjbYerJ0eV0RaQzKqT+S/z
	qH2y6Mu3mbqkyQHXKUZ4I9YlOKk9WKREOfhxOIcKlP5TnKqN2/Al
X-Gm-Gg: ASbGncuCKRTrK8KanX8nJ5Z382XRXXwjfI57dHVzJ7XXQ951DUIBiKxpAdeH2PXwtYH
	S0A7zVP/mhzRkdRkkkgrAZCX/e38ZsCb9zHSP9lEsniWmB1Lvua20Z09zda8Nzr6E88w/WqImjj
	7NTQTGo8Y2An53qrk4OPt+AAGBsZMy+Xe6zlB0JfGVpnnwc0GGnofyss4lQe5ml1PGN0ubiNyE1
	xVKzJZX5iMf48O4J+EH4M1h4koKBQfsdxf/D3H/0QmsC8R29E9WSZ5wssaTbxbg58Ien28749Rp
	nFwCgT6MmCpIdAgfyPr5rvXWdu1UNOHhqcDJp9njujkp
X-Google-Smtp-Source: 
 AGHT+IFXam9uJQZb5hs3Hxkm0TWEoFuaGqLoECsuP+xjMxnP1x5Le2+i2A4Hq3TcsdGNfTd3wjKQnQ==
X-Received: by 2002:a17:902:fc4e:b0:21f:4c8b:c4da with SMTP id
 d9443c01a7336-2219fff87cbmr242938105ad.18.1740420197211;
        Mon, 24 Feb 2025 10:03:17 -0800 (PST)
Received: from KASONG-MC4.tencent.com ([1.203.117.88])
        by smtp.gmail.com with ESMTPSA id
 d9443c01a7336-220d556e15esm184834695ad.190.2025.02.24.10.03.13
        (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256);
        Mon, 24 Feb 2025 10:03:16 -0800 (PST)
From: Kairui Song <ryncsn@gmail.com>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Chris Li <chrisl@kernel.org>,
	Barry Song <v-songbaohua@oppo.com>,
	Hugh Dickins <hughd@google.com>,
	Yosry Ahmed <yosryahmed@google.com>,
	"Huang, Ying" <ying.huang@linux.alibaba.com>,
	Baoquan He <bhe@redhat.com>,
	Nhat Pham <nphamcs@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Kalesh Singh <kaleshsingh@google.com>,
	Matthew Wilcox <willy@infradead.org>,
	linux-kernel@vger.kernel.org,
	Kairui Song <kasong@tencent.com>
Subject: [PATCH v2 7/7] mm, swap: simplify folio swap allocation
Date: Tue, 25 Feb 2025 02:02:12 +0800
Message-ID: <20250224180212.22802-8-ryncsn@gmail.com>
X-Mailer: git-send-email 2.48.1
In-Reply-To: <20250224180212.22802-1-ryncsn@gmail.com>
References: <20250224180212.22802-1-ryncsn@gmail.com>
Reply-To: Kairui Song <kasong@tencent.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Kairui Song <kasong@tencent.com>

With slot cache gone, clean up the allocation helpers even more.
folio_alloc_swap will be the only entry for allocation and adding
the folio to swap cache (except suspend), making it opposite of
folio_free_swap.

Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
---
 include/linux/swap.h |   8 ++--
 mm/shmem.c           |  21 +++------
 mm/swap.h            |   6 ---
 mm/swap_state.c      |  57 ----------------------
 mm/swapfile.c        | 110 ++++++++++++++++++++++++++++---------------
 mm/vmscan.c          |  16 ++++++-
 6 files changed, 94 insertions(+), 124 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index a0a262bcaf41..3a68da686c4e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -478,7 +478,7 @@ static inline long get_nr_swap_pages(void)
 }
=20
 extern void si_swapinfo(struct sysinfo *);
-swp_entry_t folio_alloc_swap(struct folio *folio);
+int folio_alloc_swap(struct folio *folio, gfp_t gfp_mask);
 bool folio_free_swap(struct folio *folio);
 void put_swap_folio(struct folio *folio, swp_entry_t entry);
 extern swp_entry_t get_swap_page_of_type(int);
@@ -587,11 +587,9 @@ static inline int swp_swapcount(swp_entry_t entry)
 	return 0;
 }
=20
-static inline swp_entry_t folio_alloc_swap(struct folio *folio)
+static int folio_alloc_swap(struct folio *folio, gfp_t gfp_mask)
 {
-	swp_entry_t entry;
-	entry.val =3D 0;
-	return entry;
+	return -EINVAL;
 }
=20
 static inline bool folio_free_swap(struct folio *folio)
diff --git a/mm/shmem.c b/mm/shmem.c
index 45dbcb69da0c..aad02132b75a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1546,7 +1546,6 @@ static int shmem_writepage(struct page *page, struct =
writeback_control *wbc)
 	struct inode *inode =3D mapping->host;
 	struct shmem_inode_info *info =3D SHMEM_I(inode);
 	struct shmem_sb_info *sbinfo =3D SHMEM_SB(inode->i_sb);
-	swp_entry_t swap;
 	pgoff_t index;
 	int nr_pages;
 	bool split =3D false;
@@ -1628,14 +1627,6 @@ static int shmem_writepage(struct page *page, struct=
 writeback_control *wbc)
 		folio_mark_uptodate(folio);
 	}
=20
-	swap =3D folio_alloc_swap(folio);
-	if (!swap.val) {
-		if (nr_pages > 1)
-			goto try_split;
-
-		goto redirty;
-	}
-
 	/*
 	 * Add inode to shmem_unuse()'s list of swapped-out inodes,
 	 * if it's not already there.  Do it now before the folio is
@@ -1648,20 +1639,20 @@ static int shmem_writepage(struct page *page, struc=
t writeback_control *wbc)
 	if (list_empty(&info->swaplist))
 		list_add(&info->swaplist, &shmem_swaplist);
=20
-	if (add_to_swap_cache(folio, swap,
-			__GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN,
-			NULL) =3D=3D 0) {
+	if (!folio_alloc_swap(folio, __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN=
)) {
 		shmem_recalc_inode(inode, 0, nr_pages);
-		swap_shmem_alloc(swap, nr_pages);
-		shmem_delete_from_page_cache(folio, swp_to_radix_entry(swap));
+		swap_shmem_alloc(folio->swap, nr_pages);
+		shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap));
=20
 		mutex_unlock(&shmem_swaplist_mutex);
 		BUG_ON(folio_mapped(folio));
 		return swap_writepage(&folio->page, wbc);
 	}
=20
+	list_del_init(&info->swaplist);
 	mutex_unlock(&shmem_swaplist_mutex);
-	put_swap_folio(folio, swap);
+	if (nr_pages > 1)
+		goto try_split;
 redirty:
 	folio_mark_dirty(folio);
 	if (wbc->for_reclaim)
diff --git a/mm/swap.h b/mm/swap.h
index ad2f121de970..0abb68091b4f 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -50,7 +50,6 @@ static inline pgoff_t swap_cache_index(swp_entry_t entry)
 }
=20
 void show_swap_cache_info(void);
-bool add_to_swap(struct folio *folio);
 void *get_shadow_from_swap_cache(swp_entry_t entry);
 int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
 		      gfp_t gfp, void **shadowp);
@@ -163,11 +162,6 @@ struct folio *filemap_get_incore_folio(struct address_=
space *mapping,
 	return filemap_get_folio(mapping, index);
 }
=20
-static inline bool add_to_swap(struct folio *folio)
-{
-	return false;
-}
-
 static inline void *get_shadow_from_swap_cache(swp_entry_t entry)
 {
 	return NULL;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 2b5744e211cd..68fd981b514f 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -166,63 +166,6 @@ void __delete_from_swap_cache(struct folio *folio,
 	__lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr);
 }
=20
-/**
- * add_to_swap - allocate swap space for a folio
- * @folio: folio we want to move to swap
- *
- * Allocate swap space for the folio and add the folio to the
- * swap cache.
- *
- * Context: Caller needs to hold the folio lock.
- * Return: Whether the folio was added to the swap cache.
- */
-bool add_to_swap(struct folio *folio)
-{
-	swp_entry_t entry;
-	int err;
-
-	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
-	VM_BUG_ON_FOLIO(!folio_test_uptodate(folio), folio);
-
-	entry =3D folio_alloc_swap(folio);
-	if (!entry.val)
-		return false;
-
-	/*
-	 * XArray node allocations from PF_MEMALLOC contexts could
-	 * completely exhaust the page allocator. __GFP_NOMEMALLOC
-	 * stops emergency reserves from being allocated.
-	 *
-	 * TODO: this could cause a theoretical memory reclaim
-	 * deadlock in the swap out path.
-	 */
-	/*
-	 * Add it to the swap cache.
-	 */
-	err =3D add_to_swap_cache(folio, entry,
-			__GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN, NULL);
-	if (err)
-		goto fail;
-	/*
-	 * Normally the folio will be dirtied in unmap because its
-	 * pte should be dirty. A special case is MADV_FREE page. The
-	 * page's pte could have dirty bit cleared but the folio's
-	 * SwapBacked flag is still set because clearing the dirty bit
-	 * and SwapBacked flag has no lock protected. For such folio,
-	 * unmap will not set dirty bit for it, so folio reclaim will
-	 * not write the folio out. This can cause data corruption when
-	 * the folio is swapped in later. Always setting the dirty flag
-	 * for the folio solves the problem.
-	 */
-	folio_mark_dirty(folio);
-
-	return true;
-
-fail:
-	put_swap_folio(folio, entry);
-	return false;
-}
-
 /*
  * This must be called only on folios that have
  * been verified to be in the swap cache and locked.
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 1ba916109d99..628f67974a7c 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1174,9 +1174,9 @@ static bool get_swap_device_info(struct swap_info_str=
uct *si)
  * Fast path try to get swap entries with specified order from current
  * CPU's swap entry pool (a cluster).
  */
-static int swap_alloc_fast(swp_entry_t *entry,
-			   unsigned char usage,
-			   int order)
+static bool swap_alloc_fast(swp_entry_t *entry,
+			    unsigned char usage,
+			    int order)
 {
 	struct swap_cluster_info *ci;
 	struct swap_info_struct *si;
@@ -1206,47 +1206,31 @@ static int swap_alloc_fast(swp_entry_t *entry,
 	return !!found;
 }
=20
-swp_entry_t folio_alloc_swap(struct folio *folio)
+/* Rotate the device and switch to a new cluster */
+static bool swap_alloc_slow(swp_entry_t *entry,
+			    unsigned char usage,
+			    int order)
 {
-	unsigned int order =3D folio_order(folio);
-	unsigned int size =3D 1 << order;
-	struct swap_info_struct *si, *next;
-	swp_entry_t entry =3D {};
-	unsigned long offset;
 	int node;
+	unsigned long offset;
+	struct swap_info_struct *si, *next;
=20
-	if (order) {
-		/*
-		 * Should not even be attempting large allocations when huge
-		 * page swap is disabled. Warn and fail the allocation.
-		 */
-		if (!IS_ENABLED(CONFIG_THP_SWAP) || size > SWAPFILE_CLUSTER) {
-			VM_WARN_ON_ONCE(1);
-			return entry;
-		}
-	}
-
-	/* Fast path using percpu cluster */
-	local_lock(&percpu_swap_cluster.lock);
-	if (swap_alloc_fast(&entry, SWAP_HAS_CACHE, order))
-		goto out_alloced;
-
-	/* Rotate the device and switch to a new cluster */
+	node =3D numa_node_id();
 	spin_lock(&swap_avail_lock);
 start_over:
-	node =3D numa_node_id();
 	plist_for_each_entry_safe(si, next, &swap_avail_heads[node], avail_lists[=
node]) {
+		/* Rotate the device and switch to a new cluster */
 		plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
 		spin_unlock(&swap_avail_lock);
 		if (get_swap_device_info(si)) {
 			offset =3D cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE);
 			put_swap_device(si);
 			if (offset) {
-				entry =3D swp_entry(si->type, offset);
-				goto out_alloced;
+				*entry =3D swp_entry(si->type, offset);
+				return true;
 			}
 			if (order)
-				goto out_failed;
+				return false;
 		}
=20
 		spin_lock(&swap_avail_lock);
@@ -1265,20 +1249,68 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 			goto start_over;
 	}
 	spin_unlock(&swap_avail_lock);
-out_failed:
+	return false;
+}
+
+/**
+ * folio_alloc_swap - allocate swap space for a folio
+ * @folio: folio we want to move to swap
+ * @gfp: gfp mask for shadow nodes
+ *
+ * Allocate swap space for the folio and add the folio to the
+ * swap cache.
+ *
+ * Context: Caller needs to hold the folio lock.
+ * Return: Whether the folio was added to the swap cache.
+ */
+int folio_alloc_swap(struct folio *folio, gfp_t gfp)
+{
+	unsigned int order =3D folio_order(folio);
+	unsigned int size =3D 1 << order;
+	swp_entry_t entry =3D {};
+
+	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
+	VM_BUG_ON_FOLIO(!folio_test_uptodate(folio), folio);
+
+	/*
+	 * Should not even be attempting large allocations when huge
+	 * page swap is disabled. Warn and fail the allocation.
+	 */
+	if (order && (!IS_ENABLED(CONFIG_THP_SWAP) || size > SWAPFILE_CLUSTER)) {
+		VM_WARN_ON_ONCE(1);
+		return -EINVAL;
+	}
+
+	local_lock(&percpu_swap_cluster.lock);
+	if (swap_alloc_fast(&entry, SWAP_HAS_CACHE, order))
+		goto out_alloced;
+	if (swap_alloc_slow(&entry, SWAP_HAS_CACHE, order))
+		goto out_alloced;
 	local_unlock(&percpu_swap_cluster.lock);
-	return entry;
+	return -ENOMEM;
=20
 out_alloced:
 	local_unlock(&percpu_swap_cluster.lock);
-	if (mem_cgroup_try_charge_swap(folio, entry)) {
-		put_swap_folio(folio, entry);
-		entry.val =3D 0;
-	} else {
-		atomic_long_sub(size, &nr_swap_pages);
-	}
+	if (mem_cgroup_try_charge_swap(folio, entry))
+		goto out_free;
=20
-	return entry;
+	/*
+	 * XArray node allocations from PF_MEMALLOC contexts could
+	 * completely exhaust the page allocator. __GFP_NOMEMALLOC
+	 * stops emergency reserves from being allocated.
+	 *
+	 * TODO: this could cause a theoretical memory reclaim
+	 * deadlock in the swap out path.
+	 */
+	if (add_to_swap_cache(folio, entry, gfp | __GFP_NOMEMALLOC, NULL))
+		goto out_free;
+
+	atomic_long_sub(size, &nr_swap_pages);
+	return 0;
+
+out_free:
+	put_swap_folio(folio, entry);
+	return -ENOMEM;
 }
=20
 static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fcca38bc640f..be00af3763b5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1289,7 +1289,7 @@ static unsigned int shrink_folio_list(struct list_hea=
d *folio_list,
 					    split_folio_to_list(folio, folio_list))
 						goto activate_locked;
 				}
-				if (!add_to_swap(folio)) {
+				if (folio_alloc_swap(folio, __GFP_HIGH | __GFP_NOWARN)) {
 					int __maybe_unused order =3D folio_order(folio);
=20
 					if (!folio_test_large(folio))
@@ -1305,9 +1305,21 @@ static unsigned int shrink_folio_list(struct list_he=
ad *folio_list,
 					}
 #endif
 					count_mthp_stat(order, MTHP_STAT_SWPOUT_FALLBACK);
-					if (!add_to_swap(folio))
+					if (folio_alloc_swap(folio, __GFP_HIGH | __GFP_NOWARN))
 						goto activate_locked_split;
 				}
+				/*
+				 * Normally the folio will be dirtied in unmap because its
+				 * pte should be dirty. A special case is MADV_FREE page. The
+				 * page's pte could have dirty bit cleared but the folio's
+				 * SwapBacked flag is still set because clearing the dirty bit
+				 * and SwapBacked flag has no lock protected. For such folio,
+				 * unmap will not set dirty bit for it, so folio reclaim will
+				 * not write the folio out. This can cause data corruption when
+				 * the folio is swapped in later. Always setting the dirty flag
+				 * for the folio solves the problem.
+				 */
+				folio_mark_dirty(folio);
 			}
 		}
=20
--=20
2.48.1