From nobody Sat Feb  7 23:38:34 2026
Received: from mail-pl1-f173.google.com (mail-pl1-f173.google.com
 [209.85.214.173])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id E98C71CBE96
	for <linux-kernel@vger.kernel.org>; Mon, 13 Jan 2025 18:00:26 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.214.173
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1736791229; cv=none;
 b=E5frLSGIIMkOD4O0O9grNcGZhKgoaKGgYfDg7fJoXnYCTc4UPDQHYyhg/2btIf9XKp7ZOUAc9AK8o0p9cZBQ+p+LNuAJbC0+Zq7GQLVupRI6UiTrRB+AYuSPEYqKeAFprXpjo4A4xx6fcPWNpSP5XArOdmvuF6qttN/OvwoWASU=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1736791229; c=relaxed/simple;
	bh=lcVFuzoMaDPs0s8onnopRtZjh7EXQF2uesmSUFBTTLY=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=emuzRC6mEraVItqkQGGLWdDs+Lgk8OnxaOwxwtPfWKqf+wq3bYGE+T9lCRa7NN8nffor4KkVI9wHjvRU/QACsV4tpPSsRkMgw37kXIwlJNbFfZyHScJTW5fj+69+NhGwIUNleUyq27KFV7O1PbO8+ppw2jenr6bGxkon5848n5U=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=Luy9lFIq; arc=none smtp.client-ip=209.85.214.173
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="Luy9lFIq"
Received: by mail-pl1-f173.google.com with SMTP id
 d9443c01a7336-2163bd70069so83040745ad.0
        for <linux-kernel@vger.kernel.org>;
 Mon, 13 Jan 2025 10:00:26 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1736791226; x=1737396026;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject
         :date:message-id:reply-to;
        bh=MatFqnsLfI0J5qXug55VUY2bL9Wigg14qsEA0grIGoU=;
        b=Luy9lFIq19JgpysxQDyhqzjIvUHiVdFGy4ID1Djf6X5p1wqUZqjvUllB2YtOpc9sqR
         cvXcmrJ5zQ1DCJ6358S/59RR2W4HvbewAgmO0J4x5EvzSoTk8THMIKE12s3v44fWbkoS
         lgXxsEioLF/KlFyMd4+viNmT8a3kCUnn3CuIj+Ve70adRb/adIeYOHyr/kJYiNn35fiO
         bl9eO9Lj4ZlSqTTuBgvxZhy3GLcZKAWluFiyS/Z6ZEFHgexiarAzYgu1+E3hAqytfanC
         KwAoGdFRy1WPkhmYvCqFgoHtUEMl3w93Ffc5Wl9GHEgwx1LLD+pbk7UixRB4c4T94xIy
         bYDA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1736791226; x=1737396026;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state
         :from:to:cc:subject:date:message-id:reply-to;
        bh=MatFqnsLfI0J5qXug55VUY2bL9Wigg14qsEA0grIGoU=;
        b=s8/JmLG+FvCXqWIlWawmzWCIVlb1FeHJabG4DHpeV8n0lWGiwlBbdb2cslXYd4ZFNI
         OiZQWfbm+TfZq49aa7nMTISQkt2Ode7F1Qj4l3Bo5vtje6hGhAWwNqfsxrftf6uV5Yl2
         CLOXoXYWD/OBaxvPjmgcYoJORJfcDsHYwstfDE26VzJeg1L5bh6F1vjpykRzpKrnqXH/
         o7OThVkhsmI8++3WSOaCDPL9/C4Xpr+BaZkCUTJy5JSmhfRjkxCEIRIx9b022rGONAX6
         bNfpZ1w7Xl3HYvJZ03ZgSCugAKl0E9XyRkjEbdndjlF7hc1+xAdHngbWMsloNPT4NLai
         gstw==
X-Forwarded-Encrypted: i=1;
 AJvYcCXLLLsCRGmfSvVdjVmAPoO9urOC9EmgeyBYNt4VejXE7bEDRkGl5itbFKsnBuAHnkv7pP+F1PW3G8IlZQs=@vger.kernel.org
X-Gm-Message-State: AOJu0YzCaloA1sQ8AIgXqN8sXLBDkRV04AMiUZWorodNkSt3sqRwFsBB
	Yybkt/oGb5LxCACBuvDU0P9dwsv+RwWxSxpDU/1ppjDOrhdomejt
X-Gm-Gg: ASbGncvQgLfaZDLjO6/5Jz7pyoIB+rtZbtbawIoIXlcxQYkaAfhH/kVv4iJepCJ0x/m
	eYE3lR4VOGL8JsZ/q6JySe54kGA9F+kETQOnaKMIAZKvAJu25e1Ai4ZQULjJcJU9Ev3DbLXbkkt
	vl+MYIQxB+m6pmdPklLf0GEeNwwGbqY/AUvvmYlNxHSrYcvjRypZi39NaFA+pZE3DFo9xXrdTfi
	ie5ko6UwQ6F/+YQEAkASUz3+Mmt/5/PpkVK7XX0/n+l/RDmykrKk6QQ1tav/Rp6CVCSFjMzIUQM
	/w==
X-Google-Smtp-Source: 
 AGHT+IFmnF/a9IwBa/i3UBvbehPRpoxJTxM4S0nHx+6HfhdZT9vedpg6yyvBXC+d6d5PBeQ2RciM/Q==
X-Received: by 2002:a17:902:ecc5:b0:215:bf1b:a894 with SMTP id
 d9443c01a7336-21a83f76704mr333067675ad.24.1736791225664;
        Mon, 13 Jan 2025 10:00:25 -0800 (PST)
Received: from KASONG-MC4.tencent.com ([115.171.41.132])
        by smtp.gmail.com with ESMTPSA id
 d9443c01a7336-21a9f21aba7sm57023635ad.113.2025.01.13.10.00.22
        (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256);
        Mon, 13 Jan 2025 10:00:25 -0800 (PST)
From: Kairui Song <ryncsn@gmail.com>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Chris Li <chrisl@kernel.org>,
	Barry Song <v-songbaohua@oppo.com>,
	Ryan Roberts <ryan.roberts@arm.com>,
	Hugh Dickins <hughd@google.com>,
	Yosry Ahmed <yosryahmed@google.com>,
	"Huang, Ying" <ying.huang@linux.alibaba.com>,
	Baoquan He <bhe@redhat.com>,
	Nhat Pham <nphamcs@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Kalesh Singh <kaleshsingh@google.com>,
	linux-kernel@vger.kernel.org,
	Kairui Song <kasong@tencent.com>
Subject: [PATCH v4 09/13] mm, swap: reduce contention on device lock
Date: Tue, 14 Jan 2025 01:57:28 +0800
Message-ID: <20250113175732.48099-10-ryncsn@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250113175732.48099-1-ryncsn@gmail.com>
References: <20250113175732.48099-1-ryncsn@gmail.com>
Reply-To: Kairui Song <kasong@tencent.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Kairui Song <kasong@tencent.com>

Currently, swap locking is mainly composed of two locks: the cluster
lock (ci->lock) and the device lock (si->lock).

The cluster lock is much more fine-grained, so it is best to use
ci->lock instead of si->lock as much as possible.

We have cleaned up other hard dependencies on si->lock. Following the
new cluster allocator design, most operations don't need to touch
si->lock at all. In practice, we only need to take si->lock when
moving clusters between lists.

To achieve this, this commit reworks the locking pattern of all
si->lock and ci->lock users, eliminates all usage of ci->lock inside
si->lock, and introduces a new design to avoid touching si->lock
unless needed.

For minimal contention and easier understanding of the system, two
ideas are introduced with the corresponding helpers: isolation and
relocation.

- Clusters will be `isolated` from the list when iterating the list
  to search for an allocatable cluster.

  This ensures other CPUs won't walk into the same cluster easily,
  and it releases si->lock after acquiring ci->lock, providing the
  only place that handles the inversion of two locks, and avoids
  contention.

  Iterating the cluster list almost always moves the cluster
  (free -> nonfull, nonfull -> frag, frag -> frag tail), but it
  doesn't know where the cluster should be moved to until scanning
  is done. So keeping the cluster off-list is a good option with
  low overhead.

  The off-list time window of a cluster is also minimal. In the worst
  case, one CPU will return the cluster after scanning the 512 entries
  on it, which we used to busy wait with a spin lock.

This is done with the new helper `isolate_lock_cluster`.

- Clusters will be `relocated` after allocation or freeing, according
  to their usage count and status.

  Allocations no longer hold si->lock now, and may drop ci->lock for
  reclaim, so the cluster could be moved to any location while no lock
  is held. Besides, isolation clears all flags when it takes the
  cluster off the list (the flags must be in sync with the list status,
  so cluster users don't need to touch si->lock for checking its list
  status). So the cluster has to be relocated to the right list
  according to its usage after allocation or freeing.

  Relocation is optional, if the cluster flags indicate it's already
  on the right list, it will skip touching the list or si->lock.

This is done with `relocate_cluster` after allocation or with
`[partial_]free_cluster` after freeing.

This handled usage of all kinds of clusters in a clean way.

Scanning and allocation by iterating the cluster list is handled by
"isolate - <scan / allocate> - relocate".

Scanning and allocation of per-CPU clusters will only involve
"<scan / allocate> - relocate", as it knows which cluster to lock
and use.

Freeing will only involve "relocate".

Each CPU will keep using its per-CPU cluster until the 512 entries
are all consumed. Freeing also has to free 512 entries to trigger
cluster movement in the best case, so si->lock is rarely touched.

Testing with building the Linux kernel with defconfig showed huge
improvement:

tiem make -j96 / 768M memcg, 4K pages, 10G ZRAM, on Intel 8255C:
Before:
Sys time: 73578.30, Real time: 864.05
After: (-50.7% sys time, -44.8% real time)
Sys time: 36227.49, Real time: 476.66

time make -j96 / 1152M memcg, 64K mTHP, 10G ZRAM, on Intel 8255C:
(avg of 4 test run)
Before:
Sys time: 74044.85, Real time: 846.51
hugepages-64kB/stats/swpout: 1735216
hugepages-64kB/stats/swpout_fallback: 430333

After: (-40.4% sys time, -37.1% real time)
Sys time: 44160.56, Real time: 532.07
hugepages-64kB/stats/swpout: 1786288
hugepages-64kB/stats/swpout_fallback: 243384

time make -j32 / 512M memcg, 4K pages, 5G ZRAM, on AMD 7K62:
Before:
Sys time: 8098.21, Real time: 401.3
After: (-22.6% sys time, -12.8% real time )
Sys time: 6265.02, Real time: 349.83

The allocation success rate also slightly improved as we sanitized the
usage of clusters with new defined helpers, previously dropping
si->lock or ci->lock during scan will cause cluster order shuffle.

Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |   3 +-
 mm/swapfile.c        | 432 ++++++++++++++++++++++++-------------------
 2 files changed, 247 insertions(+), 188 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 339d7f0192ff..c4ff31cb6bde 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -291,6 +291,7 @@ enum swap_cluster_flags {
  * throughput.
  */
 struct percpu_cluster {
+	local_lock_t lock; /* Protect the percpu_cluster above */
 	unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
 };
=20
@@ -313,7 +314,7 @@ struct swap_info_struct {
 					/* list of cluster that contains at least one free slot */
 	struct list_head frag_clusters[SWAP_NR_ORDERS];
 					/* list of cluster that are fragmented or contented */
-	unsigned int frag_cluster_nr[SWAP_NR_ORDERS];
+	atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
 	unsigned int pages;		/* total of usable pages of swap */
 	atomic_long_t inuse_pages;	/* number of those currently in use */
 	struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap locatio=
n */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index b754c9e16c3b..489ac6997a0c 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -261,12 +261,10 @@ static int __try_to_reclaim_swap(struct swap_info_str=
uct *si,
 	folio_ref_sub(folio, nr_pages);
 	folio_set_dirty(folio);
=20
-	spin_lock(&si->lock);
 	/* Only sinple page folio can be backed by zswap */
 	if (nr_pages =3D=3D 1)
 		zswap_invalidate(entry);
 	swap_entry_range_free(si, entry, nr_pages);
-	spin_unlock(&si->lock);
 	ret =3D nr_pages;
 out_unlock:
 	folio_unlock(folio);
@@ -401,9 +399,23 @@ static void discard_swap_cluster(struct swap_info_stru=
ct *si,
 #endif
 #define LATENCY_LIMIT		256
=20
-static inline bool cluster_is_free(struct swap_cluster_info *info)
+static inline bool cluster_is_empty(struct swap_cluster_info *info)
+{
+	return info->count =3D=3D 0;
+}
+
+static inline bool cluster_is_discard(struct swap_cluster_info *info)
+{
+	return info->flags =3D=3D CLUSTER_FLAG_DISCARD;
+}
+
+static inline bool cluster_is_usable(struct swap_cluster_info *ci, int ord=
er)
 {
-	return info->flags =3D=3D CLUSTER_FLAG_FREE;
+	if (unlikely(ci->flags > CLUSTER_FLAG_USABLE))
+		return false;
+	if (!order)
+		return true;
+	return cluster_is_empty(ci) || order =3D=3D ci->order;
 }
=20
 static inline unsigned int cluster_index(struct swap_info_struct *si,
@@ -441,19 +453,20 @@ static void move_cluster(struct swap_info_struct *si,
 	VM_WARN_ON(ci->flags =3D=3D new_flags);
=20
 	BUILD_BUG_ON(1 << sizeof(ci->flags) * BITS_PER_BYTE < CLUSTER_FLAG_MAX);
+	lockdep_assert_held(&ci->lock);
=20
-	if (ci->flags =3D=3D CLUSTER_FLAG_NONE) {
+	spin_lock(&si->lock);
+	if (ci->flags =3D=3D CLUSTER_FLAG_NONE)
 		list_add_tail(&ci->list, list);
-	} else {
-		if (ci->flags =3D=3D CLUSTER_FLAG_FRAG) {
-			VM_WARN_ON(!si->frag_cluster_nr[ci->order]);
-			si->frag_cluster_nr[ci->order]--;
-		}
+	else
 		list_move_tail(&ci->list, list);
-	}
+	spin_unlock(&si->lock);
+
+	if (ci->flags =3D=3D CLUSTER_FLAG_FRAG)
+		atomic_long_dec(&si->frag_cluster_nr[ci->order]);
+	else if (new_flags =3D=3D CLUSTER_FLAG_FRAG)
+		atomic_long_inc(&si->frag_cluster_nr[ci->order]);
 	ci->flags =3D new_flags;
-	if (new_flags =3D=3D CLUSTER_FLAG_FRAG)
-		si->frag_cluster_nr[ci->order]++;
 }
=20
 /* Add a cluster to discard list and schedule it to do discard */
@@ -476,39 +489,91 @@ static void swap_cluster_schedule_discard(struct swap=
_info_struct *si,
=20
 static void __free_cluster(struct swap_info_struct *si, struct swap_cluste=
r_info *ci)
 {
-	lockdep_assert_held(&si->lock);
 	lockdep_assert_held(&ci->lock);
 	move_cluster(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE);
 	ci->order =3D 0;
 }
=20
+/*
+ * Isolate and lock the first cluster that is not contented on a list,
+ * clean its flag before taken off-list. Cluster flag must be in sync
+ * with list status, so cluster updaters can always know the cluster
+ * list status without touching si lock.
+ *
+ * Note it's possible that all clusters on a list are contented so
+ * this returns NULL for an non-empty list.
+ */
+static struct swap_cluster_info *isolate_lock_cluster(
+		struct swap_info_struct *si, struct list_head *list)
+{
+	struct swap_cluster_info *ci, *ret =3D NULL;
+
+	spin_lock(&si->lock);
+
+	if (unlikely(!(si->flags & SWP_WRITEOK)))
+		goto out;
+
+	list_for_each_entry(ci, list, list) {
+		if (!spin_trylock(&ci->lock))
+			continue;
+
+		/* We may only isolate and clear flags of following lists */
+		VM_BUG_ON(!ci->flags);
+		VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE &&
+			  ci->flags !=3D CLUSTER_FLAG_FULL);
+
+		list_del(&ci->list);
+		ci->flags =3D CLUSTER_FLAG_NONE;
+		ret =3D ci;
+		break;
+	}
+out:
+	spin_unlock(&si->lock);
+
+	return ret;
+}
+
 /*
  * Doing discard actually. After a cluster discard is finished, the cluster
- * will be added to free cluster list. caller should hold si->lock.
-*/
-static void swap_do_scheduled_discard(struct swap_info_struct *si)
+ * will be added to free cluster list. Discard cluster is a bit special as
+ * they don't participate in allocation or reclaim, so clusters marked as
+ * CLUSTER_FLAG_DISCARD must remain off-list or on discard list.
+ */
+static bool swap_do_scheduled_discard(struct swap_info_struct *si)
 {
 	struct swap_cluster_info *ci;
+	bool ret =3D false;
 	unsigned int idx;
=20
+	spin_lock(&si->lock);
 	while (!list_empty(&si->discard_clusters)) {
 		ci =3D list_first_entry(&si->discard_clusters, struct swap_cluster_info,=
 list);
+		/*
+		 * Delete the cluster from list to prepare for discard, but keep
+		 * the CLUSTER_FLAG_DISCARD flag, there could be percpu_cluster
+		 * pointing to it, or ran into by relocate_cluster.
+		 */
 		list_del(&ci->list);
-		/* Must clear flag when taking a cluster off-list */
-		ci->flags =3D CLUSTER_FLAG_NONE;
 		idx =3D cluster_index(si, ci);
 		spin_unlock(&si->lock);
-
 		discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
 				SWAPFILE_CLUSTER);
=20
-		spin_lock(&si->lock);
 		spin_lock(&ci->lock);
-		__free_cluster(si, ci);
+		/*
+		 * Discard is done, clear its flags as it's off-list, then
+		 * return the cluster to allocation list.
+		 */
+		ci->flags =3D CLUSTER_FLAG_NONE;
 		memset(si->swap_map + idx * SWAPFILE_CLUSTER,
 				0, SWAPFILE_CLUSTER);
+		__free_cluster(si, ci);
 		spin_unlock(&ci->lock);
+		ret =3D true;
+		spin_lock(&si->lock);
 	}
+	spin_unlock(&si->lock);
+	return ret;
 }
=20
 static void swap_discard_work(struct work_struct *work)
@@ -517,9 +582,7 @@ static void swap_discard_work(struct work_struct *work)
=20
 	si =3D container_of(work, struct swap_info_struct, discard_work);
=20
-	spin_lock(&si->lock);
 	swap_do_scheduled_discard(si);
-	spin_unlock(&si->lock);
 }
=20
 static void swap_users_ref_free(struct percpu_ref *ref)
@@ -530,10 +593,14 @@ static void swap_users_ref_free(struct percpu_ref *re=
f)
 	complete(&si->comp);
 }
=20
+/*
+ * Must be called after freeing if ci->count =3D=3D 0, moves the cluster t=
o free
+ * or discard list.
+ */
 static void free_cluster(struct swap_info_struct *si, struct swap_cluster_=
info *ci)
 {
 	VM_BUG_ON(ci->count !=3D 0);
-	lockdep_assert_held(&si->lock);
+	VM_BUG_ON(ci->flags =3D=3D CLUSTER_FLAG_FREE);
 	lockdep_assert_held(&ci->lock);
=20
 	/*
@@ -550,6 +617,48 @@ static void free_cluster(struct swap_info_struct *si, =
struct swap_cluster_info *
 	__free_cluster(si, ci);
 }
=20
+/*
+ * Must be called after freeing if ci->count !=3D 0, moves the cluster to
+ * nonfull list.
+ */
+static void partial_free_cluster(struct swap_info_struct *si,
+				 struct swap_cluster_info *ci)
+{
+	VM_BUG_ON(!ci->count || ci->count =3D=3D SWAPFILE_CLUSTER);
+	lockdep_assert_held(&ci->lock);
+
+	if (ci->flags !=3D CLUSTER_FLAG_NONFULL)
+		move_cluster(si, ci, &si->nonfull_clusters[ci->order],
+			     CLUSTER_FLAG_NONFULL);
+}
+
+/*
+ * Must be called after allocation, moves the cluster to full or frag list.
+ * Note: allocation doesn't acquire si lock, and may drop the ci lock for
+ * reclaim, so the cluster could be any where when called.
+ */
+static void relocate_cluster(struct swap_info_struct *si,
+			     struct swap_cluster_info *ci)
+{
+	lockdep_assert_held(&ci->lock);
+
+	/* Discard cluster must remain off-list or on discard list */
+	if (cluster_is_discard(ci))
+		return;
+
+	if (!ci->count) {
+		free_cluster(si, ci);
+	} else if (ci->count !=3D SWAPFILE_CLUSTER) {
+		if (ci->flags !=3D CLUSTER_FLAG_FRAG)
+			move_cluster(si, ci, &si->frag_clusters[ci->order],
+				     CLUSTER_FLAG_FRAG);
+	} else {
+		if (ci->flags !=3D CLUSTER_FLAG_FULL)
+			move_cluster(si, ci, &si->full_clusters,
+				     CLUSTER_FLAG_FULL);
+	}
+}
+
 /*
  * The cluster corresponding to page_nr will be used. The cluster will not=
 be
  * added to free cluster list and its usage counter will be increased by 1.
@@ -568,30 +677,6 @@ static void inc_cluster_info_page(struct swap_info_str=
uct *si,
 	VM_BUG_ON(ci->flags);
 }
=20
-/*
- * The cluster ci decreases @nr_pages usage. If the usage counter becomes =
0,
- * which means no page in the cluster is in use, we can optionally discard
- * the cluster and add it to free cluster list.
- */
-static void dec_cluster_info_page(struct swap_info_struct *si,
-				  struct swap_cluster_info *ci, int nr_pages)
-{
-	VM_BUG_ON(ci->count < nr_pages);
-	VM_BUG_ON(cluster_is_free(ci));
-	lockdep_assert_held(&si->lock);
-	lockdep_assert_held(&ci->lock);
-	ci->count -=3D nr_pages;
-
-	if (!ci->count) {
-		free_cluster(si, ci);
-		return;
-	}
-
-	if (ci->flags !=3D CLUSTER_FLAG_NONFULL)
-		move_cluster(si, ci, &si->nonfull_clusters[ci->order],
-			     CLUSTER_FLAG_NONFULL);
-}
-
 static bool cluster_reclaim_range(struct swap_info_struct *si,
 				  struct swap_cluster_info *ci,
 				  unsigned long start, unsigned long end)
@@ -601,8 +686,6 @@ static bool cluster_reclaim_range(struct swap_info_stru=
ct *si,
 	int nr_reclaim;
=20
 	spin_unlock(&ci->lock);
-	spin_unlock(&si->lock);
-
 	do {
 		switch (READ_ONCE(map[offset])) {
 		case 0:
@@ -620,9 +703,7 @@ static bool cluster_reclaim_range(struct swap_info_stru=
ct *si,
 		}
 	} while (offset < end);
 out:
-	spin_lock(&si->lock);
 	spin_lock(&ci->lock);
-
 	/*
 	 * Recheck the range no matter reclaim succeeded or not, the slot
 	 * could have been be freed while we are not holding the lock.
@@ -636,11 +717,11 @@ static bool cluster_reclaim_range(struct swap_info_st=
ruct *si,
=20
 static bool cluster_scan_range(struct swap_info_struct *si,
 			       struct swap_cluster_info *ci,
-			       unsigned long start, unsigned int nr_pages)
+			       unsigned long start, unsigned int nr_pages,
+			       bool *need_reclaim)
 {
 	unsigned long offset, end =3D start + nr_pages;
 	unsigned char *map =3D si->swap_map;
-	bool need_reclaim =3D false;
=20
 	for (offset =3D start; offset < end; offset++) {
 		switch (READ_ONCE(map[offset])) {
@@ -649,16 +730,13 @@ static bool cluster_scan_range(struct swap_info_struc=
t *si,
 		case SWAP_HAS_CACHE:
 			if (!vm_swap_full())
 				return false;
-			need_reclaim =3D true;
+			*need_reclaim =3D true;
 			continue;
 		default:
 			return false;
 		}
 	}
=20
-	if (need_reclaim)
-		return cluster_reclaim_range(si, ci, start, end);
-
 	return true;
 }
=20
@@ -673,23 +751,17 @@ static bool cluster_alloc_range(struct swap_info_stru=
ct *si, struct swap_cluster
 	if (!(si->flags & SWP_WRITEOK))
 		return false;
=20
-	VM_BUG_ON(ci->flags =3D=3D CLUSTER_FLAG_NONE);
-	VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE);
-
-	if (cluster_is_free(ci)) {
-		if (nr_pages < SWAPFILE_CLUSTER)
-			move_cluster(si, ci, &si->nonfull_clusters[order],
-				     CLUSTER_FLAG_NONFULL);
+	/*
+	 * The first allocation in a cluster makes the
+	 * cluster exclusive to this order
+	 */
+	if (cluster_is_empty(ci))
 		ci->order =3D order;
-	}
=20
 	memset(si->swap_map + start, usage, nr_pages);
 	swap_range_alloc(si, nr_pages);
 	ci->count +=3D nr_pages;
=20
-	if (ci->count =3D=3D SWAPFILE_CLUSTER)
-		move_cluster(si, ci, &si->full_clusters, CLUSTER_FLAG_FULL);
-
 	return true;
 }
=20
@@ -700,37 +772,55 @@ static unsigned int alloc_swap_scan_cluster(struct sw=
ap_info_struct *si, unsigne
 	unsigned long start =3D offset & ~(SWAPFILE_CLUSTER - 1);
 	unsigned long end =3D min(start + SWAPFILE_CLUSTER, si->max);
 	unsigned int nr_pages =3D 1 << order;
+	bool need_reclaim, ret;
 	struct swap_cluster_info *ci;
=20
-	if (end < nr_pages)
-		return SWAP_NEXT_INVALID;
-	end -=3D nr_pages;
+	ci =3D &si->cluster_info[offset / SWAPFILE_CLUSTER];
+	lockdep_assert_held(&ci->lock);
=20
-	ci =3D lock_cluster(si, offset);
-	if (ci->count + nr_pages > SWAPFILE_CLUSTER) {
+	if (end < nr_pages || ci->count + nr_pages > SWAPFILE_CLUSTER) {
 		offset =3D SWAP_NEXT_INVALID;
-		goto done;
+		goto out;
 	}
=20
-	while (offset <=3D end) {
-		if (cluster_scan_range(si, ci, offset, nr_pages)) {
-			if (!cluster_alloc_range(si, ci, offset, usage, order)) {
-				offset =3D SWAP_NEXT_INVALID;
-				goto done;
-			}
-			*foundp =3D offset;
-			if (ci->count =3D=3D SWAPFILE_CLUSTER) {
+	for (end -=3D nr_pages; offset <=3D end; offset +=3D nr_pages) {
+		need_reclaim =3D false;
+		if (!cluster_scan_range(si, ci, offset, nr_pages, &need_reclaim))
+			continue;
+		if (need_reclaim) {
+			ret =3D cluster_reclaim_range(si, ci, start, end);
+			/*
+			 * Reclaim drops ci->lock and cluster could be used
+			 * by another order. Not checking flag as off-list
+			 * cluster has no flag set, and change of list
+			 * won't cause fragmentation.
+			 */
+			if (!cluster_is_usable(ci, order)) {
 				offset =3D SWAP_NEXT_INVALID;
-				goto done;
+				goto out;
 			}
-			offset +=3D nr_pages;
-			break;
+			if (cluster_is_empty(ci))
+				offset =3D start;
+			/* Reclaim failed but cluster is usable, try next */
+			if (!ret)
+				continue;
+		}
+		if (!cluster_alloc_range(si, ci, offset, usage, order)) {
+			offset =3D SWAP_NEXT_INVALID;
+			goto out;
+		}
+		*foundp =3D offset;
+		if (ci->count =3D=3D SWAPFILE_CLUSTER) {
+			offset =3D SWAP_NEXT_INVALID;
+			goto out;
 		}
 		offset +=3D nr_pages;
+		break;
 	}
 	if (offset > end)
 		offset =3D SWAP_NEXT_INVALID;
-done:
+out:
+	relocate_cluster(si, ci);
 	unlock_cluster(ci);
 	return offset;
 }
@@ -747,18 +837,17 @@ static void swap_reclaim_full_clusters(struct swap_in=
fo_struct *si, bool force)
 	if (force)
 		to_scan =3D swap_usage_in_pages(si) / SWAPFILE_CLUSTER;
=20
-	while (!list_empty(&si->full_clusters)) {
-		ci =3D list_first_entry(&si->full_clusters, struct swap_cluster_info, li=
st);
-		list_move_tail(&ci->list, &si->full_clusters);
+	while ((ci =3D isolate_lock_cluster(si, &si->full_clusters))) {
 		offset =3D cluster_offset(si, ci);
 		end =3D min(si->max, offset + SWAPFILE_CLUSTER);
 		to_scan--;
=20
-		spin_unlock(&si->lock);
 		while (offset < end) {
 			if (READ_ONCE(map[offset]) =3D=3D SWAP_HAS_CACHE) {
+				spin_unlock(&ci->lock);
 				nr_reclaim =3D __try_to_reclaim_swap(si, offset,
 								   TTRS_ANYWAY | TTRS_DIRECT);
+				spin_lock(&ci->lock);
 				if (nr_reclaim) {
 					offset +=3D abs(nr_reclaim);
 					continue;
@@ -766,8 +855,8 @@ static void swap_reclaim_full_clusters(struct swap_info=
_struct *si, bool force)
 			}
 			offset++;
 		}
-		spin_lock(&si->lock);
=20
+		unlock_cluster(ci);
 		if (to_scan <=3D 0)
 			break;
 	}
@@ -779,9 +868,7 @@ static void swap_reclaim_work(struct work_struct *work)
=20
 	si =3D container_of(work, struct swap_info_struct, reclaim_work);
=20
-	spin_lock(&si->lock);
 	swap_reclaim_full_clusters(si, true);
-	spin_unlock(&si->lock);
 }
=20
 /*
@@ -792,29 +879,34 @@ static void swap_reclaim_work(struct work_struct *wor=
k)
 static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,=
 int order,
 					      unsigned char usage)
 {
-	struct percpu_cluster *cluster;
 	struct swap_cluster_info *ci;
 	unsigned int offset, found =3D 0;
=20
-new_cluster:
-	lockdep_assert_held(&si->lock);
-	cluster =3D this_cpu_ptr(si->percpu_cluster);
-	offset =3D cluster->next[order];
+	/* Fast path using per CPU cluster */
+	local_lock(&si->percpu_cluster->lock);
+	offset =3D __this_cpu_read(si->percpu_cluster->next[order]);
 	if (offset) {
-		offset =3D alloc_swap_scan_cluster(si, offset, &found, order, usage);
+		ci =3D lock_cluster(si, offset);
+		/* Cluster could have been used by another order */
+		if (cluster_is_usable(ci, order)) {
+			if (cluster_is_empty(ci))
+				offset =3D cluster_offset(si, ci);
+			offset =3D alloc_swap_scan_cluster(si, offset, &found,
+							 order, usage);
+		} else {
+			unlock_cluster(ci);
+		}
 		if (found)
 			goto done;
 	}
=20
-	if (!list_empty(&si->free_clusters)) {
-		ci =3D list_first_entry(&si->free_clusters, struct swap_cluster_info, li=
st);
-		offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, o=
rder, usage);
-		/*
-		 * Either we didn't touch the cluster due to swapoff,
-		 * or the allocation must success.
-		 */
-		VM_BUG_ON((si->flags & SWP_WRITEOK) && !found);
-		goto done;
+new_cluster:
+	ci =3D isolate_lock_cluster(si, &si->free_clusters);
+	if (ci) {
+		offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci),
+						 &found, order, usage);
+		if (found)
+			goto done;
 	}
=20
 	/* Try reclaim from full clusters if free clusters list is drained */
@@ -822,49 +914,42 @@ static unsigned long cluster_alloc_swap_entry(struct =
swap_info_struct *si, int o
 		swap_reclaim_full_clusters(si, false);
=20
 	if (order < PMD_ORDER) {
-		unsigned int frags =3D 0;
+		unsigned int frags =3D 0, frags_existing;
=20
-		while (!list_empty(&si->nonfull_clusters[order])) {
-			ci =3D list_first_entry(&si->nonfull_clusters[order],
-					      struct swap_cluster_info, list);
-			move_cluster(si, ci, &si->frag_clusters[order], CLUSTER_FLAG_FRAG);
+		while ((ci =3D isolate_lock_cluster(si, &si->nonfull_clusters[order]))) {
 			offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci),
 							 &found, order, usage);
-			frags++;
 			if (found)
 				goto done;
+			/* Clusters failed to allocate are moved to frag_clusters */
+			frags++;
 		}
=20
-		/*
-		 * Nonfull clusters are moved to frag tail if we reached
-		 * here, count them too, don't over scan the frag list.
-		 */
-		while (frags < si->frag_cluster_nr[order]) {
-			ci =3D list_first_entry(&si->frag_clusters[order],
-					      struct swap_cluster_info, list);
+		frags_existing =3D atomic_long_read(&si->frag_cluster_nr[order]);
+		while (frags < frags_existing &&
+		       (ci =3D isolate_lock_cluster(si, &si->frag_clusters[order]))) {
+			atomic_long_dec(&si->frag_cluster_nr[order]);
 			/*
-			 * Rotate the frag list to iterate, they were all failing
-			 * high order allocation or moved here due to per-CPU usage,
-			 * this help keeping usable cluster ahead.
+			 * Rotate the frag list to iterate, they were all
+			 * failing high order allocation or moved here due to
+			 * per-CPU usage, but they could contain newly released
+			 * reclaimable (eg. lazy-freed swap cache) slots.
 			 */
-			list_move_tail(&ci->list, &si->frag_clusters[order]);
 			offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci),
 							 &found, order, usage);
-			frags++;
 			if (found)
 				goto done;
+			frags++;
 		}
 	}
=20
-	if (!list_empty(&si->discard_clusters)) {
-		/*
-		 * we don't have free cluster but have some clusters in
-		 * discarding, do discard now and reclaim them, then
-		 * reread cluster_next_cpu since we dropped si->lock
-		 */
-		swap_do_scheduled_discard(si);
+	/*
+	 * We don't have free cluster but have some clusters in
+	 * discarding, do discard now and reclaim them, then
+	 * reread cluster_next_cpu since we dropped si->lock
+	 */
+	if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
 		goto new_cluster;
-	}
=20
 	if (order)
 		goto done;
@@ -875,26 +960,25 @@ static unsigned long cluster_alloc_swap_entry(struct =
swap_info_struct *si, int o
 		 * Clusters here have at least one usable slots and can't fail order 0
 		 * allocation, but reclaim may drop si->lock and race with another user.
 		 */
-		while (!list_empty(&si->frag_clusters[o])) {
-			ci =3D list_first_entry(&si->frag_clusters[o],
-					      struct swap_cluster_info, list);
+		while ((ci =3D isolate_lock_cluster(si, &si->frag_clusters[o]))) {
+			atomic_long_dec(&si->frag_cluster_nr[o]);
 			offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci),
-							 &found, 0, usage);
+							 &found, order, usage);
 			if (found)
 				goto done;
 		}
=20
-		while (!list_empty(&si->nonfull_clusters[o])) {
-			ci =3D list_first_entry(&si->nonfull_clusters[o],
-					      struct swap_cluster_info, list);
+		while ((ci =3D isolate_lock_cluster(si, &si->nonfull_clusters[o]))) {
 			offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci),
-							 &found, 0, usage);
+							 &found, order, usage);
 			if (found)
 				goto done;
 		}
 	}
 done:
-	cluster->next[order] =3D offset;
+	__this_cpu_write(si->percpu_cluster->next[order], offset);
+	local_unlock(&si->percpu_cluster->lock);
+
 	return found;
 }
=20
@@ -1158,14 +1242,11 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entr=
ies[], int entry_order)
 		plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
 		spin_unlock(&swap_avail_lock);
 		if (get_swap_device_info(si)) {
-			spin_lock(&si->lock);
 			n_ret =3D scan_swap_map_slots(si, SWAP_HAS_CACHE,
 					n_goal, swp_entries, order);
-			spin_unlock(&si->lock);
 			put_swap_device(si);
 			if (n_ret || size > 1)
 				goto check_out;
-			cond_resched();
 		}
=20
 		spin_lock(&swap_avail_lock);
@@ -1378,9 +1459,7 @@ static bool __swap_entries_free(struct swap_info_stru=
ct *si,
 	if (!has_cache) {
 		for (i =3D 0; i < nr; i++)
 			zswap_invalidate(swp_entry(si->type, offset + i));
-		spin_lock(&si->lock);
 		swap_entry_range_free(si, entry, nr);
-		spin_unlock(&si->lock);
 	}
 	return has_cache;
=20
@@ -1409,16 +1488,27 @@ static void swap_entry_range_free(struct swap_info_=
struct *si, swp_entry_t entry
 	unsigned char *map_end =3D map + nr_pages;
 	struct swap_cluster_info *ci;
=20
+	/* It should never free entries across different clusters */
+	VM_BUG_ON((offset / SWAPFILE_CLUSTER) !=3D ((offset + nr_pages - 1) / SWA=
PFILE_CLUSTER));
+
 	ci =3D lock_cluster(si, offset);
+	VM_BUG_ON(cluster_is_empty(ci));
+	VM_BUG_ON(ci->count < nr_pages);
+
+	ci->count -=3D nr_pages;
 	do {
 		VM_BUG_ON(*map !=3D SWAP_HAS_CACHE);
 		*map =3D 0;
 	} while (++map < map_end);
-	dec_cluster_info_page(si, ci, nr_pages);
-	unlock_cluster(ci);
=20
 	mem_cgroup_uncharge_swap(entry, nr_pages);
 	swap_range_free(si, offset, nr_pages);
+
+	if (!ci->count)
+		free_cluster(si, ci);
+	else
+		partial_free_cluster(si, ci);
+	unlock_cluster(ci);
 }
=20
 static void cluster_swap_free_nr(struct swap_info_struct *si,
@@ -1490,9 +1580,7 @@ void put_swap_folio(struct folio *folio, swp_entry_t =
entry)
 	ci =3D lock_cluster(si, offset);
 	if (size > 1 && swap_is_has_cache(si, offset, size)) {
 		unlock_cluster(ci);
-		spin_lock(&si->lock);
 		swap_entry_range_free(si, entry, size);
-		spin_unlock(&si->lock);
 		return;
 	}
 	for (int i =3D 0; i < size; i++, entry.val++) {
@@ -1507,46 +1595,19 @@ void put_swap_folio(struct folio *folio, swp_entry_=
t entry)
 	unlock_cluster(ci);
 }
=20
-static int swp_entry_cmp(const void *ent1, const void *ent2)
-{
-	const swp_entry_t *e1 =3D ent1, *e2 =3D ent2;
-
-	return (int)swp_type(*e1) - (int)swp_type(*e2);
-}
-
 void swapcache_free_entries(swp_entry_t *entries, int n)
 {
-	struct swap_info_struct *si, *prev;
 	int i;
+	struct swap_info_struct *si =3D NULL;
=20
 	if (n <=3D 0)
 		return;
=20
-	prev =3D NULL;
-	si =3D NULL;
-
-	/*
-	 * Sort swap entries by swap device, so each lock is only taken once.
-	 * nr_swapfiles isn't absolutely correct, but the overhead of sort() is
-	 * so low that it isn't necessary to optimize further.
-	 */
-	if (nr_swapfiles > 1)
-		sort(entries, n, sizeof(entries[0]), swp_entry_cmp, NULL);
 	for (i =3D 0; i < n; ++i) {
 		si =3D _swap_info_get(entries[i]);
-
-		if (si !=3D prev) {
-			if (prev !=3D NULL)
-				spin_unlock(&prev->lock);
-			if (si !=3D NULL)
-				spin_lock(&si->lock);
-		}
 		if (si)
 			swap_entry_range_free(si, entries[i], 1);
-		prev =3D si;
 	}
-	if (si)
-		spin_unlock(&si->lock);
 }
=20
 int __swap_count(swp_entry_t entry)
@@ -1799,10 +1860,8 @@ swp_entry_t get_swap_page_of_type(int type)
=20
 	/* This is called for allocating swap entry, not cache */
 	if (get_swap_device_info(si)) {
-		spin_lock(&si->lock);
 		if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0=
))
 			atomic_long_dec(&nr_swap_pages);
-		spin_unlock(&si->lock);
 		put_swap_device(si);
 	}
 fail:
@@ -3142,6 +3201,7 @@ static struct swap_cluster_info *setup_clusters(struc=
t swap_info_struct *si,
 		cluster =3D per_cpu_ptr(si->percpu_cluster, cpu);
 		for (i =3D 0; i < SWAP_NR_ORDERS; i++)
 			cluster->next[i] =3D SWAP_NEXT_INVALID;
+		local_lock_init(&cluster->lock);
 	}
=20
 	/*
@@ -3165,7 +3225,7 @@ static struct swap_cluster_info *setup_clusters(struc=
t swap_info_struct *si,
 	for (i =3D 0; i < SWAP_NR_ORDERS; i++) {
 		INIT_LIST_HEAD(&si->nonfull_clusters[i]);
 		INIT_LIST_HEAD(&si->frag_clusters[i]);
-		si->frag_cluster_nr[i] =3D 0;
+		atomic_long_set(&si->frag_cluster_nr[i], 0);
 	}
=20
 	/*
@@ -3647,7 +3707,6 @@ int add_swap_count_continuation(swp_entry_t entry, gf=
p_t gfp_mask)
 		 */
 		goto outer;
 	}
-	spin_lock(&si->lock);
=20
 	offset =3D swp_offset(entry);
=20
@@ -3712,7 +3771,6 @@ int add_swap_count_continuation(swp_entry_t entry, gf=
p_t gfp_mask)
 	spin_unlock(&si->cont_lock);
 out:
 	unlock_cluster(ci);
-	spin_unlock(&si->lock);
 	put_swap_device(si);
 outer:
 	if (page)
--=20
2.47.1