From nobody Sun Oct  5 12:46:56 2025
Received: from mail-pf1-f172.google.com (mail-pf1-f172.google.com
 [209.85.210.172])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9303428B7E2
	for <linux-kernel@vger.kernel.org>; Mon,  4 Aug 2025 17:24:52 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.210.172
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1754328294; cv=none;
 b=XQ1L2VXlpOy8fUyqL7qiCPpzBFkUxj7jr6+zxw+fat/rXdWJqhPNmG+4k7SGMNNspt1uuCSK7x7u0mfA3CncLZ5vYZ7HHjJCgHjjIG40099DZjKvy/vkBbBwdSvi/xDy8L3WUhHgSqTz3QiGK0dtfXMC4TpXUfIvxeX9V8mpz9c=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1754328294; c=relaxed/simple;
	bh=D0j65UqNmlMIxf+Evr42Bqzpf4M4hsTguikwzCdh9SE=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=iamwUWAeqqkODnUfLWBOcZ9mZBylsYLzguZUqMW6Ad4y0geGM5flf0yIC6hfVpiGPKvmNery1d7rOW92BJp4gcCjLWmkJdK/Mom4rl56cT9GmYKz/L8/yuHLJ6u0GSvwOliDh7xBglKP5+XoI2L25ADZKaSJgo4Rafhd8THaucI=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=Y0uowsUf; arc=none smtp.client-ip=209.85.210.172
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="Y0uowsUf"
Received: by mail-pf1-f172.google.com with SMTP id
 d2e1a72fcca58-76bd202ef81so5063554b3a.3
        for <linux-kernel@vger.kernel.org>;
 Mon, 04 Aug 2025 10:24:52 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1754328292; x=1754933092;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject
         :date:message-id:reply-to;
        bh=vVDJaCuMrELPMFvvynT19e8hRXhozD/bO9trI2Q8I4s=;
        b=Y0uowsUftozLz8dRmYjIfjGr7bnmXL+BPcXFfZ4wFyUlc9RaloapegIKOS/nWX+Paq
         RNwj5tOZaM7DcgryOmbph+wmzEbUvucXEaOxm9rLk9/iJMhrJhAX8bRsY6cbhT4h0XxR
         DDwLixhsenP78h57S4DngPkJa7SqjnopwPbOsKIVyYD0iq+k8JSJwDZg1feaNbU+lpPK
         d5eGRYkK1TE/qWNoQG7Pd9N2g/RlW+pk0VZkEKkeOBYzQZaz4xWU5FrTftFEsBLIfyXk
         CS0wHYQ4DTnKc9/S4RN7RyE3eupMsbAi7v23zBmzN2+rAT9cXxROPReMzPTseUyD08BL
         xkNQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1754328292; x=1754933092;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state
         :from:to:cc:subject:date:message-id:reply-to;
        bh=vVDJaCuMrELPMFvvynT19e8hRXhozD/bO9trI2Q8I4s=;
        b=W9uv5FZdEfIMZKzHe1X20ohZ7lj1cI7SohydztnNpPhsZMAJJ1JkcGFMjcskVyyqgU
         aVyx1wFfitz9hXIc9oZzfPNsnE6tzIaMwHuP5qxHHBuTGRk43RXA191xeP8YatNoUtII
         b74BRKTtjH6WUZ5BnGSig7lUnCYNeZhAKa5rYqRqcQQzu2O1siZBGaRXydofh+8VgvM+
         SnCos4azUmwWudE6TmGvtaZMI743R9cypoRAMnpsn1GL476Zq2xRmF2cTNA1x1PX2uWS
         K7QAkARtysbSKOWROLCn1jeVeHBfWSv1XvhrEZ+sUYit/IliSWmJ9/e2/cZl6r5nONlB
         AeyQ==
X-Forwarded-Encrypted: i=1;
 AJvYcCV75E6GcIMiAzgClCcb9uigMc3kzZ5vhkZ+My7KwIk9l+Acddr82fw3ZC79O4CRDjDPaXAeBCrdfB7yWpg=@vger.kernel.org
X-Gm-Message-State: AOJu0YwJ4vlbZ56/E9Nbqm9ywajZI/qXt3dGOT2HbaVgxqQNOtL6PM52
	ZJvGOpwbRTam/glAJPcL6cjS8QQuxTLs6koTfsAc7nVotukZ3qyKQuv/
X-Gm-Gg: ASbGncu745+LG5ztBUmyzd27j5enDiJiWAVXp1idpdak2jrmfo8h17QQaG30ZQ0FuOW
	yHKfejnjFca6xYtYllb248ofs1Y7FhEGLWQXx8TfyMRcm5VqaBxSPeg0kgG84X7zl76h6IYIWtb
	2NHwJyMohTxJ3O9SJSIbCAk7oAsZeW0ITlZ1RwVBPweMNn3X1S1GE0Gs7Z+/6HqY5LHNMEXzOTj
	eYyTe49pBAYVMchiU8TTSwFohafj+YSuFpQuL+QOq5vST5KNyRRnmu9/ctX9JbFvQ5z40oY7WIu
	bhHYdaiCCSYOlkOroUlKXVFL0UHLBomER3/bEZB5jrSFFtgE0r77i+i9ZFCNbSbPJ6XKtTE8oqV
	CozqtZXPyFVXb7baFp/fRoPaOvbc=
X-Google-Smtp-Source: 
 AGHT+IGniJ0lx0Qlax8eOfu84RbBgv9hJ3X1mPcI10TaxeI90rF9axLNnX5byim17EjKyI1ryfWq7Q==
X-Received: by 2002:a05:6a00:3a10:b0:76b:e144:1d91 with SMTP id
 d2e1a72fcca58-76bec4be949mr12302420b3a.16.1754328291655;
        Mon, 04 Aug 2025 10:24:51 -0700 (PDT)
Received: from KASONG-MC4 ([101.32.222.185])
        by smtp.gmail.com with ESMTPSA id
 d2e1a72fcca58-76bfcb26905sm4194530b3a.123.2025.08.04.10.24.48
        (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256);
        Mon, 04 Aug 2025 10:24:51 -0700 (PDT)
From: Kairui Song <ryncsn@gmail.com>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Kemeng Shi <shikemeng@huaweicloud.com>,
	Chris Li <chrisl@kernel.org>,
	Nhat Pham <nphamcs@gmail.com>,
	Baoquan He <bhe@redhat.com>,
	Barry Song <baohua@kernel.org>,
	"Huang, Ying" <ying.huang@linux.alibaba.com>,
	linux-kernel@vger.kernel.org,
	Kairui Song <kasong@tencent.com>
Subject: [PATCH 1/2] mm, swap: don't scan every fragment cluster
Date: Tue,  5 Aug 2025 01:24:38 +0800
Message-ID: <20250804172439.2331-2-ryncsn@gmail.com>
X-Mailer: git-send-email 2.50.1
In-Reply-To: <20250804172439.2331-1-ryncsn@gmail.com>
References: <20250804172439.2331-1-ryncsn@gmail.com>
Reply-To: Kairui Song <kasong@tencent.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Kairui Song <kasong@tencent.com>

Fragment clusters were mostly failing high order allocation already.
The reason we scan it now is that a swap slot may get freed without
releasing the swap cache, so a swap map entry will end up in HAS_CACHE
only status, and the cluster won't be moved back to non-full or free
cluster list.

Usually this only happens for !SWP_SYNCHRONOUS_IO devices when the swap
device usage is low (!vm_swap_full()) since swap will try to lazy free
the swap cache.

It's unlikely to cause any real issue. Fragmentation is only an issue
when the device is getting full, and by  that time, swap will already
be releasing the swap cache aggressively. And swap cache reclaim happens
when the allocator scans a cluster too. Scanning one fragment cluster
should be good enough to reclaim these pinned slots.

And besides, only high order allocation requires iterating over a
cluster list, order 0 allocation will succeed on the first attempt.
And high order allocation failure isn't a serious problem.

So the iteration of fragment clusters is trivial, but it will slow down
mTHP allocation by a lot when the fragment cluster list is long.
So it's better to drop this fragment cluster iteration design. Only
scanning one fragment cluster is good enough in case any cluster is
stuck in the fragment list; this ensures order 0 allocation never
falls, and large allocations still have an acceptable success rate.

Test on a 48c96t system, build linux kernel using 10G ZRAM, make -j48,
defconfig with 768M cgroup memory limit, on top of tmpfs, 4K folio
only:

Before: sys time: 4407.28s
After:  sys time: 4425.22s

Change to make -j96, 2G memory limit, 64kB mTHP enabled, and 10G ZRAM:

Before: sys time: 10230.22s  64kB/swpout: 1793044  64kB/swpout_fallback: 17=
653
After:  sys time: 5527.90s   64kB/swpout: 1789358  64kB/swpout_fallback: 17=
813

Change to 8G ZRAM:

Before: sys time: 21929.17s  64kB/swpout: 1634681  64kB/swpout_fallback: 17=
3056
After:  sys time: 6121.01s   64kB/swpout: 1638155  64kB/swpout_fallback: 18=
9562

Change to use 10G brd device with SWP_SYNCHRONOUS_IO flag removed:

Before: sys time: 7368.41s  64kB/swpout:1787599  swpout_fallback: 0
After:  sys time: 7338.27s  64kB/swpout:1783106  swpout_fallback: 0

Change to use 8G brd device with SWP_SYNCHRONOUS_IO flag removed:

Before: sys time: 28139.60s 64kB/swpout:1645421  swpout_fallback: 148408
After:  sys time: 8941.90s  64kB/swpout:1592973  swpout_fallback: 265010

The performance is a lot better and large order allocation failure rate
is only very slightly higher or unchanged.

Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/swap.h |  1 -
 mm/swapfile.c        | 30 ++++++++----------------------
 2 files changed, 8 insertions(+), 23 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2fe6ed2cc3fd..a060d102e0d1 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -310,7 +310,6 @@ struct swap_info_struct {
 					/* list of cluster that contains at least one free slot */
 	struct list_head frag_clusters[SWAP_NR_ORDERS];
 					/* list of cluster that are fragmented or contented */
-	atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
 	unsigned int pages;		/* total of usable pages of swap */
 	atomic_long_t inuse_pages;	/* number of those currently in use */
 	struct swap_sequential_cluster *global_cluster; /* Use one global cluster=
 for rotating device */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index b4f3cc712580..5fdb3cb2b8b7 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -470,11 +470,6 @@ static void move_cluster(struct swap_info_struct *si,
 	else
 		list_move_tail(&ci->list, list);
 	spin_unlock(&si->lock);
-
-	if (ci->flags =3D=3D CLUSTER_FLAG_FRAG)
-		atomic_long_dec(&si->frag_cluster_nr[ci->order]);
-	else if (new_flags =3D=3D CLUSTER_FLAG_FRAG)
-		atomic_long_inc(&si->frag_cluster_nr[ci->order]);
 	ci->flags =3D new_flags;
 }
=20
@@ -926,32 +921,25 @@ static unsigned long cluster_alloc_swap_entry(struct =
swap_info_struct *si, int o
 		swap_reclaim_full_clusters(si, false);
=20
 	if (order < PMD_ORDER) {
-		unsigned int frags =3D 0, frags_existing;
-
 		while ((ci =3D isolate_lock_cluster(si, &si->nonfull_clusters[order]))) {
 			found =3D alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
 							order, usage);
 			if (found)
 				goto done;
-			/* Clusters failed to allocate are moved to frag_clusters */
-			frags++;
 		}
=20
-		frags_existing =3D atomic_long_read(&si->frag_cluster_nr[order]);
-		while (frags < frags_existing &&
-		       (ci =3D isolate_lock_cluster(si, &si->frag_clusters[order]))) {
-			atomic_long_dec(&si->frag_cluster_nr[order]);
-			/*
-			 * Rotate the frag list to iterate, they were all
-			 * failing high order allocation or moved here due to
-			 * per-CPU usage, but they could contain newly released
-			 * reclaimable (eg. lazy-freed swap cache) slots.
-			 */
+		/*
+		 * Scan only one fragment cluster is good enough. Order 0
+		 * allocation will surely success, and large allocation
+		 * failure is not critical. Scanning one cluster still
+		 * keeps the list rotated and reclaimed (for HAS_CACHE).
+		 */
+		ci =3D isolate_lock_cluster(si, &si->frag_clusters[order]);
+		if (ci) {
 			found =3D alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
 							order, usage);
 			if (found)
 				goto done;
-			frags++;
 		}
 	}
=20
@@ -972,7 +960,6 @@ static unsigned long cluster_alloc_swap_entry(struct sw=
ap_info_struct *si, int o
 		 * allocation, but reclaim may drop si->lock and race with another user.
 		 */
 		while ((ci =3D isolate_lock_cluster(si, &si->frag_clusters[o]))) {
-			atomic_long_dec(&si->frag_cluster_nr[o]);
 			found =3D alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
 							0, usage);
 			if (found)
@@ -3224,7 +3211,6 @@ static struct swap_cluster_info *setup_clusters(struc=
t swap_info_struct *si,
 	for (i =3D 0; i < SWAP_NR_ORDERS; i++) {
 		INIT_LIST_HEAD(&si->nonfull_clusters[i]);
 		INIT_LIST_HEAD(&si->frag_clusters[i]);
-		atomic_long_set(&si->frag_cluster_nr[i], 0);
 	}
=20
 	/*
--=20
2.50.1
From nobody Sun Oct  5 12:46:56 2025
Received: from mail-pf1-f170.google.com (mail-pf1-f170.google.com
 [209.85.210.170])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 40E9E28B4FC
	for <linux-kernel@vger.kernel.org>; Mon,  4 Aug 2025 17:24:56 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.210.170
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1754328297; cv=none;
 b=OR/vlTlIBOZHzdlEWnaLHQV2c8Sbhz0WA1jB6bN1t+MzvAU/J5t6597epWgCFfXB6vRwKhz1WGUHAQxW317m/2RGI0FYSCwvdFMB6bogNBt6AjxkWselad9aDUIJRMHZEjdEgrjY0LPG9XvguPP701yAlK0GaqxJMU4wTXGKnR4=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1754328297; c=relaxed/simple;
	bh=qO5Ew7xotEf+LSjx9i7tRW9PtT+V4Seu5gf4adD9WvQ=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=dt0nCF8/h0TEPHTM6Rx9F/jd66C6mkk8PSmMl5BR+N9bzr+jdPNATKjTdbC64RN6goK2r2aeNgrrVK9bmiNuV6aaHtrHIB8ZnyPK7+K99AJIUBGuBiRqsqhXycHlNGtt2B+prKLh2CXNJVZpDVglPl8oweUcWW4k4s6Afx2AZ+A=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=lLy5W/PR; arc=none smtp.client-ip=209.85.210.170
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="lLy5W/PR"
Received: by mail-pf1-f170.google.com with SMTP id
 d2e1a72fcca58-76bd9d723bfso2845139b3a.1
        for <linux-kernel@vger.kernel.org>;
 Mon, 04 Aug 2025 10:24:56 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1754328295; x=1754933095;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject
         :date:message-id:reply-to;
        bh=PZR0jWbWu/lG5CSlpgwY9+3+NvurMoK+b/pH3fKR8CI=;
        b=lLy5W/PRbT/GSa7DX2lX4KP9nqw1p2/sOGnayQRLII894Hjo4PkEiUBn5bQw6aRjHw
         nM8a4ISD8JwDa4A8PkU/NxJXCzzxDLW3yMc+Xss0lqgGYtsdAIZaGVBoOzJtJE74PN/s
         Mt9emuODF/dSGoF9JncTrr9kxcl2vm2T6A9RXej9B6TrTO+yxfgd4sLUXRZ/mLr5Jf2x
         zk4AB6nX7my9njbdJz1IX0tLtlSSnGKEYRZ6x1t1v4kurnwVeVL29Id+tZ6oS4AZooAW
         110kIseJi9QkrTO/0PAXzMCLRc/kR6SqOR1y/Hy1eYuZkQEA5i1dvzm8WuSkASGnzEo2
         8d0w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1754328295; x=1754933095;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state
         :from:to:cc:subject:date:message-id:reply-to;
        bh=PZR0jWbWu/lG5CSlpgwY9+3+NvurMoK+b/pH3fKR8CI=;
        b=ppq0zpQA0pGuyWipURJQNqGUsRnESqlJ2SK0keaPkaheuNO70NXlGK7mqhE3hYJ4+x
         IQvbGfx17cUCjzyPtxS8tWxaEM4aMQspjeaieSYMlzn9p1P39ebp/qzV71GviHRrHSJ0
         puvhJJRYEug2KlsG/v+8bBbIt4hTM1ZuRusMfWn3McUHmmbn8Mcaq+WDxe2QzcbVTZjO
         vVkX67GjgMGiO9ms2wFDNSQfVMywH0I6/ni7xr6mK4nzWuCuxZzAhG60nm5LA2Sfuqjz
         F/f/XXt+nHuUy5v0OfajKpcO3Kj4DsdGLB679VYHpGBAC782rMwnS7eUEGVH9sqDnybw
         6osg==
X-Forwarded-Encrypted: i=1;
 AJvYcCXzifH9kTr03E0n3+JOMUPbbXckde+fzoDx5xNNLJH14NLnNLTEwCps5O6EFMTTcHQt2gz7fRnIkxOHoR4=@vger.kernel.org
X-Gm-Message-State: AOJu0Yw7aZvjBhiLTBNkzElHPOX/oXJpahlmgm45H4yjyIEu/h0rTyp5
	cxWrM57w2rACAlcp2Z5PD1srMWW1lrF2Vn2lPYH4ONgeXaVgMLtzNQC6b3lP+E+Cn+Q=
X-Gm-Gg: ASbGncvilovlnIzXgHnq7VorE04R4DFqMBwDp2kMeAkablVyVqB5aB78sj/b0jF8UMr
	/Dn1U+qCqkPkbdGrphgZEVE6dhAaPn1gpStsVxfqXUoICNHWb5hsvMbrkqcBmz9sQJ+/7/s1zY0
	L1M/Zjga9IRSwputjy8yAIR0ApsNQoPws5TrQsRk8eGSeButflNmFaB+lUsopW2JfSRgU2io5Sj
	hnlPRHGc/I+xB4yNhLBOCv8MJJ84DbcjaLT9sFlwazJaARi5asHXDPW/vbi3ZpzQevzvUu6XwHx
	BJkbSWRzx/PkFZoHizyALEEDrtp1su0RLPunCZ7MW0DCXfY3dEDorWRlxCFw4F2OIDvMVtm0tqh
	w6hzFfszWSiw48/i8sdzmNscteCc=
X-Google-Smtp-Source: 
 AGHT+IGHdiNC7GNzQLwpVWRh03+lnLLmoCulHv8RNEfhsU1tU00VgfdWrwm2VCOx+cO0qpIIb4oOtg==
X-Received: by 2002:a05:6a00:1701:b0:76b:f5c3:2af9 with SMTP id
 d2e1a72fcca58-76bf5c337ecmr9874395b3a.24.1754328295484;
        Mon, 04 Aug 2025 10:24:55 -0700 (PDT)
Received: from KASONG-MC4 ([101.32.222.185])
        by smtp.gmail.com with ESMTPSA id
 d2e1a72fcca58-76bfcb26905sm4194530b3a.123.2025.08.04.10.24.51
        (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256);
        Mon, 04 Aug 2025 10:24:54 -0700 (PDT)
From: Kairui Song <ryncsn@gmail.com>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Kemeng Shi <shikemeng@huaweicloud.com>,
	Chris Li <chrisl@kernel.org>,
	Nhat Pham <nphamcs@gmail.com>,
	Baoquan He <bhe@redhat.com>,
	Barry Song <baohua@kernel.org>,
	"Huang, Ying" <ying.huang@linux.alibaba.com>,
	linux-kernel@vger.kernel.org,
	Kairui Song <kasong@tencent.com>
Subject: [PATCH 2/2] mm, swap: prefer nonfull over free clusters
Date: Tue,  5 Aug 2025 01:24:39 +0800
Message-ID: <20250804172439.2331-3-ryncsn@gmail.com>
X-Mailer: git-send-email 2.50.1
In-Reply-To: <20250804172439.2331-1-ryncsn@gmail.com>
References: <20250804172439.2331-1-ryncsn@gmail.com>
Reply-To: Kairui Song <kasong@tencent.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Kairui Song <kasong@tencent.com>

We prefer a free cluster over a nonfull cluster whenever a CPU local
cluster is drained to respect the SSD discard behavior [1]. It's not
a best practice for non-discarding devices. And this is causing a
chigher fragmentation rate.

So for a non-discarding device, prefer nonfull over free clusters. This
reduces the fragmentation issue by a lot.

Testing with make -j96, defconfig, using 64k mTHP, 8G ZRAM:

Before: sys time: 6121.0s  64kB/swpout: 1638155  64kB/swpout_fallback: 1895=
62
After:  sys time: 6145.3s  64kB/swpout: 1761110  64kB/swpout_fallback: 66071

Testing with make -j96, defconfig, using 64k mTHP, 10G ZRAM:

Before: sys time 5527.9s  64kB/swpout: 1789358  64kB/swpout_fallback: 17813
After:  sys time 5538.3s  64kB/swpout: 1813133  64kB/swpout_fallback: 0

Performance is basically unchanged, and the large allocation failure rate
is lower. Enabling all mTHP sizes showed a more significant result:

Using the same test setup with 10G ZRAM and enabling all mTHP sizes:

128kB swap failure rate:
Before: swpout:449548 swpout_fallback:55894
After:  swpout:497519 swpout_fallback:3204

256kB swap failure rate:
Before: swpout:63938  swpout_fallback:2154
After:  swpout:65698  swpout_fallback:324

512kB swap failure rate:
Before: swpout:11971  swpout_fallback:2218
After:  swpout:14606  swpout_fallback:4

2M swap failure rate:
Before: swpout:12     swpout_fallback:1578
After:  swpout:1253   swpout_fallback:15

The success rate of large allocations is much higher.

Link: https://lore.kernel.org/linux-mm/87v8242vng.fsf@yhuang6-desk2.ccr.cor=
p.intel.com/ [1]
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 38 ++++++++++++++++++++++++++++----------
 1 file changed, 28 insertions(+), 10 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 5fdb3cb2b8b7..4a0cf4fb348d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -908,18 +908,20 @@ static unsigned long cluster_alloc_swap_entry(struct =
swap_info_struct *si, int o
 	}
=20
 new_cluster:
-	ci =3D isolate_lock_cluster(si, &si->free_clusters);
-	if (ci) {
-		found =3D alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
-						order, usage);
-		if (found)
-			goto done;
+	/*
+	 * If the device need discard, prefer new cluster over nonfull
+	 * to spread out the writes.
+	 */
+	if (si->flags & SWP_PAGE_DISCARD) {
+		ci =3D isolate_lock_cluster(si, &si->free_clusters);
+		if (ci) {
+			found =3D alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
+							order, usage);
+			if (found)
+				goto done;
+		}
 	}
=20
-	/* Try reclaim from full clusters if free clusters list is drained */
-	if (vm_swap_full())
-		swap_reclaim_full_clusters(si, false);
-
 	if (order < PMD_ORDER) {
 		while ((ci =3D isolate_lock_cluster(si, &si->nonfull_clusters[order]))) {
 			found =3D alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
@@ -927,7 +929,23 @@ static unsigned long cluster_alloc_swap_entry(struct s=
wap_info_struct *si, int o
 			if (found)
 				goto done;
 		}
+	}
=20
+	if (!(si->flags & SWP_PAGE_DISCARD)) {
+		ci =3D isolate_lock_cluster(si, &si->free_clusters);
+		if (ci) {
+			found =3D alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
+							order, usage);
+			if (found)
+				goto done;
+		}
+	}
+
+	/* Try reclaim full clusters if free and nonfull lists are drained */
+	if (vm_swap_full())
+		swap_reclaim_full_clusters(si, false);
+
+	if (order < PMD_ORDER) {
 		/*
 		 * Scan only one fragment cluster is good enough. Order 0
 		 * allocation will surely success, and large allocation
--=20
2.50.1