From nobody Sun Feb  8 11:15:57 2026
Received: from mail-pl1-f181.google.com (mail-pl1-f181.google.com
 [209.85.214.181])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id B90771DF66B
	for <linux-kernel@vger.kernel.org>; Fri,  2 Aug 2024 12:21:05 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.214.181
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1722601267; cv=none;
 b=mwJ+0PCoWBMSGJtvqykoQBc4HsleuOquj9TDYDmj1j9x4LZ1CZM+H3MlrlQZpXSiFdAX+wbbv7yawpWtbD/i3lN/Rd6w9IoDhiOtMefc9YggG385dhyoiWj0MS1MwcFx1xqPPCM8BNxj2UcB4zryyXOXFqJZqlMdu3dZeDvUy30=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1722601267; c=relaxed/simple;
	bh=tzHAhwAyy8PSrLYUP3BLqNXDLKH7z9/YewVeZXL15oQ=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=Uf6ZsIKMK8IlmyR/oFcN5TkQEZ4I/6xCSUZ+W1eXGotiGBE/kZaKKcSvhrevVeZ2p45cBjJ7hancW2pA2yltWKb+Wf38Mekl4B1t/DVM7t4Qy0JLgcjmCcL+ygH264khLWabp1oYzcEqL2KaXNDZcUhWVeaVrZkd6ioBI3Qijpg=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=CwEok+Vb; arc=none smtp.client-ip=209.85.214.181
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="CwEok+Vb"
Received: by mail-pl1-f181.google.com with SMTP id
 d9443c01a7336-1fc65329979so69690415ad.0
        for <linux-kernel@vger.kernel.org>;
 Fri, 02 Aug 2024 05:21:05 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1722601265; x=1723206065;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=SZw+mrWAtNO8gx3RRID0EOwy0TVTv1NvTSzmaII/3QU=;
        b=CwEok+VbfHIY/985JberV5te8k8yzW7YyZoYZtoJoruJN2emkhpoTMSbwvl7YhImto
         FQTqork/IKhuayFf5ekS6+g6Ep+tP2ILBJ51vIynI0uYtLEj9o4s/h6+t4Uhw7EmEP8w
         Lfm8WvwoZFP6I8hLWia4JeX7pgUjOMhfMWSJebNCXGGThhYe1T6EZ1gEcW+OhVYXpEoJ
         TjFIZMeGlKX9vVAW9J1HE0bAJfTQ4IA0r9ZRuWTwYbeAp7r2gfZH5Jh8F+sdw/KRyYF5
         4M+jBDKKDYg+pe5Fg4Omo96XqJRg6m42arqvT2Q3Wp+zQ9UwzpGeAns3uRR9oJZpuJJ2
         3LXA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1722601265; x=1723206065;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=SZw+mrWAtNO8gx3RRID0EOwy0TVTv1NvTSzmaII/3QU=;
        b=G8VCLZJnWLvPgEhTrOby6AjXLn/kkKbi2GvCs4whdCIwWknLkhn5BCLVDdHURCdfUU
         isGojIFTPQSC/jYA5OUQa3XLqImN/n0roAva9I/g3Z65/M7AH7GLMbdJ6Z/awmyTlucp
         D7kzSt7oYAtDpXJcLIFFqxE/C+qxCWP07xLIMlNyNYI4sP+4on9ifxjfaWnL/Iav7mmI
         mCDZEwJ+TJNUJeCsOhWdCEIR3mndtGekzKuMQsTx+o4o6qbPJEip0ioJu4OPaVlEtRuH
         L+YRJlynr3+LBKAAAlLaMv2zMoCu189ishosbuxxGirbVUh0JiWcunwyV/ttuAvJO5d+
         Wbdw==
X-Forwarded-Encrypted: i=1;
 AJvYcCXIWaJkS9eo6dwYU05YYwFN7IQwyKRasHjKUaRqPa5ABVayKMxUYGmvmMwXA04Guzys/J99xSOhZZDRU7GDOZ2BMCvkGvPcz/LZzmSr
X-Gm-Message-State: AOJu0YxyR4ufMA0v0sgmuML1cMqYE9Cv3kVX0/pwQsrSaXon0JEZ/9Hr
	deoF5+MKp1U0YOa77BYe8t4Jqh/3Tbsfc2X1vQuZ/hnWvQQObmLC
X-Google-Smtp-Source: 
 AGHT+IH1pVDAT1Yc267JouGX6+3PsrKXzfDOleAEismvtPezwW0HljZJprD4nR6qZc1Oy9FTD3tnXg==
X-Received: by 2002:a17:902:db07:b0:1fd:9e44:e5e9 with SMTP id
 d9443c01a7336-1ff5744c810mr42469645ad.53.1722601264728;
        Fri, 02 Aug 2024 05:21:04 -0700 (PDT)
Received: from localhost.localdomain
 ([2407:7000:8942:5500:aaa1:59ff:fe57:eb97])
        by smtp.gmail.com with ESMTPSA id
 d9443c01a7336-1ff590600e4sm15841875ad.144.2024.08.02.05.20.56
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 02 Aug 2024 05:21:04 -0700 (PDT)
From: Barry Song <21cnbao@gmail.com>
To: akpm@linux-foundation.org,
	linux-mm@kvack.org
Cc: baolin.wang@linux.alibaba.com,
	chrisl@kernel.org,
	david@redhat.com,
	hannes@cmpxchg.org,
	hughd@google.com,
	kaleshsingh@google.com,
	kasong@tencent.com,
	linux-kernel@vger.kernel.org,
	mhocko@suse.com,
	minchan@kernel.org,
	nphamcs@gmail.com,
	ryan.roberts@arm.com,
	senozhatsky@chromium.org,
	shakeel.butt@linux.dev,
	shy828301@gmail.com,
	surenb@google.com,
	v-songbaohua@oppo.com,
	willy@infradead.org,
	xiang@kernel.org,
	ying.huang@intel.com,
	yosryahmed@google.com,
	hch@infradead.org
Subject: [PATCH v6 1/2] mm: add nr argument in
 mem_cgroup_swapin_uncharge_swap() helper to support large folios
Date: Sat,  3 Aug 2024 00:20:30 +1200
Message-Id: <20240802122031.117548-2-21cnbao@gmail.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240802122031.117548-1-21cnbao@gmail.com>
References: <20240726094618.401593-1-21cnbao@gmail.com>
 <20240802122031.117548-1-21cnbao@gmail.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Barry Song <v-songbaohua@oppo.com>

With large folios swap-in, we might need to uncharge multiple entries
all together, add nr argument in mem_cgroup_swapin_uncharge_swap().

For the existing two users, just pass nr=3D1.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Acked-by: Chris Li <chrisl@kernel.org>
---
 include/linux/memcontrol.h | 5 +++--
 mm/memcontrol.c            | 7 ++++---
 mm/memory.c                | 2 +-
 mm/swap_state.c            | 2 +-
 4 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1b79760af685..44f7fb7dc0c8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -682,7 +682,8 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *me=
mcg, gfp_t gfp,
=20
 int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *=
mm,
 				  gfp_t gfp, swp_entry_t entry);
-void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
+
+void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr_pa=
ges);
=20
 void __mem_cgroup_uncharge(struct folio *folio);
=20
@@ -1181,7 +1182,7 @@ static inline int mem_cgroup_swapin_charge_folio(stru=
ct folio *folio,
 	return 0;
 }
=20
-static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
+static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsi=
gned int nr)
 {
 }
=20
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b889a7fbf382..5d763c234c44 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4572,14 +4572,15 @@ int mem_cgroup_swapin_charge_folio(struct folio *fo=
lio, struct mm_struct *mm,
=20
 /*
  * mem_cgroup_swapin_uncharge_swap - uncharge swap slot
- * @entry: swap entry for which the page is charged
+ * @entry: the first swap entry for which the pages are charged
+ * @nr_pages: number of pages which will be uncharged
  *
  * Call this function after successfully adding the charged page to swapca=
che.
  *
  * Note: This function assumes the page for which swap slot is being uncha=
rged
  * is order 0 page.
  */
-void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
+void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr_pa=
ges)
 {
 	/*
 	 * Cgroup1's unified memory+swap counter has been charged with the
@@ -4599,7 +4600,7 @@ void mem_cgroup_swapin_uncharge_swap(swp_entry_t entr=
y)
 		 * let's not wait for it.  The page already received a
 		 * memory+swap charge, drop the swap entry duplicate.
 		 */
-		mem_cgroup_uncharge_swap(entry, 1);
+		mem_cgroup_uncharge_swap(entry, nr_pages);
 	}
 }
=20
diff --git a/mm/memory.c b/mm/memory.c
index 4c8716cb306c..4cf4902db1ec 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4102,7 +4102,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 					ret =3D VM_FAULT_OOM;
 					goto out_page;
 				}
-				mem_cgroup_swapin_uncharge_swap(entry);
+				mem_cgroup_swapin_uncharge_swap(entry, 1);
=20
 				shadow =3D get_shadow_from_swap_cache(entry);
 				if (shadow)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 293ff1afdca4..1159e3225754 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -522,7 +522,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry=
, gfp_t gfp_mask,
 	if (add_to_swap_cache(new_folio, entry, gfp_mask & GFP_RECLAIM_MASK, &sha=
dow))
 		goto fail_unlock;
=20
-	mem_cgroup_swapin_uncharge_swap(entry);
+	mem_cgroup_swapin_uncharge_swap(entry, 1);
=20
 	if (shadow)
 		workingset_refault(new_folio, shadow);
--=20
2.34.1
From nobody Sun Feb  8 11:15:57 2026
Received: from mail-pl1-f172.google.com (mail-pl1-f172.google.com
 [209.85.214.172])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5DCFD1DF66B
	for <linux-kernel@vger.kernel.org>; Fri,  2 Aug 2024 12:21:14 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.214.172
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1722601276; cv=none;
 b=e6e0mfXoqYv9y/h5zzbqSOfopi8S4HX6YRF4BQ4fw0ef/9o0WIXdW8UNoKHyiNv04wbn1PIfi1i+SgN43vUUvUGN8JvYqdI2UoRBuP1evLwX/miIITRDmV8H4muogyDNRIYMa86UCSe9pxSa5wFu7zDdLAdX6kesWXBxzODayzg=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1722601276; c=relaxed/simple;
	bh=r4vhrPSWMt/E6o9WLjPLYuu7RkLcGCWzBd3h+Z4eMPo=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=LHRdw8JL5jqy52oBDmcMsc0YMsNSAfHKCojR8e4L1LcTxqZyGKYMuI8jwpcyadtwvRp3K7NcLv+Cygu9dztIsXkcvUyEwOmxzslxGGJY29IJRacsN3ZhE933g09rUdGEGsBN0gK/JkE7H8cYqFqFPJqRHc0yInJk5cTDzgyApIw=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=egDK2FFX; arc=none smtp.client-ip=209.85.214.172
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="egDK2FFX"
Received: by mail-pl1-f172.google.com with SMTP id
 d9443c01a7336-1ff1cd07f56so60163055ad.2
        for <linux-kernel@vger.kernel.org>;
 Fri, 02 Aug 2024 05:21:14 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1722601274; x=1723206074;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=T4kFGlQ3P9/qg9SMcB706prs3lrZrpD1NyU/I35BuaI=;
        b=egDK2FFXEqvKfScYgesHVij+PjQAzgbiEIH3LYAb8Xl8qnxiaObUuM/k0/yU+SqPlK
         mH7BNUVjbc2ErElCgLjdimD9omgOMhlEtNvkQQ+dBVIIsF8W6ssbUQDDvTaDel/NjSEa
         +Ewk5zGntxr//VBpFjyzLfm0FPI3QQL+INDI0FKrOXkU5t/SCN6r0q5nc6Vw/V+CyiPW
         UdBNfB/p8X/HPNRc5gybtW2NJRm1fWeacSKk9rRsClGz1Dn7kzICMt5PcppBFqZ7rDBs
         YBf8Dmps/Cn2RUtaHfJ4bJFABWoIsEhGZe7zDhtR+NIVERLRztDzo0PJIQdHJxRbrU8w
         TdoA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1722601274; x=1723206074;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=T4kFGlQ3P9/qg9SMcB706prs3lrZrpD1NyU/I35BuaI=;
        b=gzUG6+EC4WoHlDtddiKzjEAYquCl3vpab85agDOjp9haX0J7tyAh2hXfNsv7N9SFU9
         ADpSfMYcQOD6uGQzZQ9axuu9ufGsn//LTp7rX9mflT5bEdW7tsgcMLcCzRfEl1UEpj7f
         JZ6m63bG3Dz0YrtfV+fWbzr4OIewg37gZGQlMmdr7kVvDoFRMGEw4Tg0Y9ocVql/Q9ZG
         lyZfIr9wSXvsC8+rjScNWOKoZuLQL7XcF4lFchVk4TiP7wyw9FZ1zPxub7P2AxEqQYFT
         dzHdnUQ49PnfT/vONwhpyCUvv4NA/8pxweTf6ndh8+kT0hlwgAGeie/BCaG8/DiyaVDe
         ZV/Q==
X-Forwarded-Encrypted: i=1;
 AJvYcCXSlWTnXrZtc8BhvKzxUkjlLgcI4OHhMTI/zO1xRvzMlljIMfjJRzcYc7H+yEfUcpF8CMsCE3VnAk+rsEad3IV4l2OZV9EG/+o/7R35
X-Gm-Message-State: AOJu0YyOUHtgOPNPU8zjBhVYaI5F0YH9y2V5dA6Lrwk/q9AufRgpJrKR
	hOQcKarDNClCCerxYXh8aNtN/XvoUHGiwVTmfIX4K9DVsZOT/TtK
X-Google-Smtp-Source: 
 AGHT+IFspuGjo7i44LD4lZkqlOJD5G7rbdSrKoQkk+kKJkp83UHxDiHKjj4IanneYSzs6/eAxZtR1g==
X-Received: by 2002:a17:902:cec6:b0:1fb:1b16:eb7b with SMTP id
 d9443c01a7336-1ff57258abbmr49466575ad.16.1722601273549;
        Fri, 02 Aug 2024 05:21:13 -0700 (PDT)
Received: from localhost.localdomain
 ([2407:7000:8942:5500:aaa1:59ff:fe57:eb97])
        by smtp.gmail.com with ESMTPSA id
 d9443c01a7336-1ff590600e4sm15841875ad.144.2024.08.02.05.21.05
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 02 Aug 2024 05:21:13 -0700 (PDT)
From: Barry Song <21cnbao@gmail.com>
To: akpm@linux-foundation.org,
	linux-mm@kvack.org
Cc: baolin.wang@linux.alibaba.com,
	chrisl@kernel.org,
	david@redhat.com,
	hannes@cmpxchg.org,
	hughd@google.com,
	kaleshsingh@google.com,
	kasong@tencent.com,
	linux-kernel@vger.kernel.org,
	mhocko@suse.com,
	minchan@kernel.org,
	nphamcs@gmail.com,
	ryan.roberts@arm.com,
	senozhatsky@chromium.org,
	shakeel.butt@linux.dev,
	shy828301@gmail.com,
	surenb@google.com,
	v-songbaohua@oppo.com,
	willy@infradead.org,
	xiang@kernel.org,
	ying.huang@intel.com,
	yosryahmed@google.com,
	hch@infradead.org,
	Chuanhua Han <hanchuanhua@oppo.com>
Subject: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices
Date: Sat,  3 Aug 2024 00:20:31 +1200
Message-Id: <20240802122031.117548-3-21cnbao@gmail.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240802122031.117548-1-21cnbao@gmail.com>
References: <20240726094618.401593-1-21cnbao@gmail.com>
 <20240802122031.117548-1-21cnbao@gmail.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chuanhua Han <hanchuanhua@oppo.com>

Currently, we have mTHP features, but unfortunately, without support for la=
rge
folio swap-ins, once these large folios are swapped out, they are lost beca=
use
mTHP swap is a one-way process. The lack of mTHP swap-in functionality prev=
ents
mTHP from being used on devices like Android that heavily rely on swap.

This patch introduces mTHP swap-in support. It starts from sync devices such
as zRAM. This is probably the simplest and most common use case, benefiting
billions of Android phones and similar devices with minimal implementation
cost. In this straightforward scenario, large folios are always exclusive,
eliminating the need to handle complex rmap and swapcache issues.

It offers several benefits:
1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
   swap-out and swap-in. Large folios in the buddy system are also
   preserved as much as possible, rather than being fragmented due
   to swap-in.

2. Eliminates fragmentation in swap slots and supports successful
   THP_SWPOUT.

   w/o this patch (Refer to the data from Chris's and Kairui's latest
   swap allocator optimization while running ./thp_swap_allocator_test
   w/o "-a" option [1]):

   ./thp_swap_allocator_test
   Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentag=
e: 0.00%
   Iteration 2: swpout inc: 131, swpout fallback inc: 101, Fallback percent=
age: 43.53%
   Iteration 3: swpout inc: 71, swpout fallback inc: 155, Fallback percenta=
ge: 68.58%
   Iteration 4: swpout inc: 55, swpout fallback inc: 168, Fallback percenta=
ge: 75.34%
   Iteration 5: swpout inc: 35, swpout fallback inc: 191, Fallback percenta=
ge: 84.51%
   Iteration 6: swpout inc: 25, swpout fallback inc: 199, Fallback percenta=
ge: 88.84%
   Iteration 7: swpout inc: 23, swpout fallback inc: 205, Fallback percenta=
ge: 89.91%
   Iteration 8: swpout inc: 9, swpout fallback inc: 219, Fallback percentag=
e: 96.05%
   Iteration 9: swpout inc: 13, swpout fallback inc: 213, Fallback percenta=
ge: 94.25%
   Iteration 10: swpout inc: 12, swpout fallback inc: 216, Fallback percent=
age: 94.74%
   Iteration 11: swpout inc: 16, swpout fallback inc: 213, Fallback percent=
age: 93.01%
   Iteration 12: swpout inc: 10, swpout fallback inc: 210, Fallback percent=
age: 95.45%
   Iteration 13: swpout inc: 16, swpout fallback inc: 212, Fallback percent=
age: 92.98%
   Iteration 14: swpout inc: 12, swpout fallback inc: 212, Fallback percent=
age: 94.64%
   Iteration 15: swpout inc: 15, swpout fallback inc: 211, Fallback percent=
age: 93.36%
   Iteration 16: swpout inc: 15, swpout fallback inc: 200, Fallback percent=
age: 93.02%
   Iteration 17: swpout inc: 9, swpout fallback inc: 220, Fallback percenta=
ge: 96.07%

   w/ this patch (always 0%):
   Iteration 1: swpout inc: 948, swpout fallback inc: 0, Fallback percentag=
e: 0.00%
   Iteration 2: swpout inc: 953, swpout fallback inc: 0, Fallback percentag=
e: 0.00%
   Iteration 3: swpout inc: 950, swpout fallback inc: 0, Fallback percentag=
e: 0.00%
   Iteration 4: swpout inc: 952, swpout fallback inc: 0, Fallback percentag=
e: 0.00%
   Iteration 5: swpout inc: 950, swpout fallback inc: 0, Fallback percentag=
e: 0.00%
   Iteration 6: swpout inc: 950, swpout fallback inc: 0, Fallback percentag=
e: 0.00%
   Iteration 7: swpout inc: 947, swpout fallback inc: 0, Fallback percentag=
e: 0.00%
   Iteration 8: swpout inc: 950, swpout fallback inc: 0, Fallback percentag=
e: 0.00%
   Iteration 9: swpout inc: 950, swpout fallback inc: 0, Fallback percentag=
e: 0.00%
   Iteration 10: swpout inc: 945, swpout fallback inc: 0, Fallback percenta=
ge: 0.00%
   Iteration 11: swpout inc: 947, swpout fallback inc: 0, Fallback percenta=
ge: 0.00%
   ...

3. With both mTHP swap-out and swap-in supported, we offer the option to en=
able
   zsmalloc compression/decompression with larger granularity[2]. The upcom=
ing
   optimization in zsmalloc will significantly increase swap speed and impr=
ove
   compression efficiency. Tested by running 100 iterations of swapping 100=
MiB
   of anon memory, the swap speed improved dramatically:
                time consumption of swapin(ms)   time consumption of swapou=
t(ms)
     lz4 4k                  45274                    90540
     lz4 64k                 22942                    55667
     zstdn 4k                85035                    186585
     zstdn 64k               46558                    118533

    The compression ratio also improved, as evaluated with 1 GiB of data:
     granularity   orig_data_size   compr_data_size
     4KiB-zstd      1048576000       246876055
     64KiB-zstd     1048576000       199763892

   Without mTHP swap-in, the potential optimizations in zsmalloc cannot be
   realized.

4. Even mTHP swap-in itself can reduce swap-in page faults by a factor
   of nr_pages. Swapping in content filled with the same data 0x11, w/o
   and w/ the patch for five rounds (Since the content is the same,
   decompression will be very fast. This primarily assesses the impact of
   reduced page faults):

  swp in bandwidth(bytes/ms)    w/o              w/
   round1                     624152          1127501
   round2                     631672          1127501
   round3                     620459          1139756
   round4                     606113          1139756
   round5                     624152          1152281
   avg                        621310          1137359      +83%

[1] https://lore.kernel.org/all/20240730-swap-allocator-v5-0-cb9c148b9297@k=
ernel.org/
[2] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/

Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
Co-developed-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Reported-by: Kairui Song <ryncsn@gmail.com>
---
 mm/memory.c | 211 ++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 188 insertions(+), 23 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4cf4902db1ec..07029532469a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3986,6 +3986,152 @@ static vm_fault_t handle_pte_marker(struct vm_fault=
 *vmf)
 	return VM_FAULT_SIGBUS;
 }
=20
+/*
+ * check a range of PTEs are completely swap entries with
+ * contiguous swap offsets and the same SWAP_HAS_CACHE.
+ * ptep must be first one in the range
+ */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
+{
+	struct swap_info_struct *si;
+	unsigned long addr;
+	swp_entry_t entry;
+	pgoff_t offset;
+	char has_cache;
+	int idx, i;
+	pte_t pte;
+
+	addr =3D ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
+	idx =3D (vmf->address - addr) / PAGE_SIZE;
+	pte =3D ptep_get(ptep);
+
+	if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx)))
+		return false;
+	entry =3D pte_to_swp_entry(pte);
+	offset =3D swp_offset(entry);
+	if (swap_pte_batch(ptep, nr_pages, pte) !=3D nr_pages)
+		return false;
+
+	si =3D swp_swap_info(entry);
+	has_cache =3D si->swap_map[offset] & SWAP_HAS_CACHE;
+	for (i =3D 1; i < nr_pages; i++) {
+		/*
+		 * while allocating a large folio and doing swap_read_folio for the
+		 * SWP_SYNCHRONOUS_IO path, which is the case the being faulted pte
+		 * doesn't have swapcache. We need to ensure all PTEs have no cache
+		 * as well, otherwise, we might go to swap devices while the content
+		 * is in swapcache
+		 */
+		if ((si->swap_map[offset + i] & SWAP_HAS_CACHE) !=3D has_cache)
+			return false;
+	}
+
+	return true;
+}
+
+static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset,
+		unsigned long addr, unsigned long orders)
+{
+	int order, nr;
+
+	order =3D highest_order(orders);
+
+	/*
+	 * To swap-in a THP with nr pages, we require its first swap_offset
+	 * is aligned with nr. This can filter out most invalid entries.
+	 */
+	while (orders) {
+		nr =3D 1 << order;
+		if ((addr >> PAGE_SHIFT) % nr =3D=3D swp_offset % nr)
+			break;
+		order =3D next_order(&orders, order);
+	}
+
+	return orders;
+}
+#else
+static inline bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int n=
r_pages)
+{
+	return false;
+}
+#endif
+
+static struct folio *alloc_swap_folio(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma =3D vmf->vma;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	unsigned long orders;
+	struct folio *folio;
+	unsigned long addr;
+	swp_entry_t entry;
+	spinlock_t *ptl;
+	pte_t *pte;
+	gfp_t gfp;
+	int order;
+
+	/*
+	 * If uffd is active for the vma we need per-page fault fidelity to
+	 * maintain the uffd semantics.
+	 */
+	if (unlikely(userfaultfd_armed(vma)))
+		goto fallback;
+
+	/*
+	 * A large swapped out folio could be partially or fully in zswap. We
+	 * lack handling for such cases, so fallback to swapping in order-0
+	 * folio.
+	 */
+	if (!zswap_never_enabled())
+		goto fallback;
+
+	entry =3D pte_to_swp_entry(vmf->orig_pte);
+	/*
+	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
+	 * and suitable for swapping THP.
+	 */
+	orders =3D thp_vma_allowable_orders(vma, vma->vm_flags,
+			TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
+	orders =3D thp_vma_suitable_orders(vma, vmf->address, orders);
+	orders =3D thp_swap_suitable_orders(swp_offset(entry), vmf->address, orde=
rs);
+
+	if (!orders)
+		goto fallback;
+
+	pte =3D pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address & PMD=
_MASK, &ptl);
+	if (unlikely(!pte))
+		goto fallback;
+
+	/*
+	 * For do_swap_page, find the highest order where the aligned range is
+	 * completely swap entries with contiguous swap offsets.
+	 */
+	order =3D highest_order(orders);
+	while (orders) {
+		addr =3D ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+		if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order))
+			break;
+		order =3D next_order(&orders, order);
+	}
+
+	pte_unmap_unlock(pte, ptl);
+
+	/* Try allocating the highest of the remaining orders. */
+	gfp =3D vma_thp_gfp_mask(vma);
+	while (orders) {
+		addr =3D ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+		folio =3D vma_alloc_folio(gfp, order, vma, addr, true);
+		if (folio)
+			return folio;
+		order =3D next_order(&orders, order);
+	}
+
+fallback:
+#endif
+	return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
+}
+
+
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -4074,35 +4220,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (!folio) {
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
 		    __swap_count(entry) =3D=3D 1) {
-			/*
-			 * Prevent parallel swapin from proceeding with
-			 * the cache flag. Otherwise, another thread may
-			 * finish swapin first, free the entry, and swapout
-			 * reusing the same entry. It's undetectable as
-			 * pte_same() returns true due to entry reuse.
-			 */
-			if (swapcache_prepare(entry, 1)) {
-				/* Relax a bit to prevent rapid repeated page faults */
-				schedule_timeout_uninterruptible(1);
-				goto out;
-			}
-			need_clear_cache =3D true;
-
 			/* skip swapcache */
-			folio =3D vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
-						vma, vmf->address, false);
+			folio =3D alloc_swap_folio(vmf);
 			page =3D &folio->page;
 			if (folio) {
 				__folio_set_locked(folio);
 				__folio_set_swapbacked(folio);
=20
+				nr_pages =3D folio_nr_pages(folio);
+				if (folio_test_large(folio))
+					entry.val =3D ALIGN_DOWN(entry.val, nr_pages);
+				/*
+				 * Prevent parallel swapin from proceeding with
+				 * the cache flag. Otherwise, another thread may
+				 * finish swapin first, free the entry, and swapout
+				 * reusing the same entry. It's undetectable as
+				 * pte_same() returns true due to entry reuse.
+				 */
+				if (swapcache_prepare(entry, nr_pages)) {
+					/* Relax a bit to prevent rapid repeated page faults */
+					schedule_timeout_uninterruptible(1);
+					goto out_page;
+				}
+				need_clear_cache =3D true;
+
 				if (mem_cgroup_swapin_charge_folio(folio,
 							vma->vm_mm, GFP_KERNEL,
 							entry)) {
 					ret =3D VM_FAULT_OOM;
 					goto out_page;
 				}
-				mem_cgroup_swapin_uncharge_swap(entry, 1);
+				mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
=20
 				shadow =3D get_shadow_from_swap_cache(entry);
 				if (shadow)
@@ -4209,6 +4357,22 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		goto out_nomap;
 	}
=20
+	/* allocated large folios for SWP_SYNCHRONOUS_IO */
+	if (folio_test_large(folio) && !folio_test_swapcache(folio)) {
+		unsigned long nr =3D folio_nr_pages(folio);
+		unsigned long folio_start =3D ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
+		unsigned long idx =3D (vmf->address - folio_start) / PAGE_SIZE;
+		pte_t *folio_ptep =3D vmf->pte - idx;
+
+		if (!can_swapin_thp(vmf, folio_ptep, nr))
+			goto out_nomap;
+
+		page_idx =3D idx;
+		address =3D folio_start;
+		ptep =3D folio_ptep;
+		goto check_folio;
+	}
+
 	nr_pages =3D 1;
 	page_idx =3D 0;
 	address =3D vmf->address;
@@ -4340,11 +4504,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		folio_add_lru_vma(folio, vma);
 	} else if (!folio_test_anon(folio)) {
 		/*
-		 * We currently only expect small !anon folios, which are either
-		 * fully exclusive or fully shared. If we ever get large folios
-		 * here, we have to be careful.
+		 * We currently only expect small !anon folios which are either
+		 * fully exclusive or fully shared, or new allocated large folios
+		 * which are fully exclusive. If we ever get large folios within
+		 * swapcache here, we have to be careful.
 		 */
-		VM_WARN_ON_ONCE(folio_test_large(folio));
+		VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio));
 		VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
 		folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
 	} else {
@@ -4387,7 +4552,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 out:
 	/* Clear the swap cache pin for direct swapin after PTL unlock */
 	if (need_clear_cache)
-		swapcache_clear(si, entry, 1);
+		swapcache_clear(si, entry, nr_pages);
 	if (si)
 		put_swap_device(si);
 	return ret;
@@ -4403,7 +4568,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		folio_put(swapcache);
 	}
 	if (need_clear_cache)
-		swapcache_clear(si, entry, 1);
+		swapcache_clear(si, entry, nr_pages);
 	if (si)
 		put_swap_device(si);
 	return ret;
--=20
2.34.1