From nobody Tue Nov 26 10:22:36 2024
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.14])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id CC544185954
	for <linux-kernel@vger.kernel.org>; Fri, 18 Oct 2024 06:48:09 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.14
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1729234091; cv=none;
 b=JNurN0VVPz4+DIY2oqCvboFp6kijQ9MNblP0/sfq2opSw0ClcLxC6V4nRFOHxPeXnpAectocNp30Cf5O5TcmgbR4s5saLrIvjLXGDqWSNnIe6Bm8IJOTI63JRkJW+6gitXjdm5i4Lfqf0jHGB73j4yQo69s0O4asLMOz7SKVG2U=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1729234091; c=relaxed/simple;
	bh=vgbP0deIxDuw/qX2EJ50fTnXNEzCSKwaUCbRMPN4Zeg=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=HqDTNGEuWFFGV1PMo4EafTaFw5OQcjJndE+E2Bs5+fuvYzSF0FPd/rFu/+i8PxvW16sw8KpQkkUmwE4ESoVzLlAEBKRjB7lnXBpe2LQW14YC0s/SpnpOLf57SVguuoV2LYuD5l0tT5+/MsEzXRBV6/xndVJ0AdJ5Wuao+JEcLX4=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=QUOsHkCl; arc=none smtp.client-ip=192.198.163.14
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="QUOsHkCl"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1729234090; x=1760770090;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=vgbP0deIxDuw/qX2EJ50fTnXNEzCSKwaUCbRMPN4Zeg=;
  b=QUOsHkCl/NNTY/HUeSkuPsZ0Chwa1NAhJekI5gueK1GPcBnWv82aJ7Cd
   njRnZCCvxsfQurlaR3jdx3mGdW5xWh5SRj1EY/MBLqyDyYzS3wqGrOHu6
   bq6pSmlpjcabL7r6VzcmfoQyXHaehG8pz7kfEvIzVt3/ZRmtHIKmEyXge
   CO0ymVezjqDaxwjgjUUznh0jvRBNl3kifMeoVDmRq9CtBKbTlKn1aq/vm
   jnakC5Y+DmZOHXd1JHC83toDwmDI31xX4WYRRoOcA26+TMCr5t7jJdGGf
   oHTYxPntNh/vLsbsE7LMDDy24GVl+Dy6bRnsZXl1jpCUzjoA8b+S/jSNF
   g==;
X-CSE-ConnectionGUID: 4Qdm16b8RwiaUolJ+WlKbw==
X-CSE-MsgGUID: eevwZ78ASgmogHagW3nOJA==
X-IronPort-AV: E=McAfee;i="6700,10204,11228"; a="28963318"
X-IronPort-AV: E=Sophos;i="6.11,212,1725346800";
   d="scan'208";a="28963318"
Received: from fmviesa003.fm.intel.com ([10.60.135.143])
  by fmvoesa108.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 17 Oct 2024 23:48:07 -0700
X-CSE-ConnectionGUID: UUU+zR47R3umRvDc5akGeQ==
X-CSE-MsgGUID: nOsiFCOlRUqm01m/QxIO7g==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.11,212,1725346800";
   d="scan'208";a="82744499"
Received: from jf5300-b11a338t.jf.intel.com ([10.242.51.6])
  by fmviesa003.fm.intel.com with ESMTP; 17 Oct 2024 23:48:06 -0700
From: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
To: linux-kernel@vger.kernel.org,
	linux-mm@kvack.org,
	hannes@cmpxchg.org,
	yosryahmed@google.com,
	nphamcs@gmail.com,
	chengming.zhou@linux.dev,
	usamaarif642@gmail.com,
	ryan.roberts@arm.com,
	ying.huang@intel.com,
	21cnbao@gmail.com,
	akpm@linux-foundation.org,
	hughd@google.com,
	willy@infradead.org,
	bfoster@redhat.com,
	dchinner@redhat.com,
	chrisl@kernel.org,
	david@redhat.com
Cc: wajdi.k.feghali@intel.com,
	vinodh.gopal@intel.com,
	kanchana.p.sridhar@intel.com
Subject: [RFC PATCH v1 1/7] mm: zswap: Config variable to enable zswap loads
 with decompress batching.
Date: Thu, 17 Oct 2024 23:47:59 -0700
Message-Id: <20241018064805.336490-2-kanchana.p.sridhar@intel.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: <20241018064805.336490-1-kanchana.p.sridhar@intel.com>
References: <20241018064805.336490-1-kanchana.p.sridhar@intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Add a new zswap config variable that controls whether zswap load will
decompress a batch of 4K folios, for instance, the folios prefetched
during swapin_readahead():

  CONFIG_ZSWAP_LOAD_BATCHING_ENABLED

The existing CONFIG_CRYPTO_DEV_IAA_CRYPTO variable added in commit
ea7a5cbb4369 ("crypto: iaa - Add Intel IAA Compression Accelerator crypto
driver core") is used to detect if the system has the Intel Analytics
Accelerator (IAA), and the iaa_crypto module is available. If so, the
kernel build will prompt for CONFIG_ZSWAP_LOAD_BATCHING_ENABLED. Hence,
users have the ability to set CONFIG_ZSWAP_LOAD_BATCHING_ENABLED=3D"y" only
on systems that have Intel IAA.

If CONFIG_ZSWAP_LOAD_BATCHING_ENABLED is enabled, and IAA is configured
as the zswap compressor, the vm.page-cluster is used to prefetch up to
32 4K folios using swapin_readahead(). The readahead folios present in
zswap are then loaded as a batch using IAA decompression batching.

The patch also implements a zswap API that returns the status of this
config variable.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 include/linux/zswap.h |  8 ++++++++
 mm/Kconfig            | 13 +++++++++++++
 mm/zswap.c            | 12 ++++++++++++
 3 files changed, 33 insertions(+)

diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 328a1e09d502..294d13efbfb1 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -118,6 +118,9 @@ static inline void zswap_store_batch(struct swap_in_mem=
ory_cache_cb *simc)
 	else
 		__zswap_store_batch_single(simc);
 }
+
+bool zswap_load_batching_enabled(void);
+
 unsigned long zswap_total_pages(void);
 bool zswap_store(struct folio *folio);
 bool zswap_load(struct folio *folio);
@@ -145,6 +148,11 @@ static inline void zswap_store_batch(struct swap_in_me=
mory_cache_cb *simc)
 {
 }
=20
+static inline bool zswap_load_batching_enabled(void)
+{
+	return false;
+}
+
 static inline bool zswap_store(struct folio *folio)
 {
 	return false;
diff --git a/mm/Kconfig b/mm/Kconfig
index 26d1a5cee471..98e46a3cf0e3 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -137,6 +137,19 @@ config ZSWAP_STORE_BATCHING_ENABLED
 	in the folio in	hardware, thereby improving large folio compression
 	throughput and reducing swapout latency.
=20
+config ZSWAP_LOAD_BATCHING_ENABLED
+	bool "Batching of zswap loads of 4K folios with Intel IAA"
+	depends on ZSWAP && CRYPTO_DEV_IAA_CRYPTO
+	default n
+	help
+	Enables zswap_load to swapin multiple 4K folios in batches of 8,
+	rather than a folio at a time, if the system has Intel IAA for hardware
+	acceleration of decompressions. swapin_readahead will be used to
+	prefetch a batch of folios to be swapped in along with the faulting
+	folio. If IAA is the zswap compressor, this will parallelize batch
+	decompression of upto 8 folios in hardware, thereby reducing swapin
+	and do_swap_page latency.
+
 choice
 	prompt "Default allocator"
 	depends on ZSWAP
diff --git a/mm/zswap.c b/mm/zswap.c
index 68ce498ad000..fe7bc2a6672e 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -136,6 +136,13 @@ module_param_named(shrinker_enabled, zswap_shrinker_en=
abled, bool, 0644);
 static bool __zswap_store_batching_enabled =3D IS_ENABLED(
 	CONFIG_ZSWAP_STORE_BATCHING_ENABLED);
=20
+/*
+ * Enable/disable batching of decompressions of multiple 4K folios, if
+ * the system has Intel IAA.
+ */
+static bool __zswap_load_batching_enabled =3D IS_ENABLED(
+	CONFIG_ZSWAP_LOAD_BATCHING_ENABLED);
+
 bool zswap_is_enabled(void)
 {
 	return zswap_enabled;
@@ -246,6 +253,11 @@ __always_inline bool zswap_store_batching_enabled(void)
 	return __zswap_store_batching_enabled;
 }
=20
+__always_inline bool zswap_load_batching_enabled(void)
+{
+	return __zswap_load_batching_enabled;
+}
+
 static void __zswap_store_batch_core(
 	int node_id,
 	struct folio **folios,
--=20
2.27.0
From nobody Tue Nov 26 10:22:36 2024
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.14])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 63C18186E27
	for <linux-kernel@vger.kernel.org>; Fri, 18 Oct 2024 06:48:10 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.14
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1729234092; cv=none;
 b=I/zjGn0seJMXLd5drtvb0ZE9wiE0EDeQAVCA/L5LRKkI++dEyGM7hdQqjQqA9tWOq6wVNJIf4EpLziK+yc0kwtJjknhQaCA+BsP40YAB/l3jN1R4DkFkMPwvIKL8cVQYd19MuKb3zCFE3LVj8PgiQ1By05QgbQVplwDn1Qf1BP0=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1729234092; c=relaxed/simple;
	bh=UPuvHsF/khjGuvZn4W/MjDOW3ucu5XL70qjTmnPaytQ=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=Srst0AtrCqu8yDjYm2+hCrVGi1rE+EZ77wUCkskbJ6XBuAOPp2qoOj02CCNXE0uEgoIS3FRzHB46NtBKMORqtM22dFkAWARRaVbs3uyZWzsQfjVgiUTltw8z2UGztyssx9qVY6GnLUSKGKZF79zH5qd/U/mg0oR1BgUx+XYM+P8=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=PvokcR2P; arc=none smtp.client-ip=192.198.163.14
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="PvokcR2P"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1729234090; x=1760770090;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=UPuvHsF/khjGuvZn4W/MjDOW3ucu5XL70qjTmnPaytQ=;
  b=PvokcR2PrC/fXzeiGW0/7O+5QmdFS8EHHDS9owCgRbug2QvDKApy94Ml
   6zGYcWCd3HsMxksdipTV2LUqWDzmvQsgk0Jlq9M80gvTQaSs7n7cjS+8m
   HurD0wuS8FOXAYnfDpntBhbE/Ex2J6FqNGgQUp07GNDlkKD6/qIRTgEFn
   XmjkZCrgpRUT1Utnyli8pvukno1ZH/g0wSde4DWBgbCMoz6vHdMsFBOZi
   fU5SX52bHljcvy/Hbk2ANHBSJN2WkTiejBQm+B99aooCSyEEZXZVG/TAf
   lQzPSxxA1nhbsFNQBzHCXKjO4JD1ikHIH6wnIhHzjgiOYc+IUf3xRurDG
   g==;
X-CSE-ConnectionGUID: /I0hDXPXQ7ChIrj59nh3Jw==
X-CSE-MsgGUID: YIpKIewHTVyhMSKyrpQMMw==
X-IronPort-AV: E=McAfee;i="6700,10204,11228"; a="28963326"
X-IronPort-AV: E=Sophos;i="6.11,212,1725346800";
   d="scan'208";a="28963326"
Received: from fmviesa003.fm.intel.com ([10.60.135.143])
  by fmvoesa108.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 17 Oct 2024 23:48:07 -0700
X-CSE-ConnectionGUID: NWgiSYFkS0SuqFRm5HR5lA==
X-CSE-MsgGUID: UXdDj26UTrKRjCKhs7vctg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.11,212,1725346800";
   d="scan'208";a="82744504"
Received: from jf5300-b11a338t.jf.intel.com ([10.242.51.6])
  by fmviesa003.fm.intel.com with ESMTP; 17 Oct 2024 23:48:06 -0700
From: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
To: linux-kernel@vger.kernel.org,
	linux-mm@kvack.org,
	hannes@cmpxchg.org,
	yosryahmed@google.com,
	nphamcs@gmail.com,
	chengming.zhou@linux.dev,
	usamaarif642@gmail.com,
	ryan.roberts@arm.com,
	ying.huang@intel.com,
	21cnbao@gmail.com,
	akpm@linux-foundation.org,
	hughd@google.com,
	willy@infradead.org,
	bfoster@redhat.com,
	dchinner@redhat.com,
	chrisl@kernel.org,
	david@redhat.com
Cc: wajdi.k.feghali@intel.com,
	vinodh.gopal@intel.com,
	kanchana.p.sridhar@intel.com
Subject: [RFC PATCH v1 2/7] mm: swap: Add IAA batch decompression API
 swap_crypto_acomp_decompress_batch().
Date: Thu, 17 Oct 2024 23:48:00 -0700
Message-Id: <20241018064805.336490-3-kanchana.p.sridhar@intel.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: <20241018064805.336490-1-kanchana.p.sridhar@intel.com>
References: <20241018064805.336490-1-kanchana.p.sridhar@intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Added a new API swap_crypto_acomp_decompress_batch() that does batch
decompression. A system that has Intel IAA can avail of this API to submit a
batch of decompress jobs for parallel decompression in the hardware, to imp=
rove
performance. On a system without IAA, this API will process each decompress
job sequentially.

The purpose of this API is to be invocable from any swap module that needs
to decompress multiple 4K folios, or a batch of pages in the general case.
For instance, zswap would decompress up to (1UL << SWAP_RA_ORDER_CEILING)
folios in batches of SWAP_CRYPTO_SUB_BATCH_SIZE (i.e. 8 if the system has
IAA) pages prefetched by swapin_readahead(), which would improve readahead
performance.

Towards this eventual goal, the swap_crypto_acomp_decompress_batch()
interface is implemented in swap_state.c and exported via mm/swap.h. It
would be preferable for swap_crypto_acomp_decompress_batch() to be exported
via include/linux/swap.h so that modules outside mm (for e.g. zram) can
potentially use the API for batch decompressions with IAA, since the
swapin_readahead() batching interface is common to all swap modules.
I would appreciate RFC comments on this.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/swap.h       |  42 +++++++++++++++++--
 mm/swap_state.c | 109 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 147 insertions(+), 4 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index 08c04954304f..0bb386b5fdee 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -10,11 +10,12 @@ struct mempolicy;
 #include <linux/crypto.h>
=20
 /*
- * For IAA compression batching:
- * Maximum number of IAA acomp compress requests that will be processed
- * in a sub-batch.
+ * For IAA compression/decompression batching:
+ * Maximum number of IAA acomp compress/decompress requests that will be
+ * processed in a sub-batch.
  */
-#if defined(CONFIG_ZSWAP_STORE_BATCHING_ENABLED)
+#if defined(CONFIG_ZSWAP_STORE_BATCHING_ENABLED) || \
+	defined(CONFIG_ZSWAP_LOAD_BATCHING_ENABLED)
 #define SWAP_CRYPTO_SUB_BATCH_SIZE 8UL
 #else
 #define SWAP_CRYPTO_SUB_BATCH_SIZE 1UL
@@ -60,6 +61,29 @@ void swap_crypto_acomp_compress_batch(
 	int nr_pages,
 	struct crypto_acomp_ctx *acomp_ctx);
=20
+/**
+ * This API provides IAA decompress batching functionality for use by swap
+ * modules.
+ * The acomp_ctx mutex should be locked/unlocked before/after calling this
+ * procedure.
+ *
+ * @srcs: The src buffers to be decompressed.
+ * @pages: The pages to store the buffers decompressed by IAA.
+ * @slens: src buffers' compressed lengths.
+ * @errors: Will contain a 0 if the page was successfully decompressed, or=
 a
+ *          non-0 error value to be processed by the calling function.
+ * @nr_pages: The number of pages, up to SWAP_CRYPTO_SUB_BATCH_SIZE,
+ *            to be decompressed.
+ * @acomp_ctx: The acomp context for iaa_crypto/other compressor.
+ */
+void swap_crypto_acomp_decompress_batch(
+	u8 *srcs[],
+	struct page *pages[],
+	unsigned int slens[],
+	int errors[],
+	int nr_pages,
+	struct crypto_acomp_ctx *acomp_ctx);
+
 /* linux/mm/vmscan.c, linux/mm/page_io.c, linux/mm/zswap.c */
 /* For batching of compressions in reclaim path. */
 struct swap_in_memory_cache_cb {
@@ -204,6 +228,16 @@ static inline void swap_write_in_memory_cache_unplug(
 {
 }
=20
+static inline void swap_crypto_acomp_decompress_batch(
+	u8 *srcs[],
+	struct page *pages[],
+	unsigned int slens[],
+	int errors[],
+	int nr_pages,
+	struct crypto_acomp_ctx *acomp_ctx)
+{
+}
+
 static inline void swap_read_folio(struct folio *folio, struct swap_iocb *=
*plug)
 {
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 117c3caa5679..3cebbff40804 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -855,6 +855,115 @@ void swap_crypto_acomp_compress_batch(
 }
 EXPORT_SYMBOL_GPL(swap_crypto_acomp_compress_batch);
=20
+/**
+ * This API provides IAA decompress batching functionality for use by swap
+ * modules.
+ * The acomp_ctx mutex should be locked/unlocked before/after calling this
+ * procedure.
+ *
+ * @srcs: The src buffers to be decompressed.
+ * @pages: The pages to store the buffers decompressed by IAA.
+ * @slens: src buffers' compressed lengths.
+ * @errors: Will contain a 0 if the page was successfully decompressed, or=
 a
+ *          non-0 error value to be processed by the calling function.
+ * @nr_pages: The number of pages, up to SWAP_CRYPTO_SUB_BATCH_SIZE,
+ *            to be decompressed.
+ * @acomp_ctx: The acomp context for iaa_crypto/other compressor.
+ */
+void swap_crypto_acomp_decompress_batch(
+	u8 *srcs[],
+	struct page *pages[],
+	unsigned int slens[],
+	int errors[],
+	int nr_pages,
+	struct crypto_acomp_ctx *acomp_ctx)
+{
+	struct scatterlist inputs[SWAP_CRYPTO_SUB_BATCH_SIZE];
+	struct scatterlist outputs[SWAP_CRYPTO_SUB_BATCH_SIZE];
+	unsigned int dlens[SWAP_CRYPTO_SUB_BATCH_SIZE];
+	bool decompressions_done =3D false;
+	int i, j;
+
+	BUG_ON(nr_pages > SWAP_CRYPTO_SUB_BATCH_SIZE);
+
+	/*
+	 * Prepare and submit acomp_reqs to IAA.
+	 * IAA will process these decompress jobs in parallel in async mode.
+	 * If the compressor does not support a poll() method, or if IAA is
+	 * used in sync mode, the jobs will be processed sequentially using
+	 * acomp_ctx->req[0] and acomp_ctx->wait.
+	 */
+	for (i =3D 0; i < nr_pages; ++i) {
+		j =3D acomp_ctx->acomp->poll ? i : 0;
+
+		dlens[i] =3D PAGE_SIZE;
+		sg_init_one(&inputs[i], srcs[i], slens[i]);
+		sg_init_table(&outputs[i], 1);
+		sg_set_page(&outputs[i], pages[i], PAGE_SIZE, 0);
+		acomp_request_set_params(acomp_ctx->req[j], &inputs[i],
+					&outputs[i], slens[i], dlens[i]);
+		/*
+		 * If the crypto_acomp provides an asynchronous poll()
+		 * interface, submit the request to the driver now, and poll for
+		 * a completion status later, after all descriptors have been
+		 * submitted. If the crypto_acomp does not provide a poll()
+		 * interface, submit the request and wait for it to complete,
+		 * i.e., synchronously, before moving on to the next request.
+		 */
+		if (acomp_ctx->acomp->poll) {
+			errors[i] =3D crypto_acomp_decompress(acomp_ctx->req[j]);
+
+			if (errors[i] !=3D -EINPROGRESS)
+				errors[i] =3D -EINVAL;
+			else
+				errors[i] =3D -EAGAIN;
+		} else {
+			errors[i] =3D crypto_wait_req(
+					crypto_acomp_decompress(acomp_ctx->req[j]),
+					&acomp_ctx->wait);
+			if (!errors[i]) {
+				dlens[i] =3D acomp_ctx->req[j]->dlen;
+				BUG_ON(dlens[i] !=3D PAGE_SIZE);
+			}
+		}
+	}
+
+	/*
+	 * If not doing async decompressions, the batch has been processed at
+	 * this point and we can return.
+	 */
+	if (!acomp_ctx->acomp->poll)
+		return;
+
+	/*
+	 * Poll for and process IAA decompress job completions
+	 * in out-of-order manner.
+	 */
+	while (!decompressions_done) {
+		decompressions_done =3D true;
+
+		for (i =3D 0; i < nr_pages; ++i) {
+			/*
+			 * Skip, if the decompression has already completed
+			 * successfully or with an error.
+			 */
+			if (errors[i] !=3D -EAGAIN)
+				continue;
+
+			errors[i] =3D crypto_acomp_poll(acomp_ctx->req[i]);
+
+			if (errors[i]) {
+				if (errors[i] =3D=3D -EAGAIN)
+					decompressions_done =3D false;
+			} else {
+				dlens[i] =3D acomp_ctx->req[i]->dlen;
+				BUG_ON(dlens[i] !=3D PAGE_SIZE);
+			}
+		}
+	}
+}
+EXPORT_SYMBOL_GPL(swap_crypto_acomp_decompress_batch);
+
 #endif /* CONFIG_SWAP */
=20
 static int swap_vma_ra_win(struct vm_fault *vmf, unsigned long *start,
--=20
2.27.0
From nobody Tue Nov 26 10:22:36 2024
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.14])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id EA404186E3F
	for <linux-kernel@vger.kernel.org>; Fri, 18 Oct 2024 06:48:10 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.14
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1729234092; cv=none;
 b=lAUdfWxAw297eGbSoULmJDQNRHplJGGlMEmvbxva3GrnhVxSyM3JfdXPvaSH++4v+7AHx1Fc+Li6/Yf/TfZTRop5uLqLPKrwj8hALUaypr0UGnqufBQf1dHuBEuqUd60RZWLohZxYgNEnVOAk7gt0mQsO+KUBf5Zyxx4rr/91eQ=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1729234092; c=relaxed/simple;
	bh=0287k7mWEmWYuvXNZAukIKoRrD9gl2Lhv4dAhuzEN6E=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=DD12BXAojt51Lm3dgcVcyqKh/Cx/R7uxi0NUuCffHGyPOfBeWkjn87mLwCVm3Gygc6t3uaSYt8ce4YTBJWy/gy62sX07DCMXhyYfRwrrdHUBsHEIW7AGDuXP9frPmCk9aIObfLY9YcVqihCeeBKV64GQ8Rxx6RkfiEe4uEVYm8U=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=A9P0PveW; arc=none smtp.client-ip=192.198.163.14
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="A9P0PveW"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1729234091; x=1760770091;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=0287k7mWEmWYuvXNZAukIKoRrD9gl2Lhv4dAhuzEN6E=;
  b=A9P0PveWAhnXrQaO70Kw2LRlhuiQYfKsB43yJLLcPMQ6tQxxpgElR9mP
   PLdIlZ6vS1ziw2gXNtoiPw1rI0pWZelzVa7qQq4r0hKTxFwzXwMNQbkZN
   3wKAs+FcYXiCP64hay6fuL9yxqWcTFEnLBcAMe9JD3lHo1U7ivpDz38eh
   vBszH6tegxlRsYAQftgQL+lBAtcClDxIxuvZ6CPnk8r5k6nXH5eyaWAff
   I5xZrZ1az7ucMNXyqgIdkBjAMviJLm4bHi5m3RgkXFT0QQVOkfbiwU4GM
   ADNfxuy7nhpvaobOYGFt05m1VXGzSrxDZRWqHShM/we9eOAXq7FZzTJVy
   A==;
X-CSE-ConnectionGUID: RGXD59muS5yOLo/Mbeustg==
X-CSE-MsgGUID: KLzBWscXTymyJp5KAFo34A==
X-IronPort-AV: E=McAfee;i="6700,10204,11228"; a="28963338"
X-IronPort-AV: E=Sophos;i="6.11,212,1725346800";
   d="scan'208";a="28963338"
Received: from fmviesa003.fm.intel.com ([10.60.135.143])
  by fmvoesa108.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 17 Oct 2024 23:48:08 -0700
X-CSE-ConnectionGUID: /UVWgjWZT22cO5WouhsKFg==
X-CSE-MsgGUID: +vVqT/OlTbeg2bOeVG+nKg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.11,212,1725346800";
   d="scan'208";a="82744509"
Received: from jf5300-b11a338t.jf.intel.com ([10.242.51.6])
  by fmviesa003.fm.intel.com with ESMTP; 17 Oct 2024 23:48:07 -0700
From: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
To: linux-kernel@vger.kernel.org,
	linux-mm@kvack.org,
	hannes@cmpxchg.org,
	yosryahmed@google.com,
	nphamcs@gmail.com,
	chengming.zhou@linux.dev,
	usamaarif642@gmail.com,
	ryan.roberts@arm.com,
	ying.huang@intel.com,
	21cnbao@gmail.com,
	akpm@linux-foundation.org,
	hughd@google.com,
	willy@infradead.org,
	bfoster@redhat.com,
	dchinner@redhat.com,
	chrisl@kernel.org,
	david@redhat.com
Cc: wajdi.k.feghali@intel.com,
	vinodh.gopal@intel.com,
	kanchana.p.sridhar@intel.com
Subject: [RFC PATCH v1 3/7] pagevec: struct folio_batch changes for decompress
 batching interface.
Date: Thu, 17 Oct 2024 23:48:01 -0700
Message-Id: <20241018064805.336490-4-kanchana.p.sridhar@intel.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: <20241018064805.336490-1-kanchana.p.sridhar@intel.com>
References: <20241018064805.336490-1-kanchana.p.sridhar@intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Made these changes to "struct folio_batch" for use in the
swapin_readahead() based zswap load batching interface for parallel
decompressions with IAA:

 1) Moved SWAP_RA_ORDER_CEILING definition to pagevec.h.
 2) Increased PAGEVEC_SIZE to (1UL << SWAP_RA_ORDER_CEILING),
    because vm.page-cluster=3D5 requires capacity for 32 folios.
 3) Made folio_batch_add() more fail-safe.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 include/linux/pagevec.h | 13 ++++++++++---
 mm/swap_state.c         |  2 --
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index 5d3a0cccc6bf..c9bab240fb6e 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -11,8 +11,14 @@
=20
 #include <linux/types.h>
=20
-/* 31 pointers + header align the folio_batch structure to a power of two =
*/
-#define PAGEVEC_SIZE	31
+/*
+ * For page-cluster of 5, I noticed that space for 31 pointers was
+ * insufficient. Increasing this to meet the requirements for folio_batch
+ * usage in the swap read decompress batching interface that is based on
+ * swapin_readahead().
+ */
+#define SWAP_RA_ORDER_CEILING	5
+#define PAGEVEC_SIZE	(1UL << SWAP_RA_ORDER_CEILING)
=20
 struct folio;
=20
@@ -74,7 +80,8 @@ static inline unsigned int folio_batch_space(struct folio=
_batch *fbatch)
 static inline unsigned folio_batch_add(struct folio_batch *fbatch,
 		struct folio *folio)
 {
-	fbatch->folios[fbatch->nr++] =3D folio;
+	if (folio_batch_space(fbatch) > 0)
+		fbatch->folios[fbatch->nr++] =3D folio;
 	return folio_batch_space(fbatch);
 }
=20
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3cebbff40804..0673593d363c 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -44,8 +44,6 @@ struct address_space *swapper_spaces[MAX_SWAPFILES] __rea=
d_mostly;
 static unsigned int nr_swapper_spaces[MAX_SWAPFILES] __read_mostly;
 static bool enable_vma_readahead __read_mostly =3D true;
=20
-#define SWAP_RA_ORDER_CEILING	5
-
 #define SWAP_RA_WIN_SHIFT	(PAGE_SHIFT / 2)
 #define SWAP_RA_HITS_MASK	((1UL << SWAP_RA_WIN_SHIFT) - 1)
 #define SWAP_RA_HITS_MAX	SWAP_RA_HITS_MASK
--=20
2.27.0
From nobody Tue Nov 26 10:22:36 2024
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.14])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 915AA18787A
	for <linux-kernel@vger.kernel.org>; Fri, 18 Oct 2024 06:48:11 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.14
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1729234093; cv=none;
 b=UA4SG9qarq2N3tq5/xYu/YgB/2KCAwamdaKPAB7A33rWo0tmnB/YrMtKLkrX3vR2tu04gdhKTTmPMebLYhqbHZwxEi6AyY/ASetgjvHNfsyJPRGuR/pUFxo3/HAQ8IpMkEXqH4bru9KwyTbKi3k4sYsWYalyd2UsWJG2/ezKyNk=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1729234093; c=relaxed/simple;
	bh=Wk7y1NCaq5vrYXmMN8sm97ljo3rd4bPykdD24mhfkEY=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=Z/7hE8Ieitim1O6tB3IVEuZsI7LeCPGxnFh9awWb+R6PLDn50Xw/vCOOxtsBKn+flZIThcgtRqaW9F5g/tX9+6GevH7Q7GT1gjufuhhHxOukBxMsxB9aHJdm2vSc45wLEU11Bpwm+pgcADP2nVn0dKY/41h2/BCHF7fQtYKLENs=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=RjbvtdPq; arc=none smtp.client-ip=192.198.163.14
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="RjbvtdPq"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1729234091; x=1760770091;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=Wk7y1NCaq5vrYXmMN8sm97ljo3rd4bPykdD24mhfkEY=;
  b=RjbvtdPq7O3FWXwT+BHoZil3d99OGkHbTWuB+c91vkgOWBSN2u7FYB11
   BjYPBvMqbBIOX3prs3+pMcc1AFwCZo4esdUE6zvcuaLej5bPRli8/uGWy
   a7ilQ+IgQnU+psif6Rv17KBo9Q0BAxY6NqR+bAt8hV5cBJesL+1Brt4DM
   JJ2X+msLmWtJo1+hlXjecabJF3L00m39E+PLHavUSDlcVvL14ocBAHlxE
   aUufQOQh6Hhff9GoIdSRUNWrUwB2s0t+x6Kzrfn1q6TSATWuDXEGM2SJ1
   Dv4zO1ZWI+jS4EMiFg5Cvj5822hf3r+sShr2j4CdX3CkRHheDzA07zZDQ
   A==;
X-CSE-ConnectionGUID: W2OYiiq8So+Ja3LkJzTzBA==
X-CSE-MsgGUID: /cxUvX2BRKyEwvWoP/TBJQ==
X-IronPort-AV: E=McAfee;i="6700,10204,11228"; a="28963350"
X-IronPort-AV: E=Sophos;i="6.11,212,1725346800";
   d="scan'208";a="28963350"
Received: from fmviesa003.fm.intel.com ([10.60.135.143])
  by fmvoesa108.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 17 Oct 2024 23:48:09 -0700
X-CSE-ConnectionGUID: 2dVk8TXESRGmpz6l97XWIg==
X-CSE-MsgGUID: 3TYx2nS6SfmEhvdE+9apQQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.11,212,1725346800";
   d="scan'208";a="82744516"
Received: from jf5300-b11a338t.jf.intel.com ([10.242.51.6])
  by fmviesa003.fm.intel.com with ESMTP; 17 Oct 2024 23:48:08 -0700
From: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
To: linux-kernel@vger.kernel.org,
	linux-mm@kvack.org,
	hannes@cmpxchg.org,
	yosryahmed@google.com,
	nphamcs@gmail.com,
	chengming.zhou@linux.dev,
	usamaarif642@gmail.com,
	ryan.roberts@arm.com,
	ying.huang@intel.com,
	21cnbao@gmail.com,
	akpm@linux-foundation.org,
	hughd@google.com,
	willy@infradead.org,
	bfoster@redhat.com,
	dchinner@redhat.com,
	chrisl@kernel.org,
	david@redhat.com
Cc: wajdi.k.feghali@intel.com,
	vinodh.gopal@intel.com,
	kanchana.p.sridhar@intel.com
Subject: [RFC PATCH v1 4/7] mm: swap: swap_read_folio() can add a folio to a
 folio_batch if it is in zswap.
Date: Thu, 17 Oct 2024 23:48:02 -0700
Message-Id: <20241018064805.336490-5-kanchana.p.sridhar@intel.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: <20241018064805.336490-1-kanchana.p.sridhar@intel.com>
References: <20241018064805.336490-1-kanchana.p.sridhar@intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

This patch modifies swap_read_folio() to check if the swap entry is present
in zswap, and if so, it will be added to a "zswap_batch" folio_batch, if
the caller (e.g. swapin_readahead()) has passed in a valid "zswap_batch".

If the swap entry is found in zswap, it will be added the next available
index in a sub-batch. This sub-batch is part of "struct zswap_decomp_batch"
which progressively constructs SWAP_CRYPTO_SUB_BATCH_SIZE arrays of zswap
entries/xarrays/pages/source-lengths ready for batch decompression in IAA.
The function that does this, zswap_add_load_batch(), will return true to
swap_read_folio(). If the entry is not found in zswap, it will return
false.

If the swap entry was not found in zswap, and if
zswap_load_batching_enabled() and a valid "non_zswap_batch" folio_batch is
passed to swap_read_folio(), the folio will be added to the
"non_zswap_batch" batch.

Finally, the code falls through to the usual/existing swap_read_folio()
flow.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 include/linux/zswap.h | 35 +++++++++++++++++
 mm/memory.c           |  2 +-
 mm/page_io.c          | 26 ++++++++++++-
 mm/swap.h             | 31 ++++++++++++++-
 mm/swap_state.c       | 10 ++---
 mm/zswap.c            | 89 +++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 183 insertions(+), 10 deletions(-)

diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 294d13efbfb1..1d6de281f243 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -12,6 +12,8 @@ extern atomic_long_t zswap_stored_pages;
 #ifdef CONFIG_ZSWAP
=20
 struct swap_in_memory_cache_cb;
+struct zswap_decomp_batch;
+struct zswap_entry;
=20
 struct zswap_lruvec_state {
 	/*
@@ -120,6 +122,19 @@ static inline void zswap_store_batch(struct swap_in_me=
mory_cache_cb *simc)
 }
=20
 bool zswap_load_batching_enabled(void);
+void zswap_load_batch_init(struct zswap_decomp_batch *zd_batch);
+void zswap_load_batch_reinit(struct zswap_decomp_batch *zd_batch);
+bool __zswap_add_load_batch(struct zswap_decomp_batch *zd_batch,
+			    struct folio *folio);
+static inline bool zswap_add_load_batch(
+	struct zswap_decomp_batch *zd_batch,
+	struct folio *folio)
+{
+	if (zswap_load_batching_enabled())
+		return __zswap_add_load_batch(zd_batch, folio);
+
+	return false;
+}
=20
 unsigned long zswap_total_pages(void);
 bool zswap_store(struct folio *folio);
@@ -138,6 +153,8 @@ struct zswap_lruvec_state {};
 struct zswap_store_sub_batch_page {};
 struct zswap_store_pipeline_state {};
 struct swap_in_memory_cache_cb;
+struct zswap_decomp_batch;
+struct zswap_entry;
=20
 static inline bool zswap_store_batching_enabled(void)
 {
@@ -153,6 +170,24 @@ static inline bool zswap_load_batching_enabled(void)
 	return false;
 }
=20
+static inline void zswap_load_batch_init(
+	struct zswap_decomp_batch *zd_batch)
+{
+}
+
+static inline void zswap_load_batch_reinit(
+	struct zswap_decomp_batch *zd_batch)
+{
+}
+
+static inline bool zswap_add_load_batch(
+	struct folio *folio,
+	struct zswap_entry *entry,
+	struct zswap_decomp_batch *zd_batch)
+{
+	return false;
+}
+
 static inline bool zswap_store(struct folio *folio)
 {
 	return false;
diff --git a/mm/memory.c b/mm/memory.c
index 0f614523b9f4..b5745b9ffdf7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4322,7 +4322,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
=20
 				/* To provide entry to swap_read_folio() */
 				folio->swap =3D entry;
-				swap_read_folio(folio, NULL);
+				swap_read_folio(folio, NULL, NULL, NULL);
 				folio->private =3D NULL;
 			}
 		} else {
diff --git a/mm/page_io.c b/mm/page_io.c
index 065db25309b8..9750302d193b 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -744,11 +744,17 @@ static void swap_read_folio_bdev_async(struct folio *=
folio,
 	submit_bio(bio);
 }
=20
-void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
+/*
+ * Returns true if the folio was read, and false if the folio was added to
+ * the zswap_decomp_batch for batched decompression.
+ */
+bool swap_read_folio(struct folio *folio, struct swap_iocb **plug,
+		     struct zswap_decomp_batch *zswap_batch,
+		     struct folio_batch *non_zswap_batch)
 {
 	struct swap_info_struct *sis =3D swp_swap_info(folio->swap);
 	bool synchronous =3D sis->flags & SWP_SYNCHRONOUS_IO;
-	bool workingset =3D folio_test_workingset(folio);
+	bool workingset;
 	unsigned long pflags;
 	bool in_thrashing;
=20
@@ -756,11 +762,26 @@ void swap_read_folio(struct folio *folio, struct swap=
_iocb **plug)
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(folio_test_uptodate(folio), folio);
=20
+	/*
+	 * If entry is found in zswap xarray, and zswap load batching
+	 * is enabled, this is a candidate for zswap batch decompression.
+	 */
+	if (zswap_batch && zswap_add_load_batch(zswap_batch, folio))
+		return false;
+
+	if (zswap_load_batching_enabled() && non_zswap_batch) {
+		BUG_ON(!folio_batch_space(non_zswap_batch));
+		folio_batch_add(non_zswap_batch, folio);
+		return false;
+	}
+
 	/*
 	 * Count submission time as memory stall and delay. When the device
 	 * is congested, or the submitting cgroup IO-throttled, submission
 	 * can be a significant part of overall IO time.
 	 */
+	workingset =3D folio_test_workingset(folio);
+
 	if (workingset) {
 		delayacct_thrashing_start(&in_thrashing);
 		psi_memstall_enter(&pflags);
@@ -792,6 +813,7 @@ void swap_read_folio(struct folio *folio, struct swap_i=
ocb **plug)
 		psi_memstall_leave(&pflags);
 	}
 	delayacct_swapin_end();
+	return true;
 }
=20
 void __swap_read_unplug(struct swap_iocb *sio)
diff --git a/mm/swap.h b/mm/swap.h
index 0bb386b5fdee..310f99007fe6 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -84,6 +84,27 @@ void swap_crypto_acomp_decompress_batch(
 	int nr_pages,
 	struct crypto_acomp_ctx *acomp_ctx);
=20
+#if defined(CONFIG_ZSWAP_LOAD_BATCHING_ENABLED)
+#define MAX_NR_ZSWAP_LOAD_SUB_BATCHES DIV_ROUND_UP(PAGEVEC_SIZE, \
+					SWAP_CRYPTO_SUB_BATCH_SIZE)
+#else
+#define MAX_NR_ZSWAP_LOAD_SUB_BATCHES 1UL
+#endif /* CONFIG_ZSWAP_LOAD_BATCHING_ENABLED */
+
+/*
+ * Note: If PAGEVEC_SIZE or SWAP_CRYPTO_SUB_BATCH_SIZE
+ * exceeds 256, change the u8 to u16.
+ */
+struct zswap_decomp_batch {
+	struct folio_batch fbatch;
+	bool swapcache[PAGEVEC_SIZE];
+	struct xarray *trees[MAX_NR_ZSWAP_LOAD_SUB_BATCHES][SWAP_CRYPTO_SUB_BATCH=
_SIZE];
+	struct zswap_entry *entries[MAX_NR_ZSWAP_LOAD_SUB_BATCHES][SWAP_CRYPTO_SU=
B_BATCH_SIZE];
+	struct page *pages[MAX_NR_ZSWAP_LOAD_SUB_BATCHES][SWAP_CRYPTO_SUB_BATCH_S=
IZE];
+	unsigned int slens[MAX_NR_ZSWAP_LOAD_SUB_BATCHES][SWAP_CRYPTO_SUB_BATCH_S=
IZE];
+	u8 nr_decomp[MAX_NR_ZSWAP_LOAD_SUB_BATCHES];
+};
+
 /* linux/mm/vmscan.c, linux/mm/page_io.c, linux/mm/zswap.c */
 /* For batching of compressions in reclaim path. */
 struct swap_in_memory_cache_cb {
@@ -101,7 +122,9 @@ struct swap_in_memory_cache_cb {
 /* linux/mm/page_io.c */
 int sio_pool_init(void);
 struct swap_iocb;
-void swap_read_folio(struct folio *folio, struct swap_iocb **plug);
+bool swap_read_folio(struct folio *folio, struct swap_iocb **plug,
+		     struct zswap_decomp_batch *zswap_batch,
+		     struct folio_batch *non_zswap_batch);
 void __swap_read_unplug(struct swap_iocb *plug);
 static inline void swap_read_unplug(struct swap_iocb *plug)
 {
@@ -238,8 +261,12 @@ static inline void swap_crypto_acomp_decompress_batch(
 {
 }
=20
-static inline void swap_read_folio(struct folio *folio, struct swap_iocb *=
*plug)
+struct zswap_decomp_batch {};
+static inline bool swap_read_folio(struct folio *folio, struct swap_iocb *=
*plug,
+				   struct zswap_decomp_batch *zswap_batch,
+				   struct folio_batch *non_zswap_batch)
 {
+	return false;
 }
 static inline void swap_write_unplug(struct swap_iocb *sio)
 {
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 0673593d363c..0aa938e4c34d 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -570,7 +570,7 @@ struct folio *read_swap_cache_async(swp_entry_t entry, =
gfp_t gfp_mask,
 	mpol_cond_put(mpol);
=20
 	if (page_allocated)
-		swap_read_folio(folio, plug);
+		swap_read_folio(folio, plug, NULL, NULL);
 	return folio;
 }
=20
@@ -687,7 +687,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry,=
 gfp_t gfp_mask,
 		if (!folio)
 			continue;
 		if (page_allocated) {
-			swap_read_folio(folio, &splug);
+			swap_read_folio(folio, &splug, NULL, NULL);
 			if (offset !=3D entry_offset) {
 				folio_set_readahead(folio);
 				count_vm_event(SWAP_RA);
@@ -703,7 +703,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry,=
 gfp_t gfp_mask,
 	folio =3D __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
 					&page_allocated, false);
 	if (unlikely(page_allocated))
-		swap_read_folio(folio, NULL);
+		swap_read_folio(folio, NULL, NULL, NULL);
 	return folio;
 }
=20
@@ -1057,7 +1057,7 @@ static struct folio *swap_vma_readahead(swp_entry_t t=
arg_entry, gfp_t gfp_mask,
 		if (!folio)
 			continue;
 		if (page_allocated) {
-			swap_read_folio(folio, &splug);
+			swap_read_folio(folio, &splug, NULL, NULL);
 			if (addr !=3D vmf->address) {
 				folio_set_readahead(folio);
 				count_vm_event(SWAP_RA);
@@ -1075,7 +1075,7 @@ static struct folio *swap_vma_readahead(swp_entry_t t=
arg_entry, gfp_t gfp_mask,
 	folio =3D __read_swap_cache_async(targ_entry, gfp_mask, mpol, targ_ilx,
 					&page_allocated, false);
 	if (unlikely(page_allocated))
-		swap_read_folio(folio, NULL);
+		swap_read_folio(folio, NULL, NULL, NULL);
 	return folio;
 }
=20
diff --git a/mm/zswap.c b/mm/zswap.c
index fe7bc2a6672e..1d293f95d525 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -2312,6 +2312,95 @@ bool zswap_load(struct folio *folio)
 	return true;
 }
=20
+/* Code for zswap load batch with batch decompress. */
+
+__always_inline void zswap_load_batch_init(struct zswap_decomp_batch *zd_b=
atch)
+{
+	unsigned int sb;
+
+	folio_batch_init(&zd_batch->fbatch);
+
+	for (sb =3D 0; sb < MAX_NR_ZSWAP_LOAD_SUB_BATCHES; ++sb)
+		zd_batch->nr_decomp[sb] =3D 0;
+}
+
+__always_inline void zswap_load_batch_reinit(struct zswap_decomp_batch *zd=
_batch)
+{
+	unsigned int sb;
+
+	folio_batch_reinit(&zd_batch->fbatch);
+
+	for (sb =3D 0; sb < MAX_NR_ZSWAP_LOAD_SUB_BATCHES; ++sb)
+		zd_batch->nr_decomp[sb] =3D 0;
+}
+
+/*
+ * All folios in zd_batch are allocated into the swapcache
+ * in swapin_readahead(), before being added to the zd_batch
+ * for batch decompression.
+ */
+bool __zswap_add_load_batch(struct zswap_decomp_batch *zd_batch,
+			    struct folio *folio)
+{
+	swp_entry_t swp =3D folio->swap;
+	pgoff_t offset =3D swp_offset(swp);
+	bool swapcache =3D folio_test_swapcache(folio);
+	struct xarray *tree =3D swap_zswap_tree(swp);
+	struct zswap_entry *entry;
+	unsigned int batch_idx, sb;
+
+	VM_WARN_ON_ONCE(!folio_test_locked(folio));
+
+	if (zswap_never_enabled())
+		return false;
+
+	/*
+	 * Large folios should not be swapped in while zswap is being used, as
+	 * they are not properly handled. Zswap does not properly load large
+	 * folios, and a large folio may only be partially in zswap.
+	 *
+	 * Returning false here will cause the large folio to be added to
+	 * the "non_zswap_batch" in swap_read_folio(), which will eventually
+	 * call zswap_load() if the folio is not in the zeromap. Finally,
+	 * zswap_load() will return true without marking the folio uptodate
+	 * so that an IO error is emitted (e.g. do_swap_page() will sigbus).
+	 */
+	if (WARN_ON_ONCE(folio_test_large(folio)))
+		return false;
+
+	/*
+	 * When reading into the swapcache, invalidate our entry. The
+	 * swapcache can be the authoritative owner of the page and
+	 * its mappings, and the pressure that results from having two
+	 * in-memory copies outweighs any benefits of caching the
+	 * compression work.
+	 */
+	if (swapcache)
+		entry =3D xa_erase(tree, offset);
+	else
+		entry =3D xa_load(tree, offset);
+
+	if (!entry)
+		return false;
+
+	BUG_ON(!folio_batch_space(&zd_batch->fbatch));
+	folio_batch_add(&zd_batch->fbatch, folio);
+
+	batch_idx =3D folio_batch_count(&zd_batch->fbatch) - 1;
+	zd_batch->swapcache[batch_idx] =3D swapcache;
+	sb =3D batch_idx / SWAP_CRYPTO_SUB_BATCH_SIZE;
+
+	if (entry->length) {
+		zd_batch->trees[sb][zd_batch->nr_decomp[sb]] =3D tree;
+		zd_batch->entries[sb][zd_batch->nr_decomp[sb]] =3D entry;
+		zd_batch->pages[sb][zd_batch->nr_decomp[sb]] =3D &folio->page;
+		zd_batch->slens[sb][zd_batch->nr_decomp[sb]] =3D entry->length;
+		zd_batch->nr_decomp[sb]++;
+	}
+
+	return true;
+}
+
 void zswap_invalidate(swp_entry_t swp)
 {
 	pgoff_t offset =3D swp_offset(swp);
--=20
2.27.0
From nobody Tue Nov 26 10:22:36 2024
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.14])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7F614188713
	for <linux-kernel@vger.kernel.org>; Fri, 18 Oct 2024 06:48:12 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.14
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1729234094; cv=none;
 b=laCxVVxYP2ekjxw1KUz92uLXUPLIKWSzFAENGWHPKPQZKRHKUppR3PAU6T2pbhp+PiHTbZp6fe5vWmVUgPaJ9vfn6AElsirCibGYyihGOFmRtjjwNiQV5dwvCCRb2xRIaspFrJVuV2nrBUPWH2mD1KaLRq/u7unVKLmJCOAYr48=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1729234094; c=relaxed/simple;
	bh=EVMRjfiuNxd9amTtVRjIPzNA0R9fg2+OAdURffS6kSM=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=PYLcvr5tkJzVCkRkkXOLx/LV0os4aMadepvLovOSiWacj+Bxy0ZjJZS2Bb8aEUUjiggUptFCRVON0wg1/O54qOX3NmRknm8mL3ZiRkoXFOjnvQy2LI0RLxS0Hlx3X6YgDk91FniWADE9VzWq2/yh8iPRflN8Q/rqGXzT1So6E1c=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=WWxAGnrc; arc=none smtp.client-ip=192.198.163.14
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="WWxAGnrc"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1729234092; x=1760770092;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=EVMRjfiuNxd9amTtVRjIPzNA0R9fg2+OAdURffS6kSM=;
  b=WWxAGnrc9MUQmyXhbClJyVFcdMLJc5WiHOT3+N9CRCs9mOZOHg/kEh0o
   nbSsseT3KMxdehaFO5FX4n2CF0ItcF3K4Cm8fMf5QBQMvOddtyaC66DOC
   n9NcvzQyMxACcI2SDisu8yfM/WtM0gyC/ZxHMZ3lE661QOPXxuLY2r0Qg
   VjGZw3Fmvw+dJTFEaHvzHVA5fprGIQi4Gt59/zUvFG/dckUfEuL/vqeOT
   e37gJbIciiQt6ZQspqnsrVMqN0QoB75srHnJiTwJZFvemfvfsvFI+i1tA
   uilP7Sd9dxyLnCf4uVcEbS224IyNZfahWbuiu3bh956oC3bpPSWxBvGmW
   A==;
X-CSE-ConnectionGUID: wG+1eC8cQMCqIjUx0O5SPA==
X-CSE-MsgGUID: 144r3/6fS6ePkpFxBnAB3w==
X-IronPort-AV: E=McAfee;i="6700,10204,11228"; a="28963362"
X-IronPort-AV: E=Sophos;i="6.11,212,1725346800";
   d="scan'208";a="28963362"
Received: from fmviesa003.fm.intel.com ([10.60.135.143])
  by fmvoesa108.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 17 Oct 2024 23:48:09 -0700
X-CSE-ConnectionGUID: bK28v/+wSFW12salxHZpOw==
X-CSE-MsgGUID: 2L41HZYESMSBem9+UQfdBw==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.11,212,1725346800";
   d="scan'208";a="82744521"
Received: from jf5300-b11a338t.jf.intel.com ([10.242.51.6])
  by fmviesa003.fm.intel.com with ESMTP; 17 Oct 2024 23:48:08 -0700
From: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
To: linux-kernel@vger.kernel.org,
	linux-mm@kvack.org,
	hannes@cmpxchg.org,
	yosryahmed@google.com,
	nphamcs@gmail.com,
	chengming.zhou@linux.dev,
	usamaarif642@gmail.com,
	ryan.roberts@arm.com,
	ying.huang@intel.com,
	21cnbao@gmail.com,
	akpm@linux-foundation.org,
	hughd@google.com,
	willy@infradead.org,
	bfoster@redhat.com,
	dchinner@redhat.com,
	chrisl@kernel.org,
	david@redhat.com
Cc: wajdi.k.feghali@intel.com,
	vinodh.gopal@intel.com,
	kanchana.p.sridhar@intel.com
Subject: [RFC PATCH v1 5/7] mm: swap,
 zswap: zswap folio_batch processing with IAA decompression batching.
Date: Thu, 17 Oct 2024 23:48:03 -0700
Message-Id: <20241018064805.336490-6-kanchana.p.sridhar@intel.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: <20241018064805.336490-1-kanchana.p.sridhar@intel.com>
References: <20241018064805.336490-1-kanchana.p.sridhar@intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

This patch provides the functionality that processes a "zswap_batch" in
which swap_read_folio() had previously stored swap entries found in zswap,
for batched loading.

The newly added zswap_finish_load_batch() API implements the main zswap
load batching functionality. This makes use of the sub-batches of
zswap_entry/xarray/page/source-length readily available from
zswap_add_load_batch(). These sub-batch arrays are processed one at a time,
until the entire zswap folio_batch has been loaded. The existing
zswap_load() functionality of deleting zswap_entries for folios found in
the swapcache, is preserved.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 include/linux/zswap.h |  22 ++++++
 mm/page_io.c          |  35 +++++++++
 mm/swap.h             |  17 +++++
 mm/zswap.c            | 171 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 245 insertions(+)

diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 1d6de281f243..a0792c2b300a 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -110,6 +110,15 @@ struct zswap_store_pipeline_state {
 	u8 nr_comp_pages;
 };
=20
+/* Note: If SWAP_CRYPTO_SUB_BATCH_SIZE exceeds 256, change the u8 to u16. =
*/
+struct zswap_load_sub_batch_state {
+	struct xarray **trees;
+	struct zswap_entry **entries;
+	struct page **pages;
+	unsigned int *slens;
+	u8 nr_decomp;
+};
+
 bool zswap_store_batching_enabled(void);
 void __zswap_store_batch(struct swap_in_memory_cache_cb *simc);
 void __zswap_store_batch_single(struct swap_in_memory_cache_cb *simc);
@@ -136,6 +145,14 @@ static inline bool zswap_add_load_batch(
 	return false;
 }
=20
+void __zswap_finish_load_batch(struct zswap_decomp_batch *zd_batch);
+static inline void zswap_finish_load_batch(
+	struct zswap_decomp_batch *zd_batch)
+{
+	if (zswap_load_batching_enabled())
+		__zswap_finish_load_batch(zd_batch);
+}
+
 unsigned long zswap_total_pages(void);
 bool zswap_store(struct folio *folio);
 bool zswap_load(struct folio *folio);
@@ -188,6 +205,11 @@ static inline bool zswap_add_load_batch(
 	return false;
 }
=20
+static inline void zswap_finish_load_batch(
+	struct zswap_decomp_batch *zd_batch)
+{
+}
+
 static inline bool zswap_store(struct folio *folio)
 {
 	return false;
diff --git a/mm/page_io.c b/mm/page_io.c
index 9750302d193b..aa83221318ef 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -816,6 +816,41 @@ bool swap_read_folio(struct folio *folio, struct swap_=
iocb **plug,
 	return true;
 }
=20
+static void __swap_post_process_zswap_load_batch(
+	struct zswap_decomp_batch *zswap_batch)
+{
+	u8 i;
+
+	for (i =3D 0; i < folio_batch_count(&zswap_batch->fbatch); ++i) {
+		struct folio *folio =3D zswap_batch->fbatch.folios[i];
+		folio_unlock(folio);
+	}
+}
+
+/*
+ * The swapin_readahead batching interface makes sure that the
+ * input zswap_batch consists of folios belonging to the same swap
+ * device type.
+ */
+void __swap_read_zswap_batch_unplug(struct zswap_decomp_batch *zswap_batch,
+				    struct swap_iocb **splug)
+{
+	unsigned long pflags;
+
+	if (!folio_batch_count(&zswap_batch->fbatch))
+		return;
+
+	psi_memstall_enter(&pflags);
+	delayacct_swapin_start();
+
+	/* Load the zswap batch. */
+	zswap_finish_load_batch(zswap_batch);
+	__swap_post_process_zswap_load_batch(zswap_batch);
+
+	psi_memstall_leave(&pflags);
+	delayacct_swapin_end();
+}
+
 void __swap_read_unplug(struct swap_iocb *sio)
 {
 	struct iov_iter from;
diff --git a/mm/swap.h b/mm/swap.h
index 310f99007fe6..2b82c8ed765c 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -125,6 +125,16 @@ struct swap_iocb;
 bool swap_read_folio(struct folio *folio, struct swap_iocb **plug,
 		     struct zswap_decomp_batch *zswap_batch,
 		     struct folio_batch *non_zswap_batch);
+void __swap_read_zswap_batch_unplug(
+	struct zswap_decomp_batch *zswap_batch,
+	struct swap_iocb **splug);
+static inline void swap_read_zswap_batch_unplug(
+	struct zswap_decomp_batch *zswap_batch,
+	struct swap_iocb **splug)
+{
+	if (likely(zswap_batch))
+		__swap_read_zswap_batch_unplug(zswap_batch, splug);
+}
 void __swap_read_unplug(struct swap_iocb *plug);
 static inline void swap_read_unplug(struct swap_iocb *plug)
 {
@@ -268,6 +278,13 @@ static inline bool swap_read_folio(struct folio *folio=
, struct swap_iocb **plug,
 {
 	return false;
 }
+
+static inline void swap_read_zswap_batch_unplug(
+	struct zswap_decomp_batch *zswap_batch,
+	struct swap_iocb **splug)
+{
+}
+
 static inline void swap_write_unplug(struct swap_iocb *sio)
 {
 }
diff --git a/mm/zswap.c b/mm/zswap.c
index 1d293f95d525..39bf7d8810e9 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -35,6 +35,7 @@
 #include <linux/pagemap.h>
 #include <linux/workqueue.h>
 #include <linux/list_lru.h>
+#include <linux/delayacct.h>
=20
 #include "swap.h"
 #include "internal.h"
@@ -2401,6 +2402,176 @@ bool __zswap_add_load_batch(struct zswap_decomp_bat=
ch *zd_batch,
 	return true;
 }
=20
+static __always_inline void zswap_load_sub_batch_init(
+	struct zswap_decomp_batch *zd_batch,
+	unsigned int sb,
+	struct zswap_load_sub_batch_state *zls)
+{
+	zls->trees =3D zd_batch->trees[sb];
+	zls->entries =3D zd_batch->entries[sb];
+	zls->pages =3D zd_batch->pages[sb];
+	zls->slens =3D zd_batch->slens[sb];
+	zls->nr_decomp =3D zd_batch->nr_decomp[sb];
+}
+
+static void zswap_load_map_sources(
+	struct zswap_load_sub_batch_state *zls,
+	u8 *srcs[])
+{
+	u8 i;
+
+	for (i =3D 0; i < zls->nr_decomp; ++i) {
+		struct zswap_entry *entry =3D zls->entries[i];
+		struct zpool *zpool =3D entry->pool->zpool;
+		u8 *buf =3D zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
+		memcpy(srcs[i], buf, entry->length);
+		zpool_unmap_handle(zpool, entry->handle);
+	}
+}
+
+static void zswap_decompress_batch(
+	struct zswap_load_sub_batch_state *zls,
+	u8 *srcs[],
+	int decomp_errors[])
+{
+	struct crypto_acomp_ctx *acomp_ctx;
+
+	acomp_ctx =3D raw_cpu_ptr(zls->entries[0]->pool->acomp_ctx);
+
+	swap_crypto_acomp_decompress_batch(
+		srcs,
+		zls->pages,
+		zls->slens,
+		decomp_errors,
+		zls->nr_decomp,
+		acomp_ctx);
+}
+
+static void zswap_load_batch_updates(
+	struct zswap_decomp_batch *zd_batch,
+	unsigned int sb,
+	struct zswap_load_sub_batch_state *zls,
+	int decomp_errors[])
+{
+	unsigned int j;
+	u8 i;
+
+	for (i =3D 0; i < zls->nr_decomp; ++i) {
+		j =3D (sb * SWAP_CRYPTO_SUB_BATCH_SIZE) + i;
+		struct folio *folio =3D zd_batch->fbatch.folios[j];
+		struct zswap_entry *entry =3D zls->entries[i];
+
+		BUG_ON(decomp_errors[i]);
+		count_vm_event(ZSWPIN);
+		if (entry->objcg)
+			count_objcg_events(entry->objcg, ZSWPIN, 1);
+
+		if (zd_batch->swapcache[j]) {
+			zswap_entry_free(entry);
+			folio_mark_dirty(folio);
+		}
+
+		folio_mark_uptodate(folio);
+	}
+}
+
+static void zswap_load_decomp_batch(
+	struct zswap_decomp_batch *zd_batch,
+	unsigned int sb,
+	struct zswap_load_sub_batch_state *zls)
+{
+	int decomp_errors[SWAP_CRYPTO_SUB_BATCH_SIZE];
+	struct crypto_acomp_ctx *acomp_ctx;
+
+	acomp_ctx =3D raw_cpu_ptr(zls->entries[0]->pool->acomp_ctx);
+	mutex_lock(&acomp_ctx->mutex);
+
+	zswap_load_map_sources(zls, acomp_ctx->buffer);
+
+	zswap_decompress_batch(zls, acomp_ctx->buffer, decomp_errors);
+
+	mutex_unlock(&acomp_ctx->mutex);
+
+	zswap_load_batch_updates(zd_batch, sb, zls, decomp_errors);
+}
+
+static void zswap_load_start_accounting(
+	struct zswap_decomp_batch *zd_batch,
+	unsigned int sb,
+	struct zswap_load_sub_batch_state *zls,
+	bool workingset[],
+	bool in_thrashing[])
+{
+	unsigned int j;
+	u8 i;
+
+	for (i =3D 0; i < zls->nr_decomp; ++i) {
+		j =3D (sb * SWAP_CRYPTO_SUB_BATCH_SIZE) + i;
+		struct folio *folio =3D zd_batch->fbatch.folios[j];
+		workingset[i] =3D folio_test_workingset(folio);
+		if (workingset[i])
+			delayacct_thrashing_start(&in_thrashing[i]);
+	}
+}
+
+static void zswap_load_end_accounting(
+	struct zswap_decomp_batch *zd_batch,
+	struct zswap_load_sub_batch_state *zls,
+	bool workingset[],
+	bool in_thrashing[])
+{
+	u8 i;
+
+	for (i =3D 0; i < zls->nr_decomp; ++i)
+		if (workingset[i])
+			delayacct_thrashing_end(&in_thrashing[i]);
+}
+
+/*
+ * All entries in a zd_batch belong to the same swap device.
+ */
+void __zswap_finish_load_batch(struct zswap_decomp_batch *zd_batch)
+{
+	struct zswap_load_sub_batch_state zls;
+	unsigned int nr_folios =3D folio_batch_count(&zd_batch->fbatch);
+	unsigned int nr_sb =3D DIV_ROUND_UP(nr_folios, SWAP_CRYPTO_SUB_BATCH_SIZE=
);
+	unsigned int sb;
+
+	/*
+	 * Process the zd_batch in sub-batches of
+	 * SWAP_CRYPTO_SUB_BATCH_SIZE.
+	 */
+	for (sb =3D 0; sb < nr_sb; ++sb) {
+		bool workingset[SWAP_CRYPTO_SUB_BATCH_SIZE];
+		bool in_thrashing[SWAP_CRYPTO_SUB_BATCH_SIZE];
+
+		zswap_load_sub_batch_init(zd_batch, sb, &zls);
+
+		zswap_load_start_accounting(zd_batch, sb, &zls,
+					    workingset, in_thrashing);
+
+		/* Decompress the batch. */
+		if (zls.nr_decomp)
+			zswap_load_decomp_batch(zd_batch, sb, &zls);
+
+		/*
+		 * Should we free zswap_entries, as in zswap_load():
+		 * With the new swapin_readahead batching interface,
+		 * all prefetch entries are read into the swapcache.
+		 * Freeing the zswap entries here causes segfaults,
+		 * most probably because a page-fault occured while
+		 * the buffer was being decompressed.
+		 * Allowing the regular folio_free_swap() sequence
+		 * in do_swap_page() appears to keep things stable
+		 * without duplicated zswap-swapcache memory, as far
+		 * as I can tell from my testing.
+		 */
+
+		zswap_load_end_accounting(zd_batch, &zls,
+					  workingset, in_thrashing);
+	}
+}
+
 void zswap_invalidate(swp_entry_t swp)
 {
 	pgoff_t offset =3D swp_offset(swp);
--=20
2.27.0
From nobody Tue Nov 26 10:22:36 2024
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.14])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id A29F0188722
	for <linux-kernel@vger.kernel.org>; Fri, 18 Oct 2024 06:48:12 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.14
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1729234095; cv=none;
 b=SL4TZ8RgjlENT0dcHEVdffHJEJAfNZzy3zlviIdzdgzLoPu+EI6kFUIvHW4XAxj0bRnROiAQe+RXMn1p3atvnsFLVagVzy8QcEQ+K6SVcS6VBpSqZYrATa/qg0/VcmDrIealCj+TulmUF5HouRvYWFsRFiM2L6k63Y7fk9GDLm0=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1729234095; c=relaxed/simple;
	bh=NVsoZ0G+h+yCIJer8I6ApzubS8LAuLTzsZjgRgzHJ0A=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=oSJavP0grXQsqquc9+z3u2stuoeHYOF1zzT6agmLxQnac7lvrYWpKcjeWk9gVKD06lhl2ykPZ9QLgTePxqmTm/74pTSY/OkZUkgVvrzI5dkN/3jQYAAZ9c2PDRd7il9XvGJsN8j7/X9gUtwVoPejeNbivooNoM13pMuNvBvpEzc=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=aiGunNIg; arc=none smtp.client-ip=192.198.163.14
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="aiGunNIg"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1729234093; x=1760770093;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=NVsoZ0G+h+yCIJer8I6ApzubS8LAuLTzsZjgRgzHJ0A=;
  b=aiGunNIgOF6wLW7ueRED/2a/T/wgOiIBHWgou3FYtJbO9Di8/5mZS8xp
   rN+Qs8WU9z/gWYonQTA0Knhi+cz3Jr7kkR8A9OwT9GayhklBEjAwCgaxn
   uJY068CDo/hA2GE3e0Vbjo5Gn8WQPf10Slz1468CQ2lpnA15fynV/RL/f
   EZ66U/G4iV8Z70C5SsOB1LufSKJThpwxAqWlMjsBC29LusbRCGXO9ABRx
   YbOyoTv5N4P2LmmGpHvbpgUYAzci2zHZ+flV270atGPkDDRTqsuThmZgs
   tNALWxgcx9DFsUWK1rNZOsw+mR3yDtsEJpFrnL9y0aEsDfcF1L6NM6OZh
   Q==;
X-CSE-ConnectionGUID: FFMTDIK3RfCsKeY2q3rfrQ==
X-CSE-MsgGUID: Hxr+1EOJREesOyHFsZ6/GQ==
X-IronPort-AV: E=McAfee;i="6700,10204,11228"; a="28963374"
X-IronPort-AV: E=Sophos;i="6.11,212,1725346800";
   d="scan'208";a="28963374"
Received: from fmviesa003.fm.intel.com ([10.60.135.143])
  by fmvoesa108.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 17 Oct 2024 23:48:10 -0700
X-CSE-ConnectionGUID: 9smf1lgiQ5GrHRanDYY+1g==
X-CSE-MsgGUID: vZzx9IQOSxa3k9xr46CDBg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.11,212,1725346800";
   d="scan'208";a="82744530"
Received: from jf5300-b11a338t.jf.intel.com ([10.242.51.6])
  by fmviesa003.fm.intel.com with ESMTP; 17 Oct 2024 23:48:09 -0700
From: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
To: linux-kernel@vger.kernel.org,
	linux-mm@kvack.org,
	hannes@cmpxchg.org,
	yosryahmed@google.com,
	nphamcs@gmail.com,
	chengming.zhou@linux.dev,
	usamaarif642@gmail.com,
	ryan.roberts@arm.com,
	ying.huang@intel.com,
	21cnbao@gmail.com,
	akpm@linux-foundation.org,
	hughd@google.com,
	willy@infradead.org,
	bfoster@redhat.com,
	dchinner@redhat.com,
	chrisl@kernel.org,
	david@redhat.com
Cc: wajdi.k.feghali@intel.com,
	vinodh.gopal@intel.com,
	kanchana.p.sridhar@intel.com
Subject: [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap
 load batching interface.
Date: Thu, 17 Oct 2024 23:48:04 -0700
Message-Id: <20241018064805.336490-7-kanchana.p.sridhar@intel.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: <20241018064805.336490-1-kanchana.p.sridhar@intel.com>
References: <20241018064805.336490-1-kanchana.p.sridhar@intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

This patch invokes the swapin_readahead() based batching interface to
prefetch a batch of 4K folios for zswap load with batch decompressions
in parallel using IAA hardware. swapin_readahead() prefetches folios based
on vm.page-cluster and the usefulness of prior prefetches to the
workload. As folios are created in the swapcache and the readahead code
calls swap_read_folio() with a "zswap_batch" and a "non_zswap_batch", the
respective folio_batches get populated with the folios to be read.

Finally, the swapin_readahead() procedures will call the newly added
process_ra_batch_of_same_type() which:

 1) Reads all the non_zswap_batch folios sequentially by calling
    swap_read_folio().
 2) Calls swap_read_zswap_batch_unplug() with the zswap_batch which calls
    zswap_finish_load_batch() that finally decompresses each
    SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a prefetch
    batch of say, 32 folios) in parallel with IAA.

Within do_swap_page(), we try to benefit from batch decompressions in both
these scenarios:

 1) single-mapped, SWP_SYNCHRONOUS_IO:
      We call swapin_readahead() with "single_mapped_path =3D true". This is
      done only in the !zswap_never_enabled() case.
 2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
      We call swapin_readahead() with "single_mapped_path =3D false".

This will place folios in the swapcache: a design choice that handles cases
where a folio that is "single-mapped" in process 1 could be prefetched in
process 2; and handles highly contended server scenarios with stability.
There are checks added at the end of do_swap_page(), after the folio has
been successfully loaded, to detect if the single-mapped swapcache folio is
still single-mapped, and if so, folio_free_swap() is called on the folio.

Within the swapin_readahead() functions, if single_mapped_path is true, and
either the platform does not have IAA, or, if the platform has IAA and the
user selects a software compressor for zswap (details of sysfs knob
follow), readahead/batching are skipped and the folio is loaded using
zswap_load().

A new swap parameter "singlemapped_ra_enabled" (false by default) is added
for platforms that have IAA, zswap_load_batching_enabled() is true, and we
want to give the user the option to run experiments with IAA and with
software compressors for zswap (swap device is SWP_SYNCHRONOUS_IO):

For IAA:
 echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled

For software compressors:
 echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled

If "singlemapped_ra_enabled" is set to false, swapin_readahead() will skip
prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO" do_swap_page()
path.

Thanks Ying Huang for the really helpful brainstorming discussions on the
swap_read_folio() plug design.

Suggested-by: Ying Huang <ying.huang@intel.com>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/memory.c     | 187 +++++++++++++++++++++++++++++++++++++-----------
 mm/shmem.c      |   2 +-
 mm/swap.h       |  12 ++--
 mm/swap_state.c | 157 ++++++++++++++++++++++++++++++++++++----
 mm/swapfile.c   |   2 +-
 5 files changed, 299 insertions(+), 61 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index b5745b9ffdf7..9655b85fc243 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3924,6 +3924,42 @@ static vm_fault_t remove_device_exclusive_entry(stru=
ct vm_fault *vmf)
 	return 0;
 }
=20
+/*
+ * swapin readahead based batching interface for zswap batched loads using=
 IAA:
+ *
+ * Should only be called for and if the faulting swap entry in do_swap_page
+ * is single-mapped and SWP_SYNCHRONOUS_IO.
+ *
+ * Detect if the folio is in the swapcache, is still mapped to only this
+ * process, and further, there are no additional references to this folio
+ * (for e.g. if another process simultaneously readahead this swap entry
+ * while this process was handling the page-fault, and got a pointer to the
+ * folio allocated by this process in the swapcache), besides the referenc=
es
+ * that were obtained within __read_swap_cache_async() by this process tha=
t is
+ * faulting in this single-mapped swap entry.
+ */
+static inline bool should_free_singlemap_swapcache(swp_entry_t entry,
+						   struct folio *folio)
+{
+	if (!folio_test_swapcache(folio))
+		return false;
+
+	if (__swap_count(entry) !=3D 0)
+		return false;
+
+	/*
+	 * The folio ref count for a single-mapped folio that was allocated
+	 * in __read_swap_cache_async(), can be a maximum of 3. These are the
+	 * incrementors of the folio ref count in __read_swap_cache_async():
+	 * folio_alloc_mpol(), add_to_swap_cache(), folio_add_lru().
+	 */
+
+	if (folio_ref_count(folio) <=3D 3)
+		return true;
+
+	return false;
+}
+
 static inline bool should_try_to_free_swap(struct folio *folio,
 					   struct vm_area_struct *vma,
 					   unsigned int fault_flags)
@@ -4215,6 +4251,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	swp_entry_t entry;
 	pte_t pte;
 	vm_fault_t ret =3D 0;
+	bool single_mapped_swapcache =3D false;
 	void *shadow =3D NULL;
 	int nr_pages;
 	unsigned long page_idx;
@@ -4283,51 +4320,90 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (!folio) {
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
 		    __swap_count(entry) =3D=3D 1) {
-			/* skip swapcache */
-			folio =3D alloc_swap_folio(vmf);
-			if (folio) {
-				__folio_set_locked(folio);
-				__folio_set_swapbacked(folio);
-
-				nr_pages =3D folio_nr_pages(folio);
-				if (folio_test_large(folio))
-					entry.val =3D ALIGN_DOWN(entry.val, nr_pages);
-				/*
-				 * Prevent parallel swapin from proceeding with
-				 * the cache flag. Otherwise, another thread
-				 * may finish swapin first, free the entry, and
-				 * swapout reusing the same entry. It's
-				 * undetectable as pte_same() returns true due
-				 * to entry reuse.
-				 */
-				if (swapcache_prepare(entry, nr_pages)) {
+			if (zswap_never_enabled()) {
+				/* skip swapcache */
+				folio =3D alloc_swap_folio(vmf);
+				if (folio) {
+					__folio_set_locked(folio);
+					__folio_set_swapbacked(folio);
+
+					nr_pages =3D folio_nr_pages(folio);
+					if (folio_test_large(folio))
+						entry.val =3D ALIGN_DOWN(entry.val, nr_pages);
 					/*
-					 * Relax a bit to prevent rapid
-					 * repeated page faults.
+					 * Prevent parallel swapin from proceeding with
+					 * the cache flag. Otherwise, another thread
+					 * may finish swapin first, free the entry, and
+					 * swapout reusing the same entry. It's
+					 * undetectable as pte_same() returns true due
+					 * to entry reuse.
 					 */
-					add_wait_queue(&swapcache_wq, &wait);
-					schedule_timeout_uninterruptible(1);
-					remove_wait_queue(&swapcache_wq, &wait);
-					goto out_page;
+					if (swapcache_prepare(entry, nr_pages)) {
+						/*
+						 * Relax a bit to prevent rapid
+						 * repeated page faults.
+						 */
+						add_wait_queue(&swapcache_wq, &wait);
+						schedule_timeout_uninterruptible(1);
+						remove_wait_queue(&swapcache_wq, &wait);
+						goto out_page;
+					}
+					need_clear_cache =3D true;
+
+					mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
+
+					shadow =3D get_shadow_from_swap_cache(entry);
+					if (shadow)
+						workingset_refault(folio, shadow);
+
+					folio_add_lru(folio);
+
+					/* To provide entry to swap_read_folio() */
+					folio->swap =3D entry;
+					swap_read_folio(folio, NULL, NULL, NULL);
+					folio->private =3D NULL;
+				}
+			} else {
+				/*
+				 * zswap is enabled or was enabled at some point.
+				 * Don't skip swapcache.
+				 *
+				 * swapin readahead based batching interface
+				 * for zswap batched loads using IAA:
+				 *
+				 * Readahead is invoked in this path only if
+				 * the sys swap "singlemapped_ra_enabled" swap
+				 * parameter is set to true. By default,
+				 * "singlemapped_ra_enabled" is set to false,
+				 * the recommended setting for software compressors.
+				 * For IAA, if "singlemapped_ra_enabled" is set
+				 * to true, readahead will be deployed in this path
+				 * as well.
+				 *
+				 * For single-mapped pages, the batching interface
+				 * calls __read_swap_cache_async() to allocate and
+				 * place the faulting page in the swapcache. This is
+				 * to handle a scenario where the faulting page in
+				 * this process happens to simultaneously be a
+				 * readahead page in another process. By placing the
+				 * single-mapped faulting page in the swapcache,
+				 * we avoid race conditions and duplicate page
+				 * allocations under these scenarios.
+				 */
+				folio =3D swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
+							 vmf, true);
+				if (!folio) {
+					ret =3D VM_FAULT_OOM;
+					goto out;
 				}
-				need_clear_cache =3D true;
-
-				mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
-
-				shadow =3D get_shadow_from_swap_cache(entry);
-				if (shadow)
-					workingset_refault(folio, shadow);
-
-				folio_add_lru(folio);
=20
-				/* To provide entry to swap_read_folio() */
-				folio->swap =3D entry;
-				swap_read_folio(folio, NULL, NULL, NULL);
-				folio->private =3D NULL;
-			}
+				single_mapped_swapcache =3D true;
+				nr_pages =3D folio_nr_pages(folio);
+				swapcache =3D folio;
+			} /* swapin with zswap support. */
 		} else {
 			folio =3D swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
-						vmf);
+						 vmf, false);
 			swapcache =3D folio;
 		}
=20
@@ -4528,8 +4604,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	 * yet.
 	 */
 	swap_free_nr(entry, nr_pages);
-	if (should_try_to_free_swap(folio, vma, vmf->flags))
+	if (should_try_to_free_swap(folio, vma, vmf->flags)) {
 		folio_free_swap(folio);
+		single_mapped_swapcache =3D false;
+	}
=20
 	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
 	add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
@@ -4619,6 +4697,30 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		if (waitqueue_active(&swapcache_wq))
 			wake_up(&swapcache_wq);
 	}
+
+	/*
+	 * swapin readahead based batching interface
+	 * for zswap batched loads using IAA:
+	 *
+	 * Don't skip swapcache strategy for single-mapped
+	 * pages: As described above, we place the
+	 * single-mapped faulting page in the swapcache,
+	 * to avoid race conditions and duplicate page
+	 * allocations between process 1 handling a
+	 * page-fault for a single-mapped page, while
+	 * simultaneously, the same swap entry is a
+	 * readahead prefetch page in another process 2.
+	 *
+	 * One side-effect of this, is that if the race did
+	 * not occur, we need to clean up the swapcache
+	 * entry and free the zswap entry for the faulting
+	 * page, iff it is still single-mapped and is
+	 * exclusive to this process.
+	 */
+	if (single_mapped_swapcache &&
+		data_race(should_free_singlemap_swapcache(entry, folio)))
+		folio_free_swap(folio);
+
 	if (si)
 		put_swap_device(si);
 	return ret;
@@ -4638,6 +4740,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		if (waitqueue_active(&swapcache_wq))
 			wake_up(&swapcache_wq);
 	}
+
+	if (single_mapped_swapcache &&
+		data_race(should_free_singlemap_swapcache(entry, folio)))
+		folio_free_swap(folio);
+
 	if (si)
 		put_swap_device(si);
 	return ret;
diff --git a/mm/shmem.c b/mm/shmem.c
index 66eae800ffab..e4549c04f316 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1624,7 +1624,7 @@ static struct folio *shmem_swapin_cluster(swp_entry_t=
 swap, gfp_t gfp,
 	struct folio *folio;
=20
 	mpol =3D shmem_get_pgoff_policy(info, index, 0, &ilx);
-	folio =3D swap_cluster_readahead(swap, gfp, mpol, ilx);
+	folio =3D swap_cluster_readahead(swap, gfp, mpol, ilx, false);
 	mpol_cond_put(mpol);
=20
 	return folio;
diff --git a/mm/swap.h b/mm/swap.h
index 2b82c8ed765c..2861bd8f5a96 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -199,9 +199,11 @@ struct folio *__read_swap_cache_async(swp_entry_t entr=
y, gfp_t gfp_flags,
 		struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
 		bool skip_if_exists);
 struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
-		struct mempolicy *mpol, pgoff_t ilx);
+		struct mempolicy *mpol, pgoff_t ilx,
+		bool single_mapped_path);
 struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
-		struct vm_fault *vmf);
+		struct vm_fault *vmf,
+		bool single_mapped_path);
=20
 static inline unsigned int folio_swap_flags(struct folio *folio)
 {
@@ -304,13 +306,15 @@ static inline void show_swap_cache_info(void)
 }
=20
 static inline struct folio *swap_cluster_readahead(swp_entry_t entry,
-			gfp_t gfp_mask, struct mempolicy *mpol, pgoff_t ilx)
+			gfp_t gfp_mask, struct mempolicy *mpol, pgoff_t ilx,
+			bool single_mapped_path)
 {
 	return NULL;
 }
=20
 static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_ma=
sk,
-			struct vm_fault *vmf)
+			struct vm_fault *vmf,
+			bool single_mapped_path)
 {
 	return NULL;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 0aa938e4c34d..66ea8f7f724c 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -44,6 +44,12 @@ struct address_space *swapper_spaces[MAX_SWAPFILES] __re=
ad_mostly;
 static unsigned int nr_swapper_spaces[MAX_SWAPFILES] __read_mostly;
 static bool enable_vma_readahead __read_mostly =3D true;
=20
+/*
+ * Enable readahead in single-mapped do_swap_page() path.
+ * Set to "true" for IAA.
+ */
+static bool enable_singlemapped_readahead __read_mostly =3D false;
+
 #define SWAP_RA_WIN_SHIFT	(PAGE_SHIFT / 2)
 #define SWAP_RA_HITS_MASK	((1UL << SWAP_RA_WIN_SHIFT) - 1)
 #define SWAP_RA_HITS_MAX	SWAP_RA_HITS_MASK
@@ -340,6 +346,11 @@ static inline bool swap_use_vma_readahead(void)
 	return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap);
 }
=20
+static inline bool swap_use_singlemapped_readahead(void)
+{
+	return READ_ONCE(enable_singlemapped_readahead);
+}
+
 /*
  * Lookup a swap entry in the swap cache. A found folio will be returned
  * unlocked and with its refcount incremented - we rely on the kernel
@@ -635,12 +646,49 @@ static unsigned long swapin_nr_pages(unsigned long of=
fset)
 	return pages;
 }
=20
+static void process_ra_batch_of_same_type(
+	struct zswap_decomp_batch *zswap_batch,
+	struct folio_batch *non_zswap_batch,
+	swp_entry_t targ_entry,
+	struct swap_iocb **splug)
+{
+	unsigned int i;
+
+	for (i =3D 0; i < folio_batch_count(non_zswap_batch); ++i) {
+		struct folio *folio =3D non_zswap_batch->folios[i];
+		swap_read_folio(folio, splug, NULL, NULL);
+		if (folio->swap.val !=3D targ_entry.val) {
+			folio_set_readahead(folio);
+			count_vm_event(SWAP_RA);
+		}
+		folio_put(folio);
+	}
+
+	swap_read_zswap_batch_unplug(zswap_batch, splug);
+
+	for (i =3D 0; i < folio_batch_count(&zswap_batch->fbatch); ++i) {
+		struct folio *folio =3D zswap_batch->fbatch.folios[i];
+		if (folio->swap.val !=3D targ_entry.val) {
+			folio_set_readahead(folio);
+			count_vm_event(SWAP_RA);
+		}
+		folio_put(folio);
+	}
+
+	folio_batch_reinit(non_zswap_batch);
+
+	zswap_load_batch_reinit(zswap_batch);
+}
+
 /**
  * swap_cluster_readahead - swap in pages in hope we need them soon
  * @entry: swap entry of this memory
  * @gfp_mask: memory allocation flags
  * @mpol: NUMA memory allocation policy to be applied
  * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE
+ * @single_mapped_path: Called from do_swap_page() single-mapped path.
+ * Only readahead if the sys "singlemapped_ra_enabled" swap parameter
+ * is set to true.
  *
  * Returns the struct folio for entry and addr, after queueing swapin.
  *
@@ -654,7 +702,8 @@ static unsigned long swapin_nr_pages(unsigned long offs=
et)
  * are fairly likely to have been swapped out from the same node.
  */
 struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
-				    struct mempolicy *mpol, pgoff_t ilx)
+				     struct mempolicy *mpol, pgoff_t ilx,
+				     bool single_mapped_path)
 {
 	struct folio *folio;
 	unsigned long entry_offset =3D swp_offset(entry);
@@ -664,12 +713,22 @@ struct folio *swap_cluster_readahead(swp_entry_t entr=
y, gfp_t gfp_mask,
 	struct swap_info_struct *si =3D swp_swap_info(entry);
 	struct blk_plug plug;
 	struct swap_iocb *splug =3D NULL;
+	struct zswap_decomp_batch zswap_batch;
+	struct folio_batch non_zswap_batch;
 	bool page_allocated;
=20
+	if (single_mapped_path &&
+		(!swap_use_singlemapped_readahead() ||
+		 !zswap_load_batching_enabled()))
+		goto skip;
+
 	mask =3D swapin_nr_pages(offset) - 1;
 	if (!mask)
 		goto skip;
=20
+	zswap_load_batch_init(&zswap_batch);
+	folio_batch_init(&non_zswap_batch);
+
 	/* Read a page_cluster sized and aligned cluster around offset. */
 	start_offset =3D offset & ~mask;
 	end_offset =3D offset | mask;
@@ -678,6 +737,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry,=
 gfp_t gfp_mask,
 	if (end_offset >=3D si->max)
 		end_offset =3D si->max - 1;
=20
+	/* Note that all swap entries readahead are of the same swap type. */
 	blk_start_plug(&plug);
 	for (offset =3D start_offset; offset <=3D end_offset ; offset++) {
 		/* Ok, do the async read-ahead now */
@@ -687,14 +747,22 @@ struct folio *swap_cluster_readahead(swp_entry_t entr=
y, gfp_t gfp_mask,
 		if (!folio)
 			continue;
 		if (page_allocated) {
-			swap_read_folio(folio, &splug, NULL, NULL);
-			if (offset !=3D entry_offset) {
-				folio_set_readahead(folio);
-				count_vm_event(SWAP_RA);
+			if (swap_read_folio(folio, &splug,
+					    &zswap_batch, &non_zswap_batch)) {
+				if (offset !=3D entry_offset) {
+					folio_set_readahead(folio);
+					count_vm_event(SWAP_RA);
+				}
+				folio_put(folio);
 			}
+		} else {
+			folio_put(folio);
 		}
-		folio_put(folio);
 	}
+
+	process_ra_batch_of_same_type(&zswap_batch, &non_zswap_batch,
+				      entry, &splug);
+
 	blk_finish_plug(&plug);
 	swap_read_unplug(splug);
 	lru_add_drain();	/* Push any new pages onto the LRU now */
@@ -1009,6 +1077,9 @@ static int swap_vma_ra_win(struct vm_fault *vmf, unsi=
gned long *start,
  * @mpol: NUMA memory allocation policy to be applied
  * @targ_ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE
  * @vmf: fault information
+ * @single_mapped_path: Called from do_swap_page() single-mapped path.
+ * Only readahead if the sys "singlemapped_ra_enabled" swap parameter
+ * is set to true.
  *
  * Returns the struct folio for entry and addr, after queueing swapin.
  *
@@ -1019,10 +1090,14 @@ static int swap_vma_ra_win(struct vm_fault *vmf, un=
signed long *start,
  *
  */
 static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_=
mask,
-		struct mempolicy *mpol, pgoff_t targ_ilx, struct vm_fault *vmf)
+		struct mempolicy *mpol, pgoff_t targ_ilx, struct vm_fault *vmf,
+		bool single_mapped_path)
 {
 	struct blk_plug plug;
 	struct swap_iocb *splug =3D NULL;
+	struct zswap_decomp_batch zswap_batch;
+	struct folio_batch non_zswap_batch;
+	int type =3D -1, prev_type =3D -1;
 	struct folio *folio;
 	pte_t *pte =3D NULL, pentry;
 	int win;
@@ -1031,10 +1106,18 @@ static struct folio *swap_vma_readahead(swp_entry_t=
 targ_entry, gfp_t gfp_mask,
 	pgoff_t ilx;
 	bool page_allocated;
=20
+	if (single_mapped_path &&
+		(!swap_use_singlemapped_readahead() ||
+		 !zswap_load_batching_enabled()))
+		goto skip;
+
 	win =3D swap_vma_ra_win(vmf, &start, &end);
 	if (win =3D=3D 1)
 		goto skip;
=20
+	zswap_load_batch_init(&zswap_batch);
+	folio_batch_init(&non_zswap_batch);
+
 	ilx =3D targ_ilx - PFN_DOWN(vmf->address - start);
=20
 	blk_start_plug(&plug);
@@ -1057,16 +1140,38 @@ static struct folio *swap_vma_readahead(swp_entry_t=
 targ_entry, gfp_t gfp_mask,
 		if (!folio)
 			continue;
 		if (page_allocated) {
-			swap_read_folio(folio, &splug, NULL, NULL);
-			if (addr !=3D vmf->address) {
-				folio_set_readahead(folio);
-				count_vm_event(SWAP_RA);
+			type =3D swp_type(entry);
+
+			/*
+			 * Process this sub-batch before switching to
+			 * another swap device type.
+			 */
+			if ((prev_type >=3D 0) && (type !=3D prev_type))
+				process_ra_batch_of_same_type(&zswap_batch,
+							      &non_zswap_batch,
+							      targ_entry,
+							      &splug);
+
+			if (swap_read_folio(folio, &splug,
+					    &zswap_batch, &non_zswap_batch)) {
+				if (addr !=3D vmf->address) {
+					folio_set_readahead(folio);
+					count_vm_event(SWAP_RA);
+				}
+				folio_put(folio);
 			}
+
+			prev_type =3D type;
+		} else {
+			folio_put(folio);
 		}
-		folio_put(folio);
 	}
 	if (pte)
 		pte_unmap(pte);
+
+	process_ra_batch_of_same_type(&zswap_batch, &non_zswap_batch,
+				      targ_entry, &splug);
+
 	blk_finish_plug(&plug);
 	swap_read_unplug(splug);
 	lru_add_drain();
@@ -1092,7 +1197,7 @@ static struct folio *swap_vma_readahead(swp_entry_t t=
arg_entry, gfp_t gfp_mask,
  * or vma-based(ie, virtual address based on faulty address) readahead.
  */
 struct folio *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
-				struct vm_fault *vmf)
+				struct vm_fault *vmf, bool single_mapped_path)
 {
 	struct mempolicy *mpol;
 	pgoff_t ilx;
@@ -1100,8 +1205,10 @@ struct folio *swapin_readahead(swp_entry_t entry, gf=
p_t gfp_mask,
=20
 	mpol =3D get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
 	folio =3D swap_use_vma_readahead() ?
-		swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf) :
-		swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
+		swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf,
+				   single_mapped_path) :
+		swap_cluster_readahead(entry, gfp_mask, mpol, ilx,
+				       single_mapped_path);
 	mpol_cond_put(mpol);
=20
 	return folio;
@@ -1126,10 +1233,30 @@ static ssize_t vma_ra_enabled_store(struct kobject =
*kobj,
=20
 	return count;
 }
+static ssize_t singlemapped_ra_enabled_show(struct kobject *kobj,
+				     struct kobj_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%s\n",
+			  enable_singlemapped_readahead ? "true" : "false");
+}
+static ssize_t singlemapped_ra_enabled_store(struct kobject *kobj,
+				      struct kobj_attribute *attr,
+				      const char *buf, size_t count)
+{
+	ssize_t ret;
+
+	ret =3D kstrtobool(buf, &enable_singlemapped_readahead);
+	if (ret)
+		return ret;
+
+	return count;
+}
 static struct kobj_attribute vma_ra_enabled_attr =3D __ATTR_RW(vma_ra_enab=
led);
+static struct kobj_attribute singlemapped_ra_enabled_attr =3D __ATTR_RW(si=
nglemapped_ra_enabled);
=20
 static struct attribute *swap_attrs[] =3D {
 	&vma_ra_enabled_attr.attr,
+	&singlemapped_ra_enabled_attr.attr,
 	NULL,
 };
=20
diff --git a/mm/swapfile.c b/mm/swapfile.c
index b0915f3fab31..10367eaee1ff 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2197,7 +2197,7 @@ static int unuse_pte_range(struct vm_area_struct *vma=
, pmd_t *pmd,
 			};
=20
 			folio =3D swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
-						&vmf);
+						&vmf, false);
 		}
 		if (!folio) {
 			swp_count =3D READ_ONCE(si->swap_map[offset]);
--=20
2.27.0
From nobody Tue Nov 26 10:22:36 2024
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.14])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id B760317DFEB
	for <linux-kernel@vger.kernel.org>; Fri, 18 Oct 2024 06:48:13 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.14
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1729234095; cv=none;
 b=Q1Qi4MTAvsBW8gkx4NnvSYVqH1WouljpEK7UnGFdg6/18uE3k1IqG0LgN1YOEOq5UnwxOrGUf+/W5M1W1++/V6+YeBzxvIMzthHTet5rTX9VcC0AVv+g6TtH+tZ7r5jRMlrsyvqta5cXDe6t55nLRc7GYhmHdbIjMIjPcYURzfE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1729234095; c=relaxed/simple;
	bh=CzMrEaQ3oHH5UP19JYpptn44NLCDei/++V3CWVkaTRg=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=K865BvJyOB1yqueuNkgYgdOHjc8/g2bnTnQUZJlefYGnlwJdSSaKFqZvwP17UGbev1C14LOPoq7N58zAfUk16SyZ48KiqaC1NUihxMYDKXe2evv1kPJaVFFbm9Lr2SLQ9OHbvc0kJb+LOgDyCwql+wHVukPZa5GxBqzszsV8+b0=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=fX7xmc6w; arc=none smtp.client-ip=192.198.163.14
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="fX7xmc6w"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1729234094; x=1760770094;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=CzMrEaQ3oHH5UP19JYpptn44NLCDei/++V3CWVkaTRg=;
  b=fX7xmc6w8TSMoB/YcemWGNcSb30FK0VTFmptc2PgssqI2SDetauMWBGi
   73yIETF1KIzZu3NhNPwQhcKOGD39GzMpn5+hrmnnm9RWgVPPANUwmivvr
   vHalrmgb219RdJcIcqC0jswtOSX9+iE1YBspJEcWW2bXGXjii+SdtITbo
   AAvtTYcx2ABCe134+xxLK5VtAe9SyBwzeZ6Ay8GXg2Ey6t/8z3MTdWdVl
   IuhiJ5XafQpbXLNdl/z7vxnKowKFjPUBfzHf9l/ZYxLNQyL+WxF/waVQd
   BRjDmTWLMQY4s7EUI6MilmeFMfa4LSJEozeHd+y+9ASkP+to/HYMzy61R
   g==;
X-CSE-ConnectionGUID: JISYTmxUQvO7u3R+X0U+dg==
X-CSE-MsgGUID: hteHiFgWRh64W3//EuLCrA==
X-IronPort-AV: E=McAfee;i="6700,10204,11228"; a="28963386"
X-IronPort-AV: E=Sophos;i="6.11,212,1725346800";
   d="scan'208";a="28963386"
Received: from fmviesa003.fm.intel.com ([10.60.135.143])
  by fmvoesa108.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 17 Oct 2024 23:48:10 -0700
X-CSE-ConnectionGUID: IJqafArETn+8qdBvF+ZmKQ==
X-CSE-MsgGUID: heBzjw1nQkOgWyQXAlD5Kg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.11,212,1725346800";
   d="scan'208";a="82744533"
Received: from jf5300-b11a338t.jf.intel.com ([10.242.51.6])
  by fmviesa003.fm.intel.com with ESMTP; 17 Oct 2024 23:48:10 -0700
From: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
To: linux-kernel@vger.kernel.org,
	linux-mm@kvack.org,
	hannes@cmpxchg.org,
	yosryahmed@google.com,
	nphamcs@gmail.com,
	chengming.zhou@linux.dev,
	usamaarif642@gmail.com,
	ryan.roberts@arm.com,
	ying.huang@intel.com,
	21cnbao@gmail.com,
	akpm@linux-foundation.org,
	hughd@google.com,
	willy@infradead.org,
	bfoster@redhat.com,
	dchinner@redhat.com,
	chrisl@kernel.org,
	david@redhat.com
Cc: wajdi.k.feghali@intel.com,
	vinodh.gopal@intel.com,
	kanchana.p.sridhar@intel.com
Subject: [RFC PATCH v1 7/7] mm: For IAA batching,
 reduce SWAP_BATCH and modify swap slot cache thresholds.
Date: Thu, 17 Oct 2024 23:48:05 -0700
Message-Id: <20241018064805.336490-8-kanchana.p.sridhar@intel.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: <20241018064805.336490-1-kanchana.p.sridhar@intel.com>
References: <20241018064805.336490-1-kanchana.p.sridhar@intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

When IAA is used for compress batching and decompress batching of folios,
we significantly reduce the swapout-swapin path latencies, such
that swap page-faults' latencies are reduced. This means swap entries will
need to be freed more often, and swap slots will have to be released more
often.

The existing SWAP_BATCH and SWAP_SLOTS_CACHE_SIZE value of 64
can cause lock contention of the swap_info_struct lock in
swapcache_free_entries and cpu hardlockups can result in highly contended
server scenarios.

To prevent this, the SWAP_BATCH and SWAP_SLOTS_CACHE_SIZE
has been reduced to 16 if IAA is used for compress/decompress batching. The
swap_slots_cache activate/deactive thresholds have been modified
accordingly, so that we don't compromise performance for stability.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 include/linux/swap.h       | 7 +++++++
 include/linux/swap_slots.h | 7 +++++++
 2 files changed, 14 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index ca533b478c21..3987faa0a2ff 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -13,6 +13,7 @@
 #include <linux/pagemap.h>
 #include <linux/atomic.h>
 #include <linux/page-flags.h>
+#include <linux/pagevec.h>
 #include <uapi/linux/mempolicy.h>
 #include <asm/page.h>
=20
@@ -32,7 +33,13 @@ struct pagevec;
 #define SWAP_FLAGS_VALID	(SWAP_FLAG_PRIO_MASK | SWAP_FLAG_PREFER | \
 				 SWAP_FLAG_DISCARD | SWAP_FLAG_DISCARD_ONCE | \
 				 SWAP_FLAG_DISCARD_PAGES)
+
+#if defined(CONFIG_ZSWAP_STORE_BATCHING_ENABLED) || \
+	defined(CONFIG_ZSWAP_LOAD_BATCHING_ENABLED)
+#define SWAP_BATCH 16
+#else
 #define SWAP_BATCH 64
+#endif
=20
 static inline int current_is_kswapd(void)
 {
diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h
index 15adfb8c813a..1b6e4e2798bd 100644
--- a/include/linux/swap_slots.h
+++ b/include/linux/swap_slots.h
@@ -7,8 +7,15 @@
 #include <linux/mutex.h>
=20
 #define SWAP_SLOTS_CACHE_SIZE			SWAP_BATCH
+
+#if defined(CONFIG_ZSWAP_STORE_BATCHING_ENABLED) || \
+	defined(CONFIG_ZSWAP_LOAD_BATCHING_ENABLED)
+#define THRESHOLD_ACTIVATE_SWAP_SLOTS_CACHE	(40*SWAP_SLOTS_CACHE_SIZE)
+#define THRESHOLD_DEACTIVATE_SWAP_SLOTS_CACHE	(16*SWAP_SLOTS_CACHE_SIZE)
+#else
 #define THRESHOLD_ACTIVATE_SWAP_SLOTS_CACHE	(5*SWAP_SLOTS_CACHE_SIZE)
 #define THRESHOLD_DEACTIVATE_SWAP_SLOTS_CACHE	(2*SWAP_SLOTS_CACHE_SIZE)
+#endif
=20
 struct swap_slots_cache {
 	bool		lock_initialized;
--=20
2.27.0