[v11] RE: [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver

RE: [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver

Posted by Sridhar, Kanchana P 1 month, 1 week ago


> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Thursday, August 14, 2025 10:28 PM
> To: Nhat Pham <nphamcs@gmail.com>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> yosry.ahmed@linux.dev; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v11 00/24] zswap compression batching with optimized
> iaa_crypto driver
> 
> On Fri, Aug 08, 2025 at 04:51:14PM -0700, Nhat Pham wrote:
> >
> > Can we get some comments from crypto tree maintainers as well? I feel
> > like this patch series is more crypto patch than zswap patch, at this
> > point.
> >
> > Can we land any zswap parts without the crypto API change? Grasping at
> > straws here, in case we can parallelize the reviewing and merging
> > process.
> 
> My preference is for a unified interface that caters to both
> software compression as well as parallel hardware compression.
> 
> The reason is that there is clear advantage in passing a large
> batch of pages to the Crypto API even for software compression,
> the least we could do is to pack the compressed result together
> and avoid the unnecessary copying of the compressed output that
> is currently done in zswap.
> 
> However, since you guys are both happy with this patch-set,
> I'm not going stand in the way.
> 
> But I do want some changes made to the proposed Crypto API interface
> so that it can be reused for IPComp.
> 
> In particular, instead of passing an opaque pointer (kernel_data)
> to magically turn on batching, please add a new helper that enables
> batching.
> 
> I don't think we need any extra fields in struct acomp_req apart
> from a new field called unit_size.  This would be 4096 for zswap,
> it could be the MTU for IPsec.
> 
> So add something like this and document that it must be called
> after acmop_request_set_callback (which should set unit_size to 0):
> 
> static inline void acomp_request_set_unit_size(struct acomp_req *req,
> 					       unsigned int du)
> {
> 	req->unit = du;
> }
> 
> static inline void acomp_request_set_callback(struct acomp_req *req, ...)
> {
> 	...
> +	req->unit = 0;
> }
> 
> For the source, nothing needs to be done because the folio could
> be passed in as is.
> 
> For the destination, construct an SG list for them and pass that in.
> The rule should be that the SG list must contain a sufficient number
> of pages for the compression output based on the given unit size.
> 
> For the output lengths, just set the lengths in the destination
> SG list after compression.  If a page is incompressible (including
> an error), just set the length to a negative value (-ENOSPC could
> be used for incompressible input, as we already do).  Even though
> struct scatterlist->length is unsigned, there should be no issue
> with storing a negative value there.

Hi Herbert, Nhat,

Thanks Herbert for these suggestions! I have implemented the new crypto API
and the SG list suggestion. While doing so, I was also able to consolidate the
new scatterlist based zswap_compress() implementation for software and hardware
(i.e. batching) compressors, within the constraints of not changing anything
below the crypto layer for software compressors.

I wanted to provide some additional details so that you can review the overall
approach and let me know if things look Ok. I will rebase the code to the latest
mm-unstable and start working on v12 in the meantime.

1) The zswap per-CPU acomp_ctx has two sg_tables added, one each for
   inputs/outputs, with nents set to the pool->compr_batch_size (1 for software
   compressors). This per-CPU data incurs additional memory overhead per-CPU,
   however this is memory that will anyway be allocated on the stack in
   zswap_compress(); and less memory overhead than the latter because we know
   exactly how many sg_table scatterlists to allocate for the given pool
   (assuming we don't kmalloc in zswap_compress()). I will make sure to quantify
   the overhead in v12's commit logs.

2) I added new sg_alloc_table_node() and sg_kmalloc_node() to facilitate this.

3) I added the acomp_request_set_unit_size() helper for
   batching; initialized the unit_size to 0 in
   acomp_request_set_callback(). zswap_cpu_comp_prepare() will set the unit_size
   to PAGE_SIZE after the call to acomp_request_set_callback().

4) Unified code in zswap_compress() for software and hardware compressors to use
   the per-CPU SG lists. Some unavoidable changes for software path to use the
   acomp_req->dlen instead of the SG list output length.

5) A trade-off I had to make in the iaa_crypto driver to adhere to the new SG
   list based batching architecture:

   Currently, all calls to dma_map_sg() in iaa_crypto_main.c use
   sg_nents(req->src[dst]). This requires the sg_init_marker() is set correctly
   based on the number of pages in the batch. This in turn requires
   sg_unmark_end() to be called to clear the termination marker before
   returning. All this adds latency to zswap_compress() (i.e. per batch compress
   call) with the new approach and causes regression wrt v11.

   To make the new approach functional and performant, I have changed all
   the calls to dma_map_sg() to use nents of 1. This should not be a concern,
   since it eliminates redundant computes to scan an SG list with only one
   scatterlist for existing kernel users, i.e. zswap with the zswap_compress()
   modifications described in (4). This will continue to hold true with the zram
   IAA batching support I am developing. There are no kernel use cases for the
   iaa_crypto driver that will break this assumption.

6) "For the source, nothing needs to be done because the folio could be passed
   in as is.". As far as I know, this cannot be accomplished without
   modifications to the crypto API for software compressors, because compressed
   buffers need to be stored in the zswap/zram zs_pools at PAGE_SIZE
   granularity.
   
I have validated the above v12 changes applied over the v11 patch-series,
on Sapphire Rapids:

  1) usemem30: Both zstd and deflate-iaa have comparable performance to v11.

  2) kernel_compilation test: Mostly better performance than baseline, but worse
     than v11. Slightly worse sys time than baseline for zstd/PMD.


  usemem30 with 64K folios:
  =========================
  
     zswap shrinker_enabled = N.
  
     -----------------------------------------------------------------------
                     mm-unstable-7-30-2025             v11             v12
     -----------------------------------------------------------------------
     zswap compressor          deflate-iaa     deflate-iaa     deflate-iaa
     -----------------------------------------------------------------------
     Total throughput (KB/s)     7,153,359      10,856,388      10,714,236
     Avg throughput (KB/s)         238,445         361,879         357,141          
     elapsed time (sec)              92.61           70.50           68.87
     sys time (sec)               2,193.59        1,675.32        1,613.11
     -----------------------------------------------------------------------
    
     -----------------------------------------------------------------------
                     mm-unstable-7-30-2025             v11             v12
     -----------------------------------------------------------------------
     zswap compressor                 zstd            zstd            zstd
     -----------------------------------------------------------------------
     Total throughput (KB/s)     6,866,411       6,874,244       6,922,818
     Avg throughput (KB/s)         228,880         229,141         230,760
     elapsed time (sec)              96.45           89.05           87.75
     sys time (sec)               2,410.72        2,150.63        2,090.86     
     -----------------------------------------------------------------------


  usemem30 with 2M folios:
  ========================
  
     zswap shrinker_enabled = N.
    
     -----------------------------------------------------------------------
                     mm-unstable-7-30-2025             v11             v12
     -----------------------------------------------------------------------
     zswap compressor          deflate-iaa     deflate-iaa     deflate-iaa
     -----------------------------------------------------------------------
     Total throughput (KB/s)     7,268,929      11,312,195      10,943,491
     Avg throughput (KB/s)         242,297         377,073         364,783
     elapsed time (sec)              80.40           68.73           69.19
     sys time (sec)               1,856.54        1,599.25        1,618.08
     -----------------------------------------------------------------------
  
     -----------------------------------------------------------------------
                     mm-unstable-7-30-2025             v11             v12
     -----------------------------------------------------------------------
     zswap compressor                 zstd            zstd            zstd           
     -----------------------------------------------------------------------
     Total throughput (KB/s)     7,560,441       7,627,155       7,600,588
     Avg throughput (KB/s)         252,014         254,238         253,352
     elapsed time (sec)              88.89           83.22           87.55
     sys time (sec)               2,132.05        1,952.98        2,079.26
     -----------------------------------------------------------------------


  kernel_compilation with 64K folios:
  ===================================

     zswap shrinker_enabled = Y.
  
     --------------------------------------------------------------------------
                        mm-unstable-7-30-2025             v11             v12
     --------------------------------------------------------------------------
     zswap compressor             deflate-iaa     deflate-iaa     deflate-iaa
     --------------------------------------------------------------------------
     real_sec                          901.81          840.60          895.94
     sys_sec                         2,672.93        2,171.17        2,584.04
     zswpout                       34,700,692      24,076,095      37,725,671
     zswap_written_back_pages       2,612,474       1,451,961       3,050,557
     --------------------------------------------------------------------------

     --------------------------------------------------------------------------
                        mm-unstable-7-30-2025             v11             v12
     --------------------------------------------------------------------------
     zswap compressor                    zstd            zstd            zstd
     --------------------------------------------------------------------------
     real_sec                          882.67          837.21          872.98  
     sys_sec                         3,573.31        2,593.94        3,301.67
     zswpout                       42,768,967      22,660,215      36,810,396
     zswap_written_back_pages       2,109,739         727,919       1,475,480
     --------------------------------------------------------------------------


  kernel_compilation with PMD folios:
  ===================================

     zswap shrinker_enabled = Y.

     --------------------------------------------------------------------------
                        mm-unstable-7-30-2025             v11             v12
     --------------------------------------------------------------------------
     zswap compressor             deflate-iaa     deflate-iaa     deflate-iaa
     --------------------------------------------------------------------------
     real_sec                          838.76          804.83          826.05
     sys_sec                         3,173.57        2,422.63        3,128.11
     zswpout                       59,544,198      38,093,995      60,072,119
     zswap_written_back_pages       2,726,367         929,614       2,324,707
     --------------------------------------------------------------------------
 
 
     --------------------------------------------------------------------------
                        mm-unstable-7-30-2025             v11             v12
     --------------------------------------------------------------------------
     zswap compressor                    zstd            zstd            zstd 
     --------------------------------------------------------------------------
     real_sec                          831.09          813.40          827.84
     sys_sec                         4,251.11        3,053.95        4,406.65
     zswpout                       59,452,638      35,832,407      63,459,471
     zswap_written_back_pages       1,041,721         423,334       1,162,913
     --------------------------------------------------------------------------


I am still in the process of verifying if modifying zswap_decompress() to use
the per-CPU SG lists improves kernel_compilation, but thought this would be a
good sync point to get your thoughts.

I would greatly appreciate your comments on the approach and trade-offs, and
guidance on how to proceed.


"v12" zswap.c diff wrt v11:
===========================

diff --git a/mm/zswap.c b/mm/zswap.c
index c30c1f325f57..58ad257e87e8 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -152,6 +152,8 @@ struct crypto_acomp_ctx {
 	struct acomp_req *req;
 	struct crypto_wait wait;
 	u8 **buffers;
+	struct sg_table *sg_inputs;
+	struct sg_table *sg_outputs;
 	struct mutex mutex;
 	bool is_sleepable;
 };
@@ -282,6 +284,16 @@ static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx, u8 nr_buffers)
 			kfree(acomp_ctx->buffers[i]);
 		kfree(acomp_ctx->buffers);
 	}
+
+	if (acomp_ctx->sg_inputs) {
+		sg_free_table(acomp_ctx->sg_inputs);
+		acomp_ctx->sg_inputs = NULL;
+	}
+
+	if (acomp_ctx->sg_outputs) {
+		sg_free_table(acomp_ctx->sg_outputs);
+		acomp_ctx->sg_outputs = NULL;
+	}
 }
 
 static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
@@ -922,6 +934,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 {
 	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
 	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
+	int cpu_node = cpu_to_node(cpu);
 	int ret = -ENOMEM;
 	u8 i;
 
@@ -936,7 +949,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
 		return 0;
 
-	acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
+	acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_node);
 	if (IS_ERR_OR_NULL(acomp_ctx->acomp)) {
 		pr_err("could not alloc crypto acomp %s : %ld\n",
 				pool->tfm_name, PTR_ERR(acomp_ctx->acomp));
@@ -960,13 +973,13 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 				     crypto_acomp_batch_size(acomp_ctx->acomp));
 
 	acomp_ctx->buffers = kcalloc_node(pool->compr_batch_size, sizeof(u8 *),
-					  GFP_KERNEL, cpu_to_node(cpu));
+					  GFP_KERNEL, cpu_node);
 	if (!acomp_ctx->buffers)
 		goto fail;
 
 	for (i = 0; i < pool->compr_batch_size; ++i) {
 		acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL,
-						     cpu_to_node(cpu));
+						     cpu_node);
 		if (!acomp_ctx->buffers[i])
 			goto fail;
 	}
@@ -981,6 +994,26 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	acomp_request_set_callback(acomp_ctx->req, CRYPTO_TFM_REQ_MAY_BACKLOG,
 				   crypto_req_done, &acomp_ctx->wait);
 
+	acomp_request_set_unit_size(acomp_ctx->req, PAGE_SIZE);
+
+	acomp_ctx->sg_inputs = kmalloc_node(sizeof(*acomp_ctx->sg_inputs),
+					    GFP_KERNEL, cpu_node);
+	if (!acomp_ctx->sg_inputs)
+		goto fail;
+
+	if (sg_alloc_table_node(&acomp_ctx->sg_inputs, pool->compr_batch_size,
+				GFP_KERNEL, cpu_node))
+		goto fail;
+
+	acomp_ctx->sg_outputs = kmalloc_node(sizeof(*acomp_ctx->sg_outputs),
+					     GFP_KERNEL, cpu_node);
+	if (!acomp_ctx->sg_outputs)
+		goto fail;
+
+	if (sg_alloc_table_node(&acomp_ctx->sg_outputs, pool->compr_batch_size,
+				GFP_KERNEL, cpu_node))
+		goto fail;
+
 	mutex_init(&acomp_ctx->mutex);
 	return 0;
 
@@ -1027,17 +1060,14 @@ static bool zswap_compress(struct folio *folio, long start, unsigned int nr_page
 			   struct zswap_entry *entries[], struct zswap_pool *pool,
 			   int node_id)
 {
+	unsigned int nr_comps = min(nr_pages, pool->compr_batch_size);
+	unsigned int dlens[ZSWAP_MAX_BATCH_SIZE];
 	struct crypto_acomp_ctx *acomp_ctx;
-	struct scatterlist input, output;
 	struct zpool *zpool = pool->zpool;
-
-	unsigned int dlens[ZSWAP_MAX_BATCH_SIZE];
-	int errors[ZSWAP_MAX_BATCH_SIZE];
-
-	unsigned int nr_comps = min(nr_pages, pool->compr_batch_size);
-	unsigned int i, j;
-	int err;
+	struct scatterlist *sg;
+	unsigned int i, j, k;
 	gfp_t gfp;
+	int err;
 
 	gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
 
@@ -1045,59 +1075,58 @@ static bool zswap_compress(struct folio *folio, long start, unsigned int nr_page
 
 	mutex_lock(&acomp_ctx->mutex);
 
+	prefetchw(acomp_ctx->sg_inputs->sgl);
+	prefetchw(acomp_ctx->sg_outputs->sgl);
+
 	/*
 	 * Note:
 	 * [i] refers to the incoming batch space and is used to
-	 *     index into the folio pages, @entries and @errors.
+	 *     index into the folio pages and @entries.
+	 *
+	 * [k] refers to the @acomp_ctx space, as determined by
+	 *     @pool->compr_batch_size, and is used to index into
+	 *     @acomp_ctx->buffers and @dlens.
 	 */
 	for (i = 0; i < nr_pages; i += nr_comps) {
-		if (nr_comps == 1) {
-			sg_init_table(&input, 1);
-			sg_set_page(&input, folio_page(folio, start + i), PAGE_SIZE, 0);
+		for_each_sg(acomp_ctx->sg_inputs->sgl, sg, nr_comps, k)
+			sg_set_page(sg, folio_page(folio, start + k + i), PAGE_SIZE, 0);
 
-			/*
-			 * We need PAGE_SIZE * 2 here since there maybe over-compression case,
-			 * and hardware-accelerators may won't check the dst buffer size, so
-			 * giving the dst buffer with enough length to avoid buffer overflow.
-			 */
-			sg_init_one(&output, acomp_ctx->buffers[0], PAGE_SIZE * 2);
-			acomp_request_set_params(acomp_ctx->req, &input,
-						 &output, PAGE_SIZE, PAGE_SIZE);
-
-			errors[i] = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
-						    &acomp_ctx->wait);
-			if (unlikely(errors[i]))
-				goto compress_error;
-
-			dlens[i] = acomp_ctx->req->dlen;
-		} else {
-			struct page *pages[ZSWAP_MAX_BATCH_SIZE];
-			unsigned int k;
-
-			for (k = 0; k < nr_pages; ++k)
-				pages[k] = folio_page(folio, start + k);
-
-			struct swap_batch_comp_data batch_comp_data = {
-				.pages = pages,
-				.dsts = acomp_ctx->buffers,
-				.dlens = dlens,
-				.errors = errors,
-				.nr_comps = nr_pages,
-			};
-
-			acomp_ctx->req->kernel_data = &batch_comp_data;
-
-			if (unlikely(crypto_acomp_compress(acomp_ctx->req)))
-				goto compress_error;
+		/*
+		 * We need PAGE_SIZE * 2 here since there maybe over-compression case,
+		 * and hardware-accelerators may won't check the dst buffer size, so
+		 * giving the dst buffer with enough length to avoid buffer overflow.
+		 */
+		for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k)
+			sg_set_buf(sg, acomp_ctx->buffers[k], PAGE_SIZE * 2);
+
+		acomp_request_set_params(acomp_ctx->req,
+					 acomp_ctx->sg_inputs->sgl,
+					 acomp_ctx->sg_outputs->sgl,
+					 nr_comps * PAGE_SIZE,
+					 nr_comps * PAGE_SIZE);
+
+		err = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
+				      &acomp_ctx->wait);
+
+		if (unlikely(err)) {
+			if (nr_comps == 1)
+				dlens[0] = err;
+			goto compress_error;
 		}
 
+		if (nr_comps == 1)
+			dlens[0] = acomp_ctx->req->dlen;
+		else
+			for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k)
+				dlens[k] = sg->length;
+
 		/*
 		 * All @nr_comps pages were successfully compressed.
 		 * Store the pages in zpool.
 		 *
 		 * Note:
 		 * [j] refers to the incoming batch space and is used to
-		 *     index into the folio pages, @entries, @dlens and @errors.
+		 *     index into the folio pages and @entries.
 		 * [k] refers to the @acomp_ctx space, as determined by
 		 *     @pool->compr_batch_size, and is used to index into
 		 *     @acomp_ctx->buffers.
@@ -1113,7 +1142,7 @@ static bool zswap_compress(struct folio *folio, long start, unsigned int nr_page
 			 * non-batching software compressors.
 			 */
 			prefetchw(entries[j]);
-			err = zpool_malloc(zpool, dlens[j], gfp, &handle, node_id);
+			err = zpool_malloc(zpool, dlens[k], gfp, &handle, node_id);
 
 			if (unlikely(err)) {
 				if (err == -ENOSPC)
@@ -1124,9 +1153,9 @@ static bool zswap_compress(struct folio *folio, long start, unsigned int nr_page
 				goto err_unlock;
 			}
 
-			zpool_obj_write(zpool, handle, acomp_ctx->buffers[k], dlens[j]);
+			zpool_obj_write(zpool, handle, acomp_ctx->buffers[k], dlens[k]);
 			entries[j]->handle = handle;
-			entries[j]->length = dlens[j];
+			entries[j]->length = dlens[k];
 		}
 	} /* finished compress and store nr_pages. */
 
@@ -1134,9 +1163,9 @@ static bool zswap_compress(struct folio *folio, long start, unsigned int nr_page
 	return true;
 
 compress_error:
-	for (j = i; j < i + nr_comps; ++j) {
-		if (errors[j]) {
-			if (errors[j] == -ENOSPC)
+	for (k = 0; k < nr_comps; ++k) {
+		if (dlens[k] < 0) {
+			if (dlens[k] == -ENOSPC)
 				zswap_reject_compress_poor++;
 			else
 				zswap_reject_compress_fail++;



Thanks,
Kanchana

> 
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver

Posted by Herbert Xu 1 month, 1 week ago

On Fri, Aug 22, 2025 at 07:26:34PM +0000, Sridhar, Kanchana P wrote:
>
> 1) The zswap per-CPU acomp_ctx has two sg_tables added, one each for
>    inputs/outputs, with nents set to the pool->compr_batch_size (1 for software
>    compressors). This per-CPU data incurs additional memory overhead per-CPU,
>    however this is memory that will anyway be allocated on the stack in
>    zswap_compress(); and less memory overhead than the latter because we know
>    exactly how many sg_table scatterlists to allocate for the given pool
>    (assuming we don't kmalloc in zswap_compress()). I will make sure to quantify
>    the overhead in v12's commit logs.

There is no need for any SG lists for the source.  The folio should
be submitted as the source.

So only the destination requires an SG list.

> 6) "For the source, nothing needs to be done because the folio could be passed
>    in as is.". As far as I know, this cannot be accomplished without
>    modifications to the crypto API for software compressors, because compressed
>    buffers need to be stored in the zswap/zram zs_pools at PAGE_SIZE
>    granularity.

Sure.  But all it needs is one central fallback path in the acompress
API.  I can do this for you.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

RE: [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver

Posted by Sridhar, Kanchana P 1 month, 1 week ago

> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Sunday, August 24, 2025 10:39 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; hannes@cmpxchg.org; yosry.ahmed@linux.dev;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v11 00/24] zswap compression batching with optimized
> iaa_crypto driver
> 
> On Fri, Aug 22, 2025 at 07:26:34PM +0000, Sridhar, Kanchana P wrote:
> >
> > 1) The zswap per-CPU acomp_ctx has two sg_tables added, one each for
> >    inputs/outputs, with nents set to the pool->compr_batch_size (1 for
> software
> >    compressors). This per-CPU data incurs additional memory overhead per-
> CPU,
> >    however this is memory that will anyway be allocated on the stack in
> >    zswap_compress(); and less memory overhead than the latter because we
> know
> >    exactly how many sg_table scatterlists to allocate for the given pool
> >    (assuming we don't kmalloc in zswap_compress()). I will make sure to
> quantify
> >    the overhead in v12's commit logs.
> 
> There is no need for any SG lists for the source.  The folio should
> be submitted as the source.
> 
> So only the destination requires an SG list.
> 
> > 6) "For the source, nothing needs to be done because the folio could be
> passed
> >    in as is.". As far as I know, this cannot be accomplished without
> >    modifications to the crypto API for software compressors, because
> compressed
> >    buffers need to be stored in the zswap/zram zs_pools at PAGE_SIZE
> >    granularity.
> 
> Sure.  But all it needs is one central fallback path in the acompress
> API.  I can do this for you.

Thanks Herbert, for reviewing the approach. IIUC, we should follow
these constraints:

1) The folio should be submitted as the source.

2) For the destination, construct an SG list for them and pass that in.
    The rule should be that the SG list must contain a sufficient number
    of pages for the compression output based on the given unit size
    (PAGE_SIZE for zswap).

For PMD folios, there would be 512 compression outputs. In this case,
would we need to pass in an SG list that can contain 512 compression
outputs after calling the acompress API once?

If so, this might not be feasible for zswap since there are only "batch-size"
number of pre-allocated per-CPU output buffers, where "batch_size" is the
max number of pages that can be compressed in one call to the algorithm
(1 for software compressors). Hence, gathering all 512 compression outputs
may not be possible in a single invocation to crypto_acomp_compress().

Is the suggestion to allocate 512 per-CPU output buffers to overcome
this? This could be memory-wise very expensive. Please let me know if
I am missing something.

Thanks for offering to make the necessary changes to the acompress API.
Hoping we can sync on the approach!

Best regards,
Kanchana

> 
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver

Posted by Herbert Xu 1 month, 1 week ago

On Mon, Aug 25, 2025 at 06:12:19PM +0000, Sridhar, Kanchana P wrote:
>
> Thanks Herbert, for reviewing the approach. IIUC, we should follow
> these constraints:
> 
> 1) The folio should be submitted as the source.
> 
> 2) For the destination, construct an SG list for them and pass that in.
>     The rule should be that the SG list must contain a sufficient number
>     of pages for the compression output based on the given unit size
>     (PAGE_SIZE for zswap).
> 
> For PMD folios, there would be 512 compression outputs. In this case,
> would we need to pass in an SG list that can contain 512 compression
> outputs after calling the acompress API once?

Eventually yes :)

But for now we're just replicating your current patch-set, so
the folio should come with an offset and a length restriction,
and correspondingly the destination SG list should contain the
same number of pages as there are in your current patch-set.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

RE: [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver

Posted by Sridhar, Kanchana P 1 month, 1 week ago

> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Monday, August 25, 2025 6:13 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; hannes@cmpxchg.org; yosry.ahmed@linux.dev;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v11 00/24] zswap compression batching with optimized
> iaa_crypto driver
> 
> On Mon, Aug 25, 2025 at 06:12:19PM +0000, Sridhar, Kanchana P wrote:
> >
> > Thanks Herbert, for reviewing the approach. IIUC, we should follow
> > these constraints:
> >
> > 1) The folio should be submitted as the source.
> >
> > 2) For the destination, construct an SG list for them and pass that in.
> >     The rule should be that the SG list must contain a sufficient number
> >     of pages for the compression output based on the given unit size
> >     (PAGE_SIZE for zswap).
> >
> > For PMD folios, there would be 512 compression outputs. In this case,
> > would we need to pass in an SG list that can contain 512 compression
> > outputs after calling the acompress API once?
> 
> Eventually yes :)
> 
> But for now we're just replicating your current patch-set, so
> the folio should come with an offset and a length restriction,
> and correspondingly the destination SG list should contain the
> same number of pages as there are in your current patch-set.

Thanks Herbert. Just want to make sure I understand this. Are you
referring to replacing sg_set_page() for the input with sg_set_folio()?
We have to pass in a scatterlist for the acomp_req->src..

This is how the converged zswap_compress() code would look for
batch compression of "nr_pages" in "folio", starting at index "start".
The input SG list will contain "nr_comps" pages: nr_comps is
1 for software and 8 for IAA.

The destination SG list will contain an equivalent number of
buffers (each is PAGE_SIZE * 2).

Based on your suggestions, I was able to come up with a unified
implementation for software and hardware compressors: the SG list
for the input is a key aspect of this (lines 24-25 from the start of the
procedure):

static bool zswap_compress(struct folio *folio, long start, unsigned int nr_pages,
                           struct zswap_entry *entries[], struct zswap_pool *pool,
                           int node_id)
{
        unsigned int nr_comps = min(nr_pages, pool->compr_batch_size);
        unsigned int dlens[ZSWAP_MAX_BATCH_SIZE];
        struct crypto_acomp_ctx *acomp_ctx;
        struct zpool *zpool = pool->zpool;
        struct scatterlist *sg;
        unsigned int i, j, k;
        gfp_t gfp;
        int err;

        gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;

        acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);

        mutex_lock(&acomp_ctx->mutex);

        prefetchw(acomp_ctx->sg_inputs->sgl);
        prefetchw(acomp_ctx->sg_outputs->sgl);

        /*                                                                                                                                
         * Note:                                                                                                                          
         * [i] refers to the incoming batch space and is used to                                                                          
         *     index into the folio pages and @entries.                                                                                   
         *                                                                                                                                
         * [k] refers to the @acomp_ctx space, as determined by                                                                           
         *     @pool->compr_batch_size, and is used to index into                                                                         
         *     @acomp_ctx->buffers and @dlens.                                                                                            
         */
        for (i = 0; i < nr_pages; i += nr_comps) {
                for_each_sg(acomp_ctx->sg_inputs->sgl, sg, nr_comps, k)
                        sg_set_folio(sg, folio, PAGE_SIZE, (start + k + i) * PAGE_SIZE);

                /*                                                                                                                        
                 * We need PAGE_SIZE * 2 here since there maybe over-compression case,                                                    
                 * and hardware-accelerators may won't check the dst buffer size, so                                                      
                 * giving the dst buffer with enough length to avoid buffer overflow.                                                     
                 */
                for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k)
                        sg_set_buf(sg, acomp_ctx->buffers[k], PAGE_SIZE * 2);

                acomp_request_set_params(acomp_ctx->req,
                                         acomp_ctx->sg_inputs->sgl,
                                         acomp_ctx->sg_outputs->sgl,
                                         nr_comps * PAGE_SIZE,
                                         nr_comps * PAGE_SIZE);

                err = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
                                      &acomp_ctx->wait);

                if (unlikely(err)) {
                        if (nr_comps == 1)
                                dlens[0] = err;
                        goto compress_error;
                }

                if (nr_comps == 1)
                        dlens[0] = acomp_ctx->req->dlen;
                else
                        for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k)
                                dlens[k] = sg->length;

[ store each compressed page in zpool]

I quickly tested this with usemem 30 processes. Switching from sg_set_page()
to sg_set_folio() does cause a 15% throughput regression for IAA and 2%
regression for zstd:

usemem30/64K folios/deflate-iaa/Avg throughput (KB/s):
sg_set_page(): 357,141
sg_set_folio(): 304,696

usemem30/64K folios/zstd/Avg throughput (KB/s):
sg_set_page(): 230,760
sg_set_folio(): 226,246

In my experience, zswap_compress() is highly performance critical code
and the smallest compute additions can cause significant impact on workload
performance and sys time.

Given the code simplification and unification that your SG list suggestions
have enabled, may I understand better why sg_set_folio() is preferred?
Again, my apologies if I have misunderstood your suggestion, but I think
it is worth getting this clarified so we are all in agreement.

Thanks and best regards,
Kanchana


> 
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver

Posted by Herbert Xu 1 month, 1 week ago

On Tue, Aug 26, 2025 at 04:09:45AM +0000, Sridhar, Kanchana P wrote:
> 
> Thanks Herbert. Just want to make sure I understand this. Are you
> referring to replacing sg_set_page() for the input with sg_set_folio()?
> We have to pass in a scatterlist for the acomp_req->src..

I'm talking about acomp_request_set_src_folio.  You can pass just
a portion of a folio by specifying an offset and a length.

>         for (i = 0; i < nr_pages; i += nr_comps) {
>                 for_each_sg(acomp_ctx->sg_inputs->sgl, sg, nr_comps, k)
>                         sg_set_folio(sg, folio, PAGE_SIZE, (start + k + i) * PAGE_SIZE);
> 
>                 /*                                                                                                                        
>                  * We need PAGE_SIZE * 2 here since there maybe over-compression case,                                                    
>                  * and hardware-accelerators may won't check the dst buffer size, so                                                      
>                  * giving the dst buffer with enough length to avoid buffer overflow.                                                     
>                  */
>                 for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k)
>                         sg_set_buf(sg, acomp_ctx->buffers[k], PAGE_SIZE * 2);
>
>                 acomp_request_set_params(acomp_ctx->req,
>                                          acomp_ctx->sg_inputs->sgl,
>                                          acomp_ctx->sg_outputs->sgl,
>                                          nr_comps * PAGE_SIZE,
>                                          nr_comps * PAGE_SIZE);

I meant something more like:

		acomp_request_set_src_folio(req, folio, start_offset,
					    nr_comps * PAGE_SIZE);
		acomp_request_set_dst_sg(req, acomp_ctx_sg_outputs->sgl,
					 nr_comps * PAGE_SIZE);
		acomp_request_set_unit_size(req, PAGE_SIZE);

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

RE: [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver

Posted by Sridhar, Kanchana P 1 month, 1 week ago

> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Monday, August 25, 2025 9:15 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; hannes@cmpxchg.org; yosry.ahmed@linux.dev;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v11 00/24] zswap compression batching with optimized
> iaa_crypto driver
> 
> On Tue, Aug 26, 2025 at 04:09:45AM +0000, Sridhar, Kanchana P wrote:
> >
> > Thanks Herbert. Just want to make sure I understand this. Are you
> > referring to replacing sg_set_page() for the input with sg_set_folio()?
> > We have to pass in a scatterlist for the acomp_req->src..
> 
> I'm talking about acomp_request_set_src_folio.  You can pass just
> a portion of a folio by specifying an offset and a length.
> 
> >         for (i = 0; i < nr_pages; i += nr_comps) {
> >                 for_each_sg(acomp_ctx->sg_inputs->sgl, sg, nr_comps, k)
> >                         sg_set_folio(sg, folio, PAGE_SIZE, (start + k + i) * PAGE_SIZE);
> >
> >                 /*
> >                  * We need PAGE_SIZE * 2 here since there maybe over-
> compression case,
> >                  * and hardware-accelerators may won't check the dst buffer size,
> so
> >                  * giving the dst buffer with enough length to avoid buffer overflow.
> >                  */
> >                 for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k)
> >                         sg_set_buf(sg, acomp_ctx->buffers[k], PAGE_SIZE * 2);
> >
> >                 acomp_request_set_params(acomp_ctx->req,
> >                                          acomp_ctx->sg_inputs->sgl,
> >                                          acomp_ctx->sg_outputs->sgl,
> >                                          nr_comps * PAGE_SIZE,
> >                                          nr_comps * PAGE_SIZE);
> 
> I meant something more like:
> 
> 		acomp_request_set_src_folio(req, folio, start_offset,
> 					    nr_comps * PAGE_SIZE);
> 		acomp_request_set_dst_sg(req, acomp_ctx_sg_outputs->sgl,
> 					 nr_comps * PAGE_SIZE);
> 		acomp_request_set_unit_size(req, PAGE_SIZE);

Ok, I get it now :) Thanks. I will try this out, and pending any issues
that may arise from testing, I might be all set for putting together v12.

Thanks again Herbert, I appreciate it.

Best regards,
Kanchana

> 
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt