[v13] zswap compression batching with optimized iaa_crypto driver

[PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.

Posted by Kanchana P Sridhar 3 months, 1 week ago

This patch introduces a new unified implementation of zswap_compress()
for compressors that do and do not support batching. This eliminates
code duplication and facilitates code maintainability with the
introduction of compress batching.

The vectorized implementation of calling the earlier zswap_compress()
sequentially, one page at a time in zswap_store_pages(), is replaced
with this new version of zswap_compress() that accepts multiple pages to
compress as a batch.

If the compressor does not support batching, each page in the batch is
compressed and stored sequentially. If the compressor supports batching,
for e.g., 'deflate-iaa', the Intel IAA hardware accelerator, the batch
is compressed in parallel in hardware. If the batch is compressed
without errors, the compressed buffers are then stored in zsmalloc. In
case of compression errors, the current behavior is preserved for the
batching zswap_compress(): if the folio's memcg is writeback enabled,
pages with compression errors are store uncompressed in zsmalloc; if
not, we return an error for the folio in zswap_store().

As per Herbert's suggestion in [1] for batching to be based on SG lists
to interface with the crypto API, a "struct sg_table *sg_outputs" is
added to the per-CPU acomp_ctx. In zswap_cpu_comp_prepare(), memory is
allocated for @pool->compr_batch_size scatterlists in
@acomp_ctx->sg_outputs. The per-CPU @acomp_ctx->buffers' addresses are
statically mapped to the respective SG lists. The existing non-NUMA
sg_alloc_table() was found to give better performance than a NUMA-aware
allocation function, hence is used in this patch.

Batching compressors should initialize the output SG lengths to
PAGE_SIZE as part of the internal compress batching setup, to avoid
having to do multiple traversals over the @acomp_ctx->sg_outputs->sgl.
This is exactly how batching is implemented in the iaa_crypto driver's
compress batching procedure, iaa_comp_acompress_batch().

The batched zswap_compress() implementation is generalized as much as
possible for non-batching and batching compressors, so that the
subsequent incompressible page handling, zs_pool writes, and error
handling code is seamless for both, without the use of conditionals to
switch to specialized code for either.

The new batching implementation of zswap_compress() is called with a
batch of @nr_pages sent from zswap_store() to zswap_store_pages().
zswap_compress() steps through the batch in increments of the
compressor's batch-size, sets up the acomp_ctx->req's src/dst SG lists
to contain the folio pages and output buffers, before calling
crypto_acomp_compress().

Some important requirements of this batching architecture for batching
compressors:

  1) The output SG lengths for each sg in the acomp_req->dst should be
     intialized to PAGE_SIZE as part of other batch setup in the batch
     compression function. zswap will not take care of this in the
     interest of avoiding repetitive traversals of the
     @acomp_ctx->sg_outputs->sgl so as to not lose the benefits of
     batching.

  2) In case of a compression error for any page in the batch, the
     batching compressor should set the corresponding @sg->length to a
     negative error number, as suggested by Herbert. Otherwise, the
     @sg->length will contain the compressed output length.

  3) Batching compressors should set acomp_req->dlen to
     acomp_req->dst->length, i.e., the sg->length of the first SG in
     acomp_req->dst.

Another important change this patch makes is with the acomp_ctx mutex
locking in zswap_compress(). Earlier, the mutex was held per page's
compression. With the new code, [un]locking the mutex per page caused
regressions for software compressors when testing with 30 usemem
processes, and also kernel compilation with 'allmod' config. The
regressions were more eggregious when PMD folios were stored. The
implementation in this commit locks/unlocks the mutex once per batch,
that resolves the regression.

Architectural considerations for the zswap batching framework:
==============================================================
We have designed the zswap batching framework to be
hardware-agnostic. It has no dependencies on Intel-specific features and
can be leveraged by any hardware accelerator or software-based
compressor. In other words, the framework is open and inclusive by
design.

Other ongoing work that can use batching:
=========================================
This patch-series demonstrates the performance benefits of compress
batching when used in zswap_store() of large folios. shrink_folio_list()
"reclaim batching" of any-order folios is the major next work that uses
the zswap compress batching framework: our testing of kernel_compilation
with writeback and the zswap shrinker indicates 10X fewer pages get
written back when we reclaim 32 folios as a batch, as compared to one
folio at a time: this is with deflate-iaa and with zstd. We expect to
submit a patch-series with this data and the resulting performance
improvements shortly. Reclaim batching relieves memory pressure faster
than reclaiming one folio at a time, hence alleviates the need to scan
slab memory for writeback.

Nhat has given ideas on using batching with the ongoing kcompressd work,
as well as beneficially using decompression batching & block IO batching
to improve zswap writeback efficiency.

Experiments that combine zswap compress batching, reclaim batching,
swapin_readahead() decompression batching of prefetched pages, and
writeback batching show that 0 pages are written back with deflate-iaa
and zstd. For comparison, the baselines for these compressors see
200K-800K pages written to disk (kernel compilation 'allmod' config).

To summarize, these are future clients of the batching framework:

   - shrink_folio_list() reclaim batching of multiple folios:
       Implemented, will submit patch-series.
   - zswap writeback with decompress batching:
       Implemented, will submit patch-series.
   - zram:
       Implemented, will submit patch-series.
   - kcompressd:
       Not yet implemented.
   - file systems:
       Not yet implemented.
   - swapin_readahead() decompression batching of prefetched pages:
       Implemented, will submit patch-series.

Additionally, any place we have folios that need to be compressed, can
potentially be parallelized.

Performance data:
=================

As suggested by Barry, this is the performance data gathered on Intel
Sapphire Rapids with usemem 30 processes running at 50% memory pressure
and kernel_compilation/allmod config run with 2G limit using 32
threads. To keep comparisons simple, all testing was done without the
zswap shrinker.

  usemem30 with 64K folios:
  =========================

     zswap shrinker_enabled = N.

     -----------------------------------------------------------------------
                     mm-unstable-10-24-2025             v13
     -----------------------------------------------------------------------
     zswap compressor          deflate-iaa     deflate-iaa   IAA Batching
                                                                 vs.
                                                             IAA Sequential
     -----------------------------------------------------------------------
     Total throughput (KB/s)     6,118,675       9,901,216       62%
     Average throughput (KB/s)     203,955         330,040       62%
     elapsed time (sec)              98.94           70.90      -28%
     sys time (sec)               2,379.29        1,686.18      -29%
     -----------------------------------------------------------------------

     -----------------------------------------------------------------------
                     mm-unstable-10-24-2025             v13
     -----------------------------------------------------------------------
     zswap compressor                 zstd            zstd   v13 zstd
                                                             improvement
     -----------------------------------------------------------------------
     Total throughput (KB/s)     5,983,561       6,003,851      0.3%
     Average throughput (KB/s)     199,452         200,128      0.3%
     elapsed time (sec)             100.93           96.62     -4.3%
     sys time (sec)               2,532.49        2,395.83       -5%
     -----------------------------------------------------------------------

  usemem30 with 2M folios:
  ========================

     -----------------------------------------------------------------------
                     mm-unstable-10-24-2025             v13
     -----------------------------------------------------------------------
     zswap compressor          deflate-iaa     deflate-iaa   IAA Batching
                                                                 vs.
                                                             IAA Sequential
     -----------------------------------------------------------------------
     Total throughput (KB/s)     6,309,635      10,558,225       67%
     Average throughput (KB/s)     210,321         351,940       67%
     elapsed time (sec)              88.70           67.84      -24%
     sys time (sec)               2,059.83        1,581.07      -23%
     -----------------------------------------------------------------------

     -----------------------------------------------------------------------
                     mm-unstable-10-24-2025             v13
     -----------------------------------------------------------------------
     zswap compressor                 zstd            zstd   v13 zstd
                                                             improvement
     -----------------------------------------------------------------------
     Total throughput (KB/s)     6,562,687       6,567,946      0.1%
     Average throughput (KB/s)     218,756         218,931      0.1%
     elapsed time (sec)              94.69           88.79       -6%
     sys time (sec)               2,253.97        2,083.43       -8%
     -----------------------------------------------------------------------

    The main takeaway from usemem, a workload that is mostly compression
    dominated (very few swapins) is that the higher the number of batches,
    such as with larger folios, the more the benefit of batching cost
    amortization, as shown by the PMD usemem data. This aligns well
    with the future direction for batching.

kernel_compilation/allmodconfig, 64K folios:
============================================

     --------------------------------------------------------------------------
               mm-unstable-10-24-2025             v13
     --------------------------------------------------------------------------
     zswap compressor    deflate-iaa     deflate-iaa    IAA Batching
                                                             vs.
                                                        IAA Sequential
     --------------------------------------------------------------------------
     real_sec                 836.64          806.94      -3.5%
     sys_sec                3,897.57        3,661.83        -6%
     --------------------------------------------------------------------------

     --------------------------------------------------------------------------
               mm-unstable-10-24-2025             v13
     --------------------------------------------------------------------------
     zswap compressor           zstd            zstd    Improvement
     --------------------------------------------------------------------------
     real_sec                 880.62          850.41      -3.4%
     sys_sec                5,171.90        5,076.51      -1.8%
     --------------------------------------------------------------------------

kernel_compilation/allmodconfig, PMD folios:
============================================

     --------------------------------------------------------------------------
               mm-unstable-10-24-2025             v13
     --------------------------------------------------------------------------
     zswap compressor    deflate-iaa     deflate-iaa    IAA Batching
                                                             vs.
                                                        IAA Sequential
     --------------------------------------------------------------------------
     real_sec                 818.48          779.67      -4.7%
     sys_sec                4,226.52        4,245.18       0.4%
     --------------------------------------------------------------------------

     --------------------------------------------------------------------------
              mm-unstable-10-24-2025             v13
     --------------------------------------------------------------------------
     zswap compressor          zstd             zstd    Improvement
     --------------------------------------------------------------------------
     real_sec                888.45           849.54      -4.4%
     sys_sec               5,866.72         5,847.17      -0.3%
     --------------------------------------------------------------------------

[1]: https://lore.kernel.org/all/aJ7Fk6RpNc815Ivd@gondor.apana.org.au/T/#m99aea2ce3d284e6c5a3253061d97b08c4752a798

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 249 ++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 181 insertions(+), 68 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 257567edc587..c5487dd69ec6 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -143,6 +143,7 @@ struct crypto_acomp_ctx {
 	struct acomp_req *req;
 	struct crypto_wait wait;
 	u8 **buffers;
+	struct sg_table *sg_outputs;
 	struct mutex mutex;
 	bool is_sleepable;
 };
@@ -271,6 +272,11 @@ static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx, u8 nr_buffers)
 			kfree(acomp_ctx->buffers[i]);
 		kfree(acomp_ctx->buffers);
 	}
+
+	if (acomp_ctx->sg_outputs) {
+		sg_free_table(acomp_ctx->sg_outputs);
+		kfree(acomp_ctx->sg_outputs);
+	}
 }
 
 static struct zswap_pool *zswap_pool_create(char *compressor)
@@ -804,6 +810,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
 	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
 	int nid = cpu_to_node(cpu);
+	struct scatterlist *sg;
 	int ret = -ENOMEM;
 	u8 i;
 
@@ -849,6 +856,22 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 			goto fail;
 	}
 
+	acomp_ctx->sg_outputs = kmalloc(sizeof(*acomp_ctx->sg_outputs),
+					GFP_KERNEL);
+	if (!acomp_ctx->sg_outputs)
+		goto fail;
+
+	if (sg_alloc_table(acomp_ctx->sg_outputs, pool->compr_batch_size,
+			   GFP_KERNEL))
+		goto fail;
+
+	/*
+	 * Statically map the per-CPU destination buffers to the per-CPU
+	 * SG lists.
+	 */
+	for_each_sg(acomp_ctx->sg_outputs->sgl, sg, pool->compr_batch_size, i)
+		sg_set_buf(sg, acomp_ctx->buffers[i], PAGE_SIZE);
+
 	/*
 	 * if the backend of acomp is async zip, crypto_req_done() will wakeup
 	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
@@ -869,84 +892,177 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	return ret;
 }
 
-static bool zswap_compress(struct page *page, struct zswap_entry *entry,
-			   struct zswap_pool *pool, bool wb_enabled)
+/*
+ * Unified code path for compressors that do and do not support batching. This
+ * procedure will compress multiple @nr_pages in @folio starting from the
+ * @start index.
+ *
+ * It is assumed that @nr_pages <= ZSWAP_MAX_BATCH_SIZE. zswap_store() makes
+ * sure of this by design and zswap_store_pages() warns if this is not
+ * true.
+ *
+ * @nr_pages can be in (1, ZSWAP_MAX_BATCH_SIZE] even if the compressor does not
+ * support batching.
+ *
+ * If @pool->compr_batch_size is 1, each page is processed sequentially.
+ *
+ * If @pool->compr_batch_size is > 1, compression batching is invoked within
+ * the algorithm's driver, except if @nr_pages is 1: if so, the driver can
+ * choose to call the sequential/non-batching compress API.
+ *
+ * In both cases, if all compressions are successful, the compressed buffers
+ * are stored in zsmalloc.
+ *
+ * Traversing multiple SG lists when @nr_comps is > 1 is expensive, and impacts
+ * batching performance if we were to repeat this operation multiple times,
+ * such as:
+ *   - to map destination buffers to each SG list in the @acomp_ctx->sg_outputs
+ *     sg_table.
+ *   - to initialize each output SG list's @sg->length to PAGE_SIZE.
+ *   - to get the compressed output length in each @sg->length.
+ *
+ * These are some design choices made to optimize batching with SG lists:
+ *
+ * 1) The source folio pages in the batch are directly submitted to
+ *    crypto_acomp via acomp_request_set_src_folio().
+ *
+ * 2) The per-CPU @acomp_ctx->sg_outputs scatterlists are used to set up
+ *    destination buffers for interfacing with crypto_acomp.
+ *
+ * 3) To optimize performance, we map the per-CPU @acomp_ctx->buffers to the
+ *    @acomp_ctx->sg_outputs->sgl SG lists at pool creation time. The only task
+ *    remaining to be done for the output SG lists in zswap_compress() is to
+ *    set each @sg->length to PAGE_SIZE. This is done in zswap_compress()
+ *    for non-batching compressors. This needs to be done within the compress
+ *    batching driver procedure as part of iterating through the SG lists for
+ *    batch setup, so as to minimize expensive traversals through the SG lists.
+ *
+ * 4) Important requirements for batching compressors:
+ *    - Each @sg->length in @acomp_ctx->req->sg_outputs->sgl should reflect the
+ *      compression outcome for that specific page, and be set to:
+ *      - the page's compressed length, or
+ *      - the compression error value for that page.
+ *    - The @acomp_ctx->req->dlen should be set to the first page's
+ *      @sg->length. This enables code generalization in zswap_compress()
+ *      for non-batching and batching compressors.
+ *
+ * acomp_ctx mutex locking:
+ *    Earlier, the mutex was held per page compression. With the new code,
+ *    [un]locking the mutex per page caused regressions for software
+ *    compressors. We now lock the mutex once per batch, which resolves the
+ *    regression.
+ */
+static bool zswap_compress(struct folio *folio, long start, unsigned int nr_pages,
+			   struct zswap_entry *entries[], struct zswap_pool *pool,
+			   int nid, bool wb_enabled)
 {
+	gfp_t gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
+	unsigned int nr_comps = min(nr_pages, pool->compr_batch_size);
+	unsigned int slen = nr_comps * PAGE_SIZE;
 	struct crypto_acomp_ctx *acomp_ctx;
-	struct scatterlist input, output;
-	int comp_ret = 0, alloc_ret = 0;
-	unsigned int dlen = PAGE_SIZE;
+	int err = 0, err_sg = 0;
+	struct scatterlist *sg;
+	unsigned int i, j, k;
 	unsigned long handle;
-	gfp_t gfp;
-	u8 *dst;
-	bool mapped = false;
+	int *errp, dlen;
+	void *dst;
 
 	acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
 	mutex_lock(&acomp_ctx->mutex);
 
-	dst = acomp_ctx->buffers[0];
-	sg_init_table(&input, 1);
-	sg_set_page(&input, page, PAGE_SIZE, 0);
-
-	sg_init_one(&output, dst, PAGE_SIZE);
-	acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
+	errp = (pool->compr_batch_size == 1) ? &err : &err_sg;
 
 	/*
-	 * it maybe looks a little bit silly that we send an asynchronous request,
-	 * then wait for its completion synchronously. This makes the process look
-	 * synchronous in fact.
-	 * Theoretically, acomp supports users send multiple acomp requests in one
-	 * acomp instance, then get those requests done simultaneously. but in this
-	 * case, zswap actually does store and load page by page, there is no
-	 * existing method to send the second page before the first page is done
-	 * in one thread doing zswap.
-	 * but in different threads running on different cpu, we have different
-	 * acomp instance, so multiple threads can do (de)compression in parallel.
+	 * [i] refers to the incoming batch space and is used to
+	 *     index into the folio pages.
+	 *
+	 * [j] refers to the incoming batch space and is used to
+	 *     index into the @entries for the folio's pages in this
+	 *     batch, per compress call while iterating over the output SG
+	 *     lists. Also used to index into the folio's pages from @start,
+	 *     in case of compress errors.
+	 *
+	 * [k] refers to the @acomp_ctx space, as determined by
+	 *     @pool->compr_batch_size, and is used to index into
+	 *     @acomp_ctx->sg_outputs->sgl and @acomp_ctx->buffers.
 	 */
-	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
-	dlen = acomp_ctx->req->dlen;
+	for (i = 0; i < nr_pages; i += nr_comps) {
+		acomp_request_set_src_folio(acomp_ctx->req, folio,
+					    (start + i) * PAGE_SIZE,
+					    slen);
 
-	/*
-	 * If a page cannot be compressed into a size smaller than PAGE_SIZE,
-	 * save the content as is without a compression, to keep the LRU order
-	 * of writebacks.  If writeback is disabled, reject the page since it
-	 * only adds metadata overhead.  swap_writeout() will put the page back
-	 * to the active LRU list in the case.
-	 */
-	if (comp_ret || !dlen || dlen >= PAGE_SIZE) {
-		if (!wb_enabled) {
-			comp_ret = comp_ret ? comp_ret : -EINVAL;
-			goto unlock;
-		}
-		comp_ret = 0;
-		dlen = PAGE_SIZE;
-		dst = kmap_local_page(page);
-		mapped = true;
-	}
+		acomp_ctx->sg_outputs->sgl->length = slen;
 
-	gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
-	handle = zs_malloc(pool->zs_pool, dlen, gfp, page_to_nid(page));
-	if (IS_ERR_VALUE(handle)) {
-		alloc_ret = PTR_ERR((void *)handle);
-		goto unlock;
-	}
+		acomp_request_set_dst_sg(acomp_ctx->req,
+					 acomp_ctx->sg_outputs->sgl,
+					 slen);
+
+		err = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
+				      &acomp_ctx->wait);
+
+		acomp_ctx->sg_outputs->sgl->length = acomp_ctx->req->dlen;
+
+		/*
+		 * If a page cannot be compressed into a size smaller than
+		 * PAGE_SIZE, save the content as is without a compression, to
+		 * keep the LRU order of writebacks.  If writeback is disabled,
+		 * reject the page since it only adds metadata overhead.
+		 * swap_writeout() will put the page back to the active LRU list
+		 * in the case.
+		 *
+		 * It is assumed that any compressor that sets the output length
+		 * to 0 or a value >= PAGE_SIZE will also return a negative
+		 * error status in @err; i.e, will not return a successful
+		 * compression status in @err in this case.
+		 */
+		if (err && !wb_enabled)
+			goto compress_error;
+
+		for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
+			j = k + i;
+			dst = acomp_ctx->buffers[k];
+			dlen = sg->length | *errp;
+
+			if (dlen < 0) {
+				dlen = PAGE_SIZE;
+				dst = kmap_local_page(folio_page(folio, start + j));
+			}
+
+			handle = zs_malloc(pool->zs_pool, dlen, gfp, nid);
 
-	zs_obj_write(pool->zs_pool, handle, dst, dlen);
-	entry->handle = handle;
-	entry->length = dlen;
+			if (IS_ERR_VALUE(handle)) {
+				if (PTR_ERR((void *)handle) == -ENOSPC)
+					zswap_reject_compress_poor++;
+				else
+					zswap_reject_alloc_fail++;
 
-unlock:
-	if (mapped)
-		kunmap_local(dst);
-	if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
-		zswap_reject_compress_poor++;
-	else if (comp_ret)
-		zswap_reject_compress_fail++;
-	else if (alloc_ret)
-		zswap_reject_alloc_fail++;
+				goto err_unlock;
+			}
+
+			zs_obj_write(pool->zs_pool, handle, dst, dlen);
+			entries[j]->handle = handle;
+			entries[j]->length = dlen;
+			if (dst != acomp_ctx->buffers[k])
+				kunmap_local(dst);
+		}
+	} /* finished compress and store nr_pages. */
+
+	mutex_unlock(&acomp_ctx->mutex);
+	return true;
+
+compress_error:
+	for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
+		if ((int)sg->length < 0) {
+			if ((int)sg->length == -ENOSPC)
+				zswap_reject_compress_poor++;
+			else
+				zswap_reject_compress_fail++;
+		}
+	}
 
+err_unlock:
 	mutex_unlock(&acomp_ctx->mutex);
-	return comp_ret == 0 && alloc_ret == 0;
+	return false;
 }
 
 static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
@@ -1488,12 +1604,9 @@ static bool zswap_store_pages(struct folio *folio,
 		INIT_LIST_HEAD(&entries[i]->lru);
 	}
 
-	for (i = 0; i < nr_pages; ++i) {
-		struct page *page = folio_page(folio, start + i);
-
-		if (!zswap_compress(page, entries[i], pool, wb_enabled))
-			goto store_pages_failed;
-	}
+	if (unlikely(!zswap_compress(folio, start, nr_pages, entries, pool,
+				     nid, wb_enabled)))
+		goto store_pages_failed;
 
 	for (i = 0; i < nr_pages; ++i) {
 		struct zswap_entry *old, *entry = entries[i];
-- 
2.27.0

Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.

Posted by Yosry Ahmed 2 months, 3 weeks ago

On Tue, Nov 04, 2025 at 01:12:35AM -0800, Kanchana P Sridhar wrote:
> This patch introduces a new unified implementation of zswap_compress()
> for compressors that do and do not support batching. This eliminates
> code duplication and facilitates code maintainability with the
> introduction of compress batching.
> 
> The vectorized implementation of calling the earlier zswap_compress()
> sequentially, one page at a time in zswap_store_pages(), is replaced
> with this new version of zswap_compress() that accepts multiple pages to
> compress as a batch.
> 
> If the compressor does not support batching, each page in the batch is
> compressed and stored sequentially. If the compressor supports batching,
> for e.g., 'deflate-iaa', the Intel IAA hardware accelerator, the batch
> is compressed in parallel in hardware. If the batch is compressed
> without errors, the compressed buffers are then stored in zsmalloc. In
> case of compression errors, the current behavior is preserved for the
> batching zswap_compress(): if the folio's memcg is writeback enabled,
> pages with compression errors are store uncompressed in zsmalloc; if
> not, we return an error for the folio in zswap_store().
> 
> As per Herbert's suggestion in [1] for batching to be based on SG lists
> to interface with the crypto API, a "struct sg_table *sg_outputs" is
> added to the per-CPU acomp_ctx. In zswap_cpu_comp_prepare(), memory is
> allocated for @pool->compr_batch_size scatterlists in
> @acomp_ctx->sg_outputs. The per-CPU @acomp_ctx->buffers' addresses are
> statically mapped to the respective SG lists. The existing non-NUMA
> sg_alloc_table() was found to give better performance than a NUMA-aware
> allocation function, hence is used in this patch.
> 
> Batching compressors should initialize the output SG lengths to
> PAGE_SIZE as part of the internal compress batching setup, to avoid
> having to do multiple traversals over the @acomp_ctx->sg_outputs->sgl.
> This is exactly how batching is implemented in the iaa_crypto driver's
> compress batching procedure, iaa_comp_acompress_batch().
> 
> The batched zswap_compress() implementation is generalized as much as
> possible for non-batching and batching compressors, so that the
> subsequent incompressible page handling, zs_pool writes, and error
> handling code is seamless for both, without the use of conditionals to
> switch to specialized code for either.
> 
> The new batching implementation of zswap_compress() is called with a
> batch of @nr_pages sent from zswap_store() to zswap_store_pages().
> zswap_compress() steps through the batch in increments of the
> compressor's batch-size, sets up the acomp_ctx->req's src/dst SG lists
> to contain the folio pages and output buffers, before calling
> crypto_acomp_compress().
> 
> Some important requirements of this batching architecture for batching
> compressors:
> 
>   1) The output SG lengths for each sg in the acomp_req->dst should be
>      intialized to PAGE_SIZE as part of other batch setup in the batch
>      compression function. zswap will not take care of this in the
>      interest of avoiding repetitive traversals of the
>      @acomp_ctx->sg_outputs->sgl so as to not lose the benefits of
>      batching.
> 
>   2) In case of a compression error for any page in the batch, the
>      batching compressor should set the corresponding @sg->length to a
>      negative error number, as suggested by Herbert. Otherwise, the
>      @sg->length will contain the compressed output length.
> 
>   3) Batching compressors should set acomp_req->dlen to
>      acomp_req->dst->length, i.e., the sg->length of the first SG in
>      acomp_req->dst.
> 
> Another important change this patch makes is with the acomp_ctx mutex
> locking in zswap_compress(). Earlier, the mutex was held per page's
> compression. With the new code, [un]locking the mutex per page caused
> regressions for software compressors when testing with 30 usemem
> processes, and also kernel compilation with 'allmod' config. The
> regressions were more eggregious when PMD folios were stored. The
> implementation in this commit locks/unlocks the mutex once per batch,
> that resolves the regression.
> 
> Architectural considerations for the zswap batching framework:
> ==============================================================
> We have designed the zswap batching framework to be
> hardware-agnostic. It has no dependencies on Intel-specific features and
> can be leveraged by any hardware accelerator or software-based
> compressor. In other words, the framework is open and inclusive by
> design.
> 
> Other ongoing work that can use batching:
> =========================================
> This patch-series demonstrates the performance benefits of compress
> batching when used in zswap_store() of large folios. shrink_folio_list()
> "reclaim batching" of any-order folios is the major next work that uses
> the zswap compress batching framework: our testing of kernel_compilation
> with writeback and the zswap shrinker indicates 10X fewer pages get
> written back when we reclaim 32 folios as a batch, as compared to one
> folio at a time: this is with deflate-iaa and with zstd. We expect to
> submit a patch-series with this data and the resulting performance
> improvements shortly. Reclaim batching relieves memory pressure faster
> than reclaiming one folio at a time, hence alleviates the need to scan
> slab memory for writeback.
> 
> Nhat has given ideas on using batching with the ongoing kcompressd work,
> as well as beneficially using decompression batching & block IO batching
> to improve zswap writeback efficiency.
> 
> Experiments that combine zswap compress batching, reclaim batching,
> swapin_readahead() decompression batching of prefetched pages, and
> writeback batching show that 0 pages are written back with deflate-iaa
> and zstd. For comparison, the baselines for these compressors see
> 200K-800K pages written to disk (kernel compilation 'allmod' config).
> 
> To summarize, these are future clients of the batching framework:
> 
>    - shrink_folio_list() reclaim batching of multiple folios:
>        Implemented, will submit patch-series.
>    - zswap writeback with decompress batching:
>        Implemented, will submit patch-series.
>    - zram:
>        Implemented, will submit patch-series.
>    - kcompressd:
>        Not yet implemented.
>    - file systems:
>        Not yet implemented.
>    - swapin_readahead() decompression batching of prefetched pages:
>        Implemented, will submit patch-series.
> 
> Additionally, any place we have folios that need to be compressed, can
> potentially be parallelized.
> 
> Performance data:
> =================
> 
> As suggested by Barry, this is the performance data gathered on Intel
> Sapphire Rapids with usemem 30 processes running at 50% memory pressure
> and kernel_compilation/allmod config run with 2G limit using 32
> threads. To keep comparisons simple, all testing was done without the
> zswap shrinker.
> 
>   usemem30 with 64K folios:
>   =========================
> 
>      zswap shrinker_enabled = N.
> 
>      -----------------------------------------------------------------------
>                      mm-unstable-10-24-2025             v13
>      -----------------------------------------------------------------------
>      zswap compressor          deflate-iaa     deflate-iaa   IAA Batching
>                                                                  vs.
>                                                              IAA Sequential
>      -----------------------------------------------------------------------
>      Total throughput (KB/s)     6,118,675       9,901,216       62%
>      Average throughput (KB/s)     203,955         330,040       62%
>      elapsed time (sec)              98.94           70.90      -28%
>      sys time (sec)               2,379.29        1,686.18      -29%
>      -----------------------------------------------------------------------
> 
>      -----------------------------------------------------------------------
>                      mm-unstable-10-24-2025             v13
>      -----------------------------------------------------------------------
>      zswap compressor                 zstd            zstd   v13 zstd
>                                                              improvement
>      -----------------------------------------------------------------------
>      Total throughput (KB/s)     5,983,561       6,003,851      0.3%
>      Average throughput (KB/s)     199,452         200,128      0.3%
>      elapsed time (sec)             100.93           96.62     -4.3%
>      sys time (sec)               2,532.49        2,395.83       -5%
>      -----------------------------------------------------------------------
> 
>   usemem30 with 2M folios:
>   ========================
> 
>      -----------------------------------------------------------------------
>                      mm-unstable-10-24-2025             v13
>      -----------------------------------------------------------------------
>      zswap compressor          deflate-iaa     deflate-iaa   IAA Batching
>                                                                  vs.
>                                                              IAA Sequential
>      -----------------------------------------------------------------------
>      Total throughput (KB/s)     6,309,635      10,558,225       67%
>      Average throughput (KB/s)     210,321         351,940       67%
>      elapsed time (sec)              88.70           67.84      -24%
>      sys time (sec)               2,059.83        1,581.07      -23%
>      -----------------------------------------------------------------------
> 
>      -----------------------------------------------------------------------
>                      mm-unstable-10-24-2025             v13
>      -----------------------------------------------------------------------
>      zswap compressor                 zstd            zstd   v13 zstd
>                                                              improvement
>      -----------------------------------------------------------------------
>      Total throughput (KB/s)     6,562,687       6,567,946      0.1%
>      Average throughput (KB/s)     218,756         218,931      0.1%
>      elapsed time (sec)              94.69           88.79       -6%
>      sys time (sec)               2,253.97        2,083.43       -8%
>      -----------------------------------------------------------------------
> 
>     The main takeaway from usemem, a workload that is mostly compression
>     dominated (very few swapins) is that the higher the number of batches,
>     such as with larger folios, the more the benefit of batching cost
>     amortization, as shown by the PMD usemem data. This aligns well
>     with the future direction for batching.
> 
> kernel_compilation/allmodconfig, 64K folios:
> ============================================
> 
>      --------------------------------------------------------------------------
>                mm-unstable-10-24-2025             v13
>      --------------------------------------------------------------------------
>      zswap compressor    deflate-iaa     deflate-iaa    IAA Batching
>                                                              vs.
>                                                         IAA Sequential
>      --------------------------------------------------------------------------
>      real_sec                 836.64          806.94      -3.5%
>      sys_sec                3,897.57        3,661.83        -6%
>      --------------------------------------------------------------------------
> 
>      --------------------------------------------------------------------------
>                mm-unstable-10-24-2025             v13
>      --------------------------------------------------------------------------
>      zswap compressor           zstd            zstd    Improvement
>      --------------------------------------------------------------------------
>      real_sec                 880.62          850.41      -3.4%
>      sys_sec                5,171.90        5,076.51      -1.8%
>      --------------------------------------------------------------------------
> 
> kernel_compilation/allmodconfig, PMD folios:
> ============================================
> 
>      --------------------------------------------------------------------------
>                mm-unstable-10-24-2025             v13
>      --------------------------------------------------------------------------
>      zswap compressor    deflate-iaa     deflate-iaa    IAA Batching
>                                                              vs.
>                                                         IAA Sequential
>      --------------------------------------------------------------------------
>      real_sec                 818.48          779.67      -4.7%
>      sys_sec                4,226.52        4,245.18       0.4%
>      --------------------------------------------------------------------------
> 
>      --------------------------------------------------------------------------
>               mm-unstable-10-24-2025             v13
>      --------------------------------------------------------------------------
>      zswap compressor          zstd             zstd    Improvement
>      --------------------------------------------------------------------------
>      real_sec                888.45           849.54      -4.4%
>      sys_sec               5,866.72         5,847.17      -0.3%
>      --------------------------------------------------------------------------
> 
> [1]: https://lore.kernel.org/all/aJ7Fk6RpNc815Ivd@gondor.apana.org.au/T/#m99aea2ce3d284e6c5a3253061d97b08c4752a798
> 
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>

I won't go through the commit log and rewrite for this one too, but
please do so similar to how I did for the previous patches. Do not
describe the code, give a high-level overview of what is happening and
why it's happeneing, as well as very concise performance results.

Do not include things that only make sense in the context of a patch and
won't make sense as part of git histroy.

That being said, I'd like Herbert to review this patch and make sure the
scatterlist and crypto APIs are being used correctly as he advised
earlier. I do have some comments on the zswap side though.

> ---
>  mm/zswap.c | 249 ++++++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 181 insertions(+), 68 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 257567edc587..c5487dd69ec6 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -143,6 +143,7 @@ struct crypto_acomp_ctx {
>  	struct acomp_req *req;
>  	struct crypto_wait wait;
>  	u8 **buffers;
> +	struct sg_table *sg_outputs;
>  	struct mutex mutex;
>  	bool is_sleepable;
>  };
> @@ -271,6 +272,11 @@ static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx, u8 nr_buffers)
>  			kfree(acomp_ctx->buffers[i]);
>  		kfree(acomp_ctx->buffers);
>  	}
> +
> +	if (acomp_ctx->sg_outputs) {
> +		sg_free_table(acomp_ctx->sg_outputs);
> +		kfree(acomp_ctx->sg_outputs);
> +	}
>  }
>  
>  static struct zswap_pool *zswap_pool_create(char *compressor)
> @@ -804,6 +810,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>  	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
>  	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
>  	int nid = cpu_to_node(cpu);
> +	struct scatterlist *sg;
>  	int ret = -ENOMEM;
>  	u8 i;
>  
> @@ -849,6 +856,22 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>  			goto fail;
>  	}
>  
> +	acomp_ctx->sg_outputs = kmalloc(sizeof(*acomp_ctx->sg_outputs),
> +					GFP_KERNEL);
> +	if (!acomp_ctx->sg_outputs)
> +		goto fail;
> +
> +	if (sg_alloc_table(acomp_ctx->sg_outputs, pool->compr_batch_size,
> +			   GFP_KERNEL))
> +		goto fail;
> +
> +	/*
> +	 * Statically map the per-CPU destination buffers to the per-CPU
> +	 * SG lists.
> +	 */
> +	for_each_sg(acomp_ctx->sg_outputs->sgl, sg, pool->compr_batch_size, i)
> +		sg_set_buf(sg, acomp_ctx->buffers[i], PAGE_SIZE);
> +
>  	/*
>  	 * if the backend of acomp is async zip, crypto_req_done() will wakeup
>  	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
> @@ -869,84 +892,177 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>  	return ret;
>  }
>  
> -static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> -			   struct zswap_pool *pool, bool wb_enabled)
> +/*
> + * Unified code path for compressors that do and do not support batching. This
> + * procedure will compress multiple @nr_pages in @folio starting from the
> + * @start index.
> + *
> + * It is assumed that @nr_pages <= ZSWAP_MAX_BATCH_SIZE. zswap_store() makes
> + * sure of this by design and zswap_store_pages() warns if this is not
> + * true.
> + *
> + * @nr_pages can be in (1, ZSWAP_MAX_BATCH_SIZE] even if the compressor does not
> + * support batching.
> + *
> + * If @pool->compr_batch_size is 1, each page is processed sequentially.
> + *
> + * If @pool->compr_batch_size is > 1, compression batching is invoked within
> + * the algorithm's driver, except if @nr_pages is 1: if so, the driver can
> + * choose to call the sequential/non-batching compress API.
> + *
> + * In both cases, if all compressions are successful, the compressed buffers
> + * are stored in zsmalloc.
> + *
> + * Traversing multiple SG lists when @nr_comps is > 1 is expensive, and impacts
> + * batching performance if we were to repeat this operation multiple times,
> + * such as:
> + *   - to map destination buffers to each SG list in the @acomp_ctx->sg_outputs
> + *     sg_table.
> + *   - to initialize each output SG list's @sg->length to PAGE_SIZE.
> + *   - to get the compressed output length in each @sg->length.
> + *
> + * These are some design choices made to optimize batching with SG lists:
> + *
> + * 1) The source folio pages in the batch are directly submitted to
> + *    crypto_acomp via acomp_request_set_src_folio().
> + *
> + * 2) The per-CPU @acomp_ctx->sg_outputs scatterlists are used to set up
> + *    destination buffers for interfacing with crypto_acomp.
> + *
> + * 3) To optimize performance, we map the per-CPU @acomp_ctx->buffers to the
> + *    @acomp_ctx->sg_outputs->sgl SG lists at pool creation time. The only task
> + *    remaining to be done for the output SG lists in zswap_compress() is to
> + *    set each @sg->length to PAGE_SIZE. This is done in zswap_compress()
> + *    for non-batching compressors. This needs to be done within the compress
> + *    batching driver procedure as part of iterating through the SG lists for
> + *    batch setup, so as to minimize expensive traversals through the SG lists.
> + *
> + * 4) Important requirements for batching compressors:
> + *    - Each @sg->length in @acomp_ctx->req->sg_outputs->sgl should reflect the
> + *      compression outcome for that specific page, and be set to:
> + *      - the page's compressed length, or
> + *      - the compression error value for that page.
> + *    - The @acomp_ctx->req->dlen should be set to the first page's
> + *      @sg->length. This enables code generalization in zswap_compress()
> + *      for non-batching and batching compressors.
> + *
> + * acomp_ctx mutex locking:
> + *    Earlier, the mutex was held per page compression. With the new code,
> + *    [un]locking the mutex per page caused regressions for software
> + *    compressors. We now lock the mutex once per batch, which resolves the
> + *    regression.
> + */

Please, no huge comments describing what the code is doing. If there's
anything that is not clear from reading the code or needs to be
explained or documented, please do so **concisely** in the relevant part
of the function.

> +static bool zswap_compress(struct folio *folio, long start, unsigned int nr_pages,
> +			   struct zswap_entry *entries[], struct zswap_pool *pool,
> +			   int nid, bool wb_enabled)
>  {
> +	gfp_t gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
> +	unsigned int nr_comps = min(nr_pages, pool->compr_batch_size);
> +	unsigned int slen = nr_comps * PAGE_SIZE;
>  	struct crypto_acomp_ctx *acomp_ctx;
> -	struct scatterlist input, output;
> -	int comp_ret = 0, alloc_ret = 0;
> -	unsigned int dlen = PAGE_SIZE;
> +	int err = 0, err_sg = 0;
> +	struct scatterlist *sg;
> +	unsigned int i, j, k;
>  	unsigned long handle;
> -	gfp_t gfp;
> -	u8 *dst;
> -	bool mapped = false;
> +	int *errp, dlen;
> +	void *dst;
>  
>  	acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
>  	mutex_lock(&acomp_ctx->mutex);
>  
> -	dst = acomp_ctx->buffers[0];
> -	sg_init_table(&input, 1);
> -	sg_set_page(&input, page, PAGE_SIZE, 0);
> -
> -	sg_init_one(&output, dst, PAGE_SIZE);
> -	acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
> +	errp = (pool->compr_batch_size == 1) ? &err : &err_sg;

err_sg is not used anywhere, so *errp could end up being garbage. Why do
we need this?

>  
>  	/*
> -	 * it maybe looks a little bit silly that we send an asynchronous request,
> -	 * then wait for its completion synchronously. This makes the process look
> -	 * synchronous in fact.
> -	 * Theoretically, acomp supports users send multiple acomp requests in one
> -	 * acomp instance, then get those requests done simultaneously. but in this
> -	 * case, zswap actually does store and load page by page, there is no
> -	 * existing method to send the second page before the first page is done
> -	 * in one thread doing zswap.
> -	 * but in different threads running on different cpu, we have different
> -	 * acomp instance, so multiple threads can do (de)compression in parallel.
> +	 * [i] refers to the incoming batch space and is used to
> +	 *     index into the folio pages.
> +	 *
> +	 * [j] refers to the incoming batch space and is used to
> +	 *     index into the @entries for the folio's pages in this
> +	 *     batch, per compress call while iterating over the output SG
> +	 *     lists. Also used to index into the folio's pages from @start,
> +	 *     in case of compress errors.
> +	 *
> +	 * [k] refers to the @acomp_ctx space, as determined by
> +	 *     @pool->compr_batch_size, and is used to index into
> +	 *     @acomp_ctx->sg_outputs->sgl and @acomp_ctx->buffers.
>  	 */
> -	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
> -	dlen = acomp_ctx->req->dlen;
> +	for (i = 0; i < nr_pages; i += nr_comps) {

What are looping over here? I thought zswap_compress() takes in exactly
one batch.

> +		acomp_request_set_src_folio(acomp_ctx->req, folio,
> +					    (start + i) * PAGE_SIZE,
> +					    slen);
>  
> -	/*
> -	 * If a page cannot be compressed into a size smaller than PAGE_SIZE,
> -	 * save the content as is without a compression, to keep the LRU order
> -	 * of writebacks.  If writeback is disabled, reject the page since it
> -	 * only adds metadata overhead.  swap_writeout() will put the page back
> -	 * to the active LRU list in the case.
> -	 */
> -	if (comp_ret || !dlen || dlen >= PAGE_SIZE) {
> -		if (!wb_enabled) {
> -			comp_ret = comp_ret ? comp_ret : -EINVAL;
> -			goto unlock;
> -		}
> -		comp_ret = 0;
> -		dlen = PAGE_SIZE;
> -		dst = kmap_local_page(page);
> -		mapped = true;
> -	}
> +		acomp_ctx->sg_outputs->sgl->length = slen;
>  
> -	gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
> -	handle = zs_malloc(pool->zs_pool, dlen, gfp, page_to_nid(page));
> -	if (IS_ERR_VALUE(handle)) {
> -		alloc_ret = PTR_ERR((void *)handle);
> -		goto unlock;
> -	}
> +		acomp_request_set_dst_sg(acomp_ctx->req,
> +					 acomp_ctx->sg_outputs->sgl,
> +					 slen);
> +
> +		err = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
> +				      &acomp_ctx->wait);
> +
> +		acomp_ctx->sg_outputs->sgl->length = acomp_ctx->req->dlen;
> +
> +		/*
> +		 * If a page cannot be compressed into a size smaller than
> +		 * PAGE_SIZE, save the content as is without a compression, to
> +		 * keep the LRU order of writebacks.  If writeback is disabled,
> +		 * reject the page since it only adds metadata overhead.
> +		 * swap_writeout() will put the page back to the active LRU list
> +		 * in the case.
> +		 *
> +		 * It is assumed that any compressor that sets the output length
> +		 * to 0 or a value >= PAGE_SIZE will also return a negative
> +		 * error status in @err; i.e, will not return a successful
> +		 * compression status in @err in this case.
> +		 */

Ugh, checking the compression error and checking the compression length
are now in separate places so we need to check if writeback is disabled
in separate places and store the page as-is. It's ugly, and I think the
current code is not correct.

> +		if (err && !wb_enabled)
> +			goto compress_error;
> +
> +		for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> +			j = k + i;

Please use meaningful iterator names rather than i, j, and k and the huge
comment explaining what they are.

> +			dst = acomp_ctx->buffers[k];
> +			dlen = sg->length | *errp;

Why are we doing this?

> +
> +			if (dlen < 0) {

We should do the incompressible page handling also if dlen is PAGE_SIZE,
or if the compression failed (I guess that's the intention of bit OR'ing
with *errp?)

> +				dlen = PAGE_SIZE;
> +				dst = kmap_local_page(folio_page(folio, start + j));
> +			}
> +
> +			handle = zs_malloc(pool->zs_pool, dlen, gfp, nid);
>  
> -	zs_obj_write(pool->zs_pool, handle, dst, dlen);
> -	entry->handle = handle;
> -	entry->length = dlen;
> +			if (IS_ERR_VALUE(handle)) {
> +				if (PTR_ERR((void *)handle) == -ENOSPC)
> +					zswap_reject_compress_poor++;
> +				else
> +					zswap_reject_alloc_fail++;
>  
> -unlock:
> -	if (mapped)
> -		kunmap_local(dst);
> -	if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
> -		zswap_reject_compress_poor++;
> -	else if (comp_ret)
> -		zswap_reject_compress_fail++;
> -	else if (alloc_ret)
> -		zswap_reject_alloc_fail++;
> +				goto err_unlock;
> +			}
> +
> +			zs_obj_write(pool->zs_pool, handle, dst, dlen);
> +			entries[j]->handle = handle;
> +			entries[j]->length = dlen;
> +			if (dst != acomp_ctx->buffers[k])
> +				kunmap_local(dst);
> +		}
> +	} /* finished compress and store nr_pages. */
> +
> +	mutex_unlock(&acomp_ctx->mutex);
> +	return true;
> +
> +compress_error:
> +	for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> +		if ((int)sg->length < 0) {
> +			if ((int)sg->length == -ENOSPC)
> +				zswap_reject_compress_poor++;
> +			else
> +				zswap_reject_compress_fail++;
> +		}
> +	}
>  
> +err_unlock:
>  	mutex_unlock(&acomp_ctx->mutex);
> -	return comp_ret == 0 && alloc_ret == 0;
> +	return false;
>  }
>  
>  static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
> @@ -1488,12 +1604,9 @@ static bool zswap_store_pages(struct folio *folio,
>  		INIT_LIST_HEAD(&entries[i]->lru);
>  	}
>  
> -	for (i = 0; i < nr_pages; ++i) {
> -		struct page *page = folio_page(folio, start + i);
> -
> -		if (!zswap_compress(page, entries[i], pool, wb_enabled))
> -			goto store_pages_failed;
> -	}
> +	if (unlikely(!zswap_compress(folio, start, nr_pages, entries, pool,
> +				     nid, wb_enabled)))
> +		goto store_pages_failed;
>  
>  	for (i = 0; i < nr_pages; ++i) {
>  		struct zswap_entry *old, *entry = entries[i];
> -- 
> 2.27.0
>

RE: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.

Posted by Sridhar, Kanchana P 2 months, 3 weeks ago

> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Thursday, November 13, 2025 1:35 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> compress batching of large folios.
> 
> On Tue, Nov 04, 2025 at 01:12:35AM -0800, Kanchana P Sridhar wrote:
> > This patch introduces a new unified implementation of zswap_compress()
> > for compressors that do and do not support batching. This eliminates
> > code duplication and facilitates code maintainability with the
> > introduction of compress batching.
> >
> > The vectorized implementation of calling the earlier zswap_compress()
> > sequentially, one page at a time in zswap_store_pages(), is replaced
> > with this new version of zswap_compress() that accepts multiple pages to
> > compress as a batch.
> >
> > If the compressor does not support batching, each page in the batch is
> > compressed and stored sequentially. If the compressor supports batching,
> > for e.g., 'deflate-iaa', the Intel IAA hardware accelerator, the batch
> > is compressed in parallel in hardware. If the batch is compressed
> > without errors, the compressed buffers are then stored in zsmalloc. In
> > case of compression errors, the current behavior is preserved for the
> > batching zswap_compress(): if the folio's memcg is writeback enabled,
> > pages with compression errors are store uncompressed in zsmalloc; if
> > not, we return an error for the folio in zswap_store().
> >
> > As per Herbert's suggestion in [1] for batching to be based on SG lists
> > to interface with the crypto API, a "struct sg_table *sg_outputs" is
> > added to the per-CPU acomp_ctx. In zswap_cpu_comp_prepare(), memory
> is
> > allocated for @pool->compr_batch_size scatterlists in
> > @acomp_ctx->sg_outputs. The per-CPU @acomp_ctx->buffers' addresses
> are
> > statically mapped to the respective SG lists. The existing non-NUMA
> > sg_alloc_table() was found to give better performance than a NUMA-aware
> > allocation function, hence is used in this patch.
> >
> > Batching compressors should initialize the output SG lengths to
> > PAGE_SIZE as part of the internal compress batching setup, to avoid
> > having to do multiple traversals over the @acomp_ctx->sg_outputs->sgl.
> > This is exactly how batching is implemented in the iaa_crypto driver's
> > compress batching procedure, iaa_comp_acompress_batch().
> >
> > The batched zswap_compress() implementation is generalized as much as
> > possible for non-batching and batching compressors, so that the
> > subsequent incompressible page handling, zs_pool writes, and error
> > handling code is seamless for both, without the use of conditionals to
> > switch to specialized code for either.
> >
> > The new batching implementation of zswap_compress() is called with a
> > batch of @nr_pages sent from zswap_store() to zswap_store_pages().
> > zswap_compress() steps through the batch in increments of the
> > compressor's batch-size, sets up the acomp_ctx->req's src/dst SG lists
> > to contain the folio pages and output buffers, before calling
> > crypto_acomp_compress().
> >
> > Some important requirements of this batching architecture for batching
> > compressors:
> >
> >   1) The output SG lengths for each sg in the acomp_req->dst should be
> >      intialized to PAGE_SIZE as part of other batch setup in the batch
> >      compression function. zswap will not take care of this in the
> >      interest of avoiding repetitive traversals of the
> >      @acomp_ctx->sg_outputs->sgl so as to not lose the benefits of
> >      batching.
> >
> >   2) In case of a compression error for any page in the batch, the
> >      batching compressor should set the corresponding @sg->length to a
> >      negative error number, as suggested by Herbert. Otherwise, the
> >      @sg->length will contain the compressed output length.
> >
> >   3) Batching compressors should set acomp_req->dlen to
> >      acomp_req->dst->length, i.e., the sg->length of the first SG in
> >      acomp_req->dst.
> >
> > Another important change this patch makes is with the acomp_ctx mutex
> > locking in zswap_compress(). Earlier, the mutex was held per page's
> > compression. With the new code, [un]locking the mutex per page caused
> > regressions for software compressors when testing with 30 usemem
> > processes, and also kernel compilation with 'allmod' config. The
> > regressions were more eggregious when PMD folios were stored. The
> > implementation in this commit locks/unlocks the mutex once per batch,
> > that resolves the regression.
> >
> > Architectural considerations for the zswap batching framework:
> >
> ==============================================================
> > We have designed the zswap batching framework to be
> > hardware-agnostic. It has no dependencies on Intel-specific features and
> > can be leveraged by any hardware accelerator or software-based
> > compressor. In other words, the framework is open and inclusive by
> > design.
> >
> > Other ongoing work that can use batching:
> > =========================================
> > This patch-series demonstrates the performance benefits of compress
> > batching when used in zswap_store() of large folios. shrink_folio_list()
> > "reclaim batching" of any-order folios is the major next work that uses
> > the zswap compress batching framework: our testing of kernel_compilation
> > with writeback and the zswap shrinker indicates 10X fewer pages get
> > written back when we reclaim 32 folios as a batch, as compared to one
> > folio at a time: this is with deflate-iaa and with zstd. We expect to
> > submit a patch-series with this data and the resulting performance
> > improvements shortly. Reclaim batching relieves memory pressure faster
> > than reclaiming one folio at a time, hence alleviates the need to scan
> > slab memory for writeback.
> >
> > Nhat has given ideas on using batching with the ongoing kcompressd work,
> > as well as beneficially using decompression batching & block IO batching
> > to improve zswap writeback efficiency.
> >
> > Experiments that combine zswap compress batching, reclaim batching,
> > swapin_readahead() decompression batching of prefetched pages, and
> > writeback batching show that 0 pages are written back with deflate-iaa
> > and zstd. For comparison, the baselines for these compressors see
> > 200K-800K pages written to disk (kernel compilation 'allmod' config).
> >
> > To summarize, these are future clients of the batching framework:
> >
> >    - shrink_folio_list() reclaim batching of multiple folios:
> >        Implemented, will submit patch-series.
> >    - zswap writeback with decompress batching:
> >        Implemented, will submit patch-series.
> >    - zram:
> >        Implemented, will submit patch-series.
> >    - kcompressd:
> >        Not yet implemented.
> >    - file systems:
> >        Not yet implemented.
> >    - swapin_readahead() decompression batching of prefetched pages:
> >        Implemented, will submit patch-series.
> >
> > Additionally, any place we have folios that need to be compressed, can
> > potentially be parallelized.
> >
> > Performance data:
> > =================
> >
> > As suggested by Barry, this is the performance data gathered on Intel
> > Sapphire Rapids with usemem 30 processes running at 50% memory
> pressure
> > and kernel_compilation/allmod config run with 2G limit using 32
> > threads. To keep comparisons simple, all testing was done without the
> > zswap shrinker.
> >
> >   usemem30 with 64K folios:
> >   =========================
> >
> >      zswap shrinker_enabled = N.
> >
> >      -----------------------------------------------------------------------
> >                      mm-unstable-10-24-2025             v13
> >      -----------------------------------------------------------------------
> >      zswap compressor          deflate-iaa     deflate-iaa   IAA Batching
> >                                                                  vs.
> >                                                              IAA Sequential
> >      -----------------------------------------------------------------------
> >      Total throughput (KB/s)     6,118,675       9,901,216       62%
> >      Average throughput (KB/s)     203,955         330,040       62%
> >      elapsed time (sec)              98.94           70.90      -28%
> >      sys time (sec)               2,379.29        1,686.18      -29%
> >      -----------------------------------------------------------------------
> >
> >      -----------------------------------------------------------------------
> >                      mm-unstable-10-24-2025             v13
> >      -----------------------------------------------------------------------
> >      zswap compressor                 zstd            zstd   v13 zstd
> >                                                              improvement
> >      -----------------------------------------------------------------------
> >      Total throughput (KB/s)     5,983,561       6,003,851      0.3%
> >      Average throughput (KB/s)     199,452         200,128      0.3%
> >      elapsed time (sec)             100.93           96.62     -4.3%
> >      sys time (sec)               2,532.49        2,395.83       -5%
> >      -----------------------------------------------------------------------
> >
> >   usemem30 with 2M folios:
> >   ========================
> >
> >      -----------------------------------------------------------------------
> >                      mm-unstable-10-24-2025             v13
> >      -----------------------------------------------------------------------
> >      zswap compressor          deflate-iaa     deflate-iaa   IAA Batching
> >                                                                  vs.
> >                                                              IAA Sequential
> >      -----------------------------------------------------------------------
> >      Total throughput (KB/s)     6,309,635      10,558,225       67%
> >      Average throughput (KB/s)     210,321         351,940       67%
> >      elapsed time (sec)              88.70           67.84      -24%
> >      sys time (sec)               2,059.83        1,581.07      -23%
> >      -----------------------------------------------------------------------
> >
> >      -----------------------------------------------------------------------
> >                      mm-unstable-10-24-2025             v13
> >      -----------------------------------------------------------------------
> >      zswap compressor                 zstd            zstd   v13 zstd
> >                                                              improvement
> >      -----------------------------------------------------------------------
> >      Total throughput (KB/s)     6,562,687       6,567,946      0.1%
> >      Average throughput (KB/s)     218,756         218,931      0.1%
> >      elapsed time (sec)              94.69           88.79       -6%
> >      sys time (sec)               2,253.97        2,083.43       -8%
> >      -----------------------------------------------------------------------
> >
> >     The main takeaway from usemem, a workload that is mostly compression
> >     dominated (very few swapins) is that the higher the number of batches,
> >     such as with larger folios, the more the benefit of batching cost
> >     amortization, as shown by the PMD usemem data. This aligns well
> >     with the future direction for batching.
> >
> > kernel_compilation/allmodconfig, 64K folios:
> > ============================================
> >
> >      --------------------------------------------------------------------------
> >                mm-unstable-10-24-2025             v13
> >      --------------------------------------------------------------------------
> >      zswap compressor    deflate-iaa     deflate-iaa    IAA Batching
> >                                                              vs.
> >                                                         IAA Sequential
> >      --------------------------------------------------------------------------
> >      real_sec                 836.64          806.94      -3.5%
> >      sys_sec                3,897.57        3,661.83        -6%
> >      --------------------------------------------------------------------------
> >
> >      --------------------------------------------------------------------------
> >                mm-unstable-10-24-2025             v13
> >      --------------------------------------------------------------------------
> >      zswap compressor           zstd            zstd    Improvement
> >      --------------------------------------------------------------------------
> >      real_sec                 880.62          850.41      -3.4%
> >      sys_sec                5,171.90        5,076.51      -1.8%
> >      --------------------------------------------------------------------------
> >
> > kernel_compilation/allmodconfig, PMD folios:
> > ============================================
> >
> >      --------------------------------------------------------------------------
> >                mm-unstable-10-24-2025             v13
> >      --------------------------------------------------------------------------
> >      zswap compressor    deflate-iaa     deflate-iaa    IAA Batching
> >                                                              vs.
> >                                                         IAA Sequential
> >      --------------------------------------------------------------------------
> >      real_sec                 818.48          779.67      -4.7%
> >      sys_sec                4,226.52        4,245.18       0.4%
> >      --------------------------------------------------------------------------
> >
> >      --------------------------------------------------------------------------
> >               mm-unstable-10-24-2025             v13
> >      --------------------------------------------------------------------------
> >      zswap compressor          zstd             zstd    Improvement
> >      --------------------------------------------------------------------------
> >      real_sec                888.45           849.54      -4.4%
> >      sys_sec               5,866.72         5,847.17      -0.3%
> >      --------------------------------------------------------------------------
> >
> > [1]:
> https://lore.kernel.org/all/aJ7Fk6RpNc815Ivd@gondor.apana.org.au/T/#m99
> aea2ce3d284e6c5a3253061d97b08c4752a798
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> 
> I won't go through the commit log and rewrite for this one too, but
> please do so similar to how I did for the previous patches. Do not
> describe the code, give a high-level overview of what is happening and
> why it's happeneing, as well as very concise performance results.

With all due respect, I am not describing the code. zswap compress batching
is a major architectural change and I am documenting the changes from the
status quo, for other zswap developers. Yes, some of this might involve
weaving in repetition of current behavior, again to stress the backward
compatibility of main concepts.

I believe there is not one redundant datapoint when it comes to performance
metrics in this summary - please elaborate. Thanks.

> 
> Do not include things that only make sense in the context of a patch and
> won't make sense as part of git histroy.

This makes sense, duly noted and will be addressed.

> 
> That being said, I'd like Herbert to review this patch and make sure the
> scatterlist and crypto APIs are being used correctly as he advised
> earlier. I do have some comments on the zswap side though.
> 
> > ---
> >  mm/zswap.c | 249 ++++++++++++++++++++++++++++++++++++++----------
> -----
> >  1 file changed, 181 insertions(+), 68 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 257567edc587..c5487dd69ec6 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -143,6 +143,7 @@ struct crypto_acomp_ctx {
> >  	struct acomp_req *req;
> >  	struct crypto_wait wait;
> >  	u8 **buffers;
> > +	struct sg_table *sg_outputs;
> >  	struct mutex mutex;
> >  	bool is_sleepable;
> >  };
> > @@ -271,6 +272,11 @@ static void acomp_ctx_dealloc(struct
> crypto_acomp_ctx *acomp_ctx, u8 nr_buffers)
> >  			kfree(acomp_ctx->buffers[i]);
> >  		kfree(acomp_ctx->buffers);
> >  	}
> > +
> > +	if (acomp_ctx->sg_outputs) {
> > +		sg_free_table(acomp_ctx->sg_outputs);
> > +		kfree(acomp_ctx->sg_outputs);
> > +	}
> >  }
> >
> >  static struct zswap_pool *zswap_pool_create(char *compressor)
> > @@ -804,6 +810,7 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
> >  	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> node);
> >  	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> >acomp_ctx, cpu);
> >  	int nid = cpu_to_node(cpu);
> > +	struct scatterlist *sg;
> >  	int ret = -ENOMEM;
> >  	u8 i;
> >
> > @@ -849,6 +856,22 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
> >  			goto fail;
> >  	}
> >
> > +	acomp_ctx->sg_outputs = kmalloc(sizeof(*acomp_ctx->sg_outputs),
> > +					GFP_KERNEL);
> > +	if (!acomp_ctx->sg_outputs)
> > +		goto fail;
> > +
> > +	if (sg_alloc_table(acomp_ctx->sg_outputs, pool->compr_batch_size,
> > +			   GFP_KERNEL))
> > +		goto fail;
> > +
> > +	/*
> > +	 * Statically map the per-CPU destination buffers to the per-CPU
> > +	 * SG lists.
> > +	 */
> > +	for_each_sg(acomp_ctx->sg_outputs->sgl, sg, pool-
> >compr_batch_size, i)
> > +		sg_set_buf(sg, acomp_ctx->buffers[i], PAGE_SIZE);
> > +
> >  	/*
> >  	 * if the backend of acomp is async zip, crypto_req_done() will
> wakeup
> >  	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
> > @@ -869,84 +892,177 @@ static int zswap_cpu_comp_prepare(unsigned
> int cpu, struct hlist_node *node)
> >  	return ret;
> >  }
> >
> > -static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> > -			   struct zswap_pool *pool, bool wb_enabled)
> > +/*
> > + * Unified code path for compressors that do and do not support batching.
> This
> > + * procedure will compress multiple @nr_pages in @folio starting from the
> > + * @start index.
> > + *
> > + * It is assumed that @nr_pages <= ZSWAP_MAX_BATCH_SIZE.
> zswap_store() makes
> > + * sure of this by design and zswap_store_pages() warns if this is not
> > + * true.
> > + *
> > + * @nr_pages can be in (1, ZSWAP_MAX_BATCH_SIZE] even if the
> compressor does not
> > + * support batching.
> > + *
> > + * If @pool->compr_batch_size is 1, each page is processed sequentially.
> > + *
> > + * If @pool->compr_batch_size is > 1, compression batching is invoked
> within
> > + * the algorithm's driver, except if @nr_pages is 1: if so, the driver can
> > + * choose to call the sequential/non-batching compress API.
> > + *
> > + * In both cases, if all compressions are successful, the compressed buffers
> > + * are stored in zsmalloc.
> > + *
> > + * Traversing multiple SG lists when @nr_comps is > 1 is expensive, and
> impacts
> > + * batching performance if we were to repeat this operation multiple
> times,
> > + * such as:
> > + *   - to map destination buffers to each SG list in the @acomp_ctx-
> >sg_outputs
> > + *     sg_table.
> > + *   - to initialize each output SG list's @sg->length to PAGE_SIZE.
> > + *   - to get the compressed output length in each @sg->length.
> > + *
> > + * These are some design choices made to optimize batching with SG lists:
> > + *
> > + * 1) The source folio pages in the batch are directly submitted to
> > + *    crypto_acomp via acomp_request_set_src_folio().
> > + *
> > + * 2) The per-CPU @acomp_ctx->sg_outputs scatterlists are used to set up
> > + *    destination buffers for interfacing with crypto_acomp.
> > + *
> > + * 3) To optimize performance, we map the per-CPU @acomp_ctx->buffers
> to the
> > + *    @acomp_ctx->sg_outputs->sgl SG lists at pool creation time. The only
> task
> > + *    remaining to be done for the output SG lists in zswap_compress() is to
> > + *    set each @sg->length to PAGE_SIZE. This is done in zswap_compress()
> > + *    for non-batching compressors. This needs to be done within the
> compress
> > + *    batching driver procedure as part of iterating through the SG lists for
> > + *    batch setup, so as to minimize expensive traversals through the SG
> lists.
> > + *
> > + * 4) Important requirements for batching compressors:
> > + *    - Each @sg->length in @acomp_ctx->req->sg_outputs->sgl should
> reflect the
> > + *      compression outcome for that specific page, and be set to:
> > + *      - the page's compressed length, or
> > + *      - the compression error value for that page.
> > + *    - The @acomp_ctx->req->dlen should be set to the first page's
> > + *      @sg->length. This enables code generalization in zswap_compress()
> > + *      for non-batching and batching compressors.
> > + *
> > + * acomp_ctx mutex locking:
> > + *    Earlier, the mutex was held per page compression. With the new code,
> > + *    [un]locking the mutex per page caused regressions for software
> > + *    compressors. We now lock the mutex once per batch, which resolves
> the
> > + *    regression.
> > + */
> 
> Please, no huge comments describing what the code is doing. If there's
> anything that is not clear from reading the code or needs to be
> explained or documented, please do so **concisely** in the relevant part
> of the function.

Again, these are important requirements related to the major change, i.e.,
batching, wrt why/how. I think it is important to note considerations for the
next batching algorithm, just like I have done within the IAA driver. To be very
clear, I am not describing code.

If questions arise as to why the mutex is being locked per batch as against
per page, I think the comment above is helpful and saves time for folks to
understand the "why".

> 
> > +static bool zswap_compress(struct folio *folio, long start, unsigned int
> nr_pages,
> > +			   struct zswap_entry *entries[], struct zswap_pool
> *pool,
> > +			   int nid, bool wb_enabled)
> >  {
> > +	gfp_t gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM |
> __GFP_MOVABLE;
> > +	unsigned int nr_comps = min(nr_pages, pool->compr_batch_size);
> > +	unsigned int slen = nr_comps * PAGE_SIZE;
> >  	struct crypto_acomp_ctx *acomp_ctx;
> > -	struct scatterlist input, output;
> > -	int comp_ret = 0, alloc_ret = 0;
> > -	unsigned int dlen = PAGE_SIZE;
> > +	int err = 0, err_sg = 0;
> > +	struct scatterlist *sg;
> > +	unsigned int i, j, k;
> >  	unsigned long handle;
> > -	gfp_t gfp;
> > -	u8 *dst;
> > -	bool mapped = false;
> > +	int *errp, dlen;
> > +	void *dst;
> >
> >  	acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
> >  	mutex_lock(&acomp_ctx->mutex);
> >
> > -	dst = acomp_ctx->buffers[0];
> > -	sg_init_table(&input, 1);
> > -	sg_set_page(&input, page, PAGE_SIZE, 0);
> > -
> > -	sg_init_one(&output, dst, PAGE_SIZE);
> > -	acomp_request_set_params(acomp_ctx->req, &input, &output,
> PAGE_SIZE, dlen);
> > +	errp = (pool->compr_batch_size == 1) ? &err : &err_sg;
> 
> err_sg is not used anywhere, so *errp could end up being garbage. Why do
> we need this?

err_sg is initialized to 0 and never changes. It can never be garbage.
We need this because of the current dichotomy between software compressors
and IAA in the sg->length based error handling per Herbert's suggestions,
included in the huge function comment block. It is needed to avoid branches
and have the zswap_compress() code look seamless for all compressors.

> 
> >
> >  	/*
> > -	 * it maybe looks a little bit silly that we send an asynchronous
> request,
> > -	 * then wait for its completion synchronously. This makes the process
> look
> > -	 * synchronous in fact.
> > -	 * Theoretically, acomp supports users send multiple acomp requests
> in one
> > -	 * acomp instance, then get those requests done simultaneously. but
> in this
> > -	 * case, zswap actually does store and load page by page, there is no
> > -	 * existing method to send the second page before the first page is
> done
> > -	 * in one thread doing zswap.
> > -	 * but in different threads running on different cpu, we have different
> > -	 * acomp instance, so multiple threads can do (de)compression in
> parallel.
> > +	 * [i] refers to the incoming batch space and is used to
> > +	 *     index into the folio pages.
> > +	 *
> > +	 * [j] refers to the incoming batch space and is used to
> > +	 *     index into the @entries for the folio's pages in this
> > +	 *     batch, per compress call while iterating over the output SG
> > +	 *     lists. Also used to index into the folio's pages from @start,
> > +	 *     in case of compress errors.
> > +	 *
> > +	 * [k] refers to the @acomp_ctx space, as determined by
> > +	 *     @pool->compr_batch_size, and is used to index into
> > +	 *     @acomp_ctx->sg_outputs->sgl and @acomp_ctx->buffers.
> >  	 */
> > -	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx-
> >req), &acomp_ctx->wait);
> > -	dlen = acomp_ctx->req->dlen;
> > +	for (i = 0; i < nr_pages; i += nr_comps) {
> 
> What are looping over here? I thought zswap_compress() takes in exactly
> one batch.

We are iterating once over one batch for batching compressors, and one
page at a time for software.

> 
> > +		acomp_request_set_src_folio(acomp_ctx->req, folio,
> > +					    (start + i) * PAGE_SIZE,
> > +					    slen);
> >
> > -	/*
> > -	 * If a page cannot be compressed into a size smaller than PAGE_SIZE,
> > -	 * save the content as is without a compression, to keep the LRU
> order
> > -	 * of writebacks.  If writeback is disabled, reject the page since it
> > -	 * only adds metadata overhead.  swap_writeout() will put the page
> back
> > -	 * to the active LRU list in the case.
> > -	 */
> > -	if (comp_ret || !dlen || dlen >= PAGE_SIZE) {
> > -		if (!wb_enabled) {
> > -			comp_ret = comp_ret ? comp_ret : -EINVAL;
> > -			goto unlock;
> > -		}
> > -		comp_ret = 0;
> > -		dlen = PAGE_SIZE;
> > -		dst = kmap_local_page(page);
> > -		mapped = true;
> > -	}
> > +		acomp_ctx->sg_outputs->sgl->length = slen;
> >
> > -	gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM |
> __GFP_MOVABLE;
> > -	handle = zs_malloc(pool->zs_pool, dlen, gfp, page_to_nid(page));
> > -	if (IS_ERR_VALUE(handle)) {
> > -		alloc_ret = PTR_ERR((void *)handle);
> > -		goto unlock;
> > -	}
> > +		acomp_request_set_dst_sg(acomp_ctx->req,
> > +					 acomp_ctx->sg_outputs->sgl,
> > +					 slen);
> > +
> > +		err = crypto_wait_req(crypto_acomp_compress(acomp_ctx-
> >req),
> > +				      &acomp_ctx->wait);
> > +
> > +		acomp_ctx->sg_outputs->sgl->length = acomp_ctx->req-
> >dlen;
> > +
> > +		/*
> > +		 * If a page cannot be compressed into a size smaller than
> > +		 * PAGE_SIZE, save the content as is without a compression,
> to
> > +		 * keep the LRU order of writebacks.  If writeback is disabled,
> > +		 * reject the page since it only adds metadata overhead.
> > +		 * swap_writeout() will put the page back to the active LRU
> list
> > +		 * in the case.
> > +		 *
> > +		 * It is assumed that any compressor that sets the output
> length
> > +		 * to 0 or a value >= PAGE_SIZE will also return a negative
> > +		 * error status in @err; i.e, will not return a successful
> > +		 * compression status in @err in this case.
> > +		 */
> 
> Ugh, checking the compression error and checking the compression length
> are now in separate places so we need to check if writeback is disabled
> in separate places and store the page as-is. It's ugly, and I think the
> current code is not correct.

The code is 100% correct. You need to spend more time understanding
the code. I have stated my assumption above in the comments to
help in understanding the "why".

From a maintainer, I would expect more responsible statements than
this. A flippant remark made without understanding the code (and,
disparaging the comments intended to help you do this), can impact
someone's career. I am held accountable in my job based on your
comments.

That said, I have worked tirelessly and innovated to make the code
compliant with Herbert's suggestions (which btw have enabled an
elegant batching implementation and code commonality for IAA and
software compressors), validated it thoroughly for IAA and ZSTD to
ensure that both demonstrate performance improvements, which
are crucial for memory savings. I am proud of this work.


> 
> > +		if (err && !wb_enabled)
> > +			goto compress_error;
> > +
> > +		for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> > +			j = k + i;
> 
> Please use meaningful iterator names rather than i, j, and k and the huge
> comment explaining what they are.

I happen to have a different view: having longer iterator names firstly makes
code seem "verbose" and detracts from readability, not to mention exceeding the
80-character line limit. The comments are essential for code maintainability
and avoid out-of-bounds errors when the next zswap developer wants to
optimize the code.

One drawback of i/j/k iterators is mis-typing errors which cannot be caught
at compile time. Let me think some more about how to strike a good balance.

> 
> > +			dst = acomp_ctx->buffers[k];
> > +			dlen = sg->length | *errp;
> 
> Why are we doing this?
> 
> > +
> > +			if (dlen < 0) {
> 
> We should do the incompressible page handling also if dlen is PAGE_SIZE,
> or if the compression failed (I guess that's the intention of bit OR'ing
> with *errp?)

Yes, indeed: that's the intention of bit OR'ing with *errp.

> 
> > +				dlen = PAGE_SIZE;
> > +				dst = kmap_local_page(folio_page(folio, start
> + j));
> > +			}
> > +
> > +			handle = zs_malloc(pool->zs_pool, dlen, gfp, nid);
> >
> > -	zs_obj_write(pool->zs_pool, handle, dst, dlen);
> > -	entry->handle = handle;
> > -	entry->length = dlen;
> > +			if (IS_ERR_VALUE(handle)) {
> > +				if (PTR_ERR((void *)handle) == -ENOSPC)
> > +					zswap_reject_compress_poor++;
> > +				else
> > +					zswap_reject_alloc_fail++;
> >
> > -unlock:
> > -	if (mapped)
> > -		kunmap_local(dst);
> > -	if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
> > -		zswap_reject_compress_poor++;
> > -	else if (comp_ret)
> > -		zswap_reject_compress_fail++;
> > -	else if (alloc_ret)
> > -		zswap_reject_alloc_fail++;
> > +				goto err_unlock;
> > +			}
> > +
> > +			zs_obj_write(pool->zs_pool, handle, dst, dlen);
> > +			entries[j]->handle = handle;
> > +			entries[j]->length = dlen;
> > +			if (dst != acomp_ctx->buffers[k])
> > +				kunmap_local(dst);
> > +		}
> > +	} /* finished compress and store nr_pages. */
> > +
> > +	mutex_unlock(&acomp_ctx->mutex);
> > +	return true;
> > +
> > +compress_error:
> > +	for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> > +		if ((int)sg->length < 0) {
> > +			if ((int)sg->length == -ENOSPC)
> > +				zswap_reject_compress_poor++;
> > +			else
> > +				zswap_reject_compress_fail++;
> > +		}
> > +	}
> >
> > +err_unlock:
> >  	mutex_unlock(&acomp_ctx->mutex);
> > -	return comp_ret == 0 && alloc_ret == 0;
> > +	return false;
> >  }
> >
> >  static bool zswap_decompress(struct zswap_entry *entry, struct folio
> *folio)
> > @@ -1488,12 +1604,9 @@ static bool zswap_store_pages(struct folio
> *folio,
> >  		INIT_LIST_HEAD(&entries[i]->lru);
> >  	}
> >
> > -	for (i = 0; i < nr_pages; ++i) {
> > -		struct page *page = folio_page(folio, start + i);
> > -
> > -		if (!zswap_compress(page, entries[i], pool, wb_enabled))
> > -			goto store_pages_failed;
> > -	}
> > +	if (unlikely(!zswap_compress(folio, start, nr_pages, entries, pool,
> > +				     nid, wb_enabled)))
> > +		goto store_pages_failed;
> >
> >  	for (i = 0; i < nr_pages; ++i) {
> >  		struct zswap_entry *old, *entry = entries[i];
> > --
> > 2.27.0
> >

Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.

Posted by Yosry Ahmed 2 months, 3 weeks ago

On Thu, Nov 13, 2025 at 11:55:10PM +0000, Sridhar, Kanchana P wrote:
> 
> > -----Original Message-----
> > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > Sent: Thursday, November 13, 2025 1:35 PM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> > crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> > compress batching of large folios.
> > 
[..]
> > > +		/*
> > > +		 * If a page cannot be compressed into a size smaller than
> > > +		 * PAGE_SIZE, save the content as is without a compression,
> > to
> > > +		 * keep the LRU order of writebacks.  If writeback is disabled,
> > > +		 * reject the page since it only adds metadata overhead.
> > > +		 * swap_writeout() will put the page back to the active LRU
> > list
> > > +		 * in the case.
> > > +		 *
> > > +		 * It is assumed that any compressor that sets the output
> > length
> > > +		 * to 0 or a value >= PAGE_SIZE will also return a negative
> > > +		 * error status in @err; i.e, will not return a successful
> > > +		 * compression status in @err in this case.
> > > +		 */
> > 
> > Ugh, checking the compression error and checking the compression length
> > are now in separate places so we need to check if writeback is disabled
> > in separate places and store the page as-is. It's ugly, and I think the
> > current code is not correct.
> 
> The code is 100% correct. You need to spend more time understanding
> the code. I have stated my assumption above in the comments to
> help in understanding the "why".
> 
> From a maintainer, I would expect more responsible statements than
> this. A flippant remark made without understanding the code (and,
> disparaging the comments intended to help you do this), can impact
> someone's career. I am held accountable in my job based on your
> comments.
> 
> That said, I have worked tirelessly and innovated to make the code
> compliant with Herbert's suggestions (which btw have enabled an
> elegant batching implementation and code commonality for IAA and
> software compressors), validated it thoroughly for IAA and ZSTD to
> ensure that both demonstrate performance improvements, which
> are crucial for memory savings. I am proud of this work.
> 
> 
> > 
> > > +		if (err && !wb_enabled)
> > > +			goto compress_error;
> > > +
> > > +		for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> > > +			j = k + i;
> > 
> > Please use meaningful iterator names rather than i, j, and k and the huge
> > comment explaining what they are.
> 
> I happen to have a different view: having longer iterator names firstly makes
> code seem "verbose" and detracts from readability, not to mention exceeding the
> 80-character line limit. The comments are essential for code maintainability
> and avoid out-of-bounds errors when the next zswap developer wants to
> optimize the code.
> 
> One drawback of i/j/k iterators is mis-typing errors which cannot be caught
> at compile time. Let me think some more about how to strike a good balance.
> 
> > 
> > > +			dst = acomp_ctx->buffers[k];
> > > +			dlen = sg->length | *errp;
> > 
> > Why are we doing this?
> > 
> > > +
> > > +			if (dlen < 0) {
> > 
> > We should do the incompressible page handling also if dlen is PAGE_SIZE,
> > or if the compression failed (I guess that's the intention of bit OR'ing
> > with *errp?)
> 
> Yes, indeed: that's the intention of bit OR'ing with *errp.

..and you never really answered my question. In the exising code we
store the page as incompressible if writeback is enabled AND
crypto_wait_req() fails or dlen is zero or PAGE_SIZE. We check above
if crypto_wait_req() fails and writeback is disabled, but what about the
rest?

We don't check again if writeback is enabled before storing the page is
incompressible, and we do not check if dlen is zero or PAGE_SIZE. Are
these cases no longer possible?

Also, why use errp, why not explicitly use the appropriate error code?
It's also unclear to me why the error code is always zero with HW
compression?

> 
> > 
> > > +				dlen = PAGE_SIZE;
> > > +				dst = kmap_local_page(folio_page(folio, start
> > + j));
> > > +			}
> > > +
> > > +			handle = zs_malloc(pool->zs_pool, dlen, gfp, nid);
> > >
> > > -	zs_obj_write(pool->zs_pool, handle, dst, dlen);
> > > -	entry->handle = handle;
> > > -	entry->length = dlen;
> > > +			if (IS_ERR_VALUE(handle)) {
> > > +				if (PTR_ERR((void *)handle) == -ENOSPC)
> > > +					zswap_reject_compress_poor++;
> > > +				else
> > > +					zswap_reject_alloc_fail++;
> > >
> > > -unlock:
> > > -	if (mapped)
> > > -		kunmap_local(dst);
> > > -	if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
> > > -		zswap_reject_compress_poor++;
> > > -	else if (comp_ret)
> > > -		zswap_reject_compress_fail++;
> > > -	else if (alloc_ret)
> > > -		zswap_reject_alloc_fail++;
> > > +				goto err_unlock;
> > > +			}
> > > +
> > > +			zs_obj_write(pool->zs_pool, handle, dst, dlen);
> > > +			entries[j]->handle = handle;
> > > +			entries[j]->length = dlen;
> > > +			if (dst != acomp_ctx->buffers[k])
> > > +				kunmap_local(dst);
> > > +		}
> > > +	} /* finished compress and store nr_pages. */
> > > +
> > > +	mutex_unlock(&acomp_ctx->mutex);
> > > +	return true;
> > > +
> > > +compress_error:
> > > +	for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> > > +		if ((int)sg->length < 0) {
> > > +			if ((int)sg->length == -ENOSPC)
> > > +				zswap_reject_compress_poor++;
> > > +			else
> > > +				zswap_reject_compress_fail++;
> > > +		}
> > > +	}
> > >
> > > +err_unlock:
> > >  	mutex_unlock(&acomp_ctx->mutex);
> > > -	return comp_ret == 0 && alloc_ret == 0;
> > > +	return false;
> > >  }
> > >
> > >  static bool zswap_decompress(struct zswap_entry *entry, struct folio
> > *folio)
> > > @@ -1488,12 +1604,9 @@ static bool zswap_store_pages(struct folio
> > *folio,
> > >  		INIT_LIST_HEAD(&entries[i]->lru);
> > >  	}
> > >
> > > -	for (i = 0; i < nr_pages; ++i) {
> > > -		struct page *page = folio_page(folio, start + i);
> > > -
> > > -		if (!zswap_compress(page, entries[i], pool, wb_enabled))
> > > -			goto store_pages_failed;
> > > -	}
> > > +	if (unlikely(!zswap_compress(folio, start, nr_pages, entries, pool,
> > > +				     nid, wb_enabled)))
> > > +		goto store_pages_failed;
> > >
> > >  	for (i = 0; i < nr_pages; ++i) {
> > >  		struct zswap_entry *old, *entry = entries[i];
> > > --
> > > 2.27.0
> > >

RE: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.

Posted by Sridhar, Kanchana P 2 months, 3 weeks ago

> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Thursday, November 13, 2025 9:52 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> compress batching of large folios.
> 
> On Thu, Nov 13, 2025 at 11:55:10PM +0000, Sridhar, Kanchana P wrote:
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > Sent: Thursday, November 13, 2025 1:35 PM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > hannes@cmpxchg.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev;
> > > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> > > crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > > <kristen.c.accardi@intel.com>; Gomes, Vinicius
> <vinicius.gomes@intel.com>;
> > > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > > <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress()
> with
> > > compress batching of large folios.
> > >
> [..]
> > > > +		/*
> > > > +		 * If a page cannot be compressed into a size smaller than
> > > > +		 * PAGE_SIZE, save the content as is without a compression,
> > > to
> > > > +		 * keep the LRU order of writebacks.  If writeback is disabled,
> > > > +		 * reject the page since it only adds metadata overhead.
> > > > +		 * swap_writeout() will put the page back to the active LRU
> > > list
> > > > +		 * in the case.
> > > > +		 *
> > > > +		 * It is assumed that any compressor that sets the output
> > > length
> > > > +		 * to 0 or a value >= PAGE_SIZE will also return a negative
> > > > +		 * error status in @err; i.e, will not return a successful
> > > > +		 * compression status in @err in this case.
> > > > +		 */
> > >
> > > Ugh, checking the compression error and checking the compression length
> > > are now in separate places so we need to check if writeback is disabled
> > > in separate places and store the page as-is. It's ugly, and I think the
> > > current code is not correct.
> >
> > The code is 100% correct. You need to spend more time understanding
> > the code. I have stated my assumption above in the comments to
> > help in understanding the "why".
> >
> > From a maintainer, I would expect more responsible statements than
> > this. A flippant remark made without understanding the code (and,
> > disparaging the comments intended to help you do this), can impact
> > someone's career. I am held accountable in my job based on your
> > comments.
> >
> > That said, I have worked tirelessly and innovated to make the code
> > compliant with Herbert's suggestions (which btw have enabled an
> > elegant batching implementation and code commonality for IAA and
> > software compressors), validated it thoroughly for IAA and ZSTD to
> > ensure that both demonstrate performance improvements, which
> > are crucial for memory savings. I am proud of this work.
> >
> >
> > >
> > > > +		if (err && !wb_enabled)
> > > > +			goto compress_error;
> > > > +
> > > > +		for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> > > > +			j = k + i;
> > >
> > > Please use meaningful iterator names rather than i, j, and k and the huge
> > > comment explaining what they are.
> >
> > I happen to have a different view: having longer iterator names firstly makes
> > code seem "verbose" and detracts from readability, not to mention
> exceeding the
> > 80-character line limit. The comments are essential for code maintainability
> > and avoid out-of-bounds errors when the next zswap developer wants to
> > optimize the code.
> >
> > One drawback of i/j/k iterators is mis-typing errors which cannot be caught
> > at compile time. Let me think some more about how to strike a good
> balance.
> >
> > >
> > > > +			dst = acomp_ctx->buffers[k];
> > > > +			dlen = sg->length | *errp;
> > >
> > > Why are we doing this?
> > >
> > > > +
> > > > +			if (dlen < 0) {
> > >
> > > We should do the incompressible page handling also if dlen is PAGE_SIZE,
> > > or if the compression failed (I guess that's the intention of bit OR'ing
> > > with *errp?)
> >
> > Yes, indeed: that's the intention of bit OR'ing with *errp.
> 
> ..and you never really answered my question. In the exising code we
> store the page as incompressible if writeback is enabled AND
> crypto_wait_req() fails or dlen is zero or PAGE_SIZE. We check above
> if crypto_wait_req() fails and writeback is disabled, but what about the
> rest?

Let me explain this some more. The new code only relies on the assumption
that if dlen is zero or >= PAGE_SIZE, the compressor will not return a 0
("successful status"). In other words, the compressor will return an error status
in this case, which is expected to be a negative error code.

Under these (hopefully valid) assumptions, the code handles the simple case
of an error compression return status and writeback is disabled, by the
"goto compress_error".

The rest is handled by these:

1) First, I need to adapt the use of sg_outputs->sgl->length to represent the
compress length for software compressors, so I do this after crypto_wait_req()
returns:

                acomp_ctx->sg_outputs->sgl->length = acomp_ctx->req->dlen;

I did not want to propose any changes to crypto software compressors protocols.

2) After the check for the "if (err && !wb_enabled)" case, the new code has this:

                for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
                        j = k + i;
                        dst = acomp_ctx->buffers[k];
                        dlen = sg->length | *errp;

                        if (dlen < 0) {
                                dlen = PAGE_SIZE;
                                dst = kmap_local_page(folio_page(folio, start + j));
                        }

For batching compressors, namely, iaa_crypto, the individual output SG
lists sg->length follows the requirements from Herbert: each sg->length
is the compressed length or the error status (a negative error code).

Then all I need to know whether to store the page as incompressible
is to either directly test if sg->length is negative (for batching compressors),
or sg->length bit-OR'ed with the crypto_wait_req() return status ("err")
is negative. This is accomplished by the "dlen = sg->length | *errp;".

I believe this maintains backward compatibility with the existing code.
Please let me know if you agree.

> 
> We don't check again if writeback is enabled before storing the page is
> incompressible, and we do not check if dlen is zero or PAGE_SIZE. Are
> these cases no longer possible?

Hope the above explanation clarifies things some more? These case
are possible, and as long as they return an error status, they should be
correctly handled by the new code.

> 
> Also, why use errp, why not explicitly use the appropriate error code?
> It's also unclear to me why the error code is always zero with HW
> compression?

This is because of the sg->length requirements (compressed length/error)
for the batching interface suggested by Herbert. Hence, I upfront define
err_sg to 0, and, set errp to &err_sg for batching compressors. For software
compressors, errp is set to &err, namely, the above check will always apply
the software compressor's error status to the compressed length via
the bit-OR to determine if the page needs to be stored uncompressed.


> 
> >
> > >
> > > > +				dlen = PAGE_SIZE;
> > > > +				dst = kmap_local_page(folio_page(folio, start
> > > + j));
> > > > +			}
> > > > +
> > > > +			handle = zs_malloc(pool->zs_pool, dlen, gfp, nid);
> > > >
> > > > -	zs_obj_write(pool->zs_pool, handle, dst, dlen);
> > > > -	entry->handle = handle;
> > > > -	entry->length = dlen;
> > > > +			if (IS_ERR_VALUE(handle)) {
> > > > +				if (PTR_ERR((void *)handle) == -ENOSPC)
> > > > +					zswap_reject_compress_poor++;
> > > > +				else
> > > > +					zswap_reject_alloc_fail++;
> > > >
> > > > -unlock:
> > > > -	if (mapped)
> > > > -		kunmap_local(dst);
> > > > -	if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
> > > > -		zswap_reject_compress_poor++;
> > > > -	else if (comp_ret)
> > > > -		zswap_reject_compress_fail++;
> > > > -	else if (alloc_ret)
> > > > -		zswap_reject_alloc_fail++;
> > > > +				goto err_unlock;
> > > > +			}
> > > > +
> > > > +			zs_obj_write(pool->zs_pool, handle, dst, dlen);
> > > > +			entries[j]->handle = handle;
> > > > +			entries[j]->length = dlen;
> > > > +			if (dst != acomp_ctx->buffers[k])
> > > > +				kunmap_local(dst);
> > > > +		}
> > > > +	} /* finished compress and store nr_pages. */
> > > > +
> > > > +	mutex_unlock(&acomp_ctx->mutex);
> > > > +	return true;
> > > > +
> > > > +compress_error:
> > > > +	for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> > > > +		if ((int)sg->length < 0) {
> > > > +			if ((int)sg->length == -ENOSPC)
> > > > +				zswap_reject_compress_poor++;
> > > > +			else
> > > > +				zswap_reject_compress_fail++;
> > > > +		}
> > > > +	}
> > > >
> > > > +err_unlock:
> > > >  	mutex_unlock(&acomp_ctx->mutex);
> > > > -	return comp_ret == 0 && alloc_ret == 0;
> > > > +	return false;
> > > >  }
> > > >
> > > >  static bool zswap_decompress(struct zswap_entry *entry, struct folio
> > > *folio)
> > > > @@ -1488,12 +1604,9 @@ static bool zswap_store_pages(struct folio
> > > *folio,
> > > >  		INIT_LIST_HEAD(&entries[i]->lru);
> > > >  	}
> > > >
> > > > -	for (i = 0; i < nr_pages; ++i) {
> > > > -		struct page *page = folio_page(folio, start + i);
> > > > -
> > > > -		if (!zswap_compress(page, entries[i], pool, wb_enabled))
> > > > -			goto store_pages_failed;
> > > > -	}
> > > > +	if (unlikely(!zswap_compress(folio, start, nr_pages, entries, pool,
> > > > +				     nid, wb_enabled)))
> > > > +		goto store_pages_failed;
> > > >
> > > >  	for (i = 0; i < nr_pages; ++i) {
> > > >  		struct zswap_entry *old, *entry = entries[i];
> > > > --
> > > > 2.27.0
> > > >

Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.

Posted by Yosry Ahmed 2 months, 3 weeks ago

On Fri, Nov 14, 2025 at 06:43:21AM +0000, Sridhar, Kanchana P wrote:
> 
> > -----Original Message-----
> > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > Sent: Thursday, November 13, 2025 9:52 PM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> > crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> > compress batching of large folios.
> > 
> > On Thu, Nov 13, 2025 at 11:55:10PM +0000, Sridhar, Kanchana P wrote:
> > >
> > > > -----Original Message-----
> > > > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > > Sent: Thursday, November 13, 2025 1:35 PM
> > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > > hannes@cmpxchg.org; nphamcs@gmail.com;
> > chengming.zhou@linux.dev;
> > > > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > > > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> > > > crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > > > <kristen.c.accardi@intel.com>; Gomes, Vinicius
> > <vinicius.gomes@intel.com>;
> > > > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > > > <vinodh.gopal@intel.com>
> > > > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress()
> > with
> > > > compress batching of large folios.
> > > >
> > [..]
> > > > > +		/*
> > > > > +		 * If a page cannot be compressed into a size smaller than
> > > > > +		 * PAGE_SIZE, save the content as is without a compression,
> > > > to
> > > > > +		 * keep the LRU order of writebacks.  If writeback is disabled,
> > > > > +		 * reject the page since it only adds metadata overhead.
> > > > > +		 * swap_writeout() will put the page back to the active LRU
> > > > list
> > > > > +		 * in the case.
> > > > > +		 *
> > > > > +		 * It is assumed that any compressor that sets the output
> > > > length
> > > > > +		 * to 0 or a value >= PAGE_SIZE will also return a negative
> > > > > +		 * error status in @err; i.e, will not return a successful
> > > > > +		 * compression status in @err in this case.
> > > > > +		 */
> > > >
> > > > Ugh, checking the compression error and checking the compression length
> > > > are now in separate places so we need to check if writeback is disabled
> > > > in separate places and store the page as-is. It's ugly, and I think the
> > > > current code is not correct.
> > >
> > > The code is 100% correct. You need to spend more time understanding
> > > the code. I have stated my assumption above in the comments to
> > > help in understanding the "why".
> > >
> > > From a maintainer, I would expect more responsible statements than
> > > this. A flippant remark made without understanding the code (and,
> > > disparaging the comments intended to help you do this), can impact
> > > someone's career. I am held accountable in my job based on your
> > > comments.
> > >
> > > That said, I have worked tirelessly and innovated to make the code
> > > compliant with Herbert's suggestions (which btw have enabled an
> > > elegant batching implementation and code commonality for IAA and
> > > software compressors), validated it thoroughly for IAA and ZSTD to
> > > ensure that both demonstrate performance improvements, which
> > > are crucial for memory savings. I am proud of this work.
> > >
> > >
> > > >
> > > > > +		if (err && !wb_enabled)
> > > > > +			goto compress_error;
> > > > > +
> > > > > +		for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> > > > > +			j = k + i;
> > > >
> > > > Please use meaningful iterator names rather than i, j, and k and the huge
> > > > comment explaining what they are.
> > >
> > > I happen to have a different view: having longer iterator names firstly makes
> > > code seem "verbose" and detracts from readability, not to mention
> > exceeding the
> > > 80-character line limit. The comments are essential for code maintainability
> > > and avoid out-of-bounds errors when the next zswap developer wants to
> > > optimize the code.
> > >
> > > One drawback of i/j/k iterators is mis-typing errors which cannot be caught
> > > at compile time. Let me think some more about how to strike a good
> > balance.
> > >
> > > >
> > > > > +			dst = acomp_ctx->buffers[k];
> > > > > +			dlen = sg->length | *errp;
> > > >
> > > > Why are we doing this?
> > > >
> > > > > +
> > > > > +			if (dlen < 0) {
> > > >
> > > > We should do the incompressible page handling also if dlen is PAGE_SIZE,
> > > > or if the compression failed (I guess that's the intention of bit OR'ing
> > > > with *errp?)
> > >
> > > Yes, indeed: that's the intention of bit OR'ing with *errp.
> > 
> > ..and you never really answered my question. In the exising code we
> > store the page as incompressible if writeback is enabled AND
> > crypto_wait_req() fails or dlen is zero or PAGE_SIZE. We check above
> > if crypto_wait_req() fails and writeback is disabled, but what about the
> > rest?
> 
> Let me explain this some more. The new code only relies on the assumption
> that if dlen is zero or >= PAGE_SIZE, the compressor will not return a 0
> ("successful status"). In other words, the compressor will return an error status
> in this case, which is expected to be a negative error code.

I am not sure if all compressors do that, especially for the case where
dlen >= PAGE_SIZE. The existing code does not assume that there will be
an error code in these cases.

For the dlen == 0 case, the check was introduced recently by commit
dca4437a5861 ("mm/zswap: store <PAGE_SIZE compression failed page
as-is"). Looking through the history it seems like it was introduced in
v4 of that patch but I don't see the reasoning.

SeongJae, did you observe any compressors returning dlen == 0 but no
error code? What was the reasoning behind the dlen == 0 check?

> 
> Under these (hopefully valid) assumptions, the code handles the simple case
> of an error compression return status and writeback is disabled, by the
> "goto compress_error".
> 
> The rest is handled by these:
> 
> 1) First, I need to adapt the use of sg_outputs->sgl->length to represent the
> compress length for software compressors, so I do this after crypto_wait_req()
> returns:
> 
>                 acomp_ctx->sg_outputs->sgl->length = acomp_ctx->req->dlen;

For SW compressors, why is acomp_ctx->sg_outputs->sgl->length not set?
IIUC we are using the same API for SW and HW compressors, why is the
output length in different places for each of them?

> 
> I did not want to propose any changes to crypto software compressors protocols.
> 
> 2) After the check for the "if (err && !wb_enabled)" case, the new code has this:
> 
>                 for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
>                         j = k + i;
>                         dst = acomp_ctx->buffers[k];
>                         dlen = sg->length | *errp;
> 
>                         if (dlen < 0) {
>                                 dlen = PAGE_SIZE;
>                                 dst = kmap_local_page(folio_page(folio, start + j));
>                         }
> 
> For batching compressors, namely, iaa_crypto, the individual output SG
> lists sg->length follows the requirements from Herbert: each sg->length
> is the compressed length or the error status (a negative error code).
> 
> Then all I need to know whether to store the page as incompressible
> is to either directly test if sg->length is negative (for batching compressors),
> or sg->length bit-OR'ed with the crypto_wait_req() return status ("err")
> is negative. This is accomplished by the "dlen = sg->length | *errp;".
> 
> I believe this maintains backward compatibility with the existing code.
> Please let me know if you agree.

For batching compressors, will 'err' be set as well, or just sg->length?
If it's just sg->length, then we need to check again if WB is enabled
here before storing the page uncompressed. Right?

> 
> > 
> > We don't check again if writeback is enabled before storing the page is
> > incompressible, and we do not check if dlen is zero or PAGE_SIZE. Are
> > these cases no longer possible?
> 
> Hope the above explanation clarifies things some more? These case
> are possible, and as long as they return an error status, they should be
> correctly handled by the new code.

As mentioned above, I am not sure if that's correct for dlen >=
PAGE_SIZE.

> 
> > 
> > Also, why use errp, why not explicitly use the appropriate error code?
> > It's also unclear to me why the error code is always zero with HW
> > compression?
> 
> This is because of the sg->length requirements (compressed length/error)
> for the batching interface suggested by Herbert. Hence, I upfront define
> err_sg to 0, and, set errp to &err_sg for batching compressors. For software
> compressors, errp is set to &err, namely, the above check will always apply
> the software compressor's error status to the compressed length via
> the bit-OR to determine if the page needs to be stored uncompressed.

Thanks for the clarification. I understand that the error code has
different sources for SW and HW compressors, but I do not like using
errp as an indirection. It makes the code unclear. I would rather we
explicitly check err for SW compressors and dlen for HW compressors.

That being said, I thought what Herbert suggested was that the same API
is used for both SW and HW compressors. IOW, either way we submit a
batch of pages (8 pages for SW compressors), and then the crypto API
would either give the entire batch to the compressor if it supports
batching, or loop over them internally and hand them page-by-page to
the compressor.

This would simplify usage as we do not have to handle the differences in
zswap.

If that is not doable, at the very least the API should be consistent.
Right now the error code and length are propagated differently to the
caller based on whether or not the compressor support batching.

> 
> 
> > 
> > >
> > > >
> > > > > +				dlen = PAGE_SIZE;
> > > > > +				dst = kmap_local_page(folio_page(folio, start
> > > > + j));
> > > > > +			}
> > > > > +
> > > > > +			handle = zs_malloc(pool->zs_pool, dlen, gfp, nid);
> > > > >
> > > > > -	zs_obj_write(pool->zs_pool, handle, dst, dlen);
> > > > > -	entry->handle = handle;
> > > > > -	entry->length = dlen;
> > > > > +			if (IS_ERR_VALUE(handle)) {
> > > > > +				if (PTR_ERR((void *)handle) == -ENOSPC)
> > > > > +					zswap_reject_compress_poor++;
> > > > > +				else
> > > > > +					zswap_reject_alloc_fail++;
> > > > >
> > > > > -unlock:
> > > > > -	if (mapped)
> > > > > -		kunmap_local(dst);
> > > > > -	if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
> > > > > -		zswap_reject_compress_poor++;
> > > > > -	else if (comp_ret)
> > > > > -		zswap_reject_compress_fail++;
> > > > > -	else if (alloc_ret)
> > > > > -		zswap_reject_alloc_fail++;
> > > > > +				goto err_unlock;
> > > > > +			}
> > > > > +
> > > > > +			zs_obj_write(pool->zs_pool, handle, dst, dlen);
> > > > > +			entries[j]->handle = handle;
> > > > > +			entries[j]->length = dlen;
> > > > > +			if (dst != acomp_ctx->buffers[k])
> > > > > +				kunmap_local(dst);
> > > > > +		}
> > > > > +	} /* finished compress and store nr_pages. */
> > > > > +
> > > > > +	mutex_unlock(&acomp_ctx->mutex);
> > > > > +	return true;
> > > > > +
> > > > > +compress_error:
> > > > > +	for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> > > > > +		if ((int)sg->length < 0) {
> > > > > +			if ((int)sg->length == -ENOSPC)
> > > > > +				zswap_reject_compress_poor++;
> > > > > +			else
> > > > > +				zswap_reject_compress_fail++;
> > > > > +		}
> > > > > +	}
> > > > >
> > > > > +err_unlock:
> > > > >  	mutex_unlock(&acomp_ctx->mutex);
> > > > > -	return comp_ret == 0 && alloc_ret == 0;
> > > > > +	return false;
> > > > >  }
> > > > >
> > > > >  static bool zswap_decompress(struct zswap_entry *entry, struct folio
> > > > *folio)
> > > > > @@ -1488,12 +1604,9 @@ static bool zswap_store_pages(struct folio
> > > > *folio,
> > > > >  		INIT_LIST_HEAD(&entries[i]->lru);
> > > > >  	}
> > > > >
> > > > > -	for (i = 0; i < nr_pages; ++i) {
> > > > > -		struct page *page = folio_page(folio, start + i);
> > > > > -
> > > > > -		if (!zswap_compress(page, entries[i], pool, wb_enabled))
> > > > > -			goto store_pages_failed;
> > > > > -	}
> > > > > +	if (unlikely(!zswap_compress(folio, start, nr_pages, entries, pool,
> > > > > +				     nid, wb_enabled)))
> > > > > +		goto store_pages_failed;
> > > > >
> > > > >  	for (i = 0; i < nr_pages; ++i) {
> > > > >  		struct zswap_entry *old, *entry = entries[i];
> > > > > --
> > > > > 2.27.0
> > > > >

Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.

Posted by Herbert Xu 2 months, 2 weeks ago

On Fri, Nov 14, 2025 at 03:37:53PM +0000, Yosry Ahmed wrote:
>
> Thanks for the clarification. I understand that the error code has
> different sources for SW and HW compressors, but I do not like using
> errp as an indirection. It makes the code unclear. I would rather we
> explicitly check err for SW compressors and dlen for HW compressors.
> 
> That being said, I thought what Herbert suggested was that the same API
> is used for both SW and HW compressors. IOW, either way we submit a
> batch of pages (8 pages for SW compressors), and then the crypto API
> would either give the entire batch to the compressor if it supports
> batching, or loop over them internally and hand them page-by-page to
> the compressor.
> 
> This would simplify usage as we do not have to handle the differences in
> zswap.
> 
> If that is not doable, at the very least the API should be consistent.
> Right now the error code and length are propagated differently to the
> caller based on whether or not the compressor support batching.

Yes we should only have one code path in zswap, regardless of whether
batching is used or not.

The degenerate case of a batch with a single page should be handled
by the Crypto API.

So I will change crypto_acomp to take care of this case.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.

Posted by Yosry Ahmed 2 months, 2 weeks ago

On Wed, Nov 26, 2025 at 01:46:57PM +0800, Herbert Xu wrote:
> On Fri, Nov 14, 2025 at 03:37:53PM +0000, Yosry Ahmed wrote:
> >
> > Thanks for the clarification. I understand that the error code has
> > different sources for SW and HW compressors, but I do not like using
> > errp as an indirection. It makes the code unclear. I would rather we
> > explicitly check err for SW compressors and dlen for HW compressors.
> > 
> > That being said, I thought what Herbert suggested was that the same API
> > is used for both SW and HW compressors. IOW, either way we submit a
> > batch of pages (8 pages for SW compressors), and then the crypto API
> > would either give the entire batch to the compressor if it supports
> > batching, or loop over them internally and hand them page-by-page to
> > the compressor.
> > 
> > This would simplify usage as we do not have to handle the differences in
> > zswap.
> > 
> > If that is not doable, at the very least the API should be consistent.
> > Right now the error code and length are propagated differently to the
> > caller based on whether or not the compressor support batching.
> 
> Yes we should only have one code path in zswap, regardless of whether
> batching is used or not.
> 
> The degenerate case of a batch with a single page should be handled
> by the Crypto API.
> 
> So I will change crypto_acomp to take care of this case.

Nice :)

> 
> Cheers,
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

RE: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.

Posted by Sridhar, Kanchana P 2 months, 2 weeks ago

> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Tuesday, November 25, 2025 10:35 PM
> To: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; SeongJae Park
> <sj@kernel.org>; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> Kristen C <kristen.c.accardi@intel.com>; Gomes, Vinicius
> <vinicius.gomes@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> compress batching of large folios.
> 
> On Wed, Nov 26, 2025 at 01:46:57PM +0800, Herbert Xu wrote:
> > On Fri, Nov 14, 2025 at 03:37:53PM +0000, Yosry Ahmed wrote:
> > >
> > > Thanks for the clarification. I understand that the error code has
> > > different sources for SW and HW compressors, but I do not like using
> > > errp as an indirection. It makes the code unclear. I would rather we
> > > explicitly check err for SW compressors and dlen for HW compressors.
> > >
> > > That being said, I thought what Herbert suggested was that the same API
> > > is used for both SW and HW compressors. IOW, either way we submit a
> > > batch of pages (8 pages for SW compressors), and then the crypto API
> > > would either give the entire batch to the compressor if it supports
> > > batching, or loop over them internally and hand them page-by-page to
> > > the compressor.
> > >
> > > This would simplify usage as we do not have to handle the differences in
> > > zswap.
> > >
> > > If that is not doable, at the very least the API should be consistent.
> > > Right now the error code and length are propagated differently to the
> > > caller based on whether or not the compressor support batching.
> >
> > Yes we should only have one code path in zswap, regardless of whether
> > batching is used or not.
> >
> > The degenerate case of a batch with a single page should be handled
> > by the Crypto API.
> >
> > So I will change crypto_acomp to take care of this case.
> 
> Nice :)

Thanks Herbert and Yosry!

Herbert, to make sure I understand, will you be implementing all of these
features in crypto_acomp for software compressors? I would appreciate it
if you can clarify:

1) Error & compressed length propagation to the dst sg->length only for
    non-batching compressors.
    a) For batching compressors, this wouldn't apply since errors could occur
        for any page in the batch, and the first page (dst sg->length) could have
        successfully compressed.

2) Will you also be handling the case where zswap can send an SG list batch
     with multiple pages to a non-batching compressor, and the crypto_acomp
     API will internally compress each page sequentially, propagate
     errors/compress lengths before returning?
        
If so, this would really standardize the code in zswap for batching and
non-batching compressors.

Thanks,
Kanchana

> 
> >
> > Cheers,
> > --
> > Email: Herbert Xu <herbert@gondor.apana.org.au>
> > Home Page: http://gondor.apana.org.au/~herbert/
> > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.

Posted by Herbert Xu 2 months ago

On Wed, Nov 26, 2025 at 08:05:40PM +0000, Sridhar, Kanchana P wrote:
>
> Herbert, to make sure I understand, will you be implementing all of these
> features in crypto_acomp for software compressors? I would appreciate it
> if you can clarify:
> 
> 1) Error & compressed length propagation to the dst sg->length only for
>     non-batching compressors.
>     a) For batching compressors, this wouldn't apply since errors could occur
>         for any page in the batch, and the first page (dst sg->length) could have
>         successfully compressed.

This would be the first step.

> 2) Will you also be handling the case where zswap can send an SG list batch
>      with multiple pages to a non-batching compressor, and the crypto_acomp
>      API will internally compress each page sequentially, propagate
>      errors/compress lengths before returning?
>         
> If so, this would really standardize the code in zswap for batching and
> non-batching compressors.

Yes this will be done as the next step.  My understanding is that
your patch-set doesn't require this yet as all non-batching compressors
will have a batch size of 1.

But yes this certainly will be extended, not just with sequential
processing, but we could also use pcrypt/cryptd to parallelise the
compression across CPUs.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

RE: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.

Posted by Sridhar, Kanchana P 2 months ago

> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Sunday, December 7, 2025 7:24 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Yosry Ahmed <yosry.ahmed@linux.dev>; SeongJae Park <sj@kernel.org>;
> linux-kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> Kristen C <kristen.c.accardi@intel.com>; Gomes, Vinicius
> <vinicius.gomes@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> compress batching of large folios.
> 
> On Wed, Nov 26, 2025 at 08:05:40PM +0000, Sridhar, Kanchana P wrote:
> >
> > Herbert, to make sure I understand, will you be implementing all of these
> > features in crypto_acomp for software compressors? I would appreciate it
> > if you can clarify:
> >
> > 1) Error & compressed length propagation to the dst sg->length only for
> >     non-batching compressors.
> >     a) For batching compressors, this wouldn't apply since errors could occur
> >         for any page in the batch, and the first page (dst sg->length) could have
> >         successfully compressed.
> 
> This would be the first step.

Hi Herbert,

Thanks for these clarifications! This sounds like a great first step.

> 
> > 2) Will you also be handling the case where zswap can send an SG list batch
> >      with multiple pages to a non-batching compressor, and the crypto_acomp
> >      API will internally compress each page sequentially, propagate
> >      errors/compress lengths before returning?
> >
> > If so, this would really standardize the code in zswap for batching and
> > non-batching compressors.
> 
> Yes this will be done as the next step.  My understanding is that
> your patch-set doesn't require this yet as all non-batching compressors
> will have a batch size of 1.

I see. So the way my patch-set tries to standardize batching in
zswap_compress() is to call it with a batch of 8 pages, regardless of batching
or non-batching compressors. In zswap_compress(), I presently iterate
through each page in the batch for sequential processing for non-batching
compressors whose batch size is 1. For batching compressors, the iteration
happens just once: the whole batch is compressed in one call to
crypto_acomp_compress().

When the next step is ready, I will no longer need this for loop that
iterates over the batch in "batch_size" increments. If Yosry and Nhat are
Ok with staging it as you've described, this should all be good.

Also, I have incorporated your suggestion to implement batching within
iaa_crypto in a manner that adheres to the acomp API. I was planning to
start creating an updated patch-set with this. Please let me know if it would
be a good idea to wait to sync with the first step you are working on before
submitting the updated patch-set. Thanks for collaboration!

> 
> But yes this certainly will be extended, not just with sequential
> processing, but we could also use pcrypt/cryptd to parallelise the
> compression across CPUs.

Sounds great!

Best regards,
Kanchana

> 
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.

Posted by Herbert Xu 2 months ago

On Mon, Dec 08, 2025 at 04:17:38AM +0000, Sridhar, Kanchana P wrote:
>
> I see. So the way my patch-set tries to standardize batching in
> zswap_compress() is to call it with a batch of 8 pages, regardless of batching
> or non-batching compressors. In zswap_compress(), I presently iterate
> through each page in the batch for sequential processing for non-batching
> compressors whose batch size is 1. For batching compressors, the iteration
> happens just once: the whole batch is compressed in one call to
> crypto_acomp_compress().

Oh I wasn't aware of this.  In that case there is no need for me
to delay the next step and we can do it straight away.

I had thought that the batch size was to limit the batching size
to acomp.  But if it's not, perhaps we can remove the batch size
exposure altogether.  IOW it would only be visible internally to
the acomp API while the users such as zswap would simply batch
things in whatever size that suits them.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.

Posted by Yosry Ahmed 2 months ago

On Mon, Dec 08, 2025 at 12:24:01PM +0800, Herbert Xu wrote:
> On Mon, Dec 08, 2025 at 04:17:38AM +0000, Sridhar, Kanchana P wrote:
> >
> > I see. So the way my patch-set tries to standardize batching in
> > zswap_compress() is to call it with a batch of 8 pages, regardless of batching
> > or non-batching compressors. In zswap_compress(), I presently iterate
> > through each page in the batch for sequential processing for non-batching
> > compressors whose batch size is 1. For batching compressors, the iteration
> > happens just once: the whole batch is compressed in one call to
> > crypto_acomp_compress().
> 
> Oh I wasn't aware of this.  In that case there is no need for me
> to delay the next step and we can do it straight away.
> 
> I had thought that the batch size was to limit the batching size
> to acomp.  But if it's not, perhaps we can remove the batch size
> exposure altogether.  IOW it would only be visible internally to
> the acomp API while the users such as zswap would simply batch
> things in whatever size that suits them.

Just to clarify, does this mean that zswap can pass a batch of (eight)
pages to the acomp API, and get the results for the batch uniformly
whether or not the underlying compressor supports batching?

If yes, then that's exactly what we want for zswap, because it will
simplify the interface significantly vs. what this batch is currently
doing to handle SW non-batching compressors vs HW batching compressors.

> 
> Thanks,
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.

Posted by Herbert Xu 2 months ago

On Tue, Dec 09, 2025 at 01:15:02AM +0000, Yosry Ahmed wrote:
> 
> Just to clarify, does this mean that zswap can pass a batch of (eight)
> pages to the acomp API, and get the results for the batch uniformly
> whether or not the underlying compressor supports batching?

Correct.  In fact I'd like to remove the batch size exposure to zswap
altogether.  zswap should just pass along whatever maximum number of
pages that is convenient to itself.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.

Posted by Yosry Ahmed 2 months ago

On Tue, Dec 09, 2025 at 10:32:20AM +0800, Herbert Xu wrote:
> On Tue, Dec 09, 2025 at 01:15:02AM +0000, Yosry Ahmed wrote:
> > 
> > Just to clarify, does this mean that zswap can pass a batch of (eight)
> > pages to the acomp API, and get the results for the batch uniformly
> > whether or not the underlying compressor supports batching?
> 
> Correct.  In fact I'd like to remove the batch size exposure to zswap
> altogether.  zswap should just pass along whatever maximum number of
> pages that is convenient to itself.

I think exposing the batch size is still useful as a hint for zswap. In
the current series, zswap allocates as many per-CPU buffers as the
compressor's batch size, so no extra buffers for non-batching
compressors (including SW compressors).

If we use the same batch size regardless, we'll have to always allocate
8 (or N) per-CPU buffers, for little to no benefit on non-batching
compressors.

So we still want the batch size on the zswap side, but we want the
crypto API to be uniform whether or not the compressor supports
batching.

> 
> Cheers,
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

RE: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.

Posted by Sridhar, Kanchana P 2 months ago

> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Tuesday, December 9, 2025 8:55 AM
> To: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; SeongJae Park
> <sj@kernel.org>; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> Kristen C <kristen.c.accardi@intel.com>; Gomes, Vinicius
> <vinicius.gomes@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> compress batching of large folios.
> 
> On Tue, Dec 09, 2025 at 10:32:20AM +0800, Herbert Xu wrote:
> > On Tue, Dec 09, 2025 at 01:15:02AM +0000, Yosry Ahmed wrote:
> > >
> > > Just to clarify, does this mean that zswap can pass a batch of (eight)
> > > pages to the acomp API, and get the results for the batch uniformly
> > > whether or not the underlying compressor supports batching?
> >
> > Correct.  In fact I'd like to remove the batch size exposure to zswap
> > altogether.  zswap should just pass along whatever maximum number of
> > pages that is convenient to itself.
> 
> I think exposing the batch size is still useful as a hint for zswap. In
> the current series, zswap allocates as many per-CPU buffers as the
> compressor's batch size, so no extra buffers for non-batching
> compressors (including SW compressors).
> 
> If we use the same batch size regardless, we'll have to always allocate
> 8 (or N) per-CPU buffers, for little to no benefit on non-batching
> compressors.
> 
> So we still want the batch size on the zswap side, but we want the
> crypto API to be uniform whether or not the compressor supports
> batching.

Thanks Yosry, you bring up a good point. I currently have the outer for
loop in zswap_compress() due to the above constraint. For non-batching
compressors, we allocate only one per-CPU buffer. Hence, we need to
call crypto_acomp_compress() and write the compressed data to the
zs_poll for each page in the batch. Wouldn't we need to allocate
8 per-CPU buffers for non-batching compressors if we want zswap to
send a batch of 8 pages uniformly to the crypto API, so that
zswap_compress() can store the 8 pages in zs_pool after the crypto
API returns?

Thanks,
Kanchana

> 
> >
> > Cheers,
> > --
> > Email: Herbert Xu <herbert@gondor.apana.org.au>
> > Home Page: http://gondor.apana.org.au/~herbert/
> > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.

Posted by Yosry Ahmed 2 months ago

On Tue, Dec 09, 2025 at 05:21:06PM +0000, Sridhar, Kanchana P wrote:
> 
> > -----Original Message-----
> > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > Sent: Tuesday, December 9, 2025 8:55 AM
> > To: Herbert Xu <herbert@gondor.apana.org.au>
> > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; SeongJae Park
> > <sj@kernel.org>; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > senozhatsky@chromium.org; kasong@tencent.com; linux-
> > crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> > ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> > Kristen C <kristen.c.accardi@intel.com>; Gomes, Vinicius
> > <vinicius.gomes@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> > Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> > compress batching of large folios.
> > 
> > On Tue, Dec 09, 2025 at 10:32:20AM +0800, Herbert Xu wrote:
> > > On Tue, Dec 09, 2025 at 01:15:02AM +0000, Yosry Ahmed wrote:
> > > >
> > > > Just to clarify, does this mean that zswap can pass a batch of (eight)
> > > > pages to the acomp API, and get the results for the batch uniformly
> > > > whether or not the underlying compressor supports batching?
> > >
> > > Correct.  In fact I'd like to remove the batch size exposure to zswap
> > > altogether.  zswap should just pass along whatever maximum number of
> > > pages that is convenient to itself.
> > 
> > I think exposing the batch size is still useful as a hint for zswap. In
> > the current series, zswap allocates as many per-CPU buffers as the
> > compressor's batch size, so no extra buffers for non-batching
> > compressors (including SW compressors).
> > 
> > If we use the same batch size regardless, we'll have to always allocate
> > 8 (or N) per-CPU buffers, for little to no benefit on non-batching
> > compressors.
> > 
> > So we still want the batch size on the zswap side, but we want the
> > crypto API to be uniform whether or not the compressor supports
> > batching.
> 
> Thanks Yosry, you bring up a good point. I currently have the outer for
> loop in zswap_compress() due to the above constraint. For non-batching
> compressors, we allocate only one per-CPU buffer. Hence, we need to
> call crypto_acomp_compress() and write the compressed data to the
> zs_poll for each page in the batch. Wouldn't we need to allocate
> 8 per-CPU buffers for non-batching compressors if we want zswap to
> send a batch of 8 pages uniformly to the crypto API, so that
> zswap_compress() can store the 8 pages in zs_pool after the crypto
> API returns?

Ugh, yes.. I don't think we want to burn 7 extra pages per-CPU for SW
compressors.

I think the cleanest way to handle this would be to:
- Rename zswap_compress() to __zswap_compress(), and make it handle a
  given batch size (which would be 1 or 8).
- Introduce zswap_compress() as a wrapper that breaks down the folio
  into batches and loops over them, passing them to __zswap_compress().
- __zswap_compress() has a single unified path (e.g. for compressed
  length and error handling), regardless of the batch size.

Can this be done with the current acomp API? I think all we really need
is to be able to pass in a batch of size N (which can be 1), and read
the error and compressed length in a single way. This is my main problem
with the current patch.

In the future, if it's beneifical for some SW compressors to batch
compressions, we can look into optimizations for the per-CPU buffers to
avoid allocating 8 pages per-CPU (e.g. shared page pool), or make this
opt-in for certain SW compressors that justify the cost.

> 
> Thanks,
> Kanchana
> 
> > 
> > >
> > > Cheers,
> > > --
> > > Email: Herbert Xu <herbert@gondor.apana.org.au>
> > > Home Page: http://gondor.apana.org.au/~herbert/
> > > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.

Posted by Herbert Xu 2 months ago

On Tue, Dec 09, 2025 at 05:31:35PM +0000, Yosry Ahmed wrote:
>
> Ugh, yes.. I don't think we want to burn 7 extra pages per-CPU for SW
> compressors.

OK so the consensus is that we're keeping the visible batch size
attribute for now, which will be set to 1 for everything but iaa.

So for now I'm going to just provide a trivial acomp fallback so
that non-batching algorithms conform to the batching calling
convention for a batch size of 1.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.

Posted by Yosry Ahmed 2 months ago

On Wed, Dec 10, 2025 at 12:28:11PM +0800, Herbert Xu wrote:
> On Tue, Dec 09, 2025 at 05:31:35PM +0000, Yosry Ahmed wrote:
> >
> > Ugh, yes.. I don't think we want to burn 7 extra pages per-CPU for SW
> > compressors.
> 
> OK so the consensus is that we're keeping the visible batch size
> attribute for now, which will be set to 1 for everything but iaa.
> 
> So for now I'm going to just provide a trivial acomp fallback so
> that non-batching algorithms conform to the batching calling
> convention for a batch size of 1.

Thank you!

> 
> Cheers,
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

RE: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.

Posted by Sridhar, Kanchana P 2 months ago

> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Tuesday, December 9, 2025 8:28 PM
> To: Yosry Ahmed <yosry.ahmed@linux.dev>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; SeongJae Park
> <sj@kernel.org>; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> Kristen C <kristen.c.accardi@intel.com>; Gomes, Vinicius
> <vinicius.gomes@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> compress batching of large folios.
> 
> On Tue, Dec 09, 2025 at 05:31:35PM +0000, Yosry Ahmed wrote:
> >
> > Ugh, yes.. I don't think we want to burn 7 extra pages per-CPU for SW
> > compressors.
> 
> OK so the consensus is that we're keeping the visible batch size
> attribute for now, which will be set to 1 for everything but iaa.
> 
> So for now I'm going to just provide a trivial acomp fallback so
> that non-batching algorithms conform to the batching calling
> convention for a batch size of 1.

Thanks Herbert, this sounds good.

Thanks,
Kanchana

> 
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

RE: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.

Posted by Sridhar, Kanchana P 2 months ago

> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Tuesday, December 9, 2025 9:32 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>; SeongJae Park
> <sj@kernel.org>; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> Kristen C <kristen.c.accardi@intel.com>; Gomes, Vinicius
> <vinicius.gomes@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> compress batching of large folios.
> 
> On Tue, Dec 09, 2025 at 05:21:06PM +0000, Sridhar, Kanchana P wrote:
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > Sent: Tuesday, December 9, 2025 8:55 AM
> > > To: Herbert Xu <herbert@gondor.apana.org.au>
> > > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; SeongJae Park
> > > <sj@kernel.org>; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > hannes@cmpxchg.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev;
> > > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > > senozhatsky@chromium.org; kasong@tencent.com; linux-
> > > crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> > > ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> > > Kristen C <kristen.c.accardi@intel.com>; Gomes, Vinicius
> > > <vinicius.gomes@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>;
> > > Gopal, Vinodh <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress()
> with
> > > compress batching of large folios.
> > >
> > > On Tue, Dec 09, 2025 at 10:32:20AM +0800, Herbert Xu wrote:
> > > > On Tue, Dec 09, 2025 at 01:15:02AM +0000, Yosry Ahmed wrote:
> > > > >
> > > > > Just to clarify, does this mean that zswap can pass a batch of (eight)
> > > > > pages to the acomp API, and get the results for the batch uniformly
> > > > > whether or not the underlying compressor supports batching?
> > > >
> > > > Correct.  In fact I'd like to remove the batch size exposure to zswap
> > > > altogether.  zswap should just pass along whatever maximum number of
> > > > pages that is convenient to itself.
> > >
> > > I think exposing the batch size is still useful as a hint for zswap. In
> > > the current series, zswap allocates as many per-CPU buffers as the
> > > compressor's batch size, so no extra buffers for non-batching
> > > compressors (including SW compressors).
> > >
> > > If we use the same batch size regardless, we'll have to always allocate
> > > 8 (or N) per-CPU buffers, for little to no benefit on non-batching
> > > compressors.
> > >
> > > So we still want the batch size on the zswap side, but we want the
> > > crypto API to be uniform whether or not the compressor supports
> > > batching.
> >
> > Thanks Yosry, you bring up a good point. I currently have the outer for
> > loop in zswap_compress() due to the above constraint. For non-batching
> > compressors, we allocate only one per-CPU buffer. Hence, we need to
> > call crypto_acomp_compress() and write the compressed data to the
> > zs_poll for each page in the batch. Wouldn't we need to allocate
> > 8 per-CPU buffers for non-batching compressors if we want zswap to
> > send a batch of 8 pages uniformly to the crypto API, so that
> > zswap_compress() can store the 8 pages in zs_pool after the crypto
> > API returns?
> 
> Ugh, yes.. I don't think we want to burn 7 extra pages per-CPU for SW
> compressors.
> 
> I think the cleanest way to handle this would be to:
> - Rename zswap_compress() to __zswap_compress(), and make it handle a
>   given batch size (which would be 1 or 8).
> - Introduce zswap_compress() as a wrapper that breaks down the folio
>   into batches and loops over them, passing them to __zswap_compress().
> - __zswap_compress() has a single unified path (e.g. for compressed
>   length and error handling), regardless of the batch size.
> 
> Can this be done with the current acomp API? I think all we really need
> is to be able to pass in a batch of size N (which can be 1), and read
> the error and compressed length in a single way. This is my main problem
> with the current patch.

Once Herbert gives us the crypto_acomp modification for non-batching
compressors to set the acomp_req->dst->length to the
compressed length/error value, I think the same could be accomplished
with the current patch, since I will be able to delete the "errp". IOW, I think
a simplification is possible without introducing __zswap_compress(). The
code will look seamless for non-batching and batching compressors, and the
distinction will be made apparent by the outer for loop that iterates over
the batch based on the pool->compr_batch_size in the current patch.

Alternately, we could introduce the __zswap_compress() that abstracts
one single iteration through the outer for loop: it compresses 1 or 8 pages
as a "batch". However, the distinction would still need to be made for
non-batching vs. batching compressors in the zswap_compress() wrapper:
both for sending the pool->compr_batch_size # of pages to
__zswap_compress() and for iterating over the single/multiple dst buffers
to write to zs_pool (the latter could be done within __zswap_compress(),
but the point remains: we would need to distinguish in one or the other).

It could be argued that keeping the seamless-ness in handling the calls to
crypto based on the pool->compr_batch_size and the logical distinctions
imposed by this in iterating over the output SG lists/buffers, would be
cleaner being self-contained in zswap_compress(). We already have a
zswap_store_pages() that processes the folio in batches. Maybe minimizing
the functions that do batch processing could be cleaner?

In any case, let me know which would be preferable.

Thanks,
Kanchana

> 
> In the future, if it's beneifical for some SW compressors to batch
> compressions, we can look into optimizations for the per-CPU buffers to
> avoid allocating 8 pages per-CPU (e.g. shared page pool), or make this
> opt-in for certain SW compressors that justify the cost.
> 
> >
> > Thanks,
> > Kanchana
> >
> > >
> > > >
> > > > Cheers,
> > > > --
> > > > Email: Herbert Xu <herbert@gondor.apana.org.au>
> > > > Home Page: http://gondor.apana.org.au/~herbert/
> > > > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.

Posted by Yosry Ahmed 2 months ago

On Tue, Dec 09, 2025 at 07:38:20PM +0000, Sridhar, Kanchana P wrote:
> 
> > -----Original Message-----
> > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > Sent: Tuesday, December 9, 2025 9:32 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: Herbert Xu <herbert@gondor.apana.org.au>; SeongJae Park
> > <sj@kernel.org>; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > senozhatsky@chromium.org; kasong@tencent.com; linux-
> > crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> > ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> > Kristen C <kristen.c.accardi@intel.com>; Gomes, Vinicius
> > <vinicius.gomes@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> > Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> > compress batching of large folios.
> > 
> > On Tue, Dec 09, 2025 at 05:21:06PM +0000, Sridhar, Kanchana P wrote:
> > >
> > > > -----Original Message-----
> > > > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > > Sent: Tuesday, December 9, 2025 8:55 AM
> > > > To: Herbert Xu <herbert@gondor.apana.org.au>
> > > > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; SeongJae Park
> > > > <sj@kernel.org>; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > > hannes@cmpxchg.org; nphamcs@gmail.com;
> > chengming.zhou@linux.dev;
> > > > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > > > senozhatsky@chromium.org; kasong@tencent.com; linux-
> > > > crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> > > > ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> > > > Kristen C <kristen.c.accardi@intel.com>; Gomes, Vinicius
> > > > <vinicius.gomes@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>;
> > > > Gopal, Vinodh <vinodh.gopal@intel.com>
> > > > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress()
> > with
> > > > compress batching of large folios.
> > > >
> > > > On Tue, Dec 09, 2025 at 10:32:20AM +0800, Herbert Xu wrote:
> > > > > On Tue, Dec 09, 2025 at 01:15:02AM +0000, Yosry Ahmed wrote:
> > > > > >
> > > > > > Just to clarify, does this mean that zswap can pass a batch of (eight)
> > > > > > pages to the acomp API, and get the results for the batch uniformly
> > > > > > whether or not the underlying compressor supports batching?
> > > > >
> > > > > Correct.  In fact I'd like to remove the batch size exposure to zswap
> > > > > altogether.  zswap should just pass along whatever maximum number of
> > > > > pages that is convenient to itself.
> > > >
> > > > I think exposing the batch size is still useful as a hint for zswap. In
> > > > the current series, zswap allocates as many per-CPU buffers as the
> > > > compressor's batch size, so no extra buffers for non-batching
> > > > compressors (including SW compressors).
> > > >
> > > > If we use the same batch size regardless, we'll have to always allocate
> > > > 8 (or N) per-CPU buffers, for little to no benefit on non-batching
> > > > compressors.
> > > >
> > > > So we still want the batch size on the zswap side, but we want the
> > > > crypto API to be uniform whether or not the compressor supports
> > > > batching.
> > >
> > > Thanks Yosry, you bring up a good point. I currently have the outer for
> > > loop in zswap_compress() due to the above constraint. For non-batching
> > > compressors, we allocate only one per-CPU buffer. Hence, we need to
> > > call crypto_acomp_compress() and write the compressed data to the
> > > zs_poll for each page in the batch. Wouldn't we need to allocate
> > > 8 per-CPU buffers for non-batching compressors if we want zswap to
> > > send a batch of 8 pages uniformly to the crypto API, so that
> > > zswap_compress() can store the 8 pages in zs_pool after the crypto
> > > API returns?
> > 
> > Ugh, yes.. I don't think we want to burn 7 extra pages per-CPU for SW
> > compressors.
> > 
> > I think the cleanest way to handle this would be to:
> > - Rename zswap_compress() to __zswap_compress(), and make it handle a
> >   given batch size (which would be 1 or 8).
> > - Introduce zswap_compress() as a wrapper that breaks down the folio
> >   into batches and loops over them, passing them to __zswap_compress().
> > - __zswap_compress() has a single unified path (e.g. for compressed
> >   length and error handling), regardless of the batch size.
> > 
> > Can this be done with the current acomp API? I think all we really need
> > is to be able to pass in a batch of size N (which can be 1), and read
> > the error and compressed length in a single way. This is my main problem
> > with the current patch.
> 
> Once Herbert gives us the crypto_acomp modification for non-batching
> compressors to set the acomp_req->dst->length to the
> compressed length/error value, I think the same could be accomplished
> with the current patch, since I will be able to delete the "errp". IOW, I think
> a simplification is possible without introducing __zswap_compress(). The
> code will look seamless for non-batching and batching compressors, and the
> distinction will be made apparent by the outer for loop that iterates over
> the batch based on the pool->compr_batch_size in the current patch.

I think moving the outer loop outside to a wrapper could make the
function digestable without nested loops.

> 
> Alternately, we could introduce the __zswap_compress() that abstracts
> one single iteration through the outer for loop: it compresses 1 or 8 pages
> as a "batch". However, the distinction would still need to be made for
> non-batching vs. batching compressors in the zswap_compress() wrapper:
> both for sending the pool->compr_batch_size # of pages to
> __zswap_compress() and for iterating over the single/multiple dst buffers
> to write to zs_pool (the latter could be done within __zswap_compress(),
> but the point remains: we would need to distinguish in one or the other).

Not sure what you mean by the latter. IIUC, for all compressors
__zswap_compress() would iterate over the dst buffers and write to
zs_pool, whether the number of dst buffers is 1 or 8. So there wouldn't
be any different handling in __zswap_compress(), right?

That's my whole motivation for introducing a wrapper that abstracts away
the batching size.

> 
> It could be argued that keeping the seamless-ness in handling the calls to
> crypto based on the pool->compr_batch_size and the logical distinctions
> imposed by this in iterating over the output SG lists/buffers, would be
> cleaner being self-contained in zswap_compress(). We already have a
> zswap_store_pages() that processes the folio in batches. Maybe minimizing
> the functions that do batch processing could be cleaner?

Yeah it's not great that we'll end up with zswap_store_pages() splitting
the folio into batches of 8, then zswap_compress() further splitting
them into compression batches -- but we'll have that anyway. Whether
it's inside zswap_compress() or a wrapper doesn't make things much
different imo.

Also, splitting the folio differently at different levels make semantic
sense. zswap_store_pages() splits it into batches of 8, because this is
what zswap handles (mainly to avoid dynamically allocating things like
entries). zswap_compress() will split it further if the underlying
compressor prefers that, to avoid allocating many buffer pages. So I
think it kinda makes sense.

In the future, we can revisit the split in zswap_compress() if we have a
good case for batching compression for SW (e.g. compress every 8 pages
as a single unit), or if we can optimize the per-CPU buffers somehow.

> 
> In any case, let me know which would be preferable.
> 
> Thanks,
> Kanchana
> 
> > 
> > In the future, if it's beneifical for some SW compressors to batch
> > compressions, we can look into optimizations for the per-CPU buffers to
> > avoid allocating 8 pages per-CPU (e.g. shared page pool), or make this
> > opt-in for certain SW compressors that justify the cost.
> > 
> > >
> > > Thanks,
> > > Kanchana
> > >
> > > >
> > > > >
> > > > > Cheers,
> > > > > --
> > > > > Email: Herbert Xu <herbert@gondor.apana.org.au>
> > > > > Home Page: http://gondor.apana.org.au/~herbert/
> > > > > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

RE: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.

Posted by Sridhar, Kanchana P 2 months ago

> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Wednesday, December 10, 2025 8:02 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>; SeongJae Park
> <sj@kernel.org>; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> Kristen C <kristen.c.accardi@intel.com>; Gomes, Vinicius
> <vinicius.gomes@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> compress batching of large folios.
> 
> On Tue, Dec 09, 2025 at 07:38:20PM +0000, Sridhar, Kanchana P wrote:
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > Sent: Tuesday, December 9, 2025 9:32 AM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: Herbert Xu <herbert@gondor.apana.org.au>; SeongJae Park
> > > <sj@kernel.org>; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > hannes@cmpxchg.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev;
> > > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > > senozhatsky@chromium.org; kasong@tencent.com; linux-
> > > crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> > > ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> > > Kristen C <kristen.c.accardi@intel.com>; Gomes, Vinicius
> > > <vinicius.gomes@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>;
> > > Gopal, Vinodh <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress()
> with
> > > compress batching of large folios.
> > >
> > > On Tue, Dec 09, 2025 at 05:21:06PM +0000, Sridhar, Kanchana P wrote:
> > > >
> > > > > -----Original Message-----
> > > > > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > > > Sent: Tuesday, December 9, 2025 8:55 AM
> > > > > To: Herbert Xu <herbert@gondor.apana.org.au>
> > > > > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; SeongJae
> Park
> > > > > <sj@kernel.org>; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > > > hannes@cmpxchg.org; nphamcs@gmail.com;
> > > chengming.zhou@linux.dev;
> > > > > usamaarif642@gmail.com; ryan.roberts@arm.com;
> 21cnbao@gmail.com;
> > > > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > > > > senozhatsky@chromium.org; kasong@tencent.com; linux-
> > > > > crypto@vger.kernel.org; davem@davemloft.net;
> clabbe@baylibre.com;
> > > > > ardb@kernel.org; ebiggers@google.com; surenb@google.com;
> Accardi,
> > > > > Kristen C <kristen.c.accardi@intel.com>; Gomes, Vinicius
> > > > > <vinicius.gomes@intel.com>; Feghali, Wajdi K
> > > <wajdi.k.feghali@intel.com>;
> > > > > Gopal, Vinodh <vinodh.gopal@intel.com>
> > > > > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched
> zswap_compress()
> > > with
> > > > > compress batching of large folios.
> > > > >
> > > > > On Tue, Dec 09, 2025 at 10:32:20AM +0800, Herbert Xu wrote:
> > > > > > On Tue, Dec 09, 2025 at 01:15:02AM +0000, Yosry Ahmed wrote:
> > > > > > >
> > > > > > > Just to clarify, does this mean that zswap can pass a batch of
> (eight)
> > > > > > > pages to the acomp API, and get the results for the batch uniformly
> > > > > > > whether or not the underlying compressor supports batching?
> > > > > >
> > > > > > Correct.  In fact I'd like to remove the batch size exposure to zswap
> > > > > > altogether.  zswap should just pass along whatever maximum
> number of
> > > > > > pages that is convenient to itself.
> > > > >
> > > > > I think exposing the batch size is still useful as a hint for zswap. In
> > > > > the current series, zswap allocates as many per-CPU buffers as the
> > > > > compressor's batch size, so no extra buffers for non-batching
> > > > > compressors (including SW compressors).
> > > > >
> > > > > If we use the same batch size regardless, we'll have to always allocate
> > > > > 8 (or N) per-CPU buffers, for little to no benefit on non-batching
> > > > > compressors.
> > > > >
> > > > > So we still want the batch size on the zswap side, but we want the
> > > > > crypto API to be uniform whether or not the compressor supports
> > > > > batching.
> > > >
> > > > Thanks Yosry, you bring up a good point. I currently have the outer for
> > > > loop in zswap_compress() due to the above constraint. For non-batching
> > > > compressors, we allocate only one per-CPU buffer. Hence, we need to
> > > > call crypto_acomp_compress() and write the compressed data to the
> > > > zs_poll for each page in the batch. Wouldn't we need to allocate
> > > > 8 per-CPU buffers for non-batching compressors if we want zswap to
> > > > send a batch of 8 pages uniformly to the crypto API, so that
> > > > zswap_compress() can store the 8 pages in zs_pool after the crypto
> > > > API returns?
> > >
> > > Ugh, yes.. I don't think we want to burn 7 extra pages per-CPU for SW
> > > compressors.
> > >
> > > I think the cleanest way to handle this would be to:
> > > - Rename zswap_compress() to __zswap_compress(), and make it handle
> a
> > >   given batch size (which would be 1 or 8).
> > > - Introduce zswap_compress() as a wrapper that breaks down the folio
> > >   into batches and loops over them, passing them to __zswap_compress().
> > > - __zswap_compress() has a single unified path (e.g. for compressed
> > >   length and error handling), regardless of the batch size.
> > >
> > > Can this be done with the current acomp API? I think all we really need
> > > is to be able to pass in a batch of size N (which can be 1), and read
> > > the error and compressed length in a single way. This is my main problem
> > > with the current patch.
> >
> > Once Herbert gives us the crypto_acomp modification for non-batching
> > compressors to set the acomp_req->dst->length to the
> > compressed length/error value, I think the same could be accomplished
> > with the current patch, since I will be able to delete the "errp". IOW, I think
> > a simplification is possible without introducing __zswap_compress(). The
> > code will look seamless for non-batching and batching compressors, and the
> > distinction will be made apparent by the outer for loop that iterates over
> > the batch based on the pool->compr_batch_size in the current patch.
> 
> I think moving the outer loop outside to a wrapper could make the
> function digestable without nested loops.

Sure. We would still iterate over the output SG lists in __zswap_compress(),
but yes, there wouldn't be nested loops.

> 
> >
> > Alternately, we could introduce the __zswap_compress() that abstracts
> > one single iteration through the outer for loop: it compresses 1 or 8 pages
> > as a "batch". However, the distinction would still need to be made for
> > non-batching vs. batching compressors in the zswap_compress() wrapper:
> > both for sending the pool->compr_batch_size # of pages to
> > __zswap_compress() and for iterating over the single/multiple dst buffers
> > to write to zs_pool (the latter could be done within __zswap_compress(),
> > but the point remains: we would need to distinguish in one or the other).
> 
> Not sure what you mean by the latter. IIUC, for all compressors
> __zswap_compress() would iterate over the dst buffers and write to
> zs_pool, whether the number of dst buffers is 1 or 8. So there wouldn't
> be any different handling in __zswap_compress(), right?

Yes, this is correct.

> 
> That's my whole motivation for introducing a wrapper that abstracts away
> the batching size.

Yes, you're right.

> 
> >
> > It could be argued that keeping the seamless-ness in handling the calls to
> > crypto based on the pool->compr_batch_size and the logical distinctions
> > imposed by this in iterating over the output SG lists/buffers, would be
> > cleaner being self-contained in zswap_compress(). We already have a
> > zswap_store_pages() that processes the folio in batches. Maybe minimizing
> > the functions that do batch processing could be cleaner?
> 
> Yeah it's not great that we'll end up with zswap_store_pages() splitting
> the folio into batches of 8, then zswap_compress() further splitting
> them into compression batches -- but we'll have that anyway. Whether
> it's inside zswap_compress() or a wrapper doesn't make things much
> different imo.
> 
> Also, splitting the folio differently at different levels make semantic
> sense. zswap_store_pages() splits it into batches of 8, because this is
> what zswap handles (mainly to avoid dynamically allocating things like
> entries). zswap_compress() will split it further if the underlying
> compressor prefers that, to avoid allocating many buffer pages. So I
> think it kinda makes sense.

Agreed.

> 
> In the future, we can revisit the split in zswap_compress() if we have a
> good case for batching compression for SW (e.g. compress every 8 pages
> as a single unit), or if we can optimize the per-CPU buffers somehow.

Yes. Let me see how best the __zswap_compress() API can support this.

Thanks!
Kanchana

> 
> >
> > In any case, let me know which would be preferable.
> >
> > Thanks,
> > Kanchana
> >
> > >
> > > In the future, if it's beneifical for some SW compressors to batch
> > > compressions, we can look into optimizations for the per-CPU buffers to
> > > avoid allocating 8 pages per-CPU (e.g. shared page pool), or make this
> > > opt-in for certain SW compressors that justify the cost.
> > >
> > > >
> > > > Thanks,
> > > > Kanchana
> > > >
> > > > >
> > > > > >
> > > > > > Cheers,
> > > > > > --
> > > > > > Email: Herbert Xu <herbert@gondor.apana.org.au>
> > > > > > Home Page: http://gondor.apana.org.au/~herbert/
> > > > > > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

RE: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.

Posted by Sridhar, Kanchana P 2 months ago

> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Sunday, December 7, 2025 8:24 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Yosry Ahmed <yosry.ahmed@linux.dev>; SeongJae Park <sj@kernel.org>;
> linux-kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> Kristen C <kristen.c.accardi@intel.com>; Gomes, Vinicius
> <vinicius.gomes@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> compress batching of large folios.
> 
> On Mon, Dec 08, 2025 at 04:17:38AM +0000, Sridhar, Kanchana P wrote:
> >
> > I see. So the way my patch-set tries to standardize batching in
> > zswap_compress() is to call it with a batch of 8 pages, regardless of batching
> > or non-batching compressors. In zswap_compress(), I presently iterate
> > through each page in the batch for sequential processing for non-batching
> > compressors whose batch size is 1. For batching compressors, the iteration
> > happens just once: the whole batch is compressed in one call to
> > crypto_acomp_compress().
> 
> Oh I wasn't aware of this.  In that case there is no need for me
> to delay the next step and we can do it straight away.

Sure, makes sense, thanks!

> 
> I had thought that the batch size was to limit the batching size
> to acomp.  But if it's not, perhaps we can remove the batch size
> exposure altogether.  IOW it would only be visible internally to
> the acomp API while the users such as zswap would simply batch
> things in whatever size that suits them.

Yes, I think this can be done. In case zswap sends a batch that is not
an integral multiple of the acomp algorithm's batch-size, we might
have to trade-off one sub-optimal batch (fewer pages than the alg's
batch-size) for a cleaner solution.

Thanks,
Kanchana

> 
> Thanks,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

RE: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.

Posted by Sridhar, Kanchana P 2 months, 3 weeks ago

> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Friday, November 14, 2025 7:38 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; SeongJae Park
> <sj@kernel.org>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> compress batching of large folios.
> 
> On Fri, Nov 14, 2025 at 06:43:21AM +0000, Sridhar, Kanchana P wrote:
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > Sent: Thursday, November 13, 2025 9:52 PM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > hannes@cmpxchg.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev;
> > > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> > > crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > > <kristen.c.accardi@intel.com>; Gomes, Vinicius
> <vinicius.gomes@intel.com>;
> > > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > > <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress()
> with
> > > compress batching of large folios.
> > >
> > > On Thu, Nov 13, 2025 at 11:55:10PM +0000, Sridhar, Kanchana P wrote:
> > > >
> > > > > -----Original Message-----
> > > > > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > > > Sent: Thursday, November 13, 2025 1:35 PM
> > > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > > > hannes@cmpxchg.org; nphamcs@gmail.com;
> > > chengming.zhou@linux.dev;
> > > > > usamaarif642@gmail.com; ryan.roberts@arm.com;
> 21cnbao@gmail.com;
> > > > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > > > > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com;
> linux-
> > > > > crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > > > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > > > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > > > > <kristen.c.accardi@intel.com>; Gomes, Vinicius
> > > <vinicius.gomes@intel.com>;
> > > > > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > > > > <vinodh.gopal@intel.com>
> > > > > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched
> zswap_compress()
> > > with
> > > > > compress batching of large folios.
> > > > >
> > > [..]
> > > > > > +		/*
> > > > > > +		 * If a page cannot be compressed into a size smaller
> than
> > > > > > +		 * PAGE_SIZE, save the content as is without a
> compression,
> > > > > to
> > > > > > +		 * keep the LRU order of writebacks.  If writeback is
> disabled,
> > > > > > +		 * reject the page since it only adds metadata
> overhead.
> > > > > > +		 * swap_writeout() will put the page back to the
> active LRU
> > > > > list
> > > > > > +		 * in the case.
> > > > > > +		 *
> > > > > > +		 * It is assumed that any compressor that sets the
> output
> > > > > length
> > > > > > +		 * to 0 or a value >= PAGE_SIZE will also return a
> negative
> > > > > > +		 * error status in @err; i.e, will not return a successful
> > > > > > +		 * compression status in @err in this case.
> > > > > > +		 */
> > > > >
> > > > > Ugh, checking the compression error and checking the compression
> length
> > > > > are now in separate places so we need to check if writeback is
> disabled
> > > > > in separate places and store the page as-is. It's ugly, and I think the
> > > > > current code is not correct.
> > > >
> > > > The code is 100% correct. You need to spend more time understanding
> > > > the code. I have stated my assumption above in the comments to
> > > > help in understanding the "why".
> > > >
> > > > From a maintainer, I would expect more responsible statements than
> > > > this. A flippant remark made without understanding the code (and,
> > > > disparaging the comments intended to help you do this), can impact
> > > > someone's career. I am held accountable in my job based on your
> > > > comments.
> > > >
> > > > That said, I have worked tirelessly and innovated to make the code
> > > > compliant with Herbert's suggestions (which btw have enabled an
> > > > elegant batching implementation and code commonality for IAA and
> > > > software compressors), validated it thoroughly for IAA and ZSTD to
> > > > ensure that both demonstrate performance improvements, which
> > > > are crucial for memory savings. I am proud of this work.
> > > >
> > > >
> > > > >
> > > > > > +		if (err && !wb_enabled)
> > > > > > +			goto compress_error;
> > > > > > +
> > > > > > +		for_each_sg(acomp_ctx->sg_outputs->sgl, sg,
> nr_comps, k) {
> > > > > > +			j = k + i;
> > > > >
> > > > > Please use meaningful iterator names rather than i, j, and k and the
> huge
> > > > > comment explaining what they are.
> > > >
> > > > I happen to have a different view: having longer iterator names firstly
> makes
> > > > code seem "verbose" and detracts from readability, not to mention
> > > exceeding the
> > > > 80-character line limit. The comments are essential for code
> maintainability
> > > > and avoid out-of-bounds errors when the next zswap developer wants
> to
> > > > optimize the code.
> > > >
> > > > One drawback of i/j/k iterators is mis-typing errors which cannot be
> caught
> > > > at compile time. Let me think some more about how to strike a good
> > > balance.
> > > >
> > > > >
> > > > > > +			dst = acomp_ctx->buffers[k];
> > > > > > +			dlen = sg->length | *errp;
> > > > >
> > > > > Why are we doing this?
> > > > >
> > > > > > +
> > > > > > +			if (dlen < 0) {
> > > > >
> > > > > We should do the incompressible page handling also if dlen is
> PAGE_SIZE,
> > > > > or if the compression failed (I guess that's the intention of bit OR'ing
> > > > > with *errp?)
> > > >
> > > > Yes, indeed: that's the intention of bit OR'ing with *errp.
> > >
> > > ..and you never really answered my question. In the exising code we
> > > store the page as incompressible if writeback is enabled AND
> > > crypto_wait_req() fails or dlen is zero or PAGE_SIZE. We check above
> > > if crypto_wait_req() fails and writeback is disabled, but what about the
> > > rest?
> >
> > Let me explain this some more. The new code only relies on the assumption
> > that if dlen is zero or >= PAGE_SIZE, the compressor will not return a 0
> > ("successful status"). In other words, the compressor will return an error
> status
> > in this case, which is expected to be a negative error code.
> 
> I am not sure if all compressors do that, especially for the case where
> dlen >= PAGE_SIZE. The existing code does not assume that there will be
> an error code in these cases.
> 
> For the dlen == 0 case, the check was introduced recently by commit
> dca4437a5861 ("mm/zswap: store <PAGE_SIZE compression failed page
> as-is"). Looking through the history it seems like it was introduced in
> v4 of that patch but I don't see the reasoning.

The existing code did not check for dlen == 0 and dlen >= PAGE_SIZE
prior to commit dca4437a5861 ("mm/zswap: store <PAGE_SIZE compression
failed page as-is") either. We need SeongJae or Herbert to clarify whether
this check is needed, or if it is sufficient to rely on comp_ret, the return from
crypto_wait_req().

> 
> SeongJae, did you observe any compressors returning dlen == 0 but no
> error code? What was the reasoning behind the dlen == 0 check?
> 
> >
> > Under these (hopefully valid) assumptions, the code handles the simple case
> > of an error compression return status and writeback is disabled, by the
> > "goto compress_error".
> >
> > The rest is handled by these:
> >
> > 1) First, I need to adapt the use of sg_outputs->sgl->length to represent the
> > compress length for software compressors, so I do this after
> crypto_wait_req()
> > returns:
> >
> >                 acomp_ctx->sg_outputs->sgl->length = acomp_ctx->req->dlen;
> 
> For SW compressors, why is acomp_ctx->sg_outputs->sgl->length not set?
> IIUC we are using the same API for SW and HW compressors, why is the
> output length in different places for each of them?

This is to first implement the SG lists batching interface in iaa_crypto, while
maintaining backward compatibility for SW compressors with the new API.
I believe we may want to adapt the crypto API to SW compressors
at a later point. I also believe this would be outside the scope of this patch.
It would be nice if Herbert can share his vision on this aspect.

> 
> >
> > I did not want to propose any changes to crypto software compressors
> protocols.
> >
> > 2) After the check for the "if (err && !wb_enabled)" case, the new code has
> this:
> >
> >                 for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> >                         j = k + i;
> >                         dst = acomp_ctx->buffers[k];
> >                         dlen = sg->length | *errp;
> >
> >                         if (dlen < 0) {
> >                                 dlen = PAGE_SIZE;
> >                                 dst = kmap_local_page(folio_page(folio, start + j));
> >                         }
> >
> > For batching compressors, namely, iaa_crypto, the individual output SG
> > lists sg->length follows the requirements from Herbert: each sg->length
> > is the compressed length or the error status (a negative error code).
> >
> > Then all I need to know whether to store the page as incompressible
> > is to either directly test if sg->length is negative (for batching compressors),
> > or sg->length bit-OR'ed with the crypto_wait_req() return status ("err")
> > is negative. This is accomplished by the "dlen = sg->length | *errp;".
> >
> > I believe this maintains backward compatibility with the existing code.
> > Please let me know if you agree.
> 
> For batching compressors, will 'err' be set as well, or just sg->length?
> If it's just sg->length, then we need to check again if WB is enabled
> here before storing the page uncompressed. Right?

iaa_crypto will set 'err' and set the sg->length as per the batching interface
spec from Herbert.

> 
> >
> > >
> > > We don't check again if writeback is enabled before storing the page is
> > > incompressible, and we do not check if dlen is zero or PAGE_SIZE. Are
> > > these cases no longer possible?
> >
> > Hope the above explanation clarifies things some more? These case
> > are possible, and as long as they return an error status, they should be
> > correctly handled by the new code.
> 
> As mentioned above, I am not sure if that's correct for dlen >=
> PAGE_SIZE.

We need to get clarity on this from SeongJae/Herbert.

> 
> >
> > >
> > > Also, why use errp, why not explicitly use the appropriate error code?
> > > It's also unclear to me why the error code is always zero with HW
> > > compression?
> >
> > This is because of the sg->length requirements (compressed length/error)
> > for the batching interface suggested by Herbert. Hence, I upfront define
> > err_sg to 0, and, set errp to &err_sg for batching compressors. For software
> > compressors, errp is set to &err, namely, the above check will always apply
> > the software compressor's error status to the compressed length via
> > the bit-OR to determine if the page needs to be stored uncompressed.
> 
> Thanks for the clarification. I understand that the error code has
> different sources for SW and HW compressors, but I do not like using
> errp as an indirection. It makes the code unclear. I would rather we
> explicitly check err for SW compressors and dlen for HW compressors.
> 
> That being said, I thought what Herbert suggested was that the same API
> is used for both SW and HW compressors. IOW, either way we submit a
> batch of pages (8 pages for SW compressors), and then the crypto API
> would either give the entire batch to the compressor if it supports
> batching, or loop over them internally and hand them page-by-page to
> the compressor.

That was not how I understood Herbert's suggestion for the batching interface.
He did suggest the following:

"Before the call to acomp, the destination SG list should contain as
many elements as the number of units.  On return, the dst lengths
should be stored in each destination SG entry."

I have incorporated this suggestion in the iaa_crypto driver. For SW
compressors, I have tried not to propose any API changes, while making
sure that the zswap changes for the SG lists batching API work as expected
for SW without too much special-casing code.

I suppose I always assumed that we would update SW compressors later,
and not as part of this patch-set.

> 
> This would simplify usage as we do not have to handle the differences in
> zswap.

That's the nice thing about SG lists - I think the zswap_compress() calls to
the new batching API appears agnostic to SW and HW compressors.
Other than the upfront "errp = (pool->compr_batch_size == 1) ? &err : &err_sg;"
the logical code organization of the new zswap_compress() is quite similar to
the existing code. The post-compress "dlen = sg->length | *errp;" handles the rest.

> 
> If that is not doable, at the very least the API should be consistent.
> Right now the error code and length are propagated differently to the
> caller based on whether or not the compressor support batching.

Hopefully this minor difference is transitional while we move zswap to
use the new batching interface, with the assumption that crypto SW API
can be updated later. We would need to get Herbert's thoughts on this.

> 
> >
> >
> > >
> > > >
> > > > >
> > > > > > +				dlen = PAGE_SIZE;
> > > > > > +				dst =
> kmap_local_page(folio_page(folio, start
> > > > > + j));
> > > > > > +			}
> > > > > > +
> > > > > > +			handle = zs_malloc(pool->zs_pool, dlen, gfp,
> nid);
> > > > > >
> > > > > > -	zs_obj_write(pool->zs_pool, handle, dst, dlen);
> > > > > > -	entry->handle = handle;
> > > > > > -	entry->length = dlen;
> > > > > > +			if (IS_ERR_VALUE(handle)) {
> > > > > > +				if (PTR_ERR((void *)handle) == -
> ENOSPC)
> > > > > > +
> 	zswap_reject_compress_poor++;
> > > > > > +				else
> > > > > > +					zswap_reject_alloc_fail++;
> > > > > >
> > > > > > -unlock:
> > > > > > -	if (mapped)
> > > > > > -		kunmap_local(dst);
> > > > > > -	if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
> > > > > > -		zswap_reject_compress_poor++;
> > > > > > -	else if (comp_ret)
> > > > > > -		zswap_reject_compress_fail++;
> > > > > > -	else if (alloc_ret)
> > > > > > -		zswap_reject_alloc_fail++;
> > > > > > +				goto err_unlock;
> > > > > > +			}
> > > > > > +
> > > > > > +			zs_obj_write(pool->zs_pool, handle, dst,
> dlen);
> > > > > > +			entries[j]->handle = handle;
> > > > > > +			entries[j]->length = dlen;
> > > > > > +			if (dst != acomp_ctx->buffers[k])
> > > > > > +				kunmap_local(dst);
> > > > > > +		}
> > > > > > +	} /* finished compress and store nr_pages. */
> > > > > > +
> > > > > > +	mutex_unlock(&acomp_ctx->mutex);
> > > > > > +	return true;
> > > > > > +
> > > > > > +compress_error:
> > > > > > +	for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> > > > > > +		if ((int)sg->length < 0) {
> > > > > > +			if ((int)sg->length == -ENOSPC)
> > > > > > +				zswap_reject_compress_poor++;
> > > > > > +			else
> > > > > > +				zswap_reject_compress_fail++;
> > > > > > +		}
> > > > > > +	}
> > > > > >
> > > > > > +err_unlock:
> > > > > >  	mutex_unlock(&acomp_ctx->mutex);
> > > > > > -	return comp_ret == 0 && alloc_ret == 0;
> > > > > > +	return false;
> > > > > >  }
> > > > > >
> > > > > >  static bool zswap_decompress(struct zswap_entry *entry, struct
> folio
> > > > > *folio)
> > > > > > @@ -1488,12 +1604,9 @@ static bool zswap_store_pages(struct
> folio
> > > > > *folio,
> > > > > >  		INIT_LIST_HEAD(&entries[i]->lru);
> > > > > >  	}
> > > > > >
> > > > > > -	for (i = 0; i < nr_pages; ++i) {
> > > > > > -		struct page *page = folio_page(folio, start + i);
> > > > > > -
> > > > > > -		if (!zswap_compress(page, entries[i], pool,
> wb_enabled))
> > > > > > -			goto store_pages_failed;
> > > > > > -	}
> > > > > > +	if (unlikely(!zswap_compress(folio, start, nr_pages, entries,
> pool,
> > > > > > +				     nid, wb_enabled)))
> > > > > > +		goto store_pages_failed;
> > > > > >
> > > > > >  	for (i = 0; i < nr_pages; ++i) {
> > > > > >  		struct zswap_entry *old, *entry = entries[i];
> > > > > > --
> > > > > > 2.27.0
> > > > > >

Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.

Posted by Yosry Ahmed 2 months, 3 weeks ago

On Fri, Nov 14, 2025 at 07:23:42PM +0000, Sridhar, Kanchana P wrote:
[..]
 > > > > >
> > > > > > > +		if (err && !wb_enabled)
> > > > > > > +			goto compress_error;
> > > > > > > +
> > > > > > > +		for_each_sg(acomp_ctx->sg_outputs->sgl, sg,
> > nr_comps, k) {
> > > > > > > +			j = k + i;
> > > > > >
[..]
> > > > >
> > > > > >
> > > > > > > +			dst = acomp_ctx->buffers[k];
> > > > > > > +			dlen = sg->length | *errp;
> > > > > >
> > > > > > Why are we doing this?
> > > > > >
> > > > > > > +
> > > > > > > +			if (dlen < 0) {
> > > > > >
> > > > > > We should do the incompressible page handling also if dlen is
> > PAGE_SIZE,
> > > > > > or if the compression failed (I guess that's the intention of bit OR'ing
> > > > > > with *errp?)
> > > > >
> > > > > Yes, indeed: that's the intention of bit OR'ing with *errp.
> > > >
> > > > ..and you never really answered my question. In the exising code we
> > > > store the page as incompressible if writeback is enabled AND
> > > > crypto_wait_req() fails or dlen is zero or PAGE_SIZE. We check above
> > > > if crypto_wait_req() fails and writeback is disabled, but what about the
> > > > rest?
> > >
> > > Let me explain this some more. The new code only relies on the assumption
> > > that if dlen is zero or >= PAGE_SIZE, the compressor will not return a 0
> > > ("successful status"). In other words, the compressor will return an error
> > status
> > > in this case, which is expected to be a negative error code.
> > 
> > I am not sure if all compressors do that, especially for the case where
> > dlen >= PAGE_SIZE. The existing code does not assume that there will be
> > an error code in these cases.
> > 
> > For the dlen == 0 case, the check was introduced recently by commit
> > dca4437a5861 ("mm/zswap: store <PAGE_SIZE compression failed page
> > as-is"). Looking through the history it seems like it was introduced in
> > v4 of that patch but I don't see the reasoning.
> 
> The existing code did not check for dlen == 0 and dlen >= PAGE_SIZE
> prior to commit dca4437a5861 ("mm/zswap: store <PAGE_SIZE compression
> failed page as-is") either. We need SeongJae or Herbert to clarify whether
> this check is needed, or if it is sufficient to rely on comp_ret, the return from
> crypto_wait_req().
> 
> > 
> > SeongJae, did you observe any compressors returning dlen == 0 but no
> > error code? What was the reasoning behind the dlen == 0 check?
> > 
> > >
> > > Under these (hopefully valid) assumptions, the code handles the simple case
> > > of an error compression return status and writeback is disabled, by the
> > > "goto compress_error".
> > >
> > > The rest is handled by these:
> > >
> > > 1) First, I need to adapt the use of sg_outputs->sgl->length to represent the
> > > compress length for software compressors, so I do this after
> > crypto_wait_req()
> > > returns:
> > >
> > >                 acomp_ctx->sg_outputs->sgl->length = acomp_ctx->req->dlen;
> > 
> > For SW compressors, why is acomp_ctx->sg_outputs->sgl->length not set?
> > IIUC we are using the same API for SW and HW compressors, why is the
> > output length in different places for each of them?
> 
> This is to first implement the SG lists batching interface in iaa_crypto, while
> maintaining backward compatibility for SW compressors with the new API.
> I believe we may want to adapt the crypto API to SW compressors
> at a later point. I also believe this would be outside the scope of this patch.
> It would be nice if Herbert can share his vision on this aspect.
> 
> > 
> > >
> > > I did not want to propose any changes to crypto software compressors
> > protocols.
> > >
> > > 2) After the check for the "if (err && !wb_enabled)" case, the new code has
> > this:
> > >
> > >                 for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> > >                         j = k + i;
> > >                         dst = acomp_ctx->buffers[k];
> > >                         dlen = sg->length | *errp;
> > >
> > >                         if (dlen < 0) {
> > >                                 dlen = PAGE_SIZE;
> > >                                 dst = kmap_local_page(folio_page(folio, start + j));
> > >                         }
> > >
> > > For batching compressors, namely, iaa_crypto, the individual output SG
> > > lists sg->length follows the requirements from Herbert: each sg->length
> > > is the compressed length or the error status (a negative error code).
> > >
> > > Then all I need to know whether to store the page as incompressible
> > > is to either directly test if sg->length is negative (for batching compressors),
> > > or sg->length bit-OR'ed with the crypto_wait_req() return status ("err")
> > > is negative. This is accomplished by the "dlen = sg->length | *errp;".
> > >
> > > I believe this maintains backward compatibility with the existing code.
> > > Please let me know if you agree.
> > 
> > For batching compressors, will 'err' be set as well, or just sg->length?
> > If it's just sg->length, then we need to check again if WB is enabled
> > here before storing the page uncompressed. Right?
> 
> iaa_crypto will set 'err' and set the sg->length as per the batching interface
> spec from Herbert.

So both 'err' and sg->length will contain the same error? In this case
why do we need to check if dlen < 0? Shouldn't checking 'err' be
sufficient? and it would work for both SW and HW and we wouldn't need
errp. Did I miss something?

> 
> > 
> > >
> > > >
> > > > We don't check again if writeback is enabled before storing the page is
> > > > incompressible, and we do not check if dlen is zero or PAGE_SIZE. Are
> > > > these cases no longer possible?
> > >
> > > Hope the above explanation clarifies things some more? These case
> > > are possible, and as long as they return an error status, they should be
> > > correctly handled by the new code.
> > 
> > As mentioned above, I am not sure if that's correct for dlen >=
> > PAGE_SIZE.
> 
> We need to get clarity on this from SeongJae/Herbert.
> 
> > 
> > >
> > > >
> > > > Also, why use errp, why not explicitly use the appropriate error code?
> > > > It's also unclear to me why the error code is always zero with HW
> > > > compression?
> > >
> > > This is because of the sg->length requirements (compressed length/error)
> > > for the batching interface suggested by Herbert. Hence, I upfront define
> > > err_sg to 0, and, set errp to &err_sg for batching compressors. For software
> > > compressors, errp is set to &err, namely, the above check will always apply
> > > the software compressor's error status to the compressed length via
> > > the bit-OR to determine if the page needs to be stored uncompressed.
> > 
> > Thanks for the clarification. I understand that the error code has
> > different sources for SW and HW compressors, but I do not like using
> > errp as an indirection. It makes the code unclear. I would rather we
> > explicitly check err for SW compressors and dlen for HW compressors.
> > 
> > That being said, I thought what Herbert suggested was that the same API
> > is used for both SW and HW compressors. IOW, either way we submit a
> > batch of pages (8 pages for SW compressors), and then the crypto API
> > would either give the entire batch to the compressor if it supports
> > batching, or loop over them internally and hand them page-by-page to
> > the compressor.
> 
> That was not how I understood Herbert's suggestion for the batching interface.
> He did suggest the following:
> 
> "Before the call to acomp, the destination SG list should contain as
> many elements as the number of units.  On return, the dst lengths
> should be stored in each destination SG entry."
> 
> I have incorporated this suggestion in the iaa_crypto driver. For SW
> compressors, I have tried not to propose any API changes, while making
> sure that the zswap changes for the SG lists batching API work as expected
> for SW without too much special-casing code.
> 
> I suppose I always assumed that we would update SW compressors later,
> and not as part of this patch-set.

I am not sure I understand what changes lie in the crypto layer and what
changes lie in the SW compressors. I am not suggesting we do any
modification to the SW compressors.

I imagined that the crypto layer would present a uniform API regardless
of whether or not the compressor supports batching. Ideally zswap would
pass in a batch to crypto and it would figure out if it needs to break
them down or not. Then the output length and errors would be presented
uniformly to the caller.

That being said, I am not at all familiar with how crypto works and how
straightforward that would be. Herbert, WDYT?

> 
> > 
> > This would simplify usage as we do not have to handle the differences in
> > zswap.
> 
> That's the nice thing about SG lists - I think the zswap_compress() calls to
> the new batching API appears agnostic to SW and HW compressors.
> Other than the upfront "errp = (pool->compr_batch_size == 1) ? &err : &err_sg;"
> the logical code organization of the new zswap_compress() is quite similar to
> the existing code. The post-compress "dlen = sg->length | *errp;" handles the rest.

It would be even nicer if the batches are also abstracted by SG lists.

Also, I don't like how the error codes and output lengths are presented
differently for HW and SW compressors.

RE: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.

Posted by Sridhar, Kanchana P 2 months, 3 weeks ago

> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Friday, November 14, 2025 11:44 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: SeongJae Park <sj@kernel.org>; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; hannes@cmpxchg.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> compress batching of large folios.
> 
> On Fri, Nov 14, 2025 at 07:23:42PM +0000, Sridhar, Kanchana P wrote:
> [..]
>  > > > > >
> > > > > > > > +		if (err && !wb_enabled)
> > > > > > > > +			goto compress_error;
> > > > > > > > +
> > > > > > > > +		for_each_sg(acomp_ctx->sg_outputs->sgl, sg,
> > > nr_comps, k) {
> > > > > > > > +			j = k + i;
> > > > > > >
> [..]
> > > > > >
> > > > > > >
> > > > > > > > +			dst = acomp_ctx->buffers[k];
> > > > > > > > +			dlen = sg->length | *errp;
> > > > > > >
> > > > > > > Why are we doing this?
> > > > > > >
> > > > > > > > +
> > > > > > > > +			if (dlen < 0) {
> > > > > > >
> > > > > > > We should do the incompressible page handling also if dlen is
> > > PAGE_SIZE,
> > > > > > > or if the compression failed (I guess that's the intention of bit
> OR'ing
> > > > > > > with *errp?)
> > > > > >
> > > > > > Yes, indeed: that's the intention of bit OR'ing with *errp.
> > > > >
> > > > > ..and you never really answered my question. In the exising code we
> > > > > store the page as incompressible if writeback is enabled AND
> > > > > crypto_wait_req() fails or dlen is zero or PAGE_SIZE. We check above
> > > > > if crypto_wait_req() fails and writeback is disabled, but what about the
> > > > > rest?
> > > >
> > > > Let me explain this some more. The new code only relies on the
> assumption
> > > > that if dlen is zero or >= PAGE_SIZE, the compressor will not return a 0
> > > > ("successful status"). In other words, the compressor will return an error
> > > status
> > > > in this case, which is expected to be a negative error code.
> > >
> > > I am not sure if all compressors do that, especially for the case where
> > > dlen >= PAGE_SIZE. The existing code does not assume that there will be
> > > an error code in these cases.
> > >
> > > For the dlen == 0 case, the check was introduced recently by commit
> > > dca4437a5861 ("mm/zswap: store <PAGE_SIZE compression failed page
> > > as-is"). Looking through the history it seems like it was introduced in
> > > v4 of that patch but I don't see the reasoning.
> >
> > The existing code did not check for dlen == 0 and dlen >= PAGE_SIZE
> > prior to commit dca4437a5861 ("mm/zswap: store <PAGE_SIZE compression
> > failed page as-is") either. We need SeongJae or Herbert to clarify whether
> > this check is needed, or if it is sufficient to rely on comp_ret, the return from
> > crypto_wait_req().
> >
> > >
> > > SeongJae, did you observe any compressors returning dlen == 0 but no
> > > error code? What was the reasoning behind the dlen == 0 check?
> > >
> > > >
> > > > Under these (hopefully valid) assumptions, the code handles the simple
> case
> > > > of an error compression return status and writeback is disabled, by the
> > > > "goto compress_error".
> > > >
> > > > The rest is handled by these:
> > > >
> > > > 1) First, I need to adapt the use of sg_outputs->sgl->length to represent
> the
> > > > compress length for software compressors, so I do this after
> > > crypto_wait_req()
> > > > returns:
> > > >
> > > >                 acomp_ctx->sg_outputs->sgl->length = acomp_ctx->req->dlen;
> > >
> > > For SW compressors, why is acomp_ctx->sg_outputs->sgl->length not set?
> > > IIUC we are using the same API for SW and HW compressors, why is the
> > > output length in different places for each of them?
> >
> > This is to first implement the SG lists batching interface in iaa_crypto, while
> > maintaining backward compatibility for SW compressors with the new API.
> > I believe we may want to adapt the crypto API to SW compressors
> > at a later point. I also believe this would be outside the scope of this patch.
> > It would be nice if Herbert can share his vision on this aspect.
> >
> > >
> > > >
> > > > I did not want to propose any changes to crypto software compressors
> > > protocols.
> > > >
> > > > 2) After the check for the "if (err && !wb_enabled)" case, the new code
> has
> > > this:
> > > >
> > > >                 for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> > > >                         j = k + i;
> > > >                         dst = acomp_ctx->buffers[k];
> > > >                         dlen = sg->length | *errp;
> > > >
> > > >                         if (dlen < 0) {
> > > >                                 dlen = PAGE_SIZE;
> > > >                                 dst = kmap_local_page(folio_page(folio, start + j));
> > > >                         }
> > > >
> > > > For batching compressors, namely, iaa_crypto, the individual output SG
> > > > lists sg->length follows the requirements from Herbert: each sg->length
> > > > is the compressed length or the error status (a negative error code).
> > > >
> > > > Then all I need to know whether to store the page as incompressible
> > > > is to either directly test if sg->length is negative (for batching
> compressors),
> > > > or sg->length bit-OR'ed with the crypto_wait_req() return status ("err")
> > > > is negative. This is accomplished by the "dlen = sg->length | *errp;".
> > > >
> > > > I believe this maintains backward compatibility with the existing code.
> > > > Please let me know if you agree.
> > >
> > > For batching compressors, will 'err' be set as well, or just sg->length?
> > > If it's just sg->length, then we need to check again if WB is enabled
> > > here before storing the page uncompressed. Right?
> >
> > iaa_crypto will set 'err' and set the sg->length as per the batching interface
> > spec from Herbert.
> 
> So both 'err' and sg->length will contain the same error? In this case
> why do we need to check if dlen < 0? Shouldn't checking 'err' be
> sufficient? and it would work for both SW and HW and we wouldn't need
> errp. Did I miss something?

Great question. For a batching compressor, 'err' will contain an error if any
page in the batch had a compression error. This allows the early bail-out
path for SW and HW compressors if writeback is not enabled for the folio.

Only the specific pages' sg->length will contain an error code. The other
batch pages that compressed fine will have the compressed length in
sg->length. This enables the post-compression loop with the errp check
bit-ORed with the sg->length, which for SW, has been brought up to date
with the acomp_req->dlen before we get to the wb_enabled code path.

> 
> >
> > >
> > > >
> > > > >
> > > > > We don't check again if writeback is enabled before storing the page is
> > > > > incompressible, and we do not check if dlen is zero or PAGE_SIZE. Are
> > > > > these cases no longer possible?
> > > >
> > > > Hope the above explanation clarifies things some more? These case
> > > > are possible, and as long as they return an error status, they should be
> > > > correctly handled by the new code.
> > >
> > > As mentioned above, I am not sure if that's correct for dlen >=
> > > PAGE_SIZE.
> >
> > We need to get clarity on this from SeongJae/Herbert.
> >
> > >
> > > >
> > > > >
> > > > > Also, why use errp, why not explicitly use the appropriate error code?
> > > > > It's also unclear to me why the error code is always zero with HW
> > > > > compression?
> > > >
> > > > This is because of the sg->length requirements (compressed
> length/error)
> > > > for the batching interface suggested by Herbert. Hence, I upfront define
> > > > err_sg to 0, and, set errp to &err_sg for batching compressors. For
> software
> > > > compressors, errp is set to &err, namely, the above check will always
> apply
> > > > the software compressor's error status to the compressed length via
> > > > the bit-OR to determine if the page needs to be stored uncompressed.
> > >
> > > Thanks for the clarification. I understand that the error code has
> > > different sources for SW and HW compressors, but I do not like using
> > > errp as an indirection. It makes the code unclear. I would rather we
> > > explicitly check err for SW compressors and dlen for HW compressors.
> > >
> > > That being said, I thought what Herbert suggested was that the same API
> > > is used for both SW and HW compressors. IOW, either way we submit a
> > > batch of pages (8 pages for SW compressors), and then the crypto API
> > > would either give the entire batch to the compressor if it supports
> > > batching, or loop over them internally and hand them page-by-page to
> > > the compressor.
> >
> > That was not how I understood Herbert's suggestion for the batching
> interface.
> > He did suggest the following:
> >
> > "Before the call to acomp, the destination SG list should contain as
> > many elements as the number of units.  On return, the dst lengths
> > should be stored in each destination SG entry."
> >
> > I have incorporated this suggestion in the iaa_crypto driver. For SW
> > compressors, I have tried not to propose any API changes, while making
> > sure that the zswap changes for the SG lists batching API work as expected
> > for SW without too much special-casing code.
> >
> > I suppose I always assumed that we would update SW compressors later,
> > and not as part of this patch-set.
> 
> I am not sure I understand what changes lie in the crypto layer and what
> changes lie in the SW compressors. I am not suggesting we do any
> modification to the SW compressors.
> 
> I imagined that the crypto layer would present a uniform API regardless
> of whether or not the compressor supports batching. Ideally zswap would
> pass in a batch to crypto and it would figure out if it needs to break
> them down or not. Then the output length and errors would be presented
> uniformly to the caller.

From my understanding, this would require changes to the crypto layer for
SW compressors, which again IIUC, does not set the sg->length, only sets
the acomp_req->dlen (IIUC, a temporary state until crypto for SW also uses
SG lists).

Ideally, batching could be handled similarly by crypto for SW. I believe we
will get there, albeit outside the scope of this patch.

> 
> That being said, I am not at all familiar with how crypto works and how
> straightforward that would be. Herbert, WDYT?
> 
> >
> > >
> > > This would simplify usage as we do not have to handle the differences in
> > > zswap.
> >
> > That's the nice thing about SG lists - I think the zswap_compress() calls to
> > the new batching API appears agnostic to SW and HW compressors.
> > Other than the upfront "errp = (pool->compr_batch_size == 1) ? &err :
> &err_sg;"
> > the logical code organization of the new zswap_compress() is quite similar to
> > the existing code. The post-compress "dlen = sg->length | *errp;" handles
> the rest.
> 
> It would be even nicer if the batches are also abstracted by SG lists.
> 
> Also, I don't like how the error codes and output lengths are presented
> differently for HW and SW compressors.

I do believe this is short-term and is the first step in implementing batching
in zswap. We should get Herbert's thoughts on this.

Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.

Posted by Yosry Ahmed 2 months, 3 weeks ago

On Fri, Nov 14, 2025 at 07:59:57PM +0000, Sridhar, Kanchana P wrote:
> 
> > -----Original Message-----
> > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > Sent: Friday, November 14, 2025 11:44 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: SeongJae Park <sj@kernel.org>; linux-kernel@vger.kernel.org; linux-
> > mm@kvack.org; hannes@cmpxchg.org; nphamcs@gmail.com;
> > chengming.zhou@linux.dev; usamaarif642@gmail.com;
> > ryan.roberts@arm.com; 21cnbao@gmail.com;
> > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > senozhatsky@chromium.org; kasong@tencent.com; linux-
> > crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> > compress batching of large folios.
> > 
> > On Fri, Nov 14, 2025 at 07:23:42PM +0000, Sridhar, Kanchana P wrote:
> > [..]
> >  > > > > >
> > > > > > > > > +		if (err && !wb_enabled)
> > > > > > > > > +			goto compress_error;
> > > > > > > > > +
> > > > > > > > > +		for_each_sg(acomp_ctx->sg_outputs->sgl, sg,
> > > > nr_comps, k) {
> > > > > > > > > +			j = k + i;
> > > > > > > >
> > [..]
> > > > > > >
> > > > > > > >
> > > > > > > > > +			dst = acomp_ctx->buffers[k];
> > > > > > > > > +			dlen = sg->length | *errp;
> > > > > > > >
> > > > > > > > Why are we doing this?
> > > > > > > >
> > > > > > > > > +
> > > > > > > > > +			if (dlen < 0) {
> > > > > > > >
> > > > > > > > We should do the incompressible page handling also if dlen is
> > > > PAGE_SIZE,
> > > > > > > > or if the compression failed (I guess that's the intention of bit
> > OR'ing
> > > > > > > > with *errp?)
> > > > > > >
> > > > > > > Yes, indeed: that's the intention of bit OR'ing with *errp.
> > > > > >
> > > > > > ..and you never really answered my question. In the exising code we
> > > > > > store the page as incompressible if writeback is enabled AND
> > > > > > crypto_wait_req() fails or dlen is zero or PAGE_SIZE. We check above
> > > > > > if crypto_wait_req() fails and writeback is disabled, but what about the
> > > > > > rest?
> > > > >
> > > > > Let me explain this some more. The new code only relies on the
> > assumption
> > > > > that if dlen is zero or >= PAGE_SIZE, the compressor will not return a 0
> > > > > ("successful status"). In other words, the compressor will return an error
> > > > status
> > > > > in this case, which is expected to be a negative error code.
> > > >
> > > > I am not sure if all compressors do that, especially for the case where
> > > > dlen >= PAGE_SIZE. The existing code does not assume that there will be
> > > > an error code in these cases.
> > > >
> > > > For the dlen == 0 case, the check was introduced recently by commit
> > > > dca4437a5861 ("mm/zswap: store <PAGE_SIZE compression failed page
> > > > as-is"). Looking through the history it seems like it was introduced in
> > > > v4 of that patch but I don't see the reasoning.
> > >
> > > The existing code did not check for dlen == 0 and dlen >= PAGE_SIZE
> > > prior to commit dca4437a5861 ("mm/zswap: store <PAGE_SIZE compression
> > > failed page as-is") either. We need SeongJae or Herbert to clarify whether
> > > this check is needed, or if it is sufficient to rely on comp_ret, the return from
> > > crypto_wait_req().
> > >
> > > >
> > > > SeongJae, did you observe any compressors returning dlen == 0 but no
> > > > error code? What was the reasoning behind the dlen == 0 check?
> > > >
> > > > >
> > > > > Under these (hopefully valid) assumptions, the code handles the simple
> > case
> > > > > of an error compression return status and writeback is disabled, by the
> > > > > "goto compress_error".
> > > > >
> > > > > The rest is handled by these:
> > > > >
> > > > > 1) First, I need to adapt the use of sg_outputs->sgl->length to represent
> > the
> > > > > compress length for software compressors, so I do this after
> > > > crypto_wait_req()
> > > > > returns:
> > > > >
> > > > >                 acomp_ctx->sg_outputs->sgl->length = acomp_ctx->req->dlen;
> > > >
> > > > For SW compressors, why is acomp_ctx->sg_outputs->sgl->length not set?
> > > > IIUC we are using the same API for SW and HW compressors, why is the
> > > > output length in different places for each of them?
> > >
> > > This is to first implement the SG lists batching interface in iaa_crypto, while
> > > maintaining backward compatibility for SW compressors with the new API.
> > > I believe we may want to adapt the crypto API to SW compressors
> > > at a later point. I also believe this would be outside the scope of this patch.
> > > It would be nice if Herbert can share his vision on this aspect.
> > >
> > > >
> > > > >
> > > > > I did not want to propose any changes to crypto software compressors
> > > > protocols.
> > > > >
> > > > > 2) After the check for the "if (err && !wb_enabled)" case, the new code
> > has
> > > > this:
> > > > >
> > > > >                 for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> > > > >                         j = k + i;
> > > > >                         dst = acomp_ctx->buffers[k];
> > > > >                         dlen = sg->length | *errp;
> > > > >
> > > > >                         if (dlen < 0) {
> > > > >                                 dlen = PAGE_SIZE;
> > > > >                                 dst = kmap_local_page(folio_page(folio, start + j));
> > > > >                         }
> > > > >
> > > > > For batching compressors, namely, iaa_crypto, the individual output SG
> > > > > lists sg->length follows the requirements from Herbert: each sg->length
> > > > > is the compressed length or the error status (a negative error code).
> > > > >
> > > > > Then all I need to know whether to store the page as incompressible
> > > > > is to either directly test if sg->length is negative (for batching
> > compressors),
> > > > > or sg->length bit-OR'ed with the crypto_wait_req() return status ("err")
> > > > > is negative. This is accomplished by the "dlen = sg->length | *errp;".
> > > > >
> > > > > I believe this maintains backward compatibility with the existing code.
> > > > > Please let me know if you agree.
> > > >
> > > > For batching compressors, will 'err' be set as well, or just sg->length?
> > > > If it's just sg->length, then we need to check again if WB is enabled
> > > > here before storing the page uncompressed. Right?
> > >
> > > iaa_crypto will set 'err' and set the sg->length as per the batching interface
> > > spec from Herbert.
> > 
> > So both 'err' and sg->length will contain the same error? In this case
> > why do we need to check if dlen < 0? Shouldn't checking 'err' be
> > sufficient? and it would work for both SW and HW and we wouldn't need
> > errp. Did I miss something?
> 
> Great question. For a batching compressor, 'err' will contain an error if any
> page in the batch had a compression error. This allows the early bail-out
> path for SW and HW compressors if writeback is not enabled for the folio.
> 
> Only the specific pages' sg->length will contain an error code. The other
> batch pages that compressed fine will have the compressed length in
> sg->length. This enables the post-compression loop with the errp check
> bit-ORed with the sg->length, which for SW, has been brought up to date
> with the acomp_req->dlen before we get to the wb_enabled code path.
> 
> > 
> > >
> > > >
> > > > >
> > > > > >
> > > > > > We don't check again if writeback is enabled before storing the page is
> > > > > > incompressible, and we do not check if dlen is zero or PAGE_SIZE. Are
> > > > > > these cases no longer possible?
> > > > >
> > > > > Hope the above explanation clarifies things some more? These case
> > > > > are possible, and as long as they return an error status, they should be
> > > > > correctly handled by the new code.
> > > >
> > > > As mentioned above, I am not sure if that's correct for dlen >=
> > > > PAGE_SIZE.
> > >
> > > We need to get clarity on this from SeongJae/Herbert.
> > >
> > > >
> > > > >
> > > > > >
> > > > > > Also, why use errp, why not explicitly use the appropriate error code?
> > > > > > It's also unclear to me why the error code is always zero with HW
> > > > > > compression?
> > > > >
> > > > > This is because of the sg->length requirements (compressed
> > length/error)
> > > > > for the batching interface suggested by Herbert. Hence, I upfront define
> > > > > err_sg to 0, and, set errp to &err_sg for batching compressors. For
> > software
> > > > > compressors, errp is set to &err, namely, the above check will always
> > apply
> > > > > the software compressor's error status to the compressed length via
> > > > > the bit-OR to determine if the page needs to be stored uncompressed.
> > > >
> > > > Thanks for the clarification. I understand that the error code has
> > > > different sources for SW and HW compressors, but I do not like using
> > > > errp as an indirection. It makes the code unclear. I would rather we
> > > > explicitly check err for SW compressors and dlen for HW compressors.
> > > >
> > > > That being said, I thought what Herbert suggested was that the same API
> > > > is used for both SW and HW compressors. IOW, either way we submit a
> > > > batch of pages (8 pages for SW compressors), and then the crypto API
> > > > would either give the entire batch to the compressor if it supports
> > > > batching, or loop over them internally and hand them page-by-page to
> > > > the compressor.
> > >
> > > That was not how I understood Herbert's suggestion for the batching
> > interface.
> > > He did suggest the following:
> > >
> > > "Before the call to acomp, the destination SG list should contain as
> > > many elements as the number of units.  On return, the dst lengths
> > > should be stored in each destination SG entry."
> > >
> > > I have incorporated this suggestion in the iaa_crypto driver. For SW
> > > compressors, I have tried not to propose any API changes, while making
> > > sure that the zswap changes for the SG lists batching API work as expected
> > > for SW without too much special-casing code.
> > >
> > > I suppose I always assumed that we would update SW compressors later,
> > > and not as part of this patch-set.
> > 
> > I am not sure I understand what changes lie in the crypto layer and what
> > changes lie in the SW compressors. I am not suggesting we do any
> > modification to the SW compressors.
> > 
> > I imagined that the crypto layer would present a uniform API regardless
> > of whether or not the compressor supports batching. Ideally zswap would
> > pass in a batch to crypto and it would figure out if it needs to break
> > them down or not. Then the output length and errors would be presented
> > uniformly to the caller.
> 
> From my understanding, this would require changes to the crypto layer for
> SW compressors, which again IIUC, does not set the sg->length, only sets
> the acomp_req->dlen (IIUC, a temporary state until crypto for SW also uses
> SG lists).
> 
> Ideally, batching could be handled similarly by crypto for SW. I believe we
> will get there, albeit outside the scope of this patch.
> 
> > 
> > That being said, I am not at all familiar with how crypto works and how
> > straightforward that would be. Herbert, WDYT?
> > 
> > >
> > > >
> > > > This would simplify usage as we do not have to handle the differences in
> > > > zswap.
> > >
> > > That's the nice thing about SG lists - I think the zswap_compress() calls to
> > > the new batching API appears agnostic to SW and HW compressors.
> > > Other than the upfront "errp = (pool->compr_batch_size == 1) ? &err :
> > &err_sg;"
> > > the logical code organization of the new zswap_compress() is quite similar to
> > > the existing code. The post-compress "dlen = sg->length | *errp;" handles
> > the rest.
> > 
> > It would be even nicer if the batches are also abstracted by SG lists.
> > 
> > Also, I don't like how the error codes and output lengths are presented
> > differently for HW and SW compressors.
> 
> I do believe this is short-term and is the first step in implementing batching
> in zswap. We should get Herbert's thoughts on this.

If we have to keep the different approaches for now I would still like
to simplify the error handling. We should remove errp and explicitly
check sg->length or 'err' based on the batch size.

Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.

Posted by Yosry Ahmed 2 months, 3 weeks ago

On Thu, Nov 13, 2025 at 11:55:10PM +0000, Sridhar, Kanchana P wrote:
> 
> > -----Original Message-----
> > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > Sent: Thursday, November 13, 2025 1:35 PM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> > crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> > compress batching of large folios.
> > 
> > On Tue, Nov 04, 2025 at 01:12:35AM -0800, Kanchana P Sridhar wrote:
> > > This patch introduces a new unified implementation of zswap_compress()
> > > for compressors that do and do not support batching. This eliminates
> > > code duplication and facilitates code maintainability with the
> > > introduction of compress batching.
> > >
> > > The vectorized implementation of calling the earlier zswap_compress()
> > > sequentially, one page at a time in zswap_store_pages(), is replaced
> > > with this new version of zswap_compress() that accepts multiple pages to
> > > compress as a batch.
> > >
> > > If the compressor does not support batching, each page in the batch is
> > > compressed and stored sequentially. If the compressor supports batching,
> > > for e.g., 'deflate-iaa', the Intel IAA hardware accelerator, the batch
> > > is compressed in parallel in hardware. If the batch is compressed
> > > without errors, the compressed buffers are then stored in zsmalloc. In
> > > case of compression errors, the current behavior is preserved for the
> > > batching zswap_compress(): if the folio's memcg is writeback enabled,
> > > pages with compression errors are store uncompressed in zsmalloc; if
> > > not, we return an error for the folio in zswap_store().
> > >
> > > As per Herbert's suggestion in [1] for batching to be based on SG lists
> > > to interface with the crypto API, a "struct sg_table *sg_outputs" is
> > > added to the per-CPU acomp_ctx. In zswap_cpu_comp_prepare(), memory
> > is
> > > allocated for @pool->compr_batch_size scatterlists in
> > > @acomp_ctx->sg_outputs. The per-CPU @acomp_ctx->buffers' addresses
> > are
> > > statically mapped to the respective SG lists. The existing non-NUMA
> > > sg_alloc_table() was found to give better performance than a NUMA-aware
> > > allocation function, hence is used in this patch.
> > >
> > > Batching compressors should initialize the output SG lengths to
> > > PAGE_SIZE as part of the internal compress batching setup, to avoid
> > > having to do multiple traversals over the @acomp_ctx->sg_outputs->sgl.
> > > This is exactly how batching is implemented in the iaa_crypto driver's
> > > compress batching procedure, iaa_comp_acompress_batch().
> > >
> > > The batched zswap_compress() implementation is generalized as much as
> > > possible for non-batching and batching compressors, so that the
> > > subsequent incompressible page handling, zs_pool writes, and error
> > > handling code is seamless for both, without the use of conditionals to
> > > switch to specialized code for either.
> > >
> > > The new batching implementation of zswap_compress() is called with a
> > > batch of @nr_pages sent from zswap_store() to zswap_store_pages().
> > > zswap_compress() steps through the batch in increments of the
> > > compressor's batch-size, sets up the acomp_ctx->req's src/dst SG lists
> > > to contain the folio pages and output buffers, before calling
> > > crypto_acomp_compress().
> > >
> > > Some important requirements of this batching architecture for batching
> > > compressors:
> > >
> > >   1) The output SG lengths for each sg in the acomp_req->dst should be
> > >      intialized to PAGE_SIZE as part of other batch setup in the batch
> > >      compression function. zswap will not take care of this in the
> > >      interest of avoiding repetitive traversals of the
> > >      @acomp_ctx->sg_outputs->sgl so as to not lose the benefits of
> > >      batching.
> > >
> > >   2) In case of a compression error for any page in the batch, the
> > >      batching compressor should set the corresponding @sg->length to a
> > >      negative error number, as suggested by Herbert. Otherwise, the
> > >      @sg->length will contain the compressed output length.
> > >
> > >   3) Batching compressors should set acomp_req->dlen to
> > >      acomp_req->dst->length, i.e., the sg->length of the first SG in
> > >      acomp_req->dst.
> > >
> > > Another important change this patch makes is with the acomp_ctx mutex
> > > locking in zswap_compress(). Earlier, the mutex was held per page's
> > > compression. With the new code, [un]locking the mutex per page caused
> > > regressions for software compressors when testing with 30 usemem
> > > processes, and also kernel compilation with 'allmod' config. The
> > > regressions were more eggregious when PMD folios were stored. The
> > > implementation in this commit locks/unlocks the mutex once per batch,
> > > that resolves the regression.
> > >
> > > Architectural considerations for the zswap batching framework:
> > >
> > ==============================================================
> > > We have designed the zswap batching framework to be
> > > hardware-agnostic. It has no dependencies on Intel-specific features and
> > > can be leveraged by any hardware accelerator or software-based
> > > compressor. In other words, the framework is open and inclusive by
> > > design.
> > >
> > > Other ongoing work that can use batching:
> > > =========================================
> > > This patch-series demonstrates the performance benefits of compress
> > > batching when used in zswap_store() of large folios. shrink_folio_list()
> > > "reclaim batching" of any-order folios is the major next work that uses
> > > the zswap compress batching framework: our testing of kernel_compilation
> > > with writeback and the zswap shrinker indicates 10X fewer pages get
> > > written back when we reclaim 32 folios as a batch, as compared to one
> > > folio at a time: this is with deflate-iaa and with zstd. We expect to
> > > submit a patch-series with this data and the resulting performance
> > > improvements shortly. Reclaim batching relieves memory pressure faster
> > > than reclaiming one folio at a time, hence alleviates the need to scan
> > > slab memory for writeback.
> > >
> > > Nhat has given ideas on using batching with the ongoing kcompressd work,
> > > as well as beneficially using decompression batching & block IO batching
> > > to improve zswap writeback efficiency.
> > >
> > > Experiments that combine zswap compress batching, reclaim batching,
> > > swapin_readahead() decompression batching of prefetched pages, and
> > > writeback batching show that 0 pages are written back with deflate-iaa
> > > and zstd. For comparison, the baselines for these compressors see
> > > 200K-800K pages written to disk (kernel compilation 'allmod' config).
> > >
> > > To summarize, these are future clients of the batching framework:
> > >
> > >    - shrink_folio_list() reclaim batching of multiple folios:
> > >        Implemented, will submit patch-series.
> > >    - zswap writeback with decompress batching:
> > >        Implemented, will submit patch-series.
> > >    - zram:
> > >        Implemented, will submit patch-series.
> > >    - kcompressd:
> > >        Not yet implemented.
> > >    - file systems:
> > >        Not yet implemented.
> > >    - swapin_readahead() decompression batching of prefetched pages:
> > >        Implemented, will submit patch-series.
> > >
> > > Additionally, any place we have folios that need to be compressed, can
> > > potentially be parallelized.
> > >
> > > Performance data:
> > > =================
> > >
> > > As suggested by Barry, this is the performance data gathered on Intel
> > > Sapphire Rapids with usemem 30 processes running at 50% memory
> > pressure
> > > and kernel_compilation/allmod config run with 2G limit using 32
> > > threads. To keep comparisons simple, all testing was done without the
> > > zswap shrinker.
> > >
> > >   usemem30 with 64K folios:
> > >   =========================
> > >
> > >      zswap shrinker_enabled = N.
> > >
> > >      -----------------------------------------------------------------------
> > >                      mm-unstable-10-24-2025             v13
> > >      -----------------------------------------------------------------------
> > >      zswap compressor          deflate-iaa     deflate-iaa   IAA Batching
> > >                                                                  vs.
> > >                                                              IAA Sequential
> > >      -----------------------------------------------------------------------
> > >      Total throughput (KB/s)     6,118,675       9,901,216       62%
> > >      Average throughput (KB/s)     203,955         330,040       62%
> > >      elapsed time (sec)              98.94           70.90      -28%
> > >      sys time (sec)               2,379.29        1,686.18      -29%
> > >      -----------------------------------------------------------------------
> > >
> > >      -----------------------------------------------------------------------
> > >                      mm-unstable-10-24-2025             v13
> > >      -----------------------------------------------------------------------
> > >      zswap compressor                 zstd            zstd   v13 zstd
> > >                                                              improvement
> > >      -----------------------------------------------------------------------
> > >      Total throughput (KB/s)     5,983,561       6,003,851      0.3%
> > >      Average throughput (KB/s)     199,452         200,128      0.3%
> > >      elapsed time (sec)             100.93           96.62     -4.3%
> > >      sys time (sec)               2,532.49        2,395.83       -5%
> > >      -----------------------------------------------------------------------
> > >
> > >   usemem30 with 2M folios:
> > >   ========================
> > >
> > >      -----------------------------------------------------------------------
> > >                      mm-unstable-10-24-2025             v13
> > >      -----------------------------------------------------------------------
> > >      zswap compressor          deflate-iaa     deflate-iaa   IAA Batching
> > >                                                                  vs.
> > >                                                              IAA Sequential
> > >      -----------------------------------------------------------------------
> > >      Total throughput (KB/s)     6,309,635      10,558,225       67%
> > >      Average throughput (KB/s)     210,321         351,940       67%
> > >      elapsed time (sec)              88.70           67.84      -24%
> > >      sys time (sec)               2,059.83        1,581.07      -23%
> > >      -----------------------------------------------------------------------
> > >
> > >      -----------------------------------------------------------------------
> > >                      mm-unstable-10-24-2025             v13
> > >      -----------------------------------------------------------------------
> > >      zswap compressor                 zstd            zstd   v13 zstd
> > >                                                              improvement
> > >      -----------------------------------------------------------------------
> > >      Total throughput (KB/s)     6,562,687       6,567,946      0.1%
> > >      Average throughput (KB/s)     218,756         218,931      0.1%
> > >      elapsed time (sec)              94.69           88.79       -6%
> > >      sys time (sec)               2,253.97        2,083.43       -8%
> > >      -----------------------------------------------------------------------
> > >
> > >     The main takeaway from usemem, a workload that is mostly compression
> > >     dominated (very few swapins) is that the higher the number of batches,
> > >     such as with larger folios, the more the benefit of batching cost
> > >     amortization, as shown by the PMD usemem data. This aligns well
> > >     with the future direction for batching.
> > >
> > > kernel_compilation/allmodconfig, 64K folios:
> > > ============================================
> > >
> > >      --------------------------------------------------------------------------
> > >                mm-unstable-10-24-2025             v13
> > >      --------------------------------------------------------------------------
> > >      zswap compressor    deflate-iaa     deflate-iaa    IAA Batching
> > >                                                              vs.
> > >                                                         IAA Sequential
> > >      --------------------------------------------------------------------------
> > >      real_sec                 836.64          806.94      -3.5%
> > >      sys_sec                3,897.57        3,661.83        -6%
> > >      --------------------------------------------------------------------------
> > >
> > >      --------------------------------------------------------------------------
> > >                mm-unstable-10-24-2025             v13
> > >      --------------------------------------------------------------------------
> > >      zswap compressor           zstd            zstd    Improvement
> > >      --------------------------------------------------------------------------
> > >      real_sec                 880.62          850.41      -3.4%
> > >      sys_sec                5,171.90        5,076.51      -1.8%
> > >      --------------------------------------------------------------------------
> > >
> > > kernel_compilation/allmodconfig, PMD folios:
> > > ============================================
> > >
> > >      --------------------------------------------------------------------------
> > >                mm-unstable-10-24-2025             v13
> > >      --------------------------------------------------------------------------
> > >      zswap compressor    deflate-iaa     deflate-iaa    IAA Batching
> > >                                                              vs.
> > >                                                         IAA Sequential
> > >      --------------------------------------------------------------------------
> > >      real_sec                 818.48          779.67      -4.7%
> > >      sys_sec                4,226.52        4,245.18       0.4%
> > >      --------------------------------------------------------------------------
> > >
> > >      --------------------------------------------------------------------------
> > >               mm-unstable-10-24-2025             v13
> > >      --------------------------------------------------------------------------
> > >      zswap compressor          zstd             zstd    Improvement
> > >      --------------------------------------------------------------------------
> > >      real_sec                888.45           849.54      -4.4%
> > >      sys_sec               5,866.72         5,847.17      -0.3%
> > >      --------------------------------------------------------------------------
> > >
> > > [1]:
> > https://lore.kernel.org/all/aJ7Fk6RpNc815Ivd@gondor.apana.org.au/T/#m99
> > aea2ce3d284e6c5a3253061d97b08c4752a798
> > >
> > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > 
> > I won't go through the commit log and rewrite for this one too, but
> > please do so similar to how I did for the previous patches. Do not
> > describe the code, give a high-level overview of what is happening and
> > why it's happeneing, as well as very concise performance results.
> 
> With all due respect, I am not describing the code. zswap compress batching
> is a major architectural change and I am documenting the changes from the
> status quo, for other zswap developers. Yes, some of this might involve
> weaving in repetition of current behavior, again to stress the backward
> compatibility of main concepts.

As I said, I did not go through the commit log as I did for previous
ones, which did include unnecessary description of the code. What I
asked is for you to do similar changes here, if needed, because the
commit log is too big.

For example, you should remove mentions of ongoing work and future work,
simply because things change and they may not land. Just briefly
mentioning that there are future use cases (with maybe an example) is
sufficient.

> 
> I believe there is not one redundant datapoint when it comes to performance
> metrics in this summary - please elaborate. Thanks.

I never said they were redundant, I said we should make them more
concise. For example, the first table can be replaced by stating that
throughput improves by ~62% and the time is reduced by 28-29% and so on.

> 
> > 
> > Do not include things that only make sense in the context of a patch and
> > won't make sense as part of git histroy.
> 
> This makes sense, duly noted and will be addressed.
> 
> > 
> > That being said, I'd like Herbert to review this patch and make sure the
> > scatterlist and crypto APIs are being used correctly as he advised
> > earlier. I do have some comments on the zswap side though.
> > 
[..]
> > > @@ -869,84 +892,177 @@ static int zswap_cpu_comp_prepare(unsigned
> > int cpu, struct hlist_node *node)
> > >  	return ret;
> > >  }
> > >
> > > -static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> > > -			   struct zswap_pool *pool, bool wb_enabled)
> > > +/*
> > > + * Unified code path for compressors that do and do not support batching.
> > This
> > > + * procedure will compress multiple @nr_pages in @folio starting from the
> > > + * @start index.
> > > + *
> > > + * It is assumed that @nr_pages <= ZSWAP_MAX_BATCH_SIZE.
> > zswap_store() makes
> > > + * sure of this by design and zswap_store_pages() warns if this is not
> > > + * true.
> > > + *
> > > + * @nr_pages can be in (1, ZSWAP_MAX_BATCH_SIZE] even if the
> > compressor does not
> > > + * support batching.
> > > + *
> > > + * If @pool->compr_batch_size is 1, each page is processed sequentially.
> > > + *
> > > + * If @pool->compr_batch_size is > 1, compression batching is invoked
> > within
> > > + * the algorithm's driver, except if @nr_pages is 1: if so, the driver can
> > > + * choose to call the sequential/non-batching compress API.
> > > + *
> > > + * In both cases, if all compressions are successful, the compressed buffers
> > > + * are stored in zsmalloc.
> > > + *
> > > + * Traversing multiple SG lists when @nr_comps is > 1 is expensive, and
> > impacts
> > > + * batching performance if we were to repeat this operation multiple
> > times,
> > > + * such as:
> > > + *   - to map destination buffers to each SG list in the @acomp_ctx-
> > >sg_outputs
> > > + *     sg_table.
> > > + *   - to initialize each output SG list's @sg->length to PAGE_SIZE.
> > > + *   - to get the compressed output length in each @sg->length.
> > > + *
> > > + * These are some design choices made to optimize batching with SG lists:
> > > + *
> > > + * 1) The source folio pages in the batch are directly submitted to
> > > + *    crypto_acomp via acomp_request_set_src_folio().
> > > + *
> > > + * 2) The per-CPU @acomp_ctx->sg_outputs scatterlists are used to set up
> > > + *    destination buffers for interfacing with crypto_acomp.
> > > + *
> > > + * 3) To optimize performance, we map the per-CPU @acomp_ctx->buffers
> > to the
> > > + *    @acomp_ctx->sg_outputs->sgl SG lists at pool creation time. The only
> > task
> > > + *    remaining to be done for the output SG lists in zswap_compress() is to
> > > + *    set each @sg->length to PAGE_SIZE. This is done in zswap_compress()
> > > + *    for non-batching compressors. This needs to be done within the
> > compress
> > > + *    batching driver procedure as part of iterating through the SG lists for
> > > + *    batch setup, so as to minimize expensive traversals through the SG
> > lists.
> > > + *
> > > + * 4) Important requirements for batching compressors:
> > > + *    - Each @sg->length in @acomp_ctx->req->sg_outputs->sgl should
> > reflect the
> > > + *      compression outcome for that specific page, and be set to:
> > > + *      - the page's compressed length, or
> > > + *      - the compression error value for that page.
> > > + *    - The @acomp_ctx->req->dlen should be set to the first page's
> > > + *      @sg->length. This enables code generalization in zswap_compress()
> > > + *      for non-batching and batching compressors.
> > > + *
> > > + * acomp_ctx mutex locking:
> > > + *    Earlier, the mutex was held per page compression. With the new code,
> > > + *    [un]locking the mutex per page caused regressions for software
> > > + *    compressors. We now lock the mutex once per batch, which resolves
> > the
> > > + *    regression.
> > > + */
> > 
> > Please, no huge comments describing what the code is doing. If there's
> > anything that is not clear from reading the code or needs to be
> > explained or documented, please do so **concisely** in the relevant part
> > of the function.
> 
> Again, these are important requirements related to the major change, i.e.,
> batching, wrt why/how. I think it is important to note considerations for the
> next batching algorithm, just like I have done within the IAA driver. To be very
> clear, I am not describing code.
> 
> If questions arise as to why the mutex is being locked per batch as against
> per page, I think the comment above is helpful and saves time for folks to
> understand the "why".

Having a huge comment above the function does not help. For things like
this, you should add a brief comment above the mutex locking (where it's
relevant). Otherwise it's easy for someone to move the mutex locking
without reading this comment.

Same applies for other things. I am not saying we should throw away the
entire comment, but it's not helpful in its current form. Concise
comments in the relevant parts are much more helpful. Keep comments
above the function to general notes and things that are important to
callers, not implementation details.

> 
> > 
> > > +static bool zswap_compress(struct folio *folio, long start, unsigned int
> > nr_pages,
> > > +			   struct zswap_entry *entries[], struct zswap_pool
> > *pool,
> > > +			   int nid, bool wb_enabled)
> > >  {
> > > +	gfp_t gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM |
> > __GFP_MOVABLE;
> > > +	unsigned int nr_comps = min(nr_pages, pool->compr_batch_size);
> > > +	unsigned int slen = nr_comps * PAGE_SIZE;
> > >  	struct crypto_acomp_ctx *acomp_ctx;
> > > -	struct scatterlist input, output;
> > > -	int comp_ret = 0, alloc_ret = 0;
> > > -	unsigned int dlen = PAGE_SIZE;
> > > +	int err = 0, err_sg = 0;
> > > +	struct scatterlist *sg;
> > > +	unsigned int i, j, k;
> > >  	unsigned long handle;
> > > -	gfp_t gfp;
> > > -	u8 *dst;
> > > -	bool mapped = false;
> > > +	int *errp, dlen;
> > > +	void *dst;
> > >
> > >  	acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
> > >  	mutex_lock(&acomp_ctx->mutex);
> > >
> > > -	dst = acomp_ctx->buffers[0];
> > > -	sg_init_table(&input, 1);
> > > -	sg_set_page(&input, page, PAGE_SIZE, 0);
> > > -
> > > -	sg_init_one(&output, dst, PAGE_SIZE);
> > > -	acomp_request_set_params(acomp_ctx->req, &input, &output,
> > PAGE_SIZE, dlen);
> > > +	errp = (pool->compr_batch_size == 1) ? &err : &err_sg;
> > 
> > err_sg is not used anywhere, so *errp could end up being garbage. Why do
> > we need this?
> 
> err_sg is initialized to 0 and never changes. It can never be garbage.
> We need this because of the current dichotomy between software compressors
> and IAA in the sg->length based error handling per Herbert's suggestions,
> included in the huge function comment block. It is needed to avoid branches
> and have the zswap_compress() code look seamless for all compressors.

This is exactly what I meant by saying the huge comment doesn't help. It
should be documented where it is implemented.

That being said, the code is confusing and not readable, why do we need
to do such manuevring with the error codes? It's really hard to track.

> 
> > 
> > >
> > >  	/*
> > > -	 * it maybe looks a little bit silly that we send an asynchronous
> > request,
> > > -	 * then wait for its completion synchronously. This makes the process
> > look
> > > -	 * synchronous in fact.
> > > -	 * Theoretically, acomp supports users send multiple acomp requests
> > in one
> > > -	 * acomp instance, then get those requests done simultaneously. but
> > in this
> > > -	 * case, zswap actually does store and load page by page, there is no
> > > -	 * existing method to send the second page before the first page is
> > done
> > > -	 * in one thread doing zswap.
> > > -	 * but in different threads running on different cpu, we have different
> > > -	 * acomp instance, so multiple threads can do (de)compression in
> > parallel.
> > > +	 * [i] refers to the incoming batch space and is used to
> > > +	 *     index into the folio pages.
> > > +	 *
> > > +	 * [j] refers to the incoming batch space and is used to
> > > +	 *     index into the @entries for the folio's pages in this
> > > +	 *     batch, per compress call while iterating over the output SG
> > > +	 *     lists. Also used to index into the folio's pages from @start,
> > > +	 *     in case of compress errors.
> > > +	 *
> > > +	 * [k] refers to the @acomp_ctx space, as determined by
> > > +	 *     @pool->compr_batch_size, and is used to index into
> > > +	 *     @acomp_ctx->sg_outputs->sgl and @acomp_ctx->buffers.
> > >  	 */
> > > -	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx-
> > >req), &acomp_ctx->wait);
> > > -	dlen = acomp_ctx->req->dlen;
> > > +	for (i = 0; i < nr_pages; i += nr_comps) {
> > 
> > What are looping over here? I thought zswap_compress() takes in exactly
> > one batch.
> 
> We are iterating once over one batch for batching compressors, and one
> page at a time for software.

I thought we wanted to have a single acomp API that takes in a batch of
pages, and then either hands them over to HW compressors, or loops over
them for SW compressors. This would simplify the users like zswap
because the differences between SW and HW compressors would be handled
internally.

> 
> > 
> > > +		acomp_request_set_src_folio(acomp_ctx->req, folio,
> > > +					    (start + i) * PAGE_SIZE,
> > > +					    slen);
> > >
> > > -	/*
> > > -	 * If a page cannot be compressed into a size smaller than PAGE_SIZE,
> > > -	 * save the content as is without a compression, to keep the LRU
> > order
> > > -	 * of writebacks.  If writeback is disabled, reject the page since it
> > > -	 * only adds metadata overhead.  swap_writeout() will put the page
> > back
> > > -	 * to the active LRU list in the case.
> > > -	 */
> > > -	if (comp_ret || !dlen || dlen >= PAGE_SIZE) {
> > > -		if (!wb_enabled) {
> > > -			comp_ret = comp_ret ? comp_ret : -EINVAL;
> > > -			goto unlock;
> > > -		}
> > > -		comp_ret = 0;
> > > -		dlen = PAGE_SIZE;
> > > -		dst = kmap_local_page(page);
> > > -		mapped = true;
> > > -	}
> > > +		acomp_ctx->sg_outputs->sgl->length = slen;
> > >
> > > -	gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM |
> > __GFP_MOVABLE;
> > > -	handle = zs_malloc(pool->zs_pool, dlen, gfp, page_to_nid(page));
> > > -	if (IS_ERR_VALUE(handle)) {
> > > -		alloc_ret = PTR_ERR((void *)handle);
> > > -		goto unlock;
> > > -	}
> > > +		acomp_request_set_dst_sg(acomp_ctx->req,
> > > +					 acomp_ctx->sg_outputs->sgl,
> > > +					 slen);
> > > +
> > > +		err = crypto_wait_req(crypto_acomp_compress(acomp_ctx-
> > >req),
> > > +				      &acomp_ctx->wait);
> > > +
> > > +		acomp_ctx->sg_outputs->sgl->length = acomp_ctx->req-
> > >dlen;
> > > +
> > > +		/*
> > > +		 * If a page cannot be compressed into a size smaller than
> > > +		 * PAGE_SIZE, save the content as is without a compression,
> > to
> > > +		 * keep the LRU order of writebacks.  If writeback is disabled,
> > > +		 * reject the page since it only adds metadata overhead.
> > > +		 * swap_writeout() will put the page back to the active LRU
> > list
> > > +		 * in the case.
> > > +		 *
> > > +		 * It is assumed that any compressor that sets the output
> > length
> > > +		 * to 0 or a value >= PAGE_SIZE will also return a negative
> > > +		 * error status in @err; i.e, will not return a successful
> > > +		 * compression status in @err in this case.
> > > +		 */
> > 
> > Ugh, checking the compression error and checking the compression length
> > are now in separate places so we need to check if writeback is disabled
> > in separate places and store the page as-is. It's ugly, and I think the
> > current code is not correct.
> 
> The code is 100% correct. You need to spend more time understanding
> the code. I have stated my assumption above in the comments to
> help in understanding the "why".
> 
> From a maintainer, I would expect more responsible statements than
> this. A flippant remark made without understanding the code (and,
> disparaging the comments intended to help you do this), can impact
> someone's career. I am held accountable in my job based on your
> comments.
> 
> That said, I have worked tirelessly and innovated to make the code
> compliant with Herbert's suggestions (which btw have enabled an
> elegant batching implementation and code commonality for IAA and
> software compressors), validated it thoroughly for IAA and ZSTD to
> ensure that both demonstrate performance improvements, which
> are crucial for memory savings. I am proud of this work.

I really do NOT appreciate the personal attack here. I am not sure why
my comment came across as a "flippant remark".

Let me be clear, I never said anything bad about "this work", or
expressed that I do not want to see it merged. You did a good job and
you should be proud of your work.

That being said, code review is part of the process, and you should know
better than anyone given how much this series evolved over 13 revisions
of careful reviews. I spent a considerable amount of time reviewing
previous revisions, pointing out problems, and helping this series
evolve. Telling me that I "should spend more time understanding the
code" is enraging at this point.

To be even more clear, I gain NOTHING by reviewing your code and helping
you land this work. I also have a job, and it's not reviewing your code.
I would tread very carefully if I were you.

Let's keep the discussion technical and civil. I will NOT tolerate such
comments going forward.

> 
> 
> > 
> > > +		if (err && !wb_enabled)
> > > +			goto compress_error;
> > > +
> > > +		for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> > > +			j = k + i;
> > 
> > Please use meaningful iterator names rather than i, j, and k and the huge
> > comment explaining what they are.
> 
> I happen to have a different view: having longer iterator names firstly makes
> code seem "verbose" and detracts from readability, not to mention exceeding the
> 80-character line limit. The comments are essential for code maintainability
> and avoid out-of-bounds errors when the next zswap developer wants to
> optimize the code.
> 
> One drawback of i/j/k iterators is mis-typing errors which cannot be caught
> at compile time. Let me think some more about how to strike a good balance.

I think if we get rid of the outer loop things will get much simpler. I
initially thought the acomp API will handle the looping internally for
SW compressors.

> 
> > 
> > > +			dst = acomp_ctx->buffers[k];
> > > +			dlen = sg->length | *errp;
> > 
> > Why are we doing this?
> > 
> > > +
> > > +			if (dlen < 0) {
> > 
> > We should do the incompressible page handling also if dlen is PAGE_SIZE,
> > or if the compression failed (I guess that's the intention of bit OR'ing
> > with *errp?)
> 
> Yes, indeed: that's the intention of bit OR'ing with *errp.

This is not very readable.

RE: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.

Posted by Sridhar, Kanchana P 1 month, 3 weeks ago

> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Thursday, November 13, 2025 4:46 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> compress batching of large folios.
[...]
> > > > Architectural considerations for the zswap batching framework:
> > > >
> > >
> ==============================================================
> > > > We have designed the zswap batching framework to be
> > > > hardware-agnostic. It has no dependencies on Intel-specific features
> and
> > > > can be leveraged by any hardware accelerator or software-based
> > > > compressor. In other words, the framework is open and inclusive by
> > > > design.
> > > >
> > > > Other ongoing work that can use batching:
> > > > =========================================
> > > > This patch-series demonstrates the performance benefits of compress
> > > > batching when used in zswap_store() of large folios. shrink_folio_list()
> > > > "reclaim batching" of any-order folios is the major next work that uses
> > > > the zswap compress batching framework: our testing of
> kernel_compilation
> > > > with writeback and the zswap shrinker indicates 10X fewer pages get
> > > > written back when we reclaim 32 folios as a batch, as compared to one
> > > > folio at a time: this is with deflate-iaa and with zstd. We expect to
> > > > submit a patch-series with this data and the resulting performance
> > > > improvements shortly. Reclaim batching relieves memory pressure
> faster
> > > > than reclaiming one folio at a time, hence alleviates the need to scan
> > > > slab memory for writeback.
> > > >
> > > > Nhat has given ideas on using batching with the ongoing kcompressd
> work,
> > > > as well as beneficially using decompression batching & block IO batching
> > > > to improve zswap writeback efficiency.
> > > >
> > > > Experiments that combine zswap compress batching, reclaim batching,
> > > > swapin_readahead() decompression batching of prefetched pages, and
> > > > writeback batching show that 0 pages are written back with deflate-iaa
> > > > and zstd. For comparison, the baselines for these compressors see
> > > > 200K-800K pages written to disk (kernel compilation 'allmod' config).
> > > >
> > > > To summarize, these are future clients of the batching framework:
> > > >
> > > >    - shrink_folio_list() reclaim batching of multiple folios:
> > > >        Implemented, will submit patch-series.
> > > >    - zswap writeback with decompress batching:
> > > >        Implemented, will submit patch-series.
> > > >    - zram:
> > > >        Implemented, will submit patch-series.
> > > >    - kcompressd:
> > > >        Not yet implemented.
> > > >    - file systems:
> > > >        Not yet implemented.
> > > >    - swapin_readahead() decompression batching of prefetched pages:
> > > >        Implemented, will submit patch-series.
> > > >
> > > > Additionally, any place we have folios that need to be compressed, can
> > > > potentially be parallelized.

[...]

> For example, you should remove mentions of ongoing work and future work,
> simply because things change and they may not land. Just briefly
> mentioning that there are future use cases (with maybe an example) is
> sufficient.

Hi Yosry,

The mentions of ongoing/future work were included as per Andrew's suggestion.
Hence, I would like to keep these in the commit log. Hope this is Ok with you?

Thanks,
Kanchana

Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.

Posted by Yosry Ahmed 1 month, 3 weeks ago

On Fri, Dec 19, 2025 at 02:29:15AM +0000, Sridhar, Kanchana P wrote:
> 
> > -----Original Message-----
> > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > Sent: Thursday, November 13, 2025 4:46 PM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> > crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> > compress batching of large folios.
> [...]
> > > > > Architectural considerations for the zswap batching framework:
> > > > >
> > > >
> > ==============================================================
> > > > > We have designed the zswap batching framework to be
> > > > > hardware-agnostic. It has no dependencies on Intel-specific features
> > and
> > > > > can be leveraged by any hardware accelerator or software-based
> > > > > compressor. In other words, the framework is open and inclusive by
> > > > > design.
> > > > >
> > > > > Other ongoing work that can use batching:
> > > > > =========================================
> > > > > This patch-series demonstrates the performance benefits of compress
> > > > > batching when used in zswap_store() of large folios. shrink_folio_list()
> > > > > "reclaim batching" of any-order folios is the major next work that uses
> > > > > the zswap compress batching framework: our testing of
> > kernel_compilation
> > > > > with writeback and the zswap shrinker indicates 10X fewer pages get
> > > > > written back when we reclaim 32 folios as a batch, as compared to one
> > > > > folio at a time: this is with deflate-iaa and with zstd. We expect to
> > > > > submit a patch-series with this data and the resulting performance
> > > > > improvements shortly. Reclaim batching relieves memory pressure
> > faster
> > > > > than reclaiming one folio at a time, hence alleviates the need to scan
> > > > > slab memory for writeback.
> > > > >
> > > > > Nhat has given ideas on using batching with the ongoing kcompressd
> > work,
> > > > > as well as beneficially using decompression batching & block IO batching
> > > > > to improve zswap writeback efficiency.
> > > > >
> > > > > Experiments that combine zswap compress batching, reclaim batching,
> > > > > swapin_readahead() decompression batching of prefetched pages, and
> > > > > writeback batching show that 0 pages are written back with deflate-iaa
> > > > > and zstd. For comparison, the baselines for these compressors see
> > > > > 200K-800K pages written to disk (kernel compilation 'allmod' config).
> > > > >
> > > > > To summarize, these are future clients of the batching framework:
> > > > >
> > > > >    - shrink_folio_list() reclaim batching of multiple folios:
> > > > >        Implemented, will submit patch-series.
> > > > >    - zswap writeback with decompress batching:
> > > > >        Implemented, will submit patch-series.
> > > > >    - zram:
> > > > >        Implemented, will submit patch-series.
> > > > >    - kcompressd:
> > > > >        Not yet implemented.
> > > > >    - file systems:
> > > > >        Not yet implemented.
> > > > >    - swapin_readahead() decompression batching of prefetched pages:
> > > > >        Implemented, will submit patch-series.
> > > > >
> > > > > Additionally, any place we have folios that need to be compressed, can
> > > > > potentially be parallelized.
> 
> [...]
> 
> > For example, you should remove mentions of ongoing work and future work,
> > simply because things change and they may not land. Just briefly
> > mentioning that there are future use cases (with maybe an example) is
> > sufficient.
> 
> Hi Yosry,
> 
> The mentions of ongoing/future work were included as per Andrew's suggestion.
> Hence, I would like to keep these in the commit log. Hope this is Ok with you?

We can keep them, but not in the detail they are currently in, and
avoiding mentioning what is implemented or not implemented yet because
it's not very relevant to the patch imo.

So maybe focus on the fact that the compression batching can be used for
other use cases like batching decompression in zswap writeback, batching
compression in zram, batch compression of different folios during
reclaim, etc -- without going too much into detail because these details
will probably change when these extensions are proposed.


> 
> Thanks,
> Kanchana
>

RE: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.

Posted by Sridhar, Kanchana P 1 month, 3 weeks ago

> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Friday, December 19, 2025 7:26 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> compress batching of large folios.
> 
> On Fri, Dec 19, 2025 at 02:29:15AM +0000, Sridhar, Kanchana P wrote:
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > Sent: Thursday, November 13, 2025 4:46 PM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > hannes@cmpxchg.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev;
> > > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> > > crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > > <kristen.c.accardi@intel.com>; Gomes, Vinicius
> <vinicius.gomes@intel.com>;
> > > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > > <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress()
> with
> > > compress batching of large folios.
> > [...]
> > > > > > Architectural considerations for the zswap batching framework:
> > > > > >
> > > > >
> > >
> ==============================================================
> > > > > > We have designed the zswap batching framework to be
> > > > > > hardware-agnostic. It has no dependencies on Intel-specific features
> > > and
> > > > > > can be leveraged by any hardware accelerator or software-based
> > > > > > compressor. In other words, the framework is open and inclusive by
> > > > > > design.
> > > > > >
> > > > > > Other ongoing work that can use batching:
> > > > > > =========================================
> > > > > > This patch-series demonstrates the performance benefits of
> compress
> > > > > > batching when used in zswap_store() of large folios.
> shrink_folio_list()
> > > > > > "reclaim batching" of any-order folios is the major next work that
> uses
> > > > > > the zswap compress batching framework: our testing of
> > > kernel_compilation
> > > > > > with writeback and the zswap shrinker indicates 10X fewer pages get
> > > > > > written back when we reclaim 32 folios as a batch, as compared to
> one
> > > > > > folio at a time: this is with deflate-iaa and with zstd. We expect to
> > > > > > submit a patch-series with this data and the resulting performance
> > > > > > improvements shortly. Reclaim batching relieves memory pressure
> > > faster
> > > > > > than reclaiming one folio at a time, hence alleviates the need to scan
> > > > > > slab memory for writeback.
> > > > > >
> > > > > > Nhat has given ideas on using batching with the ongoing kcompressd
> > > work,
> > > > > > as well as beneficially using decompression batching & block IO
> batching
> > > > > > to improve zswap writeback efficiency.
> > > > > >
> > > > > > Experiments that combine zswap compress batching, reclaim
> batching,
> > > > > > swapin_readahead() decompression batching of prefetched pages,
> and
> > > > > > writeback batching show that 0 pages are written back with deflate-
> iaa
> > > > > > and zstd. For comparison, the baselines for these compressors see
> > > > > > 200K-800K pages written to disk (kernel compilation 'allmod' config).
> > > > > >
> > > > > > To summarize, these are future clients of the batching framework:
> > > > > >
> > > > > >    - shrink_folio_list() reclaim batching of multiple folios:
> > > > > >        Implemented, will submit patch-series.
> > > > > >    - zswap writeback with decompress batching:
> > > > > >        Implemented, will submit patch-series.
> > > > > >    - zram:
> > > > > >        Implemented, will submit patch-series.
> > > > > >    - kcompressd:
> > > > > >        Not yet implemented.
> > > > > >    - file systems:
> > > > > >        Not yet implemented.
> > > > > >    - swapin_readahead() decompression batching of prefetched
> pages:
> > > > > >        Implemented, will submit patch-series.
> > > > > >
> > > > > > Additionally, any place we have folios that need to be compressed,
> can
> > > > > > potentially be parallelized.
> >
> > [...]
> >
> > > For example, you should remove mentions of ongoing work and future
> work,
> > > simply because things change and they may not land. Just briefly
> > > mentioning that there are future use cases (with maybe an example) is
> > > sufficient.
> >
> > Hi Yosry,
> >
> > The mentions of ongoing/future work were included as per Andrew's
> suggestion.
> > Hence, I would like to keep these in the commit log. Hope this is Ok with
> you?
> 
> We can keep them, but not in the detail they are currently in, and
> avoiding mentioning what is implemented or not implemented yet because
> it's not very relevant to the patch imo.
> 
> So maybe focus on the fact that the compression batching can be used for
> other use cases like batching decompression in zswap writeback, batching
> compression in zram, batch compression of different folios during
> reclaim, etc -- without going too much into detail because these details
> will probably change when these extensions are proposed.

Sure, this sounds good, thanks!

> 
> 
> >
> > Thanks,
> > Kanchana
> >