[v9] zswap compression batching

[PATCH v9 10/19] crypto: acomp - New interfaces to facilitate batching support in acomp & drivers.

Posted by Kanchana P Sridhar 9 months, 2 weeks ago

This commit adds get_batch_size(), batch_compress() and batch_decompress()
interfaces to:

  struct acomp_alg
  struct crypto_acomp

A crypto_acomp compression algorithm that supports batching of compressions
and decompressions must provide implementations for these API, so that a
higher level module based on crypto_acomp, such as zswap, can allocate
resources for submitting multiple compress/decompress jobs that can be
batched, and invoke batching of [de]compressions.

A new helper function acomp_has_async_batching() can be invoked to query if
a crypto_acomp has registered these batching interfaces.

Further, zswap can invoke the newly added "crypto_acomp_batch_size()"
API to query the maximum number of requests that can be batch
[de]compressed. crypto_acomp_batch_size() returns 1 if the acomp has not
provided an implementation for get_batch_size(). Based on this, zswap
can use the minimum of any zswap-specific upper limits for batch-size
and the compressor's max batch-size, to allocate batching resources.

This allows the iaa_crypto Intel IAA driver to register implementations for
the get_batch_size(), batch_compress() and batch_decompress() acomp_alg
interfaces, that can subsequently be invoked from zswap to
compress/decompress pages in parallel in the IAA hardware accelerator to
improve swapout/swapin performance, through these newly added
corresponding crypto_acomp API:

  crypto_acomp_batch_size()
  crypto_acomp_batch_compress()
  crypto_acomp_batch_decompress()

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 crypto/acompress.c                  |   3 +
 include/crypto/acompress.h          | 107 ++++++++++++++++++++++++++++
 include/crypto/internal/acompress.h |  20 ++++++
 3 files changed, 130 insertions(+)

diff --git a/crypto/acompress.c b/crypto/acompress.c
index d08e0fe8cd9e..c7cca5596acf 100644
--- a/crypto/acompress.c
+++ b/crypto/acompress.c
@@ -95,6 +95,9 @@ static int crypto_acomp_init_tfm(struct crypto_tfm *tfm)
 
 	acomp->compress = alg->compress;
 	acomp->decompress = alg->decompress;
+	acomp->get_batch_size = alg->get_batch_size;
+	acomp->batch_compress = alg->batch_compress;
+	acomp->batch_decompress = alg->batch_decompress;
 	acomp->reqsize = alg->reqsize;
 
 	acomp->base.exit = crypto_acomp_exit_tfm;
diff --git a/include/crypto/acompress.h b/include/crypto/acompress.h
index 939e51d122b0..e50f3e71ba58 100644
--- a/include/crypto/acompress.h
+++ b/include/crypto/acompress.h
@@ -120,6 +120,10 @@ struct acomp_req {
  *
  * @compress:		Function performs a compress operation
  * @decompress:		Function performs a de-compress operation
+ * @get_batch_size:	Maximum batch-size for batching compress/decompress
+ *			operations.
+ * @batch_compress:	Function performs a batch compress operation.
+ * @batch_decompress:	Function performs a batch decompress operation.
  * @reqsize:		Context size for (de)compression requests
  * @fb:			Synchronous fallback tfm
  * @base:		Common crypto API algorithm data structure
@@ -127,6 +131,22 @@ struct acomp_req {
 struct crypto_acomp {
 	int (*compress)(struct acomp_req *req);
 	int (*decompress)(struct acomp_req *req);
+	unsigned int (*get_batch_size)(void);
+	bool (*batch_compress)(
+		struct acomp_req *reqs[],
+		struct page *pages[],
+		u8 *dsts[],
+		unsigned int dlens[],
+		int errors[],
+		int nr_reqs);
+	bool (*batch_decompress)(
+		struct acomp_req *reqs[],
+		u8 *srcs[],
+		struct page *pages[],
+		unsigned int slens[],
+		unsigned int dlens[],
+		int errors[],
+		int nr_reqs);
 	unsigned int reqsize;
 	struct crypto_acomp *fb;
 	struct crypto_tfm base;
@@ -224,6 +244,13 @@ static inline bool acomp_is_async(struct crypto_acomp *tfm)
 	       CRYPTO_ALG_ASYNC;
 }
 
+static inline bool acomp_has_async_batching(struct crypto_acomp *tfm)
+{
+	return (acomp_is_async(tfm) &&
+		(crypto_comp_alg_common(tfm)->base.cra_flags & CRYPTO_ALG_TYPE_ACOMPRESS) &&
+		tfm->get_batch_size && tfm->batch_compress && tfm->batch_decompress);
+}
+
 static inline struct crypto_acomp *crypto_acomp_reqtfm(struct acomp_req *req)
 {
 	return __crypto_acomp_tfm(req->base.tfm);
@@ -595,4 +622,84 @@ static inline struct acomp_req *acomp_request_on_stack_init(
 	return req;
 }
 
+/**
+ * crypto_acomp_batch_size() -- Get the algorithm's batch size
+ *
+ * Function returns the algorithm's batch size for batching operations
+ *
+ * @tfm:	ACOMPRESS tfm handle allocated with crypto_alloc_acomp()
+ *
+ * Return:	crypto_acomp's batch size.
+ */
+static inline unsigned int crypto_acomp_batch_size(struct crypto_acomp *tfm)
+{
+	if (acomp_has_async_batching(tfm))
+		return tfm->get_batch_size();
+
+	return 1;
+}
+
+/**
+ * crypto_acomp_batch_compress() -- Invoke asynchronous compress of a batch
+ * of requests.
+ *
+ * @reqs: @nr_reqs asynchronous compress requests.
+ * @pages: Pages to be compressed by IAA.
+ * @dsts: Pre-allocated destination buffers to store results of compression.
+ *        Each element of @dsts must be of size "PAGE_SIZE * 2".
+ * @dlens: Will contain the compressed lengths for @pages.
+ * @errors: zero on successful compression of the corresponding
+ *          req, or error code in case of error.
+ * @nr_reqs: The number of requests in @reqs, up to IAA_CRYPTO_MAX_BATCH_SIZE,
+ *           to be compressed.
+ *
+ * Returns true if all compress requests in the batch complete successfully,
+ * false otherwise.
+ */
+static inline bool crypto_acomp_batch_compress(
+	struct acomp_req *reqs[],
+	struct page *pages[],
+	u8 *dsts[],
+	unsigned int dlens[],
+	int errors[],
+	int nr_reqs)
+{
+	struct crypto_acomp *tfm = crypto_acomp_reqtfm(reqs[0]);
+
+	return tfm->batch_compress(reqs, pages, dsts, dlens, errors, nr_reqs);
+}
+
+/**
+ * crypto_acomp_batch_decompress() -- Invoke asynchronous decompress of a batch
+ * of requests.
+ *
+ * @reqs: @nr_reqs asynchronous decompress requests.
+ * @srcs: Source buffers to to be decompressed.
+ * @pages: Destination pages corresponding to @srcs.
+ * @slens: Buffer lengths for @srcs.
+ * @dlens: Will contain the decompressed lengths for @srcs.
+ *	   For batch decompressions, the caller must enforce additional
+ *	   semantics such as, BUG_ON(dlens[i] != PAGE_SIZE) assertions.
+ * @errors: zero on successful decompression of the corresponding
+ *          req, or error code in case of error.
+ * @nr_reqs: The number of requests in @reqs, up to IAA_CRYPTO_MAX_BATCH_SIZE,
+ *           to be decompressed.
+ *
+ * Returns true if all decompress requests in the batch complete successfully,
+ * false otherwise.
+ */
+static inline bool crypto_acomp_batch_decompress(
+	struct acomp_req *reqs[],
+	u8 *srcs[],
+	struct page *pages[],
+	unsigned int slens[],
+	unsigned int dlens[],
+	int errors[],
+	int nr_reqs)
+{
+	struct crypto_acomp *tfm = crypto_acomp_reqtfm(reqs[0]);
+
+	return tfm->batch_decompress(reqs, srcs, pages, slens, dlens, errors, nr_reqs);
+}
+
 #endif
diff --git a/include/crypto/internal/acompress.h b/include/crypto/internal/acompress.h
index b69d818d7e68..891e40831af8 100644
--- a/include/crypto/internal/acompress.h
+++ b/include/crypto/internal/acompress.h
@@ -23,6 +23,10 @@
  *
  * @compress:	Function performs a compress operation
  * @decompress:	Function performs a de-compress operation
+ * @get_batch_size:	Maximum batch-size for batching compress/decompress
+ *			operations.
+ * @batch_compress:	Function performs a batch compress operation.
+ * @batch_decompress:	Function performs a batch decompress operation.
  * @init:	Initialize the cryptographic transformation object.
  *		This function is used to initialize the cryptographic
  *		transformation object. This function is called only once at
@@ -43,6 +47,22 @@
 struct acomp_alg {
 	int (*compress)(struct acomp_req *req);
 	int (*decompress)(struct acomp_req *req);
+	unsigned int (*get_batch_size)(void);
+	bool (*batch_compress)(
+		struct acomp_req *reqs[],
+		struct page *pages[],
+		u8 *dsts[],
+		unsigned int dlens[],
+		int errors[],
+		int nr_reqs);
+	bool (*batch_decompress)(
+		struct acomp_req *reqs[],
+		u8 *srcs[],
+		struct page *pages[],
+		unsigned int slens[],
+		unsigned int dlens[],
+		int errors[],
+		int nr_reqs);
 	int (*init)(struct crypto_acomp *tfm);
 	void (*exit)(struct crypto_acomp *tfm);
 
-- 
2.27.0

Re: [PATCH v9 10/19] crypto: acomp - New interfaces to facilitate batching support in acomp & drivers.

Posted by Herbert Xu 9 months, 2 weeks ago

On Wed, Apr 30, 2025 at 01:52:56PM -0700, Kanchana P Sridhar wrote:
>
> @@ -127,6 +131,22 @@ struct acomp_req {
>  struct crypto_acomp {
>  	int (*compress)(struct acomp_req *req);
>  	int (*decompress)(struct acomp_req *req);
> +	unsigned int (*get_batch_size)(void);
> +	bool (*batch_compress)(
> +		struct acomp_req *reqs[],
> +		struct page *pages[],
> +		u8 *dsts[],
> +		unsigned int dlens[],
> +		int errors[],
> +		int nr_reqs);
> +	bool (*batch_decompress)(
> +		struct acomp_req *reqs[],
> +		u8 *srcs[],
> +		struct page *pages[],
> +		unsigned int slens[],
> +		unsigned int dlens[],
> +		int errors[],
> +		int nr_reqs);

I shelved request chaining because allocating one request per page
is actively harmful to performance.  So we should not add any
interface that is based on one request per page.

My plan is to supply a whole folio through acomp_request_set_src_folio
and mark it as a batch request with a data unit size of 4K, e.g.:

	acomp_request_set_src_folio(req, folio, 0, len);
	acomp_request_set_data_unit(req, 4096);

Then the algorithm can dice it up in whatever way it sees fit.  For
algorithms that don't support batching, the acompress API should dice
it up and feed it to the algorithm piece-meal.

IOW the folio loop in zswap_store would be moved into the Crypto API.

This is contingent on one API change, bringing back NULL dst support
to acompress.  This way zswap does not need to worry about allocating
memory that might not even be needed (when pages compress well).

This won't look like the useless NULL dst we had before which simply
pre-allocated memory rather than allocating them on demand.

What acompress should do is allocate one dst page at a time, once that
is filled up, then allocate one more.  They should be chained up in an
SG list.  Pages that do not compress can be marked as a zero-length
entry in the SG list.

If the allocation fails at any point in time, simply stop the
batching at that point and return the SG list of what has been
compressed so far.  After processing the returned pages, zswap
can then call acompress again with an offset into the folio to
continue compression.

To prevent pathological cases of zero progress, zswap can provide
one pre-allocated page to seed the process.  For iaa, it should
just allocate as many pages as it needs for batching, and if that
fails, simply fall back to no batching and do things one page at
a time (or however many pages you manage to allocate).

I'll whip up a quick POC and we can work on top of it.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

RE: [PATCH v9 10/19] crypto: acomp - New interfaces to facilitate batching support in acomp & drivers.

Posted by Sridhar, Kanchana P 9 months, 1 week ago

> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Wednesday, April 30, 2025 6:41 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> Kristen C <kristen.c.accardi@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v9 10/19] crypto: acomp - New interfaces to facilitate
> batching support in acomp & drivers.
> 
> On Wed, Apr 30, 2025 at 01:52:56PM -0700, Kanchana P Sridhar wrote:
> >
> > @@ -127,6 +131,22 @@ struct acomp_req {
> >  struct crypto_acomp {
> >  	int (*compress)(struct acomp_req *req);
> >  	int (*decompress)(struct acomp_req *req);
> > +	unsigned int (*get_batch_size)(void);
> > +	bool (*batch_compress)(
> > +		struct acomp_req *reqs[],
> > +		struct page *pages[],
> > +		u8 *dsts[],
> > +		unsigned int dlens[],
> > +		int errors[],
> > +		int nr_reqs);
> > +	bool (*batch_decompress)(
> > +		struct acomp_req *reqs[],
> > +		u8 *srcs[],
> > +		struct page *pages[],
> > +		unsigned int slens[],
> > +		unsigned int dlens[],
> > +		int errors[],
> > +		int nr_reqs);
> 
> I shelved request chaining because allocating one request per page
> is actively harmful to performance.  So we should not add any
> interface that is based on one request per page.

Hi Herbert,

My cover letter presents data that I've gathered that indicates significant
performance improvements with the crypto_acomp_batch_compress()
interface that allocates one request per page.

In addition, I would also like to share the p50/p99 latency of just the calls
to crypto_acomp_compress() and crypto_acomp_batch_compress() that
I gathered using the silesia.tar dataset (http://wanos.co/assets/silesia.tar)
and a simple madvise test that reads the dataset into memory, then
swaps out all pages and swaps them back in.

This data is on Sapphire Rapids, core frequency fixed at 2500 MHz.
I have enabled 64K folios.
The "count" refers to the number of calls to the crypto_acomp API.
As expected, the deflate-iaa "count" values in v9 are much lower
because zswap_compress() in v9 uses compression batching, i.e.,
invokes crypto_acomp_batch_compress() with batches of 8 pages,
while storing the 64K folios.

     -------------------------------------------------------------------------
     64K folios:    Normalized Per-Page Latency of crypto_acomp
                           calls in zswap_compress() (ns)
     ------------+------------------------------+----------------------------
                 |     mm-unstable-4-21-2025    |              v9
     ------------+------------------------------+----------------------------
                 |   count       p50       p99  |   count      p50       p99
     ------------+------------------------------+----------------------------
     deflate-iaa |  50,459     3,396     3,663  |   6,379      717       774
                 |                              |
     zstd        |  50,631    27,610    33,006  |  50,631   27,253    32,516
     ------------+------------------------------+----------------------------

There is no indication of sending one acomp request per page
harming performance, with either deflate-iaa or zstd. We see a
4.7X speedup with IAA that uses the crypto_acomp_batch_compress()
interface in question.

> 
> My plan is to supply a whole folio through acomp_request_set_src_folio
> and mark it as a batch request with a data unit size of 4K, e.g.:
> 
> 	acomp_request_set_src_folio(req, folio, 0, len);
> 	acomp_request_set_data_unit(req, 4096);
> 
> Then the algorithm can dice it up in whatever way it sees fit.  For
> algorithms that don't support batching, the acompress API should dice
> it up and feed it to the algorithm piece-meal.
> 
> IOW the folio loop in zswap_store would be moved into the Crypto API.
> 
> This is contingent on one API change, bringing back NULL dst support
> to acompress.  This way zswap does not need to worry about allocating
> memory that might not even be needed (when pages compress well).
> 
> This won't look like the useless NULL dst we had before which simply
> pre-allocated memory rather than allocating them on demand.
> 
> What acompress should do is allocate one dst page at a time, once that
> is filled up, then allocate one more.  They should be chained up in an
> SG list.  Pages that do not compress can be marked as a zero-length
> entry in the SG list.
> 
> If the allocation fails at any point in time, simply stop the
> batching at that point and return the SG list of what has been
> compressed so far.  After processing the returned pages, zswap
> can then call acompress again with an offset into the folio to
> continue compression.

I am not sure if this would be feasible, because zswap_store()
incrementally does other book-keeping necessary for mm/swap
consistency as pages get compressed, such as allocating entries,
storing compressed buffers in zpool, updating the xarray of swap
offsets stored in zswap, adding the entry to the zswap memcg LRU
list, etc.

As soon as an error is encountered in zswap_compress(),
zswap_store() has to cleanup any prior zpool stores for the batch,
and delete any swap offsets for the folio from xarray.

Imo, handing over the folio store loop to crypto might make this
"maintaining consistency of mm/swap incrementally with each
page compressed/stored" somewhat messy. However, I would like
to request the zswap maintainers to weigh in with their insights
on pros/cons of what you are proposing.

> 
> To prevent pathological cases of zero progress, zswap can provide
> one pre-allocated page to seed the process.  For iaa, it should
> just allocate as many pages as it needs for batching, and if that
> fails, simply fall back to no batching and do things one page at
> a time (or however many pages you manage to allocate).

I'm somewhat concerned about a) allocating memory and b) adding
computes in the zswap_store() path. It is not clear what is the
motivating factor for doing so, especially because the solution and
data presented in v9 have indicated only performance upside with
the crypto_acomp_batch_compress() API and its implementation
in iaa_crypto, and modest performance gains with zstd using the
existing crypto_acomp_compress() API to compress one page at a
time in a large folio. Please let me know if I am missing something.

Thanks,
Kanchana

> 
> I'll whip up a quick POC and we can work on top of it.
> 
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt