Add a new zswap config variable that controls whether zswap_store() will
compress a batch of pages, for instance, the pages in a large folio:
CONFIG_ZSWAP_STORE_BATCHING_ENABLED
The existing CONFIG_CRYPTO_DEV_IAA_CRYPTO variable added in commit
ea7a5cbb4369 ("crypto: iaa - Add Intel IAA Compression Accelerator crypto
driver core") is used to detect if the system has the Intel Analytics
Accelerator (IAA), and the iaa_crypto module is available. If so, the
kernel build will prompt for CONFIG_ZSWAP_STORE_BATCHING_ENABLED. Hence,
users have the ability to set CONFIG_ZSWAP_STORE_BATCHING_ENABLED="y" only
on systems that have Intel IAA.
If CONFIG_ZSWAP_STORE_BATCHING_ENABLED is enabled, and IAA is configured
as the zswap compressor, zswap_store() will process the pages in a large
folio in batches, i.e., multiple pages at a time. Pages in a batch will be
compressed in parallel in hardware, then stored. On systems without Intel
IAA and/or if zswap uses software compressors, pages in the batch will be
compressed sequentially and stored.
The patch also implements a zswap API that returns the status of this
config variable.
Suggested-by: Ying Huang <ying.huang@intel.com>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
include/linux/zswap.h | 6 ++++++
mm/Kconfig | 12 ++++++++++++
mm/zswap.c | 14 ++++++++++++++
3 files changed, 32 insertions(+)
diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index d961ead91bf1..74ad2a24b309 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -24,6 +24,7 @@ struct zswap_lruvec_state {
atomic_long_t nr_disk_swapins;
};
+bool zswap_store_batching_enabled(void);
unsigned long zswap_total_pages(void);
bool zswap_store(struct folio *folio);
bool zswap_load(struct folio *folio);
@@ -39,6 +40,11 @@ bool zswap_never_enabled(void);
struct zswap_lruvec_state {};
+static inline bool zswap_store_batching_enabled(void)
+{
+ return false;
+}
+
static inline bool zswap_store(struct folio *folio)
{
return false;
diff --git a/mm/Kconfig b/mm/Kconfig
index 33fa51d608dc..26d1a5cee471 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -125,6 +125,18 @@ config ZSWAP_COMPRESSOR_DEFAULT
default "zstd" if ZSWAP_COMPRESSOR_DEFAULT_ZSTD
default ""
+config ZSWAP_STORE_BATCHING_ENABLED
+ bool "Batching of zswap stores with Intel IAA"
+ depends on ZSWAP && CRYPTO_DEV_IAA_CRYPTO
+ default n
+ help
+ Enables zswap_store to swapout large folios in batches of 8 pages,
+ rather than a page at a time, if the system has Intel IAA for hardware
+ acceleration of compressions. If IAA is configured as the zswap
+ compressor, this will parallelize batch compression of upto 8 pages
+ in the folio in hardware, thereby improving large folio compression
+ throughput and reducing swapout latency.
+
choice
prompt "Default allocator"
depends on ZSWAP
diff --git a/mm/zswap.c b/mm/zswap.c
index 948c9745ee57..4893302d8c34 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -127,6 +127,15 @@ static bool zswap_shrinker_enabled = IS_ENABLED(
CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644);
+/*
+ * Enable/disable batching of compressions if zswap_store is called with a
+ * large folio. If enabled, and if IAA is the zswap compressor, pages are
+ * compressed in parallel in batches of say, 8 pages.
+ * If not, every page is compressed sequentially.
+ */
+static bool __zswap_store_batching_enabled = IS_ENABLED(
+ CONFIG_ZSWAP_STORE_BATCHING_ENABLED);
+
bool zswap_is_enabled(void)
{
return zswap_enabled;
@@ -241,6 +250,11 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
pr_debug("%s pool %s/%s\n", msg, (p)->tfm_name, \
zpool_get_type((p)->zpool))
+__always_inline bool zswap_store_batching_enabled(void)
+{
+ return __zswap_store_batching_enabled;
+}
+
/*********************************
* pool functions
**********************************/
--
2.27.0
On Thu, Oct 17, 2024 at 11:41 PM Kanchana P Sridhar <kanchana.p.sridhar@intel.com> wrote: > > Add a new zswap config variable that controls whether zswap_store() will > compress a batch of pages, for instance, the pages in a large folio: > > CONFIG_ZSWAP_STORE_BATCHING_ENABLED > > The existing CONFIG_CRYPTO_DEV_IAA_CRYPTO variable added in commit > ea7a5cbb4369 ("crypto: iaa - Add Intel IAA Compression Accelerator crypto > driver core") is used to detect if the system has the Intel Analytics > Accelerator (IAA), and the iaa_crypto module is available. If so, the > kernel build will prompt for CONFIG_ZSWAP_STORE_BATCHING_ENABLED. Hence, > users have the ability to set CONFIG_ZSWAP_STORE_BATCHING_ENABLED="y" only > on systems that have Intel IAA. > > If CONFIG_ZSWAP_STORE_BATCHING_ENABLED is enabled, and IAA is configured > as the zswap compressor, zswap_store() will process the pages in a large > folio in batches, i.e., multiple pages at a time. Pages in a batch will be > compressed in parallel in hardware, then stored. On systems without Intel > IAA and/or if zswap uses software compressors, pages in the batch will be > compressed sequentially and stored. > > The patch also implements a zswap API that returns the status of this > config variable. If we are compressing a large folio and batching is an option, is not batching ever the correct thing to do? Why is the config option needed? > > Suggested-by: Ying Huang <ying.huang@intel.com> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > --- > include/linux/zswap.h | 6 ++++++ > mm/Kconfig | 12 ++++++++++++ > mm/zswap.c | 14 ++++++++++++++ > 3 files changed, 32 insertions(+) > > diff --git a/include/linux/zswap.h b/include/linux/zswap.h > index d961ead91bf1..74ad2a24b309 100644 > --- a/include/linux/zswap.h > +++ b/include/linux/zswap.h > @@ -24,6 +24,7 @@ struct zswap_lruvec_state { > atomic_long_t nr_disk_swapins; > }; > > +bool zswap_store_batching_enabled(void); > unsigned long zswap_total_pages(void); > bool zswap_store(struct folio *folio); > bool zswap_load(struct folio *folio); > @@ -39,6 +40,11 @@ bool zswap_never_enabled(void); > > struct zswap_lruvec_state {}; > > +static inline bool zswap_store_batching_enabled(void) > +{ > + return false; > +} > + > static inline bool zswap_store(struct folio *folio) > { > return false; > diff --git a/mm/Kconfig b/mm/Kconfig > index 33fa51d608dc..26d1a5cee471 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -125,6 +125,18 @@ config ZSWAP_COMPRESSOR_DEFAULT > default "zstd" if ZSWAP_COMPRESSOR_DEFAULT_ZSTD > default "" > > +config ZSWAP_STORE_BATCHING_ENABLED > + bool "Batching of zswap stores with Intel IAA" > + depends on ZSWAP && CRYPTO_DEV_IAA_CRYPTO > + default n > + help > + Enables zswap_store to swapout large folios in batches of 8 pages, > + rather than a page at a time, if the system has Intel IAA for hardware > + acceleration of compressions. If IAA is configured as the zswap > + compressor, this will parallelize batch compression of upto 8 pages > + in the folio in hardware, thereby improving large folio compression > + throughput and reducing swapout latency. > + > choice > prompt "Default allocator" > depends on ZSWAP > diff --git a/mm/zswap.c b/mm/zswap.c > index 948c9745ee57..4893302d8c34 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -127,6 +127,15 @@ static bool zswap_shrinker_enabled = IS_ENABLED( > CONFIG_ZSWAP_SHRINKER_DEFAULT_ON); > module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644); > > +/* > + * Enable/disable batching of compressions if zswap_store is called with a > + * large folio. If enabled, and if IAA is the zswap compressor, pages are > + * compressed in parallel in batches of say, 8 pages. > + * If not, every page is compressed sequentially. > + */ > +static bool __zswap_store_batching_enabled = IS_ENABLED( > + CONFIG_ZSWAP_STORE_BATCHING_ENABLED); > + > bool zswap_is_enabled(void) > { > return zswap_enabled; > @@ -241,6 +250,11 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp) > pr_debug("%s pool %s/%s\n", msg, (p)->tfm_name, \ > zpool_get_type((p)->zpool)) > > +__always_inline bool zswap_store_batching_enabled(void) > +{ > + return __zswap_store_batching_enabled; > +} > + > /********************************* > * pool functions > **********************************/ > -- > 2.27.0 >
> -----Original Message----- > From: Yosry Ahmed <yosryahmed@google.com> > Sent: Tuesday, October 22, 2024 5:50 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > usamaarif642@gmail.com; ryan.roberts@arm.com; Huang, Ying > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org; > linux-crypto@vger.kernel.org; herbert@gondor.apana.org.au; > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > <kristen.c.accardi@intel.com>; zanussi@kernel.org; viro@zeniv.linux.org.uk; > brauner@kernel.org; jack@suse.cz; mcgrof@kernel.org; kees@kernel.org; > joel.granados@kernel.org; bfoster@redhat.com; willy@infradead.org; linux- > fsdevel@vger.kernel.org; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, > Vinodh <vinodh.gopal@intel.com> > Subject: Re: [RFC PATCH v1 09/13] mm: zswap: Config variable to enable > compress batching in zswap_store(). > > On Thu, Oct 17, 2024 at 11:41 PM Kanchana P Sridhar > <kanchana.p.sridhar@intel.com> wrote: > > > > Add a new zswap config variable that controls whether zswap_store() will > > compress a batch of pages, for instance, the pages in a large folio: > > > > CONFIG_ZSWAP_STORE_BATCHING_ENABLED > > > > The existing CONFIG_CRYPTO_DEV_IAA_CRYPTO variable added in commit > > ea7a5cbb4369 ("crypto: iaa - Add Intel IAA Compression Accelerator crypto > > driver core") is used to detect if the system has the Intel Analytics > > Accelerator (IAA), and the iaa_crypto module is available. If so, the > > kernel build will prompt for CONFIG_ZSWAP_STORE_BATCHING_ENABLED. > Hence, > > users have the ability to set > CONFIG_ZSWAP_STORE_BATCHING_ENABLED="y" only > > on systems that have Intel IAA. > > > > If CONFIG_ZSWAP_STORE_BATCHING_ENABLED is enabled, and IAA is > configured > > as the zswap compressor, zswap_store() will process the pages in a large > > folio in batches, i.e., multiple pages at a time. Pages in a batch will be > > compressed in parallel in hardware, then stored. On systems without Intel > > IAA and/or if zswap uses software compressors, pages in the batch will be > > compressed sequentially and stored. > > > > The patch also implements a zswap API that returns the status of this > > config variable. > > If we are compressing a large folio and batching is an option, is not > batching ever the correct thing to do? Why is the config option > needed? Thanks Yosry, for the code review comments! This is a good point. The main consideration here was not to impact software compressors run on non-Intel platforms, and only incur the memory footprint cost of multiple acomp_req/buffers in "struct crypto_acomp_ctx" if there is IAA to reduce latency with parallel compressions. If the memory footprint cost if acceptable, there is no reason not to do batching, even if compressions are sequential. We could amortize cost of the cgroup charging/objcg/stats updates. Thanks, Kanchana > > > > > Suggested-by: Ying Huang <ying.huang@intel.com> > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > > --- > > include/linux/zswap.h | 6 ++++++ > > mm/Kconfig | 12 ++++++++++++ > > mm/zswap.c | 14 ++++++++++++++ > > 3 files changed, 32 insertions(+) > > > > diff --git a/include/linux/zswap.h b/include/linux/zswap.h > > index d961ead91bf1..74ad2a24b309 100644 > > --- a/include/linux/zswap.h > > +++ b/include/linux/zswap.h > > @@ -24,6 +24,7 @@ struct zswap_lruvec_state { > > atomic_long_t nr_disk_swapins; > > }; > > > > +bool zswap_store_batching_enabled(void); > > unsigned long zswap_total_pages(void); > > bool zswap_store(struct folio *folio); > > bool zswap_load(struct folio *folio); > > @@ -39,6 +40,11 @@ bool zswap_never_enabled(void); > > > > struct zswap_lruvec_state {}; > > > > +static inline bool zswap_store_batching_enabled(void) > > +{ > > + return false; > > +} > > + > > static inline bool zswap_store(struct folio *folio) > > { > > return false; > > diff --git a/mm/Kconfig b/mm/Kconfig > > index 33fa51d608dc..26d1a5cee471 100644 > > --- a/mm/Kconfig > > +++ b/mm/Kconfig > > @@ -125,6 +125,18 @@ config ZSWAP_COMPRESSOR_DEFAULT > > default "zstd" if ZSWAP_COMPRESSOR_DEFAULT_ZSTD > > default "" > > > > +config ZSWAP_STORE_BATCHING_ENABLED > > + bool "Batching of zswap stores with Intel IAA" > > + depends on ZSWAP && CRYPTO_DEV_IAA_CRYPTO > > + default n > > + help > > + Enables zswap_store to swapout large folios in batches of 8 pages, > > + rather than a page at a time, if the system has Intel IAA for hardware > > + acceleration of compressions. If IAA is configured as the zswap > > + compressor, this will parallelize batch compression of upto 8 pages > > + in the folio in hardware, thereby improving large folio compression > > + throughput and reducing swapout latency. > > + > > choice > > prompt "Default allocator" > > depends on ZSWAP > > diff --git a/mm/zswap.c b/mm/zswap.c > > index 948c9745ee57..4893302d8c34 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -127,6 +127,15 @@ static bool zswap_shrinker_enabled = > IS_ENABLED( > > CONFIG_ZSWAP_SHRINKER_DEFAULT_ON); > > module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, > 0644); > > > > +/* > > + * Enable/disable batching of compressions if zswap_store is called with a > > + * large folio. If enabled, and if IAA is the zswap compressor, pages are > > + * compressed in parallel in batches of say, 8 pages. > > + * If not, every page is compressed sequentially. > > + */ > > +static bool __zswap_store_batching_enabled = IS_ENABLED( > > + CONFIG_ZSWAP_STORE_BATCHING_ENABLED); > > + > > bool zswap_is_enabled(void) > > { > > return zswap_enabled; > > @@ -241,6 +250,11 @@ static inline struct xarray > *swap_zswap_tree(swp_entry_t swp) > > pr_debug("%s pool %s/%s\n", msg, (p)->tfm_name, \ > > zpool_get_type((p)->zpool)) > > > > +__always_inline bool zswap_store_batching_enabled(void) > > +{ > > + return __zswap_store_batching_enabled; > > +} > > + > > /********************************* > > * pool functions > > **********************************/ > > -- > > 2.27.0 > >
On Tue, Oct 22, 2024 at 7:17 PM Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> wrote: > > > > -----Original Message----- > > From: Yosry Ahmed <yosryahmed@google.com> > > Sent: Tuesday, October 22, 2024 5:50 PM > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > > usamaarif642@gmail.com; ryan.roberts@arm.com; Huang, Ying > > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org; > > linux-crypto@vger.kernel.org; herbert@gondor.apana.org.au; > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > > <kristen.c.accardi@intel.com>; zanussi@kernel.org; viro@zeniv.linux.org.uk; > > brauner@kernel.org; jack@suse.cz; mcgrof@kernel.org; kees@kernel.org; > > joel.granados@kernel.org; bfoster@redhat.com; willy@infradead.org; linux- > > fsdevel@vger.kernel.org; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, > > Vinodh <vinodh.gopal@intel.com> > > Subject: Re: [RFC PATCH v1 09/13] mm: zswap: Config variable to enable > > compress batching in zswap_store(). > > > > On Thu, Oct 17, 2024 at 11:41 PM Kanchana P Sridhar > > <kanchana.p.sridhar@intel.com> wrote: > > > > > > Add a new zswap config variable that controls whether zswap_store() will > > > compress a batch of pages, for instance, the pages in a large folio: > > > > > > CONFIG_ZSWAP_STORE_BATCHING_ENABLED > > > > > > The existing CONFIG_CRYPTO_DEV_IAA_CRYPTO variable added in commit > > > ea7a5cbb4369 ("crypto: iaa - Add Intel IAA Compression Accelerator crypto > > > driver core") is used to detect if the system has the Intel Analytics > > > Accelerator (IAA), and the iaa_crypto module is available. If so, the > > > kernel build will prompt for CONFIG_ZSWAP_STORE_BATCHING_ENABLED. > > Hence, > > > users have the ability to set > > CONFIG_ZSWAP_STORE_BATCHING_ENABLED="y" only > > > on systems that have Intel IAA. > > > > > > If CONFIG_ZSWAP_STORE_BATCHING_ENABLED is enabled, and IAA is > > configured > > > as the zswap compressor, zswap_store() will process the pages in a large > > > folio in batches, i.e., multiple pages at a time. Pages in a batch will be > > > compressed in parallel in hardware, then stored. On systems without Intel > > > IAA and/or if zswap uses software compressors, pages in the batch will be > > > compressed sequentially and stored. > > > > > > The patch also implements a zswap API that returns the status of this > > > config variable. > > > > If we are compressing a large folio and batching is an option, is not > > batching ever the correct thing to do? Why is the config option > > needed? > > Thanks Yosry, for the code review comments! This is a good point. The main > consideration here was not to impact software compressors run on non-Intel > platforms, and only incur the memory footprint cost of multiple > acomp_req/buffers in "struct crypto_acomp_ctx" if there is IAA to reduce > latency with parallel compressions. > > If the memory footprint cost if acceptable, there is no reason not to do > batching, even if compressions are sequential. We could amortize cost > of the cgroup charging/objcg/stats updates. Hmm yeah based on the next patch it seems like we allocate 7 extra buffers, each sized 2 * PAGE_SIZE, percpu. That's 56KB percpu (with 4K page size), which is non-trivial. Making it a config option seems to be inconvenient though. Users have to sign up for the memory overhead if some of them won't use IAA batching, or disable batching all together. I would assume this would be especially annoying for distros, but also for anyone who wants to experiment with IAA batching. The first thing that comes to mind is making this a boot option. But I think we can make it even more convenient and support enabling it at runtime. We just need to allocate the additional buffers the first time batching is enabled. This shouldn't be too complicated, we have an array of buffers on each CPU but we only allocate the first one initially (unless batching is enabled at boot). When batching is enabled, we can allocate the remaining buffers. The only shortcoming of this approach is that if we enable batching then disable it, we can't free the buffers without significant complexity, but I think that should be fine. I don't see this being a common pattern. WDYT? > > Thanks, > Kanchana > > > > > > > > > Suggested-by: Ying Huang <ying.huang@intel.com> > > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > > > --- > > > include/linux/zswap.h | 6 ++++++ > > > mm/Kconfig | 12 ++++++++++++ > > > mm/zswap.c | 14 ++++++++++++++ > > > 3 files changed, 32 insertions(+) > > > > > > diff --git a/include/linux/zswap.h b/include/linux/zswap.h > > > index d961ead91bf1..74ad2a24b309 100644 > > > --- a/include/linux/zswap.h > > > +++ b/include/linux/zswap.h > > > @@ -24,6 +24,7 @@ struct zswap_lruvec_state { > > > atomic_long_t nr_disk_swapins; > > > }; > > > > > > +bool zswap_store_batching_enabled(void); > > > unsigned long zswap_total_pages(void); > > > bool zswap_store(struct folio *folio); > > > bool zswap_load(struct folio *folio); > > > @@ -39,6 +40,11 @@ bool zswap_never_enabled(void); > > > > > > struct zswap_lruvec_state {}; > > > > > > +static inline bool zswap_store_batching_enabled(void) > > > +{ > > > + return false; > > > +} > > > + > > > static inline bool zswap_store(struct folio *folio) > > > { > > > return false; > > > diff --git a/mm/Kconfig b/mm/Kconfig > > > index 33fa51d608dc..26d1a5cee471 100644 > > > --- a/mm/Kconfig > > > +++ b/mm/Kconfig > > > @@ -125,6 +125,18 @@ config ZSWAP_COMPRESSOR_DEFAULT > > > default "zstd" if ZSWAP_COMPRESSOR_DEFAULT_ZSTD > > > default "" > > > > > > +config ZSWAP_STORE_BATCHING_ENABLED > > > + bool "Batching of zswap stores with Intel IAA" > > > + depends on ZSWAP && CRYPTO_DEV_IAA_CRYPTO > > > + default n > > > + help > > > + Enables zswap_store to swapout large folios in batches of 8 pages, > > > + rather than a page at a time, if the system has Intel IAA for hardware > > > + acceleration of compressions. If IAA is configured as the zswap > > > + compressor, this will parallelize batch compression of upto 8 pages > > > + in the folio in hardware, thereby improving large folio compression > > > + throughput and reducing swapout latency. > > > + > > > choice > > > prompt "Default allocator" > > > depends on ZSWAP > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > index 948c9745ee57..4893302d8c34 100644 > > > --- a/mm/zswap.c > > > +++ b/mm/zswap.c > > > @@ -127,6 +127,15 @@ static bool zswap_shrinker_enabled = > > IS_ENABLED( > > > CONFIG_ZSWAP_SHRINKER_DEFAULT_ON); > > > module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, > > 0644); > > > > > > +/* > > > + * Enable/disable batching of compressions if zswap_store is called with a > > > + * large folio. If enabled, and if IAA is the zswap compressor, pages are > > > + * compressed in parallel in batches of say, 8 pages. > > > + * If not, every page is compressed sequentially. > > > + */ > > > +static bool __zswap_store_batching_enabled = IS_ENABLED( > > > + CONFIG_ZSWAP_STORE_BATCHING_ENABLED); > > > + > > > bool zswap_is_enabled(void) > > > { > > > return zswap_enabled; > > > @@ -241,6 +250,11 @@ static inline struct xarray > > *swap_zswap_tree(swp_entry_t swp) > > > pr_debug("%s pool %s/%s\n", msg, (p)->tfm_name, \ > > > zpool_get_type((p)->zpool)) > > > > > > +__always_inline bool zswap_store_batching_enabled(void) > > > +{ > > > + return __zswap_store_batching_enabled; > > > +} > > > + > > > /********************************* > > > * pool functions > > > **********************************/ > > > -- > > > 2.27.0 > > >
> -----Original Message----- > From: Yosry Ahmed <yosryahmed@google.com> > Sent: Wednesday, October 23, 2024 11:12 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > usamaarif642@gmail.com; ryan.roberts@arm.com; Huang, Ying > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org; > linux-crypto@vger.kernel.org; herbert@gondor.apana.org.au; > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > <kristen.c.accardi@intel.com>; zanussi@kernel.org; viro@zeniv.linux.org.uk; > brauner@kernel.org; jack@suse.cz; mcgrof@kernel.org; kees@kernel.org; > joel.granados@kernel.org; bfoster@redhat.com; willy@infradead.org; linux- > fsdevel@vger.kernel.org; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, > Vinodh <vinodh.gopal@intel.com> > Subject: Re: [RFC PATCH v1 09/13] mm: zswap: Config variable to enable > compress batching in zswap_store(). > > On Tue, Oct 22, 2024 at 7:17 PM Sridhar, Kanchana P > <kanchana.p.sridhar@intel.com> wrote: > > > > > > > -----Original Message----- > > > From: Yosry Ahmed <yosryahmed@google.com> > > > Sent: Tuesday, October 22, 2024 5:50 PM > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > > > hannes@cmpxchg.org; nphamcs@gmail.com; > chengming.zhou@linux.dev; > > > usamaarif642@gmail.com; ryan.roberts@arm.com; Huang, Ying > > > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > foundation.org; > > > linux-crypto@vger.kernel.org; herbert@gondor.apana.org.au; > > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > > > <kristen.c.accardi@intel.com>; zanussi@kernel.org; > viro@zeniv.linux.org.uk; > > > brauner@kernel.org; jack@suse.cz; mcgrof@kernel.org; > kees@kernel.org; > > > joel.granados@kernel.org; bfoster@redhat.com; willy@infradead.org; > linux- > > > fsdevel@vger.kernel.org; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; > Gopal, > > > Vinodh <vinodh.gopal@intel.com> > > > Subject: Re: [RFC PATCH v1 09/13] mm: zswap: Config variable to enable > > > compress batching in zswap_store(). > > > > > > On Thu, Oct 17, 2024 at 11:41 PM Kanchana P Sridhar > > > <kanchana.p.sridhar@intel.com> wrote: > > > > > > > > Add a new zswap config variable that controls whether zswap_store() > will > > > > compress a batch of pages, for instance, the pages in a large folio: > > > > > > > > CONFIG_ZSWAP_STORE_BATCHING_ENABLED > > > > > > > > The existing CONFIG_CRYPTO_DEV_IAA_CRYPTO variable added in > commit > > > > ea7a5cbb4369 ("crypto: iaa - Add Intel IAA Compression Accelerator > crypto > > > > driver core") is used to detect if the system has the Intel Analytics > > > > Accelerator (IAA), and the iaa_crypto module is available. If so, the > > > > kernel build will prompt for > CONFIG_ZSWAP_STORE_BATCHING_ENABLED. > > > Hence, > > > > users have the ability to set > > > CONFIG_ZSWAP_STORE_BATCHING_ENABLED="y" only > > > > on systems that have Intel IAA. > > > > > > > > If CONFIG_ZSWAP_STORE_BATCHING_ENABLED is enabled, and IAA is > > > configured > > > > as the zswap compressor, zswap_store() will process the pages in a large > > > > folio in batches, i.e., multiple pages at a time. Pages in a batch will be > > > > compressed in parallel in hardware, then stored. On systems without > Intel > > > > IAA and/or if zswap uses software compressors, pages in the batch will > be > > > > compressed sequentially and stored. > > > > > > > > The patch also implements a zswap API that returns the status of this > > > > config variable. > > > > > > If we are compressing a large folio and batching is an option, is not > > > batching ever the correct thing to do? Why is the config option > > > needed? > > > > Thanks Yosry, for the code review comments! This is a good point. The main > > consideration here was not to impact software compressors run on non- > Intel > > platforms, and only incur the memory footprint cost of multiple > > acomp_req/buffers in "struct crypto_acomp_ctx" if there is IAA to reduce > > latency with parallel compressions. > > > > If the memory footprint cost if acceptable, there is no reason not to do > > batching, even if compressions are sequential. We could amortize cost > > of the cgroup charging/objcg/stats updates. > > Hmm yeah based on the next patch it seems like we allocate 7 extra > buffers, each sized 2 * PAGE_SIZE, percpu. That's 56KB percpu (with 4K > page size), which is non-trivial. > > Making it a config option seems to be inconvenient though. Users have > to sign up for the memory overhead if some of them won't use IAA > batching, or disable batching all together. I would assume this would > be especially annoying for distros, but also for anyone who wants to > experiment with IAA batching. > > The first thing that comes to mind is making this a boot option. But I > think we can make it even more convenient and support enabling it at > runtime. We just need to allocate the additional buffers the first > time batching is enabled. This shouldn't be too complicated, we have > an array of buffers on each CPU but we only allocate the first one > initially (unless batching is enabled at boot). When batching is > enabled, we can allocate the remaining buffers. > > The only shortcoming of this approach is that if we enable batching > then disable it, we can't free the buffers without significant > complexity, but I think that should be fine. I don't see this being a > common pattern. > > WDYT? Thanks for these suggestions, Yosry. Sure, let me give this a try, and share updates. Thanks, Kanchana > > > > > > > Thanks, > > Kanchana > > > > > > > > > > > > > Suggested-by: Ying Huang <ying.huang@intel.com> > > > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > > > > --- > > > > include/linux/zswap.h | 6 ++++++ > > > > mm/Kconfig | 12 ++++++++++++ > > > > mm/zswap.c | 14 ++++++++++++++ > > > > 3 files changed, 32 insertions(+) > > > > > > > > diff --git a/include/linux/zswap.h b/include/linux/zswap.h > > > > index d961ead91bf1..74ad2a24b309 100644 > > > > --- a/include/linux/zswap.h > > > > +++ b/include/linux/zswap.h > > > > @@ -24,6 +24,7 @@ struct zswap_lruvec_state { > > > > atomic_long_t nr_disk_swapins; > > > > }; > > > > > > > > +bool zswap_store_batching_enabled(void); > > > > unsigned long zswap_total_pages(void); > > > > bool zswap_store(struct folio *folio); > > > > bool zswap_load(struct folio *folio); > > > > @@ -39,6 +40,11 @@ bool zswap_never_enabled(void); > > > > > > > > struct zswap_lruvec_state {}; > > > > > > > > +static inline bool zswap_store_batching_enabled(void) > > > > +{ > > > > + return false; > > > > +} > > > > + > > > > static inline bool zswap_store(struct folio *folio) > > > > { > > > > return false; > > > > diff --git a/mm/Kconfig b/mm/Kconfig > > > > index 33fa51d608dc..26d1a5cee471 100644 > > > > --- a/mm/Kconfig > > > > +++ b/mm/Kconfig > > > > @@ -125,6 +125,18 @@ config ZSWAP_COMPRESSOR_DEFAULT > > > > default "zstd" if ZSWAP_COMPRESSOR_DEFAULT_ZSTD > > > > default "" > > > > > > > > +config ZSWAP_STORE_BATCHING_ENABLED > > > > + bool "Batching of zswap stores with Intel IAA" > > > > + depends on ZSWAP && CRYPTO_DEV_IAA_CRYPTO > > > > + default n > > > > + help > > > > + Enables zswap_store to swapout large folios in batches of 8 pages, > > > > + rather than a page at a time, if the system has Intel IAA for > hardware > > > > + acceleration of compressions. If IAA is configured as the zswap > > > > + compressor, this will parallelize batch compression of upto 8 pages > > > > + in the folio in hardware, thereby improving large folio compression > > > > + throughput and reducing swapout latency. > > > > + > > > > choice > > > > prompt "Default allocator" > > > > depends on ZSWAP > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > > index 948c9745ee57..4893302d8c34 100644 > > > > --- a/mm/zswap.c > > > > +++ b/mm/zswap.c > > > > @@ -127,6 +127,15 @@ static bool zswap_shrinker_enabled = > > > IS_ENABLED( > > > > CONFIG_ZSWAP_SHRINKER_DEFAULT_ON); > > > > module_param_named(shrinker_enabled, zswap_shrinker_enabled, > bool, > > > 0644); > > > > > > > > +/* > > > > + * Enable/disable batching of compressions if zswap_store is called > with a > > > > + * large folio. If enabled, and if IAA is the zswap compressor, pages are > > > > + * compressed in parallel in batches of say, 8 pages. > > > > + * If not, every page is compressed sequentially. > > > > + */ > > > > +static bool __zswap_store_batching_enabled = IS_ENABLED( > > > > + CONFIG_ZSWAP_STORE_BATCHING_ENABLED); > > > > + > > > > bool zswap_is_enabled(void) > > > > { > > > > return zswap_enabled; > > > > @@ -241,6 +250,11 @@ static inline struct xarray > > > *swap_zswap_tree(swp_entry_t swp) > > > > pr_debug("%s pool %s/%s\n", msg, (p)->tfm_name, \ > > > > zpool_get_type((p)->zpool)) > > > > > > > > +__always_inline bool zswap_store_batching_enabled(void) > > > > +{ > > > > + return __zswap_store_batching_enabled; > > > > +} > > > > + > > > > /********************************* > > > > * pool functions > > > > **********************************/ > > > > -- > > > > 2.27.0 > > > >
On Wed, Oct 23, 2024 at 02:17:06AM +0000, Sridhar, Kanchana P wrote: > > Thanks Yosry, for the code review comments! This is a good point. The main > consideration here was not to impact software compressors run on non-Intel > platforms, and only incur the memory footprint cost of multiple > acomp_req/buffers in "struct crypto_acomp_ctx" if there is IAA to reduce > latency with parallel compressions. I'm working on a batching mechanism for crypto_ahash interface, where the requests are simply chained together and then submitted. The same mechanism should work for crypto_acomp as well: + for (i = 0; i < num_mb; ++i) { + if (testmgr_alloc_buf(data[i].xbuf)) + goto out; + + crypto_init_wait(&data[i].wait); + + data[i].req = ahash_request_alloc(tfm, GFP_KERNEL); + if (!data[i].req) { + pr_err("alg: hash: Failed to allocate request for %s\n", + algo); + goto out; + } + + if (i) + ahash_request_chain(data[i].req, data[0].req); + else + ahash_reqchain_init(data[i].req, 0, crypto_req_done, + &data[i].wait); + + sg_init_table(data[i].sg, XBUFSIZE); + for (j = 0; j < XBUFSIZE; j++) { + sg_set_buf(data[i].sg + j, data[i].xbuf[j], PAGE_SIZE); + memset(data[i].xbuf[j], 0xff, PAGE_SIZE); + } + } Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
> -----Original Message----- > From: Herbert Xu <herbert@gondor.apana.org.au> > Sent: Tuesday, October 22, 2024 7:58 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: Yosry Ahmed <yosryahmed@google.com>; linux-kernel@vger.kernel.org; > linux-mm@kvack.org; hannes@cmpxchg.org; nphamcs@gmail.com; > chengming.zhou@linux.dev; usamaarif642@gmail.com; > ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>; > 21cnbao@gmail.com; akpm@linux-foundation.org; linux- > crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com; > ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi, > Kristen C <kristen.c.accardi@intel.com>; zanussi@kernel.org; > viro@zeniv.linux.org.uk; brauner@kernel.org; jack@suse.cz; > mcgrof@kernel.org; kees@kernel.org; joel.granados@kernel.org; > bfoster@redhat.com; willy@infradead.org; linux-fsdevel@vger.kernel.org; > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > <vinodh.gopal@intel.com> > Subject: Re: [RFC PATCH v1 09/13] mm: zswap: Config variable to enable > compress batching in zswap_store(). > > On Wed, Oct 23, 2024 at 02:17:06AM +0000, Sridhar, Kanchana P wrote: > > > > Thanks Yosry, for the code review comments! This is a good point. The main > > consideration here was not to impact software compressors run on non- > Intel > > platforms, and only incur the memory footprint cost of multiple > > acomp_req/buffers in "struct crypto_acomp_ctx" if there is IAA to reduce > > latency with parallel compressions. > > I'm working on a batching mechanism for crypto_ahash interface, > where the requests are simply chained together and then submitted. > > The same mechanism should work for crypto_acomp as well: > > + for (i = 0; i < num_mb; ++i) { > + if (testmgr_alloc_buf(data[i].xbuf)) > + goto out; > + > + crypto_init_wait(&data[i].wait); > + > + data[i].req = ahash_request_alloc(tfm, GFP_KERNEL); > + if (!data[i].req) { > + pr_err("alg: hash: Failed to allocate request for %s\n", > + algo); > + goto out; > + } > + > + if (i) > + ahash_request_chain(data[i].req, data[0].req); > + else > + ahash_reqchain_init(data[i].req, 0, crypto_req_done, > + &data[i].wait); > + > + sg_init_table(data[i].sg, XBUFSIZE); > + for (j = 0; j < XBUFSIZE; j++) { > + sg_set_buf(data[i].sg + j, data[i].xbuf[j], PAGE_SIZE); > + memset(data[i].xbuf[j], 0xff, PAGE_SIZE); > + } > + } Thanks Herbert, for letting us know! Sure, we will look forward to these changes when they are ready, to look into incorporating in the iaa_crypto driver. Thanks, Kanchana > > Cheers, > -- > Email: Herbert Xu <herbert@gondor.apana.org.au> > Home Page: http://gondor.apana.org.au/~herbert/ > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
© 2016 - 2024 Red Hat, Inc.