zswap IAA decompress batching

[RFC PATCH v1 0/7] zswap IAA decompress batching
Posted by Kanchana P Sridhar 1 month, 1 week ago
IAA Decompression Batching:
===========================

This patch-series applies over [1], the IAA compress batching patch-series.

[1] https://patchwork.kernel.org/project/linux-mm/list/?series=900537

This RFC patch-series introduces the use of the Intel Analytics Accelerator
(IAA) for parallel decompression of 4K folios prefetched by
swapin_readahead(). We have developed zswap batched loading of these
prefetched folios, that deploys the use of parallel decompressions by IAA.

swapin_readahead() provides a natural batching interface because it adapts
to the usefulness of prior prefetches, to adjust the readahead
window. Hence, it allows the page-cluster to be set based on workload
characteristics. For workloads that are prefetching friendly, this can form
the basis for reading ahead up to 32 folios with zswap load batching to
significantly reduce swap in latency, major page-faults and systime;
thereby improving workload performance.

The patch-series builds upon the IAA compress batching patch-series [1],
and is organized as follows:

 1) A centralized batch decompression API that can be used by swap modules.
 2) "struct folio_batch" modifications, e.g., PAGEVEC_SIZE is increased to
    2^5. 
 3) Addition of "zswap_batch" and "non_zswap_batch" folio_batches in
    swap_read_folio() to serve the purposes of a plug.
 4) swap_read_zswap_batch_unplug() API in page_io.c to process a read
    batch of entries found in zswap.
 5) zswap API to add a swap entry to a load batch, init/reinit the batch,
    process the batch using the batch decompression API. 
 6) Modifications to the swapin_readahead() functions,
    swap_vma_readahead() and swap_cluster_readahead() to:
    a) Call swap_read_folio() to add prefetch swap entries to "zswap_batch"
       and "non_zswap_batch" folio_batches.
    b) Process the two readahead folio batches: "non_zswap_batch" folios
       will be read sequentially; "zswap_batch" folios will be batch
       decompressed with IAA.
 7) Modifications to do_swap_page() to invoke swapin_readahead() from both,
    the single-mapped SWP_SYNCHRONOUS_IO and shared/non-SWP_SYNCHRONOUS_IO
    branches. In the former path, we call swapin_readahead() only in the
    !zswap_never_enabled() case.
    a) This causes folios to be read into the swapcache in both paths. This
       design choice was motivated by stability: to handle race conditions
       with say, process 1 faulting in a single-mapped folio; however,
       process 2 could be simultaneously prefetching it as a "readahead"
       folio.
    b) If the single-mapped folio was successfully read and the race did
       not occur, there are checks added to free the swapcache entry for
       the folio, before do_swap_page() returns.
 8) Finally, for IAA batching, we reduce SWAP_BATCH to 16 and modify the
    swap slots cache thresholds to alleviate lock contention on the
    swap_info_struct lock due to reduced swap page-fault latencies.

IAA decompress batching can be enabled only on platforms that have IAA, by
setting this config variable:

 CONFIG_ZSWAP_LOAD_BATCHING_ENABLED="y"

A new swap parameter "singlemapped_ra_enabled" (false by default) is added
for use on platforms that have IAA. If zswap_load_batching_enabled() is
true, this is intended to give the user the option to run experiments with
IAA and with software compressors for zswap.

These are the recommended settings for "singlemapped_ra_enabled", which
takes effect only in the do_swap_page() single-mapped SWP_SYNCHRONOUS_IO
path:

 For IAA:
   echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled

 For software compressors:
   echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled

If "singlemapped_ra_enabled" is set to false, swapin_readahead() will skip
prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO" do_swap_page()
path.

IAA decompress batching performance testing was done using the kernel
compilation test "allmodconfig" run in tmpfs, which demonstrates a
significant amount of readahead activity. vm-scalability usemem is not
ideal for decompress batching because there is very little readahead
activity even with page-cluster of 5 (swap_ra is < 150 with 4k/16k/32k/64k
folios).

The kernel compilation experiments with decompress batching demonstrate
significant latency reductions with kernel compilation: up to 4% lower
elapsed time, 14% lower sys time than mm-unstable/zstd. When combined with
compress batching, we see a reduction of 5% in elapsed time and 20% in sys
time as compared to mm-unstable commit 817952b8be34 with zstd.

Our internal validation of IAA compress/decompress batching in highly
contended Sapphire Rapids server setups with workloads running on 72 cores
for ~25 minutes under stringent memory limit constraints have shown up to
50% reduction in sys time and 3.5% reduction in workload run time as
compared to software compressors.


System setup for testing:
=========================
Testing of this patch-series was done with mm-unstable as of 10-16-2024,
commit 817952b8be34, without and with this patch-series ("this
patch-series" includes [1]). Data was gathered on an Intel Sapphire Rapids
server, dual-socket 56 cores per socket, 4 IAA devices per socket, 503 GiB
RAM and 525G SSD disk partition swap. Core frequency was fixed at 2500MHz.

The kernel compilation test with run in tmpfs, using the "allmodconfig", so
that significant swapout and readahead activity can be observed to quantify
decompress batching.

Other kernel configuration parameters:

    zswap compressor : deflate-iaa
    zswap allocator   : zsmalloc
    vm.page-cluster   : 3,4

IAA "compression verification" is disabled and the async poll acomp
interface is used in the iaa_crypto driver (the defaults with this
series).


Performance testing (Kernel compilation):
=========================================

As mentioned earlier, for workloads that see a lot of swapout activity, we
can benefit from configuring 2 WQs per IAA device, with compress jobs from
all same-socket cores being distributed toothe wq.1 of all IAAs on the
socket, with the "global_wq" developed in this patch-series.

Although this data includes IAA decompress batching, which will be
submitted as a separate RFC patch-series, I am listing it here to quantify
the benefit of distributing compress jobs among all IAAs. The kernel
compilation test with "allmodconfig" is able to quantify this well:


 4K folios: deflate-iaa: kernel compilation
 ==========================================

 ------------------------------------------------------------------------------
                   mm-unstable-10-16-2024       zswap_load_batch with
                                              IAA decompress batching
 ------------------------------------------------------------------------------
 zswap compressor                    zstd         deflate-iaa
 vm.compress-batchsize                n/a                   1
 vm.page-cluster                        3                   3
 ------------------------------------------------------------------------------
 real_sec                          783.87              752.99
 user_sec                       15,750.07           15,746.37
 sys_sec                         6,522.32            5,638.16
 Max_Res_Set_Size_KB            1,872,640           1,872,640

 ------------------------------------------------------------------------------
 zswpout                       82,364,991         105,190,461
 zswpin                        21,303,393          29,684,653
 pswpout                               13                   1
 pswpin                                12                   1
 pgmajfault                    17,114,339          24,034,146
 swap_ra                        4,596,035           6,219,484
 swap_ra_hit                    2,903,249           3,876,195
 ------------------------------------------------------------------------------


 Progression of kernel compilation latency improvements with
 compress/decompress batching:
 ============================================================

 -------------------------------------------------------------------------------
               mm-unstable-10-16-2024   shrink_folio_       zswap_load_batch
                                        list()              w/ IAA decompress
                                        batching            batching 
                                        of folios           
 -------------------------------------------------------------------------------
 zswap compr       zstd   deflate-iaa   deflate-iaa    deflate-iaa   deflate-iaa
 vm.compress-       n/a           n/a            32              1            32
 batchsize
 vm.page-             3             3             3              3             3
  cluster
 -------------------------------------------------------------------------------
 real_sec        783.87        761.69        747.32         752.99        749.25
 user_sec     15,750.07     15,716.69     15,728.39      15,746.37     15,741.71
 sys_sec       6,522.32      5,725.28      5,399.44       5,638.16      5,482.12
 Max_RSS_KB   1,872,640     1,870,848     1,874,432      1,872,640     1,872,640
                                                                                       
 zswpout     82,364,991    97,739,600   102,780,612    105,190,461   106,729,372
 zswpin      21,303,393    27,684,166    29,016,252     29,684,653    30,717,819
 pswpout             13           222           213              1            12
 pswpin              12           209           202              1             8
 pgmajfault  17,114,339    22,421,211    23,378,161     24,034,146    24,852,985
 swap_ra      4,596,035     5,840,082     6,231,646      6,219,484     6,504,878
 swap_ra_hit  2,903,249     3,682,444     3,940,420      3,876,195     4,092,852
 -------------------------------------------------------------------------------


The last 2 columns of the latency reduction progression are as follows:


 IAA decompress batching combined with distributing compress jobs to all
 same-socket IAA devices: 
 =======================================================================

 ------------------------------------------------------------------------------
                   IAA shrink_folio_list() compress batching and
                       swapin_readahead() decompress batching

                                      1WQ      2WQ (distribute compress jobs)

                        1 local WQ (wq.0)    1 local WQ (wq.0) +
                                  per IAA    1 global WQ (wq.1) per IAA
                        
 ------------------------------------------------------------------------------
 zswap compressor             deflate-iaa         deflate-iaa
 vm.compress-batchsize                 32                  32
 vm.page-cluster                        4                   4
 ------------------------------------------------------------------------------
 real_sec                          746.77              745.42  
 user_sec                       15,732.66           15,738.85
 sys_sec                         5,384.14            5,247.86
 Max_Res_Set_Size_KB            1,874,432           1,872,640

 ------------------------------------------------------------------------------
 zswpout                      101,648,460         104,882,982
 zswpin                        27,418,319          29,428,515
 pswpout                              213                  22
 pswpin                               207                   6
 pgmajfault                    21,896,616          23,629,768
 swap_ra                        6,054,409           6,385,080
 swap_ra_hit                    3,791,628           3,985,141
 ------------------------------------------------------------------------------


I would greatly appreciate code review comments for this RFC series!

[1] https://patchwork.kernel.org/project/linux-mm/list/?series=900537


Thanks,
Kanchana



Kanchana P Sridhar (7):
  mm: zswap: Config variable to enable zswap loads with decompress
    batching.
  mm: swap: Add IAA batch decompression API
    swap_crypto_acomp_decompress_batch().
  pagevec: struct folio_batch changes for decompress batching interface.
  mm: swap: swap_read_folio() can add a folio to a folio_batch if it is
    in zswap.
  mm: swap, zswap: zswap folio_batch processing with IAA decompression
    batching.
  mm: do_swap_page() calls swapin_readahead() zswap load batching
    interface.
  mm: For IAA batching, reduce SWAP_BATCH and modify swap slot cache
    thresholds.

 include/linux/pagevec.h    |  13 +-
 include/linux/swap.h       |   7 +
 include/linux/swap_slots.h |   7 +
 include/linux/zswap.h      |  65 +++++++++
 mm/Kconfig                 |  13 ++
 mm/memory.c                | 187 +++++++++++++++++++------
 mm/page_io.c               |  61 ++++++++-
 mm/shmem.c                 |   2 +-
 mm/swap.h                  | 102 ++++++++++++--
 mm/swap_state.c            | 272 ++++++++++++++++++++++++++++++++++---
 mm/swapfile.c              |   2 +-
 mm/zswap.c                 | 272 +++++++++++++++++++++++++++++++++++++
 12 files changed, 927 insertions(+), 76 deletions(-)

-- 
2.27.0