RE: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios

Sridhar, Kanchana P posted 3 patches 1 year, 3 months ago
Only 0 patches received!
There is a newer version of this series
RE: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
Posted by Sridhar, Kanchana P 1 year, 3 months ago
> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Wednesday, August 28, 2024 3:37 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
> 
> On Wed, Aug 28, 2024 at 2:35 AM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Hi All,
> >
> > This patch-series enables zswap_store() to accept and store mTHP
> > folios. The most significant contribution in this series is from the
> > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> > migrated to v6.11-rc3 in patch 2/4 of this series.
> >
> > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
> >      https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> >
> > Additionally, there is an attempt to modularize some of the functionality
> > in zswap_store(), to make it more amenable to supporting any-order
> > mTHPs. For instance, the function zswap_store_entry() stores a
> zswap_entry
> > in the xarray. Likewise, zswap_delete_stored_offsets() can be used to
> > delete all offsets corresponding to a higher order folio stored in zswap.
> >
> > For accounting purposes, the patch-series adds per-order mTHP sysfs
> > "zswpout" counters that get incremented upon successful zswap_store of
> > an mTHP folio:
> >
> > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
> >
> > This patch-series is a precursor to ZSWAP compress batching of mTHP
> > swap-out and decompress batching of swap-ins based on
> swapin_readahead(),
> > using Intel IAA hardware acceleration, which we would like to submit in
> > subsequent RFC patch-series, with performance improvement data.
> >
> > Thanks to Ying Huang for pre-posting review feedback and suggestions!
> >
> > Changes since v4:
> > =================
> > 1) Published before/after data with zstd, as suggested by Nhat (Thanks
> >    Nhat for the data reviews!).
> > 2) Rebased to mm-unstable from 8/27/2024,
> >    commit b659edec079c90012cf8d05624e312d1062b8b87.
> > 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if
> >    CONFIG_MEMCG is not defined, to resolve build errors reported by kernel
> >    robot; as per Nhat's and Michal's suggestion to not require a separate
> >    patch to fix the build errors (thanks both!).
> > 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as
> >    suggested by Yosry (Thanks Yosry!).
> > 5) Squashed the commits that define new mthp zswpout stat counters, and
> >    invoke count_mthp_stat() after successful zswap_store()s; into a single
> >    commit. Thanks Yosry for this suggestion!
> >
> > Changes since v3:
> > =================
> > 1) Rebased to mm-unstable commit
> 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
> >    Thanks to Barry for suggesting aligning with Ryan Roberts' latest
> >    changes to count_mthp_stat() so that it's always defined, even when THP
> >    is disabled. Barry, I have also made one other change in page_io.c
> >    where count_mthp_stat() is called by count_swpout_vm_event(). I would
> >    appreciate it if you can review this. Thanks!
> >    Hopefully this should resolve the kernel robot build errors.
> >
> > Changes since v2:
> > =================
> > 1) Gathered usemem data using SSD as the backing swap device for zswap,
> >    as suggested by Ying Huang. Ying, I would appreciate it if you can
> >    review the latest data. Thanks!
> > 2) Generated the base commit info in the patches to attempt to address
> >    the kernel test robot build errors.
> > 3) No code changes to the individual patches themselves.
> >
> > Changes since RFC v1:
> > =====================
> >
> > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
> >    Thanks Barry!
> > 2) Addressed some of the code review comments that Nhat Pham provided
> in
> >    Ryan's initial RFC [1]:
> >    - Added a comment about the cgroup zswap limit checks occuring once
> per
> >      folio at the beginning of zswap_store().
> >      Nhat, Ryan, please do let me know if the comments convey the summary
> >      from the RFC discussion. Thanks!
> >    - Posted data on running the cgroup suite's zswap kselftest.
> > 3) Rebased to v6.11-rc3.
> > 4) Gathered performance data with usemem and the rebased patch-series.
> >
> > Performance Testing:
> > ====================
> > Testing of this patch-series was done with the v6.11-rc3 mainline, without
> > and with this patch-series, on an Intel Sapphire Rapids server,
> > dual-socket 56 cores per socket, 4 IAA devices per socket.
> >
> > The system has 503 GiB RAM, with 176GiB ZRAM (35% of available RAM) as
> the
> > backing swap device for ZSWAP. zstd is configured as the ZRAM compressor.
> > Core frequency was fixed at 2500MHz.
> >
> > The vm-scalability "usemem" test was run in a cgroup whose memory.high
> > was fixed at 40G. The is no swap limit set for the cgroup. Following a
> > similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting"
> > series [2], 70 usemem processes were run, each allocating and writing 1G of
> > memory:
> >
> >     usemem --init-time -w -O -n 70 1g
> >
> > The vm/sysfs mTHP stats included with the performance data provide
> details
> > on the swapout activity to ZSWAP/swap.
> >
> > Other kernel configuration parameters:
> >
> >     ZSWAP Compressors : zstd, deflate-iaa
> >     ZSWAP Allocator   : zsmalloc
> >     SWAP page-cluster : 2
> >
> > In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
> > IAA "compression verification" is enabled. Hence each IAA compression
> > will be decompressed internally by the "iaa_crypto" driver, the crc-s
> > returned by the hardware will be compared and errors reported in case of
> > mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> > compared to the software compressors.
> >
> > Throughput is derived by averaging the individual 70 processes' throughputs
> > reported by usemem. sys time is measured with perf. All data points are
> > averaged across 3 runs.
> >
> >  64KB mTHP (cgroup memory.high set to 40G):
> >  ==========================================
> >
> >  ------------------------------------------------------------------------------
> >                      v6.11-rc3 mainline              zswap-mTHP      Change wrt
> >                                Baseline                                Baseline
> >  ------------------------------------------------------------------------------
> >  ZSWAP compressor       zstd   deflate-        zstd    deflate-   zstd deflate-
> >                                     iaa                     iaa             iaa
> >  ------------------------------------------------------------------------------
> >  Throughput (KB/s)   161,496    156,343     140,363     151,938   -13%      -3%
> >  sys time (sec)       771.68     802.08      954.85      735.47   -24%       8%
> >  memcg_high          111,223    110,889     138,651     133,884
> >  memcg_swap_high           0          0           0           0
> >  memcg_swap_fail           0          0           0           0
> >  pswpin                   16         16           0           0
> >  pswpout           7,471,472  7,527,963           0           0
> >  zswpin                  635        605         624         639
> >  zswpout               1,509      1,478   9,453,761   9,385,910
> >  thp_swpout                0          0           0           0
> >  thp_swpout_               0          0           0           0
> >   fallback
> >  pgmajfault            3,616      3,430       4,633       3,611
> >  ZSWPOUT-64kB            n/a        n/a     590,768     586,521
> >  SWPOUT-64kB         466,967    470,498           0           0
> >  ------------------------------------------------------------------------------
> >
> >  2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
> >  =======================================================
> >
> >  ------------------------------------------------------------------------------
> >                       v6.11-rc3 mainline              zswap-mTHP     Change wrt
> >                                 Baseline                               Baseline
> >  ------------------------------------------------------------------------------
> >  ZSWAP compressor       zstd    deflate-        zstd    deflate-  zstd deflate-
> >                                      iaa                     iaa            iaa
> >  ------------------------------------------------------------------------------
> >  Throughput (KB/s)    192,164    194,643     165,005     174,536  -14%     -10%
> >  sys time (sec)        823.55     830.42      801.72      676.65    3%      19%
> >  memcg_high            16,054     15,936      14,951      16,096
> >  memcg_swap_high            0          0           0           0
> >  memcg_swap_fail            0          0           0           0
> >  pswpin                     0          0           0           0
> >  pswpout            8,629,248  8,628,907           0           0
> >  zswpin                   560        645       5,333         781
> >  zswpout                1,416      1,503   8,546,895   9,355,760
> >  thp_swpout            16,854     16,853           0           0
> >  thp_swpout_                0          0           0           0
> >   fallback
> >  pgmajfault             3,341      3,574       8,139       3,582
> >  ZSWPOUT-2048kB           n/a        n/a      16,684      18,270
> >  SWPOUT-2048kB         16,854     16,853           0           0
> >  ------------------------------------------------------------------------------
> >
> > In the "Before" scenario, when zswap does not store mTHP, only allocations
> > count towards the cgroup memory limit. However, in the "After" scenario,
> > with the introduction of zswap_store() mTHP, both, allocations as well as
> > the zswap compressed pool usage from all 70 processes are counted
> towards
> > the memory limit. As a result, we see higher swapout activity in the
> > "After" data. Hence, more time is spent doing reclaim as the zswap cgroup
> > charge leads to more frequent memory.high breaches.
> >
> > This causes degradation in throughput and sys time with zswap mTHP, more
> so
> > in case of zstd than deflate-iaa. Compress latency could play a part in
> > this - when there is more swapout activity happening, a slower compressor
> > would cause allocations to stall for any/all of the 70 processes.
> >
> > In my opinion, even though the test set up does not provide an accurate
> > way for a direct before/after comparison (because of zswap usage being
> > counted in cgroup, hence towards the memory.high), it still seems
> > reasonable for zswap_store to support (m)THP, so that further performance
> > improvements can be implemented.
> 
> Are you saying that in the "Before" data we end up skipping zswap
> completely because of using mTHPs?

That's right, Yosry.

> 
> Does it make more sense to turn CONFIG_THP_SWAP in the "Before" data

We could do this, however I am not sure if turning off CONFIG_THP_SWAP
will have other side-effects in terms of disabling mm code paths outside of
zswap that are intended to be mTHP optimizations that could again skew
the before/after comparisons.

Will wait for Nhat's comments as well.

Thanks,
Kanchana

> to force the mTHPs to be split and for the data to be stored in zswap?
> This would be a more fair Before/After comparison where the memory
> goes to zswap in both cases, but "Before" has to be split because of
> zswap's lack of support for mTHP. I assume most setups relying on
> zswap will be turning CONFIG_THP_SWAP off today anyway, but maybe not.
> Nhat, is this something you can share?
Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
Posted by Yosry Ahmed 1 year, 3 months ago
[..]
> > > In the "Before" scenario, when zswap does not store mTHP, only allocations
> > > count towards the cgroup memory limit. However, in the "After" scenario,
> > > with the introduction of zswap_store() mTHP, both, allocations as well as
> > > the zswap compressed pool usage from all 70 processes are counted
> > towards
> > > the memory limit. As a result, we see higher swapout activity in the
> > > "After" data. Hence, more time is spent doing reclaim as the zswap cgroup
> > > charge leads to more frequent memory.high breaches.
> > >
> > > This causes degradation in throughput and sys time with zswap mTHP, more
> > so
> > > in case of zstd than deflate-iaa. Compress latency could play a part in
> > > this - when there is more swapout activity happening, a slower compressor
> > > would cause allocations to stall for any/all of the 70 processes.
> > >
> > > In my opinion, even though the test set up does not provide an accurate
> > > way for a direct before/after comparison (because of zswap usage being
> > > counted in cgroup, hence towards the memory.high), it still seems
> > > reasonable for zswap_store to support (m)THP, so that further performance
> > > improvements can be implemented.
> >
> > Are you saying that in the "Before" data we end up skipping zswap
> > completely because of using mTHPs?
>
> That's right, Yosry.
>
> >
> > Does it make more sense to turn CONFIG_THP_SWAP in the "Before" data
>
> We could do this, however I am not sure if turning off CONFIG_THP_SWAP
> will have other side-effects in terms of disabling mm code paths outside of
> zswap that are intended to be mTHP optimizations that could again skew
> the before/after comparisons.

Yeah that's possible, but right now we are testing mTHP swapout that
does not go through zswap at all vs. mTHP swapout going through zswap.

I think what we really want to measure is 4K swapout going through
zswap vs. mTHP swapout going through zswap. This assumes that current
zswap setups disable CONFIG_THP_SWAP, so we would be measuring the
benefit of allowing them to enable CONFIG_THP_SWAP by supporting it
properly in zswap.

If some setups with zswap have CONFIG_THP_SWAP enabled then that's a
different story, but we already have the data for this case as well
right now in case this is a legitimate setup.

Adding Chris Li here from Google. We have CONFIG_THP_SWAP disabled
with zswap, so for us we would want to know the benefit of supporting
CONFIG_THP_SWAP properly in zswap. At least I think so :)

>
> Will wait for Nhat's comments as well.
>
> Thanks,
> Kanchana
>
> > to force the mTHPs to be split and for the data to be stored in zswap?
> > This would be a more fair Before/After comparison where the memory
> > goes to zswap in both cases, but "Before" has to be split because of
> > zswap's lack of support for mTHP. I assume most setups relying on
> > zswap will be turning CONFIG_THP_SWAP off today anyway, but maybe not.
> > Nhat, is this something you can share?
RE: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
Posted by Sridhar, Kanchana P 1 year, 3 months ago
Hi Yosry,

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Wednesday, August 28, 2024 6:02 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>; Chris
> Li <chrisl@kernel.org>
> Subject: Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
> 
> [..]
> > > > In the "Before" scenario, when zswap does not store mTHP, only
> allocations
> > > > count towards the cgroup memory limit. However, in the "After"
> scenario,
> > > > with the introduction of zswap_store() mTHP, both, allocations as well as
> > > > the zswap compressed pool usage from all 70 processes are counted
> > > towards
> > > > the memory limit. As a result, we see higher swapout activity in the
> > > > "After" data. Hence, more time is spent doing reclaim as the zswap
> cgroup
> > > > charge leads to more frequent memory.high breaches.
> > > >
> > > > This causes degradation in throughput and sys time with zswap mTHP,
> more
> > > so
> > > > in case of zstd than deflate-iaa. Compress latency could play a part in
> > > > this - when there is more swapout activity happening, a slower
> compressor
> > > > would cause allocations to stall for any/all of the 70 processes.
> > > >
> > > > In my opinion, even though the test set up does not provide an accurate
> > > > way for a direct before/after comparison (because of zswap usage being
> > > > counted in cgroup, hence towards the memory.high), it still seems
> > > > reasonable for zswap_store to support (m)THP, so that further
> performance
> > > > improvements can be implemented.
> > >
> > > Are you saying that in the "Before" data we end up skipping zswap
> > > completely because of using mTHPs?
> >
> > That's right, Yosry.
> >
> > >
> > > Does it make more sense to turn CONFIG_THP_SWAP in the "Before" data
> >
> > We could do this, however I am not sure if turning off CONFIG_THP_SWAP
> > will have other side-effects in terms of disabling mm code paths outside of
> > zswap that are intended to be mTHP optimizations that could again skew
> > the before/after comparisons.
> 
> Yeah that's possible, but right now we are testing mTHP swapout that
> does not go through zswap at all vs. mTHP swapout going through zswap.
> 
> I think what we really want to measure is 4K swapout going through
> zswap vs. mTHP swapout going through zswap. This assumes that current
> zswap setups disable CONFIG_THP_SWAP, so we would be measuring the
> benefit of allowing them to enable CONFIG_THP_SWAP by supporting it
> properly in zswap.
> 
> If some setups with zswap have CONFIG_THP_SWAP enabled then that's a
> different story, but we already have the data for this case as well
> right now in case this is a legitimate setup.
> 
> Adding Chris Li here from Google. We have CONFIG_THP_SWAP disabled
> with zswap, so for us we would want to know the benefit of supporting
> CONFIG_THP_SWAP properly in zswap. At least I think so :)

Sure, this makes sense. Here's the data that I gathered with CONFIG_THP_SWAP
disabled. We see improvements overall in throughput and sys time for zstd and
deflate-iaa, when comparing before (THP_SWAP=N) vs. after (THP_SWAP=Y):

64K mTHP:
=========

 -------------------------------------------------------------------------------
                       v6.11-rc3 mainline              zswap-mTHP     Change wrt
                                 Baseline                               Baseline
                        CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y
--------------------------------------------------------------------------------
 ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
                                      iaa                     iaa            iaa
 -------------------------------------------------------------------------------
 Throughput (KB/s)   136,113      140,044     140,363     151,938    3%       8%
 sys time (sec)       986.78       951.95      954.85      735.47    3%      23%
 memcg_high          124,183      127,513     138,651     133,884
 memcg_swap_high           0            0           0           0
 memcg_swap_fail     619,020      751,099           0           0
 pswpin                    0            0           0           0
 pswpout                   0            0           0           0
 zswpin                  656          569         624         639
 zswpout           9,413,603   11,284,812   9,453,761   9,385,910
 thp_swpout                0            0           0           0
 thp_swpout_               0            0           0           0
  fallback
 pgmajfault            3,470        3,382       4,633       3,611
 ZSWPOUT-64kB            n/a          n/a     590,768     586,521
 SWPOUT-64kB               0            0           0           0
 -------------------------------------------------------------------------------


2M THP:
=======

 ------------------------------------------------------------------------------
                       v6.11-rc3 mainline              zswap-mTHP    Change wrt
                                 Baseline                              Baseline
                        CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y
 ------------------------------------------------------------------------------
 ZSWAP compressor       zstd    deflate-        zstd    deflate-  zstd deflate-
                                     iaa                     iaa            iaa
 ------------------------------------------------------------------------------
 Throughput (KB/s)    164,220    172,523      165,005     174,536  0.5%      1%
 sys time (sec)        855.76     686.94       801.72      676.65    6%      1%
 memcg_high            14,628     16,247       14,951      16,096
 memcg_swap_high            0          0            0           0
 memcg_swap_fail       18,698     21,114            0           0
 pswpin                     0          0            0           0
 pswpout                    0          0            0           0
 zswpin                   663        665        5,333         781
 zswpout            8,419,458  8,992,065    8,546,895   9,355,760
 thp_swpout                 0          0            0           0
 thp_swpout_           18,697     21,113            0           0
  fallback
 pgmajfault             3,439      3,496        8,139       3,582
 ZSWPOUT-2048kB           n/a        n/a       16,684      18,270
 SWPOUT-2048kB              0          0            0           0
 -----------------------------------------------------------------------------

Thanks,
Kanchana

> 
> >
> > Will wait for Nhat's comments as well.
> >
> > Thanks,
> > Kanchana
> >
> > > to force the mTHPs to be split and for the data to be stored in zswap?
> > > This would be a more fair Before/After comparison where the memory
> > > goes to zswap in both cases, but "Before" has to be split because of
> > > zswap's lack of support for mTHP. I assume most setups relying on
> > > zswap will be turning CONFIG_THP_SWAP off today anyway, but maybe
> not.
> > > Nhat, is this something you can share?