> -----Original Message----- > From: Yosry Ahmed <yosryahmed@google.com> > Sent: Wednesday, August 28, 2024 3:37 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; nphamcs@gmail.com; ryan.roberts@arm.com; > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios > > On Wed, Aug 28, 2024 at 2:35 AM Kanchana P Sridhar > <kanchana.p.sridhar@intel.com> wrote: > > > > Hi All, > > > > This patch-series enables zswap_store() to accept and store mTHP > > folios. The most significant contribution in this series is from the > > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been > > migrated to v6.11-rc3 in patch 2/4 of this series. > > > > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting > > https://lore.kernel.org/linux-mm/20231019110543.3284654-1- > ryan.roberts@arm.com/T/#u > > > > Additionally, there is an attempt to modularize some of the functionality > > in zswap_store(), to make it more amenable to supporting any-order > > mTHPs. For instance, the function zswap_store_entry() stores a > zswap_entry > > in the xarray. Likewise, zswap_delete_stored_offsets() can be used to > > delete all offsets corresponding to a higher order folio stored in zswap. > > > > For accounting purposes, the patch-series adds per-order mTHP sysfs > > "zswpout" counters that get incremented upon successful zswap_store of > > an mTHP folio: > > > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout > > > > This patch-series is a precursor to ZSWAP compress batching of mTHP > > swap-out and decompress batching of swap-ins based on > swapin_readahead(), > > using Intel IAA hardware acceleration, which we would like to submit in > > subsequent RFC patch-series, with performance improvement data. > > > > Thanks to Ying Huang for pre-posting review feedback and suggestions! > > > > Changes since v4: > > ================= > > 1) Published before/after data with zstd, as suggested by Nhat (Thanks > > Nhat for the data reviews!). > > 2) Rebased to mm-unstable from 8/27/2024, > > commit b659edec079c90012cf8d05624e312d1062b8b87. > > 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if > > CONFIG_MEMCG is not defined, to resolve build errors reported by kernel > > robot; as per Nhat's and Michal's suggestion to not require a separate > > patch to fix the build errors (thanks both!). > > 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as > > suggested by Yosry (Thanks Yosry!). > > 5) Squashed the commits that define new mthp zswpout stat counters, and > > invoke count_mthp_stat() after successful zswap_store()s; into a single > > commit. Thanks Yosry for this suggestion! > > > > Changes since v3: > > ================= > > 1) Rebased to mm-unstable commit > 8c0b4f7b65fd1ca7af01267f491e815a40d77444. > > Thanks to Barry for suggesting aligning with Ryan Roberts' latest > > changes to count_mthp_stat() so that it's always defined, even when THP > > is disabled. Barry, I have also made one other change in page_io.c > > where count_mthp_stat() is called by count_swpout_vm_event(). I would > > appreciate it if you can review this. Thanks! > > Hopefully this should resolve the kernel robot build errors. > > > > Changes since v2: > > ================= > > 1) Gathered usemem data using SSD as the backing swap device for zswap, > > as suggested by Ying Huang. Ying, I would appreciate it if you can > > review the latest data. Thanks! > > 2) Generated the base commit info in the patches to attempt to address > > the kernel test robot build errors. > > 3) No code changes to the individual patches themselves. > > > > Changes since RFC v1: > > ===================== > > > > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion. > > Thanks Barry! > > 2) Addressed some of the code review comments that Nhat Pham provided > in > > Ryan's initial RFC [1]: > > - Added a comment about the cgroup zswap limit checks occuring once > per > > folio at the beginning of zswap_store(). > > Nhat, Ryan, please do let me know if the comments convey the summary > > from the RFC discussion. Thanks! > > - Posted data on running the cgroup suite's zswap kselftest. > > 3) Rebased to v6.11-rc3. > > 4) Gathered performance data with usemem and the rebased patch-series. > > > > Performance Testing: > > ==================== > > Testing of this patch-series was done with the v6.11-rc3 mainline, without > > and with this patch-series, on an Intel Sapphire Rapids server, > > dual-socket 56 cores per socket, 4 IAA devices per socket. > > > > The system has 503 GiB RAM, with 176GiB ZRAM (35% of available RAM) as > the > > backing swap device for ZSWAP. zstd is configured as the ZRAM compressor. > > Core frequency was fixed at 2500MHz. > > > > The vm-scalability "usemem" test was run in a cgroup whose memory.high > > was fixed at 40G. The is no swap limit set for the cgroup. Following a > > similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting" > > series [2], 70 usemem processes were run, each allocating and writing 1G of > > memory: > > > > usemem --init-time -w -O -n 70 1g > > > > The vm/sysfs mTHP stats included with the performance data provide > details > > on the swapout activity to ZSWAP/swap. > > > > Other kernel configuration parameters: > > > > ZSWAP Compressors : zstd, deflate-iaa > > ZSWAP Allocator : zsmalloc > > SWAP page-cluster : 2 > > > > In the experiments where "deflate-iaa" is used as the ZSWAP compressor, > > IAA "compression verification" is enabled. Hence each IAA compression > > will be decompressed internally by the "iaa_crypto" driver, the crc-s > > returned by the hardware will be compared and errors reported in case of > > mismatches. Thus "deflate-iaa" helps ensure better data integrity as > > compared to the software compressors. > > > > Throughput is derived by averaging the individual 70 processes' throughputs > > reported by usemem. sys time is measured with perf. All data points are > > averaged across 3 runs. > > > > 64KB mTHP (cgroup memory.high set to 40G): > > ========================================== > > > > ------------------------------------------------------------------------------ > > v6.11-rc3 mainline zswap-mTHP Change wrt > > Baseline Baseline > > ------------------------------------------------------------------------------ > > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- > > iaa iaa iaa > > ------------------------------------------------------------------------------ > > Throughput (KB/s) 161,496 156,343 140,363 151,938 -13% -3% > > sys time (sec) 771.68 802.08 954.85 735.47 -24% 8% > > memcg_high 111,223 110,889 138,651 133,884 > > memcg_swap_high 0 0 0 0 > > memcg_swap_fail 0 0 0 0 > > pswpin 16 16 0 0 > > pswpout 7,471,472 7,527,963 0 0 > > zswpin 635 605 624 639 > > zswpout 1,509 1,478 9,453,761 9,385,910 > > thp_swpout 0 0 0 0 > > thp_swpout_ 0 0 0 0 > > fallback > > pgmajfault 3,616 3,430 4,633 3,611 > > ZSWPOUT-64kB n/a n/a 590,768 586,521 > > SWPOUT-64kB 466,967 470,498 0 0 > > ------------------------------------------------------------------------------ > > > > 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G): > > ======================================================= > > > > ------------------------------------------------------------------------------ > > v6.11-rc3 mainline zswap-mTHP Change wrt > > Baseline Baseline > > ------------------------------------------------------------------------------ > > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- > > iaa iaa iaa > > ------------------------------------------------------------------------------ > > Throughput (KB/s) 192,164 194,643 165,005 174,536 -14% -10% > > sys time (sec) 823.55 830.42 801.72 676.65 3% 19% > > memcg_high 16,054 15,936 14,951 16,096 > > memcg_swap_high 0 0 0 0 > > memcg_swap_fail 0 0 0 0 > > pswpin 0 0 0 0 > > pswpout 8,629,248 8,628,907 0 0 > > zswpin 560 645 5,333 781 > > zswpout 1,416 1,503 8,546,895 9,355,760 > > thp_swpout 16,854 16,853 0 0 > > thp_swpout_ 0 0 0 0 > > fallback > > pgmajfault 3,341 3,574 8,139 3,582 > > ZSWPOUT-2048kB n/a n/a 16,684 18,270 > > SWPOUT-2048kB 16,854 16,853 0 0 > > ------------------------------------------------------------------------------ > > > > In the "Before" scenario, when zswap does not store mTHP, only allocations > > count towards the cgroup memory limit. However, in the "After" scenario, > > with the introduction of zswap_store() mTHP, both, allocations as well as > > the zswap compressed pool usage from all 70 processes are counted > towards > > the memory limit. As a result, we see higher swapout activity in the > > "After" data. Hence, more time is spent doing reclaim as the zswap cgroup > > charge leads to more frequent memory.high breaches. > > > > This causes degradation in throughput and sys time with zswap mTHP, more > so > > in case of zstd than deflate-iaa. Compress latency could play a part in > > this - when there is more swapout activity happening, a slower compressor > > would cause allocations to stall for any/all of the 70 processes. > > > > In my opinion, even though the test set up does not provide an accurate > > way for a direct before/after comparison (because of zswap usage being > > counted in cgroup, hence towards the memory.high), it still seems > > reasonable for zswap_store to support (m)THP, so that further performance > > improvements can be implemented. > > Are you saying that in the "Before" data we end up skipping zswap > completely because of using mTHPs? That's right, Yosry. > > Does it make more sense to turn CONFIG_THP_SWAP in the "Before" data We could do this, however I am not sure if turning off CONFIG_THP_SWAP will have other side-effects in terms of disabling mm code paths outside of zswap that are intended to be mTHP optimizations that could again skew the before/after comparisons. Will wait for Nhat's comments as well. Thanks, Kanchana > to force the mTHPs to be split and for the data to be stored in zswap? > This would be a more fair Before/After comparison where the memory > goes to zswap in both cases, but "Before" has to be split because of > zswap's lack of support for mTHP. I assume most setups relying on > zswap will be turning CONFIG_THP_SWAP off today anyway, but maybe not. > Nhat, is this something you can share?
[..] > > > In the "Before" scenario, when zswap does not store mTHP, only allocations > > > count towards the cgroup memory limit. However, in the "After" scenario, > > > with the introduction of zswap_store() mTHP, both, allocations as well as > > > the zswap compressed pool usage from all 70 processes are counted > > towards > > > the memory limit. As a result, we see higher swapout activity in the > > > "After" data. Hence, more time is spent doing reclaim as the zswap cgroup > > > charge leads to more frequent memory.high breaches. > > > > > > This causes degradation in throughput and sys time with zswap mTHP, more > > so > > > in case of zstd than deflate-iaa. Compress latency could play a part in > > > this - when there is more swapout activity happening, a slower compressor > > > would cause allocations to stall for any/all of the 70 processes. > > > > > > In my opinion, even though the test set up does not provide an accurate > > > way for a direct before/after comparison (because of zswap usage being > > > counted in cgroup, hence towards the memory.high), it still seems > > > reasonable for zswap_store to support (m)THP, so that further performance > > > improvements can be implemented. > > > > Are you saying that in the "Before" data we end up skipping zswap > > completely because of using mTHPs? > > That's right, Yosry. > > > > > Does it make more sense to turn CONFIG_THP_SWAP in the "Before" data > > We could do this, however I am not sure if turning off CONFIG_THP_SWAP > will have other side-effects in terms of disabling mm code paths outside of > zswap that are intended to be mTHP optimizations that could again skew > the before/after comparisons. Yeah that's possible, but right now we are testing mTHP swapout that does not go through zswap at all vs. mTHP swapout going through zswap. I think what we really want to measure is 4K swapout going through zswap vs. mTHP swapout going through zswap. This assumes that current zswap setups disable CONFIG_THP_SWAP, so we would be measuring the benefit of allowing them to enable CONFIG_THP_SWAP by supporting it properly in zswap. If some setups with zswap have CONFIG_THP_SWAP enabled then that's a different story, but we already have the data for this case as well right now in case this is a legitimate setup. Adding Chris Li here from Google. We have CONFIG_THP_SWAP disabled with zswap, so for us we would want to know the benefit of supporting CONFIG_THP_SWAP properly in zswap. At least I think so :) > > Will wait for Nhat's comments as well. > > Thanks, > Kanchana > > > to force the mTHPs to be split and for the data to be stored in zswap? > > This would be a more fair Before/After comparison where the memory > > goes to zswap in both cases, but "Before" has to be split because of > > zswap's lack of support for mTHP. I assume most setups relying on > > zswap will be turning CONFIG_THP_SWAP off today anyway, but maybe not. > > Nhat, is this something you can share?
Hi Yosry,
> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Wednesday, August 28, 2024 6:02 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>; Chris
> Li <chrisl@kernel.org>
> Subject: Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
>
> [..]
> > > > In the "Before" scenario, when zswap does not store mTHP, only
> allocations
> > > > count towards the cgroup memory limit. However, in the "After"
> scenario,
> > > > with the introduction of zswap_store() mTHP, both, allocations as well as
> > > > the zswap compressed pool usage from all 70 processes are counted
> > > towards
> > > > the memory limit. As a result, we see higher swapout activity in the
> > > > "After" data. Hence, more time is spent doing reclaim as the zswap
> cgroup
> > > > charge leads to more frequent memory.high breaches.
> > > >
> > > > This causes degradation in throughput and sys time with zswap mTHP,
> more
> > > so
> > > > in case of zstd than deflate-iaa. Compress latency could play a part in
> > > > this - when there is more swapout activity happening, a slower
> compressor
> > > > would cause allocations to stall for any/all of the 70 processes.
> > > >
> > > > In my opinion, even though the test set up does not provide an accurate
> > > > way for a direct before/after comparison (because of zswap usage being
> > > > counted in cgroup, hence towards the memory.high), it still seems
> > > > reasonable for zswap_store to support (m)THP, so that further
> performance
> > > > improvements can be implemented.
> > >
> > > Are you saying that in the "Before" data we end up skipping zswap
> > > completely because of using mTHPs?
> >
> > That's right, Yosry.
> >
> > >
> > > Does it make more sense to turn CONFIG_THP_SWAP in the "Before" data
> >
> > We could do this, however I am not sure if turning off CONFIG_THP_SWAP
> > will have other side-effects in terms of disabling mm code paths outside of
> > zswap that are intended to be mTHP optimizations that could again skew
> > the before/after comparisons.
>
> Yeah that's possible, but right now we are testing mTHP swapout that
> does not go through zswap at all vs. mTHP swapout going through zswap.
>
> I think what we really want to measure is 4K swapout going through
> zswap vs. mTHP swapout going through zswap. This assumes that current
> zswap setups disable CONFIG_THP_SWAP, so we would be measuring the
> benefit of allowing them to enable CONFIG_THP_SWAP by supporting it
> properly in zswap.
>
> If some setups with zswap have CONFIG_THP_SWAP enabled then that's a
> different story, but we already have the data for this case as well
> right now in case this is a legitimate setup.
>
> Adding Chris Li here from Google. We have CONFIG_THP_SWAP disabled
> with zswap, so for us we would want to know the benefit of supporting
> CONFIG_THP_SWAP properly in zswap. At least I think so :)
Sure, this makes sense. Here's the data that I gathered with CONFIG_THP_SWAP
disabled. We see improvements overall in throughput and sys time for zstd and
deflate-iaa, when comparing before (THP_SWAP=N) vs. after (THP_SWAP=Y):
64K mTHP:
=========
-------------------------------------------------------------------------------
v6.11-rc3 mainline zswap-mTHP Change wrt
Baseline Baseline
CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y
--------------------------------------------------------------------------------
ZSWAP compressor zstd deflate- zstd deflate- zstd deflate-
iaa iaa iaa
-------------------------------------------------------------------------------
Throughput (KB/s) 136,113 140,044 140,363 151,938 3% 8%
sys time (sec) 986.78 951.95 954.85 735.47 3% 23%
memcg_high 124,183 127,513 138,651 133,884
memcg_swap_high 0 0 0 0
memcg_swap_fail 619,020 751,099 0 0
pswpin 0 0 0 0
pswpout 0 0 0 0
zswpin 656 569 624 639
zswpout 9,413,603 11,284,812 9,453,761 9,385,910
thp_swpout 0 0 0 0
thp_swpout_ 0 0 0 0
fallback
pgmajfault 3,470 3,382 4,633 3,611
ZSWPOUT-64kB n/a n/a 590,768 586,521
SWPOUT-64kB 0 0 0 0
-------------------------------------------------------------------------------
2M THP:
=======
------------------------------------------------------------------------------
v6.11-rc3 mainline zswap-mTHP Change wrt
Baseline Baseline
CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y
------------------------------------------------------------------------------
ZSWAP compressor zstd deflate- zstd deflate- zstd deflate-
iaa iaa iaa
------------------------------------------------------------------------------
Throughput (KB/s) 164,220 172,523 165,005 174,536 0.5% 1%
sys time (sec) 855.76 686.94 801.72 676.65 6% 1%
memcg_high 14,628 16,247 14,951 16,096
memcg_swap_high 0 0 0 0
memcg_swap_fail 18,698 21,114 0 0
pswpin 0 0 0 0
pswpout 0 0 0 0
zswpin 663 665 5,333 781
zswpout 8,419,458 8,992,065 8,546,895 9,355,760
thp_swpout 0 0 0 0
thp_swpout_ 18,697 21,113 0 0
fallback
pgmajfault 3,439 3,496 8,139 3,582
ZSWPOUT-2048kB n/a n/a 16,684 18,270
SWPOUT-2048kB 0 0 0 0
-----------------------------------------------------------------------------
Thanks,
Kanchana
>
> >
> > Will wait for Nhat's comments as well.
> >
> > Thanks,
> > Kanchana
> >
> > > to force the mTHPs to be split and for the data to be stored in zswap?
> > > This would be a more fair Before/After comparison where the memory
> > > goes to zswap in both cases, but "Before" has to be split because of
> > > zswap's lack of support for mTHP. I assume most setups relying on
> > > zswap will be turning CONFIG_THP_SWAP off today anyway, but maybe
> not.
> > > Nhat, is this something you can share?
© 2016 - 2025 Red Hat, Inc.