Documentation/admin-guide/mm/transhuge.rst | 8 +- include/linux/huge_mm.h | 1 + include/linux/memcontrol.h | 4 + mm/Kconfig | 8 + mm/huge_memory.c | 3 + mm/page_io.c | 1 + mm/zswap.c | 248 ++++++++++++++++----- 7 files changed, 210 insertions(+), 63 deletions(-)
Hi All, This patch-series enables zswap_store() to accept and store mTHP folios. The most significant contribution in this series is from the earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been migrated to mm-unstable as of 9-23-2024 in patches 5,6 of this series. [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u Additionally, there is an attempt to modularize some of the functionality in zswap_store(), to make it more amenable to supporting any-order mTHPs. For instance, the function zswap_store_entry() stores a zswap_entry in the xarray. Likewise, zswap_delete_stored_offsets() can be used to delete all offsets corresponding to a higher order folio stored in zswap. For accounting purposes, the patch-series adds per-order mTHP sysfs "zswpout" counters that get incremented upon successful zswap_store of an mTHP folio: /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) will enable/disable zswap storing of (m)THP. When disabled, zswap will fallback to rejecting the mTHP folio, to be processed by the backing swap device. This patch-series is a pre-requisite for ZSWAP compress batching of mTHP swap-out and decompress batching of swap-ins based on swapin_readahead(), using Intel IAA hardware acceleration, which we would like to submit in subsequent patch-series, with performance improvement data. Thanks to Ying Huang for pre-posting review feedback and suggestions! Thanks also to Nhat, Yosry, Barry, Chengming, Usama and Ying for their helpful feedback, data reviews and suggestions! Co-development signoff request: =============================== I would like to request Ryan Roberts' co-developer signoff on patches 5 and 6 in this series. Thanks Ryan! Changes since v6: ================= 1) Rebased to mm-unstable as of 9-23-2024, commit acfabf7e197f7a5bedf4749dac1f39551417b049. 2) Refactored into smaller commits, as suggested by Yosry and Chengming. Thanks both! 3) Reworded the commit log for patches 5 and 6 as per Yosry's suggestion. Thanks Yosry! 4) Gathered data on a Sapphire Rapids server that has 823GiB SSD swap disk partition. Also, all experiments are run with usemem --sleep 10, so that the memory allocated by the 70 processes remains in memory longer. Posted elapsed and sys times. Thanks to Yosry, Nhat and Ying for their help with refining the performance characterization methodology. 5) Updated Documentation/admin-guide/mm/transhuge.rst as suggested by Nhat. Thanks Nhat! Changes since v5: ================= 1) Rebased to mm-unstable as of 8/29/2024, commit 9287e4adbc6ab8fa04d25eb82e097fed877a4642. 2) Added CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) to enable/disable zswap_store() of mTHP folios. Thanks Nhat for the suggestion to add a knob by which users can enable/disable this change. Nhat, I hope this is along the lines of what you were thinking. 3) Added vm-scalability usemem data with 4K folios with CONFIG_ZSWAP_STORE_THP_DEFAULT_ON off, that I gathered to make sure there is no regression with this change. 4) Added data with usemem with 64K and 2M THP for an alternate view of before/after, as suggested by Yosry, so we can understand the impact of when mTHPs are split into 4K folios in shrink_folio_list() (CONFIG_THP_SWAP off) vs. not split (CONFIG_THP_SWAP on) and stored in zswap. Thanks Yosry for this suggestion. Changes since v4: ================= 1) Published before/after data with zstd, as suggested by Nhat (Thanks Nhat for the data reviews!). 2) Rebased to mm-unstable from 8/27/2024, commit b659edec079c90012cf8d05624e312d1062b8b87. 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if CONFIG_MEMCG is not defined, to resolve build errors reported by kernel robot; as per Nhat's and Michal's suggestion to not require a separate patch to fix the build errors (thanks both!). 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as suggested by Yosry (Thanks Yosry!). 5) Squashed the commits that define new mthp zswpout stat counters, and invoke count_mthp_stat() after successful zswap_store()s; into a single commit. Thanks Yosry for this suggestion! Changes since v3: ================= 1) Rebased to mm-unstable commit 8c0b4f7b65fd1ca7af01267f491e815a40d77444. Thanks to Barry for suggesting aligning with Ryan Roberts' latest changes to count_mthp_stat() so that it's always defined, even when THP is disabled. Barry, I have also made one other change in page_io.c where count_mthp_stat() is called by count_swpout_vm_event(). I would appreciate it if you can review this. Thanks! Hopefully this should resolve the kernel robot build errors. Changes since v2: ================= 1) Gathered usemem data using SSD as the backing swap device for zswap, as suggested by Ying Huang. Ying, I would appreciate it if you can review the latest data. Thanks! 2) Generated the base commit info in the patches to attempt to address the kernel test robot build errors. 3) No code changes to the individual patches themselves. Changes since RFC v1: ===================== 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion. Thanks Barry! 2) Addressed some of the code review comments that Nhat Pham provided in Ryan's initial RFC [1]: - Added a comment about the cgroup zswap limit checks occuring once per folio at the beginning of zswap_store(). Nhat, Ryan, please do let me know if the comments convey the summary from the RFC discussion. Thanks! - Posted data on running the cgroup suite's zswap kselftest. 3) Rebased to v6.11-rc3. 4) Gathered performance data with usemem and the rebased patch-series. Regression Testing: =================== I ran vm-scalability usemem 70 processes without mTHP, i.e., only 4K folios with mm-unstable and with this patch-series. The main goal was to make sure that there is no functional or performance regression wrt the earlier zswap behavior for 4K folios, CONFIG_ZSWAP_STORE_THP_DEFAULT_ON is not set, and zswap_store() of 4K pages goes through the newly added code path [zswap_store(), zswap_store_page()]. The data indicates there is no regression. ------------------------------------------------------------------------------ mm-unstable 8-28-2024 zswap-mTHP v6 CONFIG_ZSWAP_STORE_THP_DEFAULT_ON is not set ------------------------------------------------------------------------------ ZSWAP compressor zstd deflate- zstd deflate- iaa iaa ------------------------------------------------------------------------------ Throughput (KB/s) 110,775 113,010 111,550 121,937 sys time (sec) 1,141.72 954.87 1,131.95 828.47 memcg_high 140,500 153,737 139,772 134,129 memcg_swap_high 0 0 0 0 memcg_swap_fail 0 0 0 0 pswpin 0 0 0 0 pswpout 0 0 0 0 zswpin 675 690 682 684 zswpout 9,552,298 10,603,271 9,566,392 9,267,213 thp_swpout 0 0 0 0 thp_swpout_ 0 0 0 0 fallback pgmajfault 3,453 3,468 3,841 3,487 ZSWPOUT-64kB-mTHP n/a n/a 0 0 SWPOUT-64kB-mTHP 0 0 0 0 ------------------------------------------------------------------------------ Performance Testing: ==================== Testing of this patch-series was done with mm-unstable as of 9-23-2024, commit acfabf7e197f7a5bedf4749dac1f39551417b049. Data was gathered without/with this patch-series, on an Intel Sapphire Rapids server, dual-socket 56 cores per socket, 4 IAA devices per socket, 503 GiB RAM and 823G SSD disk partition swap. Core frequency was fixed at 2500MHz. The vm-scalability "usemem" test was run in a cgroup whose memory.high was fixed at 40G. The is no swap limit set for the cgroup. Following a similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting" series [2], 70 usemem processes were run, each allocating and writing 1G of memory, and sleeping for 10 sec before exiting: usemem --init-time -w -O -s 10 -n 70 1g The vm/sysfs mTHP stats included with the performance data provide details on the swapout activity to ZSWAP/swap. Other kernel configuration parameters: ZSWAP Compressors : zstd, deflate-iaa ZSWAP Allocator : zsmalloc SWAP page-cluster : 2 In the experiments where "deflate-iaa" is used as the ZSWAP compressor, IAA "compression verification" is enabled. Hence each IAA compression will be decompressed internally by the "iaa_crypto" driver, the crc-s returned by the hardware will be compared and errors reported in case of mismatches. Thus "deflate-iaa" helps ensure better data integrity as compared to the software compressors. Throughput is derived by averaging the individual 70 processes' throughputs reported by usemem. elapsed/sys times are measured with perf. All data points per compressor/kernel/mTHP configuration are averaged across 3 runs. Case 1: Comparing zswap 4K vs. zswap mTHP ========================================= In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in 64K/2M (m)THP to be split into 4K folios that get processed by zswap. The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that results in 64K/2M (m)THP to not be split, and processed by zswap. 64KB mTHP (cgroup memory.high set to 40G): ========================================== ------------------------------------------------------------------------------- mm-unstable 9-23-2024 zswap-mTHP Change wrt CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y Baseline Baseline ------------------------------------------------------------------------------- ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- iaa iaa iaa ------------------------------------------------------------------------------- Throughput (KB/s) 143,323 125,485 153,550 129,609 7% 3% elapsed time (sec) 24.97 25.42 23.90 25.19 4% 1% sys time (sec) 822.72 750.96 757.70 731.13 8% 3% memcg_high 132,743 169,825 148,075 192,744 memcg_swap_fail 639,067 841,553 2,204 2,215 pswpin 0 0 0 0 pswpout 0 0 0 0 zswpin 795 873 760 902 zswpout 10,011,266 13,195,137 10,010,017 13,193,554 thp_swpout 0 0 0 0 thp_swpout_ 0 0 0 0 fallback 64kB-mthp_ 639,065 841,553 2,204 2,215 swpout_fallback pgmajfault 2,861 2,924 3,054 3,259 ZSWPOUT-64kB n/a n/a 623,451 822,268 SWPOUT-64kB 0 0 0 0 ------------------------------------------------------------------------------- 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G): ======================================================= ------------------------------------------------------------------------------- mm-unstable 9-23-2024 zswap-mTHP Change wrt CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y Baseline Baseline ------------------------------------------------------------------------------- ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- iaa iaa iaa ------------------------------------------------------------------------------- Throughput (KB/s) 145,616 139,640 169,404 141,168 16% 1% elapsed time (sec) 25.05 23.85 23.02 23.37 8% 2% sys time (sec) 790.53 676.34 613.26 677.83 22% -0.2% memcg_high 16,702 25,197 17,374 23,890 memcg_swap_fail 21,485 27,814 114 144 pswpin 0 0 0 0 pswpout 0 0 0 0 zswpin 793 852 778 922 zswpout 10,011,709 13,186,882 10,010,893 13,195,600 thp_swpout 0 0 0 0 thp_swpout_ 21,485 27,814 114 144 fallback 2048kB-mthp_ n/a n/a 0 0 swpout_fallback pgmajfault 2,701 2,822 4,151 5,066 ZSWPOUT-2048kB n/a n/a 19,442 25,615 SWPOUT-2048kB 0 0 0 0 ------------------------------------------------------------------------------- We mostly see improvements in throughput, elapsed and sys time for zstd and deflate-iaa, when comparing before (THP_SWAP=N) vs. after (THP_SWAP=Y). Case 2: Comparing SSD swap mTHP vs. zswap mTHP ============================================== In this scenario, CONFIG_THP_SWAP is enabled in "before" and "after" experiments. The "before" represents zswap rejecting mTHP, and the mTHP being stored by the 823G SSD swap. The "after" represents data with this patch-series, that results in 64K/2M (m)THP being processed and stored by zswap. 64KB mTHP (cgroup memory.high set to 40G): ========================================== ------------------------------------------------------------------------------- mm-unstable 9-23-2024 zswap-mTHP Change wrt CONFIG_THP_SWAP=Y CONFIG_THP_SWAP=Y Baseline Baseline ------------------------------------------------------------------------------- ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- iaa iaa iaa ------------------------------------------------------------------------------- Throughput (KB/s) 20,265 20,696 153,550 129,609 658% 526% elapsed time (sec) 72.44 70.86 23.90 25.19 67% 64% sys time (sec) 77.95 77.99 757.70 731.13 -872% -837% memcg_high 115,811 113,277 148,075 192,744 memcg_swap_fail 2,386 2,425 2,204 2,215 pswpin 16 16 0 0 pswpout 7,774,235 7,616,069 0 0 zswpin 728 749 760 902 zswpout 38,424 39,022 10,010,017 13,193,554 thp_swpout 0 0 0 0 thp_swpout_ 0 0 0 0 fallback 64kB-mthp_ 2,386 2,425 2,204 2,215 swpout_fallback pgmajfault 2,757 2,860 3,054 3,259 ZSWPOUT-64kB n/a n/a 623,451 822,268 SWPOUT-64kB 485,890 476,004 0 0 ------------------------------------------------------------------------------- 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G): ======================================================= ------------------------------------------------------------------------------- mm-unstable 9-23-2024 zswap-mTHP Change wrt CONFIG_THP_SWAP=Y CONFIG_THP_SWAP=Y Baseline Baseline ------------------------------------------------------------------------------- ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- iaa iaa iaa ------------------------------------------------------------------------------- Throughput (KB/s) 24,347 35,971 169,404 141,168 596% 292% elapsed time (sec) 63.52 64.59 23.02 23.37 64% 64% sys time (sec) 27.91 27.01 613.26 677.83 -2098% -2410% memcg_high 13,576 13,467 17,374 23,890 memcg_swap_fail 162 124 114 144 pswpin 0 0 0 0 pswpout 7,003,307 7,168,853 0 0 zswpin 741 722 778 922 zswpout 84,429 65,315 10,010,893 13,195,600 thp_swpout 13,678 14,002 0 0 thp_swpout_ 162 124 114 144 fallback 2048kB-mthp_ n/a n/a 0 0 swpout_fallback pgmajfault 3,345 2,903 4,151 5,066 ZSWPOUT-2048kB n/a n/a 19,442 25,615 SWPOUT-2048kB 13,678 14,002 0 0 ------------------------------------------------------------------------------- We see significant improvements in throughput and elapsed time for zstd and deflate-iaa, when comparing before (mTHP-SSD) vs. after (mTHP-ZSWAP). The sys time increases with mTHP-ZSWAP as expected, due to the CPU compression time vs. asynchronous disk write times, as pointed out by Ying and Yosry. In the "Before" scenario, when zswap does not store mTHP, only allocations count towards the cgroup memory limit. However, in the "After" scenario, with the introduction of zswap_store() mTHP, both, allocations as well as the zswap compressed pool usage from all 70 processes are counted towards the memory limit. As a result, we see higher swapout activity in the "After" data. Hence, more time is spent doing reclaim as the zswap cgroup charge leads to more frequent memory.high breaches. Summary: ======== The v7 data presented above comparing zswap-mTHP with a conventional 823G SSD swap demonstrates good performance improvements with zswap-mTHP. Hence, it seems reasonable for zswap_store to support (m)THP, so that further performance improvements can be implemented. Some of the ideas that have shown promise in our experiments are: 1) IAA compress/decompress batching. 2) Distributing compress jobs across all IAA devices on the socket. In the experimental setup used in this patchset, we have enabled IAA compress verification to ensure additional hardware data integrity CRC checks not currently done by the software compressors. The tests run for this patchset are also using only 1 IAA device per core, that avails of 2 compress engines on the device. In our experiments with IAA batching, we distribute compress jobs from all cores to the 8 compress engines available per socket. We further compress the pages in each mTHP in parallel in the accelerator. As a result, we improve compress latency and reclaim throughput. The following compares the same usemem workload characteristics between: 1) zstd (v7 experiments) 2) deflate-iaa "Fixed mode" (v7 experiments) 3) deflate-iaa with batching 4) deflate-iaa-canned "Canned mode" [3] with batching vm.page-cluster is set to "2" for all runs. 64K mTHP ZSWAP: =============== ------------------------------------------------------------------------------- ZSWAP zstd IAA Fixed IAA Fixed IAA Canned IAA IAA IAA compressor (v7) (v7) + Batching + Batching Batch Canned Canned vs. vs. Batch 64K mTHP Seqtl Fixed vs. ZSTD ------------------------------------------------------------------------------- Throughput 153,550 129,609 156,215 166,975 21% 7% 9% (KB/s) elapsed time 23.90 25.19 22.46 21.38 11% 5% 11% (sec) sys time 757.70 731.13 715.62 648.83 2% 9% 14% (sec) memcg_high 148,075 192,744 197,548 181,734 memcg_swap_ 2,204 2,215 2,293 2,263 fail pswpin 0 0 0 0 pswpout 0 0 0 0 zswpin 760 902 774 833 zswpout 10,010,017 13,193,554 13,193,176 12,125,616 thp_swpout 0 0 0 0 thp_swpout_ 0 0 0 0 fallback 64kB-mthp_ 2,204 2,215 2,293 2,263 swpout_ fallback pgmajfault 3,054 3,259 3,545 3,516 ZSWPOUT-64kB 623,451 822,268 822,176 755,480 SWPOUT-64kB 0 0 0 0 swap_ra 146 161 152 159 swap_ra_hit 64 121 68 88 ------------------------------------------------------------------------------- 2M THP ZSWAP: ============= ------------------------------------------------------------------------------- ZSWAP zstd IAA Fixed IAA Fixed IAA Canned IAA IAA IAA compressor (v7) (v7) + Batching + Batching Batch Canned Canned vs. vs. Batch 2M THP Seqtl Fixed vs. ZSTD ------------------------------------------------------------------------------- Throughput 169,404 141,168 175,089 193,407 24% 10% 14% (KB/s) elapsed time 23.02 23.37 21.13 19.97 10% 5% 13% (sec) sys time 613.26 677.83 630.51 533.80 7% 15% 13% (sec) memcg_high 17,374 23,890 24,349 22,374 memcg_swap_ 114 144 102 88 fail pswpin 0 0 0 0 pswpout 0 0 0 0 zswpin 778 922 6,492 6,642 zswpout 10,010,893 13,195,600 13,199,907 12,132,265 thp_swpout 0 0 0 0 thp_swpout_ 114 144 102 88 fallback pgmajfault 4,151 5,066 5,032 4,999 ZSWPOUT-2MB 19,442 25,615 25,666 23,594 SWPOUT-2MB 0 0 0 0 swap_ra 3 9 4,383 4,494 swap_ra_hit 2 6 4,298 4,412 ------------------------------------------------------------------------------- With ZSWAP IAA compress/decompress batching, we are able to demonstrate significant performance improvements and memory savings in scalability experiments under memory pressure, as compared to software compressors. We hope to submit this work in subsequent patch series. Thanks, Kanchana [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u [2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-ryan.roberts@arm.com/ [3] https://patchwork.kernel.org/project/linux-crypto/cover/cover.1710969449.git.andre.glover@linux.intel.com/ Kanchana P Sridhar (8): mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined. mm: zswap: Modify zswap_compress() to accept a page instead of a folio. mm: zswap: Refactor code to store an entry in zswap xarray. mm: zswap: Refactor code to delete stored offsets in case of errors. mm: zswap: Compress and store a specific page in a folio. mm: zswap: Support mTHP swapout in zswap_store(). mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout stats. mm: Document the newly added mTHP zswpout stats, clarify swpout semantics. Documentation/admin-guide/mm/transhuge.rst | 8 +- include/linux/huge_mm.h | 1 + include/linux/memcontrol.h | 4 + mm/Kconfig | 8 + mm/huge_memory.c | 3 + mm/page_io.c | 1 + mm/zswap.c | 248 ++++++++++++++++----- 7 files changed, 210 insertions(+), 63 deletions(-) base-commit: acfabf7e197f7a5bedf4749dac1f39551417b049 -- 2.27.0
Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes: [snip] > > Case 1: Comparing zswap 4K vs. zswap mTHP > ========================================= > > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in > 64K/2M (m)THP to be split into 4K folios that get processed by zswap. > > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that results > in 64K/2M (m)THP to not be split, and processed by zswap. > > 64KB mTHP (cgroup memory.high set to 40G): > ========================================== > > ------------------------------------------------------------------------------- > mm-unstable 9-23-2024 zswap-mTHP Change wrt > CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y Baseline > Baseline > ------------------------------------------------------------------------------- > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- > iaa iaa iaa > ------------------------------------------------------------------------------- > Throughput (KB/s) 143,323 125,485 153,550 129,609 7% 3% > elapsed time (sec) 24.97 25.42 23.90 25.19 4% 1% > sys time (sec) 822.72 750.96 757.70 731.13 8% 3% > memcg_high 132,743 169,825 148,075 192,744 > memcg_swap_fail 639,067 841,553 2,204 2,215 > pswpin 0 0 0 0 > pswpout 0 0 0 0 > zswpin 795 873 760 902 > zswpout 10,011,266 13,195,137 10,010,017 13,193,554 > thp_swpout 0 0 0 0 > thp_swpout_ 0 0 0 0 > fallback > 64kB-mthp_ 639,065 841,553 2,204 2,215 > swpout_fallback > pgmajfault 2,861 2,924 3,054 3,259 > ZSWPOUT-64kB n/a n/a 623,451 822,268 > SWPOUT-64kB 0 0 0 0 > ------------------------------------------------------------------------------- > IIUC, the throughput is the sum of throughput of all usemem processes? One possible issue of usemem test case is the "imbalance" issue. That is, some usemem processes may swap-out/swap-in less, so the score is very high; while some other processes may swap-out/swap-in more, so the score is very low. Sometimes, the total score decreases, but the scores of usemem processes are more balanced, so that the performance should be considered better. And, in general, we should make usemem score balanced among processes via say longer test time. Can you check this in your test results? [snip] -- Best Regards, Huang, Ying
> -----Original Message----- > From: Huang, Ying <ying.huang@intel.com> > Sent: Tuesday, September 24, 2024 11:35 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com; > chengming.zhou@linux.dev; usamaarif642@gmail.com; > shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com; > akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, > Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios > > Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes: > > [snip] > > > > > Case 1: Comparing zswap 4K vs. zswap mTHP > > ========================================= > > > > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in > > 64K/2M (m)THP to be split into 4K folios that get processed by zswap. > > > > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that > results > > in 64K/2M (m)THP to not be split, and processed by zswap. > > > > 64KB mTHP (cgroup memory.high set to 40G): > > ========================================== > > > > ------------------------------------------------------------------------------- > > mm-unstable 9-23-2024 zswap-mTHP Change wrt > > CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y Baseline > > Baseline > > ------------------------------------------------------------------------------- > > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- > > iaa iaa iaa > > ------------------------------------------------------------------------------- > > Throughput (KB/s) 143,323 125,485 153,550 129,609 7% 3% > > elapsed time (sec) 24.97 25.42 23.90 25.19 4% 1% > > sys time (sec) 822.72 750.96 757.70 731.13 8% 3% > > memcg_high 132,743 169,825 148,075 192,744 > > memcg_swap_fail 639,067 841,553 2,204 2,215 > > pswpin 0 0 0 0 > > pswpout 0 0 0 0 > > zswpin 795 873 760 902 > > zswpout 10,011,266 13,195,137 10,010,017 13,193,554 > > thp_swpout 0 0 0 0 > > thp_swpout_ 0 0 0 0 > > fallback > > 64kB-mthp_ 639,065 841,553 2,204 2,215 > > swpout_fallback > > pgmajfault 2,861 2,924 3,054 3,259 > > ZSWPOUT-64kB n/a n/a 623,451 822,268 > > SWPOUT-64kB 0 0 0 0 > > ------------------------------------------------------------------------------- > > > > IIUC, the throughput is the sum of throughput of all usemem processes? > > One possible issue of usemem test case is the "imbalance" issue. That > is, some usemem processes may swap-out/swap-in less, so the score is > very high; while some other processes may swap-out/swap-in more, so the > score is very low. Sometimes, the total score decreases, but the scores > of usemem processes are more balanced, so that the performance should be > considered better. And, in general, we should make usemem score > balanced among processes via say longer test time. Can you check this > in your test results? Actually, the throughput data listed in the cover-letter is the average of all the usemem processes. Your observation about the "imbalance" issue is right. Some processes see a higher throughput than others. I have noticed that the throughputs progressively reduce as the individual processes exit and print their stats. Listed below are the stats from two runs of usemem70: sleep 10 and sleep 30. Both are run with a cgroup mem-limit of 40G. Data is with v7, 64K folios are enabled, zswap uses zstd. ----------------------------------------------- sleep 10 sleep 30 Throughput (KB/s) Throughput (KB/s) ----------------------------------------------- 181,540 191,686 179,651 191,459 179,068 188,834 177,244 187,568 177,215 186,703 176,565 185,584 176,546 185,370 176,470 185,021 176,214 184,303 176,128 184,040 175,279 183,932 174,745 180,831 173,935 179,418 161,546 168,014 160,332 167,540 160,122 167,364 159,613 167,020 159,546 166,590 159,021 166,483 158,845 166,418 158,426 166,264 158,396 166,066 158,371 165,944 158,298 165,866 158,250 165,884 158,057 165,533 158,011 165,532 157,899 165,457 157,894 165,424 157,839 165,410 157,731 165,407 157,629 165,273 157,626 164,867 157,581 164,636 157,471 164,266 157,430 164,225 157,287 163,290 156,289 153,597 153,970 147,494 148,244 147,102 142,907 146,111 142,811 145,789 139,171 141,168 136,314 140,714 133,616 140,111 132,881 139,636 132,729 136,943 132,680 136,844 132,248 135,726 132,027 135,384 131,929 135,270 131,766 134,748 131,667 134,733 131,576 134,582 131,396 134,302 131,351 134,160 131,135 134,102 130,885 134,097 130,854 134,058 130,767 134,006 130,666 133,960 130,647 133,894 130,152 133,837 130,006 133,747 129,921 133,679 129,856 133,666 129,377 133,564 128,366 133,331 127,988 132,938 126,903 132,746 ----------------------------------------------- sum 10,526,916 10,919,561 average 150,385 155,994 stddev 17,551 19,633 ----------------------------------------------- elapsed 24.40 43.66 time (sec) sys time 806.25 766.05 (sec) zswpout 10,008,713 10,008,407 64K folio 623,463 623,629 swpout ----------------------------------------------- As we increase the time for which allocations are maintained, there seems to be a slight improvement in throughput, but the variance increases as well. The processes with lower throughput could be the ones that handle the memcg being over limit by doing reclaim, possibly before they can allocate. Interestingly, the longer test time does seem to reduce the amount of reclaim (hence lower sys time), but more 64K large folios seem to be reclaimed. Could this mean that with longer test time (sleep 30), more cold memory residing in large folios is getting reclaimed, as against memory just relinquished by the exiting processes? Thanks, Kanchana > > [snip] > > -- > Best Regards, > Huang, Ying
"Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes: >> -----Original Message----- >> From: Huang, Ying <ying.huang@intel.com> >> Sent: Tuesday, September 24, 2024 11:35 PM >> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> >> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; >> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com; >> chengming.zhou@linux.dev; usamaarif642@gmail.com; >> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com; >> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, >> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh >> <vinodh.gopal@intel.com> >> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios >> >> Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes: >> >> [snip] >> >> > >> > Case 1: Comparing zswap 4K vs. zswap mTHP >> > ========================================= >> > >> > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in >> > 64K/2M (m)THP to be split into 4K folios that get processed by zswap. >> > >> > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that >> results >> > in 64K/2M (m)THP to not be split, and processed by zswap. >> > >> > 64KB mTHP (cgroup memory.high set to 40G): >> > ========================================== >> > >> > ------------------------------------------------------------------------------- >> > mm-unstable 9-23-2024 zswap-mTHP Change wrt >> > CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y Baseline >> > Baseline >> > ------------------------------------------------------------------------------- >> > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- >> > iaa iaa iaa >> > ------------------------------------------------------------------------------- >> > Throughput (KB/s) 143,323 125,485 153,550 129,609 7% 3% >> > elapsed time (sec) 24.97 25.42 23.90 25.19 4% 1% >> > sys time (sec) 822.72 750.96 757.70 731.13 8% 3% >> > memcg_high 132,743 169,825 148,075 192,744 >> > memcg_swap_fail 639,067 841,553 2,204 2,215 >> > pswpin 0 0 0 0 >> > pswpout 0 0 0 0 >> > zswpin 795 873 760 902 >> > zswpout 10,011,266 13,195,137 10,010,017 13,193,554 >> > thp_swpout 0 0 0 0 >> > thp_swpout_ 0 0 0 0 >> > fallback >> > 64kB-mthp_ 639,065 841,553 2,204 2,215 >> > swpout_fallback >> > pgmajfault 2,861 2,924 3,054 3,259 >> > ZSWPOUT-64kB n/a n/a 623,451 822,268 >> > SWPOUT-64kB 0 0 0 0 >> > ------------------------------------------------------------------------------- >> > >> >> IIUC, the throughput is the sum of throughput of all usemem processes? >> >> One possible issue of usemem test case is the "imbalance" issue. That >> is, some usemem processes may swap-out/swap-in less, so the score is >> very high; while some other processes may swap-out/swap-in more, so the >> score is very low. Sometimes, the total score decreases, but the scores >> of usemem processes are more balanced, so that the performance should be >> considered better. And, in general, we should make usemem score >> balanced among processes via say longer test time. Can you check this >> in your test results? > > Actually, the throughput data listed in the cover-letter is the average of > all the usemem processes. Your observation about the "imbalance" issue is > right. Some processes see a higher throughput than others. I have noticed > that the throughputs progressively reduce as the individual processes exit > and print their stats. > > Listed below are the stats from two runs of usemem70: sleep 10 and sleep 30. > Both are run with a cgroup mem-limit of 40G. Data is with v7, 64K folios are > enabled, zswap uses zstd. > > > ----------------------------------------------- > sleep 10 sleep 30 > Throughput (KB/s) Throughput (KB/s) > ----------------------------------------------- > 181,540 191,686 > 179,651 191,459 > 179,068 188,834 > 177,244 187,568 > 177,215 186,703 > 176,565 185,584 > 176,546 185,370 > 176,470 185,021 > 176,214 184,303 > 176,128 184,040 > 175,279 183,932 > 174,745 180,831 > 173,935 179,418 > 161,546 168,014 > 160,332 167,540 > 160,122 167,364 > 159,613 167,020 > 159,546 166,590 > 159,021 166,483 > 158,845 166,418 > 158,426 166,264 > 158,396 166,066 > 158,371 165,944 > 158,298 165,866 > 158,250 165,884 > 158,057 165,533 > 158,011 165,532 > 157,899 165,457 > 157,894 165,424 > 157,839 165,410 > 157,731 165,407 > 157,629 165,273 > 157,626 164,867 > 157,581 164,636 > 157,471 164,266 > 157,430 164,225 > 157,287 163,290 > 156,289 153,597 > 153,970 147,494 > 148,244 147,102 > 142,907 146,111 > 142,811 145,789 > 139,171 141,168 > 136,314 140,714 > 133,616 140,111 > 132,881 139,636 > 132,729 136,943 > 132,680 136,844 > 132,248 135,726 > 132,027 135,384 > 131,929 135,270 > 131,766 134,748 > 131,667 134,733 > 131,576 134,582 > 131,396 134,302 > 131,351 134,160 > 131,135 134,102 > 130,885 134,097 > 130,854 134,058 > 130,767 134,006 > 130,666 133,960 > 130,647 133,894 > 130,152 133,837 > 130,006 133,747 > 129,921 133,679 > 129,856 133,666 > 129,377 133,564 > 128,366 133,331 > 127,988 132,938 > 126,903 132,746 > ----------------------------------------------- > sum 10,526,916 10,919,561 > average 150,385 155,994 > stddev 17,551 19,633 > ----------------------------------------------- > elapsed 24.40 43.66 > time (sec) > sys time 806.25 766.05 > (sec) > zswpout 10,008,713 10,008,407 > 64K folio 623,463 623,629 > swpout > ----------------------------------------------- Although there are some imbalance, I don't find it's too much. So, I think the test result is reasonable. Please pay attention to the imbalance issue in the future tests. > As we increase the time for which allocations are maintained, > there seems to be a slight improvement in throughput, but the > variance increases as well. The processes with lower throughput > could be the ones that handle the memcg being over limit by > doing reclaim, possibly before they can allocate. > > Interestingly, the longer test time does seem to reduce the amount > of reclaim (hence lower sys time), but more 64K large folios seem to > be reclaimed. Could this mean that with longer test time (sleep 30), > more cold memory residing in large folios is getting reclaimed, as > against memory just relinquished by the exiting processes? I don't think longer sleep time in test helps much to balance. Can you try with less process, and larger memory size per process? I guess that this will improve balance. -- Best Regards, Huang, Ying
Hi Ying, > -----Original Message----- > From: Huang, Ying <ying.huang@intel.com> > Sent: Wednesday, September 25, 2024 5:45 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com; > chengming.zhou@linux.dev; usamaarif642@gmail.com; > shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com; > akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, > Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios > > "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes: > > >> -----Original Message----- > >> From: Huang, Ying <ying.huang@intel.com> > >> Sent: Tuesday, September 24, 2024 11:35 PM > >> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > >> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > >> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com; > >> chengming.zhou@linux.dev; usamaarif642@gmail.com; > >> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com; > >> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; > Feghali, > >> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > >> <vinodh.gopal@intel.com> > >> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios > >> > >> Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes: > >> > >> [snip] > >> > >> > > >> > Case 1: Comparing zswap 4K vs. zswap mTHP > >> > ========================================= > >> > > >> > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that > results in > >> > 64K/2M (m)THP to be split into 4K folios that get processed by zswap. > >> > > >> > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that > >> results > >> > in 64K/2M (m)THP to not be split, and processed by zswap. > >> > > >> > 64KB mTHP (cgroup memory.high set to 40G): > >> > ========================================== > >> > > >> > ------------------------------------------------------------------------------- > >> > mm-unstable 9-23-2024 zswap-mTHP Change wrt > >> > CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y > Baseline > >> > Baseline > >> > ------------------------------------------------------------------------------- > >> > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- > >> > iaa iaa iaa > >> > ------------------------------------------------------------------------------- > >> > Throughput (KB/s) 143,323 125,485 153,550 129,609 7% > 3% > >> > elapsed time (sec) 24.97 25.42 23.90 25.19 4% 1% > >> > sys time (sec) 822.72 750.96 757.70 731.13 8% 3% > >> > memcg_high 132,743 169,825 148,075 192,744 > >> > memcg_swap_fail 639,067 841,553 2,204 2,215 > >> > pswpin 0 0 0 0 > >> > pswpout 0 0 0 0 > >> > zswpin 795 873 760 902 > >> > zswpout 10,011,266 13,195,137 10,010,017 13,193,554 > >> > thp_swpout 0 0 0 0 > >> > thp_swpout_ 0 0 0 0 > >> > fallback > >> > 64kB-mthp_ 639,065 841,553 2,204 2,215 > >> > swpout_fallback > >> > pgmajfault 2,861 2,924 3,054 3,259 > >> > ZSWPOUT-64kB n/a n/a 623,451 822,268 > >> > SWPOUT-64kB 0 0 0 0 > >> > ------------------------------------------------------------------------------- > >> > > >> > >> IIUC, the throughput is the sum of throughput of all usemem processes? > >> > >> One possible issue of usemem test case is the "imbalance" issue. That > >> is, some usemem processes may swap-out/swap-in less, so the score is > >> very high; while some other processes may swap-out/swap-in more, so the > >> score is very low. Sometimes, the total score decreases, but the scores > >> of usemem processes are more balanced, so that the performance should > be > >> considered better. And, in general, we should make usemem score > >> balanced among processes via say longer test time. Can you check this > >> in your test results? > > > > Actually, the throughput data listed in the cover-letter is the average of > > all the usemem processes. Your observation about the "imbalance" issue is > > right. Some processes see a higher throughput than others. I have noticed > > that the throughputs progressively reduce as the individual processes exit > > and print their stats. > > > > Listed below are the stats from two runs of usemem70: sleep 10 and sleep > 30. > > Both are run with a cgroup mem-limit of 40G. Data is with v7, 64K folios are > > enabled, zswap uses zstd. > > > > > > ----------------------------------------------- > > sleep 10 sleep 30 > > Throughput (KB/s) Throughput (KB/s) > > ----------------------------------------------- > > 181,540 191,686 > > 179,651 191,459 > > 179,068 188,834 > > 177,244 187,568 > > 177,215 186,703 > > 176,565 185,584 > > 176,546 185,370 > > 176,470 185,021 > > 176,214 184,303 > > 176,128 184,040 > > 175,279 183,932 > > 174,745 180,831 > > 173,935 179,418 > > 161,546 168,014 > > 160,332 167,540 > > 160,122 167,364 > > 159,613 167,020 > > 159,546 166,590 > > 159,021 166,483 > > 158,845 166,418 > > 158,426 166,264 > > 158,396 166,066 > > 158,371 165,944 > > 158,298 165,866 > > 158,250 165,884 > > 158,057 165,533 > > 158,011 165,532 > > 157,899 165,457 > > 157,894 165,424 > > 157,839 165,410 > > 157,731 165,407 > > 157,629 165,273 > > 157,626 164,867 > > 157,581 164,636 > > 157,471 164,266 > > 157,430 164,225 > > 157,287 163,290 > > 156,289 153,597 > > 153,970 147,494 > > 148,244 147,102 > > 142,907 146,111 > > 142,811 145,789 > > 139,171 141,168 > > 136,314 140,714 > > 133,616 140,111 > > 132,881 139,636 > > 132,729 136,943 > > 132,680 136,844 > > 132,248 135,726 > > 132,027 135,384 > > 131,929 135,270 > > 131,766 134,748 > > 131,667 134,733 > > 131,576 134,582 > > 131,396 134,302 > > 131,351 134,160 > > 131,135 134,102 > > 130,885 134,097 > > 130,854 134,058 > > 130,767 134,006 > > 130,666 133,960 > > 130,647 133,894 > > 130,152 133,837 > > 130,006 133,747 > > 129,921 133,679 > > 129,856 133,666 > > 129,377 133,564 > > 128,366 133,331 > > 127,988 132,938 > > 126,903 132,746 > > ----------------------------------------------- > > sum 10,526,916 10,919,561 > > average 150,385 155,994 > > stddev 17,551 19,633 > > ----------------------------------------------- > > elapsed 24.40 43.66 > > time (sec) > > sys time 806.25 766.05 > > (sec) > > zswpout 10,008,713 10,008,407 > > 64K folio 623,463 623,629 > > swpout > > ----------------------------------------------- > > Although there are some imbalance, I don't find it's too much. So, I > think the test result is reasonable. Please pay attention to the > imbalance issue in the future tests. Sure, will do so. > > > As we increase the time for which allocations are maintained, > > there seems to be a slight improvement in throughput, but the > > variance increases as well. The processes with lower throughput > > could be the ones that handle the memcg being over limit by > > doing reclaim, possibly before they can allocate. > > > > Interestingly, the longer test time does seem to reduce the amount > > of reclaim (hence lower sys time), but more 64K large folios seem to > > be reclaimed. Could this mean that with longer test time (sleep 30), > > more cold memory residing in large folios is getting reclaimed, as > > against memory just relinquished by the exiting processes? > > I don't think longer sleep time in test helps much to balance. Can you > try with less process, and larger memory size per process? I guess that > this will improve balance. I tried this, and the data is listed below: usemem options: --------------- 30 processes allocate 10G each cgroup memory limit = 150G sleep 10 525Gi SSD disk swap partition 64K large folios enabled Throughput (KB/s) of each of the 30 processes: --------------------------------------------------------------- mm-unstable zswap_store of large folios 9-25-2024 v7 zswap compressor: zstd zstd deflate-iaa --------------------------------------------------------------- 38,393 234,485 374,427 37,283 215,528 314,225 37,156 214,942 304,413 37,143 213,073 304,146 36,814 212,904 290,186 36,277 212,304 288,212 36,104 212,207 285,682 36,000 210,173 270,661 35,994 208,487 256,960 35,979 207,788 248,313 35,967 207,714 235,338 35,966 207,703 229,335 35,835 207,690 221,697 35,793 207,418 221,600 35,692 206,160 219,346 35,682 206,128 219,162 35,681 205,817 219,155 35,678 205,546 214,862 35,678 205,523 214,710 35,677 204,951 214,282 35,677 204,283 213,441 35,677 203,348 213,011 35,675 203,028 212,923 35,673 201,922 212,492 35,672 201,660 212,225 35,672 200,724 211,808 35,672 200,324 211,420 35,671 199,686 211,413 35,667 198,858 211,346 35,667 197,590 211,209 --------------------------------------------------------------- sum 1,081,515 6,217,964 7,268,000 average 36,051 207,265 242,267 stddev 655 7,010 42,234 elapsed time (sec) 343.70 107.40 84.34 sys time (sec) 269.30 2,520.13 1,696.20 memcg.high breaches 443,672 475,074 623,333 zswpout 22,605 48,931,249 54,777,100 pswpout 40,004,528 0 0 hugepages-64K zswpout 0 3,057,090 3,421,855 hugepages-64K swpout 2,500,283 0 0 --------------------------------------------------------------- As you can see, this is quite a memory-constrained scenario, where we are giving a 50% of total memory required, as the memory limit for the cgroup in which the 30 processes are run. This causes significantly more reclaim activity than the setup I was using thus far (70 processes, 1G, 40G limit). The variance or "imbalance" reduces somewhat for zstd, but not for IAA. IAA shows really good throughput (17%) and elapsed time (21%) and sys time (33%) improvement wrt zstd with zswap_store of large folios. These are the memory-constrained scenarios in which IAA typically does really well. IAA verify_compress is enabled, so this is an added data integrity checks benefit we get with IAA. I would like to get your and the maintainers' feedback on whether I should switch to this "usemem30-10G" setup for v8? Thanks, Kanchana > > -- > Best Regards, > Huang, Ying
"Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes: > Hi Ying, > >> -----Original Message----- >> From: Huang, Ying <ying.huang@intel.com> >> Sent: Wednesday, September 25, 2024 5:45 PM >> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> >> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; >> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com; >> chengming.zhou@linux.dev; usamaarif642@gmail.com; >> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com; >> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, >> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh >> <vinodh.gopal@intel.com> >> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios >> >> "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes: >> >> >> -----Original Message----- >> >> From: Huang, Ying <ying.huang@intel.com> >> >> Sent: Tuesday, September 24, 2024 11:35 PM >> >> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> >> >> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; >> >> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com; >> >> chengming.zhou@linux.dev; usamaarif642@gmail.com; >> >> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com; >> >> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; >> Feghali, >> >> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh >> >> <vinodh.gopal@intel.com> >> >> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios >> >> >> >> Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes: >> >> >> >> [snip] >> >> >> >> > >> >> > Case 1: Comparing zswap 4K vs. zswap mTHP >> >> > ========================================= >> >> > >> >> > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that >> results in >> >> > 64K/2M (m)THP to be split into 4K folios that get processed by zswap. >> >> > >> >> > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that >> >> results >> >> > in 64K/2M (m)THP to not be split, and processed by zswap. >> >> > >> >> > 64KB mTHP (cgroup memory.high set to 40G): >> >> > ========================================== >> >> > >> >> > ------------------------------------------------------------------------------- >> >> > mm-unstable 9-23-2024 zswap-mTHP Change wrt >> >> > CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y >> Baseline >> >> > Baseline >> >> > ------------------------------------------------------------------------------- >> >> > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- >> >> > iaa iaa iaa >> >> > ------------------------------------------------------------------------------- >> >> > Throughput (KB/s) 143,323 125,485 153,550 129,609 7% >> 3% >> >> > elapsed time (sec) 24.97 25.42 23.90 25.19 4% 1% >> >> > sys time (sec) 822.72 750.96 757.70 731.13 8% 3% >> >> > memcg_high 132,743 169,825 148,075 192,744 >> >> > memcg_swap_fail 639,067 841,553 2,204 2,215 >> >> > pswpin 0 0 0 0 >> >> > pswpout 0 0 0 0 >> >> > zswpin 795 873 760 902 >> >> > zswpout 10,011,266 13,195,137 10,010,017 13,193,554 >> >> > thp_swpout 0 0 0 0 >> >> > thp_swpout_ 0 0 0 0 >> >> > fallback >> >> > 64kB-mthp_ 639,065 841,553 2,204 2,215 >> >> > swpout_fallback >> >> > pgmajfault 2,861 2,924 3,054 3,259 >> >> > ZSWPOUT-64kB n/a n/a 623,451 822,268 >> >> > SWPOUT-64kB 0 0 0 0 >> >> > ------------------------------------------------------------------------------- >> >> > >> >> >> >> IIUC, the throughput is the sum of throughput of all usemem processes? >> >> >> >> One possible issue of usemem test case is the "imbalance" issue. That >> >> is, some usemem processes may swap-out/swap-in less, so the score is >> >> very high; while some other processes may swap-out/swap-in more, so the >> >> score is very low. Sometimes, the total score decreases, but the scores >> >> of usemem processes are more balanced, so that the performance should >> be >> >> considered better. And, in general, we should make usemem score >> >> balanced among processes via say longer test time. Can you check this >> >> in your test results? >> > >> > Actually, the throughput data listed in the cover-letter is the average of >> > all the usemem processes. Your observation about the "imbalance" issue is >> > right. Some processes see a higher throughput than others. I have noticed >> > that the throughputs progressively reduce as the individual processes exit >> > and print their stats. >> > >> > Listed below are the stats from two runs of usemem70: sleep 10 and sleep >> 30. >> > Both are run with a cgroup mem-limit of 40G. Data is with v7, 64K folios are >> > enabled, zswap uses zstd. >> > >> > >> > ----------------------------------------------- >> > sleep 10 sleep 30 >> > Throughput (KB/s) Throughput (KB/s) >> > ----------------------------------------------- >> > 181,540 191,686 >> > 179,651 191,459 >> > 179,068 188,834 >> > 177,244 187,568 >> > 177,215 186,703 >> > 176,565 185,584 >> > 176,546 185,370 >> > 176,470 185,021 >> > 176,214 184,303 >> > 176,128 184,040 >> > 175,279 183,932 >> > 174,745 180,831 >> > 173,935 179,418 >> > 161,546 168,014 >> > 160,332 167,540 >> > 160,122 167,364 >> > 159,613 167,020 >> > 159,546 166,590 >> > 159,021 166,483 >> > 158,845 166,418 >> > 158,426 166,264 >> > 158,396 166,066 >> > 158,371 165,944 >> > 158,298 165,866 >> > 158,250 165,884 >> > 158,057 165,533 >> > 158,011 165,532 >> > 157,899 165,457 >> > 157,894 165,424 >> > 157,839 165,410 >> > 157,731 165,407 >> > 157,629 165,273 >> > 157,626 164,867 >> > 157,581 164,636 >> > 157,471 164,266 >> > 157,430 164,225 >> > 157,287 163,290 >> > 156,289 153,597 >> > 153,970 147,494 >> > 148,244 147,102 >> > 142,907 146,111 >> > 142,811 145,789 >> > 139,171 141,168 >> > 136,314 140,714 >> > 133,616 140,111 >> > 132,881 139,636 >> > 132,729 136,943 >> > 132,680 136,844 >> > 132,248 135,726 >> > 132,027 135,384 >> > 131,929 135,270 >> > 131,766 134,748 >> > 131,667 134,733 >> > 131,576 134,582 >> > 131,396 134,302 >> > 131,351 134,160 >> > 131,135 134,102 >> > 130,885 134,097 >> > 130,854 134,058 >> > 130,767 134,006 >> > 130,666 133,960 >> > 130,647 133,894 >> > 130,152 133,837 >> > 130,006 133,747 >> > 129,921 133,679 >> > 129,856 133,666 >> > 129,377 133,564 >> > 128,366 133,331 >> > 127,988 132,938 >> > 126,903 132,746 >> > ----------------------------------------------- >> > sum 10,526,916 10,919,561 >> > average 150,385 155,994 >> > stddev 17,551 19,633 >> > ----------------------------------------------- >> > elapsed 24.40 43.66 >> > time (sec) >> > sys time 806.25 766.05 >> > (sec) >> > zswpout 10,008,713 10,008,407 >> > 64K folio 623,463 623,629 >> > swpout >> > ----------------------------------------------- >> >> Although there are some imbalance, I don't find it's too much. So, I >> think the test result is reasonable. Please pay attention to the >> imbalance issue in the future tests. > > Sure, will do so. > >> >> > As we increase the time for which allocations are maintained, >> > there seems to be a slight improvement in throughput, but the >> > variance increases as well. The processes with lower throughput >> > could be the ones that handle the memcg being over limit by >> > doing reclaim, possibly before they can allocate. >> > >> > Interestingly, the longer test time does seem to reduce the amount >> > of reclaim (hence lower sys time), but more 64K large folios seem to >> > be reclaimed. Could this mean that with longer test time (sleep 30), >> > more cold memory residing in large folios is getting reclaimed, as >> > against memory just relinquished by the exiting processes? >> >> I don't think longer sleep time in test helps much to balance. Can you >> try with less process, and larger memory size per process? I guess that >> this will improve balance. > > I tried this, and the data is listed below: > > usemem options: > --------------- > 30 processes allocate 10G each > cgroup memory limit = 150G > sleep 10 > 525Gi SSD disk swap partition > 64K large folios enabled > > Throughput (KB/s) of each of the 30 processes: > --------------------------------------------------------------- > mm-unstable zswap_store of large folios > 9-25-2024 v7 > zswap compressor: zstd zstd deflate-iaa > --------------------------------------------------------------- > 38,393 234,485 374,427 > 37,283 215,528 314,225 > 37,156 214,942 304,413 > 37,143 213,073 304,146 > 36,814 212,904 290,186 > 36,277 212,304 288,212 > 36,104 212,207 285,682 > 36,000 210,173 270,661 > 35,994 208,487 256,960 > 35,979 207,788 248,313 > 35,967 207,714 235,338 > 35,966 207,703 229,335 > 35,835 207,690 221,697 > 35,793 207,418 221,600 > 35,692 206,160 219,346 > 35,682 206,128 219,162 > 35,681 205,817 219,155 > 35,678 205,546 214,862 > 35,678 205,523 214,710 > 35,677 204,951 214,282 > 35,677 204,283 213,441 > 35,677 203,348 213,011 > 35,675 203,028 212,923 > 35,673 201,922 212,492 > 35,672 201,660 212,225 > 35,672 200,724 211,808 > 35,672 200,324 211,420 > 35,671 199,686 211,413 > 35,667 198,858 211,346 > 35,667 197,590 211,209 > --------------------------------------------------------------- > sum 1,081,515 6,217,964 7,268,000 > average 36,051 207,265 242,267 > stddev 655 7,010 42,234 > elapsed time (sec) 343.70 107.40 84.34 > sys time (sec) 269.30 2,520.13 1,696.20 > memcg.high breaches 443,672 475,074 623,333 > zswpout 22,605 48,931,249 54,777,100 > pswpout 40,004,528 0 0 > hugepages-64K zswpout 0 3,057,090 3,421,855 > hugepages-64K swpout 2,500,283 0 0 > --------------------------------------------------------------- > > As you can see, this is quite a memory-constrained scenario, where we > are giving a 50% of total memory required, as the memory limit for the > cgroup in which the 30 processes are run. This causes significantly more > reclaim activity than the setup I was using thus far (70 processes, 1G, > 40G limit). > > The variance or "imbalance" reduces somewhat for zstd, but not for IAA. > > IAA shows really good throughput (17%) and elapsed time (21%) and > sys time (33%) improvement wrt zstd with zswap_store of large folios. > These are the memory-constrained scenarios in which IAA typically > does really well. IAA verify_compress is enabled, so this is an added > data integrity checks benefit we get with IAA. > > I would like to get your and the maintainers' feedback on whether > I should switch to this "usemem30-10G" setup for v8? The results looks good to me. I suggest you to use it. -- Best Regards, Huang, Ying
> -----Original Message----- > From: Huang, Ying <ying.huang@intel.com> > Sent: Wednesday, September 25, 2024 11:48 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com; > chengming.zhou@linux.dev; usamaarif642@gmail.com; > shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com; > akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, > Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios > > "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes: > > > Hi Ying, > > > >> -----Original Message----- > >> From: Huang, Ying <ying.huang@intel.com> > >> Sent: Wednesday, September 25, 2024 5:45 PM > >> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > >> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > >> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com; > >> chengming.zhou@linux.dev; usamaarif642@gmail.com; > >> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com; > >> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; > Feghali, > >> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > >> <vinodh.gopal@intel.com> > >> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios > >> > >> "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes: > >> > >> >> -----Original Message----- > >> >> From: Huang, Ying <ying.huang@intel.com> > >> >> Sent: Tuesday, September 24, 2024 11:35 PM > >> >> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > >> >> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > >> >> hannes@cmpxchg.org; yosryahmed@google.com; > nphamcs@gmail.com; > >> >> chengming.zhou@linux.dev; usamaarif642@gmail.com; > >> >> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com; > >> >> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; > >> Feghali, > >> >> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > >> >> <vinodh.gopal@intel.com> > >> >> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios > >> >> > >> >> Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes: > >> >> > >> >> [snip] > >> >> > >> >> > > >> >> > Case 1: Comparing zswap 4K vs. zswap mTHP > >> >> > ========================================= > >> >> > > >> >> > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that > >> results in > >> >> > 64K/2M (m)THP to be split into 4K folios that get processed by zswap. > >> >> > > >> >> > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, > that > >> >> results > >> >> > in 64K/2M (m)THP to not be split, and processed by zswap. > >> >> > > >> >> > 64KB mTHP (cgroup memory.high set to 40G): > >> >> > ========================================== > >> >> > > >> >> > ------------------------------------------------------------------------------- > >> >> > mm-unstable 9-23-2024 zswap-mTHP Change > wrt > >> >> > CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y > >> Baseline > >> >> > Baseline > >> >> > ------------------------------------------------------------------------------- > >> >> > ZSWAP compressor zstd deflate- zstd deflate- zstd > deflate- > >> >> > iaa iaa iaa > >> >> > ------------------------------------------------------------------------------- > >> >> > Throughput (KB/s) 143,323 125,485 153,550 129,609 7% > >> 3% > >> >> > elapsed time (sec) 24.97 25.42 23.90 25.19 4% 1% > >> >> > sys time (sec) 822.72 750.96 757.70 731.13 8% 3% > >> >> > memcg_high 132,743 169,825 148,075 192,744 > >> >> > memcg_swap_fail 639,067 841,553 2,204 2,215 > >> >> > pswpin 0 0 0 0 > >> >> > pswpout 0 0 0 0 > >> >> > zswpin 795 873 760 902 > >> >> > zswpout 10,011,266 13,195,137 10,010,017 13,193,554 > >> >> > thp_swpout 0 0 0 0 > >> >> > thp_swpout_ 0 0 0 0 > >> >> > fallback > >> >> > 64kB-mthp_ 639,065 841,553 2,204 2,215 > >> >> > swpout_fallback > >> >> > pgmajfault 2,861 2,924 3,054 3,259 > >> >> > ZSWPOUT-64kB n/a n/a 623,451 822,268 > >> >> > SWPOUT-64kB 0 0 0 0 > >> >> > ------------------------------------------------------------------------------- > >> >> > > >> >> > >> >> IIUC, the throughput is the sum of throughput of all usemem processes? > >> >> > >> >> One possible issue of usemem test case is the "imbalance" issue. That > >> >> is, some usemem processes may swap-out/swap-in less, so the score is > >> >> very high; while some other processes may swap-out/swap-in more, so > the > >> >> score is very low. Sometimes, the total score decreases, but the scores > >> >> of usemem processes are more balanced, so that the performance > should > >> be > >> >> considered better. And, in general, we should make usemem score > >> >> balanced among processes via say longer test time. Can you check this > >> >> in your test results? > >> > > >> > Actually, the throughput data listed in the cover-letter is the average of > >> > all the usemem processes. Your observation about the "imbalance" issue > is > >> > right. Some processes see a higher throughput than others. I have > noticed > >> > that the throughputs progressively reduce as the individual processes > exit > >> > and print their stats. > >> > > >> > Listed below are the stats from two runs of usemem70: sleep 10 and > sleep > >> 30. > >> > Both are run with a cgroup mem-limit of 40G. Data is with v7, 64K folios > are > >> > enabled, zswap uses zstd. > >> > > >> > > >> > ----------------------------------------------- > >> > sleep 10 sleep 30 > >> > Throughput (KB/s) Throughput (KB/s) > >> > ----------------------------------------------- > >> > 181,540 191,686 > >> > 179,651 191,459 > >> > 179,068 188,834 > >> > 177,244 187,568 > >> > 177,215 186,703 > >> > 176,565 185,584 > >> > 176,546 185,370 > >> > 176,470 185,021 > >> > 176,214 184,303 > >> > 176,128 184,040 > >> > 175,279 183,932 > >> > 174,745 180,831 > >> > 173,935 179,418 > >> > 161,546 168,014 > >> > 160,332 167,540 > >> > 160,122 167,364 > >> > 159,613 167,020 > >> > 159,546 166,590 > >> > 159,021 166,483 > >> > 158,845 166,418 > >> > 158,426 166,264 > >> > 158,396 166,066 > >> > 158,371 165,944 > >> > 158,298 165,866 > >> > 158,250 165,884 > >> > 158,057 165,533 > >> > 158,011 165,532 > >> > 157,899 165,457 > >> > 157,894 165,424 > >> > 157,839 165,410 > >> > 157,731 165,407 > >> > 157,629 165,273 > >> > 157,626 164,867 > >> > 157,581 164,636 > >> > 157,471 164,266 > >> > 157,430 164,225 > >> > 157,287 163,290 > >> > 156,289 153,597 > >> > 153,970 147,494 > >> > 148,244 147,102 > >> > 142,907 146,111 > >> > 142,811 145,789 > >> > 139,171 141,168 > >> > 136,314 140,714 > >> > 133,616 140,111 > >> > 132,881 139,636 > >> > 132,729 136,943 > >> > 132,680 136,844 > >> > 132,248 135,726 > >> > 132,027 135,384 > >> > 131,929 135,270 > >> > 131,766 134,748 > >> > 131,667 134,733 > >> > 131,576 134,582 > >> > 131,396 134,302 > >> > 131,351 134,160 > >> > 131,135 134,102 > >> > 130,885 134,097 > >> > 130,854 134,058 > >> > 130,767 134,006 > >> > 130,666 133,960 > >> > 130,647 133,894 > >> > 130,152 133,837 > >> > 130,006 133,747 > >> > 129,921 133,679 > >> > 129,856 133,666 > >> > 129,377 133,564 > >> > 128,366 133,331 > >> > 127,988 132,938 > >> > 126,903 132,746 > >> > ----------------------------------------------- > >> > sum 10,526,916 10,919,561 > >> > average 150,385 155,994 > >> > stddev 17,551 19,633 > >> > ----------------------------------------------- > >> > elapsed 24.40 43.66 > >> > time (sec) > >> > sys time 806.25 766.05 > >> > (sec) > >> > zswpout 10,008,713 10,008,407 > >> > 64K folio 623,463 623,629 > >> > swpout > >> > ----------------------------------------------- > >> > >> Although there are some imbalance, I don't find it's too much. So, I > >> think the test result is reasonable. Please pay attention to the > >> imbalance issue in the future tests. > > > > Sure, will do so. > > > >> > >> > As we increase the time for which allocations are maintained, > >> > there seems to be a slight improvement in throughput, but the > >> > variance increases as well. The processes with lower throughput > >> > could be the ones that handle the memcg being over limit by > >> > doing reclaim, possibly before they can allocate. > >> > > >> > Interestingly, the longer test time does seem to reduce the amount > >> > of reclaim (hence lower sys time), but more 64K large folios seem to > >> > be reclaimed. Could this mean that with longer test time (sleep 30), > >> > more cold memory residing in large folios is getting reclaimed, as > >> > against memory just relinquished by the exiting processes? > >> > >> I don't think longer sleep time in test helps much to balance. Can you > >> try with less process, and larger memory size per process? I guess that > >> this will improve balance. > > > > I tried this, and the data is listed below: > > > > usemem options: > > --------------- > > 30 processes allocate 10G each > > cgroup memory limit = 150G > > sleep 10 > > 525Gi SSD disk swap partition > > 64K large folios enabled > > > > Throughput (KB/s) of each of the 30 processes: > > --------------------------------------------------------------- > > mm-unstable zswap_store of large folios > > 9-25-2024 v7 > > zswap compressor: zstd zstd deflate-iaa > > --------------------------------------------------------------- > > 38,393 234,485 374,427 > > 37,283 215,528 314,225 > > 37,156 214,942 304,413 > > 37,143 213,073 304,146 > > 36,814 212,904 290,186 > > 36,277 212,304 288,212 > > 36,104 212,207 285,682 > > 36,000 210,173 270,661 > > 35,994 208,487 256,960 > > 35,979 207,788 248,313 > > 35,967 207,714 235,338 > > 35,966 207,703 229,335 > > 35,835 207,690 221,697 > > 35,793 207,418 221,600 > > 35,692 206,160 219,346 > > 35,682 206,128 219,162 > > 35,681 205,817 219,155 > > 35,678 205,546 214,862 > > 35,678 205,523 214,710 > > 35,677 204,951 214,282 > > 35,677 204,283 213,441 > > 35,677 203,348 213,011 > > 35,675 203,028 212,923 > > 35,673 201,922 212,492 > > 35,672 201,660 212,225 > > 35,672 200,724 211,808 > > 35,672 200,324 211,420 > > 35,671 199,686 211,413 > > 35,667 198,858 211,346 > > 35,667 197,590 211,209 > > --------------------------------------------------------------- > > sum 1,081,515 6,217,964 7,268,000 > > average 36,051 207,265 242,267 > > stddev 655 7,010 42,234 > > elapsed time (sec) 343.70 107.40 84.34 > > sys time (sec) 269.30 2,520.13 1,696.20 > > memcg.high breaches 443,672 475,074 623,333 > > zswpout 22,605 48,931,249 54,777,100 > > pswpout 40,004,528 0 0 > > hugepages-64K zswpout 0 3,057,090 3,421,855 > > hugepages-64K swpout 2,500,283 0 0 > > --------------------------------------------------------------- > > > > As you can see, this is quite a memory-constrained scenario, where we > > are giving a 50% of total memory required, as the memory limit for the > > cgroup in which the 30 processes are run. This causes significantly more > > reclaim activity than the setup I was using thus far (70 processes, 1G, > > 40G limit). > > > > The variance or "imbalance" reduces somewhat for zstd, but not for IAA. > > > > IAA shows really good throughput (17%) and elapsed time (21%) and > > sys time (33%) improvement wrt zstd with zswap_store of large folios. > > These are the memory-constrained scenarios in which IAA typically > > does really well. IAA verify_compress is enabled, so this is an added > > data integrity checks benefit we get with IAA. > > > > I would like to get your and the maintainers' feedback on whether > > I should switch to this "usemem30-10G" setup for v8? > > The results looks good to me. I suggest you to use it. Ok, sure, thanks Ying. Thanks, Kanchana > > -- > Best Regards, > Huang, Ying
On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar <kanchana.p.sridhar@intel.com> wrote: > > Hi All, > > This patch-series enables zswap_store() to accept and store mTHP > folios. The most significant contribution in this series is from the > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been > migrated to mm-unstable as of 9-23-2024 in patches 5,6 of this series. > > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting > https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u > > Additionally, there is an attempt to modularize some of the functionality > in zswap_store(), to make it more amenable to supporting any-order > mTHPs. For instance, the function zswap_store_entry() stores a zswap_entry > in the xarray. Likewise, zswap_delete_stored_offsets() can be used to > delete all offsets corresponding to a higher order folio stored in zswap. These are implementation details that are not very useful here, you can just mention that the first few patches do refactoring prep work. > > For accounting purposes, the patch-series adds per-order mTHP sysfs > "zswpout" counters that get incremented upon successful zswap_store of > an mTHP folio: > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout > > A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) > will enable/disable zswap storing of (m)THP. When disabled, zswap will > fallback to rejecting the mTHP folio, to be processed by the backing > swap device. Why is this needed? Do we just not have enough confidence in the feature yet, or are there some cases that regress from enabling mTHP for zswapout? Does generic mTHP swapout/swapin also use config options? > > This patch-series is a pre-requisite for ZSWAP compress batching of mTHP > swap-out and decompress batching of swap-ins based on swapin_readahead(), > using Intel IAA hardware acceleration, which we would like to submit in > subsequent patch-series, with performance improvement data. > > Thanks to Ying Huang for pre-posting review feedback and suggestions! > > Thanks also to Nhat, Yosry, Barry, Chengming, Usama and Ying for their > helpful feedback, data reviews and suggestions! > > Co-development signoff request: > =============================== > I would like to request Ryan Roberts' co-developer signoff on patches > 5 and 6 in this series. Thanks Ryan! > > Changes since v6: > ================= Please put the changelog at the very end, I almost missed the performance evaluation. > 1) Rebased to mm-unstable as of 9-23-2024, > commit acfabf7e197f7a5bedf4749dac1f39551417b049. > 2) Refactored into smaller commits, as suggested by Yosry and > Chengming. Thanks both! > 3) Reworded the commit log for patches 5 and 6 as per Yosry's > suggestion. Thanks Yosry! > 4) Gathered data on a Sapphire Rapids server that has 823GiB SSD swap disk > partition. Also, all experiments are run with usemem --sleep 10, so that > the memory allocated by the 70 processes remains in memory > longer. Posted elapsed and sys times. Thanks to Yosry, Nhat and Ying for > their help with refining the performance characterization methodology. > 5) Updated Documentation/admin-guide/mm/transhuge.rst as suggested by > Nhat. Thanks Nhat! > > Changes since v5: > ================= > 1) Rebased to mm-unstable as of 8/29/2024, > commit 9287e4adbc6ab8fa04d25eb82e097fed877a4642. > 2) Added CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) to > enable/disable zswap_store() of mTHP folios. Thanks Nhat for the > suggestion to add a knob by which users can enable/disable this > change. Nhat, I hope this is along the lines of what you were > thinking. > 3) Added vm-scalability usemem data with 4K folios with > CONFIG_ZSWAP_STORE_THP_DEFAULT_ON off, that I gathered to make sure > there is no regression with this change. > 4) Added data with usemem with 64K and 2M THP for an alternate view of > before/after, as suggested by Yosry, so we can understand the impact > of when mTHPs are split into 4K folios in shrink_folio_list() > (CONFIG_THP_SWAP off) vs. not split (CONFIG_THP_SWAP on) and stored > in zswap. Thanks Yosry for this suggestion. > > Changes since v4: > ================= > 1) Published before/after data with zstd, as suggested by Nhat (Thanks > Nhat for the data reviews!). > 2) Rebased to mm-unstable from 8/27/2024, > commit b659edec079c90012cf8d05624e312d1062b8b87. > 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if > CONFIG_MEMCG is not defined, to resolve build errors reported by kernel > robot; as per Nhat's and Michal's suggestion to not require a separate > patch to fix the build errors (thanks both!). > 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as > suggested by Yosry (Thanks Yosry!). > 5) Squashed the commits that define new mthp zswpout stat counters, and > invoke count_mthp_stat() after successful zswap_store()s; into a single > commit. Thanks Yosry for this suggestion! > > Changes since v3: > ================= > 1) Rebased to mm-unstable commit 8c0b4f7b65fd1ca7af01267f491e815a40d77444. > Thanks to Barry for suggesting aligning with Ryan Roberts' latest > changes to count_mthp_stat() so that it's always defined, even when THP > is disabled. Barry, I have also made one other change in page_io.c > where count_mthp_stat() is called by count_swpout_vm_event(). I would > appreciate it if you can review this. Thanks! > Hopefully this should resolve the kernel robot build errors. > > Changes since v2: > ================= > 1) Gathered usemem data using SSD as the backing swap device for zswap, > as suggested by Ying Huang. Ying, I would appreciate it if you can > review the latest data. Thanks! > 2) Generated the base commit info in the patches to attempt to address > the kernel test robot build errors. > 3) No code changes to the individual patches themselves. > > Changes since RFC v1: > ===================== > > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion. > Thanks Barry! > 2) Addressed some of the code review comments that Nhat Pham provided in > Ryan's initial RFC [1]: > - Added a comment about the cgroup zswap limit checks occuring once per > folio at the beginning of zswap_store(). > Nhat, Ryan, please do let me know if the comments convey the summary > from the RFC discussion. Thanks! > - Posted data on running the cgroup suite's zswap kselftest. > 3) Rebased to v6.11-rc3. > 4) Gathered performance data with usemem and the rebased patch-series. > > > Regression Testing: > =================== > I ran vm-scalability usemem 70 processes without mTHP, i.e., only 4K > folios with mm-unstable and with this patch-series. The main goal was > to make sure that there is no functional or performance regression > wrt the earlier zswap behavior for 4K folios, > CONFIG_ZSWAP_STORE_THP_DEFAULT_ON is not set, and zswap_store() of 4K > pages goes through the newly added code path [zswap_store(), > zswap_store_page()]. > > The data indicates there is no regression. > > ------------------------------------------------------------------------------ > mm-unstable 8-28-2024 zswap-mTHP v6 > CONFIG_ZSWAP_STORE_THP_DEFAULT_ON > is not set > ------------------------------------------------------------------------------ > ZSWAP compressor zstd deflate- zstd deflate- > iaa iaa > ------------------------------------------------------------------------------ > Throughput (KB/s) 110,775 113,010 111,550 121,937 > sys time (sec) 1,141.72 954.87 1,131.95 828.47 > memcg_high 140,500 153,737 139,772 134,129 > memcg_swap_high 0 0 0 0 > memcg_swap_fail 0 0 0 0 > pswpin 0 0 0 0 > pswpout 0 0 0 0 > zswpin 675 690 682 684 > zswpout 9,552,298 10,603,271 9,566,392 9,267,213 > thp_swpout 0 0 0 0 > thp_swpout_ 0 0 0 0 > fallback > pgmajfault 3,453 3,468 3,841 3,487 > ZSWPOUT-64kB-mTHP n/a n/a 0 0 > SWPOUT-64kB-mTHP 0 0 0 0 > ------------------------------------------------------------------------------ It's probably better to put the zstd columns next to each other, and the deflate-iaa columns next to each other, for easier visual comparisons. > > > Performance Testing: > ==================== > Testing of this patch-series was done with mm-unstable as of 9-23-2024, > commit acfabf7e197f7a5bedf4749dac1f39551417b049. Data was gathered > without/with this patch-series, on an Intel Sapphire Rapids server, > dual-socket 56 cores per socket, 4 IAA devices per socket, 503 GiB RAM and > 823G SSD disk partition swap. Core frequency was fixed at 2500MHz. > > The vm-scalability "usemem" test was run in a cgroup whose memory.high > was fixed at 40G. The is no swap limit set for the cgroup. Following a > similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting" > series [2], 70 usemem processes were run, each allocating and writing 1G of > memory, and sleeping for 10 sec before exiting: > > usemem --init-time -w -O -s 10 -n 70 1g > > The vm/sysfs mTHP stats included with the performance data provide details > on the swapout activity to ZSWAP/swap. > > Other kernel configuration parameters: > > ZSWAP Compressors : zstd, deflate-iaa > ZSWAP Allocator : zsmalloc > SWAP page-cluster : 2 > > In the experiments where "deflate-iaa" is used as the ZSWAP compressor, > IAA "compression verification" is enabled. Hence each IAA compression > will be decompressed internally by the "iaa_crypto" driver, the crc-s > returned by the hardware will be compared and errors reported in case of > mismatches. Thus "deflate-iaa" helps ensure better data integrity as > compared to the software compressors. > > Throughput is derived by averaging the individual 70 processes' throughputs > reported by usemem. elapsed/sys times are measured with perf. All data > points per compressor/kernel/mTHP configuration are averaged across 3 runs. > > Case 1: Comparing zswap 4K vs. zswap mTHP > ========================================= > > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in > 64K/2M (m)THP to be split into 4K folios that get processed by zswap. > > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that results > in 64K/2M (m)THP to not be split, and processed by zswap. > > 64KB mTHP (cgroup memory.high set to 40G): > ========================================== > > ------------------------------------------------------------------------------- > mm-unstable 9-23-2024 zswap-mTHP Change wrt > CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y Baseline > Baseline > ------------------------------------------------------------------------------- > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- > iaa iaa iaa > ------------------------------------------------------------------------------- > Throughput (KB/s) 143,323 125,485 153,550 129,609 7% 3% > elapsed time (sec) 24.97 25.42 23.90 25.19 4% 1% > sys time (sec) 822.72 750.96 757.70 731.13 8% 3% > memcg_high 132,743 169,825 148,075 192,744 > memcg_swap_fail 639,067 841,553 2,204 2,215 > pswpin 0 0 0 0 > pswpout 0 0 0 0 > zswpin 795 873 760 902 > zswpout 10,011,266 13,195,137 10,010,017 13,193,554 > thp_swpout 0 0 0 0 > thp_swpout_ 0 0 0 0 > fallback > 64kB-mthp_ 639,065 841,553 2,204 2,215 > swpout_fallback > pgmajfault 2,861 2,924 3,054 3,259 > ZSWPOUT-64kB n/a n/a 623,451 822,268 > SWPOUT-64kB 0 0 0 0 > ------------------------------------------------------------------------------- > > > 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G): > ======================================================= > > ------------------------------------------------------------------------------- > mm-unstable 9-23-2024 zswap-mTHP Change wrt > CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y Baseline > Baseline > ------------------------------------------------------------------------------- > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- > iaa iaa iaa > ------------------------------------------------------------------------------- > Throughput (KB/s) 145,616 139,640 169,404 141,168 16% 1% > elapsed time (sec) 25.05 23.85 23.02 23.37 8% 2% > sys time (sec) 790.53 676.34 613.26 677.83 22% -0.2% > memcg_high 16,702 25,197 17,374 23,890 > memcg_swap_fail 21,485 27,814 114 144 > pswpin 0 0 0 0 > pswpout 0 0 0 0 > zswpin 793 852 778 922 > zswpout 10,011,709 13,186,882 10,010,893 13,195,600 > thp_swpout 0 0 0 0 > thp_swpout_ 21,485 27,814 114 144 > fallback > 2048kB-mthp_ n/a n/a 0 0 > swpout_fallback > pgmajfault 2,701 2,822 4,151 5,066 > ZSWPOUT-2048kB n/a n/a 19,442 25,615 > SWPOUT-2048kB 0 0 0 0 > ------------------------------------------------------------------------------- > > We mostly see improvements in throughput, elapsed and sys time for zstd and > deflate-iaa, when comparing before (THP_SWAP=N) vs. after (THP_SWAP=Y). > > > Case 2: Comparing SSD swap mTHP vs. zswap mTHP > ============================================== > > In this scenario, CONFIG_THP_SWAP is enabled in "before" and "after" > experiments. The "before" represents zswap rejecting mTHP, and the mTHP > being stored by the 823G SSD swap. The "after" represents data with this > patch-series, that results in 64K/2M (m)THP being processed and stored by > zswap. > > 64KB mTHP (cgroup memory.high set to 40G): > ========================================== > > ------------------------------------------------------------------------------- > mm-unstable 9-23-2024 zswap-mTHP Change wrt > CONFIG_THP_SWAP=Y CONFIG_THP_SWAP=Y Baseline > Baseline > ------------------------------------------------------------------------------- > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- > iaa iaa iaa > ------------------------------------------------------------------------------- > Throughput (KB/s) 20,265 20,696 153,550 129,609 658% 526% > elapsed time (sec) 72.44 70.86 23.90 25.19 67% 64% > sys time (sec) 77.95 77.99 757.70 731.13 -872% -837% > memcg_high 115,811 113,277 148,075 192,744 > memcg_swap_fail 2,386 2,425 2,204 2,215 > pswpin 16 16 0 0 > pswpout 7,774,235 7,616,069 0 0 > zswpin 728 749 760 902 > zswpout 38,424 39,022 10,010,017 13,193,554 > thp_swpout 0 0 0 0 > thp_swpout_ 0 0 0 0 > fallback > 64kB-mthp_ 2,386 2,425 2,204 2,215 > swpout_fallback > pgmajfault 2,757 2,860 3,054 3,259 > ZSWPOUT-64kB n/a n/a 623,451 822,268 > SWPOUT-64kB 485,890 476,004 0 0 > ------------------------------------------------------------------------------- > > > 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G): > ======================================================= > > ------------------------------------------------------------------------------- > mm-unstable 9-23-2024 zswap-mTHP Change wrt > CONFIG_THP_SWAP=Y CONFIG_THP_SWAP=Y Baseline > Baseline > ------------------------------------------------------------------------------- > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- > iaa iaa iaa > ------------------------------------------------------------------------------- > Throughput (KB/s) 24,347 35,971 169,404 141,168 596% 292% > elapsed time (sec) 63.52 64.59 23.02 23.37 64% 64% > sys time (sec) 27.91 27.01 613.26 677.83 -2098% -2410% > memcg_high 13,576 13,467 17,374 23,890 > memcg_swap_fail 162 124 114 144 > pswpin 0 0 0 0 > pswpout 7,003,307 7,168,853 0 0 > zswpin 741 722 778 922 > zswpout 84,429 65,315 10,010,893 13,195,600 > thp_swpout 13,678 14,002 0 0 > thp_swpout_ 162 124 114 144 > fallback > 2048kB-mthp_ n/a n/a 0 0 > swpout_fallback > pgmajfault 3,345 2,903 4,151 5,066 > ZSWPOUT-2048kB n/a n/a 19,442 25,615 > SWPOUT-2048kB 13,678 14,002 0 0 > ------------------------------------------------------------------------------- > > We see significant improvements in throughput and elapsed time for zstd and > deflate-iaa, when comparing before (mTHP-SSD) vs. after (mTHP-ZSWAP). The > sys time increases with mTHP-ZSWAP as expected, due to the CPU compression > time vs. asynchronous disk write times, as pointed out by Ying and Yosry. > > In the "Before" scenario, when zswap does not store mTHP, only allocations > count towards the cgroup memory limit. However, in the "After" scenario, > with the introduction of zswap_store() mTHP, both, allocations as well as > the zswap compressed pool usage from all 70 processes are counted towards > the memory limit. As a result, we see higher swapout activity in the > "After" data. Hence, more time is spent doing reclaim as the zswap cgroup > charge leads to more frequent memory.high breaches. > > Summary: > ======== > The v7 data presented above comparing zswap-mTHP with a conventional 823G > SSD swap demonstrates good performance improvements with zswap-mTHP. Hence, > it seems reasonable for zswap_store to support (m)THP, so that further > performance improvements can be implemented. > > Some of the ideas that have shown promise in our experiments are: > > 1) IAA compress/decompress batching. > 2) Distributing compress jobs across all IAA devices on the socket. > > In the experimental setup used in this patchset, we have enabled > IAA compress verification to ensure additional hardware data integrity CRC > checks not currently done by the software compressors. The tests run for > this patchset are also using only 1 IAA device per core, that avails of 2 > compress engines on the device. In our experiments with IAA batching, we > distribute compress jobs from all cores to the 8 compress engines available > per socket. We further compress the pages in each mTHP in parallel in the > accelerator. As a result, we improve compress latency and reclaim > throughput. > > The following compares the same usemem workload characteristics between: > > 1) zstd (v7 experiments) > 2) deflate-iaa "Fixed mode" (v7 experiments) > 3) deflate-iaa with batching > 4) deflate-iaa-canned "Canned mode" [3] with batching > > vm.page-cluster is set to "2" for all runs. > > 64K mTHP ZSWAP: > =============== > > ------------------------------------------------------------------------------- > ZSWAP zstd IAA Fixed IAA Fixed IAA Canned IAA IAA IAA > compressor (v7) (v7) + Batching + Batching Batch Canned Canned > vs. vs. Batch > 64K mTHP Seqtl Fixed vs. > ZSTD > ------------------------------------------------------------------------------- > Throughput 153,550 129,609 156,215 166,975 21% 7% 9% > (KB/s) > elapsed time 23.90 25.19 22.46 21.38 11% 5% 11% > (sec) > sys time 757.70 731.13 715.62 648.83 2% 9% 14% > (sec) > memcg_high 148,075 192,744 197,548 181,734 > memcg_swap_ 2,204 2,215 2,293 2,263 > fail > pswpin 0 0 0 0 > pswpout 0 0 0 0 > zswpin 760 902 774 833 > zswpout 10,010,017 13,193,554 13,193,176 12,125,616 > thp_swpout 0 0 0 0 > thp_swpout_ 0 0 0 0 > fallback > 64kB-mthp_ 2,204 2,215 2,293 2,263 > swpout_ > fallback > pgmajfault 3,054 3,259 3,545 3,516 > ZSWPOUT-64kB 623,451 822,268 822,176 755,480 > SWPOUT-64kB 0 0 0 0 > swap_ra 146 161 152 159 > swap_ra_hit 64 121 68 88 > ------------------------------------------------------------------------------- > > > 2M THP ZSWAP: > ============= > > ------------------------------------------------------------------------------- > ZSWAP zstd IAA Fixed IAA Fixed IAA Canned IAA IAA IAA > compressor (v7) (v7) + Batching + Batching Batch Canned Canned > vs. vs. Batch > 2M THP Seqtl Fixed vs. > ZSTD > ------------------------------------------------------------------------------- > Throughput 169,404 141,168 175,089 193,407 24% 10% 14% > (KB/s) > elapsed time 23.02 23.37 21.13 19.97 10% 5% 13% > (sec) > sys time 613.26 677.83 630.51 533.80 7% 15% 13% > (sec) > memcg_high 17,374 23,890 24,349 22,374 > memcg_swap_ 114 144 102 88 > fail > pswpin 0 0 0 0 > pswpout 0 0 0 0 > zswpin 778 922 6,492 6,642 > zswpout 10,010,893 13,195,600 13,199,907 12,132,265 > thp_swpout 0 0 0 0 > thp_swpout_ 114 144 102 88 > fallback > pgmajfault 4,151 5,066 5,032 4,999 > ZSWPOUT-2MB 19,442 25,615 25,666 23,594 > SWPOUT-2MB 0 0 0 0 > swap_ra 3 9 4,383 4,494 > swap_ra_hit 2 6 4,298 4,412 > ------------------------------------------------------------------------------- > > > With ZSWAP IAA compress/decompress batching, we are able to demonstrate > significant performance improvements and memory savings in scalability > experiments under memory pressure, as compared to software compressors. We > hope to submit this work in subsequent patch series. Honestly I would remove the detailed results of the followup series for batching, it should be enough to mention a single figure for further expected improvement from ongoing work that depends on this. > > Thanks, > Kanchana > > [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u > [2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-ryan.roberts@arm.com/ > [3] https://patchwork.kernel.org/project/linux-crypto/cover/cover.1710969449.git.andre.glover@linux.intel.com/ > > > Kanchana P Sridhar (8): > mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined. > mm: zswap: Modify zswap_compress() to accept a page instead of a > folio. > mm: zswap: Refactor code to store an entry in zswap xarray. > mm: zswap: Refactor code to delete stored offsets in case of errors. > mm: zswap: Compress and store a specific page in a folio. > mm: zswap: Support mTHP swapout in zswap_store(). > mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout > stats. > mm: Document the newly added mTHP zswpout stats, clarify swpout > semantics. > > Documentation/admin-guide/mm/transhuge.rst | 8 +- > include/linux/huge_mm.h | 1 + > include/linux/memcontrol.h | 4 + > mm/Kconfig | 8 + > mm/huge_memory.c | 3 + > mm/page_io.c | 1 + > mm/zswap.c | 248 ++++++++++++++++----- > 7 files changed, 210 insertions(+), 63 deletions(-) > > > base-commit: acfabf7e197f7a5bedf4749dac1f39551417b049 > -- > 2.27.0 >
> -----Original Message----- > From: Yosry Ahmed <yosryahmed@google.com> > Sent: Tuesday, September 24, 2024 12:35 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com; > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios > > On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar > <kanchana.p.sridhar@intel.com> wrote: > > > > Hi All, > > > > This patch-series enables zswap_store() to accept and store mTHP > > folios. The most significant contribution in this series is from the > > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been > > migrated to mm-unstable as of 9-23-2024 in patches 5,6 of this series. > > > > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting > > https://lore.kernel.org/linux-mm/20231019110543.3284654-1- > ryan.roberts@arm.com/T/#u > > > > Additionally, there is an attempt to modularize some of the functionality > > in zswap_store(), to make it more amenable to supporting any-order > > mTHPs. For instance, the function zswap_store_entry() stores a > zswap_entry > > in the xarray. Likewise, zswap_delete_stored_offsets() can be used to > > delete all offsets corresponding to a higher order folio stored in zswap. > > These are implementation details that are not very useful here, you > can just mention that the first few patches do refactoring prep work. Thanks Yosry for the comments! Sure, I will reword this as you've suggested in v8. > > > > > For accounting purposes, the patch-series adds per-order mTHP sysfs > > "zswpout" counters that get incremented upon successful zswap_store of > > an mTHP folio: > > > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout > > > > A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by > default) > > will enable/disable zswap storing of (m)THP. When disabled, zswap will > > fallback to rejecting the mTHP folio, to be processed by the backing > > swap device. > > Why is this needed? Do we just not have enough confidence in the > feature yet, or are there some cases that regress from enabling mTHP > for zswapout? > > Does generic mTHP swapout/swapin also use config options? As discussed in the other comments' follow-up, I will delete the config option and runtime knob. > > > > > This patch-series is a pre-requisite for ZSWAP compress batching of mTHP > > swap-out and decompress batching of swap-ins based on > swapin_readahead(), > > using Intel IAA hardware acceleration, which we would like to submit in > > subsequent patch-series, with performance improvement data. > > > > Thanks to Ying Huang for pre-posting review feedback and suggestions! > > > > Thanks also to Nhat, Yosry, Barry, Chengming, Usama and Ying for their > > helpful feedback, data reviews and suggestions! > > > > Co-development signoff request: > > =============================== > > I would like to request Ryan Roberts' co-developer signoff on patches > > 5 and 6 in this series. Thanks Ryan! > > > > Changes since v6: > > ================= > > Please put the changelog at the very end, I almost missed the > performance evaluation. Sure, will fix this. > > > 1) Rebased to mm-unstable as of 9-23-2024, > > commit acfabf7e197f7a5bedf4749dac1f39551417b049. > > 2) Refactored into smaller commits, as suggested by Yosry and > > Chengming. Thanks both! > > 3) Reworded the commit log for patches 5 and 6 as per Yosry's > > suggestion. Thanks Yosry! > > 4) Gathered data on a Sapphire Rapids server that has 823GiB SSD swap disk > > partition. Also, all experiments are run with usemem --sleep 10, so that > > the memory allocated by the 70 processes remains in memory > > longer. Posted elapsed and sys times. Thanks to Yosry, Nhat and Ying for > > their help with refining the performance characterization methodology. > > 5) Updated Documentation/admin-guide/mm/transhuge.rst as suggested > by > > Nhat. Thanks Nhat! > > > > Changes since v5: > > ================= > > 1) Rebased to mm-unstable as of 8/29/2024, > > commit 9287e4adbc6ab8fa04d25eb82e097fed877a4642. > > 2) Added CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) to > > enable/disable zswap_store() of mTHP folios. Thanks Nhat for the > > suggestion to add a knob by which users can enable/disable this > > change. Nhat, I hope this is along the lines of what you were > > thinking. > > 3) Added vm-scalability usemem data with 4K folios with > > CONFIG_ZSWAP_STORE_THP_DEFAULT_ON off, that I gathered to make > sure > > there is no regression with this change. > > 4) Added data with usemem with 64K and 2M THP for an alternate view of > > before/after, as suggested by Yosry, so we can understand the impact > > of when mTHPs are split into 4K folios in shrink_folio_list() > > (CONFIG_THP_SWAP off) vs. not split (CONFIG_THP_SWAP on) and stored > > in zswap. Thanks Yosry for this suggestion. > > > > Changes since v4: > > ================= > > 1) Published before/after data with zstd, as suggested by Nhat (Thanks > > Nhat for the data reviews!). > > 2) Rebased to mm-unstable from 8/27/2024, > > commit b659edec079c90012cf8d05624e312d1062b8b87. > > 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if > > CONFIG_MEMCG is not defined, to resolve build errors reported by kernel > > robot; as per Nhat's and Michal's suggestion to not require a separate > > patch to fix the build errors (thanks both!). > > 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as > > suggested by Yosry (Thanks Yosry!). > > 5) Squashed the commits that define new mthp zswpout stat counters, and > > invoke count_mthp_stat() after successful zswap_store()s; into a single > > commit. Thanks Yosry for this suggestion! > > > > Changes since v3: > > ================= > > 1) Rebased to mm-unstable commit > 8c0b4f7b65fd1ca7af01267f491e815a40d77444. > > Thanks to Barry for suggesting aligning with Ryan Roberts' latest > > changes to count_mthp_stat() so that it's always defined, even when THP > > is disabled. Barry, I have also made one other change in page_io.c > > where count_mthp_stat() is called by count_swpout_vm_event(). I would > > appreciate it if you can review this. Thanks! > > Hopefully this should resolve the kernel robot build errors. > > > > Changes since v2: > > ================= > > 1) Gathered usemem data using SSD as the backing swap device for zswap, > > as suggested by Ying Huang. Ying, I would appreciate it if you can > > review the latest data. Thanks! > > 2) Generated the base commit info in the patches to attempt to address > > the kernel test robot build errors. > > 3) No code changes to the individual patches themselves. > > > > Changes since RFC v1: > > ===================== > > > > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion. > > Thanks Barry! > > 2) Addressed some of the code review comments that Nhat Pham provided > in > > Ryan's initial RFC [1]: > > - Added a comment about the cgroup zswap limit checks occuring once > per > > folio at the beginning of zswap_store(). > > Nhat, Ryan, please do let me know if the comments convey the summary > > from the RFC discussion. Thanks! > > - Posted data on running the cgroup suite's zswap kselftest. > > 3) Rebased to v6.11-rc3. > > 4) Gathered performance data with usemem and the rebased patch-series. > > > > > > Regression Testing: > > =================== > > I ran vm-scalability usemem 70 processes without mTHP, i.e., only 4K > > folios with mm-unstable and with this patch-series. The main goal was > > to make sure that there is no functional or performance regression > > wrt the earlier zswap behavior for 4K folios, > > CONFIG_ZSWAP_STORE_THP_DEFAULT_ON is not set, and zswap_store() of > 4K > > pages goes through the newly added code path [zswap_store(), > > zswap_store_page()]. > > > > The data indicates there is no regression. > > > > ------------------------------------------------------------------------------ > > mm-unstable 8-28-2024 zswap-mTHP v6 > > CONFIG_ZSWAP_STORE_THP_DEFAULT_ON > > is not set > > ------------------------------------------------------------------------------ > > ZSWAP compressor zstd deflate- zstd deflate- > > iaa iaa > > ------------------------------------------------------------------------------ > > Throughput (KB/s) 110,775 113,010 111,550 121,937 > > sys time (sec) 1,141.72 954.87 1,131.95 828.47 > > memcg_high 140,500 153,737 139,772 134,129 > > memcg_swap_high 0 0 0 0 > > memcg_swap_fail 0 0 0 0 > > pswpin 0 0 0 0 > > pswpout 0 0 0 0 > > zswpin 675 690 682 684 > > zswpout 9,552,298 10,603,271 9,566,392 9,267,213 > > thp_swpout 0 0 0 0 > > thp_swpout_ 0 0 0 0 > > fallback > > pgmajfault 3,453 3,468 3,841 3,487 > > ZSWPOUT-64kB-mTHP n/a n/a 0 0 > > SWPOUT-64kB-mTHP 0 0 0 0 > > ------------------------------------------------------------------------------ > > It's probably better to put the zstd columns next to each other, and > the deflate-iaa columns next to each other, for easier visual > comparisons. Sure. Will change this accordingly, in v8. > > > > > > > Performance Testing: > > ==================== > > Testing of this patch-series was done with mm-unstable as of 9-23-2024, > > commit acfabf7e197f7a5bedf4749dac1f39551417b049. Data was gathered > > without/with this patch-series, on an Intel Sapphire Rapids server, > > dual-socket 56 cores per socket, 4 IAA devices per socket, 503 GiB RAM and > > 823G SSD disk partition swap. Core frequency was fixed at 2500MHz. > > > > The vm-scalability "usemem" test was run in a cgroup whose memory.high > > was fixed at 40G. The is no swap limit set for the cgroup. Following a > > similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting" > > series [2], 70 usemem processes were run, each allocating and writing 1G of > > memory, and sleeping for 10 sec before exiting: > > > > usemem --init-time -w -O -s 10 -n 70 1g > > > > The vm/sysfs mTHP stats included with the performance data provide > details > > on the swapout activity to ZSWAP/swap. > > > > Other kernel configuration parameters: > > > > ZSWAP Compressors : zstd, deflate-iaa > > ZSWAP Allocator : zsmalloc > > SWAP page-cluster : 2 > > > > In the experiments where "deflate-iaa" is used as the ZSWAP compressor, > > IAA "compression verification" is enabled. Hence each IAA compression > > will be decompressed internally by the "iaa_crypto" driver, the crc-s > > returned by the hardware will be compared and errors reported in case of > > mismatches. Thus "deflate-iaa" helps ensure better data integrity as > > compared to the software compressors. > > > > Throughput is derived by averaging the individual 70 processes' throughputs > > reported by usemem. elapsed/sys times are measured with perf. All data > > points per compressor/kernel/mTHP configuration are averaged across 3 > runs. > > > > Case 1: Comparing zswap 4K vs. zswap mTHP > > ========================================= > > > > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in > > 64K/2M (m)THP to be split into 4K folios that get processed by zswap. > > > > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that > results > > in 64K/2M (m)THP to not be split, and processed by zswap. > > > > 64KB mTHP (cgroup memory.high set to 40G): > > ========================================== > > > > ------------------------------------------------------------------------------- > > mm-unstable 9-23-2024 zswap-mTHP Change wrt > > CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y Baseline > > Baseline > > ------------------------------------------------------------------------------- > > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- > > iaa iaa iaa > > ------------------------------------------------------------------------------- > > Throughput (KB/s) 143,323 125,485 153,550 129,609 7% 3% > > elapsed time (sec) 24.97 25.42 23.90 25.19 4% 1% > > sys time (sec) 822.72 750.96 757.70 731.13 8% 3% > > memcg_high 132,743 169,825 148,075 192,744 > > memcg_swap_fail 639,067 841,553 2,204 2,215 > > pswpin 0 0 0 0 > > pswpout 0 0 0 0 > > zswpin 795 873 760 902 > > zswpout 10,011,266 13,195,137 10,010,017 13,193,554 > > thp_swpout 0 0 0 0 > > thp_swpout_ 0 0 0 0 > > fallback > > 64kB-mthp_ 639,065 841,553 2,204 2,215 > > swpout_fallback > > pgmajfault 2,861 2,924 3,054 3,259 > > ZSWPOUT-64kB n/a n/a 623,451 822,268 > > SWPOUT-64kB 0 0 0 0 > > ------------------------------------------------------------------------------- > > > > > > 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G): > > ======================================================= > > > > ------------------------------------------------------------------------------- > > mm-unstable 9-23-2024 zswap-mTHP Change wrt > > CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y Baseline > > Baseline > > ------------------------------------------------------------------------------- > > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- > > iaa iaa iaa > > ------------------------------------------------------------------------------- > > Throughput (KB/s) 145,616 139,640 169,404 141,168 16% 1% > > elapsed time (sec) 25.05 23.85 23.02 23.37 8% 2% > > sys time (sec) 790.53 676.34 613.26 677.83 22% -0.2% > > memcg_high 16,702 25,197 17,374 23,890 > > memcg_swap_fail 21,485 27,814 114 144 > > pswpin 0 0 0 0 > > pswpout 0 0 0 0 > > zswpin 793 852 778 922 > > zswpout 10,011,709 13,186,882 10,010,893 13,195,600 > > thp_swpout 0 0 0 0 > > thp_swpout_ 21,485 27,814 114 144 > > fallback > > 2048kB-mthp_ n/a n/a 0 0 > > swpout_fallback > > pgmajfault 2,701 2,822 4,151 5,066 > > ZSWPOUT-2048kB n/a n/a 19,442 25,615 > > SWPOUT-2048kB 0 0 0 0 > > ------------------------------------------------------------------------------- > > > > We mostly see improvements in throughput, elapsed and sys time for zstd > and > > deflate-iaa, when comparing before (THP_SWAP=N) vs. after > (THP_SWAP=Y). > > > > > > Case 2: Comparing SSD swap mTHP vs. zswap mTHP > > ============================================== > > > > In this scenario, CONFIG_THP_SWAP is enabled in "before" and "after" > > experiments. The "before" represents zswap rejecting mTHP, and the mTHP > > being stored by the 823G SSD swap. The "after" represents data with this > > patch-series, that results in 64K/2M (m)THP being processed and stored by > > zswap. > > > > 64KB mTHP (cgroup memory.high set to 40G): > > ========================================== > > > > ------------------------------------------------------------------------------- > > mm-unstable 9-23-2024 zswap-mTHP Change wrt > > CONFIG_THP_SWAP=Y CONFIG_THP_SWAP=Y Baseline > > Baseline > > ------------------------------------------------------------------------------- > > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- > > iaa iaa iaa > > ------------------------------------------------------------------------------- > > Throughput (KB/s) 20,265 20,696 153,550 129,609 658% 526% > > elapsed time (sec) 72.44 70.86 23.90 25.19 67% 64% > > sys time (sec) 77.95 77.99 757.70 731.13 -872% -837% > > memcg_high 115,811 113,277 148,075 192,744 > > memcg_swap_fail 2,386 2,425 2,204 2,215 > > pswpin 16 16 0 0 > > pswpout 7,774,235 7,616,069 0 0 > > zswpin 728 749 760 902 > > zswpout 38,424 39,022 10,010,017 13,193,554 > > thp_swpout 0 0 0 0 > > thp_swpout_ 0 0 0 0 > > fallback > > 64kB-mthp_ 2,386 2,425 2,204 2,215 > > swpout_fallback > > pgmajfault 2,757 2,860 3,054 3,259 > > ZSWPOUT-64kB n/a n/a 623,451 822,268 > > SWPOUT-64kB 485,890 476,004 0 0 > > ------------------------------------------------------------------------------- > > > > > > 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G): > > ======================================================= > > > > ------------------------------------------------------------------------------- > > mm-unstable 9-23-2024 zswap-mTHP Change wrt > > CONFIG_THP_SWAP=Y CONFIG_THP_SWAP=Y Baseline > > Baseline > > ------------------------------------------------------------------------------- > > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- > > iaa iaa iaa > > ------------------------------------------------------------------------------- > > Throughput (KB/s) 24,347 35,971 169,404 141,168 596% 292% > > elapsed time (sec) 63.52 64.59 23.02 23.37 64% 64% > > sys time (sec) 27.91 27.01 613.26 677.83 -2098% -2410% > > memcg_high 13,576 13,467 17,374 23,890 > > memcg_swap_fail 162 124 114 144 > > pswpin 0 0 0 0 > > pswpout 7,003,307 7,168,853 0 0 > > zswpin 741 722 778 922 > > zswpout 84,429 65,315 10,010,893 13,195,600 > > thp_swpout 13,678 14,002 0 0 > > thp_swpout_ 162 124 114 144 > > fallback > > 2048kB-mthp_ n/a n/a 0 0 > > swpout_fallback > > pgmajfault 3,345 2,903 4,151 5,066 > > ZSWPOUT-2048kB n/a n/a 19,442 25,615 > > SWPOUT-2048kB 13,678 14,002 0 0 > > ------------------------------------------------------------------------------- > > > > We see significant improvements in throughput and elapsed time for zstd > and > > deflate-iaa, when comparing before (mTHP-SSD) vs. after (mTHP-ZSWAP). > The > > sys time increases with mTHP-ZSWAP as expected, due to the CPU > compression > > time vs. asynchronous disk write times, as pointed out by Ying and Yosry. > > > > In the "Before" scenario, when zswap does not store mTHP, only allocations > > count towards the cgroup memory limit. However, in the "After" scenario, > > with the introduction of zswap_store() mTHP, both, allocations as well as > > the zswap compressed pool usage from all 70 processes are counted > towards > > the memory limit. As a result, we see higher swapout activity in the > > "After" data. Hence, more time is spent doing reclaim as the zswap cgroup > > charge leads to more frequent memory.high breaches. > > > > Summary: > > ======== > > The v7 data presented above comparing zswap-mTHP with a conventional > 823G > > SSD swap demonstrates good performance improvements with zswap- > mTHP. Hence, > > it seems reasonable for zswap_store to support (m)THP, so that further > > performance improvements can be implemented. > > > > Some of the ideas that have shown promise in our experiments are: > > > > 1) IAA compress/decompress batching. > > 2) Distributing compress jobs across all IAA devices on the socket. > > > > In the experimental setup used in this patchset, we have enabled > > IAA compress verification to ensure additional hardware data integrity CRC > > checks not currently done by the software compressors. The tests run for > > this patchset are also using only 1 IAA device per core, that avails of 2 > > compress engines on the device. In our experiments with IAA batching, we > > distribute compress jobs from all cores to the 8 compress engines available > > per socket. We further compress the pages in each mTHP in parallel in the > > accelerator. As a result, we improve compress latency and reclaim > > throughput. > > > > The following compares the same usemem workload characteristics > between: > > > > 1) zstd (v7 experiments) > > 2) deflate-iaa "Fixed mode" (v7 experiments) > > 3) deflate-iaa with batching > > 4) deflate-iaa-canned "Canned mode" [3] with batching > > > > vm.page-cluster is set to "2" for all runs. > > > > 64K mTHP ZSWAP: > > =============== > > > > ------------------------------------------------------------------------------- > > ZSWAP zstd IAA Fixed IAA Fixed IAA Canned IAA IAA IAA > > compressor (v7) (v7) + Batching + Batching Batch Canned Canned > > vs. vs. Batch > > 64K mTHP Seqtl Fixed vs. > > ZSTD > > ------------------------------------------------------------------------------- > > Throughput 153,550 129,609 156,215 166,975 21% 7% 9% > > (KB/s) > > elapsed time 23.90 25.19 22.46 21.38 11% 5% 11% > > (sec) > > sys time 757.70 731.13 715.62 648.83 2% 9% 14% > > (sec) > > memcg_high 148,075 192,744 197,548 181,734 > > memcg_swap_ 2,204 2,215 2,293 2,263 > > fail > > pswpin 0 0 0 0 > > pswpout 0 0 0 0 > > zswpin 760 902 774 833 > > zswpout 10,010,017 13,193,554 13,193,176 12,125,616 > > thp_swpout 0 0 0 0 > > thp_swpout_ 0 0 0 0 > > fallback > > 64kB-mthp_ 2,204 2,215 2,293 2,263 > > swpout_ > > fallback > > pgmajfault 3,054 3,259 3,545 3,516 > > ZSWPOUT-64kB 623,451 822,268 822,176 755,480 > > SWPOUT-64kB 0 0 0 0 > > swap_ra 146 161 152 159 > > swap_ra_hit 64 121 68 88 > > ------------------------------------------------------------------------------- > > > > > > 2M THP ZSWAP: > > ============= > > > > ------------------------------------------------------------------------------- > > ZSWAP zstd IAA Fixed IAA Fixed IAA Canned IAA IAA IAA > > compressor (v7) (v7) + Batching + Batching Batch Canned Canned > > vs. vs. Batch > > 2M THP Seqtl Fixed vs. > > ZSTD > > ------------------------------------------------------------------------------- > > Throughput 169,404 141,168 175,089 193,407 24% 10% 14% > > (KB/s) > > elapsed time 23.02 23.37 21.13 19.97 10% 5% 13% > > (sec) > > sys time 613.26 677.83 630.51 533.80 7% 15% 13% > > (sec) > > memcg_high 17,374 23,890 24,349 22,374 > > memcg_swap_ 114 144 102 88 > > fail > > pswpin 0 0 0 0 > > pswpout 0 0 0 0 > > zswpin 778 922 6,492 6,642 > > zswpout 10,010,893 13,195,600 13,199,907 12,132,265 > > thp_swpout 0 0 0 0 > > thp_swpout_ 114 144 102 88 > > fallback > > pgmajfault 4,151 5,066 5,032 4,999 > > ZSWPOUT-2MB 19,442 25,615 25,666 23,594 > > SWPOUT-2MB 0 0 0 0 > > swap_ra 3 9 4,383 4,494 > > swap_ra_hit 2 6 4,298 4,412 > > ------------------------------------------------------------------------------- > > > > > > With ZSWAP IAA compress/decompress batching, we are able to > demonstrate > > significant performance improvements and memory savings in scalability > > experiments under memory pressure, as compared to software > compressors. We > > hope to submit this work in subsequent patch series. > > Honestly I would remove the detailed results of the followup series > for batching, it should be enough to mention a single figure for > further expected improvement from ongoing work that depends on this. Definitely, will summarize the results of batching in the cover letter for v8. Thanks, Kanchana > > > > > Thanks, > > Kanchana > > > > [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1- > ryan.roberts@arm.com/T/#u > > [2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1- > ryan.roberts@arm.com/ > > [3] https://patchwork.kernel.org/project/linux- > crypto/cover/cover.1710969449.git.andre.glover@linux.intel.com/ > > > > > > Kanchana P Sridhar (8): > > mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined. > > mm: zswap: Modify zswap_compress() to accept a page instead of a > > folio. > > mm: zswap: Refactor code to store an entry in zswap xarray. > > mm: zswap: Refactor code to delete stored offsets in case of errors. > > mm: zswap: Compress and store a specific page in a folio. > > mm: zswap: Support mTHP swapout in zswap_store(). > > mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout > > stats. > > mm: Document the newly added mTHP zswpout stats, clarify swpout > > semantics. > > > > Documentation/admin-guide/mm/transhuge.rst | 8 +- > > include/linux/huge_mm.h | 1 + > > include/linux/memcontrol.h | 4 + > > mm/Kconfig | 8 + > > mm/huge_memory.c | 3 + > > mm/page_io.c | 1 + > > mm/zswap.c | 248 ++++++++++++++++----- > > 7 files changed, 210 insertions(+), 63 deletions(-) > > > > > > base-commit: acfabf7e197f7a5bedf4749dac1f39551417b049 > > -- > > 2.27.0 > >
© 2016 - 2024 Red Hat, Inc.