Hi Ying, > -----Original Message----- > From: Huang, Ying <ying.huang@intel.com> > Sent: Wednesday, September 25, 2024 5:45 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com; > chengming.zhou@linux.dev; usamaarif642@gmail.com; > shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com; > akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, > Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios > > "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes: > > >> -----Original Message----- > >> From: Huang, Ying <ying.huang@intel.com> > >> Sent: Tuesday, September 24, 2024 11:35 PM > >> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > >> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > >> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com; > >> chengming.zhou@linux.dev; usamaarif642@gmail.com; > >> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com; > >> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; > Feghali, > >> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > >> <vinodh.gopal@intel.com> > >> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios > >> > >> Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes: > >> > >> [snip] > >> > >> > > >> > Case 1: Comparing zswap 4K vs. zswap mTHP > >> > ========================================= > >> > > >> > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that > results in > >> > 64K/2M (m)THP to be split into 4K folios that get processed by zswap. > >> > > >> > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that > >> results > >> > in 64K/2M (m)THP to not be split, and processed by zswap. > >> > > >> > 64KB mTHP (cgroup memory.high set to 40G): > >> > ========================================== > >> > > >> > ------------------------------------------------------------------------------- > >> > mm-unstable 9-23-2024 zswap-mTHP Change wrt > >> > CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y > Baseline > >> > Baseline > >> > ------------------------------------------------------------------------------- > >> > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- > >> > iaa iaa iaa > >> > ------------------------------------------------------------------------------- > >> > Throughput (KB/s) 143,323 125,485 153,550 129,609 7% > 3% > >> > elapsed time (sec) 24.97 25.42 23.90 25.19 4% 1% > >> > sys time (sec) 822.72 750.96 757.70 731.13 8% 3% > >> > memcg_high 132,743 169,825 148,075 192,744 > >> > memcg_swap_fail 639,067 841,553 2,204 2,215 > >> > pswpin 0 0 0 0 > >> > pswpout 0 0 0 0 > >> > zswpin 795 873 760 902 > >> > zswpout 10,011,266 13,195,137 10,010,017 13,193,554 > >> > thp_swpout 0 0 0 0 > >> > thp_swpout_ 0 0 0 0 > >> > fallback > >> > 64kB-mthp_ 639,065 841,553 2,204 2,215 > >> > swpout_fallback > >> > pgmajfault 2,861 2,924 3,054 3,259 > >> > ZSWPOUT-64kB n/a n/a 623,451 822,268 > >> > SWPOUT-64kB 0 0 0 0 > >> > ------------------------------------------------------------------------------- > >> > > >> > >> IIUC, the throughput is the sum of throughput of all usemem processes? > >> > >> One possible issue of usemem test case is the "imbalance" issue. That > >> is, some usemem processes may swap-out/swap-in less, so the score is > >> very high; while some other processes may swap-out/swap-in more, so the > >> score is very low. Sometimes, the total score decreases, but the scores > >> of usemem processes are more balanced, so that the performance should > be > >> considered better. And, in general, we should make usemem score > >> balanced among processes via say longer test time. Can you check this > >> in your test results? > > > > Actually, the throughput data listed in the cover-letter is the average of > > all the usemem processes. Your observation about the "imbalance" issue is > > right. Some processes see a higher throughput than others. I have noticed > > that the throughputs progressively reduce as the individual processes exit > > and print their stats. > > > > Listed below are the stats from two runs of usemem70: sleep 10 and sleep > 30. > > Both are run with a cgroup mem-limit of 40G. Data is with v7, 64K folios are > > enabled, zswap uses zstd. > > > > > > ----------------------------------------------- > > sleep 10 sleep 30 > > Throughput (KB/s) Throughput (KB/s) > > ----------------------------------------------- > > 181,540 191,686 > > 179,651 191,459 > > 179,068 188,834 > > 177,244 187,568 > > 177,215 186,703 > > 176,565 185,584 > > 176,546 185,370 > > 176,470 185,021 > > 176,214 184,303 > > 176,128 184,040 > > 175,279 183,932 > > 174,745 180,831 > > 173,935 179,418 > > 161,546 168,014 > > 160,332 167,540 > > 160,122 167,364 > > 159,613 167,020 > > 159,546 166,590 > > 159,021 166,483 > > 158,845 166,418 > > 158,426 166,264 > > 158,396 166,066 > > 158,371 165,944 > > 158,298 165,866 > > 158,250 165,884 > > 158,057 165,533 > > 158,011 165,532 > > 157,899 165,457 > > 157,894 165,424 > > 157,839 165,410 > > 157,731 165,407 > > 157,629 165,273 > > 157,626 164,867 > > 157,581 164,636 > > 157,471 164,266 > > 157,430 164,225 > > 157,287 163,290 > > 156,289 153,597 > > 153,970 147,494 > > 148,244 147,102 > > 142,907 146,111 > > 142,811 145,789 > > 139,171 141,168 > > 136,314 140,714 > > 133,616 140,111 > > 132,881 139,636 > > 132,729 136,943 > > 132,680 136,844 > > 132,248 135,726 > > 132,027 135,384 > > 131,929 135,270 > > 131,766 134,748 > > 131,667 134,733 > > 131,576 134,582 > > 131,396 134,302 > > 131,351 134,160 > > 131,135 134,102 > > 130,885 134,097 > > 130,854 134,058 > > 130,767 134,006 > > 130,666 133,960 > > 130,647 133,894 > > 130,152 133,837 > > 130,006 133,747 > > 129,921 133,679 > > 129,856 133,666 > > 129,377 133,564 > > 128,366 133,331 > > 127,988 132,938 > > 126,903 132,746 > > ----------------------------------------------- > > sum 10,526,916 10,919,561 > > average 150,385 155,994 > > stddev 17,551 19,633 > > ----------------------------------------------- > > elapsed 24.40 43.66 > > time (sec) > > sys time 806.25 766.05 > > (sec) > > zswpout 10,008,713 10,008,407 > > 64K folio 623,463 623,629 > > swpout > > ----------------------------------------------- > > Although there are some imbalance, I don't find it's too much. So, I > think the test result is reasonable. Please pay attention to the > imbalance issue in the future tests. Sure, will do so. > > > As we increase the time for which allocations are maintained, > > there seems to be a slight improvement in throughput, but the > > variance increases as well. The processes with lower throughput > > could be the ones that handle the memcg being over limit by > > doing reclaim, possibly before they can allocate. > > > > Interestingly, the longer test time does seem to reduce the amount > > of reclaim (hence lower sys time), but more 64K large folios seem to > > be reclaimed. Could this mean that with longer test time (sleep 30), > > more cold memory residing in large folios is getting reclaimed, as > > against memory just relinquished by the exiting processes? > > I don't think longer sleep time in test helps much to balance. Can you > try with less process, and larger memory size per process? I guess that > this will improve balance. I tried this, and the data is listed below: usemem options: --------------- 30 processes allocate 10G each cgroup memory limit = 150G sleep 10 525Gi SSD disk swap partition 64K large folios enabled Throughput (KB/s) of each of the 30 processes: --------------------------------------------------------------- mm-unstable zswap_store of large folios 9-25-2024 v7 zswap compressor: zstd zstd deflate-iaa --------------------------------------------------------------- 38,393 234,485 374,427 37,283 215,528 314,225 37,156 214,942 304,413 37,143 213,073 304,146 36,814 212,904 290,186 36,277 212,304 288,212 36,104 212,207 285,682 36,000 210,173 270,661 35,994 208,487 256,960 35,979 207,788 248,313 35,967 207,714 235,338 35,966 207,703 229,335 35,835 207,690 221,697 35,793 207,418 221,600 35,692 206,160 219,346 35,682 206,128 219,162 35,681 205,817 219,155 35,678 205,546 214,862 35,678 205,523 214,710 35,677 204,951 214,282 35,677 204,283 213,441 35,677 203,348 213,011 35,675 203,028 212,923 35,673 201,922 212,492 35,672 201,660 212,225 35,672 200,724 211,808 35,672 200,324 211,420 35,671 199,686 211,413 35,667 198,858 211,346 35,667 197,590 211,209 --------------------------------------------------------------- sum 1,081,515 6,217,964 7,268,000 average 36,051 207,265 242,267 stddev 655 7,010 42,234 elapsed time (sec) 343.70 107.40 84.34 sys time (sec) 269.30 2,520.13 1,696.20 memcg.high breaches 443,672 475,074 623,333 zswpout 22,605 48,931,249 54,777,100 pswpout 40,004,528 0 0 hugepages-64K zswpout 0 3,057,090 3,421,855 hugepages-64K swpout 2,500,283 0 0 --------------------------------------------------------------- As you can see, this is quite a memory-constrained scenario, where we are giving a 50% of total memory required, as the memory limit for the cgroup in which the 30 processes are run. This causes significantly more reclaim activity than the setup I was using thus far (70 processes, 1G, 40G limit). The variance or "imbalance" reduces somewhat for zstd, but not for IAA. IAA shows really good throughput (17%) and elapsed time (21%) and sys time (33%) improvement wrt zstd with zswap_store of large folios. These are the memory-constrained scenarios in which IAA typically does really well. IAA verify_compress is enabled, so this is an added data integrity checks benefit we get with IAA. I would like to get your and the maintainers' feedback on whether I should switch to this "usemem30-10G" setup for v8? Thanks, Kanchana > > -- > Best Regards, > Huang, Ying
"Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes: > Hi Ying, > >> -----Original Message----- >> From: Huang, Ying <ying.huang@intel.com> >> Sent: Wednesday, September 25, 2024 5:45 PM >> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> >> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; >> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com; >> chengming.zhou@linux.dev; usamaarif642@gmail.com; >> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com; >> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, >> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh >> <vinodh.gopal@intel.com> >> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios >> >> "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes: >> >> >> -----Original Message----- >> >> From: Huang, Ying <ying.huang@intel.com> >> >> Sent: Tuesday, September 24, 2024 11:35 PM >> >> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> >> >> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; >> >> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com; >> >> chengming.zhou@linux.dev; usamaarif642@gmail.com; >> >> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com; >> >> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; >> Feghali, >> >> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh >> >> <vinodh.gopal@intel.com> >> >> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios >> >> >> >> Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes: >> >> >> >> [snip] >> >> >> >> > >> >> > Case 1: Comparing zswap 4K vs. zswap mTHP >> >> > ========================================= >> >> > >> >> > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that >> results in >> >> > 64K/2M (m)THP to be split into 4K folios that get processed by zswap. >> >> > >> >> > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that >> >> results >> >> > in 64K/2M (m)THP to not be split, and processed by zswap. >> >> > >> >> > 64KB mTHP (cgroup memory.high set to 40G): >> >> > ========================================== >> >> > >> >> > ------------------------------------------------------------------------------- >> >> > mm-unstable 9-23-2024 zswap-mTHP Change wrt >> >> > CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y >> Baseline >> >> > Baseline >> >> > ------------------------------------------------------------------------------- >> >> > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- >> >> > iaa iaa iaa >> >> > ------------------------------------------------------------------------------- >> >> > Throughput (KB/s) 143,323 125,485 153,550 129,609 7% >> 3% >> >> > elapsed time (sec) 24.97 25.42 23.90 25.19 4% 1% >> >> > sys time (sec) 822.72 750.96 757.70 731.13 8% 3% >> >> > memcg_high 132,743 169,825 148,075 192,744 >> >> > memcg_swap_fail 639,067 841,553 2,204 2,215 >> >> > pswpin 0 0 0 0 >> >> > pswpout 0 0 0 0 >> >> > zswpin 795 873 760 902 >> >> > zswpout 10,011,266 13,195,137 10,010,017 13,193,554 >> >> > thp_swpout 0 0 0 0 >> >> > thp_swpout_ 0 0 0 0 >> >> > fallback >> >> > 64kB-mthp_ 639,065 841,553 2,204 2,215 >> >> > swpout_fallback >> >> > pgmajfault 2,861 2,924 3,054 3,259 >> >> > ZSWPOUT-64kB n/a n/a 623,451 822,268 >> >> > SWPOUT-64kB 0 0 0 0 >> >> > ------------------------------------------------------------------------------- >> >> > >> >> >> >> IIUC, the throughput is the sum of throughput of all usemem processes? >> >> >> >> One possible issue of usemem test case is the "imbalance" issue. That >> >> is, some usemem processes may swap-out/swap-in less, so the score is >> >> very high; while some other processes may swap-out/swap-in more, so the >> >> score is very low. Sometimes, the total score decreases, but the scores >> >> of usemem processes are more balanced, so that the performance should >> be >> >> considered better. And, in general, we should make usemem score >> >> balanced among processes via say longer test time. Can you check this >> >> in your test results? >> > >> > Actually, the throughput data listed in the cover-letter is the average of >> > all the usemem processes. Your observation about the "imbalance" issue is >> > right. Some processes see a higher throughput than others. I have noticed >> > that the throughputs progressively reduce as the individual processes exit >> > and print their stats. >> > >> > Listed below are the stats from two runs of usemem70: sleep 10 and sleep >> 30. >> > Both are run with a cgroup mem-limit of 40G. Data is with v7, 64K folios are >> > enabled, zswap uses zstd. >> > >> > >> > ----------------------------------------------- >> > sleep 10 sleep 30 >> > Throughput (KB/s) Throughput (KB/s) >> > ----------------------------------------------- >> > 181,540 191,686 >> > 179,651 191,459 >> > 179,068 188,834 >> > 177,244 187,568 >> > 177,215 186,703 >> > 176,565 185,584 >> > 176,546 185,370 >> > 176,470 185,021 >> > 176,214 184,303 >> > 176,128 184,040 >> > 175,279 183,932 >> > 174,745 180,831 >> > 173,935 179,418 >> > 161,546 168,014 >> > 160,332 167,540 >> > 160,122 167,364 >> > 159,613 167,020 >> > 159,546 166,590 >> > 159,021 166,483 >> > 158,845 166,418 >> > 158,426 166,264 >> > 158,396 166,066 >> > 158,371 165,944 >> > 158,298 165,866 >> > 158,250 165,884 >> > 158,057 165,533 >> > 158,011 165,532 >> > 157,899 165,457 >> > 157,894 165,424 >> > 157,839 165,410 >> > 157,731 165,407 >> > 157,629 165,273 >> > 157,626 164,867 >> > 157,581 164,636 >> > 157,471 164,266 >> > 157,430 164,225 >> > 157,287 163,290 >> > 156,289 153,597 >> > 153,970 147,494 >> > 148,244 147,102 >> > 142,907 146,111 >> > 142,811 145,789 >> > 139,171 141,168 >> > 136,314 140,714 >> > 133,616 140,111 >> > 132,881 139,636 >> > 132,729 136,943 >> > 132,680 136,844 >> > 132,248 135,726 >> > 132,027 135,384 >> > 131,929 135,270 >> > 131,766 134,748 >> > 131,667 134,733 >> > 131,576 134,582 >> > 131,396 134,302 >> > 131,351 134,160 >> > 131,135 134,102 >> > 130,885 134,097 >> > 130,854 134,058 >> > 130,767 134,006 >> > 130,666 133,960 >> > 130,647 133,894 >> > 130,152 133,837 >> > 130,006 133,747 >> > 129,921 133,679 >> > 129,856 133,666 >> > 129,377 133,564 >> > 128,366 133,331 >> > 127,988 132,938 >> > 126,903 132,746 >> > ----------------------------------------------- >> > sum 10,526,916 10,919,561 >> > average 150,385 155,994 >> > stddev 17,551 19,633 >> > ----------------------------------------------- >> > elapsed 24.40 43.66 >> > time (sec) >> > sys time 806.25 766.05 >> > (sec) >> > zswpout 10,008,713 10,008,407 >> > 64K folio 623,463 623,629 >> > swpout >> > ----------------------------------------------- >> >> Although there are some imbalance, I don't find it's too much. So, I >> think the test result is reasonable. Please pay attention to the >> imbalance issue in the future tests. > > Sure, will do so. > >> >> > As we increase the time for which allocations are maintained, >> > there seems to be a slight improvement in throughput, but the >> > variance increases as well. The processes with lower throughput >> > could be the ones that handle the memcg being over limit by >> > doing reclaim, possibly before they can allocate. >> > >> > Interestingly, the longer test time does seem to reduce the amount >> > of reclaim (hence lower sys time), but more 64K large folios seem to >> > be reclaimed. Could this mean that with longer test time (sleep 30), >> > more cold memory residing in large folios is getting reclaimed, as >> > against memory just relinquished by the exiting processes? >> >> I don't think longer sleep time in test helps much to balance. Can you >> try with less process, and larger memory size per process? I guess that >> this will improve balance. > > I tried this, and the data is listed below: > > usemem options: > --------------- > 30 processes allocate 10G each > cgroup memory limit = 150G > sleep 10 > 525Gi SSD disk swap partition > 64K large folios enabled > > Throughput (KB/s) of each of the 30 processes: > --------------------------------------------------------------- > mm-unstable zswap_store of large folios > 9-25-2024 v7 > zswap compressor: zstd zstd deflate-iaa > --------------------------------------------------------------- > 38,393 234,485 374,427 > 37,283 215,528 314,225 > 37,156 214,942 304,413 > 37,143 213,073 304,146 > 36,814 212,904 290,186 > 36,277 212,304 288,212 > 36,104 212,207 285,682 > 36,000 210,173 270,661 > 35,994 208,487 256,960 > 35,979 207,788 248,313 > 35,967 207,714 235,338 > 35,966 207,703 229,335 > 35,835 207,690 221,697 > 35,793 207,418 221,600 > 35,692 206,160 219,346 > 35,682 206,128 219,162 > 35,681 205,817 219,155 > 35,678 205,546 214,862 > 35,678 205,523 214,710 > 35,677 204,951 214,282 > 35,677 204,283 213,441 > 35,677 203,348 213,011 > 35,675 203,028 212,923 > 35,673 201,922 212,492 > 35,672 201,660 212,225 > 35,672 200,724 211,808 > 35,672 200,324 211,420 > 35,671 199,686 211,413 > 35,667 198,858 211,346 > 35,667 197,590 211,209 > --------------------------------------------------------------- > sum 1,081,515 6,217,964 7,268,000 > average 36,051 207,265 242,267 > stddev 655 7,010 42,234 > elapsed time (sec) 343.70 107.40 84.34 > sys time (sec) 269.30 2,520.13 1,696.20 > memcg.high breaches 443,672 475,074 623,333 > zswpout 22,605 48,931,249 54,777,100 > pswpout 40,004,528 0 0 > hugepages-64K zswpout 0 3,057,090 3,421,855 > hugepages-64K swpout 2,500,283 0 0 > --------------------------------------------------------------- > > As you can see, this is quite a memory-constrained scenario, where we > are giving a 50% of total memory required, as the memory limit for the > cgroup in which the 30 processes are run. This causes significantly more > reclaim activity than the setup I was using thus far (70 processes, 1G, > 40G limit). > > The variance or "imbalance" reduces somewhat for zstd, but not for IAA. > > IAA shows really good throughput (17%) and elapsed time (21%) and > sys time (33%) improvement wrt zstd with zswap_store of large folios. > These are the memory-constrained scenarios in which IAA typically > does really well. IAA verify_compress is enabled, so this is an added > data integrity checks benefit we get with IAA. > > I would like to get your and the maintainers' feedback on whether > I should switch to this "usemem30-10G" setup for v8? The results looks good to me. I suggest you to use it. -- Best Regards, Huang, Ying
> -----Original Message----- > From: Huang, Ying <ying.huang@intel.com> > Sent: Wednesday, September 25, 2024 11:48 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com; > chengming.zhou@linux.dev; usamaarif642@gmail.com; > shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com; > akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, > Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios > > "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes: > > > Hi Ying, > > > >> -----Original Message----- > >> From: Huang, Ying <ying.huang@intel.com> > >> Sent: Wednesday, September 25, 2024 5:45 PM > >> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > >> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > >> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com; > >> chengming.zhou@linux.dev; usamaarif642@gmail.com; > >> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com; > >> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; > Feghali, > >> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > >> <vinodh.gopal@intel.com> > >> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios > >> > >> "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes: > >> > >> >> -----Original Message----- > >> >> From: Huang, Ying <ying.huang@intel.com> > >> >> Sent: Tuesday, September 24, 2024 11:35 PM > >> >> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > >> >> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > >> >> hannes@cmpxchg.org; yosryahmed@google.com; > nphamcs@gmail.com; > >> >> chengming.zhou@linux.dev; usamaarif642@gmail.com; > >> >> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com; > >> >> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; > >> Feghali, > >> >> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > >> >> <vinodh.gopal@intel.com> > >> >> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios > >> >> > >> >> Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes: > >> >> > >> >> [snip] > >> >> > >> >> > > >> >> > Case 1: Comparing zswap 4K vs. zswap mTHP > >> >> > ========================================= > >> >> > > >> >> > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that > >> results in > >> >> > 64K/2M (m)THP to be split into 4K folios that get processed by zswap. > >> >> > > >> >> > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, > that > >> >> results > >> >> > in 64K/2M (m)THP to not be split, and processed by zswap. > >> >> > > >> >> > 64KB mTHP (cgroup memory.high set to 40G): > >> >> > ========================================== > >> >> > > >> >> > ------------------------------------------------------------------------------- > >> >> > mm-unstable 9-23-2024 zswap-mTHP Change > wrt > >> >> > CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y > >> Baseline > >> >> > Baseline > >> >> > ------------------------------------------------------------------------------- > >> >> > ZSWAP compressor zstd deflate- zstd deflate- zstd > deflate- > >> >> > iaa iaa iaa > >> >> > ------------------------------------------------------------------------------- > >> >> > Throughput (KB/s) 143,323 125,485 153,550 129,609 7% > >> 3% > >> >> > elapsed time (sec) 24.97 25.42 23.90 25.19 4% 1% > >> >> > sys time (sec) 822.72 750.96 757.70 731.13 8% 3% > >> >> > memcg_high 132,743 169,825 148,075 192,744 > >> >> > memcg_swap_fail 639,067 841,553 2,204 2,215 > >> >> > pswpin 0 0 0 0 > >> >> > pswpout 0 0 0 0 > >> >> > zswpin 795 873 760 902 > >> >> > zswpout 10,011,266 13,195,137 10,010,017 13,193,554 > >> >> > thp_swpout 0 0 0 0 > >> >> > thp_swpout_ 0 0 0 0 > >> >> > fallback > >> >> > 64kB-mthp_ 639,065 841,553 2,204 2,215 > >> >> > swpout_fallback > >> >> > pgmajfault 2,861 2,924 3,054 3,259 > >> >> > ZSWPOUT-64kB n/a n/a 623,451 822,268 > >> >> > SWPOUT-64kB 0 0 0 0 > >> >> > ------------------------------------------------------------------------------- > >> >> > > >> >> > >> >> IIUC, the throughput is the sum of throughput of all usemem processes? > >> >> > >> >> One possible issue of usemem test case is the "imbalance" issue. That > >> >> is, some usemem processes may swap-out/swap-in less, so the score is > >> >> very high; while some other processes may swap-out/swap-in more, so > the > >> >> score is very low. Sometimes, the total score decreases, but the scores > >> >> of usemem processes are more balanced, so that the performance > should > >> be > >> >> considered better. And, in general, we should make usemem score > >> >> balanced among processes via say longer test time. Can you check this > >> >> in your test results? > >> > > >> > Actually, the throughput data listed in the cover-letter is the average of > >> > all the usemem processes. Your observation about the "imbalance" issue > is > >> > right. Some processes see a higher throughput than others. I have > noticed > >> > that the throughputs progressively reduce as the individual processes > exit > >> > and print their stats. > >> > > >> > Listed below are the stats from two runs of usemem70: sleep 10 and > sleep > >> 30. > >> > Both are run with a cgroup mem-limit of 40G. Data is with v7, 64K folios > are > >> > enabled, zswap uses zstd. > >> > > >> > > >> > ----------------------------------------------- > >> > sleep 10 sleep 30 > >> > Throughput (KB/s) Throughput (KB/s) > >> > ----------------------------------------------- > >> > 181,540 191,686 > >> > 179,651 191,459 > >> > 179,068 188,834 > >> > 177,244 187,568 > >> > 177,215 186,703 > >> > 176,565 185,584 > >> > 176,546 185,370 > >> > 176,470 185,021 > >> > 176,214 184,303 > >> > 176,128 184,040 > >> > 175,279 183,932 > >> > 174,745 180,831 > >> > 173,935 179,418 > >> > 161,546 168,014 > >> > 160,332 167,540 > >> > 160,122 167,364 > >> > 159,613 167,020 > >> > 159,546 166,590 > >> > 159,021 166,483 > >> > 158,845 166,418 > >> > 158,426 166,264 > >> > 158,396 166,066 > >> > 158,371 165,944 > >> > 158,298 165,866 > >> > 158,250 165,884 > >> > 158,057 165,533 > >> > 158,011 165,532 > >> > 157,899 165,457 > >> > 157,894 165,424 > >> > 157,839 165,410 > >> > 157,731 165,407 > >> > 157,629 165,273 > >> > 157,626 164,867 > >> > 157,581 164,636 > >> > 157,471 164,266 > >> > 157,430 164,225 > >> > 157,287 163,290 > >> > 156,289 153,597 > >> > 153,970 147,494 > >> > 148,244 147,102 > >> > 142,907 146,111 > >> > 142,811 145,789 > >> > 139,171 141,168 > >> > 136,314 140,714 > >> > 133,616 140,111 > >> > 132,881 139,636 > >> > 132,729 136,943 > >> > 132,680 136,844 > >> > 132,248 135,726 > >> > 132,027 135,384 > >> > 131,929 135,270 > >> > 131,766 134,748 > >> > 131,667 134,733 > >> > 131,576 134,582 > >> > 131,396 134,302 > >> > 131,351 134,160 > >> > 131,135 134,102 > >> > 130,885 134,097 > >> > 130,854 134,058 > >> > 130,767 134,006 > >> > 130,666 133,960 > >> > 130,647 133,894 > >> > 130,152 133,837 > >> > 130,006 133,747 > >> > 129,921 133,679 > >> > 129,856 133,666 > >> > 129,377 133,564 > >> > 128,366 133,331 > >> > 127,988 132,938 > >> > 126,903 132,746 > >> > ----------------------------------------------- > >> > sum 10,526,916 10,919,561 > >> > average 150,385 155,994 > >> > stddev 17,551 19,633 > >> > ----------------------------------------------- > >> > elapsed 24.40 43.66 > >> > time (sec) > >> > sys time 806.25 766.05 > >> > (sec) > >> > zswpout 10,008,713 10,008,407 > >> > 64K folio 623,463 623,629 > >> > swpout > >> > ----------------------------------------------- > >> > >> Although there are some imbalance, I don't find it's too much. So, I > >> think the test result is reasonable. Please pay attention to the > >> imbalance issue in the future tests. > > > > Sure, will do so. > > > >> > >> > As we increase the time for which allocations are maintained, > >> > there seems to be a slight improvement in throughput, but the > >> > variance increases as well. The processes with lower throughput > >> > could be the ones that handle the memcg being over limit by > >> > doing reclaim, possibly before they can allocate. > >> > > >> > Interestingly, the longer test time does seem to reduce the amount > >> > of reclaim (hence lower sys time), but more 64K large folios seem to > >> > be reclaimed. Could this mean that with longer test time (sleep 30), > >> > more cold memory residing in large folios is getting reclaimed, as > >> > against memory just relinquished by the exiting processes? > >> > >> I don't think longer sleep time in test helps much to balance. Can you > >> try with less process, and larger memory size per process? I guess that > >> this will improve balance. > > > > I tried this, and the data is listed below: > > > > usemem options: > > --------------- > > 30 processes allocate 10G each > > cgroup memory limit = 150G > > sleep 10 > > 525Gi SSD disk swap partition > > 64K large folios enabled > > > > Throughput (KB/s) of each of the 30 processes: > > --------------------------------------------------------------- > > mm-unstable zswap_store of large folios > > 9-25-2024 v7 > > zswap compressor: zstd zstd deflate-iaa > > --------------------------------------------------------------- > > 38,393 234,485 374,427 > > 37,283 215,528 314,225 > > 37,156 214,942 304,413 > > 37,143 213,073 304,146 > > 36,814 212,904 290,186 > > 36,277 212,304 288,212 > > 36,104 212,207 285,682 > > 36,000 210,173 270,661 > > 35,994 208,487 256,960 > > 35,979 207,788 248,313 > > 35,967 207,714 235,338 > > 35,966 207,703 229,335 > > 35,835 207,690 221,697 > > 35,793 207,418 221,600 > > 35,692 206,160 219,346 > > 35,682 206,128 219,162 > > 35,681 205,817 219,155 > > 35,678 205,546 214,862 > > 35,678 205,523 214,710 > > 35,677 204,951 214,282 > > 35,677 204,283 213,441 > > 35,677 203,348 213,011 > > 35,675 203,028 212,923 > > 35,673 201,922 212,492 > > 35,672 201,660 212,225 > > 35,672 200,724 211,808 > > 35,672 200,324 211,420 > > 35,671 199,686 211,413 > > 35,667 198,858 211,346 > > 35,667 197,590 211,209 > > --------------------------------------------------------------- > > sum 1,081,515 6,217,964 7,268,000 > > average 36,051 207,265 242,267 > > stddev 655 7,010 42,234 > > elapsed time (sec) 343.70 107.40 84.34 > > sys time (sec) 269.30 2,520.13 1,696.20 > > memcg.high breaches 443,672 475,074 623,333 > > zswpout 22,605 48,931,249 54,777,100 > > pswpout 40,004,528 0 0 > > hugepages-64K zswpout 0 3,057,090 3,421,855 > > hugepages-64K swpout 2,500,283 0 0 > > --------------------------------------------------------------- > > > > As you can see, this is quite a memory-constrained scenario, where we > > are giving a 50% of total memory required, as the memory limit for the > > cgroup in which the 30 processes are run. This causes significantly more > > reclaim activity than the setup I was using thus far (70 processes, 1G, > > 40G limit). > > > > The variance or "imbalance" reduces somewhat for zstd, but not for IAA. > > > > IAA shows really good throughput (17%) and elapsed time (21%) and > > sys time (33%) improvement wrt zstd with zswap_store of large folios. > > These are the memory-constrained scenarios in which IAA typically > > does really well. IAA verify_compress is enabled, so this is an added > > data integrity checks benefit we get with IAA. > > > > I would like to get your and the maintainers' feedback on whether > > I should switch to this "usemem30-10G" setup for v8? > > The results looks good to me. I suggest you to use it. Ok, sure, thanks Ying. Thanks, Kanchana > > -- > Best Regards, > Huang, Ying
© 2016 - 2024 Red Hat, Inc.