[v7] RE: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios

RE: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
Posted by Sridhar, Kanchana P 2 months ago
> -----Original Message-----
> From: Huang, Ying <ying.huang@intel.com>
> Sent: Wednesday, September 25, 2024 11:48 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com;
> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali,
> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
> 
> "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes:
> 
> > Hi Ying,
> >
> >> -----Original Message-----
> >> From: Huang, Ying <ying.huang@intel.com>
> >> Sent: Wednesday, September 25, 2024 5:45 PM
> >> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> >> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> >> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> >> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> >> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com;
> >> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>;
> Feghali,
> >> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> >> <vinodh.gopal@intel.com>
> >> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
> >>
> >> "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes:
> >>
> >> >> -----Original Message-----
> >> >> From: Huang, Ying <ying.huang@intel.com>
> >> >> Sent: Tuesday, September 24, 2024 11:35 PM
> >> >> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> >> >> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> >> >> hannes@cmpxchg.org; yosryahmed@google.com;
> nphamcs@gmail.com;
> >> >> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> >> >> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com;
> >> >> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>;
> >> Feghali,
> >> >> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> >> >> <vinodh.gopal@intel.com>
> >> >> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
> >> >>
> >> >> Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes:
> >> >>
> >> >> [snip]
> >> >>
> >> >> >
> >> >> > Case 1: Comparing zswap 4K vs. zswap mTHP
> >> >> > =========================================
> >> >> >
> >> >> > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that
> >> results in
> >> >> > 64K/2M (m)THP to be split into 4K folios that get processed by zswap.
> >> >> >
> >> >> > The "after" is CONFIG_THP_SWAP set to on, and this patch-series,
> that
> >> >> results
> >> >> > in 64K/2M (m)THP to not be split, and processed by zswap.
> >> >> >
> >> >> >  64KB mTHP (cgroup memory.high set to 40G):
> >> >> >  ==========================================
> >> >> >
> >> >> >  -------------------------------------------------------------------------------
> >> >> >                     mm-unstable 9-23-2024              zswap-mTHP     Change
> wrt
> >> >> >                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y
> >> Baseline
> >> >> >                                  Baseline
> >> >> >  -------------------------------------------------------------------------------
> >> >> >  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd
> deflate-
> >> >> >                                       iaa                     iaa            iaa
> >> >> >  -------------------------------------------------------------------------------
> >> >> >  Throughput (KB/s)   143,323      125,485     153,550     129,609    7%
> >> 3%
> >> >> >  elapsed time (sec)    24.97        25.42       23.90       25.19    4%       1%
> >> >> >  sys time (sec)       822.72       750.96      757.70      731.13    8%       3%
> >> >> >  memcg_high          132,743      169,825     148,075     192,744
> >> >> >  memcg_swap_fail     639,067      841,553       2,204       2,215
> >> >> >  pswpin                    0            0           0           0
> >> >> >  pswpout                   0            0           0           0
> >> >> >  zswpin                  795          873         760         902
> >> >> >  zswpout          10,011,266   13,195,137  10,010,017  13,193,554
> >> >> >  thp_swpout                0            0           0           0
> >> >> >  thp_swpout_               0            0           0           0
> >> >> >   fallback
> >> >> >  64kB-mthp_          639,065      841,553       2,204       2,215
> >> >> >   swpout_fallback
> >> >> >  pgmajfault            2,861        2,924       3,054       3,259
> >> >> >  ZSWPOUT-64kB            n/a          n/a     623,451     822,268
> >> >> >  SWPOUT-64kB               0            0           0           0
> >> >> >  -------------------------------------------------------------------------------
> >> >> >
> >> >>
> >> >> IIUC, the throughput is the sum of throughput of all usemem processes?
> >> >>
> >> >> One possible issue of usemem test case is the "imbalance" issue.  That
> >> >> is, some usemem processes may swap-out/swap-in less, so the score is
> >> >> very high; while some other processes may swap-out/swap-in more, so
> the
> >> >> score is very low.  Sometimes, the total score decreases, but the scores
> >> >> of usemem processes are more balanced, so that the performance
> should
> >> be
> >> >> considered better.  And, in general, we should make usemem score
> >> >> balanced among processes via say longer test time.  Can you check this
> >> >> in your test results?
> >> >
> >> > Actually, the throughput data listed in the cover-letter is the average of
> >> > all the usemem processes. Your observation about the "imbalance" issue
> is
> >> > right. Some processes see a higher throughput than others. I have
> noticed
> >> > that the throughputs progressively reduce as the individual processes
> exit
> >> > and print their stats.
> >> >
> >> > Listed below are the stats from two runs of usemem70: sleep 10 and
> sleep
> >> 30.
> >> > Both are run with a cgroup mem-limit of 40G. Data is with v7, 64K folios
> are
> >> > enabled, zswap uses zstd.
> >> >
> >> >
> >> > -----------------------------------------------
> >> >                sleep 10           sleep 30
> >> >       Throughput (KB/s)  Throughput (KB/s)
> >> >  -----------------------------------------------
> >> >                 181,540            191,686
> >> >                 179,651            191,459
> >> >                 179,068            188,834
> >> >                 177,244            187,568
> >> >                 177,215            186,703
> >> >                 176,565            185,584
> >> >                 176,546            185,370
> >> >                 176,470            185,021
> >> >                 176,214            184,303
> >> >                 176,128            184,040
> >> >                 175,279            183,932
> >> >                 174,745            180,831
> >> >                 173,935            179,418
> >> >                 161,546            168,014
> >> >                 160,332            167,540
> >> >                 160,122            167,364
> >> >                 159,613            167,020
> >> >                 159,546            166,590
> >> >                 159,021            166,483
> >> >                 158,845            166,418
> >> >                 158,426            166,264
> >> >                 158,396            166,066
> >> >                 158,371            165,944
> >> >                 158,298            165,866
> >> >                 158,250            165,884
> >> >                 158,057            165,533
> >> >                 158,011            165,532
> >> >                 157,899            165,457
> >> >                 157,894            165,424
> >> >                 157,839            165,410
> >> >                 157,731            165,407
> >> >                 157,629            165,273
> >> >                 157,626            164,867
> >> >                 157,581            164,636
> >> >                 157,471            164,266
> >> >                 157,430            164,225
> >> >                 157,287            163,290
> >> >                 156,289            153,597
> >> >                 153,970            147,494
> >> >                 148,244            147,102
> >> >                 142,907            146,111
> >> >                 142,811            145,789
> >> >                 139,171            141,168
> >> >                 136,314            140,714
> >> >                 133,616            140,111
> >> >                 132,881            139,636
> >> >                 132,729            136,943
> >> >                 132,680            136,844
> >> >                 132,248            135,726
> >> >                 132,027            135,384
> >> >                 131,929            135,270
> >> >                 131,766            134,748
> >> >                 131,667            134,733
> >> >                 131,576            134,582
> >> >                 131,396            134,302
> >> >                 131,351            134,160
> >> >                 131,135            134,102
> >> >                 130,885            134,097
> >> >                 130,854            134,058
> >> >                 130,767            134,006
> >> >                 130,666            133,960
> >> >                 130,647            133,894
> >> >                 130,152            133,837
> >> >                 130,006            133,747
> >> >                 129,921            133,679
> >> >                 129,856            133,666
> >> >                 129,377            133,564
> >> >                 128,366            133,331
> >> >                 127,988            132,938
> >> >                 126,903            132,746
> >> >  -----------------------------------------------
> >> >       sum    10,526,916         10,919,561
> >> >   average       150,385            155,994
> >> >    stddev        17,551             19,633
> >> >  -----------------------------------------------
> >> >     elapsed       24.40              43.66
> >> >  time (sec)
> >> >    sys time      806.25             766.05
> >> >       (sec)
> >> >     zswpout  10,008,713         10,008,407
> >> >   64K folio     623,463            623,629
> >> >      swpout
> >> >  -----------------------------------------------
> >>
> >> Although there are some imbalance, I don't find it's too much.  So, I
> >> think the test result is reasonable.  Please pay attention to the
> >> imbalance issue in the future tests.
> >
> > Sure, will do so.
> >
> >>
> >> > As we increase the time for which allocations are maintained,
> >> > there seems to be a slight improvement in throughput, but the
> >> > variance increases as well. The processes with lower throughput
> >> > could be the ones that handle the memcg being over limit by
> >> > doing reclaim, possibly before they can allocate.
> >> >
> >> > Interestingly, the longer test time does seem to reduce the amount
> >> > of reclaim (hence lower sys time), but more 64K large folios seem to
> >> > be reclaimed. Could this mean that with longer test time (sleep 30),
> >> > more cold memory residing in large folios is getting reclaimed, as
> >> > against memory just relinquished by the exiting processes?
> >>
> >> I don't think longer sleep time in test helps much to balance.  Can you
> >> try with less process, and larger memory size per process?  I guess that
> >> this will improve balance.
> >
> > I tried this, and the data is listed below:
> >
> >   usemem options:
> >   ---------------
> >   30 processes allocate 10G each
> >   cgroup memory limit = 150G
> >   sleep 10
> >   525Gi SSD disk swap partition
> >   64K large folios enabled
> >
> >   Throughput (KB/s) of each of the 30 processes:
> >  ---------------------------------------------------------------
> >                       mm-unstable    zswap_store of large folios
> >                         9-25-2024                v7
> >  zswap compressor:           zstd         zstd  deflate-iaa
> >  ---------------------------------------------------------------
> >                            38,393      234,485      374,427
> >                            37,283      215,528      314,225
> >                            37,156      214,942      304,413
> >                            37,143      213,073      304,146
> >                            36,814      212,904      290,186
> >                            36,277      212,304      288,212
> >                            36,104      212,207      285,682
> >                            36,000      210,173      270,661
> >                            35,994      208,487      256,960
> >                            35,979      207,788      248,313
> >                            35,967      207,714      235,338
> >                            35,966      207,703      229,335
> >                            35,835      207,690      221,697
> >                            35,793      207,418      221,600
> >                            35,692      206,160      219,346
> >                            35,682      206,128      219,162
> >                            35,681      205,817      219,155
> >                            35,678      205,546      214,862
> >                            35,678      205,523      214,710
> >                            35,677      204,951      214,282
> >                            35,677      204,283      213,441
> >                            35,677      203,348      213,011
> >                            35,675      203,028      212,923
> >                            35,673      201,922      212,492
> >                            35,672      201,660      212,225
> >                            35,672      200,724      211,808
> >                            35,672      200,324      211,420
> >                            35,671      199,686      211,413
> >                            35,667      198,858      211,346
> >                            35,667      197,590      211,209
> >  ---------------------------------------------------------------
> >  sum                     1,081,515    6,217,964    7,268,000
> >  average                    36,051      207,265      242,267
> >  stddev                        655        7,010       42,234
> >  elapsed time (sec)         343.70       107.40        84.34
> >  sys time (sec)             269.30     2,520.13     1,696.20
> >  memcg.high breaches       443,672      475,074      623,333
> >  zswpout                    22,605   48,931,249   54,777,100
> >  pswpout                40,004,528            0            0
> >  hugepages-64K zswpout           0    3,057,090    3,421,855
> >  hugepages-64K swpout    2,500,283            0            0
> >  ---------------------------------------------------------------
> >
> > As you can see, this is quite a memory-constrained scenario, where we
> > are giving a 50% of total memory required, as the memory limit for the
> > cgroup in which the 30 processes are run. This causes significantly more
> > reclaim activity than the setup I was using thus far (70 processes, 1G,
> > 40G limit).
> >
> > The variance or "imbalance" reduces somewhat for zstd, but not for IAA.
> >
> > IAA shows really good throughput (17%) and elapsed time (21%) and
> > sys time (33%) improvement wrt zstd with zswap_store of large folios.
> > These are the memory-constrained scenarios in which IAA typically
> > does really well. IAA verify_compress is enabled, so this is an added
> > data integrity checks benefit we get with IAA.
> >
> > I would like to get your and the maintainers' feedback on whether
> > I should switch to this "usemem30-10G" setup for v8?
> 
> The results looks good to me.  I suggest you to use it.

Ok, sure, thanks Ying.

Thanks,
Kanchana

> 
> --
> Best Regards,
> Huang, Ying