RE: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios

Sridhar, Kanchana P posted 3 patches 2 months, 1 week ago
Only 0 patches received!
There is a newer version of this series
RE: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
Posted by Sridhar, Kanchana P 2 months, 1 week ago
> -----Original Message-----
> From: Huang, Ying <ying.huang@intel.com>
> Sent: Friday, September 20, 2024 2:29 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Yosry Ahmed <yosryahmed@google.com>; linux-kernel@vger.kernel.org;
> linux-mm@kvack.org; hannes@cmpxchg.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
> 
> "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes:
> 
> [snip]
> 
> >
> > Thanks, these are good points. I ran this experiment with mm-unstable 9-
> 17-2024,
> > commit 248ba8004e76eb335d7e6079724c3ee89a011389.
> >
> > Data is based on average of 3 runs of the vm-scalability "usemem" test.
> >
> >  4G SSD backing zswap, each process sleeps before exiting
> >  ========================================================
> >
> >  64KB mTHP (cgroup memory.high set to 60G, no swap limit):
> >  =========================================================
> >  CONFIG_THP_SWAP=Y
> >  Sapphire Rapids server with 503 GiB RAM and 4G SSD swap backing device
> >  for zswap.
> >
> >  Experiment 1: Each process sleeps for 0 sec after allocating memory
> >  (usemem --init-time -w -O --sleep 0 -n 70 1g):
> >
> >  -------------------------------------------------------------------------------
> >                     mm-unstable 9-17-2024           zswap-mTHP v6     Change wrt
> >                                  Baseline                               Baseline
> >                                  "before"                 "after"      (sleep 0)
> >  -------------------------------------------------------------------------------
> >  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
> >                                       iaa                     iaa            iaa
> >  -------------------------------------------------------------------------------
> >  Throughput (KB/s)   296,684      274,207     359,722     390,162    21%     42%
> >  sys time (sec)        92.67        93.33      251.06      237.56  -171%   -155%
> >  memcg_high            3,503        3,769      44,425      27,154
> >  memcg_swap_fail           0            0     115,814     141,936
> >  pswpin                   17            0           0           0
> >  pswpout             370,853      393,232           0           0
> >  zswpin                  693          123         666         667
> >  zswpout               1,484          123   1,366,680   1,199,645
> >  thp_swpout                0            0           0           0
> >  thp_swpout_               0            0           0           0
> >   fallback
> >  pgmajfault            3,384        2,951       3,656       3,468
> >  ZSWPOUT-64kB            n/a          n/a      82,940      73,121
> >  SWPOUT-64kB          23,178       24,577           0           0
> >  -------------------------------------------------------------------------------
> >
> >
> >  Experiment 2: Each process sleeps for 10 sec after allocating memory
> >  (usemem --init-time -w -O --sleep 10 -n 70 1g):
> >
> >  -------------------------------------------------------------------------------
> >                     mm-unstable 9-17-2024           zswap-mTHP v6     Change wrt
> >                                  Baseline                               Baseline
> >                                  "before"                 "after"     (sleep 10)
> >  -------------------------------------------------------------------------------
> >  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
> >                                       iaa                     iaa            iaa
> >  -------------------------------------------------------------------------------
> >  Throughput (KB/s)    86,744       93,730     157,528     113,110    82%     21%
> >  sys time (sec)       308.87       315.29      477.55      629.98   -55%   -100%
> 
> What is the elapsed time for all cases?

Sure, listed below is the data for both experiments with elapsed time in row 2:

 4G SSD backing zswap, each process sleeps before exiting
 ========================================================

 64KB mTHP (cgroup memory.high set to 60G, no swap limit):
 =========================================================
 CONFIG_THP_SWAP=Y
 Sapphire Rapids server with 503 GiB RAM and 4G SSD swap backing device
 for zswap.

 Experiment 1: Each process sleeps for 0 sec after allocating memory
 (usemem --init-time -w -O --sleep 0 -n 70 1g):

 -------------------------------------------------------------------------------
                    mm-unstable 9-17-2024           zswap-mTHP v6     Change wrt
                                 Baseline                               Baseline
                                 "before"                 "after"      (sleep 0)
 -------------------------------------------------------------------------------
 ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
                                      iaa                     iaa            iaa
 -------------------------------------------------------------------------------
 Throughput (KB/s)   296,684      274,207     359,722     390,162    21%     42%
 elapsed time (sec)     4.91         4.80        4.42        5.08    10%     -6%
 sys time (sec)        92.67        93.33      251.06      237.56  -171%   -155%
 memcg_high            3,503        3,769      44,425      27,154
 memcg_swap_fail           0            0     115,814     141,936
 pswpin                   17            0           0           0
 pswpout             370,853      393,232           0           0
 zswpin                  693          123         666         667
 zswpout               1,484          123   1,366,680   1,199,645
 thp_swpout                0            0           0           0
 thp_swpout_               0            0           0           0
  fallback
 pgmajfault            3,384        2,951       3,656       3,468
 ZSWPOUT-64kB            n/a          n/a      82,940      73,121
 SWPOUT-64kB          23,178       24,577           0           0
 -------------------------------------------------------------------------------


 Experiment 2: Each process sleeps for 10 sec after allocating memory
 (usemem --init-time -w -O --sleep 10 -n 70 1g):

 -------------------------------------------------------------------------------
                    mm-unstable 9-17-2024           zswap-mTHP v6     Change wrt
                                 Baseline                               Baseline
                                 "before"                 "after"     (sleep 10)
 -------------------------------------------------------------------------------
 ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
                                      iaa                     iaa            iaa
 -------------------------------------------------------------------------------
 Throughput (KB/s)    86,744       93,730     157,528     113,110    82%     21%
 elapsed time (sec)    30.24        31.73       33.39       32.50   -10%     -2%
 sys time (sec)       308.87       315.29      477.55      629.98   -55%   -100%
 memcg_high          169,450      188,700     143,691     177,887
 memcg_swap_fail  10,131,859    9,740,646  18,738,715  19,528,110
 pswpin                   17           16           0           0
 pswpout           1,154,779    1,210,485           0           0
 zswpin                  711          659       1,016         736
 zswpout              70,212       50,128   1,235,560   1,275,917
 thp_swpout                0            0           0           0
 thp_swpout_               0            0           0           0
  fallback
 pgmajfault            6,120        6,291       8,789       6,474
 ZSWPOUT-64kB            n/a          n/a      67,587      68,912
 SWPOUT-64kB          72,174       75,655           0           0
 -------------------------------------------------------------------------------


> 
> >  memcg_high          169,450      188,700     143,691     177,887
> >  memcg_swap_fail  10,131,859    9,740,646  18,738,715  19,528,110
> >  pswpin                   17           16           0           0
> >  pswpout           1,154,779    1,210,485           0           0
> >  zswpin                  711          659       1,016         736
> >  zswpout              70,212       50,128   1,235,560   1,275,917
> >  thp_swpout                0            0           0           0
> >  thp_swpout_               0            0           0           0
> >   fallback
> >  pgmajfault            6,120        6,291       8,789       6,474
> >  ZSWPOUT-64kB            n/a          n/a      67,587      68,912
> >  SWPOUT-64kB          72,174       75,655           0           0
> >  -------------------------------------------------------------------------------
> >
> >
> > Conclusions from the experiments:
> > =================================
> > 1) zswap-mTHP improves throughput as compared to the baseline, for zstd
> and
> >    deflate-iaa.
> >
> > 2) Yosry's theory is proved correct in the 4G constrained swap setup.
> >    When the processes are constrained to sleep 10 sec after allocating
> >    memory, thereby keeping the memory allocated longer, the "Baseline" or
> >    "before" with mTHP getting stored in SSD shows a degradation of 71% in
> >    throughput and 238% in sys time, as compared to the "Baseline" with
> 
> Higher sys time may come from compression with CPU vs. disk writing?
> 

Here, I was comparing the "before" sys times between "sleep 10" and
"sleep 0" experiments where mTHP get stored to SSD. I was trying to
understand the increase in "before" sys time in "sleep 10", and my
analysis was this could be due to the following cycle of events:

  memory remaining allocated longer, any reclaimed memory per process
  is mostly cold memory and is not paged back in (17 pswpin for zstd),
  swap slots are not released,
  swap slot allocation failures,
  folios in the reclaim list returned to being active,
  more swapout activity in "before"/"sleep 10" (372,337 zstd) as
   compared to "before"/"sleep 0" (1,224,991 zstd),
  more sys time in "before"/"sleep 10" as compared to "before"/"sleep 0".

IOW, my takeaway from only the "before" experiments with sleep 10
vs. sleep 0 was the higher swapout activity resulting in increased
sys time.

The zswap-mTHP "after" experiments don't show significantly higher
successful swapout activity between "sleep 10" vs. "sleep 0". This is
not to say that the above cycle of events does not occur here as well,
as indicated by the higher memcg_swap_fail counts, signifying
attempted swapouts.

However, the zswap-mTHP "after" sys time increase going from
"sleep 0" to "sleep 10" is not as bad as that for "before":


   "before" = 4G SSD mTHP
   "after" = zswap-mTHP

 -------------------------------------------------------------------------
                           mm-unstable 9-17-2024             zswap-mTHP v6 
                                        Baseline
                                        "before"                   "after" 
 -------------------------------------------------------------------------
 ZSWAP compressor              zstd  deflate-iaa       zstd    deflate-iaa
 -------------------------------------------------------------------------
 "sleep 0"  sys time (sec)    92.67        93.33     251.06         237.56
 "sleep 10" sys time (sec)   308.87       315.29     477.55         629.98
 -------------------------------------------------------------------------
 "sleep 10" sys time          -233%        -238%       -90%          -165%
  vs. "sleep 0"
 -------------------------------------------------------------------------


> >    sleep 0 that benefits from serialization of disk IO not allowing all
> >    processes to allocate memory at the same time.
> >
> > 3) In the 4G SSD "sleep 0" case, zswap-mTHP shows an increase in sys time
> >    due to the cgroup charging and consequently higher memcg.high breaches
> >    and swapout activity.
> >
> >    However, the "sleep 10" case's sys time seems to degrade less, and the
> >    memcg.high breaches and swapout activity are almost similar between
> the
> >    before/after (confirming Yosry's hypothesis). Further, the
> >    memcg_swap_fail activity in the "after" scenario is almost 2X that of
> >    the "before". This indicates failure to obtain swap offsets, resulting
> >    in the folio remaining active in memory.
> >
> >    I tried to better understand this through the 64k mTHP swpout_fallback
> >    stats in the "sleep 10" zstd experiments:
> >
> >    --------------------------------------------------------------
> >                                            "before"       "after"
> >    --------------------------------------------------------------
> >    64k mTHP swpout_fallback                 627,308       897,407
> >    64k folio swapouts                        72,174        67,587
> >    [p|z]swpout events due to 64k mTHP     1,154,779     1,081,397
> >    4k folio swapouts                         70,212       154,163
> >    --------------------------------------------------------------
> >
> >    The data indicates a higher # of 64k folio swpout_fallback with
> >    zswap-mTHP, that co-relates with the higher memcg_swap_fail counts and
> >    4k folio swapouts with zswap-mTHP. Could the root-cause be
> fragmentation
> >    of the swap space due to zswap swapout being faster than SSD swapout?
> >
> 
> [snip]
> 
> --
> Best Regards,
> Huang, Ying