RE: [PATCH v8 0/8] mm: zswap swap-out of large folios

Sridhar, Kanchana P posted 8 patches 2 months ago
Only 0 patches received!
RE: [PATCH v8 0/8] mm: zswap swap-out of large folios
Posted by Sridhar, Kanchana P 2 months ago
Hi Yosry,

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Friday, September 27, 2024 7:25 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; Ryan Roberts
> <ryan.roberts@arm.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v8 0/8] mm: zswap swap-out of large folios
> 
> On Fri, Sep 27, 2024 at 7:16 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Hi All,
> >
> > This patch-series enables zswap_store() to accept and store large
> > folios. The most significant contribution in this series is from the
> > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> > migrated to mm-unstable as of 9-27-2024 in patch 6 of this series, and
> > adapted based on code review comments received for v7 of the current
> > patch-series.
> >
> > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
> >      https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> >
> > The first few patches do the prep work for supporting large folios in
> > zswap_store. Patch 6 provides the main functionality to swap-out large
> > folios in zswap. Patch 7 adds sysfs per-order hugepages "zswpout" counters
> > that get incremented upon successful zswap_store of large folios:
> >
> > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
> >
> > Patch 8 updates the documentation for the new sysfs "zswpout" counters.
> >
> > This patch-series is a pre-requisite for zswap compress batching of large
> > folio swap-out and decompress batching of swap-ins based on
> > swapin_readahead(), using Intel IAA hardware acceleration, which we
> would
> > like to submit in subsequent patch-series, with performance improvement
> > data.
> >
> > Thanks to Ying Huang for pre-posting review feedback and suggestions!
> >
> > Thanks also to Nhat, Yosry, Johannes, Barry, Chengming, Usama and Ying for
> > their helpful feedback, data reviews and suggestions!
> >
> > Co-development signoff request:
> > ===============================
> > I would like to thank Ryan Roberts for his original RFC [1] and request
> > his co-developer signoff on patch 6 in this series. Thanks Ryan!
> 
> 
> Ryan, could you help Kanchana out with a Signed-off-by please :)
> 
> >
> >
> >
> > System setup for testing:
> > =========================
> > Testing of this patch-series was done with mm-unstable as of 9-27-2024,
> > commit de2fbaa6d9c3576ec7133ed02a370ec9376bf000. Data was
> gathered
> > without/with this patch-series, on an Intel Sapphire Rapids server,
> > dual-socket 56 cores per socket, 4 IAA devices per socket, 503 GiB RAM and
> > 525G SSD disk partition swap. Core frequency was fixed at 2500MHz.
> >
> > The vm-scalability "usemem" test was run in a cgroup whose memory.high
> > was fixed at 150G. The is no swap limit set for the cgroup. 30 usemem
> > processes were run, each allocating and writing 10G of memory, and
> sleeping
> > for 10 sec before exiting:
> >
> > usemem --init-time -w -O -s 10 -n 30 10g
> >
> > Other kernel configuration parameters:
> >
> >     zswap compressors : zstd, deflate-iaa
> >     zswap allocator   : zsmalloc
> >     vm.page-cluster   : 2
> >
> > In the experiments where "deflate-iaa" is used as the zswap compressor,
> > IAA "compression verification" is enabled by default
> > (cat /sys/bus/dsa/drivers/crypto/verify_compress). Hence each IAA
> > compression will be decompressed internally by the "iaa_crypto" driver, the
> > crc-s returned by the hardware will be compared and errors reported in
> case
> > of mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> > compared to the software compressors, and the experimental data listed
> > below is with verify_compress set to "1".
> >
> > Total and average throughput are derived from the individual 30 processes'
> > throughputs reported by usemem. elapsed/sys times are measured with
> perf.
> >
> > The vm stats and sysfs hugepages stats included with the performance data
> > provide details on the swapout activity to zswap/swap device.
> >
> >
> > Testing labels used in data summaries:
> > ======================================
> > The data refers to these test configurations and the before/after
> > comparisons that they do:
> >
> >  before-case1:
> >  -------------
> >  mm-unstable 9-27-2024, CONFIG_THP_SWAP=N (compares zswap 4K vs.
> zswap 64K)
> >
> >  In this scenario, CONFIG_THP_SWAP=N results in 64K/2M folios to be split
> >  into 4K folios that get processed by zswap.
> >
> >  before-case2:
> >  -------------
> >  mm-unstable 9-27-2024, CONFIG_THP_SWAP=Y (compares SSD swap large
> folios vs. zswap large folios)
> >
> >  In this scenario, CONFIG_THP_SWAP=Y results in zswap rejecting large
> >  folios, which will then be stored by the SSD swap device.
> >
> >  after:
> >  ------
> >  v8 of this patch-series, CONFIG_THP_SWAP=Y
> >
> >  The "after" is CONFIG_THP_SWAP=Y and v8 of this patch-series, that
> results
> >  in 64K/2M folios to not be split, and to be processed by zswap_store.
> >
> >
> > Regression Testing:
> > ===================
> > I ran vm-scalability usemem without large folios, i.e., only 4K folios with
> > mm-unstable and this patch-series. The main goal was to make sure that
> > there is no functional or performance regression wrt the earlier zswap
> > behavior for 4K folios, now that 4K folios will be processed by the new
> > zswap_store() code.
> >
> > The data indicates there is no significant regression.
> >
> >  -------------------------------------------------------------------------------
> >  4K folios:
> >  ==========
> >
> >  zswap compressor                zstd          zstd        zstd  zstd v8 zstd v8
> >                          before-case1  before-case2       after      vs.     vs.
> >                                                                    case1   case2
> >  -------------------------------------------------------------------------------
> >  Total throughput (KB/s)    4,793,363     4,880,978   4,813,151       0%     -1%
> >  Average throughput (KB/s)    159,778       162,699     160,438       0%     -1%
> >  elapsed time (sec)            130.14        123.17      127.21       2%     -3%
> >  sys time (sec)              3,135.53      2,985.64    3,110.53       1%     -4%
> >
> >  memcg_high                   446,826       444,626     448,231
> >  memcg_swap_fail                    0             0           0
> >  pswpout                            0             0           0
> >  pswpin                             0             0           0
> >  zswpout                   48,932,107    48,931,971  48,931,584
> >  zswpin                           383           386         388
> >  thp_swpout                         0             0           0
> >  thp_swpout_fallback                0             0           0
> >  64kB-mthp_swpout_fallback          0             0           0
> >  pgmajfault                     3,063         3,077       3,082
> >  swap_ra                           93            94          93
> >  swap_ra_hit                       47            47          47
> >  ZSWPOUT-64kB                     n/a           n/a           0
> >  SWPOUT-64kB                        0             0           0
> >  -------------------------------------------------------------------------------
> >
> >
> > Performance Testing:
> > ====================
> >
> > We list the data for 64K folios with before/after data per-compressor,
> > followed by the same for 2M pmd-mappable folios.
> >
> >
> >  -------------------------------------------------------------------------------
> >  64K folios: zstd:
> >  =================
> >
> >  zswap compressor                zstd          zstd         zstd      zstd v8
> >                          before-case1  before-case2        after     vs.    vs.
> >                                                                    case1  case2
> >  -------------------------------------------------------------------------------
> >  Total throughput (KB/s)    5,222,213     1,076,611    6,227,367     19%    478%
> >  Average throughput (KB/s)    174,073        35,887      207,578     19%    478%
> >  elapsed time (sec)            120.50        347.16       109.21      9%     69%
> 
> 
> The diff here is supposed to be negative, right?
> (Same for the below results)

So this is supposed to be positive to indicate the throughput improvement
[(new-old)/old] with v8 as compared to the before-case1 and before-case2.
For latency, a positive value indicates the latency reducing, since I calculate
[(old-new)/old]. This is the metric used throughout.

Based on this convention, positive percentages are improvements in both,
throughput and latency.

> 
> Otherwise the results are looking really good, we have come a long way
> since the first version :)
> 
> Thanks for working on this! I will look at individual patches later
> today or early next week.

Many thanks Yosry :)  I immensely appreciate your, Nhat's, Johannes', Ying's
and others' help in getting here!

Sure, this sounds good.

Thanks,
Kanchana

> 
> >
> >  sys time (sec)              2,930.33        248.16     2,609.22     11%   -951%
> >  memcg_high                   416,773       552,200      482,703
> >  memcg_swap_fail            3,192,906         1,293          944
> >  pswpout                            0    40,778,448            0
> >  pswpin                             0            16            0
> >  zswpout                   48,931,583        20,903   48,931,271
> >  zswpin                           384           363          392
> >  thp_swpout                         0             0            0
> >  thp_swpout_fallback                0             0            0
> >  64kB-mthp_swpout_fallback  3,192,906         1,293          944
> >  pgmajfault                     3,452         3,072        3,095
> >  swap_ra                           90            87          100
> >  swap_ra_hit                       42            43           56
> >  ZSWPOUT-64kB                     n/a           n/a    3,057,260
> >  SWPOUT-64kB                        0     2,548,653            0
> >  -------------------------------------------------------------------------------
> >
> >
> >  -------------------------------------------------------------------------------
> >  64K folios: deflate-iaa:
> >  ========================
> >
> >  zswap compressor         deflate-iaa   deflate-iaa  deflate-iaa  deflate-iaa v8
> >                          before-case1  before-case2        after      vs.    vs.
> >                                                                    case1   case2
> >  -------------------------------------------------------------------------------
> >  Total throughput (KB/s)    5,652,608     1,089,180    6,315,000     12%    480%
> >  Average throughput (KB/s)    188,420        36,306      210,500     12%    480%
> >  elapsed time (sec)            102.90        343.35        91.11     11%     73%
> >
> >
> >  sys time (sec)              2,246.86        213.53     1,939.31     14%   -808%
> >  memcg_high                   576,104       502,907      612,505
> >  memcg_swap_fail            4,016,117         1,407        1,660
> >  pswpout                            0    40,862,080            0
> >  pswpin                             0            20            0
> >  zswpout                   61,163,423        22,444   57,317,607
> >  zswpin                           401           368          449
> >  thp_swpout                         0             0            0
> >  thp_swpout_fallback                0             0            0
> >  64kB-mthp_swpout_fallback  4,016,117         1,407        1,660
> >  pgmajfault                     3,063         3,153        3,167
> >  swap_ra                           96            93          149
> >  swap_ra_hit                       46            45           89
> >  ZSWPOUT-64kB                     n/a           n/a    3,580,673
> >  SWPOUT-64kB                        0     2,553,880            0
> >  -------------------------------------------------------------------------------
> >
> >
> >  -------------------------------------------------------------------------------
> >  2M folios: zstd:
> >  ================
> >
> >  zswap compressor                zstd          zstd         zstd      zstd v8
> >                          before-case1  before-case2        after     vs.    vs.
> >                                                                    case1  case2
> >  -------------------------------------------------------------------------------
> >  Total throughput (KB/s)    5,895,500     1,109,694     6,460,111    10%    482%
> >  Average throughput (KB/s)    196,516        36,989       215,337    10%    482%
> >  elapsed time (sec)            108.77        334.28        105.92     3%     68%
> >
> >
> >  sys time (sec)              2,657.14         94.88      2,436.24     8%  -2468%
> >  memcg_high                    64,200        66,316        60,300
> >  memcg_swap_fail              101,182            70            30
> >  pswpout                            0    40,166,400             0
> >  pswpin                             0             0             0
> >  zswpout                   48,931,499        36,507    48,869,236
> >  zswpin                           380           379           397
> >  thp_swpout                         0        78,450             0
> >  thp_swpout_fallback          101,182            70            30
> >  2MB-mthp_swpout_fallback           0             0             0
> >  pgmajfault                     3,067         3,417         4,765
> >  swap_ra                           91            90         5,073
> >  swap_ra_hit                       45            45         5,024
> >  ZSWPOUT-2MB                      n/a           n/a        95,408
> >  SWPOUT-2MB                         0        78,450             0
> >  -------------------------------------------------------------------------------
> >
> >
> >  -------------------------------------------------------------------------------
> >  2M folios: deflate-iaa:
> >  =======================
> >
> >  zswap compressor         deflate-iaa   deflate-iaa  deflate-iaa  deflate-iaa v8
> >                          before-case1  before-case2        after      vs.    vs.
> >                                                                    case1   case2
> >  -------------------------------------------------------------------------------
> >  Total throughput (KB/s)   6,286,587      1,126,785     7,569,560    20%    572%
> >  Average throughput (KB/s)   209,552         37,559       252,318    20%    572%
> >  elapsed time (sec)            96.19         333.03         81.96    15%     75%
> >
> >  sys time (sec)             2,141.44          99.96      1,768.41    17%  -1669%
> >  memcg_high                   99,253         64,666        75,139
> >  memcg_swap_fail             129,074             53            73
> >  pswpout                           0     40,048,128             0
> >  pswpin                            0              0             0
> >  zswpout                  61,312,794         28,321    57,083,119
> >  zswpin                          383            406           447
> >  thp_swpout                        0         78,219             0
> >  thp_swpout_fallback         129,074             53            73
> >  2MB-mthp_swpout_fallback          0              0             0
> >  pgmajfault                    3,430          3,077         7,133
> >  swap_ra                          91            103        11,978
> >  swap_ra_hit                      47             46        11,920
> >  ZSWPOUT-2MB                     n/a            n/a       111,390
> >  SWPOUT-2MB                        0         78,219             0
> >  -------------------------------------------------------------------------------
> >
> > And finally, this is a comparison of deflate-iaa vs. zstd with v8 of this
> > patch-series:
> >
> >  ---------------------------------------------
> >                    zswap_store large folios v8
> >                   Impr w/ deflate-iaa vs. zstd
> >
> >                        64K folios    2M folios
> >  ---------------------------------------------
> >  Throughput (KB/s)             1%          17%
> >  elapsed time (sec)           17%          23%
> >  sys time (sec)               26%          27%
> >  ---------------------------------------------
> >
> >
> > Conclusions based on the performance results:
> > =============================================
> >
> >  v8 wrt before-case1:
> >  --------------------
> >  We see significant improvements in throughput, elapsed and sys time for
> >  zstd and deflate-iaa, when comparing before-case1 (THP_SWAP=N) vs.
> after
> >  (THP_SWAP=Y) with zswap_store large folios.
> >
> >  v8 wrt before-case2:
> >  --------------------
> >  We see even more significant improvements in throughput and elapsed
> time
> >  for zstd and deflate-iaa, when comparing before-case2 (large-folio-SSD)
> >  vs. after (large-folio-zswap). The sys time increases with
> >  large-folio-zswap as expected, due to the CPU compression time
> >  vs. asynchronous disk write times, as pointed out by Ying and Yosry.
> >
> >  In before-case2, when zswap does not store large folios, only allocations
> >  and cgroup charging due to 4K folio zswap stores count towards the cgroup
> >  memory limit. However, in the after scenario, with the introduction of
> >  zswap_store() of large folios, there is an added component of the zswap
> >  compressed pool usage from large folio stores from potentially all 30
> >  processes, that gets counted towards the memory limit. As a result, we see
> >  higher swapout activity in the "after" data.
> >
> >
> > Summary:
> > ========
> > The v8 data presented above shows that zswap_store of large folios
> > demonstrates good throughput/performance improvements compared to
> > conventional SSD swap of large folios with a sufficiently large 525G SSD
> > swap device. Hence, it seems reasonable for zswap_store to support large
> > folios, so that further performance improvements can be implemented.
> >
> > In the experimental setup used in this patchset, we have enabled IAA
> > compress verification to ensure additional hardware data integrity CRC
> > checks not currently done by the software compressors. We see good
> > throughput/latency improvements with deflate-iaa vs. zstd with
> zswap_store
> > of large folios.
> >
> > Some of the ideas for further reducing latency that have shown promise in
> > our experiments, are:
> >
> > 1) IAA compress/decompress batching.
> > 2) Distributing compress jobs across all IAA devices on the socket.
> >
> > The tests run for this patchset are using only 1 IAA device per core, that
> > avails of 2 compress engines on the device. In our experiments with IAA
> > batching, we distribute compress jobs from all cores to the 8 compress
> > engines available per socket. We further compress the pages in each folio
> > in parallel in the accelerator. As a result, we improve compress latency
> > and reclaim throughput.
> >
> > In decompress batching, we use swapin_readahead to generate a prefetch
> > batch of 4K folios that we decompress in parallel in IAA.
> >
> >  ------------------------------------------------------------------------------
> >                           IAA compress/decompress batching
> >               Further improvements wrt v8 zswap_store Sequential
> >                           subpage store using "deflate-iaa":
> >
> >                       "deflate-iaa" Batching  "deflate-iaa-canned" [2] Batching
> >                           Additional Impr               Additional Impr
> >                      64K folios    2M folios     64K folios    2M folios
> >  ------------------------------------------------------------------------------
> >  Throughput (KB/s)          35%          34%            44%          44%
> >  elapsed time (sec)          9%          10%            14%          17%
> >  sys time (sec)            0.4%           4%             8%          15%
> >  ------------------------------------------------------------------------------
> >
> >
> > With zswap IAA compress/decompress batching, we are able to
> demonstrate
> > significant performance improvements and memory savings in server
> > scalability experiments in highly contended system scenarios under
> > significant memory pressure; as compared to software compressors. We
> hope
> > to submit this work in subsequent patch series. The current patch-series is
> > a prequisite for these future submissions.
> >
> > Thanks,
> > Kanchana
> >
> >
> > [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> > [2] https://patchwork.kernel.org/project/linux-
> crypto/cover/cover.1710969449.git.andre.glover@linux.intel.com/
> >
> >
> > Changes since v7:
> > =================
> > 1) Rebased to mm-unstable as of 9-27-2024,
> >    commit de2fbaa6d9c3576ec7133ed02a370ec9376bf000.
> > 2) Added Nhat's 'Reviewed-by' to patches 1 and 2. Thanks Nhat!
> > 3) Implemented one-time obj_cgroup_may_zswap and zswap_check_limits
> at the
> >    start of zswap_store. Implemented one-time batch updates to cgroup
> zswap
> >    charging (with total compressed bytes), zswap_stored_pages and the
> >    memcg/vm zswpout event stats (with folio_nr_pages()) only for successful
> >    stores at the end of zswap_store. Thanks Yosry and Johannes for guidance
> >    on this!
> > 4) Changed the existing zswap_pool_get() to zswap_pool_tryget(). Modified
> >    zswap_pool_current_get() and zswap_pool_find_get() to call
> >    zswap_pool_tryget(). Furthermore, zswap_store() obtains a reference to a
> >    valid zswap_pool upfront by calling zswap_pool_tryget(), and errors out
> >    if the tryget fails. Added a new zswap_pool_get() that calls
> >    "percpu_ref_get(&pool->ref)" and is called in zswap_store_page(), as
> >    suggested by Johannes & Yosry. Thanks both!
> > 5) Provided a new count_objcg_events() API for batch event updates.
> > 6) Changed "zswap_stored_pages" to atomic_long_t to support adding
> >    folio_nr_pages() to it once a large folio is stored successfully.
> > 7) Deleted the refactoring done in v7 for the xarray updates in
> >    zswap_store_page(); and unwinding of stored offsets in zswap_store() in
> >    case of errors, as suggested by Johannes.
> > 8) Deleted the CONFIG_ZSWAP_STORE_THP_DEFAULT_ON config option
> and
> >    "zswap_mthp_enabled" tunable, as recommended by Yosry, Johannes and
> >    Nhat.
> > 9) Replaced references to "mTHP" with "large folios"; organized
> >    before/after data per-compressor for easier visual comparisons;
> >    incorporated Nhat's feedback in the documentation updates; moved
> >    changelog to the end. Thanks Johannes, Yosry and Nhat!
> > 10) Moved the usemem testing configuration to 30 processes, each
> allocating
> >     10G within a 150G memory-limit constrained cgroup, maintaining the
> >     allocated memory for 10 sec before exiting. Thanks Ying for this
> >     suggestion!
> >
> > Changes since v6:
> > =================
> > 1) Rebased to mm-unstable as of 9-23-2024,
> >    commit acfabf7e197f7a5bedf4749dac1f39551417b049.
> > 2) Refactored into smaller commits, as suggested by Yosry and
> >    Chengming. Thanks both!
> > 3) Reworded the commit log for patches 5 and 6 as per Yosry's
> >    suggestion. Thanks Yosry!
> > 4) Gathered data on a Sapphire Rapids server that has 823GiB SSD swap disk
> >    partition. Also, all experiments are run with usemem --sleep 10, so that
> >    the memory allocated by the 70 processes remains in memory
> >    longer. Posted elapsed and sys times. Thanks to Yosry, Nhat and Ying for
> >    their help with refining the performance characterization methodology.
> > 5) Updated Documentation/admin-guide/mm/transhuge.rst as suggested
> by
> >    Nhat. Thanks Nhat!
> >
> > Changes since v5:
> > =================
> > 1) Rebased to mm-unstable as of 8/29/2024,
> >    commit 9287e4adbc6ab8fa04d25eb82e097fed877a4642.
> > 2) Added CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) to
> >    enable/disable zswap_store() of mTHP folios. Thanks Nhat for the
> >    suggestion to add a knob by which users can enable/disable this
> >    change. Nhat, I hope this is along the lines of what you were
> >    thinking.
> > 3) Added vm-scalability usemem data with 4K folios with
> >    CONFIG_ZSWAP_STORE_THP_DEFAULT_ON off, that I gathered to make
> sure
> >    there is no regression with this change.
> > 4) Added data with usemem with 64K and 2M THP for an alternate view of
> >    before/after, as suggested by Yosry, so we can understand the impact
> >    of when mTHPs are split into 4K folios in shrink_folio_list()
> >    (CONFIG_THP_SWAP off) vs. not split (CONFIG_THP_SWAP on) and stored
> >    in zswap. Thanks Yosry for this suggestion.
> >
> > Changes since v4:
> > =================
> > 1) Published before/after data with zstd, as suggested by Nhat (Thanks
> >    Nhat for the data reviews!).
> > 2) Rebased to mm-unstable from 8/27/2024,
> >    commit b659edec079c90012cf8d05624e312d1062b8b87.
> > 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if
> >    CONFIG_MEMCG is not defined, to resolve build errors reported by kernel
> >    robot; as per Nhat's and Michal's suggestion to not require a separate
> >    patch to fix the build errors (thanks both!).
> > 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as
> >    suggested by Yosry (Thanks Yosry!).
> > 5) Squashed the commits that define new mthp zswpout stat counters, and
> >    invoke count_mthp_stat() after successful zswap_store()s; into a single
> >    commit. Thanks Yosry for this suggestion!
> >
> > Changes since v3:
> > =================
> > 1) Rebased to mm-unstable commit
> 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
> >    Thanks to Barry for suggesting aligning with Ryan Roberts' latest
> >    changes to count_mthp_stat() so that it's always defined, even when THP
> >    is disabled. Barry, I have also made one other change in page_io.c
> >    where count_mthp_stat() is called by count_swpout_vm_event(). I would
> >    appreciate it if you can review this. Thanks!
> >    Hopefully this should resolve the kernel robot build errors.
> >
> > Changes since v2:
> > =================
> > 1) Gathered usemem data using SSD as the backing swap device for zswap,
> >    as suggested by Ying Huang. Ying, I would appreciate it if you can
> >    review the latest data. Thanks!
> > 2) Generated the base commit info in the patches to attempt to address
> >    the kernel test robot build errors.
> > 3) No code changes to the individual patches themselves.
> >
> > Changes since RFC v1:
> > =====================
> >
> > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
> >    Thanks Barry!
> > 2) Addressed some of the code review comments that Nhat Pham provided
> in
> >    Ryan's initial RFC [1]:
> >    - Added a comment about the cgroup zswap limit checks occuring once
> per
> >      folio at the beginning of zswap_store().
> >      Nhat, Ryan, please do let me know if the comments convey the summary
> >      from the RFC discussion. Thanks!
> >    - Posted data on running the cgroup suite's zswap kselftest.
> > 3) Rebased to v6.11-rc3.
> > 4) Gathered performance data with usemem and the rebased patch-series.
> >
> >
> >
> > Kanchana P Sridhar (8):
> >   mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined.
> >   mm: zswap: Modify zswap_compress() to accept a page instead of a
> >     folio.
> >   mm: zswap: Rename zswap_pool_get() to zswap_pool_tryget().
> >   mm: Provide a new count_objcg_events() API for batch event updates.
> >   mm: zswap: Modify zswap_stored_pages to be atomic_long_t.
> >   mm: zswap: Support large folios in zswap_store().
> >   mm: swap: Count successful large folio zswap stores in hugepage
> >     zswpout stats.
> >   mm: Document the newly added sysfs large folios zswpout stats.
> >
> >  Documentation/admin-guide/mm/transhuge.rst |   8 +-
> >  fs/proc/meminfo.c                          |   2 +-
> >  include/linux/huge_mm.h                    |   1 +
> >  include/linux/memcontrol.h                 |  24 ++
> >  include/linux/zswap.h                      |   2 +-
> >  mm/huge_memory.c                           |   3 +
> >  mm/page_io.c                               |   1 +
> >  mm/zswap.c                                 | 254 +++++++++++++++------
> >  8 files changed, 219 insertions(+), 76 deletions(-)
> >
> >
> > base-commit: de2fbaa6d9c3576ec7133ed02a370ec9376bf000
> > --
> > 2.27.0
> >
Re: [PATCH v8 0/8] mm: zswap swap-out of large folios
Posted by Yosry Ahmed 2 months ago
[..]
> > > Performance Testing:
> > > ====================
> > >
> > > We list the data for 64K folios with before/after data per-compressor,
> > > followed by the same for 2M pmd-mappable folios.
> > >
> > >
> > >  -------------------------------------------------------------------------------
> > >  64K folios: zstd:
> > >  =================
> > >
> > >  zswap compressor                zstd          zstd         zstd      zstd v8
> > >                          before-case1  before-case2        after     vs.    vs.
> > >                                                                    case1  case2
> > >  -------------------------------------------------------------------------------
> > >  Total throughput (KB/s)    5,222,213     1,076,611    6,227,367     19%    478%
> > >  Average throughput (KB/s)    174,073        35,887      207,578     19%    478%
> > >  elapsed time (sec)            120.50        347.16       109.21      9%     69%
> >
> >
> > The diff here is supposed to be negative, right?
> > (Same for the below results)
>
> So this is supposed to be positive to indicate the throughput improvement
> [(new-old)/old] with v8 as compared to the before-case1 and before-case2.
> For latency, a positive value indicates the latency reducing, since I calculate
> [(old-new)/old]. This is the metric used throughout.
>
> Based on this convention, positive percentages are improvements in both,
> throughput and latency.

But you use negative percentages for sys time, we should at least be
consistent with this.