[v4] RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios

RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios

Posted by Sridhar, Kanchana P 1 year, 5 months ago

Hi Nhat,

> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Monday, August 26, 2024 7:12 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> On Fri, Aug 23, 2024 at 11:21 PM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Hi Nhat,
> >
> >
> > I started out with 2 main hypotheses to explain why zswap incurs more
> > reclaim wrt SSD:
> >
> > 1) The cgroup zswap charge, that hastens the memory.high limit to be
> >    breached, and adds to the reclaim being triggered in
> >    mem_cgroup_handle_over_high().
> >
> > 2) Does a faster reclaim path somehow cause less allocation stalls; thereby
> >    causing more breaches of memory.high, hence more reclaim -- and does
> this
> >    cycle repeat, potentially leading to higher swapout activity with zswap?
> 
> By faster reclaim path, do you mean zswap has a lower reclaim latency?

Thanks for your follow-up comments/suggestions. Yes, I was characterizing
lower zswap reclaim latency as faster reclaim path.

> 
> >
> > I focused on gathering data with lz4 for this debug, under the reasonable
> > assumption that results with deflate-iaa will be better. Once we figure out
> > an overall direction on next steps, I will publish results with zswap lz4,
> > deflate-iaa, etc.
> >
> > All experiments except "Exp 1.A" are run with
> > usemem --init-time -w -O -n 70 1g.
> >
> > General settings for all data presented in this patch-series:
> >
> > vm.swappiness = 100
> > zswap shrinker_enabled = N
> >
> >  Experiment 1 - impact of not doing cgroup zswap charge:
> >  -------------------------------------------------------
> >
> > I wanted to first understand by how much we improve without the cgroup
> > zswap charge. I commented out both, the calls to
> obj_cgroup_charge_zswap()
> > and obj_cgroup_uncharge_zswap() in zswap.c in my patch-set.
> > We improve throughput by quite a bit with this change, and are now better
> > than mTHP getting swapped out to SSD. We have also slightly improved on
> the
> > sys time, though this is still a regression as compared to SSD. If you
> > recall, we were worse on throughput and sys time with v4.
> 
> I'm not 100% sure about the validity this pair of experiments.
> 
> The thing is, you cannot ignore zswap's memory footprint altogether.
> That's the whole point of the trade-off. It's probably gigabytes worth
> of unaccounted memory usage - I see that your SSD size is 4G, and
> since compression ratio is less than 2, that's potentially 2G worth of
> memory give or take you are not charging to the cgroup, which can
> altogether alter the memory pressure and reclaim dynamics.

I agree, the zswap memory utilization charging to the cgroup is the right
thing to do (assuming we solve the temporary double-charging, as Yosry
and you have pointed out). I have summarized the zswap memory footprint
with different compressors in the results towards the end of this email.

> 
> The zswap charging itself is not the problem - that's fair and
> healthy. It might be the overreaction by the memory reclaim subsystem
> that seems anomalous?

I think so too, about the anomalous behavior.

> 
> >
> > Averages over 3 runs are summarized in each case.
> >
> >  Exp 1.A: usemem -n 1 58g:
> >  -------------------------
> >
> >  64KB mTHP (cgroup memory.high set to 60G):
> >  ==========================================
> >
> >                 SSD mTHP    zswap mTHP v4   zswap mTHP no_charge
> >  ----------------------------------------------------------------
> >  pswpout          586,352                0                      0
> >  zswpout            1,005        1,042,963                587,181
> >  ----------------------------------------------------------------
> >  Total swapout    587,357        1,042,963                587,181
> >  ----------------------------------------------------------------
> >
> > Without the zswap charge to cgroup, the total swapout activity for
> > zswap-mTHP is on par with that of SSD-mTHP for the single process case.
> >
> >
> >  Exp 1.B: usemem -n 70 1g:
> >  -------------------------
> >  v4 results with cgroup zswap charge:
> >  ------------------------------------
> >
> >  64KB mTHP (cgroup memory.high set to 60G):
> >  ==========================================
> >   ------------------------------------------------------------------
> >  |                    |                   |            |            |
> >  |Kernel              | mTHP SWAP-OUT     | Throughput |     Change |
> >  |                    |                   |       KB/s |            |
> >  |--------------------|-------------------|------------|------------|
> >  |v6.11-rc3 mainline  | SSD               |    335,346 |   Baseline |
> >  |zswap-mTHP-Store    | ZSWAP lz4         |    271,558 |       -19% |
> >  |------------------------------------------------------------------|
> >  |                    |                   |            |            |
> >  |Kernel              | mTHP SWAP-OUT     |   Sys time |     Change |
> >  |                    |                   |        sec |            |
> >  |--------------------|-------------------|------------|------------|
> >  |v6.11-rc3 mainline  | SSD               |      91.37 |   Baseline |
> >  |zswap-mTHP=Store    | ZSWAP lz4         |     265.43 |      -191% |
> >   ------------------------------------------------------------------
> >
> >   -----------------------------------------------------------------------
> >  | VMSTATS, mTHP ZSWAP/SSD stats|  v6.11-rc3 |  zswap-mTHP |  zswap-
> mTHP |
> >  |                              |   mainline |       Store |       Store |
> >  |                              |            |         lz4 | deflate-iaa |
> >  |-----------------------------------------------------------------------|
> >  | pswpout                      |    174,432 |           0 |           0 |
> >  | zswpout                      |      1,501 |   1,491,654 |   1,398,805 |
> >  |-----------------------------------------------------------------------|
> >  | hugepages-64kB/stats/zswpout |            |      63,200 |      63,244 |
> >  |-----------------------------------------------------------------------|
> >  | hugepages-64kB/stats/swpout  |     10,902 |           0 |           0 |
> >   -----------------------------------------------------------------------
> >
> >  Debug results without cgroup zswap charge in both, "Before" and "After":
> >  ------------------------------------------------------------------------
> >
> >  64KB mTHP (cgroup memory.high set to 60G):
> >  ==========================================
> >   ------------------------------------------------------------------
> >  |                    |                   |            |            |
> >  |Kernel              | mTHP SWAP-OUT     | Throughput |     Change |
> >  |                    |                   |       KB/s |            |
> >  |--------------------|-------------------|------------|------------|
> >  |v6.11-rc3 mainline  | SSD               |    300,565 |   Baseline |
> >  |zswap-mTHP-Store    | ZSWAP lz4         |    420,125 |        40% |
> >  |------------------------------------------------------------------|
> >  |                    |                   |            |            |
> >  |Kernel              | mTHP SWAP-OUT     |   Sys time |     Change |
> >  |                    |                   |        sec |            |
> >  |--------------------|-------------------|------------|------------|
> >  |v6.11-rc3 mainline  | SSD               |      90.76 |   Baseline |
> >  |zswap-mTHP=Store    | ZSWAP lz4         |     213.09 |      -135% |
> >   ------------------------------------------------------------------
> >
> >   ---------------------------------------------------------
> >  | VMSTATS, mTHP ZSWAP/SSD stats|  v6.11-rc3 |  zswap-mTHP |
> >  |                              |   mainline |       Store |
> >  |                              |            |         lz4 |
> >  |----------------------------------------------------------
> >  | pswpout                      |    330,640 |           0 |
> >  | zswpout                      |      1,527 |   1,384,725 |
> >  |----------------------------------------------------------
> >  | hugepages-64kB/stats/zswpout |            |      63,335 |
> >  |----------------------------------------------------------
> >  | hugepages-64kB/stats/swpout  |     18,242 |           0 |
> >   ---------------------------------------------------------
> >
> 
> Hmm, in the 70 processes case, it looks like we're still seeing
> latency regression, and that same pattern of overreclaiming, even
> without zswap cgroup charging?
> 
> That seems like a hint - concurrency exacerbates the problem?

Agreed, that was my conclusion as well.

> 
> >
> > Based on these results, I kept the cgroup zswap charging commented out in
> > subsequent debug steps, so as to not place zswap at a disadvantage when
> > trying to determine further causes for hypothesis (1).
> >
> >
> > Experiment 2 - swap latency/reclamation with 64K mTHP:
> > ------------------------------------------------------
> >
> > Number of swap_writepage    Total swap_writepage  Average
> swap_writepage
> >     calls from all cores      Latency (millisec)      Latency (microsec)
> > ---------------------------------------------------------------------------
> > SSD               21,373               165,434.9                   7,740
> > zswap            344,109                55,446.8                     161
> > ---------------------------------------------------------------------------
> >
> >
> > Reclamation analysis: 64k mTHP swapout:
> > ---------------------------------------
> > "Before":
> >   Total SSD compressed data size   =  1,362,296,832  bytes
> >   Total SSD write IO latency       =        887,861  milliseconds
> >
> >   Average SSD compressed data size =      1,089,837  bytes
> >   Average SSD write IO latency     =        710,289  microseconds
> >
> > "After":
> >   Total ZSWAP compressed pool size =  2,610,657,430  bytes
> >   Total ZSWAP compress latency     =         55,984  milliseconds
> >
> >   Average ZSWAP compress length    =          2,055  bytes
> >   Average ZSWAP compress latency   =             44  microseconds
> >
> >   zswap-LZ4 mTHP compression ratio =  1.99
> >   All moderately compressible pages. 0 zswap_store errors.
> >   84% of pages compress to 2056 bytes.
> 
> Hmm this ratio isn't very good indeed - it is less than 2-to-1 memory saving...
> 
> Internally, we often see 1-3 or 1-4 saving ratio (or even more).

Agree with this as well. In our experiments with other workloads, we
typically see much higher ratios.

> 
> Probably does not explain everything, but worth double checking -
> could you check with zstd to see if the ratio improves.

Sure. I gathered ratio and compressed memory footprint data today with
64K mTHP, the 4G SSD swapfile and different zswap compressors.

 This patch-series and no zswap charging, 64K mTHP:
---------------------------------------------------------------------------
                       Total         Total     Average      Average   Comp
                  compressed   compression  compressed  compression  ratio
                      length       latency      length      latency
                       bytes  milliseconds       bytes  nanoseconds
---------------------------------------------------------------------------
SSD (no zswap) 1,362,296,832       887,861  
lz4            2,610,657,430        55,984       2,055      44,065    1.99
zstd             729,129,528        50,986         565      39,510    7.25
deflate-iaa    1,286,533,438        44,785       1,415      49,252    2.89
---------------------------------------------------------------------------

zstd does very well on ratio, as expected.

> 
> >
> >
> >  Experiment 3 - 4K folios swap characteristics SSD vs. ZSWAP:
> >  ------------------------------------------------------------
> >
> >  I wanted to take a step back and understand how the mainline v6.11-rc3
> >  handles 4K folios when swapped out to SSD (CONFIG_ZSWAP is off) and
> when
> >  swapped out to ZSWAP. Interestingly, higher swapout activity is observed
> >  with 4K folios and v6.11-rc3 (with the debug change to not charge zswap to
> >  cgroup).
> >
> >  v6.11-rc3 with no zswap charge, only 4K folios, no (m)THP:
> >
> >  -------------------------------------------------------------
> >  SSD (CONFIG_ZSWAP is OFF)       ZSWAP          lz4    lzo-rle
> >  -------------------------------------------------------------
> >  cgroup memory.events:           cgroup memory.events:
> >
> >  low                 0           low              0          0
> >  high            5,068           high       321,923    375,116
> >  max                 0           max              0          0
> >  oom                 0           oom              0          0
> >  oom_kill            0           oom_kill         0          0
> >  oom_group_kill      0           oom_group_kill   0          0
> >  -------------------------------------------------------------
> >
> >  SSD (CONFIG_ZSWAP is OFF):
> >  --------------------------
> >  pswpout            415,709
> >  sys time (sec)      301.02
> >  Throughput KB/s    155,970
> >  memcg_high events    5,068
> >  --------------------------
> >
> >
> >  ZSWAP                  lz4         lz4         lz4     lzo-rle
> >  --------------------------------------------------------------
> >  zswpout          1,598,550   1,515,151   1,449,432   1,493,917
> >  sys time (sec)      889.36      481.21      581.22      635.75
> >  Throughput KB/s     35,176      14,765      20,253      21,407
> >  memcg_high events  321,923     412,733     369,976     375,116
> >  --------------------------------------------------------------
> >
> >  This shows that there is a performance regression of -60% to -195% with
> >  zswap as compared to SSD with 4K folios. The higher swapout activity with
> >  zswap is seen here too (i.e., this doesn't appear to be mTHP-specific).
> >
> >  I verified this to be the case even with the v6.7 kernel, which also
> >  showed a 2.3X throughput improvement when we don't charge zswap:
> >
> >  ZSWAP lz4                 v6.7      v6.7 with no cgroup zswap charge
> >  --------------------------------------------------------------------
> >  zswpout              1,419,802       1,398,620
> >  sys time (sec)           535.4          613.41
> 
> systime increases without zswap cgroup charging? That's strange...

Additional data gathered with v6.11-rc3 (listed below) based on your suggestion
to investigate potential swap.high breaches should hopefully provide some
explanation.

> 
> >  Throughput KB/s          8,671          20,045
> >  memcg_high events      574,046         451,859
> 
> So, on 4k folio setup, even without cgroup charge, we are still seeing:
> 
> 1. More zswpout (than observed in SSD)
> 2. 40-50% worse latency - in fact it is worse without zswap cgroup charging.
> 3. 100 times the amount of memcg_high events? This is perhaps the
> *strangest* to me. You're already removing zswap cgroup charging, then
> where does this comes from? How can we have memory.high violation when
> zswap does *not* contribute to memory usage?
> 
> Is this due to swap limit charging? Do you have a cgroup swap limit?
> 
> mem_high = page_counter_read(&memcg->memory) >
>            READ_ONCE(memcg->memory.high);
> swap_high = page_counter_read(&memcg->swap) >
>            READ_ONCE(memcg->swap.high);
> [...]
> 
> if (mem_high || swap_high) {
>     /*
>     * The allocating tasks in this cgroup will need to do
>     * reclaim or be throttled to prevent further growth
>     * of the memory or swap footprints.
>     *
>     * Target some best-effort fairness between the tasks,
>     * and distribute reclaim work and delay penalties
>     * based on how much each task is actually allocating.
>     */
>     current->memcg_nr_pages_over_high += batch;
>     set_notify_resume(current);
>     break;
> }
> 

I don't have a swap.high limit set on the cgroup; it is set to "max".

I ran experiments with v6.11-rc3, no zswap charging, 4K folios and different
zswap compressors to verify if swap.high is breached with the 4G SSD swapfile.

 SSD (CONFIG_ZSWAP is OFF):

                                SSD          SSD          SSD
 ------------------------------------------------------------ 
 pswpout                    415,709    1,032,170      636,582
 sys time (sec)              301.02       328.15       306.98
 Throughput KB/s            155,970       89,621      122,219
 memcg_high events            5,068       15,072        8,344
 memcg_swap_high events           0            0            0
 memcg_swap_fail events           0            0            0
 ------------------------------------------------------------
                                    
 ZSWAP                               zstd         zstd       zstd
 ----------------------------------------------------------------
 zswpout                        1,391,524    1,382,965  1,417,307
 sys time (sec)                    474.68       568.24     489.80
 Throughput KB/s                   26,099       23,404    111,115
 memcg_high events                335,112      340,335    162,260
 memcg_swap_high events                 0            0          0
 memcg_swap_fail events         1,226,899    5,742,153
  (mem_cgroup_try_charge_swap)
 memcg_memory_stat_pgactivate   1,259,547
  (shrink_folio_list)
 ----------------------------------------------------------------

 ZSWAP                      lzo-rle      lzo-rle     lzo-rle
 -----------------------------------------------------------
 zswpout                  1,493,917    1,363,040   1,428,133
 sys time (sec)              635.75       498.63      484.65
 Throughput KB/s             21,407       23,827      20,237
 memcg_high events          375,116      352,814     373,667
 memcg_swap_high events           0            0           0
 memcg_swap_fail events     715,211      
 -----------------------------------------------------------
                                    
 ZSWAP                         lz4         lz4        lz4          lz4
 ---------------------------------------------------------------------
 zswpout                 1,378,781   1,598,550   1,515,151   1,449,432
 sys time (sec)             495.45      889.36      481.21      581.22
 Throughput KB/s            26,248      35,176      14,765      20,253
 memcg_high events         347,209     321,923     412,733     369,976
 memcg_swap_high events          0           0           0           0
 memcg_swap_fail events    580,103           0 
 ---------------------------------------------------------------------

 ZSWAP                  deflate-iaa   deflate-iaa    deflate-iaa
 ----------------------------------------------------------------
 zswpout                    380,471     1,440,902      1,397,965
 sys time (sec)              329.06        570.77         467.41
 Throughput KB/s            283,867        28,403        190,600
 memcg_high events            5,551       422,831         28,154
 memcg_swap_high events           0             0              0
 memcg_swap_fail events           0     2,686,758        438,562
 ----------------------------------------------------------------

There are no swap.high memcg events recorded in any of the SSD/zswap
 experiments. However, I do see significant number of memcg_swap_fail
 events in some of the zswap runs, for all 3 compressors. This is not
 consistent, because there are some runs with 0 memcg_swap_fail for all
 compressors.

 There is a possible co-relation between memcg_swap_fail events
 (/sys/fs/cgroup/test/memory.swap.events) and the high # of memcg_high
 events. The root-cause appears to be that there are no available swap
 slots, memcg_swap_fail is incremented, add_to_swap() fails in
 shrink_folio_list(), followed by "activate_locked:" for the folio.
 The folio re-activation is recorded in cgroup memory.stat pgactivate
 events. The failure to swap out folios due to lack of swap slots could
 contribute towards memory.high breaches.

swp_entry_t folio_alloc_swap(struct folio *folio)
{
...
        get_swap_pages(1, &entry, 0);
out:
	if (mem_cgroup_try_charge_swap(folio, entry)) {
		put_swap_folio(folio, entry);
		entry.val = 0;
	}
	return entry;
}

int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry)
{
...
	if (!entry.val) {
		WARN_ONCE(1, "__mem_cgroup_try_charge_swap: MEMCG_SWAP_FAIL entry.val is 0");
		memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
		return 0;
	}

...
}

This is the call stack (v6.11-rc3 mainline) as reference for the above
analysis:

[  109.130504] __mem_cgroup_try_charge_swap: MEMCG_SWAP_FAIL entry.val is 0
[  109.130515] WARNING: CPU: 143 PID: 5200 at mm/memcontrol.c:5011 __mem_cgroup_try_charge_swap (mm/memcontrol.c:5011 (discriminator 3)) 
[  109.130652] RIP: 0010:__mem_cgroup_try_charge_swap (mm/memcontrol.c:5011 (discriminator 3)) 
[  109.130682] Call Trace:
[  109.130686]  <TASK>
[  109.130689] ? __warn (kernel/panic.c:735) 
[  109.130695] ? __mem_cgroup_try_charge_swap (mm/memcontrol.c:5011 (discriminator 3)) 
[  109.130698] ? report_bug (lib/bug.c:201 lib/bug.c:219) 
[  109.130705] ? prb_read_valid (kernel/printk/printk_ringbuffer.c:2183) 
[  109.130710] ? handle_bug (arch/x86/kernel/traps.c:239) 
[  109.130715] ? exc_invalid_op (arch/x86/kernel/traps.c:260 (discriminator 1)) 
[  109.130718] ? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:621) 
[  109.130722] ? __mem_cgroup_try_charge_swap (mm/memcontrol.c:5011 (discriminator 3)) 
[  109.130725] ? __mem_cgroup_try_charge_swap (mm/memcontrol.c:5011 (discriminator 3)) 
[  109.130728] folio_alloc_swap (mm/swap_slots.c:348) 
[  109.130734] add_to_swap (mm/swap_state.c:189) 
[  109.130737] shrink_folio_list (mm/vmscan.c:1235) 
[  109.130744] ? __mod_zone_page_state (mm/vmstat.c:367) 
[  109.130748] ? isolate_lru_folios (mm/vmscan.c:1598 mm/vmscan.c:1736) 
[  109.130753] shrink_inactive_list (./include/linux/spinlock.h:376 mm/vmscan.c:1961) 
[  109.130758] shrink_lruvec (mm/vmscan.c:2194 mm/vmscan.c:5706) 
[  109.130763] shrink_node (mm/vmscan.c:5910 mm/vmscan.c:5948) 
[  109.130768] do_try_to_free_pages (mm/vmscan.c:6134 mm/vmscan.c:6254) 
[  109.130772] try_to_free_mem_cgroup_pages (./include/linux/sched/mm.h:355 ./include/linux/sched/mm.h:456 mm/vmscan.c:6588) 
[  109.130778] reclaim_high (mm/memcontrol.c:1906) 
[  109.130783] mem_cgroup_handle_over_high (./include/linux/memcontrol.h:556 mm/memcontrol.c:2001 mm/memcontrol.c:2108) 
[  109.130787] irqentry_exit_to_user_mode (./include/linux/resume_user_mode.h:60 kernel/entry/common.c:114 ./include/linux/entry-common.h:328 kernel/entry/common.c:231) 
[  109.130792] asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:623) 


 However, this is probably not the only cause for either the high # of
 memory.high breaches or the over-reclaim with zswap, as seen in the lz4
 data where the memory.high is significant even in cases where there are no
 memcg_swap_fails.

Some observations/questions based on the above 4K folios swapout data:

1) There are more memcg_high events as the swapout latency reduces
   (i.e. faster swap-write path). This is even without charging zswap
   utilization to the cgroup.

2) There appears to be a direct co-relation between higher # of
   memcg_swap_fail events, and an increase in memcg_high breaches and
   reduction in usemem throughput. This combined with the observation in
   (1) suggests that with a faster compressor, we need more swap slots,
   that increases the probability of running out of swap slots with the 4G
   SSD backing device.

3) Could the data shared earlier on reduction in memcg_high breaches with
   64K mTHP swapout provide some more clues, if we agree with (1) and (2):

   "Interestingly, the # of memcg_high events reduces significantly with 64K
   mTHP as compared to the above 4K memcg_high events data, when tested
   with v4 and no zswap charge: 3,069 (SSD-mTHP) and 19,656 (ZSWAP-mTHP)."

4) In the case of each zswap compressor, there are some runs that go
   through with 0 memcg_swap_fail events. These runs generally have better
   fewer memcg_high breaches and better sys time/throughput.

5) For a given swap setup, there is some amount of variance in
   sys time for this workload.

6) All this suggests that the primary root cause is the concurrency setup,
   where there could be randomness between runs as to the # of processes
   that observe the memory.high breach due to other factors such as
   availability of swap slots for alloc.

To summarize, I believe the root-cause is the 4G SSD swapfile resulting in
running out of swap slots, and anomalous behavior with over-reclaim when 70
concurrent processes are working with the 60G memory limit while trying to
allocate 1G each; with randomness in processes reacting to the breach.

The cgroup zswap charging exacerbates this situation, but is not a problem
in and of itself.

Nhat, as you pointed out, this is somewhat of an unrealistic scenario that
doesn't seem to indicate any specific problems to be solved, other than the
temporary cgroup zswap double-charging.

Would it be fair to evaluate this patch-series based on a more realistic
swapfile configuration based on 176G ZRAM, for which I had shared the data
in v2? There weren't any problems with swap slots availability or any
anomalies that I can think of with this setup, other than the fact that the
"Before" and "After" sys times could not be directly compared for 2 key
reasons:

 - ZRAM compressed data is not charged to the cgroup, similar to SSD.
 - ZSWAP compressed data is charged to the cgroup.

This disparity causes fewer swapouts, better sys time/throughput in the
"Before" experiments.

In the "After" experiments, this disparity causes more swapouts only with
zswap-lz4 due to the poorer compression ratio combined with the cgroup
charge; and hence a regression in sys time/throughput.

However, the better compression ratio with deflate-iaa results in
comparable # of swapouts as "Before", with better sys time/throughput.

My main rationale for suggesting the v2 ZRAM swapfile data is that the
disparities are the same as with the 4G SSD swapfile, but there are no anomalies,
with reasonable explanations for the data.

I would appreciate everyones' thoughts on this. If this sounds Ok, then I can
submit a v5 with the changes suggested by Yosry. 

I am listing here the v2 data with 176G ZRAM swapfile again, just for reference.

v2 data with cgroup zswap charging:
-----------------------------------

 64KB mTHP:
 ==========
  ------------------------------------------------------------------
 |                    |                   |            |            |
 |Kernel              | mTHP SWAP-OUT     | Throughput | Change|
 |                    |                   |       KB/s |            |
 |--------------------|-------------------|------------|------------|
 |v6.11-rc3 mainline  | ZRAM lzo-rle      |    118,928 |   Baseline |
 |zswap-mTHP-Store    | ZSWAP lz4         |     82,665 |       -30% |
 |zswap-mTHP-Store    | ZSWAP deflate-iaa |    176,210 |        48% |
 |------------------------------------------------------------------|
 |                    |                   |            |            |
 |Kernel              | mTHP SWAP-OUT     |   Sys time | Change|
 |                    |                   |        sec |            |
 |--------------------|-------------------|------------|------------|
 |v6.11-rc3 mainline  | ZRAM lzo-rle      |   1,032.20 |   Baseline |
 |zswap-mTHP=Store    | ZSWAP lz4         |   1,854.51 |       -80% |
 |zswap-mTHP-Store    | ZSWAP deflate-iaa |     582.71 |        44% |
  ------------------------------------------------------------------

  -----------------------------------------------------------------------
 | VMSTATS, mTHP ZSWAP stats,   |  v6.11-rc3 |  zswap-mTHP |  zswap-mTHP |
 | mTHP ZRAM stats:             |   mainline |       Store |       Store |
 |                              |            |         lz4 | deflate-iaa |
 |-----------------------------------------------------------------------|
 | pswpin                       |         16 |           0 |           0 |
 | pswpout                      |  7,770,720 |           0 |           0 |
 | zswpin                       |        547 |         695 |         579 |
 | zswpout                      |      1,394 |  15,462,778 |   7,284,554 |
 |-----------------------------------------------------------------------|
 | thp_swpout                   |          0 |           0 |           0 |
 | thp_swpout_fallback          |          0 |           0 |           0 |
 | pgmajfault                   |      3,786 |       3,541 |       3,367 |
 |-----------------------------------------------------------------------|
 | hugepages-64kB/stats/zswpout |            |     966,328 |     455,196 |
 |-----------------------------------------------------------------------|
 | hugepages-64kB/stats/swpout  |    485,670 |           0 |           0 |
  -----------------------------------------------------------------------


 2MB PMD-THP/2048K mTHP:
 =======================
  ------------------------------------------------------------------
 |                    |                   |            |            |
 |Kernel              | mTHP SWAP-OUT     | Throughput | Change|
 |                    |                   |       KB/s |            |
 |--------------------|-------------------|------------|------------|
 |v6.11-rc3 mainline  | ZRAM lzo-rle      |    177,340 |   Baseline |
 |zswap-mTHP-Store    | ZSWAP lz4         |     84,030 |       -53% |
 |zswap-mTHP-Store    | ZSWAP deflate-iaa |    185,691 |         5% |
 |------------------------------------------------------------------|
 |                    |                   |            |            |
 |Kernel              | mTHP SWAP-OUT     |   Sys time | Change|
 |                    |                   |        sec |            |
 |--------------------|-------------------|------------|------------|
 |v6.11-rc3 mainline  | ZRAM lzo-rle      |     876.29 |   Baseline |
 |zswap-mTHP-Store    | ZSWAP lz4         |   1,740.55 |       -99% |
 |zswap-mTHP-Store    | ZSWAP deflate-iaa |     650.33 |        26% |
  ------------------------------------------------------------------

  ------------------------------------------------------------------------- 
 | VMSTATS, mTHP ZSWAP stats,     |  v6.11-rc3 |  zswap-mTHP |  zswap-mTHP |
 | mTHP ZRAM stats:               |   mainline |       Store |       Store |
 |                                |            |         lz4 | deflate-iaa |
 |-------------------------------------------------------------------------|
 | pswpin                         |          0 |           0 |           0 |
 | pswpout                        |  8,628,224 |           0 |           0 |
 | zswpin                         |        678 |      22,733 |       1,641 |
 | zswpout                        |      1,481 |  14,828,597 |   9,404,937 |
 |-------------------------------------------------------------------------|
 | thp_swpout                     |     16,852 |           0 |           0 |
 | thp_swpout_fallback            |          0 |           0 |           0 |
 | pgmajfault                     |      3,467 |      25,550 |       4,800 |
 |-------------------------------------------------------------------------|
 | hugepages-2048kB/stats/zswpout |            |      28,924 |      18,366 |
 |-------------------------------------------------------------------------|
 | hugepages-2048kB/stats/swpout  |     16,852 |           0 |           0 |
  -------------------------------------------------------------------------

> 
> >  --------------------------------------------------------------------
> >
> >
> > Summary from the debug:
> > -----------------------
> > 1) Excess reclaim is exacerbated by zswap charge to cgroup. Without the
> >    charge, reclaim is on par with SSD for mTHP in the single process
> >    case. The multiple process excess reclaim seems to be most likely
> >    resulting from over-reclaim done by the cores, in their respective calls
> >    to mem_cgroup_handle_over_high().
> 
> Exarcebate, yes. I'm not 100% it's the sole or even the main cause.
> 
> You still see a degree of overreclaiming without zswap cgroup charging in:
> 
> 1. 70 processes, with mTHP
> 2. 70 processes, with 4K folios.

That's correct, although the over-reclaiming is not as bad with mTHP.

> 
> >
> > 2) The higher swapout activity with zswap as compared to SSD does not
> >    appear to be specific to mTHP. Higher reclaim activity and sys time
> >    regression with zswap (as compared to a setup where there is only SSD
> >    configured as swap) exists with 4K pages as far back as v6.7.
> 
> Yeah I can believe that without mthp, the same-ish workload would
> cause the same regression.

This makes sense.

> 
> >
> > 3) The debug indicates the hypothesis (2) is worth more investigation:
> >    Does a faster reclaim path somehow cause less allocation stalls; thereby
> >    causing more breaches of memory.high, hence more reclaim -- and does
> this
> >    cycle repeat, potentially leading to higher swapout activity with zswap?
> >    Any advise on this being a possibility, and suggestions/pointers to
> >    verify this, would be greatly appreciated.
> 
> Add stalls along the zswap path? :)

Yes, possibly! Hopefully, the swap slots availability learning from today's
experiments makes things a little clearer.

> 
> >
> > 4) Interestingly, the # of memcg_high events reduces significantly with 64K
> >    mTHP as compared to the above 4K high events data, when tested with v4
> >    and no zswap charge: 3,069 (SSD-mTHP) and 19,656 (ZSWAP-mTHP). This
> >    potentially indicates something to do with allocation efficiency
> >    countering the higher reclaim that seems to be caused by swapout
> >    efficiency.
> >
> > 5) Nhat, Yosry: would it be possible for you to run the 4K folios
> >    usemem -n 70 1g (with 60G memory.high) expmnt with 4G and some
> higher
> >    value SSD configuration in your setup and say, v6.11-rc3. I would like
> >    to rule out the memory constrained 4G SSD in my setup somehow skewing
> >    the behavior of zswap vis-a-vis
> >    allocation/memcg_handle_over_high/reclaim. I realize your time is
> >    valuable, however I think an independent confirmation of what I have
> >    been observing, would be really helpful for us to figure out potential
> >    root-causes and solutions.
> 
> It might take awhile for me to set up your benchmark, but yeah 4G
> swapfile seems small on a 64G host - of course it depends on the
> workload, but this has a lot memory usage. In fact the total memory
> usage (70G?) is slightly above memory.high + 4G swapfile - note that
> this is exarcebated by, once again, zswap's less-than-100% memory
> saving ratio.

I agree, this is somewhat of an unrealistic setup. Hopefully the data and my
learnings shared from the experiments I ran today, should provide some
insights into possible root-causes for the anomalous over-reclaim behavior.

Thanks,
Kanchana

> 
> >
> > 6) I tried a small change in memcontrol.c::mem_cgroup_handle_over_high()
> to
> >    break out of the loop if we have reclaimed a total of at least
> >    "nr_pages":
> >
> >         nr_reclaimed = reclaim_high(memcg,
> >                                     in_retry ? SWAP_CLUSTER_MAX : nr_pages,
> >                                     gfp_mask);
> >
> > +       nr_reclaimed_total += nr_reclaimed;
> > +
> > +       if (nr_reclaimed_total >= nr_pages)
> > +               goto out;
> >
> >
> >    This was only for debug purposes, and did seem to mitigate the higher
> >    reclaim behavior for 4K folios:
> >
> >  ZSWAP                  lz4             lz4             lz4
> >  ----------------------------------------------------------
> >  zswpout          1,305,367       1,349,195       1,529,235
> >  sys time (sec)      472.06          507.76          646.39
> >  Throughput KB/s     55,144          21,811          88,310
> >  memcg_high events  257,890         343,213         172,351
> >  ----------------------------------------------------------
> >
> > On average, this change results in 17% improvement in sys time, 2.35X
> > improvement in throughput and 30% fewer memcg_high events.
> >
> > I look forward to further inputs on next steps.
> >
> > Thanks,
> > Kanchana
> >
> >
> > >
> > > Thanks for this analysis. I will debug this some more, so we can better
> > > understand these results.
> > >
> > > Thanks,
> > > Kanchana

Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios

Posted by Nhat Pham 1 year, 5 months ago

On Mon, Aug 26, 2024 at 11:08 PM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
> > Internally, we often see 1-3 or 1-4 saving ratio (or even more).
>
> Agree with this as well. In our experiments with other workloads, we
> typically see much higher ratios.
>
> >
> > Probably does not explain everything, but worth double checking -
> > could you check with zstd to see if the ratio improves.
>
> Sure. I gathered ratio and compressed memory footprint data today with
> 64K mTHP, the 4G SSD swapfile and different zswap compressors.
>
>  This patch-series and no zswap charging, 64K mTHP:
> ---------------------------------------------------------------------------
>                        Total         Total     Average      Average   Comp
>                   compressed   compression  compressed  compression  ratio
>                       length       latency      length      latency
>                        bytes  milliseconds       bytes  nanoseconds
> ---------------------------------------------------------------------------
> SSD (no zswap) 1,362,296,832       887,861
> lz4            2,610,657,430        55,984       2,055      44,065    1.99
> zstd             729,129,528        50,986         565      39,510    7.25
> deflate-iaa    1,286,533,438        44,785       1,415      49,252    2.89
> ---------------------------------------------------------------------------
>
> zstd does very well on ratio, as expected.

Wait. So zstd is displaying 7-to-1 compression ratio? And has *lower*
average latency?

Why are we running benchmark on lz4 again? Sure there is no free lunch
and no compressor that works well on all kind of data, but lz4's
performance here is so bad that it's borderline justifiable to
disable/bypass zswap with this kind of compresison ratio...

Can I ask you to run benchmarking on zstd from now on?

>
> >
> > >
> > >
> > >  Experiment 3 - 4K folios swap characteristics SSD vs. ZSWAP:
> > >  ------------------------------------------------------------
> > >
> > >  I wanted to take a step back and understand how the mainline v6.11-rc3
> > >  handles 4K folios when swapped out to SSD (CONFIG_ZSWAP is off) and
> > when
> > >  swapped out to ZSWAP. Interestingly, higher swapout activity is observed
> > >  with 4K folios and v6.11-rc3 (with the debug change to not charge zswap to
> > >  cgroup).
> > >
> > >  v6.11-rc3 with no zswap charge, only 4K folios, no (m)THP:
> > >
> > >  -------------------------------------------------------------
> > >  SSD (CONFIG_ZSWAP is OFF)       ZSWAP          lz4    lzo-rle
> > >  -------------------------------------------------------------
> > >  cgroup memory.events:           cgroup memory.events:
> > >
> > >  low                 0           low              0          0
> > >  high            5,068           high       321,923    375,116
> > >  max                 0           max              0          0
> > >  oom                 0           oom              0          0
> > >  oom_kill            0           oom_kill         0          0
> > >  oom_group_kill      0           oom_group_kill   0          0
> > >  -------------------------------------------------------------
> > >
> > >  SSD (CONFIG_ZSWAP is OFF):
> > >  --------------------------
> > >  pswpout            415,709
> > >  sys time (sec)      301.02
> > >  Throughput KB/s    155,970
> > >  memcg_high events    5,068
> > >  --------------------------
> > >
> > >
> > >  ZSWAP                  lz4         lz4         lz4     lzo-rle
> > >  --------------------------------------------------------------
> > >  zswpout          1,598,550   1,515,151   1,449,432   1,493,917
> > >  sys time (sec)      889.36      481.21      581.22      635.75
> > >  Throughput KB/s     35,176      14,765      20,253      21,407
> > >  memcg_high events  321,923     412,733     369,976     375,116
> > >  --------------------------------------------------------------
> > >
> > >  This shows that there is a performance regression of -60% to -195% with
> > >  zswap as compared to SSD with 4K folios. The higher swapout activity with
> > >  zswap is seen here too (i.e., this doesn't appear to be mTHP-specific).
> > >
> > >  I verified this to be the case even with the v6.7 kernel, which also
> > >  showed a 2.3X throughput improvement when we don't charge zswap:
> > >
> > >  ZSWAP lz4                 v6.7      v6.7 with no cgroup zswap charge
> > >  --------------------------------------------------------------------
> > >  zswpout              1,419,802       1,398,620
> > >  sys time (sec)           535.4          613.41
> >
> > systime increases without zswap cgroup charging? That's strange...
>
> Additional data gathered with v6.11-rc3 (listed below) based on your suggestion
> to investigate potential swap.high breaches should hopefully provide some
> explanation.
>
> >
> > >  Throughput KB/s          8,671          20,045
> > >  memcg_high events      574,046         451,859
> >
> > So, on 4k folio setup, even without cgroup charge, we are still seeing:
> >
> > 1. More zswpout (than observed in SSD)
> > 2. 40-50% worse latency - in fact it is worse without zswap cgroup charging.
> > 3. 100 times the amount of memcg_high events? This is perhaps the
> > *strangest* to me. You're already removing zswap cgroup charging, then
> > where does this comes from? How can we have memory.high violation when
> > zswap does *not* contribute to memory usage?
> >
> > Is this due to swap limit charging? Do you have a cgroup swap limit?
> >
> > mem_high = page_counter_read(&memcg->memory) >
> >            READ_ONCE(memcg->memory.high);
> > swap_high = page_counter_read(&memcg->swap) >
> >            READ_ONCE(memcg->swap.high);
> > [...]
> >
> > if (mem_high || swap_high) {
> >     /*
> >     * The allocating tasks in this cgroup will need to do
> >     * reclaim or be throttled to prevent further growth
> >     * of the memory or swap footprints.
> >     *
> >     * Target some best-effort fairness between the tasks,
> >     * and distribute reclaim work and delay penalties
> >     * based on how much each task is actually allocating.
> >     */
> >     current->memcg_nr_pages_over_high += batch;
> >     set_notify_resume(current);
> >     break;
> > }
> >
>
> I don't have a swap.high limit set on the cgroup; it is set to "max".
>
> I ran experiments with v6.11-rc3, no zswap charging, 4K folios and different
> zswap compressors to verify if swap.high is breached with the 4G SSD swapfile.
>
>  SSD (CONFIG_ZSWAP is OFF):
>
>                                 SSD          SSD          SSD
>  ------------------------------------------------------------
>  pswpout                    415,709    1,032,170      636,582
>  sys time (sec)              301.02       328.15       306.98
>  Throughput KB/s            155,970       89,621      122,219
>  memcg_high events            5,068       15,072        8,344
>  memcg_swap_high events           0            0            0
>  memcg_swap_fail events           0            0            0
>  ------------------------------------------------------------
>
>  ZSWAP                               zstd         zstd       zstd
>  ----------------------------------------------------------------
>  zswpout                        1,391,524    1,382,965  1,417,307
>  sys time (sec)                    474.68       568.24     489.80
>  Throughput KB/s                   26,099       23,404    111,115
>  memcg_high events                335,112      340,335    162,260
>  memcg_swap_high events                 0            0          0
>  memcg_swap_fail events         1,226,899    5,742,153
>   (mem_cgroup_try_charge_swap)
>  memcg_memory_stat_pgactivate   1,259,547
>   (shrink_folio_list)
>  ----------------------------------------------------------------
>
>  ZSWAP                      lzo-rle      lzo-rle     lzo-rle
>  -----------------------------------------------------------
>  zswpout                  1,493,917    1,363,040   1,428,133
>  sys time (sec)              635.75       498.63      484.65
>  Throughput KB/s             21,407       23,827      20,237
>  memcg_high events          375,116      352,814     373,667
>  memcg_swap_high events           0            0           0
>  memcg_swap_fail events     715,211
>  -----------------------------------------------------------
>
>  ZSWAP                         lz4         lz4        lz4          lz4
>  ---------------------------------------------------------------------
>  zswpout                 1,378,781   1,598,550   1,515,151   1,449,432
>  sys time (sec)             495.45      889.36      481.21      581.22
>  Throughput KB/s            26,248      35,176      14,765      20,253
>  memcg_high events         347,209     321,923     412,733     369,976
>  memcg_swap_high events          0           0           0           0
>  memcg_swap_fail events    580,103           0
>  ---------------------------------------------------------------------
>
>  ZSWAP                  deflate-iaa   deflate-iaa    deflate-iaa
>  ----------------------------------------------------------------
>  zswpout                    380,471     1,440,902      1,397,965
>  sys time (sec)              329.06        570.77         467.41
>  Throughput KB/s            283,867        28,403        190,600
>  memcg_high events            5,551       422,831         28,154
>  memcg_swap_high events           0             0              0
>  memcg_swap_fail events           0     2,686,758        438,562
>  ----------------------------------------------------------------

Why are there 3 columns for each of the compressors? Is this different
runs of the same workload?

And why do some columns have missing cells?

>
> There are no swap.high memcg events recorded in any of the SSD/zswap
>  experiments. However, I do see significant number of memcg_swap_fail
>  events in some of the zswap runs, for all 3 compressors. This is not
>  consistent, because there are some runs with 0 memcg_swap_fail for all
>  compressors.
>
>  There is a possible co-relation between memcg_swap_fail events
>  (/sys/fs/cgroup/test/memory.swap.events) and the high # of memcg_high
>  events. The root-cause appears to be that there are no available swap
>  slots, memcg_swap_fail is incremented, add_to_swap() fails in
>  shrink_folio_list(), followed by "activate_locked:" for the folio.
>  The folio re-activation is recorded in cgroup memory.stat pgactivate
>  events. The failure to swap out folios due to lack of swap slots could
>  contribute towards memory.high breaches.

Yeah FWIW, that was gonna be my first suggestion. This swapfile size
is wayyyy too small...

But that said, the link is not clear to me at all. The only thing I
can think of is lz4's performance sucks so bad that it's not saving
enough memory, leading to regression. And since it's still taking up
swap slot, we cannot use swap either?

>
>  However, this is probably not the only cause for either the high # of
>  memory.high breaches or the over-reclaim with zswap, as seen in the lz4
>  data where the memory.high is significant even in cases where there are no
>  memcg_swap_fails.
>
> Some observations/questions based on the above 4K folios swapout data:
>
> 1) There are more memcg_high events as the swapout latency reduces
>    (i.e. faster swap-write path). This is even without charging zswap
>    utilization to the cgroup.

This is still inexplicable to me. If we are not charging zswap usage,
we shouldn't even be triggering the reclaim_high() path, no?

I'm curious - can you use bpftrace to tracks where/when reclaim_high
is being called?

>
> 2) There appears to be a direct co-relation between higher # of
>    memcg_swap_fail events, and an increase in memcg_high breaches and
>    reduction in usemem throughput. This combined with the observation in
>    (1) suggests that with a faster compressor, we need more swap slots,
>    that increases the probability of running out of swap slots with the 4G
>    SSD backing device.
>
> 3) Could the data shared earlier on reduction in memcg_high breaches with
>    64K mTHP swapout provide some more clues, if we agree with (1) and (2):
>
>    "Interestingly, the # of memcg_high events reduces significantly with 64K
>    mTHP as compared to the above 4K memcg_high events data, when tested
>    with v4 and no zswap charge: 3,069 (SSD-mTHP) and 19,656 (ZSWAP-mTHP)."
>
> 4) In the case of each zswap compressor, there are some runs that go
>    through with 0 memcg_swap_fail events. These runs generally have better
>    fewer memcg_high breaches and better sys time/throughput.
>
> 5) For a given swap setup, there is some amount of variance in
>    sys time for this workload.
>
> 6) All this suggests that the primary root cause is the concurrency setup,
>    where there could be randomness between runs as to the # of processes
>    that observe the memory.high breach due to other factors such as
>    availability of swap slots for alloc.
>
> To summarize, I believe the root-cause is the 4G SSD swapfile resulting in
> running out of swap slots, and anomalous behavior with over-reclaim when 70
> concurrent processes are working with the 60G memory limit while trying to
> allocate 1G each; with randomness in processes reacting to the breach.
>
> The cgroup zswap charging exacerbates this situation, but is not a problem
> in and of itself.
>
> Nhat, as you pointed out, this is somewhat of an unrealistic scenario that
> doesn't seem to indicate any specific problems to be solved, other than the
> temporary cgroup zswap double-charging.
>
> Would it be fair to evaluate this patch-series based on a more realistic
> swapfile configuration based on 176G ZRAM, for which I had shared the data
> in v2? There weren't any problems with swap slots availability or any
> anomalies that I can think of with this setup, other than the fact that the
> "Before" and "After" sys times could not be directly compared for 2 key
> reasons:
>
>  - ZRAM compressed data is not charged to the cgroup, similar to SSD.
>  - ZSWAP compressed data is charged to the cgroup.

Yeah that's a bit unfair still. Wild idea, but what about we compare
SSD without zswap (or SSD with zswap, but without this patch series so
that mTHP are not zswapped) v.s zswap-on-zram (i.e with a backing
swapfile on zram block device).

It is stupid, I know. But let's take advantage of the fact that zram
is not charged to cgroup, pretending that its memory foot print is
empty?

I don't know how zram works though, so my apologies if it's a stupid
suggestion :)

RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios

Posted by Sridhar, Kanchana P 1 year, 5 months ago

> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Tuesday, August 27, 2024 8:24 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> On Mon, Aug 26, 2024 at 11:08 PM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > > Internally, we often see 1-3 or 1-4 saving ratio (or even more).
> >
> > Agree with this as well. In our experiments with other workloads, we
> > typically see much higher ratios.
> >
> > >
> > > Probably does not explain everything, but worth double checking -
> > > could you check with zstd to see if the ratio improves.
> >
> > Sure. I gathered ratio and compressed memory footprint data today with
> > 64K mTHP, the 4G SSD swapfile and different zswap compressors.
> >
> >  This patch-series and no zswap charging, 64K mTHP:
> > ---------------------------------------------------------------------------
> >                        Total         Total     Average      Average   Comp
> >                   compressed   compression  compressed  compression  ratio
> >                       length       latency      length      latency
> >                        bytes  milliseconds       bytes  nanoseconds
> > ---------------------------------------------------------------------------
> > SSD (no zswap) 1,362,296,832       887,861
> > lz4            2,610,657,430        55,984       2,055      44,065    1.99
> > zstd             729,129,528        50,986         565      39,510    7.25
> > deflate-iaa    1,286,533,438        44,785       1,415      49,252    2.89
> > ---------------------------------------------------------------------------
> >
> > zstd does very well on ratio, as expected.
> 
> Wait. So zstd is displaying 7-to-1 compression ratio? And has *lower*
> average latency?
> 
> Why are we running benchmark on lz4 again? Sure there is no free lunch
> and no compressor that works well on all kind of data, but lz4's
> performance here is so bad that it's borderline justifiable to
> disable/bypass zswap with this kind of compresison ratio...
> 
> Can I ask you to run benchmarking on zstd from now on?

Sure, will do.

> 
> >
> > >
> > > >
> > > >
> > > >  Experiment 3 - 4K folios swap characteristics SSD vs. ZSWAP:
> > > >  ------------------------------------------------------------
> > > >
> > > >  I wanted to take a step back and understand how the mainline v6.11-
> rc3
> > > >  handles 4K folios when swapped out to SSD (CONFIG_ZSWAP is off) and
> > > when
> > > >  swapped out to ZSWAP. Interestingly, higher swapout activity is
> observed
> > > >  with 4K folios and v6.11-rc3 (with the debug change to not charge
> zswap to
> > > >  cgroup).
> > > >
> > > >  v6.11-rc3 with no zswap charge, only 4K folios, no (m)THP:
> > > >
> > > >  -------------------------------------------------------------
> > > >  SSD (CONFIG_ZSWAP is OFF)       ZSWAP          lz4    lzo-rle
> > > >  -------------------------------------------------------------
> > > >  cgroup memory.events:           cgroup memory.events:
> > > >
> > > >  low                 0           low              0          0
> > > >  high            5,068           high       321,923    375,116
> > > >  max                 0           max              0          0
> > > >  oom                 0           oom              0          0
> > > >  oom_kill            0           oom_kill         0          0
> > > >  oom_group_kill      0           oom_group_kill   0          0
> > > >  -------------------------------------------------------------
> > > >
> > > >  SSD (CONFIG_ZSWAP is OFF):
> > > >  --------------------------
> > > >  pswpout            415,709
> > > >  sys time (sec)      301.02
> > > >  Throughput KB/s    155,970
> > > >  memcg_high events    5,068
> > > >  --------------------------
> > > >
> > > >
> > > >  ZSWAP                  lz4         lz4         lz4     lzo-rle
> > > >  --------------------------------------------------------------
> > > >  zswpout          1,598,550   1,515,151   1,449,432   1,493,917
> > > >  sys time (sec)      889.36      481.21      581.22      635.75
> > > >  Throughput KB/s     35,176      14,765      20,253      21,407
> > > >  memcg_high events  321,923     412,733     369,976     375,116
> > > >  --------------------------------------------------------------
> > > >
> > > >  This shows that there is a performance regression of -60% to -195%
> with
> > > >  zswap as compared to SSD with 4K folios. The higher swapout activity
> with
> > > >  zswap is seen here too (i.e., this doesn't appear to be mTHP-specific).
> > > >
> > > >  I verified this to be the case even with the v6.7 kernel, which also
> > > >  showed a 2.3X throughput improvement when we don't charge zswap:
> > > >
> > > >  ZSWAP lz4                 v6.7      v6.7 with no cgroup zswap charge
> > > >  --------------------------------------------------------------------
> > > >  zswpout              1,419,802       1,398,620
> > > >  sys time (sec)           535.4          613.41
> > >
> > > systime increases without zswap cgroup charging? That's strange...
> >
> > Additional data gathered with v6.11-rc3 (listed below) based on your
> suggestion
> > to investigate potential swap.high breaches should hopefully provide some
> > explanation.
> >
> > >
> > > >  Throughput KB/s          8,671          20,045
> > > >  memcg_high events      574,046         451,859
> > >
> > > So, on 4k folio setup, even without cgroup charge, we are still seeing:
> > >
> > > 1. More zswpout (than observed in SSD)
> > > 2. 40-50% worse latency - in fact it is worse without zswap cgroup
> charging.
> > > 3. 100 times the amount of memcg_high events? This is perhaps the
> > > *strangest* to me. You're already removing zswap cgroup charging, then
> > > where does this comes from? How can we have memory.high violation
> when
> > > zswap does *not* contribute to memory usage?
> > >
> > > Is this due to swap limit charging? Do you have a cgroup swap limit?
> > >
> > > mem_high = page_counter_read(&memcg->memory) >
> > >            READ_ONCE(memcg->memory.high);
> > > swap_high = page_counter_read(&memcg->swap) >
> > >            READ_ONCE(memcg->swap.high);
> > > [...]
> > >
> > > if (mem_high || swap_high) {
> > >     /*
> > >     * The allocating tasks in this cgroup will need to do
> > >     * reclaim or be throttled to prevent further growth
> > >     * of the memory or swap footprints.
> > >     *
> > >     * Target some best-effort fairness between the tasks,
> > >     * and distribute reclaim work and delay penalties
> > >     * based on how much each task is actually allocating.
> > >     */
> > >     current->memcg_nr_pages_over_high += batch;
> > >     set_notify_resume(current);
> > >     break;
> > > }
> > >
> >
> > I don't have a swap.high limit set on the cgroup; it is set to "max".
> >
> > I ran experiments with v6.11-rc3, no zswap charging, 4K folios and different
> > zswap compressors to verify if swap.high is breached with the 4G SSD
> swapfile.
> >
> >  SSD (CONFIG_ZSWAP is OFF):
> >
> >                                 SSD          SSD          SSD
> >  ------------------------------------------------------------
> >  pswpout                    415,709    1,032,170      636,582
> >  sys time (sec)              301.02       328.15       306.98
> >  Throughput KB/s            155,970       89,621      122,219
> >  memcg_high events            5,068       15,072        8,344
> >  memcg_swap_high events           0            0            0
> >  memcg_swap_fail events           0            0            0
> >  ------------------------------------------------------------
> >
> >  ZSWAP                               zstd         zstd       zstd
> >  ----------------------------------------------------------------
> >  zswpout                        1,391,524    1,382,965  1,417,307
> >  sys time (sec)                    474.68       568.24     489.80
> >  Throughput KB/s                   26,099       23,404    111,115
> >  memcg_high events                335,112      340,335    162,260
> >  memcg_swap_high events                 0            0          0
> >  memcg_swap_fail events         1,226,899    5,742,153
> >   (mem_cgroup_try_charge_swap)
> >  memcg_memory_stat_pgactivate   1,259,547
> >   (shrink_folio_list)
> >  ----------------------------------------------------------------
> >
> >  ZSWAP                      lzo-rle      lzo-rle     lzo-rle
> >  -----------------------------------------------------------
> >  zswpout                  1,493,917    1,363,040   1,428,133
> >  sys time (sec)              635.75       498.63      484.65
> >  Throughput KB/s             21,407       23,827      20,237
> >  memcg_high events          375,116      352,814     373,667
> >  memcg_swap_high events           0            0           0
> >  memcg_swap_fail events     715,211
> >  -----------------------------------------------------------
> >
> >  ZSWAP                         lz4         lz4        lz4          lz4
> >  ---------------------------------------------------------------------
> >  zswpout                 1,378,781   1,598,550   1,515,151   1,449,432
> >  sys time (sec)             495.45      889.36      481.21      581.22
> >  Throughput KB/s            26,248      35,176      14,765      20,253
> >  memcg_high events         347,209     321,923     412,733     369,976
> >  memcg_swap_high events          0           0           0           0
> >  memcg_swap_fail events    580,103           0
> >  ---------------------------------------------------------------------
> >
> >  ZSWAP                  deflate-iaa   deflate-iaa    deflate-iaa
> >  ----------------------------------------------------------------
> >  zswpout                    380,471     1,440,902      1,397,965
> >  sys time (sec)              329.06        570.77         467.41
> >  Throughput KB/s            283,867        28,403        190,600
> >  memcg_high events            5,551       422,831         28,154
> >  memcg_swap_high events           0             0              0
> >  memcg_swap_fail events           0     2,686,758        438,562
> >  ----------------------------------------------------------------
> 
> Why are there 3 columns for each of the compressors? Is this different
> runs of the same workload?
> 
> And why do some columns have missing cells?

Yes, these are different runs of the same workload. Since there is some
amount of variance seen in the data, I figured it is best to publish the
metrics from the individual runs rather than averaging.

Some of these runs were gathered earlier with the same code base,
however, I wasn't monitoring/logging the memcg_swap_high/memcg_swap_fail
events at that time. For those runs, just these two counters have missing
column entries; the rest of the data is still valid.

> 
> >
> > There are no swap.high memcg events recorded in any of the SSD/zswap
> >  experiments. However, I do see significant number of memcg_swap_fail
> >  events in some of the zswap runs, for all 3 compressors. This is not
> >  consistent, because there are some runs with 0 memcg_swap_fail for all
> >  compressors.
> >
> >  There is a possible co-relation between memcg_swap_fail events
> >  (/sys/fs/cgroup/test/memory.swap.events) and the high # of memcg_high
> >  events. The root-cause appears to be that there are no available swap
> >  slots, memcg_swap_fail is incremented, add_to_swap() fails in
> >  shrink_folio_list(), followed by "activate_locked:" for the folio.
> >  The folio re-activation is recorded in cgroup memory.stat pgactivate
> >  events. The failure to swap out folios due to lack of swap slots could
> >  contribute towards memory.high breaches.
> 
> Yeah FWIW, that was gonna be my first suggestion. This swapfile size
> is wayyyy too small...
> 
> But that said, the link is not clear to me at all. The only thing I
> can think of is lz4's performance sucks so bad that it's not saving
> enough memory, leading to regression. And since it's still taking up
> swap slot, we cannot use swap either?

The occurrence of memcg_swap_fail events establishes that swap slots
are not available with 4G of swap space. This causes those 4K folios to
remain in memory, which can worsen an existing problem with memory.high
breaches.

However, it is worth noting that this is not the only contributor to
memcg_high events that still occur without zswap charging. The data shows
321,923 occurrences of memcg_high in Col 2 of the lz4 table, that also has
0 occurrences of memcg_swap_fail reported in the cgroup stats. 

> 
> >
> >  However, this is probably not the only cause for either the high # of
> >  memory.high breaches or the over-reclaim with zswap, as seen in the lz4
> >  data where the memory.high is significant even in cases where there are no
> >  memcg_swap_fails.
> >
> > Some observations/questions based on the above 4K folios swapout data:
> >
> > 1) There are more memcg_high events as the swapout latency reduces
> >    (i.e. faster swap-write path). This is even without charging zswap
> >    utilization to the cgroup.
> 
> This is still inexplicable to me. If we are not charging zswap usage,
> we shouldn't even be triggering the reclaim_high() path, no?
> 
> I'm curious - can you use bpftrace to tracks where/when reclaim_high
> is being called?

I had confirmed earlier with counters that all calls to reclaim_high()
were from include/linux/resume_user_mode.h::resume_user_mode_work().
I will confirm this with zstd and bpftrace and share.

Thanks,
Kanchana

> 
> >
> > 2) There appears to be a direct co-relation between higher # of
> >    memcg_swap_fail events, and an increase in memcg_high breaches and
> >    reduction in usemem throughput. This combined with the observation in
> >    (1) suggests that with a faster compressor, we need more swap slots,
> >    that increases the probability of running out of swap slots with the 4G
> >    SSD backing device.
> >
> > 3) Could the data shared earlier on reduction in memcg_high breaches with
> >    64K mTHP swapout provide some more clues, if we agree with (1) and (2):
> >
> >    "Interestingly, the # of memcg_high events reduces significantly with 64K
> >    mTHP as compared to the above 4K memcg_high events data, when
> tested
> >    with v4 and no zswap charge: 3,069 (SSD-mTHP) and 19,656 (ZSWAP-
> mTHP)."
> >
> > 4) In the case of each zswap compressor, there are some runs that go
> >    through with 0 memcg_swap_fail events. These runs generally have better
> >    fewer memcg_high breaches and better sys time/throughput.
> >
> > 5) For a given swap setup, there is some amount of variance in
> >    sys time for this workload.
> >
> > 6) All this suggests that the primary root cause is the concurrency setup,
> >    where there could be randomness between runs as to the # of processes
> >    that observe the memory.high breach due to other factors such as
> >    availability of swap slots for alloc.
> >
> > To summarize, I believe the root-cause is the 4G SSD swapfile resulting in
> > running out of swap slots, and anomalous behavior with over-reclaim when
> 70
> > concurrent processes are working with the 60G memory limit while trying to
> > allocate 1G each; with randomness in processes reacting to the breach.
> >
> > The cgroup zswap charging exacerbates this situation, but is not a problem
> > in and of itself.
> >
> > Nhat, as you pointed out, this is somewhat of an unrealistic scenario that
> > doesn't seem to indicate any specific problems to be solved, other than the
> > temporary cgroup zswap double-charging.
> >
> > Would it be fair to evaluate this patch-series based on a more realistic
> > swapfile configuration based on 176G ZRAM, for which I had shared the
> data
> > in v2? There weren't any problems with swap slots availability or any
> > anomalies that I can think of with this setup, other than the fact that the
> > "Before" and "After" sys times could not be directly compared for 2 key
> > reasons:
> >
> >  - ZRAM compressed data is not charged to the cgroup, similar to SSD.
> >  - ZSWAP compressed data is charged to the cgroup.
> 
> Yeah that's a bit unfair still. Wild idea, but what about we compare
> SSD without zswap (or SSD with zswap, but without this patch series so
> that mTHP are not zswapped) v.s zswap-on-zram (i.e with a backing
> swapfile on zram block device).
> 
> It is stupid, I know. But let's take advantage of the fact that zram
> is not charged to cgroup, pretending that its memory foot print is
> empty?
> 
> I don't know how zram works though, so my apologies if it's a stupid
> suggestion :)

RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios

Posted by Sridhar, Kanchana P 1 year, 5 months ago


> -----Original Message-----
> From: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Sent: Tuesday, August 27, 2024 11:42 AM
> To: Nhat Pham <nphamcs@gmail.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>;
> Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Subject: RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> 
> > -----Original Message-----
> > From: Nhat Pham <nphamcs@gmail.com>
> > Sent: Tuesday, August 27, 2024 8:24 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com;
> > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> >
> > On Mon, Aug 26, 2024 at 11:08 PM Sridhar, Kanchana P
> > <kanchana.p.sridhar@intel.com> wrote:
> > >
> > > > Internally, we often see 1-3 or 1-4 saving ratio (or even more).
> > >
> > > Agree with this as well. In our experiments with other workloads, we
> > > typically see much higher ratios.
> > >
> > > >
> > > > Probably does not explain everything, but worth double checking -
> > > > could you check with zstd to see if the ratio improves.
> > >
> > > Sure. I gathered ratio and compressed memory footprint data today with
> > > 64K mTHP, the 4G SSD swapfile and different zswap compressors.
> > >
> > >  This patch-series and no zswap charging, 64K mTHP:
> > > ---------------------------------------------------------------------------
> > >                        Total         Total     Average      Average   Comp
> > >                   compressed   compression  compressed  compression  ratio
> > >                       length       latency      length      latency
> > >                        bytes  milliseconds       bytes  nanoseconds
> > > ---------------------------------------------------------------------------
> > > SSD (no zswap) 1,362,296,832       887,861
> > > lz4            2,610,657,430        55,984       2,055      44,065    1.99
> > > zstd             729,129,528        50,986         565      39,510    7.25
> > > deflate-iaa    1,286,533,438        44,785       1,415      49,252    2.89
> > > ---------------------------------------------------------------------------
> > >
> > > zstd does very well on ratio, as expected.
> >
> > Wait. So zstd is displaying 7-to-1 compression ratio? And has *lower*
> > average latency?
> >
> > Why are we running benchmark on lz4 again? Sure there is no free lunch
> > and no compressor that works well on all kind of data, but lz4's
> > performance here is so bad that it's borderline justifiable to
> > disable/bypass zswap with this kind of compresison ratio...
> >
> > Can I ask you to run benchmarking on zstd from now on?
> 
> Sure, will do.
> 
> >
> > >
> > > >
> > > > >
> > > > >
> > > > >  Experiment 3 - 4K folios swap characteristics SSD vs. ZSWAP:
> > > > >  ------------------------------------------------------------
> > > > >
> > > > >  I wanted to take a step back and understand how the mainline v6.11-
> > rc3
> > > > >  handles 4K folios when swapped out to SSD (CONFIG_ZSWAP is off)
> and
> > > > when
> > > > >  swapped out to ZSWAP. Interestingly, higher swapout activity is
> > observed
> > > > >  with 4K folios and v6.11-rc3 (with the debug change to not charge
> > zswap to
> > > > >  cgroup).
> > > > >
> > > > >  v6.11-rc3 with no zswap charge, only 4K folios, no (m)THP:
> > > > >
> > > > >  -------------------------------------------------------------
> > > > >  SSD (CONFIG_ZSWAP is OFF)       ZSWAP          lz4    lzo-rle
> > > > >  -------------------------------------------------------------
> > > > >  cgroup memory.events:           cgroup memory.events:
> > > > >
> > > > >  low                 0           low              0          0
> > > > >  high            5,068           high       321,923    375,116
> > > > >  max                 0           max              0          0
> > > > >  oom                 0           oom              0          0
> > > > >  oom_kill            0           oom_kill         0          0
> > > > >  oom_group_kill      0           oom_group_kill   0          0
> > > > >  -------------------------------------------------------------
> > > > >
> > > > >  SSD (CONFIG_ZSWAP is OFF):
> > > > >  --------------------------
> > > > >  pswpout            415,709
> > > > >  sys time (sec)      301.02
> > > > >  Throughput KB/s    155,970
> > > > >  memcg_high events    5,068
> > > > >  --------------------------
> > > > >
> > > > >
> > > > >  ZSWAP                  lz4         lz4         lz4     lzo-rle
> > > > >  --------------------------------------------------------------
> > > > >  zswpout          1,598,550   1,515,151   1,449,432   1,493,917
> > > > >  sys time (sec)      889.36      481.21      581.22      635.75
> > > > >  Throughput KB/s     35,176      14,765      20,253      21,407
> > > > >  memcg_high events  321,923     412,733     369,976     375,116
> > > > >  --------------------------------------------------------------
> > > > >
> > > > >  This shows that there is a performance regression of -60% to -195%
> > with
> > > > >  zswap as compared to SSD with 4K folios. The higher swapout activity
> > with
> > > > >  zswap is seen here too (i.e., this doesn't appear to be mTHP-specific).
> > > > >
> > > > >  I verified this to be the case even with the v6.7 kernel, which also
> > > > >  showed a 2.3X throughput improvement when we don't charge
> zswap:
> > > > >
> > > > >  ZSWAP lz4                 v6.7      v6.7 with no cgroup zswap charge
> > > > >  --------------------------------------------------------------------
> > > > >  zswpout              1,419,802       1,398,620
> > > > >  sys time (sec)           535.4          613.41
> > > >
> > > > systime increases without zswap cgroup charging? That's strange...
> > >
> > > Additional data gathered with v6.11-rc3 (listed below) based on your
> > suggestion
> > > to investigate potential swap.high breaches should hopefully provide
> some
> > > explanation.
> > >
> > > >
> > > > >  Throughput KB/s          8,671          20,045
> > > > >  memcg_high events      574,046         451,859
> > > >
> > > > So, on 4k folio setup, even without cgroup charge, we are still seeing:
> > > >
> > > > 1. More zswpout (than observed in SSD)
> > > > 2. 40-50% worse latency - in fact it is worse without zswap cgroup
> > charging.
> > > > 3. 100 times the amount of memcg_high events? This is perhaps the
> > > > *strangest* to me. You're already removing zswap cgroup charging,
> then
> > > > where does this comes from? How can we have memory.high violation
> > when
> > > > zswap does *not* contribute to memory usage?
> > > >
> > > > Is this due to swap limit charging? Do you have a cgroup swap limit?
> > > >
> > > > mem_high = page_counter_read(&memcg->memory) >
> > > >            READ_ONCE(memcg->memory.high);
> > > > swap_high = page_counter_read(&memcg->swap) >
> > > >            READ_ONCE(memcg->swap.high);
> > > > [...]
> > > >
> > > > if (mem_high || swap_high) {
> > > >     /*
> > > >     * The allocating tasks in this cgroup will need to do
> > > >     * reclaim or be throttled to prevent further growth
> > > >     * of the memory or swap footprints.
> > > >     *
> > > >     * Target some best-effort fairness between the tasks,
> > > >     * and distribute reclaim work and delay penalties
> > > >     * based on how much each task is actually allocating.
> > > >     */
> > > >     current->memcg_nr_pages_over_high += batch;
> > > >     set_notify_resume(current);
> > > >     break;
> > > > }
> > > >
> > >
> > > I don't have a swap.high limit set on the cgroup; it is set to "max".
> > >
> > > I ran experiments with v6.11-rc3, no zswap charging, 4K folios and
> different
> > > zswap compressors to verify if swap.high is breached with the 4G SSD
> > swapfile.
> > >
> > >  SSD (CONFIG_ZSWAP is OFF):
> > >
> > >                                 SSD          SSD          SSD
> > >  ------------------------------------------------------------
> > >  pswpout                    415,709    1,032,170      636,582
> > >  sys time (sec)              301.02       328.15       306.98
> > >  Throughput KB/s            155,970       89,621      122,219
> > >  memcg_high events            5,068       15,072        8,344
> > >  memcg_swap_high events           0            0            0
> > >  memcg_swap_fail events           0            0            0
> > >  ------------------------------------------------------------
> > >
> > >  ZSWAP                               zstd         zstd       zstd
> > >  ----------------------------------------------------------------
> > >  zswpout                        1,391,524    1,382,965  1,417,307
> > >  sys time (sec)                    474.68       568.24     489.80
> > >  Throughput KB/s                   26,099       23,404    111,115
> > >  memcg_high events                335,112      340,335    162,260
> > >  memcg_swap_high events                 0            0          0
> > >  memcg_swap_fail events         1,226,899    5,742,153
> > >   (mem_cgroup_try_charge_swap)
> > >  memcg_memory_stat_pgactivate   1,259,547
> > >   (shrink_folio_list)
> > >  ----------------------------------------------------------------
> > >
> > >  ZSWAP                      lzo-rle      lzo-rle     lzo-rle
> > >  -----------------------------------------------------------
> > >  zswpout                  1,493,917    1,363,040   1,428,133
> > >  sys time (sec)              635.75       498.63      484.65
> > >  Throughput KB/s             21,407       23,827      20,237
> > >  memcg_high events          375,116      352,814     373,667
> > >  memcg_swap_high events           0            0           0
> > >  memcg_swap_fail events     715,211
> > >  -----------------------------------------------------------
> > >
> > >  ZSWAP                         lz4         lz4        lz4          lz4
> > >  ---------------------------------------------------------------------
> > >  zswpout                 1,378,781   1,598,550   1,515,151   1,449,432
> > >  sys time (sec)             495.45      889.36      481.21      581.22
> > >  Throughput KB/s            26,248      35,176      14,765      20,253
> > >  memcg_high events         347,209     321,923     412,733     369,976
> > >  memcg_swap_high events          0           0           0           0
> > >  memcg_swap_fail events    580,103           0
> > >  ---------------------------------------------------------------------
> > >
> > >  ZSWAP                  deflate-iaa   deflate-iaa    deflate-iaa
> > >  ----------------------------------------------------------------
> > >  zswpout                    380,471     1,440,902      1,397,965
> > >  sys time (sec)              329.06        570.77         467.41
> > >  Throughput KB/s            283,867        28,403        190,600
> > >  memcg_high events            5,551       422,831         28,154
> > >  memcg_swap_high events           0             0              0
> > >  memcg_swap_fail events           0     2,686,758        438,562
> > >  ----------------------------------------------------------------
> >
> > Why are there 3 columns for each of the compressors? Is this different
> > runs of the same workload?
> >
> > And why do some columns have missing cells?
> 
> Yes, these are different runs of the same workload. Since there is some
> amount of variance seen in the data, I figured it is best to publish the
> metrics from the individual runs rather than averaging.
> 
> Some of these runs were gathered earlier with the same code base,
> however, I wasn't monitoring/logging the
> memcg_swap_high/memcg_swap_fail
> events at that time. For those runs, just these two counters have missing
> column entries; the rest of the data is still valid.
> 
> >
> > >
> > > There are no swap.high memcg events recorded in any of the SSD/zswap
> > >  experiments. However, I do see significant number of memcg_swap_fail
> > >  events in some of the zswap runs, for all 3 compressors. This is not
> > >  consistent, because there are some runs with 0 memcg_swap_fail for all
> > >  compressors.
> > >
> > >  There is a possible co-relation between memcg_swap_fail events
> > >  (/sys/fs/cgroup/test/memory.swap.events) and the high # of
> memcg_high
> > >  events. The root-cause appears to be that there are no available swap
> > >  slots, memcg_swap_fail is incremented, add_to_swap() fails in
> > >  shrink_folio_list(), followed by "activate_locked:" for the folio.
> > >  The folio re-activation is recorded in cgroup memory.stat pgactivate
> > >  events. The failure to swap out folios due to lack of swap slots could
> > >  contribute towards memory.high breaches.
> >
> > Yeah FWIW, that was gonna be my first suggestion. This swapfile size
> > is wayyyy too small...
> >
> > But that said, the link is not clear to me at all. The only thing I
> > can think of is lz4's performance sucks so bad that it's not saving
> > enough memory, leading to regression. And since it's still taking up
> > swap slot, we cannot use swap either?
> 
> The occurrence of memcg_swap_fail events establishes that swap slots
> are not available with 4G of swap space. This causes those 4K folios to
> remain in memory, which can worsen an existing problem with memory.high
> breaches.
> 
> However, it is worth noting that this is not the only contributor to
> memcg_high events that still occur without zswap charging. The data shows
> 321,923 occurrences of memcg_high in Col 2 of the lz4 table, that also has
> 0 occurrences of memcg_swap_fail reported in the cgroup stats.
> 
> >
> > >
> > >  However, this is probably not the only cause for either the high # of
> > >  memory.high breaches or the over-reclaim with zswap, as seen in the lz4
> > >  data where the memory.high is significant even in cases where there are
> no
> > >  memcg_swap_fails.
> > >
> > > Some observations/questions based on the above 4K folios swapout data:
> > >
> > > 1) There are more memcg_high events as the swapout latency reduces
> > >    (i.e. faster swap-write path). This is even without charging zswap
> > >    utilization to the cgroup.
> >
> > This is still inexplicable to me. If we are not charging zswap usage,
> > we shouldn't even be triggering the reclaim_high() path, no?
> >
> > I'm curious - can you use bpftrace to tracks where/when reclaim_high
> > is being called?

Hi Nhat,

Since reclaim_high() is called only in a handful of places, I figured I
would just use debugfs u64 counters to record where it gets called from.

These are the places where I increment the debugfs counters:

include/linux/resume_user_mode.h:
---------------------------------
diff --git a/include/linux/resume_user_mode.h b/include/linux/resume_user_mode.h
index e0135e0adae0..382f5469e9a2 100644
--- a/include/linux/resume_user_mode.h
+++ b/include/linux/resume_user_mode.h
@@ -24,6 +24,7 @@ static inline void set_notify_resume(struct task_struct *task)
 		kick_process(task);
 }
 
+extern u64 hoh_userland;
 
 /**
  * resume_user_mode_work - Perform work before returning to user mode
@@ -56,6 +57,7 @@ static inline void resume_user_mode_work(struct pt_regs *regs)
 	}
 #endif
 
+	++hoh_userland;
 	mem_cgroup_handle_over_high(GFP_KERNEL);
 	blkcg_maybe_throttle_current();
 

mm/memcontrol.c:
----------------
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f29157288b7d..6738bb670a78 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1910,9 +1910,12 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
 	return nr_reclaimed;
 }
 
+extern u64 rec_high_hwf;
+
 static void high_work_func(struct work_struct *work)
 {
 	struct mem_cgroup *memcg;
+	++rec_high_hwf;
 
 	memcg = container_of(work, struct mem_cgroup, high_work);
 	reclaim_high(memcg, MEMCG_CHARGE_BATCH, GFP_KERNEL);
@@ -2055,6 +2058,8 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
 	return penalty_jiffies * nr_pages / MEMCG_CHARGE_BATCH;
 }
 
+extern u64 rec_high_hoh;
+
 /*
  * Reclaims memory over the high limit. Called directly from
  * try_charge() (context permitting), as well as from the userland
@@ -2097,6 +2102,7 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask)
 	 * memory.high is currently batched, whereas memory.max and the page
 	 * allocator run every time an allocation is made.
 	 */
+	++rec_high_hoh;
 	nr_reclaimed = reclaim_high(memcg,
 				    in_retry ? SWAP_CLUSTER_MAX : nr_pages,
 				    gfp_mask);
@@ -2153,6 +2159,8 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask)
 	css_put(&memcg->css);
 }
 
+extern u64 hoh_trycharge;
+
 int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		     unsigned int nr_pages)
 {
@@ -2344,8 +2352,10 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 */
 	if (current->memcg_nr_pages_over_high > MEMCG_CHARGE_BATCH &&
 	    !(current->flags & PF_MEMALLOC) &&
-	    gfpflags_allow_blocking(gfp_mask))
+	    gfpflags_allow_blocking(gfp_mask)) {
+		++hoh_trycharge;
 		mem_cgroup_handle_over_high(gfp_mask);
+	}
 	return 0;
 }
 

I reverted my debug changes for "zswap to not charge cgroup" when I ran
these next set of experiments that record the # of times and locations
where reclaim_high() is called.

zstd is the compressor I have configured for both ZSWAP and ZRAM.

 6.11-rc3 mainline, 176Gi ZRAM backing for ZSWAP, zstd, 64K mTHP:
 ----------------------------------------------------------------
 /sys/fs/cgroup/iax/memory.events:
 high 112,910

 hoh_userland 128,835
 hoh_trycharge 0
 rec_high_hoh 113,079
 rec_high_hwf 0

 6.11-rc3 mainline, 4G SSD backing for ZSWAP, zstd, 64K mTHP:
 ------------------------------------------------------------
 /sys/fs/cgroup/iax/memory.events:
 high 4,693
 
 hoh_userland 14,069
 hoh_trycharge 0
 rec_high_hoh 4,694
 rec_high_hwf 0


 ZSWAP-mTHP, 176Gi ZRAM backing for ZSWAP, zstd, 64K mTHP:
 ---------------------------------------------------------
 /sys/fs/cgroup/iax/memory.events:
 high 139,495
 
 hoh_userland 156,628
 hoh_trycharge 0
 rec_high_hoh 140,039
 rec_high_hwf 0

 ZSWAP-mTHP, 4G SSD backing for ZSWAP, zstd, 64K mTHP:
 -----------------------------------------------------
 /sys/fs/cgroup/iax/memory.events:
 high 20,427
 
 /sys/fs/cgroup/iax/memory.swap.events:
 fail 20,856
 
 hoh_userland 31,346
 hoh_trycharge 0
 rec_high_hoh 20,513
 rec_high_hwf 0

This shows that in all cases, reclaim_high() is called only from the return
path to user mode after handling a page-fault.

Thanks,
Kanchana

> 
> I had confirmed earlier with counters that all calls to reclaim_high()
> were from include/linux/resume_user_mode.h::resume_user_mode_work().
> I will confirm this with zstd and bpftrace and share.
> 
> Thanks,
> Kanchana
> 
> >
> > >
> > > 2) There appears to be a direct co-relation between higher # of
> > >    memcg_swap_fail events, and an increase in memcg_high breaches and
> > >    reduction in usemem throughput. This combined with the observation in
> > >    (1) suggests that with a faster compressor, we need more swap slots,
> > >    that increases the probability of running out of swap slots with the 4G
> > >    SSD backing device.
> > >
> > > 3) Could the data shared earlier on reduction in memcg_high breaches
> with
> > >    64K mTHP swapout provide some more clues, if we agree with (1) and
> (2):
> > >
> > >    "Interestingly, the # of memcg_high events reduces significantly with
> 64K
> > >    mTHP as compared to the above 4K memcg_high events data, when
> > tested
> > >    with v4 and no zswap charge: 3,069 (SSD-mTHP) and 19,656 (ZSWAP-
> > mTHP)."
> > >
> > > 4) In the case of each zswap compressor, there are some runs that go
> > >    through with 0 memcg_swap_fail events. These runs generally have
> better
> > >    fewer memcg_high breaches and better sys time/throughput.
> > >
> > > 5) For a given swap setup, there is some amount of variance in
> > >    sys time for this workload.
> > >
> > > 6) All this suggests that the primary root cause is the concurrency setup,
> > >    where there could be randomness between runs as to the # of processes
> > >    that observe the memory.high breach due to other factors such as
> > >    availability of swap slots for alloc.
> > >
> > > To summarize, I believe the root-cause is the 4G SSD swapfile resulting in
> > > running out of swap slots, and anomalous behavior with over-reclaim
> when
> > 70
> > > concurrent processes are working with the 60G memory limit while trying
> to
> > > allocate 1G each; with randomness in processes reacting to the breach.
> > >
> > > The cgroup zswap charging exacerbates this situation, but is not a problem
> > > in and of itself.
> > >
> > > Nhat, as you pointed out, this is somewhat of an unrealistic scenario that
> > > doesn't seem to indicate any specific problems to be solved, other than
> the
> > > temporary cgroup zswap double-charging.
> > >
> > > Would it be fair to evaluate this patch-series based on a more realistic
> > > swapfile configuration based on 176G ZRAM, for which I had shared the
> > data
> > > in v2? There weren't any problems with swap slots availability or any
> > > anomalies that I can think of with this setup, other than the fact that the
> > > "Before" and "After" sys times could not be directly compared for 2 key
> > > reasons:
> > >
> > >  - ZRAM compressed data is not charged to the cgroup, similar to SSD.
> > >  - ZSWAP compressed data is charged to the cgroup.
> >
> > Yeah that's a bit unfair still. Wild idea, but what about we compare
> > SSD without zswap (or SSD with zswap, but without this patch series so
> > that mTHP are not zswapped) v.s zswap-on-zram (i.e with a backing
> > swapfile on zram block device).
> >
> > It is stupid, I know. But let's take advantage of the fact that zram
> > is not charged to cgroup, pretending that its memory foot print is
> > empty?
> >
> > I don't know how zram works though, so my apologies if it's a stupid
> > suggestion :)

Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios

Posted by Yosry Ahmed 1 year, 5 months ago

[..]
>
> This shows that in all cases, reclaim_high() is called only from the return
> path to user mode after handling a page-fault.

I am sorry I haven't been keeping up with this thread, I don't have a
lot of capacity right now.

If my understanding is correct, the summary of the problem we are
observing here is that with high concurrency (70 processes), we
observe worse system time, worse throughput, and higher memory_high
events with zswap than SSD swap. This is true (with varying degrees)
for 4K or mTHP, and with or without charging zswap compressed memory.

Did I get that right?

I saw you also mentioned that reclaim latency is directly correlated
to higher memory_high events.

Is it possible that with SSD swap, because we wait for IO during
reclaim, this gives a chance for other processes to allocate and free
the memory they need. While with zswap because everything is
synchronous, all processes are trying to allocate their memory at the
same time resulting in higher reclaim rates?

IOW, maybe with zswap all the processes try to allocate their memory
at the same time, so the total amount of memory needed at any given
instance is much higher than memory.high, so we keep producing
memory_high events and reclaiming. If 70 processes all require 1G at
the same time, then we need 70G of memory at once, we will keep
thrashing pages in/out of zswap.

While with SSD swap, due to the waits imposed by IO, the allocations
are more spread out and more serialized, and the amount of memory
needed at any given instance is lower; resulting in less reclaim
activity and ultimately faster overall execution?

Could you please describe what the processes are doing? Are they
allocating memory and holding on to it, or immediately freeing it?

Do you have visibility into when each process allocates and frees memory?

RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios

Posted by Sridhar, Kanchana P 1 year, 5 months ago

Hi Yosry,

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Wednesday, August 28, 2024 12:44 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; hannes@cmpxchg.org; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> [..]
> >
> > This shows that in all cases, reclaim_high() is called only from the return
> > path to user mode after handling a page-fault.
> 
> I am sorry I haven't been keeping up with this thread, I don't have a
> lot of capacity right now.
> 
> If my understanding is correct, the summary of the problem we are
> observing here is that with high concurrency (70 processes), we
> observe worse system time, worse throughput, and higher memory_high
> events with zswap than SSD swap. This is true (with varying degrees)
> for 4K or mTHP, and with or without charging zswap compressed memory.
> 
> Did I get that right?

Thanks for your review and comments! Yes, this is correct.

> 
> I saw you also mentioned that reclaim latency is directly correlated
> to higher memory_high events.

That was my observation based on the swap-constrained experiments with 4G SSD.
With a faster compressor, we allow allocations to proceed quickly, and if the pages
are not being faulted in, we need more swap slots. This increases the probability of
running out of swap slots with the 4G SSD backing device, which, as the data in v4
shows, causes memcg_swap_fail events, that drive folios to be resident in memory
(triggering memcg_high breaches as allocations proceed even without zswap cgroup
charging).

Things change when the experiments are run in a situation where there is abundant
swap space and when the default behavior of zswap compressed data being charged
to the cgroup is enabled, as in the data with 176GiB ZRAM as ZSWAP's backing
swapfile posted in v5. Now, the critical path to workload performance changes to
concurrent reclaims in response to memcg_high events due to allocation and zswap
usage. We see a lesser increase in swapout activity (as compared to the swap-constrained
experiments in v4), and compress latency seems to become the bottleneck. Each
individual process's throughput/sys time degrades mainly as a function of compress
latency. Anyway, these were some of my learnings from these experiments. Please
do let me know if there are other insights/analysis I could be missing.

> 
> Is it possible that with SSD swap, because we wait for IO during
> reclaim, this gives a chance for other processes to allocate and free
> the memory they need. While with zswap because everything is
> synchronous, all processes are trying to allocate their memory at the
> same time resulting in higher reclaim rates?
> 
> IOW, maybe with zswap all the processes try to allocate their memory
> at the same time, so the total amount of memory needed at any given
> instance is much higher than memory.high, so we keep producing
> memory_high events and reclaiming. If 70 processes all require 1G at
> the same time, then we need 70G of memory at once, we will keep
> thrashing pages in/out of zswap.
> 
> While with SSD swap, due to the waits imposed by IO, the allocations
> are more spread out and more serialized, and the amount of memory
> needed at any given instance is lower; resulting in less reclaim
> activity and ultimately faster overall execution?

This is a very interesting hypothesis, that is along the lines of the
"slower compressor" essentially causing allocation stalls (and buffering us from
the swap slots unavailability effect) observation I gathered from the 4G SSD
experiments. I think this is a possibility.

> 
> Could you please describe what the processes are doing? Are they
> allocating memory and holding on to it, or immediately freeing it?

I have been using the vm-scalability usemem workload for these experiments.
Thanks Ying for suggesting I use this workload!

I am running usemem with these config options: usemem --init-time -w -O -n 70 1g.
This forks 70 processes, each of which does the following:

1) Allocates 1G mmap virtual memory with MAP_ANONYMOUS, read/write permissions.
2) Steps through and accesses each 8 bytes chunk of memory in the mmap-ed region, and:
    2.a) Writes the index of that chunk to the (unsigned long *) memory at that index.
3) Generates statistics on throughput.

There is an "munmap()" after step (2.a) that I have commented out because I wanted to
see how much cold memory resides in the zswap zpool after the workload exits. Interestingly,
this was 0 for 64K mTHP, but of the order of several hundreds of MB for 2M THP.

> 
> Do you have visibility into when each process allocates and frees memory?

Yes. Hopefully the above offers some clarifications.

Thanks,
Kanchana

Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios

Posted by Yosry Ahmed 1 year, 5 months ago

On Wed, Aug 28, 2024 at 11:50 AM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
> Hi Yosry,
>
> > -----Original Message-----
> > From: Yosry Ahmed <yosryahmed@google.com>
> > Sent: Wednesday, August 28, 2024 12:44 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-
> > mm@kvack.org; hannes@cmpxchg.org; ryan.roberts@arm.com; Huang, Ying
> > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> > Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> >
> > [..]
> > >
> > > This shows that in all cases, reclaim_high() is called only from the return
> > > path to user mode after handling a page-fault.
> >
> > I am sorry I haven't been keeping up with this thread, I don't have a
> > lot of capacity right now.
> >
> > If my understanding is correct, the summary of the problem we are
> > observing here is that with high concurrency (70 processes), we
> > observe worse system time, worse throughput, and higher memory_high
> > events with zswap than SSD swap. This is true (with varying degrees)
> > for 4K or mTHP, and with or without charging zswap compressed memory.
> >
> > Did I get that right?
>
> Thanks for your review and comments! Yes, this is correct.
>
> >
> > I saw you also mentioned that reclaim latency is directly correlated
> > to higher memory_high events.
>
> That was my observation based on the swap-constrained experiments with 4G SSD.
> With a faster compressor, we allow allocations to proceed quickly, and if the pages
> are not being faulted in, we need more swap slots. This increases the probability of
> running out of swap slots with the 4G SSD backing device, which, as the data in v4
> shows, causes memcg_swap_fail events, that drive folios to be resident in memory
> (triggering memcg_high breaches as allocations proceed even without zswap cgroup
> charging).
>
> Things change when the experiments are run in a situation where there is abundant
> swap space and when the default behavior of zswap compressed data being charged
> to the cgroup is enabled, as in the data with 176GiB ZRAM as ZSWAP's backing
> swapfile posted in v5. Now, the critical path to workload performance changes to
> concurrent reclaims in response to memcg_high events due to allocation and zswap
> usage. We see a lesser increase in swapout activity (as compared to the swap-constrained
> experiments in v4), and compress latency seems to become the bottleneck. Each
> individual process's throughput/sys time degrades mainly as a function of compress
> latency. Anyway, these were some of my learnings from these experiments. Please
> do let me know if there are other insights/analysis I could be missing.
>
> >
> > Is it possible that with SSD swap, because we wait for IO during
> > reclaim, this gives a chance for other processes to allocate and free
> > the memory they need. While with zswap because everything is
> > synchronous, all processes are trying to allocate their memory at the
> > same time resulting in higher reclaim rates?
> >
> > IOW, maybe with zswap all the processes try to allocate their memory
> > at the same time, so the total amount of memory needed at any given
> > instance is much higher than memory.high, so we keep producing
> > memory_high events and reclaiming. If 70 processes all require 1G at
> > the same time, then we need 70G of memory at once, we will keep
> > thrashing pages in/out of zswap.
> >
> > While with SSD swap, due to the waits imposed by IO, the allocations
> > are more spread out and more serialized, and the amount of memory
> > needed at any given instance is lower; resulting in less reclaim
> > activity and ultimately faster overall execution?
>
> This is a very interesting hypothesis, that is along the lines of the
> "slower compressor" essentially causing allocation stalls (and buffering us from
> the swap slots unavailability effect) observation I gathered from the 4G SSD
> experiments. I think this is a possibility.
>
> >
> > Could you please describe what the processes are doing? Are they
> > allocating memory and holding on to it, or immediately freeing it?
>
> I have been using the vm-scalability usemem workload for these experiments.
> Thanks Ying for suggesting I use this workload!
>
> I am running usemem with these config options: usemem --init-time -w -O -n 70 1g.
> This forks 70 processes, each of which does the following:
>
> 1) Allocates 1G mmap virtual memory with MAP_ANONYMOUS, read/write permissions.
> 2) Steps through and accesses each 8 bytes chunk of memory in the mmap-ed region, and:
>     2.a) Writes the index of that chunk to the (unsigned long *) memory at that index.
> 3) Generates statistics on throughput.
>
> There is an "munmap()" after step (2.a) that I have commented out because I wanted to
> see how much cold memory resides in the zswap zpool after the workload exits. Interestingly,
> this was 0 for 64K mTHP, but of the order of several hundreds of MB for 2M THP.

Does the process exit immediately after step (3)? The memory will be
unmapped and freed once the process exits anyway, so removing an unmap
that immediately precedes the process exiting should have no effect.

I wonder how this changes if the processes sleep and keep the memory
mapped for a while, to force the situation where all the memory is
needed at the same time on SSD as well as zswap. This could make the
playing field more even and force the same thrashing to happen on SSD
for a more fair comparison.

It's not a fix, if very fast reclaim with zswap ends up causing more
problems perhaps we need to tweak the throttling of memory.high or
something.

RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios

Posted by Sridhar, Kanchana P 1 year, 5 months ago

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Wednesday, August 28, 2024 3:34 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; hannes@cmpxchg.org; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> On Wed, Aug 28, 2024 at 11:50 AM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Hi Yosry,
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosryahmed@google.com>
> > > Sent: Wednesday, August 28, 2024 12:44 AM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org;
> linux-
> > > mm@kvack.org; hannes@cmpxchg.org; ryan.roberts@arm.com; Huang,
> Ying
> > > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org;
> > > Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> > >
> > > [..]
> > > >
> > > > This shows that in all cases, reclaim_high() is called only from the return
> > > > path to user mode after handling a page-fault.
> > >
> > > I am sorry I haven't been keeping up with this thread, I don't have a
> > > lot of capacity right now.
> > >
> > > If my understanding is correct, the summary of the problem we are
> > > observing here is that with high concurrency (70 processes), we
> > > observe worse system time, worse throughput, and higher memory_high
> > > events with zswap than SSD swap. This is true (with varying degrees)
> > > for 4K or mTHP, and with or without charging zswap compressed memory.
> > >
> > > Did I get that right?
> >
> > Thanks for your review and comments! Yes, this is correct.
> >
> > >
> > > I saw you also mentioned that reclaim latency is directly correlated
> > > to higher memory_high events.
> >
> > That was my observation based on the swap-constrained experiments with
> 4G SSD.
> > With a faster compressor, we allow allocations to proceed quickly, and if the
> pages
> > are not being faulted in, we need more swap slots. This increases the
> probability of
> > running out of swap slots with the 4G SSD backing device, which, as the data
> in v4
> > shows, causes memcg_swap_fail events, that drive folios to be resident in
> memory
> > (triggering memcg_high breaches as allocations proceed even without
> zswap cgroup
> > charging).
> >
> > Things change when the experiments are run in a situation where there is
> abundant
> > swap space and when the default behavior of zswap compressed data being
> charged
> > to the cgroup is enabled, as in the data with 176GiB ZRAM as ZSWAP's
> backing
> > swapfile posted in v5. Now, the critical path to workload performance
> changes to
> > concurrent reclaims in response to memcg_high events due to allocation
> and zswap
> > usage. We see a lesser increase in swapout activity (as compared to the
> swap-constrained
> > experiments in v4), and compress latency seems to become the bottleneck.
> Each
> > individual process's throughput/sys time degrades mainly as a function of
> compress
> > latency. Anyway, these were some of my learnings from these experiments.
> Please
> > do let me know if there are other insights/analysis I could be missing.
> >
> > >
> > > Is it possible that with SSD swap, because we wait for IO during
> > > reclaim, this gives a chance for other processes to allocate and free
> > > the memory they need. While with zswap because everything is
> > > synchronous, all processes are trying to allocate their memory at the
> > > same time resulting in higher reclaim rates?
> > >
> > > IOW, maybe with zswap all the processes try to allocate their memory
> > > at the same time, so the total amount of memory needed at any given
> > > instance is much higher than memory.high, so we keep producing
> > > memory_high events and reclaiming. If 70 processes all require 1G at
> > > the same time, then we need 70G of memory at once, we will keep
> > > thrashing pages in/out of zswap.
> > >
> > > While with SSD swap, due to the waits imposed by IO, the allocations
> > > are more spread out and more serialized, and the amount of memory
> > > needed at any given instance is lower; resulting in less reclaim
> > > activity and ultimately faster overall execution?
> >
> > This is a very interesting hypothesis, that is along the lines of the
> > "slower compressor" essentially causing allocation stalls (and buffering us
> from
> > the swap slots unavailability effect) observation I gathered from the 4G SSD
> > experiments. I think this is a possibility.
> >
> > >
> > > Could you please describe what the processes are doing? Are they
> > > allocating memory and holding on to it, or immediately freeing it?
> >
> > I have been using the vm-scalability usemem workload for these
> experiments.
> > Thanks Ying for suggesting I use this workload!
> >
> > I am running usemem with these config options: usemem --init-time -w -O -
> n 70 1g.
> > This forks 70 processes, each of which does the following:
> >
> > 1) Allocates 1G mmap virtual memory with MAP_ANONYMOUS, read/write
> permissions.
> > 2) Steps through and accesses each 8 bytes chunk of memory in the mmap-
> ed region, and:
> >     2.a) Writes the index of that chunk to the (unsigned long *) memory at
> that index.
> > 3) Generates statistics on throughput.
> >
> > There is an "munmap()" after step (2.a) that I have commented out because
> I wanted to
> > see how much cold memory resides in the zswap zpool after the workload
> exits. Interestingly,
> > this was 0 for 64K mTHP, but of the order of several hundreds of MB for 2M
> THP.
> 
> Does the process exit immediately after step (3)? The memory will be
> unmapped and freed once the process exits anyway, so removing an unmap
> that immediately precedes the process exiting should have no effect.

Yes, you're right.

> 
> I wonder how this changes if the processes sleep and keep the memory
> mapped for a while, to force the situation where all the memory is
> needed at the same time on SSD as well as zswap. This could make the
> playing field more even and force the same thrashing to happen on SSD
> for a more fair comparison.

Good point. I believe I saw an option in usemem that could facilitate this.
I will investigate.

> 
> It's not a fix, if very fast reclaim with zswap ends up causing more
> problems perhaps we need to tweak the throttling of memory.high or
> something.

Sure, that is a possibility. Although, proactive reclaim might mitigate this,
in which case very fast reclaim with zswap might help.

Thanks,
Kanchana

Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios

Posted by Nhat Pham 1 year, 5 months ago

On Tue, Aug 27, 2024 at 8:23 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Mon, Aug 26, 2024 at 11:08 PM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> Yeah that's a bit unfair still. Wild idea, but what about we compare
> SSD without zswap (or SSD with zswap, but without this patch series so
> that mTHP are not zswapped) v.s zswap-on-zram (i.e with a backing
> swapfile on zram block device).
>
> It is stupid, I know. But let's take advantage of the fact that zram
> is not charged to cgroup, pretending that its memory foot print is
> empty?
>
> I don't know how zram works though, so my apologies if it's a stupid
> suggestion :)

Oh nvm, looks like that's what you're already doing.

That said, the lz4 column is soooo bad still, whereas the deflate-iaa
clearly shows improvement! This means it could be
compressor-dependent.

Can you try it with zstd?

RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios

Posted by Sridhar, Kanchana P 1 year, 5 months ago

> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Tuesday, August 27, 2024 8:30 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> On Tue, Aug 27, 2024 at 8:23 AM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Mon, Aug 26, 2024 at 11:08 PM Sridhar, Kanchana P
> > <kanchana.p.sridhar@intel.com> wrote:
> > Yeah that's a bit unfair still. Wild idea, but what about we compare
> > SSD without zswap (or SSD with zswap, but without this patch series so
> > that mTHP are not zswapped) v.s zswap-on-zram (i.e with a backing
> > swapfile on zram block device).
> >
> > It is stupid, I know. But let's take advantage of the fact that zram
> > is not charged to cgroup, pretending that its memory foot print is
> > empty?
> >
> > I don't know how zram works though, so my apologies if it's a stupid
> > suggestion :)
> 
> Oh nvm, looks like that's what you're already doing.
> 
> That said, the lz4 column is soooo bad still, whereas the deflate-iaa
> clearly shows improvement! This means it could be
> compressor-dependent.
> 
> Can you try it with zstd?

Sure, I will gather data with zstd.

Thanks,
Kanchana

RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios

Posted by Sridhar, Kanchana P 1 year, 5 months ago

> -----Original Message-----
> From: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Sent: Tuesday, August 27, 2024 11:43 AM
> To: Nhat Pham <nphamcs@gmail.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>;
> Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Subject: RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> 
> > -----Original Message-----
> > From: Nhat Pham <nphamcs@gmail.com>
> > Sent: Tuesday, August 27, 2024 8:30 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com;
> > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> >
> > On Tue, Aug 27, 2024 at 8:23 AM Nhat Pham <nphamcs@gmail.com>
> wrote:
> > >
> > > On Mon, Aug 26, 2024 at 11:08 PM Sridhar, Kanchana P
> > > <kanchana.p.sridhar@intel.com> wrote:
> > > Yeah that's a bit unfair still. Wild idea, but what about we compare
> > > SSD without zswap (or SSD with zswap, but without this patch series so
> > > that mTHP are not zswapped) v.s zswap-on-zram (i.e with a backing
> > > swapfile on zram block device).
> > >
> > > It is stupid, I know. But let's take advantage of the fact that zram
> > > is not charged to cgroup, pretending that its memory foot print is
> > > empty?
> > >
> > > I don't know how zram works though, so my apologies if it's a stupid
> > > suggestion :)
> >
> > Oh nvm, looks like that's what you're already doing.
> >
> > That said, the lz4 column is soooo bad still, whereas the deflate-iaa
> > clearly shows improvement! This means it could be
> > compressor-dependent.
> >
> > Can you try it with zstd?
> 
> Sure, I will gather data with zstd.

I will be sending out a v5 shortly with data gathered with zstd.

Thanks,
Kanchana

> 
> Thanks,
> Kanchana