[v4] RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios

RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios

Posted by Sridhar, Kanchana P 1 year, 3 months ago


> -----Original Message-----
> From: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Sent: Tuesday, August 27, 2024 11:42 AM
> To: Nhat Pham <nphamcs@gmail.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>;
> Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Subject: RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> 
> > -----Original Message-----
> > From: Nhat Pham <nphamcs@gmail.com>
> > Sent: Tuesday, August 27, 2024 8:24 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com;
> > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> >
> > On Mon, Aug 26, 2024 at 11:08 PM Sridhar, Kanchana P
> > <kanchana.p.sridhar@intel.com> wrote:
> > >
> > > > Internally, we often see 1-3 or 1-4 saving ratio (or even more).
> > >
> > > Agree with this as well. In our experiments with other workloads, we
> > > typically see much higher ratios.
> > >
> > > >
> > > > Probably does not explain everything, but worth double checking -
> > > > could you check with zstd to see if the ratio improves.
> > >
> > > Sure. I gathered ratio and compressed memory footprint data today with
> > > 64K mTHP, the 4G SSD swapfile and different zswap compressors.
> > >
> > >  This patch-series and no zswap charging, 64K mTHP:
> > > ---------------------------------------------------------------------------
> > >                        Total         Total     Average      Average   Comp
> > >                   compressed   compression  compressed  compression  ratio
> > >                       length       latency      length      latency
> > >                        bytes  milliseconds       bytes  nanoseconds
> > > ---------------------------------------------------------------------------
> > > SSD (no zswap) 1,362,296,832       887,861
> > > lz4            2,610,657,430        55,984       2,055      44,065    1.99
> > > zstd             729,129,528        50,986         565      39,510    7.25
> > > deflate-iaa    1,286,533,438        44,785       1,415      49,252    2.89
> > > ---------------------------------------------------------------------------
> > >
> > > zstd does very well on ratio, as expected.
> >
> > Wait. So zstd is displaying 7-to-1 compression ratio? And has *lower*
> > average latency?
> >
> > Why are we running benchmark on lz4 again? Sure there is no free lunch
> > and no compressor that works well on all kind of data, but lz4's
> > performance here is so bad that it's borderline justifiable to
> > disable/bypass zswap with this kind of compresison ratio...
> >
> > Can I ask you to run benchmarking on zstd from now on?
> 
> Sure, will do.
> 
> >
> > >
> > > >
> > > > >
> > > > >
> > > > >  Experiment 3 - 4K folios swap characteristics SSD vs. ZSWAP:
> > > > >  ------------------------------------------------------------
> > > > >
> > > > >  I wanted to take a step back and understand how the mainline v6.11-
> > rc3
> > > > >  handles 4K folios when swapped out to SSD (CONFIG_ZSWAP is off)
> and
> > > > when
> > > > >  swapped out to ZSWAP. Interestingly, higher swapout activity is
> > observed
> > > > >  with 4K folios and v6.11-rc3 (with the debug change to not charge
> > zswap to
> > > > >  cgroup).
> > > > >
> > > > >  v6.11-rc3 with no zswap charge, only 4K folios, no (m)THP:
> > > > >
> > > > >  -------------------------------------------------------------
> > > > >  SSD (CONFIG_ZSWAP is OFF)       ZSWAP          lz4    lzo-rle
> > > > >  -------------------------------------------------------------
> > > > >  cgroup memory.events:           cgroup memory.events:
> > > > >
> > > > >  low                 0           low              0          0
> > > > >  high            5,068           high       321,923    375,116
> > > > >  max                 0           max              0          0
> > > > >  oom                 0           oom              0          0
> > > > >  oom_kill            0           oom_kill         0          0
> > > > >  oom_group_kill      0           oom_group_kill   0          0
> > > > >  -------------------------------------------------------------
> > > > >
> > > > >  SSD (CONFIG_ZSWAP is OFF):
> > > > >  --------------------------
> > > > >  pswpout            415,709
> > > > >  sys time (sec)      301.02
> > > > >  Throughput KB/s    155,970
> > > > >  memcg_high events    5,068
> > > > >  --------------------------
> > > > >
> > > > >
> > > > >  ZSWAP                  lz4         lz4         lz4     lzo-rle
> > > > >  --------------------------------------------------------------
> > > > >  zswpout          1,598,550   1,515,151   1,449,432   1,493,917
> > > > >  sys time (sec)      889.36      481.21      581.22      635.75
> > > > >  Throughput KB/s     35,176      14,765      20,253      21,407
> > > > >  memcg_high events  321,923     412,733     369,976     375,116
> > > > >  --------------------------------------------------------------
> > > > >
> > > > >  This shows that there is a performance regression of -60% to -195%
> > with
> > > > >  zswap as compared to SSD with 4K folios. The higher swapout activity
> > with
> > > > >  zswap is seen here too (i.e., this doesn't appear to be mTHP-specific).
> > > > >
> > > > >  I verified this to be the case even with the v6.7 kernel, which also
> > > > >  showed a 2.3X throughput improvement when we don't charge
> zswap:
> > > > >
> > > > >  ZSWAP lz4                 v6.7      v6.7 with no cgroup zswap charge
> > > > >  --------------------------------------------------------------------
> > > > >  zswpout              1,419,802       1,398,620
> > > > >  sys time (sec)           535.4          613.41
> > > >
> > > > systime increases without zswap cgroup charging? That's strange...
> > >
> > > Additional data gathered with v6.11-rc3 (listed below) based on your
> > suggestion
> > > to investigate potential swap.high breaches should hopefully provide
> some
> > > explanation.
> > >
> > > >
> > > > >  Throughput KB/s          8,671          20,045
> > > > >  memcg_high events      574,046         451,859
> > > >
> > > > So, on 4k folio setup, even without cgroup charge, we are still seeing:
> > > >
> > > > 1. More zswpout (than observed in SSD)
> > > > 2. 40-50% worse latency - in fact it is worse without zswap cgroup
> > charging.
> > > > 3. 100 times the amount of memcg_high events? This is perhaps the
> > > > *strangest* to me. You're already removing zswap cgroup charging,
> then
> > > > where does this comes from? How can we have memory.high violation
> > when
> > > > zswap does *not* contribute to memory usage?
> > > >
> > > > Is this due to swap limit charging? Do you have a cgroup swap limit?
> > > >
> > > > mem_high = page_counter_read(&memcg->memory) >
> > > >            READ_ONCE(memcg->memory.high);
> > > > swap_high = page_counter_read(&memcg->swap) >
> > > >            READ_ONCE(memcg->swap.high);
> > > > [...]
> > > >
> > > > if (mem_high || swap_high) {
> > > >     /*
> > > >     * The allocating tasks in this cgroup will need to do
> > > >     * reclaim or be throttled to prevent further growth
> > > >     * of the memory or swap footprints.
> > > >     *
> > > >     * Target some best-effort fairness between the tasks,
> > > >     * and distribute reclaim work and delay penalties
> > > >     * based on how much each task is actually allocating.
> > > >     */
> > > >     current->memcg_nr_pages_over_high += batch;
> > > >     set_notify_resume(current);
> > > >     break;
> > > > }
> > > >
> > >
> > > I don't have a swap.high limit set on the cgroup; it is set to "max".
> > >
> > > I ran experiments with v6.11-rc3, no zswap charging, 4K folios and
> different
> > > zswap compressors to verify if swap.high is breached with the 4G SSD
> > swapfile.
> > >
> > >  SSD (CONFIG_ZSWAP is OFF):
> > >
> > >                                 SSD          SSD          SSD
> > >  ------------------------------------------------------------
> > >  pswpout                    415,709    1,032,170      636,582
> > >  sys time (sec)              301.02       328.15       306.98
> > >  Throughput KB/s            155,970       89,621      122,219
> > >  memcg_high events            5,068       15,072        8,344
> > >  memcg_swap_high events           0            0            0
> > >  memcg_swap_fail events           0            0            0
> > >  ------------------------------------------------------------
> > >
> > >  ZSWAP                               zstd         zstd       zstd
> > >  ----------------------------------------------------------------
> > >  zswpout                        1,391,524    1,382,965  1,417,307
> > >  sys time (sec)                    474.68       568.24     489.80
> > >  Throughput KB/s                   26,099       23,404    111,115
> > >  memcg_high events                335,112      340,335    162,260
> > >  memcg_swap_high events                 0            0          0
> > >  memcg_swap_fail events         1,226,899    5,742,153
> > >   (mem_cgroup_try_charge_swap)
> > >  memcg_memory_stat_pgactivate   1,259,547
> > >   (shrink_folio_list)
> > >  ----------------------------------------------------------------
> > >
> > >  ZSWAP                      lzo-rle      lzo-rle     lzo-rle
> > >  -----------------------------------------------------------
> > >  zswpout                  1,493,917    1,363,040   1,428,133
> > >  sys time (sec)              635.75       498.63      484.65
> > >  Throughput KB/s             21,407       23,827      20,237
> > >  memcg_high events          375,116      352,814     373,667
> > >  memcg_swap_high events           0            0           0
> > >  memcg_swap_fail events     715,211
> > >  -----------------------------------------------------------
> > >
> > >  ZSWAP                         lz4         lz4        lz4          lz4
> > >  ---------------------------------------------------------------------
> > >  zswpout                 1,378,781   1,598,550   1,515,151   1,449,432
> > >  sys time (sec)             495.45      889.36      481.21      581.22
> > >  Throughput KB/s            26,248      35,176      14,765      20,253
> > >  memcg_high events         347,209     321,923     412,733     369,976
> > >  memcg_swap_high events          0           0           0           0
> > >  memcg_swap_fail events    580,103           0
> > >  ---------------------------------------------------------------------
> > >
> > >  ZSWAP                  deflate-iaa   deflate-iaa    deflate-iaa
> > >  ----------------------------------------------------------------
> > >  zswpout                    380,471     1,440,902      1,397,965
> > >  sys time (sec)              329.06        570.77         467.41
> > >  Throughput KB/s            283,867        28,403        190,600
> > >  memcg_high events            5,551       422,831         28,154
> > >  memcg_swap_high events           0             0              0
> > >  memcg_swap_fail events           0     2,686,758        438,562
> > >  ----------------------------------------------------------------
> >
> > Why are there 3 columns for each of the compressors? Is this different
> > runs of the same workload?
> >
> > And why do some columns have missing cells?
> 
> Yes, these are different runs of the same workload. Since there is some
> amount of variance seen in the data, I figured it is best to publish the
> metrics from the individual runs rather than averaging.
> 
> Some of these runs were gathered earlier with the same code base,
> however, I wasn't monitoring/logging the
> memcg_swap_high/memcg_swap_fail
> events at that time. For those runs, just these two counters have missing
> column entries; the rest of the data is still valid.
> 
> >
> > >
> > > There are no swap.high memcg events recorded in any of the SSD/zswap
> > >  experiments. However, I do see significant number of memcg_swap_fail
> > >  events in some of the zswap runs, for all 3 compressors. This is not
> > >  consistent, because there are some runs with 0 memcg_swap_fail for all
> > >  compressors.
> > >
> > >  There is a possible co-relation between memcg_swap_fail events
> > >  (/sys/fs/cgroup/test/memory.swap.events) and the high # of
> memcg_high
> > >  events. The root-cause appears to be that there are no available swap
> > >  slots, memcg_swap_fail is incremented, add_to_swap() fails in
> > >  shrink_folio_list(), followed by "activate_locked:" for the folio.
> > >  The folio re-activation is recorded in cgroup memory.stat pgactivate
> > >  events. The failure to swap out folios due to lack of swap slots could
> > >  contribute towards memory.high breaches.
> >
> > Yeah FWIW, that was gonna be my first suggestion. This swapfile size
> > is wayyyy too small...
> >
> > But that said, the link is not clear to me at all. The only thing I
> > can think of is lz4's performance sucks so bad that it's not saving
> > enough memory, leading to regression. And since it's still taking up
> > swap slot, we cannot use swap either?
> 
> The occurrence of memcg_swap_fail events establishes that swap slots
> are not available with 4G of swap space. This causes those 4K folios to
> remain in memory, which can worsen an existing problem with memory.high
> breaches.
> 
> However, it is worth noting that this is not the only contributor to
> memcg_high events that still occur without zswap charging. The data shows
> 321,923 occurrences of memcg_high in Col 2 of the lz4 table, that also has
> 0 occurrences of memcg_swap_fail reported in the cgroup stats.
> 
> >
> > >
> > >  However, this is probably not the only cause for either the high # of
> > >  memory.high breaches or the over-reclaim with zswap, as seen in the lz4
> > >  data where the memory.high is significant even in cases where there are
> no
> > >  memcg_swap_fails.
> > >
> > > Some observations/questions based on the above 4K folios swapout data:
> > >
> > > 1) There are more memcg_high events as the swapout latency reduces
> > >    (i.e. faster swap-write path). This is even without charging zswap
> > >    utilization to the cgroup.
> >
> > This is still inexplicable to me. If we are not charging zswap usage,
> > we shouldn't even be triggering the reclaim_high() path, no?
> >
> > I'm curious - can you use bpftrace to tracks where/when reclaim_high
> > is being called?

Hi Nhat,

Since reclaim_high() is called only in a handful of places, I figured I
would just use debugfs u64 counters to record where it gets called from.

These are the places where I increment the debugfs counters:

include/linux/resume_user_mode.h:
---------------------------------
diff --git a/include/linux/resume_user_mode.h b/include/linux/resume_user_mode.h
index e0135e0adae0..382f5469e9a2 100644
--- a/include/linux/resume_user_mode.h
+++ b/include/linux/resume_user_mode.h
@@ -24,6 +24,7 @@ static inline void set_notify_resume(struct task_struct *task)
 		kick_process(task);
 }
 
+extern u64 hoh_userland;
 
 /**
  * resume_user_mode_work - Perform work before returning to user mode
@@ -56,6 +57,7 @@ static inline void resume_user_mode_work(struct pt_regs *regs)
 	}
 #endif
 
+	++hoh_userland;
 	mem_cgroup_handle_over_high(GFP_KERNEL);
 	blkcg_maybe_throttle_current();
 

mm/memcontrol.c:
----------------
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f29157288b7d..6738bb670a78 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1910,9 +1910,12 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
 	return nr_reclaimed;
 }
 
+extern u64 rec_high_hwf;
+
 static void high_work_func(struct work_struct *work)
 {
 	struct mem_cgroup *memcg;
+	++rec_high_hwf;
 
 	memcg = container_of(work, struct mem_cgroup, high_work);
 	reclaim_high(memcg, MEMCG_CHARGE_BATCH, GFP_KERNEL);
@@ -2055,6 +2058,8 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
 	return penalty_jiffies * nr_pages / MEMCG_CHARGE_BATCH;
 }
 
+extern u64 rec_high_hoh;
+
 /*
  * Reclaims memory over the high limit. Called directly from
  * try_charge() (context permitting), as well as from the userland
@@ -2097,6 +2102,7 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask)
 	 * memory.high is currently batched, whereas memory.max and the page
 	 * allocator run every time an allocation is made.
 	 */
+	++rec_high_hoh;
 	nr_reclaimed = reclaim_high(memcg,
 				    in_retry ? SWAP_CLUSTER_MAX : nr_pages,
 				    gfp_mask);
@@ -2153,6 +2159,8 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask)
 	css_put(&memcg->css);
 }
 
+extern u64 hoh_trycharge;
+
 int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		     unsigned int nr_pages)
 {
@@ -2344,8 +2352,10 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 */
 	if (current->memcg_nr_pages_over_high > MEMCG_CHARGE_BATCH &&
 	    !(current->flags & PF_MEMALLOC) &&
-	    gfpflags_allow_blocking(gfp_mask))
+	    gfpflags_allow_blocking(gfp_mask)) {
+		++hoh_trycharge;
 		mem_cgroup_handle_over_high(gfp_mask);
+	}
 	return 0;
 }
 

I reverted my debug changes for "zswap to not charge cgroup" when I ran
these next set of experiments that record the # of times and locations
where reclaim_high() is called.

zstd is the compressor I have configured for both ZSWAP and ZRAM.

 6.11-rc3 mainline, 176Gi ZRAM backing for ZSWAP, zstd, 64K mTHP:
 ----------------------------------------------------------------
 /sys/fs/cgroup/iax/memory.events:
 high 112,910

 hoh_userland 128,835
 hoh_trycharge 0
 rec_high_hoh 113,079
 rec_high_hwf 0

 6.11-rc3 mainline, 4G SSD backing for ZSWAP, zstd, 64K mTHP:
 ------------------------------------------------------------
 /sys/fs/cgroup/iax/memory.events:
 high 4,693
 
 hoh_userland 14,069
 hoh_trycharge 0
 rec_high_hoh 4,694
 rec_high_hwf 0


 ZSWAP-mTHP, 176Gi ZRAM backing for ZSWAP, zstd, 64K mTHP:
 ---------------------------------------------------------
 /sys/fs/cgroup/iax/memory.events:
 high 139,495
 
 hoh_userland 156,628
 hoh_trycharge 0
 rec_high_hoh 140,039
 rec_high_hwf 0

 ZSWAP-mTHP, 4G SSD backing for ZSWAP, zstd, 64K mTHP:
 -----------------------------------------------------
 /sys/fs/cgroup/iax/memory.events:
 high 20,427
 
 /sys/fs/cgroup/iax/memory.swap.events:
 fail 20,856
 
 hoh_userland 31,346
 hoh_trycharge 0
 rec_high_hoh 20,513
 rec_high_hwf 0

This shows that in all cases, reclaim_high() is called only from the return
path to user mode after handling a page-fault.

Thanks,
Kanchana

> 
> I had confirmed earlier with counters that all calls to reclaim_high()
> were from include/linux/resume_user_mode.h::resume_user_mode_work().
> I will confirm this with zstd and bpftrace and share.
> 
> Thanks,
> Kanchana
> 
> >
> > >
> > > 2) There appears to be a direct co-relation between higher # of
> > >    memcg_swap_fail events, and an increase in memcg_high breaches and
> > >    reduction in usemem throughput. This combined with the observation in
> > >    (1) suggests that with a faster compressor, we need more swap slots,
> > >    that increases the probability of running out of swap slots with the 4G
> > >    SSD backing device.
> > >
> > > 3) Could the data shared earlier on reduction in memcg_high breaches
> with
> > >    64K mTHP swapout provide some more clues, if we agree with (1) and
> (2):
> > >
> > >    "Interestingly, the # of memcg_high events reduces significantly with
> 64K
> > >    mTHP as compared to the above 4K memcg_high events data, when
> > tested
> > >    with v4 and no zswap charge: 3,069 (SSD-mTHP) and 19,656 (ZSWAP-
> > mTHP)."
> > >
> > > 4) In the case of each zswap compressor, there are some runs that go
> > >    through with 0 memcg_swap_fail events. These runs generally have
> better
> > >    fewer memcg_high breaches and better sys time/throughput.
> > >
> > > 5) For a given swap setup, there is some amount of variance in
> > >    sys time for this workload.
> > >
> > > 6) All this suggests that the primary root cause is the concurrency setup,
> > >    where there could be randomness between runs as to the # of processes
> > >    that observe the memory.high breach due to other factors such as
> > >    availability of swap slots for alloc.
> > >
> > > To summarize, I believe the root-cause is the 4G SSD swapfile resulting in
> > > running out of swap slots, and anomalous behavior with over-reclaim
> when
> > 70
> > > concurrent processes are working with the 60G memory limit while trying
> to
> > > allocate 1G each; with randomness in processes reacting to the breach.
> > >
> > > The cgroup zswap charging exacerbates this situation, but is not a problem
> > > in and of itself.
> > >
> > > Nhat, as you pointed out, this is somewhat of an unrealistic scenario that
> > > doesn't seem to indicate any specific problems to be solved, other than
> the
> > > temporary cgroup zswap double-charging.
> > >
> > > Would it be fair to evaluate this patch-series based on a more realistic
> > > swapfile configuration based on 176G ZRAM, for which I had shared the
> > data
> > > in v2? There weren't any problems with swap slots availability or any
> > > anomalies that I can think of with this setup, other than the fact that the
> > > "Before" and "After" sys times could not be directly compared for 2 key
> > > reasons:
> > >
> > >  - ZRAM compressed data is not charged to the cgroup, similar to SSD.
> > >  - ZSWAP compressed data is charged to the cgroup.
> >
> > Yeah that's a bit unfair still. Wild idea, but what about we compare
> > SSD without zswap (or SSD with zswap, but without this patch series so
> > that mTHP are not zswapped) v.s zswap-on-zram (i.e with a backing
> > swapfile on zram block device).
> >
> > It is stupid, I know. But let's take advantage of the fact that zram
> > is not charged to cgroup, pretending that its memory foot print is
> > empty?
> >
> > I don't know how zram works though, so my apologies if it's a stupid
> > suggestion :)

Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios

Posted by Yosry Ahmed 1 year, 3 months ago

[..]
>
> This shows that in all cases, reclaim_high() is called only from the return
> path to user mode after handling a page-fault.

I am sorry I haven't been keeping up with this thread, I don't have a
lot of capacity right now.

If my understanding is correct, the summary of the problem we are
observing here is that with high concurrency (70 processes), we
observe worse system time, worse throughput, and higher memory_high
events with zswap than SSD swap. This is true (with varying degrees)
for 4K or mTHP, and with or without charging zswap compressed memory.

Did I get that right?

I saw you also mentioned that reclaim latency is directly correlated
to higher memory_high events.

Is it possible that with SSD swap, because we wait for IO during
reclaim, this gives a chance for other processes to allocate and free
the memory they need. While with zswap because everything is
synchronous, all processes are trying to allocate their memory at the
same time resulting in higher reclaim rates?

IOW, maybe with zswap all the processes try to allocate their memory
at the same time, so the total amount of memory needed at any given
instance is much higher than memory.high, so we keep producing
memory_high events and reclaiming. If 70 processes all require 1G at
the same time, then we need 70G of memory at once, we will keep
thrashing pages in/out of zswap.

While with SSD swap, due to the waits imposed by IO, the allocations
are more spread out and more serialized, and the amount of memory
needed at any given instance is lower; resulting in less reclaim
activity and ultimately faster overall execution?

Could you please describe what the processes are doing? Are they
allocating memory and holding on to it, or immediately freeing it?

Do you have visibility into when each process allocates and frees memory?

RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios

Posted by Sridhar, Kanchana P 1 year, 3 months ago

Hi Yosry,

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Wednesday, August 28, 2024 12:44 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; hannes@cmpxchg.org; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> [..]
> >
> > This shows that in all cases, reclaim_high() is called only from the return
> > path to user mode after handling a page-fault.
> 
> I am sorry I haven't been keeping up with this thread, I don't have a
> lot of capacity right now.
> 
> If my understanding is correct, the summary of the problem we are
> observing here is that with high concurrency (70 processes), we
> observe worse system time, worse throughput, and higher memory_high
> events with zswap than SSD swap. This is true (with varying degrees)
> for 4K or mTHP, and with or without charging zswap compressed memory.
> 
> Did I get that right?

Thanks for your review and comments! Yes, this is correct.

> 
> I saw you also mentioned that reclaim latency is directly correlated
> to higher memory_high events.

That was my observation based on the swap-constrained experiments with 4G SSD.
With a faster compressor, we allow allocations to proceed quickly, and if the pages
are not being faulted in, we need more swap slots. This increases the probability of
running out of swap slots with the 4G SSD backing device, which, as the data in v4
shows, causes memcg_swap_fail events, that drive folios to be resident in memory
(triggering memcg_high breaches as allocations proceed even without zswap cgroup
charging).

Things change when the experiments are run in a situation where there is abundant
swap space and when the default behavior of zswap compressed data being charged
to the cgroup is enabled, as in the data with 176GiB ZRAM as ZSWAP's backing
swapfile posted in v5. Now, the critical path to workload performance changes to
concurrent reclaims in response to memcg_high events due to allocation and zswap
usage. We see a lesser increase in swapout activity (as compared to the swap-constrained
experiments in v4), and compress latency seems to become the bottleneck. Each
individual process's throughput/sys time degrades mainly as a function of compress
latency. Anyway, these were some of my learnings from these experiments. Please
do let me know if there are other insights/analysis I could be missing.

> 
> Is it possible that with SSD swap, because we wait for IO during
> reclaim, this gives a chance for other processes to allocate and free
> the memory they need. While with zswap because everything is
> synchronous, all processes are trying to allocate their memory at the
> same time resulting in higher reclaim rates?
> 
> IOW, maybe with zswap all the processes try to allocate their memory
> at the same time, so the total amount of memory needed at any given
> instance is much higher than memory.high, so we keep producing
> memory_high events and reclaiming. If 70 processes all require 1G at
> the same time, then we need 70G of memory at once, we will keep
> thrashing pages in/out of zswap.
> 
> While with SSD swap, due to the waits imposed by IO, the allocations
> are more spread out and more serialized, and the amount of memory
> needed at any given instance is lower; resulting in less reclaim
> activity and ultimately faster overall execution?

This is a very interesting hypothesis, that is along the lines of the
"slower compressor" essentially causing allocation stalls (and buffering us from
the swap slots unavailability effect) observation I gathered from the 4G SSD
experiments. I think this is a possibility.

> 
> Could you please describe what the processes are doing? Are they
> allocating memory and holding on to it, or immediately freeing it?

I have been using the vm-scalability usemem workload for these experiments.
Thanks Ying for suggesting I use this workload!

I am running usemem with these config options: usemem --init-time -w -O -n 70 1g.
This forks 70 processes, each of which does the following:

1) Allocates 1G mmap virtual memory with MAP_ANONYMOUS, read/write permissions.
2) Steps through and accesses each 8 bytes chunk of memory in the mmap-ed region, and:
    2.a) Writes the index of that chunk to the (unsigned long *) memory at that index.
3) Generates statistics on throughput.

There is an "munmap()" after step (2.a) that I have commented out because I wanted to
see how much cold memory resides in the zswap zpool after the workload exits. Interestingly,
this was 0 for 64K mTHP, but of the order of several hundreds of MB for 2M THP.

> 
> Do you have visibility into when each process allocates and frees memory?

Yes. Hopefully the above offers some clarifications.

Thanks,
Kanchana

Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios

Posted by Yosry Ahmed 1 year, 3 months ago

On Wed, Aug 28, 2024 at 11:50 AM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
> Hi Yosry,
>
> > -----Original Message-----
> > From: Yosry Ahmed <yosryahmed@google.com>
> > Sent: Wednesday, August 28, 2024 12:44 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-
> > mm@kvack.org; hannes@cmpxchg.org; ryan.roberts@arm.com; Huang, Ying
> > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> > Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> >
> > [..]
> > >
> > > This shows that in all cases, reclaim_high() is called only from the return
> > > path to user mode after handling a page-fault.
> >
> > I am sorry I haven't been keeping up with this thread, I don't have a
> > lot of capacity right now.
> >
> > If my understanding is correct, the summary of the problem we are
> > observing here is that with high concurrency (70 processes), we
> > observe worse system time, worse throughput, and higher memory_high
> > events with zswap than SSD swap. This is true (with varying degrees)
> > for 4K or mTHP, and with or without charging zswap compressed memory.
> >
> > Did I get that right?
>
> Thanks for your review and comments! Yes, this is correct.
>
> >
> > I saw you also mentioned that reclaim latency is directly correlated
> > to higher memory_high events.
>
> That was my observation based on the swap-constrained experiments with 4G SSD.
> With a faster compressor, we allow allocations to proceed quickly, and if the pages
> are not being faulted in, we need more swap slots. This increases the probability of
> running out of swap slots with the 4G SSD backing device, which, as the data in v4
> shows, causes memcg_swap_fail events, that drive folios to be resident in memory
> (triggering memcg_high breaches as allocations proceed even without zswap cgroup
> charging).
>
> Things change when the experiments are run in a situation where there is abundant
> swap space and when the default behavior of zswap compressed data being charged
> to the cgroup is enabled, as in the data with 176GiB ZRAM as ZSWAP's backing
> swapfile posted in v5. Now, the critical path to workload performance changes to
> concurrent reclaims in response to memcg_high events due to allocation and zswap
> usage. We see a lesser increase in swapout activity (as compared to the swap-constrained
> experiments in v4), and compress latency seems to become the bottleneck. Each
> individual process's throughput/sys time degrades mainly as a function of compress
> latency. Anyway, these were some of my learnings from these experiments. Please
> do let me know if there are other insights/analysis I could be missing.
>
> >
> > Is it possible that with SSD swap, because we wait for IO during
> > reclaim, this gives a chance for other processes to allocate and free
> > the memory they need. While with zswap because everything is
> > synchronous, all processes are trying to allocate their memory at the
> > same time resulting in higher reclaim rates?
> >
> > IOW, maybe with zswap all the processes try to allocate their memory
> > at the same time, so the total amount of memory needed at any given
> > instance is much higher than memory.high, so we keep producing
> > memory_high events and reclaiming. If 70 processes all require 1G at
> > the same time, then we need 70G of memory at once, we will keep
> > thrashing pages in/out of zswap.
> >
> > While with SSD swap, due to the waits imposed by IO, the allocations
> > are more spread out and more serialized, and the amount of memory
> > needed at any given instance is lower; resulting in less reclaim
> > activity and ultimately faster overall execution?
>
> This is a very interesting hypothesis, that is along the lines of the
> "slower compressor" essentially causing allocation stalls (and buffering us from
> the swap slots unavailability effect) observation I gathered from the 4G SSD
> experiments. I think this is a possibility.
>
> >
> > Could you please describe what the processes are doing? Are they
> > allocating memory and holding on to it, or immediately freeing it?
>
> I have been using the vm-scalability usemem workload for these experiments.
> Thanks Ying for suggesting I use this workload!
>
> I am running usemem with these config options: usemem --init-time -w -O -n 70 1g.
> This forks 70 processes, each of which does the following:
>
> 1) Allocates 1G mmap virtual memory with MAP_ANONYMOUS, read/write permissions.
> 2) Steps through and accesses each 8 bytes chunk of memory in the mmap-ed region, and:
>     2.a) Writes the index of that chunk to the (unsigned long *) memory at that index.
> 3) Generates statistics on throughput.
>
> There is an "munmap()" after step (2.a) that I have commented out because I wanted to
> see how much cold memory resides in the zswap zpool after the workload exits. Interestingly,
> this was 0 for 64K mTHP, but of the order of several hundreds of MB for 2M THP.

Does the process exit immediately after step (3)? The memory will be
unmapped and freed once the process exits anyway, so removing an unmap
that immediately precedes the process exiting should have no effect.

I wonder how this changes if the processes sleep and keep the memory
mapped for a while, to force the situation where all the memory is
needed at the same time on SSD as well as zswap. This could make the
playing field more even and force the same thrashing to happen on SSD
for a more fair comparison.

It's not a fix, if very fast reclaim with zswap ends up causing more
problems perhaps we need to tweak the throttling of memory.high or
something.

RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios

Posted by Sridhar, Kanchana P 1 year, 3 months ago

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Wednesday, August 28, 2024 3:34 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; hannes@cmpxchg.org; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> On Wed, Aug 28, 2024 at 11:50 AM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Hi Yosry,
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosryahmed@google.com>
> > > Sent: Wednesday, August 28, 2024 12:44 AM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org;
> linux-
> > > mm@kvack.org; hannes@cmpxchg.org; ryan.roberts@arm.com; Huang,
> Ying
> > > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org;
> > > Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> > >
> > > [..]
> > > >
> > > > This shows that in all cases, reclaim_high() is called only from the return
> > > > path to user mode after handling a page-fault.
> > >
> > > I am sorry I haven't been keeping up with this thread, I don't have a
> > > lot of capacity right now.
> > >
> > > If my understanding is correct, the summary of the problem we are
> > > observing here is that with high concurrency (70 processes), we
> > > observe worse system time, worse throughput, and higher memory_high
> > > events with zswap than SSD swap. This is true (with varying degrees)
> > > for 4K or mTHP, and with or without charging zswap compressed memory.
> > >
> > > Did I get that right?
> >
> > Thanks for your review and comments! Yes, this is correct.
> >
> > >
> > > I saw you also mentioned that reclaim latency is directly correlated
> > > to higher memory_high events.
> >
> > That was my observation based on the swap-constrained experiments with
> 4G SSD.
> > With a faster compressor, we allow allocations to proceed quickly, and if the
> pages
> > are not being faulted in, we need more swap slots. This increases the
> probability of
> > running out of swap slots with the 4G SSD backing device, which, as the data
> in v4
> > shows, causes memcg_swap_fail events, that drive folios to be resident in
> memory
> > (triggering memcg_high breaches as allocations proceed even without
> zswap cgroup
> > charging).
> >
> > Things change when the experiments are run in a situation where there is
> abundant
> > swap space and when the default behavior of zswap compressed data being
> charged
> > to the cgroup is enabled, as in the data with 176GiB ZRAM as ZSWAP's
> backing
> > swapfile posted in v5. Now, the critical path to workload performance
> changes to
> > concurrent reclaims in response to memcg_high events due to allocation
> and zswap
> > usage. We see a lesser increase in swapout activity (as compared to the
> swap-constrained
> > experiments in v4), and compress latency seems to become the bottleneck.
> Each
> > individual process's throughput/sys time degrades mainly as a function of
> compress
> > latency. Anyway, these were some of my learnings from these experiments.
> Please
> > do let me know if there are other insights/analysis I could be missing.
> >
> > >
> > > Is it possible that with SSD swap, because we wait for IO during
> > > reclaim, this gives a chance for other processes to allocate and free
> > > the memory they need. While with zswap because everything is
> > > synchronous, all processes are trying to allocate their memory at the
> > > same time resulting in higher reclaim rates?
> > >
> > > IOW, maybe with zswap all the processes try to allocate their memory
> > > at the same time, so the total amount of memory needed at any given
> > > instance is much higher than memory.high, so we keep producing
> > > memory_high events and reclaiming. If 70 processes all require 1G at
> > > the same time, then we need 70G of memory at once, we will keep
> > > thrashing pages in/out of zswap.
> > >
> > > While with SSD swap, due to the waits imposed by IO, the allocations
> > > are more spread out and more serialized, and the amount of memory
> > > needed at any given instance is lower; resulting in less reclaim
> > > activity and ultimately faster overall execution?
> >
> > This is a very interesting hypothesis, that is along the lines of the
> > "slower compressor" essentially causing allocation stalls (and buffering us
> from
> > the swap slots unavailability effect) observation I gathered from the 4G SSD
> > experiments. I think this is a possibility.
> >
> > >
> > > Could you please describe what the processes are doing? Are they
> > > allocating memory and holding on to it, or immediately freeing it?
> >
> > I have been using the vm-scalability usemem workload for these
> experiments.
> > Thanks Ying for suggesting I use this workload!
> >
> > I am running usemem with these config options: usemem --init-time -w -O -
> n 70 1g.
> > This forks 70 processes, each of which does the following:
> >
> > 1) Allocates 1G mmap virtual memory with MAP_ANONYMOUS, read/write
> permissions.
> > 2) Steps through and accesses each 8 bytes chunk of memory in the mmap-
> ed region, and:
> >     2.a) Writes the index of that chunk to the (unsigned long *) memory at
> that index.
> > 3) Generates statistics on throughput.
> >
> > There is an "munmap()" after step (2.a) that I have commented out because
> I wanted to
> > see how much cold memory resides in the zswap zpool after the workload
> exits. Interestingly,
> > this was 0 for 64K mTHP, but of the order of several hundreds of MB for 2M
> THP.
> 
> Does the process exit immediately after step (3)? The memory will be
> unmapped and freed once the process exits anyway, so removing an unmap
> that immediately precedes the process exiting should have no effect.

Yes, you're right.

> 
> I wonder how this changes if the processes sleep and keep the memory
> mapped for a while, to force the situation where all the memory is
> needed at the same time on SSD as well as zswap. This could make the
> playing field more even and force the same thrashing to happen on SSD
> for a more fair comparison.

Good point. I believe I saw an option in usemem that could facilitate this.
I will investigate.

> 
> It's not a fix, if very fast reclaim with zswap ends up causing more
> problems perhaps we need to tweak the throttling of memory.high or
> something.

Sure, that is a possibility. Although, proactive reclaim might mitigate this,
in which case very fast reclaim with zswap might help.

Thanks,
Kanchana

RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios