[PATCH v5 0/4] mm: support mTHP swap-in for zRAM-like swapfile

Barry Song posted 4 patches 2 months, 3 weeks ago
Documentation/admin-guide/mm/transhuge.rst |   6 +
include/linux/huge_mm.h                    |   1 +
include/linux/memcontrol.h                 |  12 ++
include/linux/swap.h                       |   9 +-
mm/huge_memory.c                           |  44 +++++
mm/memory.c                                | 212 ++++++++++++++++++---
mm/swap.h                                  |  10 +-
mm/swapfile.c                              | 102 ++++++----
8 files changed, 329 insertions(+), 67 deletions(-)
[PATCH v5 0/4] mm: support mTHP swap-in for zRAM-like swapfile
Posted by Barry Song 2 months, 3 weeks ago
From: Barry Song <v-songbaohua@oppo.com>

In an embedded system like Android, more than half of anonymous memory is
actually stored in swap devices such as zRAM. For instance, when an app
is switched to the background, most of its memory might be swapped out.

Currently, we have mTHP features, but unfortunately, without support
for large folio swap-ins, once those large folios are swapped out,
we lose them immediately because mTHP is a one-way ticket.

This is unacceptable and reduces mTHP to merely a toy on systems
with significant swap utilization.

This patch introduces mTHP swap-in support. For now, we limit mTHP
swap-ins to contiguous swaps that were likely swapped out from mTHP as
a whole.

Additionally, the current implementation only covers the SWAP_SYNCHRONOUS
case. This is the simplest and most common use case, benefiting millions
of Android phones and similar devices with minimal implementation
cost. In this straightforward scenario, large folios are always exclusive,
eliminating the need to handle complex rmap and swapcache issues.

It offers several benefits:
1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
   swap-out and swap-in.
2. Eliminates fragmentation in swap slots and supports successful THP_SWPOUT
   without fragmentation. Based on the observed data [1] on Chris's and Ryan's
   THP swap allocation optimization, aligned swap-in plays a crucial role
   in the success of THP_SWPOUT.
3. Enables zRAM/zsmalloc to compress and decompress mTHP, reducing CPU usage
   and enhancing compression ratios significantly. We have another patchset
   to enable mTHP compression and decompression in zsmalloc/zRAM[2].

Using the readahead mechanism to decide whether to swap in mTHP doesn't seem
to be an optimal approach. There's a critical distinction between pagecache
and anonymous pages: pagecache can be evicted and later retrieved from disk,
potentially becoming a mTHP upon retrieval, whereas anonymous pages must
always reside in memory or swapfile. If we swap in small folios and identify
adjacent memory suitable for swapping in as mTHP, those pages that have been
converted to small folios may never transition to mTHP. The process of
converting mTHP into small folios remains irreversible. This introduces
the risk of losing all mTHP through several swap-out and swap-in cycles,
let alone losing the benefits of defragmentation, improved compression
ratios, and reduced CPU usage based on mTHP compression/decompression.

Conversely, in deploying mTHP on millions of real-world products with this
feature in OPPO's out-of-tree code[3], we haven't observed any significant
increase in memory footprint for 64KiB mTHP based on CONT-PTE on ARM64.

[1] https://lore.kernel.org/linux-mm/20240622071231.576056-1-21cnbao@gmail.com/
[2] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/
[3] OnePlusOSS / android_kernel_oneplus_sm8550
https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11

-v5:
 * Add swap-in control policy according to Ying's proposal. Right now only
   "always" and "never" are supported, later we can extend to "auto";
 * Fix the comment regarding zswap_never_enabled() according to Yosry;
 * Filter out unaligned swp entries earlier;
 * add mem_cgroup_swapin_uncharge_swap_nr() helper

-v4:
 https://lore.kernel.org/linux-mm/20240629111010.230484-1-21cnbao@gmail.com/

 Many parts of v3 have been merged into the mm tree with the help on reviewing
 from Ryan, David, Ying and Chris etc. Thank you very much!
 This is the final part to allocate large folios and map them.

 * Use Yosry's zswap_never_enabled(), notice there is a bug. I put the bug fix
   in this v4 RFC though it should be fixed in Yosry's patch
 * lots of code improvement (drop large stack, hold ptl etc) according
   to Yosry's and Ryan's feedback
 * rebased on top of the latest mm-unstable and utilized some new helpers
   introduced recently.

-v3:
 https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/
 * avoid over-writing err in __swap_duplicate_nr, pointed out by Yosry,
   thanks!
 * fix the issue folio is charged twice for do_swap_page, separating
   alloc_anon_folio and alloc_swap_folio as they have many differences
   now on
   * memcg charing
   * clearing allocated folio or not

-v2:
 https://lore.kernel.org/linux-mm/20240229003753.134193-1-21cnbao@gmail.com/
 * lots of code cleanup according to Chris's comments, thanks!
 * collect Chris's ack tags, thanks!
 * address David's comment on moving to use folio_add_new_anon_rmap
   for !folio_test_anon in do_swap_page, thanks!
 * remove the MADV_PAGEOUT patch from this series as Ryan will
   intergrate it into swap-out series
 * Apply Kairui's work of "mm/swap: fix race when skipping swapcache"
   on large folios swap-in as well
 * fixed corrupted data(zero-filled data) in two races: zswap and
   a part of entries are in swapcache while some others are not
   in by checking SWAP_HAS_CACHE while swapping in a large folio

-v1:
 https://lore.kernel.org/all/20240118111036.72641-1-21cnbao@gmail.com/#t

Barry Song (3):
  mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for
    large folios swap-in
  mm: Introduce mem_cgroup_swapin_uncharge_swap_nr() helper for large
    folios swap-in
  mm: Introduce per-thpsize swapin control policy

Chuanhua Han (1):
  mm: support large folios swapin as a whole for zRAM-like swapfile

 Documentation/admin-guide/mm/transhuge.rst |   6 +
 include/linux/huge_mm.h                    |   1 +
 include/linux/memcontrol.h                 |  12 ++
 include/linux/swap.h                       |   9 +-
 mm/huge_memory.c                           |  44 +++++
 mm/memory.c                                | 212 ++++++++++++++++++---
 mm/swap.h                                  |  10 +-
 mm/swapfile.c                              | 102 ++++++----
 8 files changed, 329 insertions(+), 67 deletions(-)

-- 
2.34.1
[PATCH v6 0/2] mm: Ignite large folios swap-in support
Posted by Barry Song 2 months, 2 weeks ago
From: Barry Song <v-songbaohua@oppo.com>

Currently, we support mTHP swapout but not swapin. This means that once mTHP
is swapped out, it will come back as small folios when swapped in. This is
particularly detrimental for devices like Android, where more than half of
the memory is in swap.

The lack of mTHP swapin functionality makes mTHP a showstopper in scenarios
that heavily rely on swap. This patchset introduces mTHP swap-in support.
It starts with synchronous devices similar to zRAM, aiming to benefit as
many users as possible with minimal changes.

-v6:
 * remove the swapin control added in v5, per Willy, Christoph;
   The original reason for adding the swpin_enabled control was primarily
   to address concerns for slower devices. Currently, since we only support
   fast sync devices, swap-in size is less of a concern.
   We’ll gain a clearer understanding of the next steps while more devices
   begin to support mTHP swap-in.
 * add nr argument in mem_cgroup_swapin_uncharge_swap() instead of adding
   new API, Willy;
 * swapcache_prepare() and swapcache_clear() large folios support is also
   removed as it has been separated per Baolin's request, right now has
   been in mm-unstable.
 * provide more data in changelog.

-v5:
 https://lore.kernel.org/linux-mm/20240726094618.401593-1-21cnbao@gmail.com/

 * Add swap-in control policy according to Ying's proposal. Right now only
   "always" and "never" are supported, later we can extend to "auto";
 * Fix the comment regarding zswap_never_enabled() according to Yosry;
 * Filter out unaligned swp entries earlier;
 * add mem_cgroup_swapin_uncharge_swap_nr() helper

-v4:
 https://lore.kernel.org/linux-mm/20240629111010.230484-1-21cnbao@gmail.com/

 Many parts of v3 have been merged into the mm tree with the help on reviewing
 from Ryan, David, Ying and Chris etc. Thank you very much!
 This is the final part to allocate large folios and map them.

 * Use Yosry's zswap_never_enabled(), notice there is a bug. I put the bug fix
   in this v4 RFC though it should be fixed in Yosry's patch
 * lots of code improvement (drop large stack, hold ptl etc) according
   to Yosry's and Ryan's feedback
 * rebased on top of the latest mm-unstable and utilized some new helpers
   introduced recently.

-v3:
 https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/
 * avoid over-writing err in __swap_duplicate_nr, pointed out by Yosry,
   thanks!
 * fix the issue folio is charged twice for do_swap_page, separating
   alloc_anon_folio and alloc_swap_folio as they have many differences
   now on
   * memcg charing
   * clearing allocated folio or not

-v2:
 https://lore.kernel.org/linux-mm/20240229003753.134193-1-21cnbao@gmail.com/
 * lots of code cleanup according to Chris's comments, thanks!
 * collect Chris's ack tags, thanks!
 * address David's comment on moving to use folio_add_new_anon_rmap
   for !folio_test_anon in do_swap_page, thanks!
 * remove the MADV_PAGEOUT patch from this series as Ryan will
   intergrate it into swap-out series
 * Apply Kairui's work of "mm/swap: fix race when skipping swapcache"
   on large folios swap-in as well
 * fixed corrupted data(zero-filled data) in two races: zswap and
   a part of entries are in swapcache while some others are not
   in by checking SWAP_HAS_CACHE while swapping in a large folio

-v1:
 https://lore.kernel.org/all/20240118111036.72641-1-21cnbao@gmail.com/#t

Barry Song (1):
  mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to
    support large folios

Chuanhua Han (1):
  mm: support large folios swap-in for zRAM-like devices

 include/linux/memcontrol.h |   5 +-
 mm/memcontrol.c            |   7 +-
 mm/memory.c                | 211 +++++++++++++++++++++++++++++++++----
 mm/swap_state.c            |   2 +-
 4 files changed, 196 insertions(+), 29 deletions(-)

-- 
2.34.1

[PATCH v6 1/2] mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios
Posted by Barry Song 2 months, 2 weeks ago
From: Barry Song <v-songbaohua@oppo.com>

With large folios swap-in, we might need to uncharge multiple entries
all together, add nr argument in mem_cgroup_swapin_uncharge_swap().

For the existing two users, just pass nr=1.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 include/linux/memcontrol.h | 5 +++--
 mm/memcontrol.c            | 7 ++++---
 mm/memory.c                | 2 +-
 mm/swap_state.c            | 2 +-
 4 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1b79760af685..44f7fb7dc0c8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -682,7 +682,8 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
 
 int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
 				  gfp_t gfp, swp_entry_t entry);
-void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
+
+void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr_pages);
 
 void __mem_cgroup_uncharge(struct folio *folio);
 
@@ -1181,7 +1182,7 @@ static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
 	return 0;
 }
 
-static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
+static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr)
 {
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b889a7fbf382..5d763c234c44 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4572,14 +4572,15 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
 
 /*
  * mem_cgroup_swapin_uncharge_swap - uncharge swap slot
- * @entry: swap entry for which the page is charged
+ * @entry: the first swap entry for which the pages are charged
+ * @nr_pages: number of pages which will be uncharged
  *
  * Call this function after successfully adding the charged page to swapcache.
  *
  * Note: This function assumes the page for which swap slot is being uncharged
  * is order 0 page.
  */
-void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
+void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
 {
 	/*
 	 * Cgroup1's unified memory+swap counter has been charged with the
@@ -4599,7 +4600,7 @@ void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
 		 * let's not wait for it.  The page already received a
 		 * memory+swap charge, drop the swap entry duplicate.
 		 */
-		mem_cgroup_uncharge_swap(entry, 1);
+		mem_cgroup_uncharge_swap(entry, nr_pages);
 	}
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 4c8716cb306c..4cf4902db1ec 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4102,7 +4102,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 					ret = VM_FAULT_OOM;
 					goto out_page;
 				}
-				mem_cgroup_swapin_uncharge_swap(entry);
+				mem_cgroup_swapin_uncharge_swap(entry, 1);
 
 				shadow = get_shadow_from_swap_cache(entry);
 				if (shadow)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 293ff1afdca4..1159e3225754 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -522,7 +522,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	if (add_to_swap_cache(new_folio, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow))
 		goto fail_unlock;
 
-	mem_cgroup_swapin_uncharge_swap(entry);
+	mem_cgroup_swapin_uncharge_swap(entry, 1);
 
 	if (shadow)
 		workingset_refault(new_folio, shadow);
-- 
2.34.1
Re: [PATCH v6 1/2] mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios
Posted by Chris Li 2 months, 2 weeks ago
Acked-by: Chris Li <chrisl@kernel.org>

Chris

On Fri, Aug 2, 2024 at 5:21 AM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Barry Song <v-songbaohua@oppo.com>
>
> With large folios swap-in, we might need to uncharge multiple entries
> all together, add nr argument in mem_cgroup_swapin_uncharge_swap().
>
> For the existing two users, just pass nr=1.
>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  include/linux/memcontrol.h | 5 +++--
>  mm/memcontrol.c            | 7 ++++---
>  mm/memory.c                | 2 +-
>  mm/swap_state.c            | 2 +-
>  4 files changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 1b79760af685..44f7fb7dc0c8 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -682,7 +682,8 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
>
>  int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
>                                   gfp_t gfp, swp_entry_t entry);
> -void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
> +
> +void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr_pages);
>
>  void __mem_cgroup_uncharge(struct folio *folio);
>
> @@ -1181,7 +1182,7 @@ static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
>         return 0;
>  }
>
> -static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
> +static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr)
>  {
>  }
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b889a7fbf382..5d763c234c44 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4572,14 +4572,15 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
>
>  /*
>   * mem_cgroup_swapin_uncharge_swap - uncharge swap slot
> - * @entry: swap entry for which the page is charged
> + * @entry: the first swap entry for which the pages are charged
> + * @nr_pages: number of pages which will be uncharged
>   *
>   * Call this function after successfully adding the charged page to swapcache.
>   *
>   * Note: This function assumes the page for which swap slot is being uncharged
>   * is order 0 page.
>   */
> -void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
> +void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
>  {
>         /*
>          * Cgroup1's unified memory+swap counter has been charged with the
> @@ -4599,7 +4600,7 @@ void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
>                  * let's not wait for it.  The page already received a
>                  * memory+swap charge, drop the swap entry duplicate.
>                  */
> -               mem_cgroup_uncharge_swap(entry, 1);
> +               mem_cgroup_uncharge_swap(entry, nr_pages);
>         }
>  }
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 4c8716cb306c..4cf4902db1ec 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4102,7 +4102,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                                         ret = VM_FAULT_OOM;
>                                         goto out_page;
>                                 }
> -                               mem_cgroup_swapin_uncharge_swap(entry);
> +                               mem_cgroup_swapin_uncharge_swap(entry, 1);
>
>                                 shadow = get_shadow_from_swap_cache(entry);
>                                 if (shadow)
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 293ff1afdca4..1159e3225754 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -522,7 +522,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>         if (add_to_swap_cache(new_folio, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow))
>                 goto fail_unlock;
>
> -       mem_cgroup_swapin_uncharge_swap(entry);
> +       mem_cgroup_swapin_uncharge_swap(entry, 1);
>
>         if (shadow)
>                 workingset_refault(new_folio, shadow);
> --
> 2.34.1
>
[PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices
Posted by Barry Song 2 months, 2 weeks ago
From: Chuanhua Han <hanchuanhua@oppo.com>

Currently, we have mTHP features, but unfortunately, without support for large
folio swap-ins, once these large folios are swapped out, they are lost because
mTHP swap is a one-way process. The lack of mTHP swap-in functionality prevents
mTHP from being used on devices like Android that heavily rely on swap.

This patch introduces mTHP swap-in support. It starts from sync devices such
as zRAM. This is probably the simplest and most common use case, benefiting
billions of Android phones and similar devices with minimal implementation
cost. In this straightforward scenario, large folios are always exclusive,
eliminating the need to handle complex rmap and swapcache issues.

It offers several benefits:
1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
   swap-out and swap-in. Large folios in the buddy system are also
   preserved as much as possible, rather than being fragmented due
   to swap-in.

2. Eliminates fragmentation in swap slots and supports successful
   THP_SWPOUT.

   w/o this patch (Refer to the data from Chris's and Kairui's latest
   swap allocator optimization while running ./thp_swap_allocator_test
   w/o "-a" option [1]):

   ./thp_swap_allocator_test
   Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 2: swpout inc: 131, swpout fallback inc: 101, Fallback percentage: 43.53%
   Iteration 3: swpout inc: 71, swpout fallback inc: 155, Fallback percentage: 68.58%
   Iteration 4: swpout inc: 55, swpout fallback inc: 168, Fallback percentage: 75.34%
   Iteration 5: swpout inc: 35, swpout fallback inc: 191, Fallback percentage: 84.51%
   Iteration 6: swpout inc: 25, swpout fallback inc: 199, Fallback percentage: 88.84%
   Iteration 7: swpout inc: 23, swpout fallback inc: 205, Fallback percentage: 89.91%
   Iteration 8: swpout inc: 9, swpout fallback inc: 219, Fallback percentage: 96.05%
   Iteration 9: swpout inc: 13, swpout fallback inc: 213, Fallback percentage: 94.25%
   Iteration 10: swpout inc: 12, swpout fallback inc: 216, Fallback percentage: 94.74%
   Iteration 11: swpout inc: 16, swpout fallback inc: 213, Fallback percentage: 93.01%
   Iteration 12: swpout inc: 10, swpout fallback inc: 210, Fallback percentage: 95.45%
   Iteration 13: swpout inc: 16, swpout fallback inc: 212, Fallback percentage: 92.98%
   Iteration 14: swpout inc: 12, swpout fallback inc: 212, Fallback percentage: 94.64%
   Iteration 15: swpout inc: 15, swpout fallback inc: 211, Fallback percentage: 93.36%
   Iteration 16: swpout inc: 15, swpout fallback inc: 200, Fallback percentage: 93.02%
   Iteration 17: swpout inc: 9, swpout fallback inc: 220, Fallback percentage: 96.07%

   w/ this patch (always 0%):
   Iteration 1: swpout inc: 948, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 2: swpout inc: 953, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 3: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 4: swpout inc: 952, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 5: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 6: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 7: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 8: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 9: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 10: swpout inc: 945, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 11: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00%
   ...

3. With both mTHP swap-out and swap-in supported, we offer the option to enable
   zsmalloc compression/decompression with larger granularity[2]. The upcoming
   optimization in zsmalloc will significantly increase swap speed and improve
   compression efficiency. Tested by running 100 iterations of swapping 100MiB
   of anon memory, the swap speed improved dramatically:
                time consumption of swapin(ms)   time consumption of swapout(ms)
     lz4 4k                  45274                    90540
     lz4 64k                 22942                    55667
     zstdn 4k                85035                    186585
     zstdn 64k               46558                    118533

    The compression ratio also improved, as evaluated with 1 GiB of data:
     granularity   orig_data_size   compr_data_size
     4KiB-zstd      1048576000       246876055
     64KiB-zstd     1048576000       199763892

   Without mTHP swap-in, the potential optimizations in zsmalloc cannot be
   realized.

4. Even mTHP swap-in itself can reduce swap-in page faults by a factor
   of nr_pages. Swapping in content filled with the same data 0x11, w/o
   and w/ the patch for five rounds (Since the content is the same,
   decompression will be very fast. This primarily assesses the impact of
   reduced page faults):

  swp in bandwidth(bytes/ms)    w/o              w/
   round1                     624152          1127501
   round2                     631672          1127501
   round3                     620459          1139756
   round4                     606113          1139756
   round5                     624152          1152281
   avg                        621310          1137359      +83%

[1] https://lore.kernel.org/all/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/
[2] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/

Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
Co-developed-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/memory.c | 211 ++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 188 insertions(+), 23 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4cf4902db1ec..07029532469a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3986,6 +3986,152 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
 	return VM_FAULT_SIGBUS;
 }
 
+/*
+ * check a range of PTEs are completely swap entries with
+ * contiguous swap offsets and the same SWAP_HAS_CACHE.
+ * ptep must be first one in the range
+ */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
+{
+	struct swap_info_struct *si;
+	unsigned long addr;
+	swp_entry_t entry;
+	pgoff_t offset;
+	char has_cache;
+	int idx, i;
+	pte_t pte;
+
+	addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
+	idx = (vmf->address - addr) / PAGE_SIZE;
+	pte = ptep_get(ptep);
+
+	if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx)))
+		return false;
+	entry = pte_to_swp_entry(pte);
+	offset = swp_offset(entry);
+	if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages)
+		return false;
+
+	si = swp_swap_info(entry);
+	has_cache = si->swap_map[offset] & SWAP_HAS_CACHE;
+	for (i = 1; i < nr_pages; i++) {
+		/*
+		 * while allocating a large folio and doing swap_read_folio for the
+		 * SWP_SYNCHRONOUS_IO path, which is the case the being faulted pte
+		 * doesn't have swapcache. We need to ensure all PTEs have no cache
+		 * as well, otherwise, we might go to swap devices while the content
+		 * is in swapcache
+		 */
+		if ((si->swap_map[offset + i] & SWAP_HAS_CACHE) != has_cache)
+			return false;
+	}
+
+	return true;
+}
+
+static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset,
+		unsigned long addr, unsigned long orders)
+{
+	int order, nr;
+
+	order = highest_order(orders);
+
+	/*
+	 * To swap-in a THP with nr pages, we require its first swap_offset
+	 * is aligned with nr. This can filter out most invalid entries.
+	 */
+	while (orders) {
+		nr = 1 << order;
+		if ((addr >> PAGE_SHIFT) % nr == swp_offset % nr)
+			break;
+		order = next_order(&orders, order);
+	}
+
+	return orders;
+}
+#else
+static inline bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
+{
+	return false;
+}
+#endif
+
+static struct folio *alloc_swap_folio(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	unsigned long orders;
+	struct folio *folio;
+	unsigned long addr;
+	swp_entry_t entry;
+	spinlock_t *ptl;
+	pte_t *pte;
+	gfp_t gfp;
+	int order;
+
+	/*
+	 * If uffd is active for the vma we need per-page fault fidelity to
+	 * maintain the uffd semantics.
+	 */
+	if (unlikely(userfaultfd_armed(vma)))
+		goto fallback;
+
+	/*
+	 * A large swapped out folio could be partially or fully in zswap. We
+	 * lack handling for such cases, so fallback to swapping in order-0
+	 * folio.
+	 */
+	if (!zswap_never_enabled())
+		goto fallback;
+
+	entry = pte_to_swp_entry(vmf->orig_pte);
+	/*
+	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
+	 * and suitable for swapping THP.
+	 */
+	orders = thp_vma_allowable_orders(vma, vma->vm_flags,
+			TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
+	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
+	orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders);
+
+	if (!orders)
+		goto fallback;
+
+	pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address & PMD_MASK, &ptl);
+	if (unlikely(!pte))
+		goto fallback;
+
+	/*
+	 * For do_swap_page, find the highest order where the aligned range is
+	 * completely swap entries with contiguous swap offsets.
+	 */
+	order = highest_order(orders);
+	while (orders) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+		if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order))
+			break;
+		order = next_order(&orders, order);
+	}
+
+	pte_unmap_unlock(pte, ptl);
+
+	/* Try allocating the highest of the remaining orders. */
+	gfp = vma_thp_gfp_mask(vma);
+	while (orders) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+		folio = vma_alloc_folio(gfp, order, vma, addr, true);
+		if (folio)
+			return folio;
+		order = next_order(&orders, order);
+	}
+
+fallback:
+#endif
+	return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
+}
+
+
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -4074,35 +4220,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (!folio) {
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
 		    __swap_count(entry) == 1) {
-			/*
-			 * Prevent parallel swapin from proceeding with
-			 * the cache flag. Otherwise, another thread may
-			 * finish swapin first, free the entry, and swapout
-			 * reusing the same entry. It's undetectable as
-			 * pte_same() returns true due to entry reuse.
-			 */
-			if (swapcache_prepare(entry, 1)) {
-				/* Relax a bit to prevent rapid repeated page faults */
-				schedule_timeout_uninterruptible(1);
-				goto out;
-			}
-			need_clear_cache = true;
-
 			/* skip swapcache */
-			folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
-						vma, vmf->address, false);
+			folio = alloc_swap_folio(vmf);
 			page = &folio->page;
 			if (folio) {
 				__folio_set_locked(folio);
 				__folio_set_swapbacked(folio);
 
+				nr_pages = folio_nr_pages(folio);
+				if (folio_test_large(folio))
+					entry.val = ALIGN_DOWN(entry.val, nr_pages);
+				/*
+				 * Prevent parallel swapin from proceeding with
+				 * the cache flag. Otherwise, another thread may
+				 * finish swapin first, free the entry, and swapout
+				 * reusing the same entry. It's undetectable as
+				 * pte_same() returns true due to entry reuse.
+				 */
+				if (swapcache_prepare(entry, nr_pages)) {
+					/* Relax a bit to prevent rapid repeated page faults */
+					schedule_timeout_uninterruptible(1);
+					goto out_page;
+				}
+				need_clear_cache = true;
+
 				if (mem_cgroup_swapin_charge_folio(folio,
 							vma->vm_mm, GFP_KERNEL,
 							entry)) {
 					ret = VM_FAULT_OOM;
 					goto out_page;
 				}
-				mem_cgroup_swapin_uncharge_swap(entry, 1);
+				mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
 
 				shadow = get_shadow_from_swap_cache(entry);
 				if (shadow)
@@ -4209,6 +4357,22 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		goto out_nomap;
 	}
 
+	/* allocated large folios for SWP_SYNCHRONOUS_IO */
+	if (folio_test_large(folio) && !folio_test_swapcache(folio)) {
+		unsigned long nr = folio_nr_pages(folio);
+		unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
+		unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE;
+		pte_t *folio_ptep = vmf->pte - idx;
+
+		if (!can_swapin_thp(vmf, folio_ptep, nr))
+			goto out_nomap;
+
+		page_idx = idx;
+		address = folio_start;
+		ptep = folio_ptep;
+		goto check_folio;
+	}
+
 	nr_pages = 1;
 	page_idx = 0;
 	address = vmf->address;
@@ -4340,11 +4504,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		folio_add_lru_vma(folio, vma);
 	} else if (!folio_test_anon(folio)) {
 		/*
-		 * We currently only expect small !anon folios, which are either
-		 * fully exclusive or fully shared. If we ever get large folios
-		 * here, we have to be careful.
+		 * We currently only expect small !anon folios which are either
+		 * fully exclusive or fully shared, or new allocated large folios
+		 * which are fully exclusive. If we ever get large folios within
+		 * swapcache here, we have to be careful.
 		 */
-		VM_WARN_ON_ONCE(folio_test_large(folio));
+		VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio));
 		VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
 		folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
 	} else {
@@ -4387,7 +4552,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 out:
 	/* Clear the swap cache pin for direct swapin after PTL unlock */
 	if (need_clear_cache)
-		swapcache_clear(si, entry, 1);
+		swapcache_clear(si, entry, nr_pages);
 	if (si)
 		put_swap_device(si);
 	return ret;
@@ -4403,7 +4568,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		folio_put(swapcache);
 	}
 	if (need_clear_cache)
-		swapcache_clear(si, entry, 1);
+		swapcache_clear(si, entry, nr_pages);
 	if (si)
 		put_swap_device(si);
 	return ret;
-- 
2.34.1
Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices
Posted by Kairui Song 2 months ago
On Fri, Aug 2, 2024 at 8:21 PM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Chuanhua Han <hanchuanhua@oppo.com>

Hi Chuanhua,

>
> Currently, we have mTHP features, but unfortunately, without support for large
> folio swap-ins, once these large folios are swapped out, they are lost because
> mTHP swap is a one-way process. The lack of mTHP swap-in functionality prevents
> mTHP from being used on devices like Android that heavily rely on swap.
>
> This patch introduces mTHP swap-in support. It starts from sync devices such
> as zRAM. This is probably the simplest and most common use case, benefiting
> billions of Android phones and similar devices with minimal implementation
> cost. In this straightforward scenario, large folios are always exclusive,
> eliminating the need to handle complex rmap and swapcache issues.
>
> It offers several benefits:
> 1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
>    swap-out and swap-in. Large folios in the buddy system are also
>    preserved as much as possible, rather than being fragmented due
>    to swap-in.
>
> 2. Eliminates fragmentation in swap slots and supports successful
>    THP_SWPOUT.
>
>    w/o this patch (Refer to the data from Chris's and Kairui's latest
>    swap allocator optimization while running ./thp_swap_allocator_test
>    w/o "-a" option [1]):
>
>    ./thp_swap_allocator_test
>    Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 2: swpout inc: 131, swpout fallback inc: 101, Fallback percentage: 43.53%
>    Iteration 3: swpout inc: 71, swpout fallback inc: 155, Fallback percentage: 68.58%
>    Iteration 4: swpout inc: 55, swpout fallback inc: 168, Fallback percentage: 75.34%
>    Iteration 5: swpout inc: 35, swpout fallback inc: 191, Fallback percentage: 84.51%
>    Iteration 6: swpout inc: 25, swpout fallback inc: 199, Fallback percentage: 88.84%
>    Iteration 7: swpout inc: 23, swpout fallback inc: 205, Fallback percentage: 89.91%
>    Iteration 8: swpout inc: 9, swpout fallback inc: 219, Fallback percentage: 96.05%
>    Iteration 9: swpout inc: 13, swpout fallback inc: 213, Fallback percentage: 94.25%
>    Iteration 10: swpout inc: 12, swpout fallback inc: 216, Fallback percentage: 94.74%
>    Iteration 11: swpout inc: 16, swpout fallback inc: 213, Fallback percentage: 93.01%
>    Iteration 12: swpout inc: 10, swpout fallback inc: 210, Fallback percentage: 95.45%
>    Iteration 13: swpout inc: 16, swpout fallback inc: 212, Fallback percentage: 92.98%
>    Iteration 14: swpout inc: 12, swpout fallback inc: 212, Fallback percentage: 94.64%
>    Iteration 15: swpout inc: 15, swpout fallback inc: 211, Fallback percentage: 93.36%
>    Iteration 16: swpout inc: 15, swpout fallback inc: 200, Fallback percentage: 93.02%
>    Iteration 17: swpout inc: 9, swpout fallback inc: 220, Fallback percentage: 96.07%
>
>    w/ this patch (always 0%):
>    Iteration 1: swpout inc: 948, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 2: swpout inc: 953, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 3: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 4: swpout inc: 952, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 5: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 6: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 7: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 8: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 9: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 10: swpout inc: 945, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 11: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00%
>    ...
>
> 3. With both mTHP swap-out and swap-in supported, we offer the option to enable
>    zsmalloc compression/decompression with larger granularity[2]. The upcoming
>    optimization in zsmalloc will significantly increase swap speed and improve
>    compression efficiency. Tested by running 100 iterations of swapping 100MiB
>    of anon memory, the swap speed improved dramatically:
>                 time consumption of swapin(ms)   time consumption of swapout(ms)
>      lz4 4k                  45274                    90540
>      lz4 64k                 22942                    55667
>      zstdn 4k                85035                    186585
>      zstdn 64k               46558                    118533
>
>     The compression ratio also improved, as evaluated with 1 GiB of data:
>      granularity   orig_data_size   compr_data_size
>      4KiB-zstd      1048576000       246876055
>      64KiB-zstd     1048576000       199763892
>
>    Without mTHP swap-in, the potential optimizations in zsmalloc cannot be
>    realized.
>
> 4. Even mTHP swap-in itself can reduce swap-in page faults by a factor
>    of nr_pages. Swapping in content filled with the same data 0x11, w/o
>    and w/ the patch for five rounds (Since the content is the same,
>    decompression will be very fast. This primarily assesses the impact of
>    reduced page faults):
>
>   swp in bandwidth(bytes/ms)    w/o              w/
>    round1                     624152          1127501
>    round2                     631672          1127501
>    round3                     620459          1139756
>    round4                     606113          1139756
>    round5                     624152          1152281
>    avg                        621310          1137359      +83%
>
> [1] https://lore.kernel.org/all/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/
> [2] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/
>
> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  mm/memory.c | 211 ++++++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 188 insertions(+), 23 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 4cf4902db1ec..07029532469a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3986,6 +3986,152 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
>         return VM_FAULT_SIGBUS;
>  }
>
> +/*
> + * check a range of PTEs are completely swap entries with
> + * contiguous swap offsets and the same SWAP_HAS_CACHE.
> + * ptep must be first one in the range
> + */
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
> +{
> +       struct swap_info_struct *si;
> +       unsigned long addr;
> +       swp_entry_t entry;
> +       pgoff_t offset;
> +       char has_cache;
> +       int idx, i;
> +       pte_t pte;
> +
> +       addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
> +       idx = (vmf->address - addr) / PAGE_SIZE;
> +       pte = ptep_get(ptep);
> +
> +       if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx)))
> +               return false;
> +       entry = pte_to_swp_entry(pte);
> +       offset = swp_offset(entry);
> +       if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages)
> +               return false;
> +
> +       si = swp_swap_info(entry);
> +       has_cache = si->swap_map[offset] & SWAP_HAS_CACHE;
> +       for (i = 1; i < nr_pages; i++) {
> +               /*
> +                * while allocating a large folio and doing swap_read_folio for the
> +                * SWP_SYNCHRONOUS_IO path, which is the case the being faulted pte
> +                * doesn't have swapcache. We need to ensure all PTEs have no cache
> +                * as well, otherwise, we might go to swap devices while the content
> +                * is in swapcache
> +                */
> +               if ((si->swap_map[offset + i] & SWAP_HAS_CACHE) != has_cache)
> +                       return false;
> +       }
> +
> +       return true;
> +}
> +
> +static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset,
> +               unsigned long addr, unsigned long orders)
> +{
> +       int order, nr;
> +
> +       order = highest_order(orders);
> +
> +       /*
> +        * To swap-in a THP with nr pages, we require its first swap_offset
> +        * is aligned with nr. This can filter out most invalid entries.
> +        */
> +       while (orders) {
> +               nr = 1 << order;
> +               if ((addr >> PAGE_SHIFT) % nr == swp_offset % nr)
> +                       break;
> +               order = next_order(&orders, order);
> +       }
> +
> +       return orders;
> +}
> +#else
> +static inline bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
> +{
> +       return false;
> +}
> +#endif
> +
> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> +{
> +       struct vm_area_struct *vma = vmf->vma;
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +       unsigned long orders;
> +       struct folio *folio;
> +       unsigned long addr;
> +       swp_entry_t entry;
> +       spinlock_t *ptl;
> +       pte_t *pte;
> +       gfp_t gfp;
> +       int order;
> +
> +       /*
> +        * If uffd is active for the vma we need per-page fault fidelity to
> +        * maintain the uffd semantics.
> +        */
> +       if (unlikely(userfaultfd_armed(vma)))
> +               goto fallback;
> +
> +       /*
> +        * A large swapped out folio could be partially or fully in zswap. We
> +        * lack handling for such cases, so fallback to swapping in order-0
> +        * folio.
> +        */
> +       if (!zswap_never_enabled())
> +               goto fallback;
> +
> +       entry = pte_to_swp_entry(vmf->orig_pte);
> +       /*
> +        * Get a list of all the (large) orders below PMD_ORDER that are enabled
> +        * and suitable for swapping THP.
> +        */
> +       orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> +                       TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
> +       orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> +       orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders);
> +
> +       if (!orders)
> +               goto fallback;
> +
> +       pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address & PMD_MASK, &ptl);
> +       if (unlikely(!pte))
> +               goto fallback;
> +
> +       /*
> +        * For do_swap_page, find the highest order where the aligned range is
> +        * completely swap entries with contiguous swap offsets.
> +        */
> +       order = highest_order(orders);
> +       while (orders) {
> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> +               if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order))
> +                       break;
> +               order = next_order(&orders, order);
> +       }
> +
> +       pte_unmap_unlock(pte, ptl);
> +
> +       /* Try allocating the highest of the remaining orders. */
> +       gfp = vma_thp_gfp_mask(vma);
> +       while (orders) {
> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> +               folio = vma_alloc_folio(gfp, order, vma, addr, true);
> +               if (folio)
> +                       return folio;
> +               order = next_order(&orders, order);
> +       }
> +
> +fallback:
> +#endif
> +       return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
> +}
> +
> +
>  /*
>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
>   * but allow concurrent faults), and pte mapped but not yet locked.
> @@ -4074,35 +4220,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         if (!folio) {
>                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>                     __swap_count(entry) == 1) {
> -                       /*
> -                        * Prevent parallel swapin from proceeding with
> -                        * the cache flag. Otherwise, another thread may
> -                        * finish swapin first, free the entry, and swapout
> -                        * reusing the same entry. It's undetectable as
> -                        * pte_same() returns true due to entry reuse.
> -                        */
> -                       if (swapcache_prepare(entry, 1)) {
> -                               /* Relax a bit to prevent rapid repeated page faults */
> -                               schedule_timeout_uninterruptible(1);
> -                               goto out;
> -                       }
> -                       need_clear_cache = true;
> -
>                         /* skip swapcache */
> -                       folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> -                                               vma, vmf->address, false);
> +                       folio = alloc_swap_folio(vmf);
>                         page = &folio->page;
>                         if (folio) {
>                                 __folio_set_locked(folio);
>                                 __folio_set_swapbacked(folio);
>
> +                               nr_pages = folio_nr_pages(folio);
> +                               if (folio_test_large(folio))
> +                                       entry.val = ALIGN_DOWN(entry.val, nr_pages);
> +                               /*
> +                                * Prevent parallel swapin from proceeding with
> +                                * the cache flag. Otherwise, another thread may
> +                                * finish swapin first, free the entry, and swapout
> +                                * reusing the same entry. It's undetectable as
> +                                * pte_same() returns true due to entry reuse.
> +                                */
> +                               if (swapcache_prepare(entry, nr_pages)) {
> +                                       /* Relax a bit to prevent rapid repeated page faults */
> +                                       schedule_timeout_uninterruptible(1);
> +                                       goto out_page;
> +                               }
> +                               need_clear_cache = true;
> +
>                                 if (mem_cgroup_swapin_charge_folio(folio,
>                                                         vma->vm_mm, GFP_KERNEL,
>                                                         entry)) {
>                                         ret = VM_FAULT_OOM;
>                                         goto out_page;
>                                 }

After your patch, with build kernel test, I'm seeing kernel log
spamming like this:
[  101.048594] pagefault_out_of_memory: 95 callbacks suppressed
[  101.048599] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[  101.059416] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[  101.118575] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[  101.125585] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[  101.182501] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[  101.215351] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[  101.272822] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[  101.403195] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
............

And heavy performance loss with workloads limited by memcg, mTHP enabled.

After some debugging, the problematic part is the
mem_cgroup_swapin_charge_folio call above.
When under pressure, cgroup charge fails easily for mTHP. One 64k
swapin will require a much more aggressive reclaim to success.

If I change MAX_RECLAIM_RETRIES from 16 to 512, the spamming log is
gone and mTHP swapin should have a much higher swapin success rate.
But this might not be the right way.

For this particular issue, maybe you can change the charge order, try
charging first, if successful, use mTHP. if failed, fallback to 4k?

> -                               mem_cgroup_swapin_uncharge_swap(entry, 1);
> +                               mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
>
>                                 shadow = get_shadow_from_swap_cache(entry);
>                                 if (shadow)
> @@ -4209,6 +4357,22 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 goto out_nomap;
>         }
>
> +       /* allocated large folios for SWP_SYNCHRONOUS_IO */
> +       if (folio_test_large(folio) && !folio_test_swapcache(folio)) {
> +               unsigned long nr = folio_nr_pages(folio);
> +               unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
> +               unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE;
> +               pte_t *folio_ptep = vmf->pte - idx;
> +
> +               if (!can_swapin_thp(vmf, folio_ptep, nr))
> +                       goto out_nomap;
> +
> +               page_idx = idx;
> +               address = folio_start;
> +               ptep = folio_ptep;
> +               goto check_folio;
> +       }
> +
>         nr_pages = 1;
>         page_idx = 0;
>         address = vmf->address;
> @@ -4340,11 +4504,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 folio_add_lru_vma(folio, vma);
>         } else if (!folio_test_anon(folio)) {
>                 /*
> -                * We currently only expect small !anon folios, which are either
> -                * fully exclusive or fully shared. If we ever get large folios
> -                * here, we have to be careful.
> +                * We currently only expect small !anon folios which are either
> +                * fully exclusive or fully shared, or new allocated large folios
> +                * which are fully exclusive. If we ever get large folios within
> +                * swapcache here, we have to be careful.
>                  */
> -               VM_WARN_ON_ONCE(folio_test_large(folio));
> +               VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio));
>                 VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
>                 folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
>         } else {
> @@ -4387,7 +4552,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  out:
>         /* Clear the swap cache pin for direct swapin after PTL unlock */
>         if (need_clear_cache)
> -               swapcache_clear(si, entry, 1);
> +               swapcache_clear(si, entry, nr_pages);
>         if (si)
>                 put_swap_device(si);
>         return ret;
> @@ -4403,7 +4568,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 folio_put(swapcache);
>         }
>         if (need_clear_cache)
> -               swapcache_clear(si, entry, 1);
> +               swapcache_clear(si, entry, nr_pages);
>         if (si)
>                 put_swap_device(si);
>         return ret;
> --
> 2.34.1
>
>
Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices
Posted by Kefeng Wang 2 months ago

On 2024/8/15 17:47, Kairui Song wrote:
> On Fri, Aug 2, 2024 at 8:21 PM Barry Song <21cnbao@gmail.com> wrote:
>>
>> From: Chuanhua Han <hanchuanhua@oppo.com>
> 
> Hi Chuanhua,
> 
>>
...

>> +
>> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>> +{
>> +       struct vm_area_struct *vma = vmf->vma;
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +       unsigned long orders;
>> +       struct folio *folio;
>> +       unsigned long addr;
>> +       swp_entry_t entry;
>> +       spinlock_t *ptl;
>> +       pte_t *pte;
>> +       gfp_t gfp;
>> +       int order;
>> +
>> +       /*
>> +        * If uffd is active for the vma we need per-page fault fidelity to
>> +        * maintain the uffd semantics.
>> +        */
>> +       if (unlikely(userfaultfd_armed(vma)))
>> +               goto fallback;
>> +
>> +       /*
>> +        * A large swapped out folio could be partially or fully in zswap. We
>> +        * lack handling for such cases, so fallback to swapping in order-0
>> +        * folio.
>> +        */
>> +       if (!zswap_never_enabled())
>> +               goto fallback;
>> +
>> +       entry = pte_to_swp_entry(vmf->orig_pte);
>> +       /*
>> +        * Get a list of all the (large) orders below PMD_ORDER that are enabled
>> +        * and suitable for swapping THP.
>> +        */
>> +       orders = thp_vma_allowable_orders(vma, vma->vm_flags,
>> +                       TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
>> +       orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>> +       orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders);
>> +
>> +       if (!orders)
>> +               goto fallback;
>> +
>> +       pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address & PMD_MASK, &ptl);
>> +       if (unlikely(!pte))
>> +               goto fallback;
>> +
>> +       /*
>> +        * For do_swap_page, find the highest order where the aligned range is
>> +        * completely swap entries with contiguous swap offsets.
>> +        */
>> +       order = highest_order(orders);
>> +       while (orders) {
>> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>> +               if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order))
>> +                       break;
>> +               order = next_order(&orders, order);
>> +       }
>> +
>> +       pte_unmap_unlock(pte, ptl);
>> +
>> +       /* Try allocating the highest of the remaining orders. */
>> +       gfp = vma_thp_gfp_mask(vma);
>> +       while (orders) {
>> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>> +               folio = vma_alloc_folio(gfp, order, vma, addr, true);
>> +               if (folio)
>> +                       return folio;
>> +               order = next_order(&orders, order);
>> +       }
>> +
>> +fallback:
>> +#endif
>> +       return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
>> +}
>> +
>> +
>>   /*
>>    * We enter with non-exclusive mmap_lock (to exclude vma changes,
>>    * but allow concurrent faults), and pte mapped but not yet locked.
>> @@ -4074,35 +4220,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>          if (!folio) {
>>                  if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>>                      __swap_count(entry) == 1) {
>> -                       /*
>> -                        * Prevent parallel swapin from proceeding with
>> -                        * the cache flag. Otherwise, another thread may
>> -                        * finish swapin first, free the entry, and swapout
>> -                        * reusing the same entry. It's undetectable as
>> -                        * pte_same() returns true due to entry reuse.
>> -                        */
>> -                       if (swapcache_prepare(entry, 1)) {
>> -                               /* Relax a bit to prevent rapid repeated page faults */
>> -                               schedule_timeout_uninterruptible(1);
>> -                               goto out;
>> -                       }
>> -                       need_clear_cache = true;
>> -
>>                          /* skip swapcache */
>> -                       folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
>> -                                               vma, vmf->address, false);
>> +                       folio = alloc_swap_folio(vmf);
>>                          page = &folio->page;
>>                          if (folio) {
>>                                  __folio_set_locked(folio);
>>                                  __folio_set_swapbacked(folio);
>>
>> +                               nr_pages = folio_nr_pages(folio);
>> +                               if (folio_test_large(folio))
>> +                                       entry.val = ALIGN_DOWN(entry.val, nr_pages);
>> +                               /*
>> +                                * Prevent parallel swapin from proceeding with
>> +                                * the cache flag. Otherwise, another thread may
>> +                                * finish swapin first, free the entry, and swapout
>> +                                * reusing the same entry. It's undetectable as
>> +                                * pte_same() returns true due to entry reuse.
>> +                                */
>> +                               if (swapcache_prepare(entry, nr_pages)) {
>> +                                       /* Relax a bit to prevent rapid repeated page faults */
>> +                                       schedule_timeout_uninterruptible(1);
>> +                                       goto out_page;
>> +                               }
>> +                               need_clear_cache = true;
>> +
>>                                  if (mem_cgroup_swapin_charge_folio(folio,
>>                                                          vma->vm_mm, GFP_KERNEL,
>>                                                          entry)) {
>>                                          ret = VM_FAULT_OOM;
>>                                          goto out_page;
>>                                  }
> 
> After your patch, with build kernel test, I'm seeing kernel log
> spamming like this:
> [  101.048594] pagefault_out_of_memory: 95 callbacks suppressed
> [  101.048599] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> [  101.059416] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> [  101.118575] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> [  101.125585] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> [  101.182501] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> [  101.215351] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> [  101.272822] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> [  101.403195] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> ............
> 
> And heavy performance loss with workloads limited by memcg, mTHP enabled.
> 
> After some debugging, the problematic part is the
> mem_cgroup_swapin_charge_folio call above.
> When under pressure, cgroup charge fails easily for mTHP. One 64k
> swapin will require a much more aggressive reclaim to success.
> 
> If I change MAX_RECLAIM_RETRIES from 16 to 512, the spamming log is
> gone and mTHP swapin should have a much higher swapin success rate.
> But this might not be the right way.
> 
> For this particular issue, maybe you can change the charge order, try
> charging first, if successful, use mTHP. if failed, fallback to 4k?

This is what we did in alloc_anon_folio(), see 085ff35e7636
("mm: memory: move mem_cgroup_charge() into alloc_anon_folio()"),
1) fallback earlier
2) using same GFP flags for allocation and charge

but it seems that there is a little complicated for swapin charge


> 
>> -                               mem_cgroup_swapin_uncharge_swap(entry, 1);
>> +                               mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
>>
>>                                  shadow = get_shadow_from_swap_cache(entry);
>>                                  if (shadow)
>> @@ -4209,6 +4357,22 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>                  goto out_nomap;
>>          }
>>
>> +       /* allocated large folios for SWP_SYNCHRONOUS_IO */
>> +       if (folio_test_large(folio) && !folio_test_swapcache(folio)) {
>> +               unsigned long nr = folio_nr_pages(folio);
>> +               unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
>> +               unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE;
>> +               pte_t *folio_ptep = vmf->pte - idx;
>> +
>> +               if (!can_swapin_thp(vmf, folio_ptep, nr))
>> +                       goto out_nomap;
>> +
>> +               page_idx = idx;
>> +               address = folio_start;
>> +               ptep = folio_ptep;
>> +               goto check_folio;
>> +       }
>> +
>>          nr_pages = 1;
>>          page_idx = 0;
>>          address = vmf->address;
>> @@ -4340,11 +4504,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>                  folio_add_lru_vma(folio, vma);
>>          } else if (!folio_test_anon(folio)) {
>>                  /*
>> -                * We currently only expect small !anon folios, which are either
>> -                * fully exclusive or fully shared. If we ever get large folios
>> -                * here, we have to be careful.
>> +                * We currently only expect small !anon folios which are either
>> +                * fully exclusive or fully shared, or new allocated large folios
>> +                * which are fully exclusive. If we ever get large folios within
>> +                * swapcache here, we have to be careful.
>>                   */
>> -               VM_WARN_ON_ONCE(folio_test_large(folio));
>> +               VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio));
>>                  VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
>>                  folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
>>          } else {
>> @@ -4387,7 +4552,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>   out:
>>          /* Clear the swap cache pin for direct swapin after PTL unlock */
>>          if (need_clear_cache)
>> -               swapcache_clear(si, entry, 1);
>> +               swapcache_clear(si, entry, nr_pages);
>>          if (si)
>>                  put_swap_device(si);
>>          return ret;
>> @@ -4403,7 +4568,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>                  folio_put(swapcache);
>>          }
>>          if (need_clear_cache)
>> -               swapcache_clear(si, entry, 1);
>> +               swapcache_clear(si, entry, nr_pages);
>>          if (si)
>>                  put_swap_device(si);
>>          return ret;
>> --
>> 2.34.1
>>
>>
> 
Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices
Posted by Barry Song 2 months ago
On Fri, Aug 16, 2024 at 1:27 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
>
>
> On 2024/8/15 17:47, Kairui Song wrote:
> > On Fri, Aug 2, 2024 at 8:21 PM Barry Song <21cnbao@gmail.com> wrote:
> >>
> >> From: Chuanhua Han <hanchuanhua@oppo.com>
> >
> > Hi Chuanhua,
> >
> >>
> ...
>
> >> +
> >> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> >> +{
> >> +       struct vm_area_struct *vma = vmf->vma;
> >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >> +       unsigned long orders;
> >> +       struct folio *folio;
> >> +       unsigned long addr;
> >> +       swp_entry_t entry;
> >> +       spinlock_t *ptl;
> >> +       pte_t *pte;
> >> +       gfp_t gfp;
> >> +       int order;
> >> +
> >> +       /*
> >> +        * If uffd is active for the vma we need per-page fault fidelity to
> >> +        * maintain the uffd semantics.
> >> +        */
> >> +       if (unlikely(userfaultfd_armed(vma)))
> >> +               goto fallback;
> >> +
> >> +       /*
> >> +        * A large swapped out folio could be partially or fully in zswap. We
> >> +        * lack handling for such cases, so fallback to swapping in order-0
> >> +        * folio.
> >> +        */
> >> +       if (!zswap_never_enabled())
> >> +               goto fallback;
> >> +
> >> +       entry = pte_to_swp_entry(vmf->orig_pte);
> >> +       /*
> >> +        * Get a list of all the (large) orders below PMD_ORDER that are enabled
> >> +        * and suitable for swapping THP.
> >> +        */
> >> +       orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> >> +                       TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
> >> +       orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> >> +       orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders);
> >> +
> >> +       if (!orders)
> >> +               goto fallback;
> >> +
> >> +       pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address & PMD_MASK, &ptl);
> >> +       if (unlikely(!pte))
> >> +               goto fallback;
> >> +
> >> +       /*
> >> +        * For do_swap_page, find the highest order where the aligned range is
> >> +        * completely swap entries with contiguous swap offsets.
> >> +        */
> >> +       order = highest_order(orders);
> >> +       while (orders) {
> >> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> >> +               if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order))
> >> +                       break;
> >> +               order = next_order(&orders, order);
> >> +       }
> >> +
> >> +       pte_unmap_unlock(pte, ptl);
> >> +
> >> +       /* Try allocating the highest of the remaining orders. */
> >> +       gfp = vma_thp_gfp_mask(vma);
> >> +       while (orders) {
> >> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> >> +               folio = vma_alloc_folio(gfp, order, vma, addr, true);
> >> +               if (folio)
> >> +                       return folio;
> >> +               order = next_order(&orders, order);
> >> +       }
> >> +
> >> +fallback:
> >> +#endif
> >> +       return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
> >> +}
> >> +
> >> +
> >>   /*
> >>    * We enter with non-exclusive mmap_lock (to exclude vma changes,
> >>    * but allow concurrent faults), and pte mapped but not yet locked.
> >> @@ -4074,35 +4220,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>          if (!folio) {
> >>                  if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> >>                      __swap_count(entry) == 1) {
> >> -                       /*
> >> -                        * Prevent parallel swapin from proceeding with
> >> -                        * the cache flag. Otherwise, another thread may
> >> -                        * finish swapin first, free the entry, and swapout
> >> -                        * reusing the same entry. It's undetectable as
> >> -                        * pte_same() returns true due to entry reuse.
> >> -                        */
> >> -                       if (swapcache_prepare(entry, 1)) {
> >> -                               /* Relax a bit to prevent rapid repeated page faults */
> >> -                               schedule_timeout_uninterruptible(1);
> >> -                               goto out;
> >> -                       }
> >> -                       need_clear_cache = true;
> >> -
> >>                          /* skip swapcache */
> >> -                       folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> >> -                                               vma, vmf->address, false);
> >> +                       folio = alloc_swap_folio(vmf);
> >>                          page = &folio->page;
> >>                          if (folio) {
> >>                                  __folio_set_locked(folio);
> >>                                  __folio_set_swapbacked(folio);
> >>
> >> +                               nr_pages = folio_nr_pages(folio);
> >> +                               if (folio_test_large(folio))
> >> +                                       entry.val = ALIGN_DOWN(entry.val, nr_pages);
> >> +                               /*
> >> +                                * Prevent parallel swapin from proceeding with
> >> +                                * the cache flag. Otherwise, another thread may
> >> +                                * finish swapin first, free the entry, and swapout
> >> +                                * reusing the same entry. It's undetectable as
> >> +                                * pte_same() returns true due to entry reuse.
> >> +                                */
> >> +                               if (swapcache_prepare(entry, nr_pages)) {
> >> +                                       /* Relax a bit to prevent rapid repeated page faults */
> >> +                                       schedule_timeout_uninterruptible(1);
> >> +                                       goto out_page;
> >> +                               }
> >> +                               need_clear_cache = true;
> >> +
> >>                                  if (mem_cgroup_swapin_charge_folio(folio,
> >>                                                          vma->vm_mm, GFP_KERNEL,
> >>                                                          entry)) {
> >>                                          ret = VM_FAULT_OOM;
> >>                                          goto out_page;
> >>                                  }
> >
> > After your patch, with build kernel test, I'm seeing kernel log
> > spamming like this:
> > [  101.048594] pagefault_out_of_memory: 95 callbacks suppressed
> > [  101.048599] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > [  101.059416] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > [  101.118575] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > [  101.125585] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > [  101.182501] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > [  101.215351] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > [  101.272822] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > [  101.403195] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > ............
> >
> > And heavy performance loss with workloads limited by memcg, mTHP enabled.
> >
> > After some debugging, the problematic part is the
> > mem_cgroup_swapin_charge_folio call above.
> > When under pressure, cgroup charge fails easily for mTHP. One 64k
> > swapin will require a much more aggressive reclaim to success.
> >
> > If I change MAX_RECLAIM_RETRIES from 16 to 512, the spamming log is
> > gone and mTHP swapin should have a much higher swapin success rate.
> > But this might not be the right way.
> >
> > For this particular issue, maybe you can change the charge order, try
> > charging first, if successful, use mTHP. if failed, fallback to 4k?
>
> This is what we did in alloc_anon_folio(), see 085ff35e7636
> ("mm: memory: move mem_cgroup_charge() into alloc_anon_folio()"),
> 1) fallback earlier
> 2) using same GFP flags for allocation and charge
>
> but it seems that there is a little complicated for swapin charge

Kefeng, thanks! I guess we can continue using the same approach and
it's not too complicated. 

Kairui, sorry for the trouble and thanks for the report! could you
check if the solution below resolves the issue? On phones, we don't
encounter the scenarios you’re facing.

From 2daaf91077705a8fa26a3a428117f158f05375b0 Mon Sep 17 00:00:00 2001
From: Barry Song <v-songbaohua@oppo.com>
Date: Fri, 16 Aug 2024 10:51:48 +1200
Subject: [PATCH] mm: fallback to next_order if charing mTHP fails

When memcg approaches its limit, charging mTHP becomes difficult.
At this point, when the charge fails, we fallback to the next order
to avoid repeatedly retrying larger orders.

Reported-by: Kairui Song <ryncsn@gmail.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/memory.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 0ed3603aaf31..6cba28ef91e7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4121,8 +4121,12 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 	while (orders) {
 		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
 		folio = vma_alloc_folio(gfp, order, vma, addr, true);
-		if (folio)
-			return folio;
+		if (folio) {
+			if (!mem_cgroup_swapin_charge_folio(folio,
+					vma->vm_mm, gfp, entry))
+				return folio;
+			folio_put(folio);
+		}
 		order = next_order(&orders, order);
 	}
 
@@ -4244,7 +4248,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 				}
 				need_clear_cache = true;
 
-				if (mem_cgroup_swapin_charge_folio(folio,
+				if (nr_pages == 1 && mem_cgroup_swapin_charge_folio(folio,
 							vma->vm_mm, GFP_KERNEL,
 							entry)) {
 					ret = VM_FAULT_OOM;
-- 
2.34.1


Thanks
Barry

Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices
Posted by Matthew Wilcox 2 months ago
On Fri, Aug 16, 2024 at 11:06:12AM +1200, Barry Song wrote:
> When memcg approaches its limit, charging mTHP becomes difficult.
> At this point, when the charge fails, we fallback to the next order
> to avoid repeatedly retrying larger orders.

Why do you always find the ugliest possible solution to a problem?

> @@ -4244,7 +4248,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  				}
>  				need_clear_cache = true;
>  
> -				if (mem_cgroup_swapin_charge_folio(folio,
> +				if (nr_pages == 1 && mem_cgroup_swapin_charge_folio(folio,
>  							vma->vm_mm, GFP_KERNEL,
>  							entry)) {
>  					ret = VM_FAULT_OOM;

Just make alloc_swap_folio() always charge the folio, even for order-0.

And you'll have to uncharge it in the swapcache_prepare() failure case.
Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices
Posted by Barry Song 2 months ago
On Sat, Aug 17, 2024 at 9:17 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Aug 16, 2024 at 11:06:12AM +1200, Barry Song wrote:
> > When memcg approaches its limit, charging mTHP becomes difficult.
> > At this point, when the charge fails, we fallback to the next order
> > to avoid repeatedly retrying larger orders.
>
> Why do you always find the ugliest possible solution to a problem?
>

had definitely thought about charging order-0 as well in
alloc_swap_folio() when sending this
quick fix mainly for quick verification it can fix the problem. v7
will definitely charge order-0
in alloc_swap_folio().

> > @@ -4244,7 +4248,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >                               }
> >                               need_clear_cache = true;
> >
> > -                             if (mem_cgroup_swapin_charge_folio(folio,
> > +                             if (nr_pages == 1 && mem_cgroup_swapin_charge_folio(folio,
> >                                                       vma->vm_mm, GFP_KERNEL,
> >                                                       entry)) {
> >                                       ret = VM_FAULT_OOM;
>
> Just make alloc_swap_folio() always charge the folio, even for order-0.
>
> And you'll have to uncharge it in the swapcache_prepare() failure case.

I suppose this is done by folio_put() automatically.

Thanks
Barry
Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices
Posted by Kairui Song 2 months ago
On Fri, Aug 16, 2024 at 7:06 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Fri, Aug 16, 2024 at 1:27 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >
> >
> >
> > On 2024/8/15 17:47, Kairui Song wrote:
> > > On Fri, Aug 2, 2024 at 8:21 PM Barry Song <21cnbao@gmail.com> wrote:
> > >>
> > >> From: Chuanhua Han <hanchuanhua@oppo.com>
> > >
> > > Hi Chuanhua,
> > >
> > >>
> > ...
> >
> > >> +
> > >> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> > >> +{
> > >> +       struct vm_area_struct *vma = vmf->vma;
> > >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > >> +       unsigned long orders;
> > >> +       struct folio *folio;
> > >> +       unsigned long addr;
> > >> +       swp_entry_t entry;
> > >> +       spinlock_t *ptl;
> > >> +       pte_t *pte;
> > >> +       gfp_t gfp;
> > >> +       int order;
> > >> +
> > >> +       /*
> > >> +        * If uffd is active for the vma we need per-page fault fidelity to
> > >> +        * maintain the uffd semantics.
> > >> +        */
> > >> +       if (unlikely(userfaultfd_armed(vma)))
> > >> +               goto fallback;
> > >> +
> > >> +       /*
> > >> +        * A large swapped out folio could be partially or fully in zswap. We
> > >> +        * lack handling for such cases, so fallback to swapping in order-0
> > >> +        * folio.
> > >> +        */
> > >> +       if (!zswap_never_enabled())
> > >> +               goto fallback;
> > >> +
> > >> +       entry = pte_to_swp_entry(vmf->orig_pte);
> > >> +       /*
> > >> +        * Get a list of all the (large) orders below PMD_ORDER that are enabled
> > >> +        * and suitable for swapping THP.
> > >> +        */
> > >> +       orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> > >> +                       TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
> > >> +       orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> > >> +       orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders);
> > >> +
> > >> +       if (!orders)
> > >> +               goto fallback;
> > >> +
> > >> +       pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address & PMD_MASK, &ptl);
> > >> +       if (unlikely(!pte))
> > >> +               goto fallback;
> > >> +
> > >> +       /*
> > >> +        * For do_swap_page, find the highest order where the aligned range is
> > >> +        * completely swap entries with contiguous swap offsets.
> > >> +        */
> > >> +       order = highest_order(orders);
> > >> +       while (orders) {
> > >> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> > >> +               if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order))
> > >> +                       break;
> > >> +               order = next_order(&orders, order);
> > >> +       }
> > >> +
> > >> +       pte_unmap_unlock(pte, ptl);
> > >> +
> > >> +       /* Try allocating the highest of the remaining orders. */
> > >> +       gfp = vma_thp_gfp_mask(vma);
> > >> +       while (orders) {
> > >> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> > >> +               folio = vma_alloc_folio(gfp, order, vma, addr, true);
> > >> +               if (folio)
> > >> +                       return folio;
> > >> +               order = next_order(&orders, order);
> > >> +       }
> > >> +
> > >> +fallback:
> > >> +#endif
> > >> +       return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
> > >> +}
> > >> +
> > >> +
> > >>   /*
> > >>    * We enter with non-exclusive mmap_lock (to exclude vma changes,
> > >>    * but allow concurrent faults), and pte mapped but not yet locked.
> > >> @@ -4074,35 +4220,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >>          if (!folio) {
> > >>                  if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> > >>                      __swap_count(entry) == 1) {
> > >> -                       /*
> > >> -                        * Prevent parallel swapin from proceeding with
> > >> -                        * the cache flag. Otherwise, another thread may
> > >> -                        * finish swapin first, free the entry, and swapout
> > >> -                        * reusing the same entry. It's undetectable as
> > >> -                        * pte_same() returns true due to entry reuse.
> > >> -                        */
> > >> -                       if (swapcache_prepare(entry, 1)) {
> > >> -                               /* Relax a bit to prevent rapid repeated page faults */
> > >> -                               schedule_timeout_uninterruptible(1);
> > >> -                               goto out;
> > >> -                       }
> > >> -                       need_clear_cache = true;
> > >> -
> > >>                          /* skip swapcache */
> > >> -                       folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> > >> -                                               vma, vmf->address, false);
> > >> +                       folio = alloc_swap_folio(vmf);
> > >>                          page = &folio->page;
> > >>                          if (folio) {
> > >>                                  __folio_set_locked(folio);
> > >>                                  __folio_set_swapbacked(folio);
> > >>
> > >> +                               nr_pages = folio_nr_pages(folio);
> > >> +                               if (folio_test_large(folio))
> > >> +                                       entry.val = ALIGN_DOWN(entry.val, nr_pages);
> > >> +                               /*
> > >> +                                * Prevent parallel swapin from proceeding with
> > >> +                                * the cache flag. Otherwise, another thread may
> > >> +                                * finish swapin first, free the entry, and swapout
> > >> +                                * reusing the same entry. It's undetectable as
> > >> +                                * pte_same() returns true due to entry reuse.
> > >> +                                */
> > >> +                               if (swapcache_prepare(entry, nr_pages)) {
> > >> +                                       /* Relax a bit to prevent rapid repeated page faults */
> > >> +                                       schedule_timeout_uninterruptible(1);
> > >> +                                       goto out_page;
> > >> +                               }
> > >> +                               need_clear_cache = true;
> > >> +
> > >>                                  if (mem_cgroup_swapin_charge_folio(folio,
> > >>                                                          vma->vm_mm, GFP_KERNEL,
> > >>                                                          entry)) {
> > >>                                          ret = VM_FAULT_OOM;
> > >>                                          goto out_page;
> > >>                                  }
> > >
> > > After your patch, with build kernel test, I'm seeing kernel log
> > > spamming like this:
> > > [  101.048594] pagefault_out_of_memory: 95 callbacks suppressed
> > > [  101.048599] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [  101.059416] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [  101.118575] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [  101.125585] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [  101.182501] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [  101.215351] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [  101.272822] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [  101.403195] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > ............
> > >
> > > And heavy performance loss with workloads limited by memcg, mTHP enabled.
> > >
> > > After some debugging, the problematic part is the
> > > mem_cgroup_swapin_charge_folio call above.
> > > When under pressure, cgroup charge fails easily for mTHP. One 64k
> > > swapin will require a much more aggressive reclaim to success.
> > >
> > > If I change MAX_RECLAIM_RETRIES from 16 to 512, the spamming log is
> > > gone and mTHP swapin should have a much higher swapin success rate.
> > > But this might not be the right way.
> > >
> > > For this particular issue, maybe you can change the charge order, try
> > > charging first, if successful, use mTHP. if failed, fallback to 4k?
> >
> > This is what we did in alloc_anon_folio(), see 085ff35e7636
> > ("mm: memory: move mem_cgroup_charge() into alloc_anon_folio()"),
> > 1) fallback earlier
> > 2) using same GFP flags for allocation and charge
> >
> > but it seems that there is a little complicated for swapin charge
>
> Kefeng, thanks! I guess we can continue using the same approach and
> it's not too complicated.
>
> Kairui, sorry for the trouble and thanks for the report! could you
> check if the solution below resolves the issue? On phones, we don't
> encounter the scenarios you’re facing.
>
> From 2daaf91077705a8fa26a3a428117f158f05375b0 Mon Sep 17 00:00:00 2001
> From: Barry Song <v-songbaohua@oppo.com>
> Date: Fri, 16 Aug 2024 10:51:48 +1200
> Subject: [PATCH] mm: fallback to next_order if charing mTHP fails
>
> When memcg approaches its limit, charging mTHP becomes difficult.
> At this point, when the charge fails, we fallback to the next order
> to avoid repeatedly retrying larger orders.
>
> Reported-by: Kairui Song <ryncsn@gmail.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  mm/memory.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 0ed3603aaf31..6cba28ef91e7 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4121,8 +4121,12 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>         while (orders) {
>                 addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>                 folio = vma_alloc_folio(gfp, order, vma, addr, true);
> -               if (folio)
> -                       return folio;
> +               if (folio) {
> +                       if (!mem_cgroup_swapin_charge_folio(folio,
> +                                       vma->vm_mm, gfp, entry))
> +                               return folio;
> +                       folio_put(folio);
> +               }
>                 order = next_order(&orders, order);
>         }
>
> @@ -4244,7 +4248,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                                 }
>                                 need_clear_cache = true;
>
> -                               if (mem_cgroup_swapin_charge_folio(folio,
> +                               if (nr_pages == 1 && mem_cgroup_swapin_charge_folio(folio,
>                                                         vma->vm_mm, GFP_KERNEL,
>                                                         entry)) {
>                                         ret = VM_FAULT_OOM;
> --
> 2.34.1
>

Hi Barry

After the fix the spamming log is gone, thanks for the fix.

>
> Thanks
> Barry
>
Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices
Posted by Andrew Morton 2 months ago
On Sat, 17 Aug 2024 00:50:00 +0800 Kairui Song <ryncsn@gmail.com> wrote:

> > --
> > 2.34.1
> >
> 
> Hi Barry
> 
> After the fix the spamming log is gone, thanks for the fix.
> 

Thanks, I'll drop the v6 series.
Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices
Posted by Chuanhua Han 1 month, 3 weeks ago
Andrew Morton <akpm@linux-foundation.org> 于2024年8月17日周六 04:35写道:
>
> On Sat, 17 Aug 2024 00:50:00 +0800 Kairui Song <ryncsn@gmail.com> wrote:
>
> > > --
> > > 2.34.1
> > >
> >
> > Hi Barry
> >
> > After the fix the spamming log is gone, thanks for the fix.
> >
>
> Thanks, I'll drop the v6 series.
Hi, Andrew

Can you please queue v7 for testing:
https://lore.kernel.org/linux-mm/20240821074541.516249-1-hanchuanhua@oppo.com/

V7 has addressed all comments regarding the changelog, the subject and
order-0 charge from Christoph, Kairui and Willy.
>


-- 
Thanks,
Chuanhua
Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices
Posted by Christoph Hellwig 2 months ago
The subject feels wrong.  Nothing particular about zram, it is all about
SWP_SYNCHRONOUS_IO, so the Subject and commit log should state that.

On Sat, Aug 03, 2024 at 12:20:31AM +1200, Barry Song wrote:
> From: Chuanhua Han <hanchuanhua@oppo.com>
> 
> Currently, we have mTHP features, but unfortunately, without support for large
> folio swap-ins, once these large folios are swapped out, they are lost because
> mTHP swap is a one-way process. The lack of mTHP swap-in functionality prevents

Please wrap your commit logs after 73 characters to make them readable.

> +/*
> + * check a range of PTEs are completely swap entries with
> + * contiguous swap offsets and the same SWAP_HAS_CACHE.
> + * ptep must be first one in the range
> + */

Please capitalize the first character of block comments, make them full
sentences and use up all 80 characters.

> +	for (i = 1; i < nr_pages; i++) {
> +		/*
> +		 * while allocating a large folio and doing swap_read_folio for the

And also do not go over 80 characters for them, which renders them
really hard to read.

> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE

Please stub out the entire function.
Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices
Posted by Barry Song 2 months ago
On Mon, Aug 12, 2024 at 8:27 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> The subject feels wrong.  Nothing particular about zram, it is all about
> SWP_SYNCHRONOUS_IO, so the Subject and commit log should state that.

right.

This is absolutely for sync io, zram is the most typical one which is
widely used in Android and embedded systems.  Others could be
nvdimm, brd.

>
> On Sat, Aug 03, 2024 at 12:20:31AM +1200, Barry Song wrote:
> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >
> > Currently, we have mTHP features, but unfortunately, without support for large
> > folio swap-ins, once these large folios are swapped out, they are lost because
> > mTHP swap is a one-way process. The lack of mTHP swap-in functionality prevents
>
> Please wrap your commit logs after 73 characters to make them readable.

ack.

>
> > +/*
> > + * check a range of PTEs are completely swap entries with
> > + * contiguous swap offsets and the same SWAP_HAS_CACHE.
> > + * ptep must be first one in the range
> > + */
>
> Please capitalize the first character of block comments, make them full
> sentences and use up all 80 characters.

ack.

>
> > +     for (i = 1; i < nr_pages; i++) {
> > +             /*
> > +              * while allocating a large folio and doing swap_read_folio for the
>
> And also do not go over 80 characters for them, which renders them
> really hard to read.
>
> > +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> > +{
> > +     struct vm_area_struct *vma = vmf->vma;
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>
> Please stub out the entire function.

I assume you mean the below?

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
static struct folio *alloc_swap_folio(struct vm_fault *vmf)
{
}
#else
static struct folio *alloc_swap_folio(struct vm_fault *vmf)
{
}
#endif

If so, this is fine to me. the only reason I am using the current
pattern is that i am trying to follow the same pattern with

static struct folio *alloc_anon_folio(struct vm_fault *vmf)
{
        struct vm_area_struct *vma = vmf->vma;
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
#endif
        ...
}

Likely we also want to change that one?

Thanks
Barry
Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices
Posted by Christoph Hellwig 2 months ago
On Mon, Aug 12, 2024 at 08:53:06PM +1200, Barry Song wrote:
> On Mon, Aug 12, 2024 at 8:27 PM Christoph Hellwig <hch@infradead.org> wrote:
> I assume you mean the below?
> 
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> {
> }
> #else
> static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> {
> }
> #endif

Yes.

> If so, this is fine to me. the only reason I am using the current
> pattern is that i am trying to follow the same pattern with
> 
> static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> {
>         struct vm_area_struct *vma = vmf->vma;
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> #endif
>         ...
> }
> 
> Likely we also want to change that one?

It would be nice to fix that a well, probably noy in this series,
though.

Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices
Posted by Andrew Morton 2 months, 2 weeks ago
On Sat,  3 Aug 2024 00:20:31 +1200 Barry Song <21cnbao@gmail.com> wrote:

> From: Chuanhua Han <hanchuanhua@oppo.com>
> 
> Currently, we have mTHP features, but unfortunately, without support for large
> folio swap-ins, once these large folios are swapped out, they are lost because
> mTHP swap is a one-way process. The lack of mTHP swap-in functionality prevents
> mTHP from being used on devices like Android that heavily rely on swap.
> 
> This patch introduces mTHP swap-in support. It starts from sync devices such
> as zRAM. This is probably the simplest and most common use case, benefiting
> billions of Android phones and similar devices with minimal implementation
> cost. In this straightforward scenario, large folios are always exclusive,
> eliminating the need to handle complex rmap and swapcache issues.
> 
> It offers several benefits:
> 1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
>    swap-out and swap-in. Large folios in the buddy system are also
>    preserved as much as possible, rather than being fragmented due
>    to swap-in.
> 
> 2. Eliminates fragmentation in swap slots and supports successful
>    THP_SWPOUT.
> 
>    w/o this patch (Refer to the data from Chris's and Kairui's latest
>    swap allocator optimization while running ./thp_swap_allocator_test
>    w/o "-a" option [1]):
> 
> ...
>
> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>
> ...
>
> +#endif
> +	return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
> +}

Generates an unused-variable warning with allnoconfig.  Because
vma_alloc_folio_noprof() was implemented as a macro instead of an
inlined C function.  Why do we keep doing this.

Please check:

From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm-support-large-folios-swap-in-for-zram-like-devices-fix
Date: Sat Aug  3 11:59:00 AM PDT 2024

fix unused var warning

mm/memory.c: In function 'alloc_swap_folio':
mm/memory.c:4062:32: warning: unused variable 'vma' [-Wunused-variable]
 4062 |         struct vm_area_struct *vma = vmf->vma;
      |                                ^~~

Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <v-songbaohua@oppo.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Chuanhua Han <hanchuanhua@oppo.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Gao Xiang <xiang@kernel.org>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

--- a/mm/memory.c~mm-support-large-folios-swap-in-for-zram-like-devices-fix
+++ a/mm/memory.c
@@ -4059,8 +4059,8 @@ static inline bool can_swapin_thp(struct
 
 static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 {
-	struct vm_area_struct *vma = vmf->vma;
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	struct vm_area_struct *vma = vmf->vma;
 	unsigned long orders;
 	struct folio *folio;
 	unsigned long addr;
@@ -4128,7 +4128,8 @@ static struct folio *alloc_swap_folio(st
 
 fallback:
 #endif
-	return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
+	return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vmf->vma,
+			       vmf->address, false);
 }
 
 
_