[v1] mm/shmem, swap: bugfix and improvement of mTHP swap in

[PATCH 1/4] mm/shmem, swap: improve cached mTHP handling and fix potential hung

Posted by Kairui Song 7 months, 3 weeks ago

From: Kairui Song <kasong@tencent.com>

The current swap-in code assumes that, when a swap entry in shmem
mapping is order 0, its cached folios (if present) must be order 0
too, which turns out not always correct.

The problem is shmem_split_large_entry is called before verifying the
folio will eventually be swapped in, one possible race is:

    CPU1                          CPU2
shmem_swapin_folio
/* swap in of order > 0 swap entry S1 */
  folio = swap_cache_get_folio
  /* folio = NULL */
  order = xa_get_order
  /* order > 0 */
  folio = shmem_swap_alloc_folio
  /* mTHP alloc failure, folio = NULL */
  <... Interrupted ...>
                                 shmem_swapin_folio
                                 /* S1 is swapped in */
                                 shmem_writeout
                                 /* S1 is swapped out, folio cached */
  shmem_split_large_entry(..., S1)
  /* S1 is split, but the folio covering it has order > 0 now */

Now any following swapin of S1 will hang: `xa_get_order` returns 0,
and folio lookup will return a folio with order > 0. The
`xa_get_order(&mapping->i_pages, index) != folio_order(folio)` will
always return false causing swap-in to return -EEXIST.

And this looks fragile. So fix this up by allowing seeing a larger folio
in swap cache, and check the whole shmem mapping range covered by the
swapin have the right swap value upon inserting the folio. And drop
the redundant tree walks before the insertion.

This will actually improve the performance, as it avoided two redundant
Xarray tree walks in the hot path, and the only side effect is that in
the failure path, shmem may redundantly reallocate a few folios
causing temporary slight memory pressure.

And worth noting, it may seems the order and value check before
inserting might help reducing the lock contention, which is not true.
The swap cache layer ensures raced swapin will either see a swap cache
folio or failed to do a swapin (we have SWAP_HAS_CACHE bit even if
swap cache is bypassed), so holding the folio lock and checking the
folio flag is already good enough for avoiding the lock contention.
The chance that a folio passes the swap entry value check but the
shmem mapping slot has changed should be very low.

Cc: stable@vger.kernel.org
Fixes: 058313515d5a ("mm: shmem: fix potential data corruption during shmem swapin")
Fixes: 809bc86517cc ("mm: shmem: support large folio swap out")
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/shmem.c | 30 +++++++++++++++++++++---------
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index eda35be2a8d9..4e7ef343a29b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -884,7 +884,9 @@ static int shmem_add_to_page_cache(struct folio *folio,
 				   pgoff_t index, void *expected, gfp_t gfp)
 {
 	XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio));
-	long nr = folio_nr_pages(folio);
+	unsigned long nr = folio_nr_pages(folio);
+	swp_entry_t iter, swap;
+	void *entry;
 
 	VM_BUG_ON_FOLIO(index != round_down(index, nr), folio);
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
@@ -896,14 +898,24 @@ static int shmem_add_to_page_cache(struct folio *folio,
 
 	gfp &= GFP_RECLAIM_MASK;
 	folio_throttle_swaprate(folio, gfp);
+	swap = iter = radix_to_swp_entry(expected);
 
 	do {
 		xas_lock_irq(&xas);
-		if (expected != xas_find_conflict(&xas)) {
-			xas_set_err(&xas, -EEXIST);
-			goto unlock;
+		xas_for_each_conflict(&xas, entry) {
+			/*
+			 * The range must either be empty, or filled with
+			 * expected swap entries. Shmem swap entries are never
+			 * partially freed without split of both entry and
+			 * folio, so there shouldn't be any holes.
+			 */
+			if (!expected || entry != swp_to_radix_entry(iter)) {
+				xas_set_err(&xas, -EEXIST);
+				goto unlock;
+			}
+			iter.val += 1 << xas_get_order(&xas);
 		}
-		if (expected && xas_find_conflict(&xas)) {
+		if (expected && iter.val - nr != swap.val) {
 			xas_set_err(&xas, -EEXIST);
 			goto unlock;
 		}
@@ -2323,7 +2335,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			error = -ENOMEM;
 			goto failed;
 		}
-	} else if (order != folio_order(folio)) {
+	} else if (order > folio_order(folio)) {
 		/*
 		 * Swap readahead may swap in order 0 folios into swapcache
 		 * asynchronously, while the shmem mapping can still stores
@@ -2348,15 +2360,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 
 			swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
 		}
+	} else if (order < folio_order(folio)) {
+		swap.val = round_down(swp_type(swap), folio_order(folio));
 	}
 
 alloced:
 	/* We have to do this with folio locked to prevent races */
 	folio_lock(folio);
 	if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
-	    folio->swap.val != swap.val ||
-	    !shmem_confirm_swap(mapping, index, swap) ||
-	    xa_get_order(&mapping->i_pages, index) != folio_order(folio)) {
+	    folio->swap.val != swap.val) {
 		error = -EEXIST;
 		goto unlock;
 	}
-- 
2.50.0

Re: [PATCH 1/4] mm/shmem, swap: improve cached mTHP handling and fix potential hung

Posted by Kemeng Shi 7 months, 3 weeks ago


on 6/18/2025 2:35 AM, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> The current swap-in code assumes that, when a swap entry in shmem
> mapping is order 0, its cached folios (if present) must be order 0
> too, which turns out not always correct.
> 
> The problem is shmem_split_large_entry is called before verifying the
> folio will eventually be swapped in, one possible race is:
> 
>     CPU1                          CPU2
> shmem_swapin_folio
> /* swap in of order > 0 swap entry S1 */
>   folio = swap_cache_get_folio
>   /* folio = NULL */
>   order = xa_get_order
>   /* order > 0 */
>   folio = shmem_swap_alloc_folio
>   /* mTHP alloc failure, folio = NULL */
>   <... Interrupted ...>
>                                  shmem_swapin_folio
>                                  /* S1 is swapped in */
>                                  shmem_writeout
>                                  /* S1 is swapped out, folio cached */
>   shmem_split_large_entry(..., S1)
>   /* S1 is split, but the folio covering it has order > 0 now */
> 
> Now any following swapin of S1 will hang: `xa_get_order` returns 0,
> and folio lookup will return a folio with order > 0. The
> `xa_get_order(&mapping->i_pages, index) != folio_order(folio)` will
> always return false causing swap-in to return -EEXIST.
> 
> And this looks fragile. So fix this up by allowing seeing a larger folio
> in swap cache, and check the whole shmem mapping range covered by the
> swapin have the right swap value upon inserting the folio. And drop
> the redundant tree walks before the insertion.
> 
> This will actually improve the performance, as it avoided two redundant
> Xarray tree walks in the hot path, and the only side effect is that in
> the failure path, shmem may redundantly reallocate a few folios
> causing temporary slight memory pressure.
> 
> And worth noting, it may seems the order and value check before
> inserting might help reducing the lock contention, which is not true.
> The swap cache layer ensures raced swapin will either see a swap cache
> folio or failed to do a swapin (we have SWAP_HAS_CACHE bit even if
> swap cache is bypassed), so holding the folio lock and checking the
> folio flag is already good enough for avoiding the lock contention.
> The chance that a folio passes the swap entry value check but the
> shmem mapping slot has changed should be very low.
> 
> Cc: stable@vger.kernel.org
> Fixes: 058313515d5a ("mm: shmem: fix potential data corruption during shmem swapin")
> Fixes: 809bc86517cc ("mm: shmem: support large folio swap out")
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/shmem.c | 30 +++++++++++++++++++++---------
>  1 file changed, 21 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index eda35be2a8d9..4e7ef343a29b 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -884,7 +884,9 @@ static int shmem_add_to_page_cache(struct folio *folio,
>  				   pgoff_t index, void *expected, gfp_t gfp)
>  {
>  	XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio));
> -	long nr = folio_nr_pages(folio);
> +	unsigned long nr = folio_nr_pages(folio);
> +	swp_entry_t iter, swap;
> +	void *entry;
>  
>  	VM_BUG_ON_FOLIO(index != round_down(index, nr), folio);
>  	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> @@ -896,14 +898,24 @@ static int shmem_add_to_page_cache(struct folio *folio,
>  
>  	gfp &= GFP_RECLAIM_MASK;
>  	folio_throttle_swaprate(folio, gfp);
> +	swap = iter = radix_to_swp_entry(expected);
>  
>  	do {
>  		xas_lock_irq(&xas);
> -		if (expected != xas_find_conflict(&xas)) {
> -			xas_set_err(&xas, -EEXIST);
> -			goto unlock;
> +		xas_for_each_conflict(&xas, entry) {
> +			/*
> +			 * The range must either be empty, or filled with
> +			 * expected swap entries. Shmem swap entries are never
> +			 * partially freed without split of both entry and
> +			 * folio, so there shouldn't be any holes.
> +			 */
> +			if (!expected || entry != swp_to_radix_entry(iter)) {
> +				xas_set_err(&xas, -EEXIST);
> +				goto unlock;
> +			}
> +			iter.val += 1 << xas_get_order(&xas);
>  		}
> -		if (expected && xas_find_conflict(&xas)) {
> +		if (expected && iter.val - nr != swap.val) {
>  			xas_set_err(&xas, -EEXIST);
>  			goto unlock;
>  		}
> @@ -2323,7 +2335,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>  			error = -ENOMEM;
>  			goto failed;
>  		}
> -	} else if (order != folio_order(folio)) {
> +	} else if (order > folio_order(folio)) {
>  		/*
>  		 * Swap readahead may swap in order 0 folios into swapcache
>  		 * asynchronously, while the shmem mapping can still stores
> @@ -2348,15 +2360,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>  
>  			swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>  		}
> +	} else if (order < folio_order(folio)) {
> +		swap.val = round_down(swp_type(swap), folio_order(folio));
>  	}
>  
>  alloced:
>  	/* We have to do this with folio locked to prevent races */
>  	folio_lock(folio);
>  	if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
> -	    folio->swap.val != swap.val ||
> -	    !shmem_confirm_swap(mapping, index, swap) ||
> -	    xa_get_order(&mapping->i_pages, index) != folio_order(folio)) {
> +	    folio->swap.val != swap.val) {
>  		error = -EEXIST;
>  		goto unlock;
>  	}
> 
Nice catch!
Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>

Re: [PATCH 1/4] mm/shmem, swap: improve cached mTHP handling and fix potential hung

Posted by Andrew Morton 7 months, 3 weeks ago

On Wed, 18 Jun 2025 02:35:00 +0800 Kairui Song <ryncsn@gmail.com> wrote:

> From: Kairui Song <kasong@tencent.com>
> 
> The current swap-in code assumes that, when a swap entry in shmem
> mapping is order 0, its cached folios (if present) must be order 0
> too, which turns out not always correct.
> 
> The problem is shmem_split_large_entry is called before verifying the
> folio will eventually be swapped in, one possible race is:
> 
>     CPU1                          CPU2
> shmem_swapin_folio
> /* swap in of order > 0 swap entry S1 */
>   folio = swap_cache_get_folio
>   /* folio = NULL */
>   order = xa_get_order
>   /* order > 0 */
>   folio = shmem_swap_alloc_folio
>   /* mTHP alloc failure, folio = NULL */
>   <... Interrupted ...>
>                                  shmem_swapin_folio
>                                  /* S1 is swapped in */
>                                  shmem_writeout
>                                  /* S1 is swapped out, folio cached */
>   shmem_split_large_entry(..., S1)
>   /* S1 is split, but the folio covering it has order > 0 now */
> 
> Now any following swapin of S1 will hang: `xa_get_order` returns 0,
> and folio lookup will return a folio with order > 0. The
> `xa_get_order(&mapping->i_pages, index) != folio_order(folio)` will
> always return false causing swap-in to return -EEXIST.
> 
> And this looks fragile. So fix this up by allowing seeing a larger folio
> in swap cache, and check the whole shmem mapping range covered by the
> swapin have the right swap value upon inserting the folio. And drop
> the redundant tree walks before the insertion.
> 
> This will actually improve the performance, as it avoided two redundant
> Xarray tree walks in the hot path, and the only side effect is that in
> the failure path, shmem may redundantly reallocate a few folios
> causing temporary slight memory pressure.
> 
> And worth noting, it may seems the order and value check before
> inserting might help reducing the lock contention, which is not true.
> The swap cache layer ensures raced swapin will either see a swap cache
> folio or failed to do a swapin (we have SWAP_HAS_CACHE bit even if
> swap cache is bypassed), so holding the folio lock and checking the
> folio flag is already good enough for avoiding the lock contention.
> The chance that a folio passes the swap entry value check but the
> shmem mapping slot has changed should be very low.
> 
> Cc: stable@vger.kernel.org
> Fixes: 058313515d5a ("mm: shmem: fix potential data corruption during shmem swapin")
> Fixes: 809bc86517cc ("mm: shmem: support large folio swap out")

The Fixes: tells -stable maintainers (and others) which kernel versions
need the fix.  So having two Fixes: against different kernel versions is
very confusing!  Are we recommending that kernels which contain
809bc86517cc but not 058313515d5a be patched?

Re: [PATCH 1/4] mm/shmem, swap: improve cached mTHP handling and fix potential hung

Posted by Kairui Song 7 months, 3 weeks ago

On Wed, Jun 18, 2025 at 6:58 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed, 18 Jun 2025 02:35:00 +0800 Kairui Song <ryncsn@gmail.com> wrote:
>
> > From: Kairui Song <kasong@tencent.com>
> >
> > The current swap-in code assumes that, when a swap entry in shmem
> > mapping is order 0, its cached folios (if present) must be order 0
> > too, which turns out not always correct.
> >
> > The problem is shmem_split_large_entry is called before verifying the
> > folio will eventually be swapped in, one possible race is:
> >
> >     CPU1                          CPU2
> > shmem_swapin_folio
> > /* swap in of order > 0 swap entry S1 */
> >   folio = swap_cache_get_folio
> >   /* folio = NULL */
> >   order = xa_get_order
> >   /* order > 0 */
> >   folio = shmem_swap_alloc_folio
> >   /* mTHP alloc failure, folio = NULL */
> >   <... Interrupted ...>
> >                                  shmem_swapin_folio
> >                                  /* S1 is swapped in */
> >                                  shmem_writeout
> >                                  /* S1 is swapped out, folio cached */
> >   shmem_split_large_entry(..., S1)
> >   /* S1 is split, but the folio covering it has order > 0 now */
> >
> > Now any following swapin of S1 will hang: `xa_get_order` returns 0,
> > and folio lookup will return a folio with order > 0. The
> > `xa_get_order(&mapping->i_pages, index) != folio_order(folio)` will
> > always return false causing swap-in to return -EEXIST.
> >
> > And this looks fragile. So fix this up by allowing seeing a larger folio
> > in swap cache, and check the whole shmem mapping range covered by the
> > swapin have the right swap value upon inserting the folio. And drop
> > the redundant tree walks before the insertion.
> >
> > This will actually improve the performance, as it avoided two redundant
> > Xarray tree walks in the hot path, and the only side effect is that in
> > the failure path, shmem may redundantly reallocate a few folios
> > causing temporary slight memory pressure.
> >
> > And worth noting, it may seems the order and value check before
> > inserting might help reducing the lock contention, which is not true.
> > The swap cache layer ensures raced swapin will either see a swap cache
> > folio or failed to do a swapin (we have SWAP_HAS_CACHE bit even if
> > swap cache is bypassed), so holding the folio lock and checking the
> > folio flag is already good enough for avoiding the lock contention.
> > The chance that a folio passes the swap entry value check but the
> > shmem mapping slot has changed should be very low.
> >
> > Cc: stable@vger.kernel.org
> > Fixes: 058313515d5a ("mm: shmem: fix potential data corruption during shmem swapin")
> > Fixes: 809bc86517cc ("mm: shmem: support large folio swap out")
>
> The Fixes: tells -stable maintainers (and others) which kernel versions
> need the fix.  So having two Fixes: against different kernel versions is
> very confusing!  Are we recommending that kernels which contain
> 809bc86517cc but not 058313515d5a be patched?

809bc86517cc introduced mTHP support for shmem but it's buggy, and
058313515d5a tried to fix that, which is also buggy, I thought this
could help people to backport this.

I think keeping either is OK, I'll keep 809bc86517cc then, any branch
having 809bc86517cc should already have 058313515d5a backported.

[PATCH 1/4] mm/shmem, swap: improve cached mTHP handling and fix potential hung
[PATCH 2/4] mm/shmem, swap: avoid redundant Xarray lookup during swapin
[PATCH 3/4] mm/shmem, swap: improve mthp swapin process
[PATCH 4/4] mm/shmem, swap: avoid false positive swap cache lookup