mm, swap: never bypass swap cache and cleanup flags (swap table phase II)

[PATCH 10/19] mm, swap: consolidate cluster reclaim and check logic

Posted by Kairui Song 3 months, 1 week ago

From: Kairui Song <kasong@tencent.com>

Swap cluster cache reclaim requires releasing the lock, so some extra
checks are needed after the reclaim. To prepare for checking swap cache
using the swap table directly, consolidate the swap cluster reclaim and
check the logic.

Also, adjust it very slightly. By moving the cluster empty and usable
check into the reclaim helper, it will avoid a redundant scan of the
slots if the cluster is empty.

And always scan the whole region during reclaim, don't skip slots
covered by a reclaimed folio. Because the reclaim is lockless, it's
possible that new cache lands at any time. And for allocation, we want
all caches to be reclaimed to avoid fragmentation. And besides, if the
scan offset is not aligned with the size of the reclaimed folio, we are
skipping some existing caches.

There should be no observable behavior change, which might slightly
improve the fragmentation issue or performance.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 47 +++++++++++++++++++++++------------------------
 1 file changed, 23 insertions(+), 24 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index d66141f1c452..e4c521528817 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -778,42 +778,50 @@ static int swap_cluster_setup_bad_slot(struct swap_cluster_info *cluster_info,
 	return 0;
 }
 
-static bool cluster_reclaim_range(struct swap_info_struct *si,
-				  struct swap_cluster_info *ci,
-				  unsigned long start, unsigned long end)
+static unsigned int cluster_reclaim_range(struct swap_info_struct *si,
+					  struct swap_cluster_info *ci,
+					  unsigned long start, unsigned int order)
 {
+	unsigned int nr_pages = 1 << order;
+	unsigned long offset = start, end = start + nr_pages;
 	unsigned char *map = si->swap_map;
-	unsigned long offset = start;
 	int nr_reclaim;
 
 	spin_unlock(&ci->lock);
 	do {
 		switch (READ_ONCE(map[offset])) {
 		case 0:
-			offset++;
 			break;
 		case SWAP_HAS_CACHE:
 			nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
-			if (nr_reclaim > 0)
-				offset += nr_reclaim;
-			else
+			if (nr_reclaim < 0)
 				goto out;
 			break;
 		default:
 			goto out;
 		}
-	} while (offset < end);
+	} while (++offset < end);
 out:
 	spin_lock(&ci->lock);
+
+	/*
+	 * We just dropped ci->lock so cluster could be used by another
+	 * order or got freed, check if it's still usable or empty.
+	 */
+	if (!cluster_is_usable(ci, order))
+		return SWAP_ENTRY_INVALID;
+	if (cluster_is_empty(ci))
+		return cluster_offset(si, ci);
+
 	/*
 	 * Recheck the range no matter reclaim succeeded or not, the slot
 	 * could have been be freed while we are not holding the lock.
 	 */
 	for (offset = start; offset < end; offset++)
 		if (READ_ONCE(map[offset]))
-			return false;
+			return SWAP_ENTRY_INVALID;
 
-	return true;
+	return start;
 }
 
 static bool cluster_scan_range(struct swap_info_struct *si,
@@ -901,7 +909,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 	unsigned long start = ALIGN_DOWN(offset, SWAPFILE_CLUSTER);
 	unsigned long end = min(start + SWAPFILE_CLUSTER, si->max);
 	unsigned int nr_pages = 1 << order;
-	bool need_reclaim, ret;
+	bool need_reclaim;
 
 	lockdep_assert_held(&ci->lock);
 
@@ -913,20 +921,11 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 		if (!cluster_scan_range(si, ci, offset, nr_pages, &need_reclaim))
 			continue;
 		if (need_reclaim) {
-			ret = cluster_reclaim_range(si, ci, offset, offset + nr_pages);
-			/*
-			 * Reclaim drops ci->lock and cluster could be used
-			 * by another order. Not checking flag as off-list
-			 * cluster has no flag set, and change of list
-			 * won't cause fragmentation.
-			 */
-			if (!cluster_is_usable(ci, order))
-				goto out;
-			if (cluster_is_empty(ci))
-				offset = start;
+			found = cluster_reclaim_range(si, ci, offset, order);
 			/* Reclaim failed but cluster is usable, try next */
-			if (!ret)
+			if (!found)
 				continue;
+			offset = found;
 		}
 		if (!cluster_alloc_range(si, ci, offset, usage, order))
 			break;

-- 
2.51.1

Re: [PATCH 10/19] mm, swap: consolidate cluster reclaim and check logic

Posted by YoungJun Park 3 months, 1 week ago

On Wed, Oct 29, 2025 at 11:58:36PM +0800, Kairui Song wrote:

> From: Kairui Song <kasong@tencent.com>
> 

Hello Kairu, great work on your patchwork. :)                                    
> Swap cluster cache reclaim requires releasing the lock, so some extra
> checks are needed after the reclaim. To prepare for checking swap cache
> using the swap table directly, consolidate the swap cluster reclaim and
> check the logic.
> 
> Also, adjust it very slightly. By moving the cluster empty and usable
> check into the reclaim helper, it will avoid a redundant scan of the
> slots if the cluster is empty.

This is Change 1

> And always scan the whole region during reclaim, don't skip slots
> covered by a reclaimed folio. Because the reclaim is lockless, it's
> possible that new cache lands at any time. And for allocation, we want
> all caches to be reclaimed to avoid fragmentation. And besides, if the
> scan offset is not aligned with the size of the reclaimed folio, we are
> skipping some existing caches.

This is Change 2

> There should be no observable behavior change, which might slightly
> improve the fragmentation issue or performance.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swapfile.c | 47 +++++++++++++++++++++++------------------------
>  1 file changed, 23 insertions(+), 24 deletions(-)
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index d66141f1c452..e4c521528817 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -778,42 +778,50 @@ static int swap_cluster_setup_bad_slot(struct swap_cluster_info *cluster_info,
>  	return 0;
>  }
>  
> -static bool cluster_reclaim_range(struct swap_info_struct *si,
> -				  struct swap_cluster_info *ci,
> -				  unsigned long start, unsigned long end)
> +static unsigned int cluster_reclaim_range(struct swap_info_struct *si,
> +					  struct swap_cluster_info *ci,
> +					  unsigned long start, unsigned int order)
>  {
> +	unsigned int nr_pages = 1 << order;
> +	unsigned long offset = start, end = start + nr_pages;
>  	unsigned char *map = si->swap_map;
> -	unsigned long offset = start;
>  	int nr_reclaim;
>  
>  	spin_unlock(&ci->lock);
>  	do {
>  		switch (READ_ONCE(map[offset])) {
>  		case 0:
> -			offset++;
>  			break;
>  		case SWAP_HAS_CACHE:
>  			nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
> -			if (nr_reclaim > 0)
> -				offset += nr_reclaim;
> -			else
> +			if (nr_reclaim < 0)
>  				goto out;
>  			break;
>  		default:
>  			goto out;
>  		}
> -	} while (offset < end);
> +	} while (++offset < end);

Change 2

>  out:
>  	spin_lock(&ci->lock);
> +
> +	/*
> +	 * We just dropped ci->lock so cluster could be used by another
> +	 * order or got freed, check if it's still usable or empty.
> +	 */
> +	if (!cluster_is_usable(ci, order))
> +		return SWAP_ENTRY_INVALID;
> +	if (cluster_is_empty(ci))
> +		return cluster_offset(si, ci);
> +

Change 1

>  	/*
>  	 * Recheck the range no matter reclaim succeeded or not, the slot
>  	 * could have been be freed while we are not holding the lock.
>  	 */
>  	for (offset = start; offset < end; offset++)
>  		if (READ_ONCE(map[offset]))
> -			return false;
> +			return SWAP_ENTRY_INVALID;
>  
> -	return true;
> +	return start;
>  }
>  
>  static bool cluster_scan_range(struct swap_info_struct *si,
> @@ -901,7 +909,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
>  	unsigned long start = ALIGN_DOWN(offset, SWAPFILE_CLUSTER);
>  	unsigned long end = min(start + SWAPFILE_CLUSTER, si->max);
>  	unsigned int nr_pages = 1 << order;
> -	bool need_reclaim, ret;
> +	bool need_reclaim;
>  
>  	lockdep_assert_held(&ci->lock);
>  
> @@ -913,20 +921,11 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
>  		if (!cluster_scan_range(si, ci, offset, nr_pages, &need_reclaim))
>  			continue;
>  		if (need_reclaim) {
> -			ret = cluster_reclaim_range(si, ci, offset, offset + nr_pages);
> -			/*
> -			 * Reclaim drops ci->lock and cluster could be used
> -			 * by another order. Not checking flag as off-list
> -			 * cluster has no flag set, and change of list
> -			 * won't cause fragmentation.
> -			 */
> -			if (!cluster_is_usable(ci, order))
> -				goto out;
> -			if (cluster_is_empty(ci))
> -				offset = start;
> +			found = cluster_reclaim_range(si, ci, offset, order);
>  			/* Reclaim failed but cluster is usable, try next */
> -			if (!ret)

Part of Change 1 (apply return value change)

As I understand Change 1 just remove redudant checking.
But, I think another part changed also.
(maybe I don't fully understand comment or something)

cluster_reclaim_range can return SWAP_ENTRY_INVALID
if the cluster becomes unusable for the requested order. 
(!cluster_is_usable return SWAP_ENTRY_INVALID)
And it continues loop to the next offset for reclaim try.
Is this the intended behavior?

If this is the intended behavior, the comment:
    /* Reclaim failed but cluster is usable, try next */
might be a bit misleading, as the cluster could be unusable in this
failure case. Perhaps it could be updated to reflect this? 
Or I think any other thing need to be changed..? 
(cluster_is_usable function name change etc)

Thanks.
Youngjun Park

Re: [PATCH 10/19] mm, swap: consolidate cluster reclaim and check logic

Posted by Kairui Song 3 months, 1 week ago

On Fri, Oct 31, 2025 at 1:25 PM YoungJun Park <youngjun.park@lge.com> wrote:
>
> On Wed, Oct 29, 2025 at 11:58:36PM +0800, Kairui Song wrote:
>
> > From: Kairui Song <kasong@tencent.com>
> >
>
> Hello Kairu, great work on your patchwork. :)
> > Swap cluster cache reclaim requires releasing the lock, so some extra
> > checks are needed after the reclaim. To prepare for checking swap cache
> > using the swap table directly, consolidate the swap cluster reclaim and
> > check the logic.
> >
> > Also, adjust it very slightly. By moving the cluster empty and usable
> > check into the reclaim helper, it will avoid a redundant scan of the
> > slots if the cluster is empty.
>
> This is Change 1
>
> > And always scan the whole region during reclaim, don't skip slots
> > covered by a reclaimed folio. Because the reclaim is lockless, it's
> > possible that new cache lands at any time. And for allocation, we want
> > all caches to be reclaimed to avoid fragmentation. And besides, if the
> > scan offset is not aligned with the size of the reclaimed folio, we are
> > skipping some existing caches.
>
> This is Change 2
>
> > There should be no observable behavior change, which might slightly
> > improve the fragmentation issue or performance.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/swapfile.c | 47 +++++++++++++++++++++++------------------------
> >  1 file changed, 23 insertions(+), 24 deletions(-)
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index d66141f1c452..e4c521528817 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -778,42 +778,50 @@ static int swap_cluster_setup_bad_slot(struct swap_cluster_info *cluster_info,
> >       return 0;
> >  }
> >
> > -static bool cluster_reclaim_range(struct swap_info_struct *si,
> > -                               struct swap_cluster_info *ci,
> > -                               unsigned long start, unsigned long end)
> > +static unsigned int cluster_reclaim_range(struct swap_info_struct *si,
> > +                                       struct swap_cluster_info *ci,
> > +                                       unsigned long start, unsigned int order)
> >  {
> > +     unsigned int nr_pages = 1 << order;
> > +     unsigned long offset = start, end = start + nr_pages;
> >       unsigned char *map = si->swap_map;
> > -     unsigned long offset = start;
> >       int nr_reclaim;
> >
> >       spin_unlock(&ci->lock);
> >       do {
> >               switch (READ_ONCE(map[offset])) {
> >               case 0:
> > -                     offset++;
> >                       break;
> >               case SWAP_HAS_CACHE:
> >                       nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
> > -                     if (nr_reclaim > 0)
> > -                             offset += nr_reclaim;
> > -                     else
> > +                     if (nr_reclaim < 0)
> >                               goto out;
> >                       break;
> >               default:
> >                       goto out;
> >               }
> > -     } while (offset < end);
> > +     } while (++offset < end);
>
> Change 2
>
> >  out:
> >       spin_lock(&ci->lock);
> > +
> > +     /*
> > +      * We just dropped ci->lock so cluster could be used by another
> > +      * order or got freed, check if it's still usable or empty.
> > +      */
> > +     if (!cluster_is_usable(ci, order))
> > +             return SWAP_ENTRY_INVALID;
> > +     if (cluster_is_empty(ci))
> > +             return cluster_offset(si, ci);
> > +
>
> Change 1
>
> >       /*
> >        * Recheck the range no matter reclaim succeeded or not, the slot
> >        * could have been be freed while we are not holding the lock.
> >        */
> >       for (offset = start; offset < end; offset++)
> >               if (READ_ONCE(map[offset]))
> > -                     return false;
> > +                     return SWAP_ENTRY_INVALID;
> >
> > -     return true;
> > +     return start;
> >  }
> >
> >  static bool cluster_scan_range(struct swap_info_struct *si,
> > @@ -901,7 +909,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
> >       unsigned long start = ALIGN_DOWN(offset, SWAPFILE_CLUSTER);
> >       unsigned long end = min(start + SWAPFILE_CLUSTER, si->max);
> >       unsigned int nr_pages = 1 << order;
> > -     bool need_reclaim, ret;
> > +     bool need_reclaim;
> >
> >       lockdep_assert_held(&ci->lock);
> >
> > @@ -913,20 +921,11 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
> >               if (!cluster_scan_range(si, ci, offset, nr_pages, &need_reclaim))
> >                       continue;
> >               if (need_reclaim) {
> > -                     ret = cluster_reclaim_range(si, ci, offset, offset + nr_pages);
> > -                     /*
> > -                      * Reclaim drops ci->lock and cluster could be used
> > -                      * by another order. Not checking flag as off-list
> > -                      * cluster has no flag set, and change of list
> > -                      * won't cause fragmentation.
> > -                      */
> > -                     if (!cluster_is_usable(ci, order))
> > -                             goto out;
> > -                     if (cluster_is_empty(ci))
> > -                             offset = start;
> > +                     found = cluster_reclaim_range(si, ci, offset, order);
> >                       /* Reclaim failed but cluster is usable, try next */
> > -                     if (!ret)
>
> Part of Change 1 (apply return value change)
>
> As I understand Change 1 just remove redudant checking.
> But, I think another part changed also.
> (maybe I don't fully understand comment or something)
>
> cluster_reclaim_range can return SWAP_ENTRY_INVALID
> if the cluster becomes unusable for the requested order.
> (!cluster_is_usable return SWAP_ENTRY_INVALID)
> And it continues loop to the next offset for reclaim try.
> Is this the intended behavior?

Thanks for the very careful review! I should keep the
cluster_is_usable check or abort in other ways to avoid touching an
unusable cluster, will fix it.