[v4] mm, swap: swap table phase II: unify swapin use swap cache and cleanup flags

[PATCH v4 10/19] mm, swap: consolidate cluster reclaim and usability check

Posted by Kairui Song 2 months ago

From: Kairui Song <kasong@tencent.com>

Swap cluster cache reclaim requires releasing the lock, so the cluster
may become unusable after the reclaim. To prepare for checking swap
cache using the swap table directly, consolidate the swap cluster
reclaim and the check logic.

We will want to avoid touching the cluster's data completely with the
swap table, to avoid RCU overhead here. And by moving the cluster usable
check into the reclaim helper, it will also help avoid a redundant scan of
the slots if the cluster is no longer usable, and we will want to avoid
touching the cluster.

Also, adjust it very slightly while at it: always scan the whole region
during reclaim, don't skip slots covered by a reclaimed folio. Because
the reclaim is lockless, it's possible that new cache lands at any time.
And for allocation, we want all caches to be reclaimed to avoid
fragmentation. Besides, if the scan offset is not aligned with the size
of the reclaimed folio, we might skip some existing cache and fail the
reclaim unexpectedly.

There should be no observable behavior change. It might slightly improve
the fragmentation issue or performance.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 45 +++++++++++++++++++++++++++++----------------
 1 file changed, 29 insertions(+), 16 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 5a766d4fcaa5..2703dfafc632 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -777,33 +777,51 @@ static int swap_cluster_setup_bad_slot(struct swap_cluster_info *cluster_info,
 	return 0;
 }
 
+/*
+ * Reclaim drops the ci lock, so the cluster may become unusable (freed or
+ * stolen by a lower order). @usable will be set to false if that happens.
+ */
 static bool cluster_reclaim_range(struct swap_info_struct *si,
 				  struct swap_cluster_info *ci,
-				  unsigned long start, unsigned long end)
+				  unsigned long start, unsigned int order,
+				  bool *usable)
 {
+	unsigned int nr_pages = 1 << order;
+	unsigned long offset = start, end = start + nr_pages;
 	unsigned char *map = si->swap_map;
-	unsigned long offset = start;
 	int nr_reclaim;
 
 	spin_unlock(&ci->lock);
 	do {
 		switch (READ_ONCE(map[offset])) {
 		case 0:
-			offset++;
 			break;
 		case SWAP_HAS_CACHE:
 			nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
-			if (nr_reclaim > 0)
-				offset += nr_reclaim;
-			else
+			if (nr_reclaim < 0)
 				goto out;
 			break;
 		default:
 			goto out;
 		}
-	} while (offset < end);
+	} while (++offset < end);
 out:
 	spin_lock(&ci->lock);
+
+	/*
+	 * We just dropped ci->lock so cluster could be used by another
+	 * order or got freed, check if it's still usable or empty.
+	 */
+	if (!cluster_is_usable(ci, order)) {
+		*usable = false;
+		return false;
+	}
+	*usable = true;
+
+	/* Fast path, no need to scan if the whole cluster is empty */
+	if (cluster_is_empty(ci))
+		return true;
+
 	/*
 	 * Recheck the range no matter reclaim succeeded or not, the slot
 	 * could have been be freed while we are not holding the lock.
@@ -900,9 +918,10 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 	unsigned long start = ALIGN_DOWN(offset, SWAPFILE_CLUSTER);
 	unsigned long end = min(start + SWAPFILE_CLUSTER, si->max);
 	unsigned int nr_pages = 1 << order;
-	bool need_reclaim, ret;
+	bool need_reclaim, ret, usable;
 
 	lockdep_assert_held(&ci->lock);
+	VM_WARN_ON(!cluster_is_usable(ci, order));
 
 	if (end < nr_pages || ci->count + nr_pages > SWAPFILE_CLUSTER)
 		goto out;
@@ -912,14 +931,8 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 		if (!cluster_scan_range(si, ci, offset, nr_pages, &need_reclaim))
 			continue;
 		if (need_reclaim) {
-			ret = cluster_reclaim_range(si, ci, offset, offset + nr_pages);
-			/*
-			 * Reclaim drops ci->lock and cluster could be used
-			 * by another order. Not checking flag as off-list
-			 * cluster has no flag set, and change of list
-			 * won't cause fragmentation.
-			 */
-			if (!cluster_is_usable(ci, order))
+			ret = cluster_reclaim_range(si, ci, offset, order, &usable);
+			if (!usable)
 				goto out;
 			if (cluster_is_empty(ci))
 				offset = start;

-- 
2.52.0

Re: [PATCH v4 10/19] mm, swap: consolidate cluster reclaim and usability check

Posted by Baoquan He 1 month, 3 weeks ago

On 12/05/25 at 03:29am, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Swap cluster cache reclaim requires releasing the lock, so the cluster
> may become unusable after the reclaim. To prepare for checking swap
> cache using the swap table directly, consolidate the swap cluster
> reclaim and the check logic.
> 
> We will want to avoid touching the cluster's data completely with the
     ~~~~~~~~
    'want to' means 'will'?

> swap table, to avoid RCU overhead here. And by moving the cluster usable
> check into the reclaim helper, it will also help avoid a redundant scan of
> the slots if the cluster is no longer usable, and we will want to avoid
                                                       ~~~~~~~~~~~~
                                                        this place too.
> touching the cluster.
> 
> Also, adjust it very slightly while at it: always scan the whole region
> during reclaim, don't skip slots covered by a reclaimed folio. Because
> the reclaim is lockless, it's possible that new cache lands at any time.
> And for allocation, we want all caches to be reclaimed to avoid
> fragmentation. Besides, if the scan offset is not aligned with the size
> of the reclaimed folio, we might skip some existing cache and fail the
> reclaim unexpectedly.
> 
> There should be no observable behavior change. It might slightly improve
> the fragmentation issue or performance.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swapfile.c | 45 +++++++++++++++++++++++++++++----------------
>  1 file changed, 29 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 5a766d4fcaa5..2703dfafc632 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -777,33 +777,51 @@ static int swap_cluster_setup_bad_slot(struct swap_cluster_info *cluster_info,
>  	return 0;
>  }
>  
> +/*
> + * Reclaim drops the ci lock, so the cluster may become unusable (freed or
> + * stolen by a lower order). @usable will be set to false if that happens.
> + */
>  static bool cluster_reclaim_range(struct swap_info_struct *si,
>  				  struct swap_cluster_info *ci,
> -				  unsigned long start, unsigned long end)
> +				  unsigned long start, unsigned int order,
> +				  bool *usable)
>  {
> +	unsigned int nr_pages = 1 << order;
> +	unsigned long offset = start, end = start + nr_pages;
>  	unsigned char *map = si->swap_map;
> -	unsigned long offset = start;
>  	int nr_reclaim;
>  
>  	spin_unlock(&ci->lock);
>  	do {
>  		switch (READ_ONCE(map[offset])) {
>  		case 0:
> -			offset++;
>  			break;
>  		case SWAP_HAS_CACHE:
>  			nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
> -			if (nr_reclaim > 0)
> -				offset += nr_reclaim;
> -			else
> +			if (nr_reclaim < 0)
>  				goto out;
>  			break;
>  		default:
>  			goto out;
>  		}
> -	} while (offset < end);
> +	} while (++offset < end);
                 ~~~~~ '++offset' is conflicting with nr_reclaim
                 returned from __try_to_reclaim_swap(). can you explain?
>  out:
>  	spin_lock(&ci->lock);
> +
> +	/*
> +	 * We just dropped ci->lock so cluster could be used by another
> +	 * order or got freed, check if it's still usable or empty.
> +	 */
> +	if (!cluster_is_usable(ci, order)) {
> +		*usable = false;
> +		return false;
> +	}
> +	*usable = true;
> +
> +	/* Fast path, no need to scan if the whole cluster is empty */
> +	if (cluster_is_empty(ci))
> +		return true;
> +
>  	/*
>  	 * Recheck the range no matter reclaim succeeded or not, the slot
>  	 * could have been be freed while we are not holding the lock.
> @@ -900,9 +918,10 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
>  	unsigned long start = ALIGN_DOWN(offset, SWAPFILE_CLUSTER);
>  	unsigned long end = min(start + SWAPFILE_CLUSTER, si->max);
>  	unsigned int nr_pages = 1 << order;
> -	bool need_reclaim, ret;
> +	bool need_reclaim, ret, usable;
>  
>  	lockdep_assert_held(&ci->lock);
> +	VM_WARN_ON(!cluster_is_usable(ci, order));
>  
>  	if (end < nr_pages || ci->count + nr_pages > SWAPFILE_CLUSTER)
>  		goto out;
> @@ -912,14 +931,8 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
>  		if (!cluster_scan_range(si, ci, offset, nr_pages, &need_reclaim))
>  			continue;
>  		if (need_reclaim) {
> -			ret = cluster_reclaim_range(si, ci, offset, offset + nr_pages);
> -			/*
> -			 * Reclaim drops ci->lock and cluster could be used
> -			 * by another order. Not checking flag as off-list
> -			 * cluster has no flag set, and change of list
> -			 * won't cause fragmentation.
> -			 */
> -			if (!cluster_is_usable(ci, order))
> +			ret = cluster_reclaim_range(si, ci, offset, order, &usable);
> +			if (!usable)
>  				goto out;
>  			if (cluster_is_empty(ci))
>  				offset = start;
> 
> -- 
> 2.52.0
>

Re: [PATCH v4 10/19] mm, swap: consolidate cluster reclaim and usability check

Posted by Kairui Song 1 month, 3 weeks ago

On Mon, Dec 15, 2025 at 12:13 PM Baoquan He <bhe@redhat.com> wrote:
>
> On 12/05/25 at 03:29am, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > Swap cluster cache reclaim requires releasing the lock, so the cluster
> > may become unusable after the reclaim. To prepare for checking swap
> > cache using the swap table directly, consolidate the swap cluster
> > reclaim and the check logic.
> >
> > We will want to avoid touching the cluster's data completely with the
>      ~~~~~~~~
>     'want to' means 'will'?

Sorry about my english, I mean in the following commit, we need to
avoid accessing the cluster's table (ci->table)  when the cluster is
empty, so the reclaim helper need to check cluster status before
accessing it.

>
> > swap table, to avoid RCU overhead here. And by moving the cluster usable
> > check into the reclaim helper, it will also help avoid a redundant scan of
> > the slots if the cluster is no longer usable, and we will want to avoid
>                                                        ~~~~~~~~~~~~
>                                                         this place too.
> > touching the cluster.
> >
> > Also, adjust it very slightly while at it: always scan the whole region
> > during reclaim, don't skip slots covered by a reclaimed folio. Because
> > the reclaim is lockless, it's possible that new cache lands at any time.
> > And for allocation, we want all caches to be reclaimed to avoid
> > fragmentation. Besides, if the scan offset is not aligned with the size
> > of the reclaimed folio, we might skip some existing cache and fail the
> > reclaim unexpectedly.
> >
> > There should be no observable behavior change. It might slightly improve
> > the fragmentation issue or performance.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/swapfile.c | 45 +++++++++++++++++++++++++++++----------------
> >  1 file changed, 29 insertions(+), 16 deletions(-)
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 5a766d4fcaa5..2703dfafc632 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -777,33 +777,51 @@ static int swap_cluster_setup_bad_slot(struct swap_cluster_info *cluster_info,
> >       return 0;
> >  }
> >
> > +/*
> > + * Reclaim drops the ci lock, so the cluster may become unusable (freed or
> > + * stolen by a lower order). @usable will be set to false if that happens.
> > + */
> >  static bool cluster_reclaim_range(struct swap_info_struct *si,
> >                                 struct swap_cluster_info *ci,
> > -                               unsigned long start, unsigned long end)
> > +                               unsigned long start, unsigned int order,
> > +                               bool *usable)
> >  {
> > +     unsigned int nr_pages = 1 << order;
> > +     unsigned long offset = start, end = start + nr_pages;
> >       unsigned char *map = si->swap_map;
> > -     unsigned long offset = start;
> >       int nr_reclaim;
> >
> >       spin_unlock(&ci->lock);
> >       do {
> >               switch (READ_ONCE(map[offset])) {
> >               case 0:
> > -                     offset++;
> >                       break;
> >               case SWAP_HAS_CACHE:
> >                       nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
> > -                     if (nr_reclaim > 0)
> > -                             offset += nr_reclaim;
> > -                     else
> > +                     if (nr_reclaim < 0)
> >                               goto out;
> >                       break;
> >               default:
> >                       goto out;
> >               }
> > -     } while (offset < end);
> > +     } while (++offset < end);
>                  ~~~~~ '++offset' is conflicting with nr_reclaim
>                  returned from __try_to_reclaim_swap(). can you explain?

What do you mean conflicting? If (nr_reclaim < 0), reclaim failed,
this loop ends. If (nr_reclaim == 0), the slot is likely concurrently
freed so the loop should just continue to iterate & reclaim to ensure
all slots are freed. If nr_reclaim > 0, the reclaim just freed a folio
of nr_reclaim pages. We can round up by nr_reclaim to skip the slots
that were occupied by the folio, but note here we are not locking the
ci so there could be new folios landing in that range. Just keep
iterating the reclaim seems still a good option and that makes the
code simpler, and in practice maybe faster as there are less branches
and calculations involved.

I mentioned `always scan the whole region during reclaim, don't skip
slots covered by a reclaimed folio` in the commit message, I can add a
few more comments too.

> >  out:
> >       spin_lock(&ci->lock);
> > +
> > +     /*
> > +      * We just dropped ci->lock so cluster could be used by another
> > +      * order or got freed, check if it's still usable or empty.
> > +      */
> > +     if (!cluster_is_usable(ci, order)) {
> > +             *usable = false;
> > +             return false;
> > +     }
> > +     *usable = true;
> > +
> > +     /* Fast path, no need to scan if the whole cluster is empty */
> > +     if (cluster_is_empty(ci))
> > +             return true;
> > +
> >       /*
> >        * Recheck the range no matter reclaim succeeded or not, the slot
> >        * could have been be freed while we are not holding the lock.
> > @@ -900,9 +918,10 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
> >       unsigned long start = ALIGN_DOWN(offset, SWAPFILE_CLUSTER);
> >       unsigned long end = min(start + SWAPFILE_CLUSTER, si->max);
> >       unsigned int nr_pages = 1 << order;
> > -     bool need_reclaim, ret;
> > +     bool need_reclaim, ret, usable;
> >
> >       lockdep_assert_held(&ci->lock);
> > +     VM_WARN_ON(!cluster_is_usable(ci, order));
> >
> >       if (end < nr_pages || ci->count + nr_pages > SWAPFILE_CLUSTER)
> >               goto out;
> > @@ -912,14 +931,8 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
> >               if (!cluster_scan_range(si, ci, offset, nr_pages, &need_reclaim))
> >                       continue;
> >               if (need_reclaim) {
> > -                     ret = cluster_reclaim_range(si, ci, offset, offset + nr_pages);
> > -                     /*
> > -                      * Reclaim drops ci->lock and cluster could be used
> > -                      * by another order. Not checking flag as off-list
> > -                      * cluster has no flag set, and change of list
> > -                      * won't cause fragmentation.
> > -                      */
> > -                     if (!cluster_is_usable(ci, order))
> > +                     ret = cluster_reclaim_range(si, ci, offset, order, &usable);
> > +                     if (!usable)
> >                               goto out;
> >                       if (cluster_is_empty(ci))
> >                               offset = start;
> >
> > --
> > 2.52.0
> >
>

Re: [PATCH v4 10/19] mm, swap: consolidate cluster reclaim and usability check

Posted by Baoquan He 1 month, 3 weeks ago

On 12/15/25 at 12:38pm, Kairui Song wrote:
> On Mon, Dec 15, 2025 at 12:13 PM Baoquan He <bhe@redhat.com> wrote:
> >
> > On 12/05/25 at 03:29am, Kairui Song wrote:
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > Swap cluster cache reclaim requires releasing the lock, so the cluster
> > > may become unusable after the reclaim. To prepare for checking swap
> > > cache using the swap table directly, consolidate the swap cluster
> > > reclaim and the check logic.
> > >
> > > We will want to avoid touching the cluster's data completely with the
> >      ~~~~~~~~
> >     'want to' means 'will'?
> 
> Sorry about my english, I mean in the following commit, we need to
> avoid accessing the cluster's table (ci->table)  when the cluster is
> empty, so the reclaim helper need to check cluster status before
> accessing it.

Got it, I could be wrong. Please ignore this nit pick unless any english
native speaker raise concern on this.

> 
> >
> > > swap table, to avoid RCU overhead here. And by moving the cluster usable
> > > check into the reclaim helper, it will also help avoid a redundant scan of
> > > the slots if the cluster is no longer usable, and we will want to avoid
> >                                                        ~~~~~~~~~~~~
> >                                                         this place too.
> > > touching the cluster.
> > >
> > > Also, adjust it very slightly while at it: always scan the whole region
> > > during reclaim, don't skip slots covered by a reclaimed folio. Because
> > > the reclaim is lockless, it's possible that new cache lands at any time.
> > > And for allocation, we want all caches to be reclaimed to avoid
> > > fragmentation. Besides, if the scan offset is not aligned with the size
> > > of the reclaimed folio, we might skip some existing cache and fail the
> > > reclaim unexpectedly.
> > >
> > > There should be no observable behavior change. It might slightly improve
> > > the fragmentation issue or performance.
> > >
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > ---
> > >  mm/swapfile.c | 45 +++++++++++++++++++++++++++++----------------
> > >  1 file changed, 29 insertions(+), 16 deletions(-)
> > >
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index 5a766d4fcaa5..2703dfafc632 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -777,33 +777,51 @@ static int swap_cluster_setup_bad_slot(struct swap_cluster_info *cluster_info,
> > >       return 0;
> > >  }
> > >
> > > +/*
> > > + * Reclaim drops the ci lock, so the cluster may become unusable (freed or
> > > + * stolen by a lower order). @usable will be set to false if that happens.
> > > + */
> > >  static bool cluster_reclaim_range(struct swap_info_struct *si,
> > >                                 struct swap_cluster_info *ci,
> > > -                               unsigned long start, unsigned long end)
> > > +                               unsigned long start, unsigned int order,
> > > +                               bool *usable)
> > >  {
> > > +     unsigned int nr_pages = 1 << order;
> > > +     unsigned long offset = start, end = start + nr_pages;
> > >       unsigned char *map = si->swap_map;
> > > -     unsigned long offset = start;
> > >       int nr_reclaim;
> > >
> > >       spin_unlock(&ci->lock);
> > >       do {
> > >               switch (READ_ONCE(map[offset])) {
> > >               case 0:
> > > -                     offset++;
> > >                       break;
> > >               case SWAP_HAS_CACHE:
> > >                       nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
> > > -                     if (nr_reclaim > 0)
> > > -                             offset += nr_reclaim;
> > > -                     else
> > > +                     if (nr_reclaim < 0)
> > >                               goto out;
> > >                       break;
> > >               default:
> > >                       goto out;
> > >               }
> > > -     } while (offset < end);
> > > +     } while (++offset < end);
> >                  ~~~~~ '++offset' is conflicting with nr_reclaim
> >                  returned from __try_to_reclaim_swap(). can you explain?
> 
> What do you mean conflicting? If (nr_reclaim < 0), reclaim failed,
> this loop ends. If (nr_reclaim == 0), the slot is likely concurrently
> freed so the loop should just continue to iterate & reclaim to ensure
> all slots are freed. If nr_reclaim > 0, the reclaim just freed a folio
> of nr_reclaim pages. We can round up by nr_reclaim to skip the slots
> that were occupied by the folio, but note here we are not locking the
> ci so there could be new folios landing in that range. Just keep
> iterating the reclaim seems still a good option and that makes the
> code simpler, and in practice maybe faster as there are less branches
> and calculations involved.

I see now. The 'conflicting' may be not precise. I didn't understand
this because __try_to_reclaim_swap() is called in several places, and 
all of them have the same situation about lock releasing and retaking
on ci->lock around __try_to_reclaim_swap(). As you said, we may need
refactor __try_to_reclaim_swap() and make change in all those places.

> 
> I mentioned `always scan the whole region during reclaim, don't skip
> slots covered by a reclaimed folio` in the commit message, I can add a
> few more comments too.
> 
> > >  out:
> > >       spin_lock(&ci->lock);
> > > +
> > > +     /*
> > > +      * We just dropped ci->lock so cluster could be used by another
> > > +      * order or got freed, check if it's still usable or empty.
> > > +      */
> > > +     if (!cluster_is_usable(ci, order)) {
> > > +             *usable = false;
> > > +             return false;
> > > +     }
> > > +     *usable = true;
> > > +
> > > +     /* Fast path, no need to scan if the whole cluster is empty */
> > > +     if (cluster_is_empty(ci))
> > > +             return true;
> > > +
> > >       /*
> > >        * Recheck the range no matter reclaim succeeded or not, the slot
> > >        * could have been be freed while we are not holding the lock.
> > > @@ -900,9 +918,10 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
> > >       unsigned long start = ALIGN_DOWN(offset, SWAPFILE_CLUSTER);
> > >       unsigned long end = min(start + SWAPFILE_CLUSTER, si->max);
> > >       unsigned int nr_pages = 1 << order;
> > > -     bool need_reclaim, ret;
> > > +     bool need_reclaim, ret, usable;
> > >
> > >       lockdep_assert_held(&ci->lock);
> > > +     VM_WARN_ON(!cluster_is_usable(ci, order));
> > >
> > >       if (end < nr_pages || ci->count + nr_pages > SWAPFILE_CLUSTER)
> > >               goto out;
> > > @@ -912,14 +931,8 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
> > >               if (!cluster_scan_range(si, ci, offset, nr_pages, &need_reclaim))
> > >                       continue;
> > >               if (need_reclaim) {
> > > -                     ret = cluster_reclaim_range(si, ci, offset, offset + nr_pages);
> > > -                     /*
> > > -                      * Reclaim drops ci->lock and cluster could be used
> > > -                      * by another order. Not checking flag as off-list
> > > -                      * cluster has no flag set, and change of list
> > > -                      * won't cause fragmentation.
> > > -                      */
> > > -                     if (!cluster_is_usable(ci, order))
> > > +                     ret = cluster_reclaim_range(si, ci, offset, order, &usable);
> > > +                     if (!usable)
> > >                               goto out;
> > >                       if (cluster_is_empty(ci))
> > >                               offset = start;
> > >
> > > --
> > > 2.52.0
> > >
> >
>

Re: [PATCH v4 10/19] mm, swap: consolidate cluster reclaim and usability check

Posted by Kairui Song 1 month, 3 weeks ago

On Wed, Dec 17, 2025 at 7:16 PM Baoquan He <bhe@redhat.com> wrote:
>
> On 12/15/25 at 12:38pm, Kairui Song wrote:
> > On Mon, Dec 15, 2025 at 12:13 PM Baoquan He <bhe@redhat.com> wrote:
> > >
> > > On 12/05/25 at 03:29am, Kairui Song wrote:
> > > > From: Kairui Song <kasong@tencent.com>
> > > >
> > > > Swap cluster cache reclaim requires releasing the lock, so the cluster
> > > > may become unusable after the reclaim. To prepare for checking swap
> > > > cache using the swap table directly, consolidate the swap cluster
> > > > reclaim and the check logic.
> > > >
> > > > We will want to avoid touching the cluster's data completely with the
> > >      ~~~~~~~~
> > >     'want to' means 'will'?
> >
> > Sorry about my english, I mean in the following commit, we need to
> > avoid accessing the cluster's table (ci->table)  when the cluster is
> > empty, so the reclaim helper need to check cluster status before
> > accessing it.
>
> Got it, I could be wrong. Please ignore this nit pick unless any english
> native speaker raise concern on this.
>
> >
> > >
> > > > swap table, to avoid RCU overhead here. And by moving the cluster usable
> > > > check into the reclaim helper, it will also help avoid a redundant scan of
> > > > the slots if the cluster is no longer usable, and we will want to avoid
> > >                                                        ~~~~~~~~~~~~
> > >                                                         this place too.
> > > > touching the cluster.
> > > >
> > > > Also, adjust it very slightly while at it: always scan the whole region
> > > > during reclaim, don't skip slots covered by a reclaimed folio. Because
> > > > the reclaim is lockless, it's possible that new cache lands at any time.
> > > > And for allocation, we want all caches to be reclaimed to avoid
> > > > fragmentation. Besides, if the scan offset is not aligned with the size
> > > > of the reclaimed folio, we might skip some existing cache and fail the
> > > > reclaim unexpectedly.
> > > >
> > > > There should be no observable behavior change. It might slightly improve
> > > > the fragmentation issue or performance.
> > > >
> > > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > > ---
> > > >  mm/swapfile.c | 45 +++++++++++++++++++++++++++++----------------
> > > >  1 file changed, 29 insertions(+), 16 deletions(-)
> > > >
> > > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > > index 5a766d4fcaa5..2703dfafc632 100644
> > > > --- a/mm/swapfile.c
> > > > +++ b/mm/swapfile.c
> > > > @@ -777,33 +777,51 @@ static int swap_cluster_setup_bad_slot(struct swap_cluster_info *cluster_info,
> > > >       return 0;
> > > >  }
> > > >
> > > > +/*
> > > > + * Reclaim drops the ci lock, so the cluster may become unusable (freed or
> > > > + * stolen by a lower order). @usable will be set to false if that happens.
> > > > + */
> > > >  static bool cluster_reclaim_range(struct swap_info_struct *si,
> > > >                                 struct swap_cluster_info *ci,
> > > > -                               unsigned long start, unsigned long end)
> > > > +                               unsigned long start, unsigned int order,
> > > > +                               bool *usable)
> > > >  {
> > > > +     unsigned int nr_pages = 1 << order;
> > > > +     unsigned long offset = start, end = start + nr_pages;
> > > >       unsigned char *map = si->swap_map;
> > > > -     unsigned long offset = start;
> > > >       int nr_reclaim;
> > > >
> > > >       spin_unlock(&ci->lock);
> > > >       do {
> > > >               switch (READ_ONCE(map[offset])) {
> > > >               case 0:
> > > > -                     offset++;
> > > >                       break;
> > > >               case SWAP_HAS_CACHE:
> > > >                       nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
> > > > -                     if (nr_reclaim > 0)
> > > > -                             offset += nr_reclaim;
> > > > -                     else
> > > > +                     if (nr_reclaim < 0)
> > > >                               goto out;
> > > >                       break;
> > > >               default:
> > > >                       goto out;
> > > >               }
> > > > -     } while (offset < end);
> > > > +     } while (++offset < end);
> > >                  ~~~~~ '++offset' is conflicting with nr_reclaim
> > >                  returned from __try_to_reclaim_swap(). can you explain?
> >
> > What do you mean conflicting? If (nr_reclaim < 0), reclaim failed,
> > this loop ends. If (nr_reclaim == 0), the slot is likely concurrently
> > freed so the loop should just continue to iterate & reclaim to ensure
> > all slots are freed. If nr_reclaim > 0, the reclaim just freed a folio
> > of nr_reclaim pages. We can round up by nr_reclaim to skip the slots
> > that were occupied by the folio, but note here we are not locking the
> > ci so there could be new folios landing in that range. Just keep
> > iterating the reclaim seems still a good option and that makes the
> > code simpler, and in practice maybe faster as there are less branches
> > and calculations involved.
>
> I see now. The 'conflicting' may be not precise. I didn't understand
> this because __try_to_reclaim_swap() is called in several places, and
> all of them have the same situation about lock releasing and retaking
> on ci->lock around __try_to_reclaim_swap(). As you said, we may need
> refactor __try_to_reclaim_swap() and make change in all those places.

It's a bit different, other callers of __try_to_reclaim_swap are just
best effort try to reclaim a slot's swap cache, because ultimately the
allocator will reclaim the slot if needed anyway. But here, it is the
allocator doing the reclaim, so we want precisely every slot to be
cleaned.

Avoid the align /round_up also make the code a bit cleaner.

Re: [PATCH v4 10/19] mm, swap: consolidate cluster reclaim and usability check

Posted by Baoquan He 1 month, 3 weeks ago

On 12/18/25 at 02:30am, Kairui Song wrote:
> On Wed, Dec 17, 2025 at 7:16 PM Baoquan He <bhe@redhat.com> wrote:
> >
> > On 12/15/25 at 12:38pm, Kairui Song wrote:
> > > On Mon, Dec 15, 2025 at 12:13 PM Baoquan He <bhe@redhat.com> wrote:
> > > >
> > > > On 12/05/25 at 03:29am, Kairui Song wrote:
> > > > > From: Kairui Song <kasong@tencent.com>
> > > > >
> > > > > Swap cluster cache reclaim requires releasing the lock, so the cluster
> > > > > may become unusable after the reclaim. To prepare for checking swap
> > > > > cache using the swap table directly, consolidate the swap cluster
> > > > > reclaim and the check logic.
> > > > >
> > > > > We will want to avoid touching the cluster's data completely with the
> > > >      ~~~~~~~~
> > > >     'want to' means 'will'?
> > >
> > > Sorry about my english, I mean in the following commit, we need to
> > > avoid accessing the cluster's table (ci->table)  when the cluster is
> > > empty, so the reclaim helper need to check cluster status before
> > > accessing it.
> >
> > Got it, I could be wrong. Please ignore this nit pick unless any english
> > native speaker raise concern on this.
> >
> > >
> > > >
> > > > > swap table, to avoid RCU overhead here. And by moving the cluster usable
> > > > > check into the reclaim helper, it will also help avoid a redundant scan of
> > > > > the slots if the cluster is no longer usable, and we will want to avoid
> > > >                                                        ~~~~~~~~~~~~
> > > >                                                         this place too.
> > > > > touching the cluster.
> > > > >
> > > > > Also, adjust it very slightly while at it: always scan the whole region
> > > > > during reclaim, don't skip slots covered by a reclaimed folio. Because
> > > > > the reclaim is lockless, it's possible that new cache lands at any time.
> > > > > And for allocation, we want all caches to be reclaimed to avoid
> > > > > fragmentation. Besides, if the scan offset is not aligned with the size
> > > > > of the reclaimed folio, we might skip some existing cache and fail the
> > > > > reclaim unexpectedly.
> > > > >
> > > > > There should be no observable behavior change. It might slightly improve
> > > > > the fragmentation issue or performance.
> > > > >
> > > > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > > > ---
> > > > >  mm/swapfile.c | 45 +++++++++++++++++++++++++++++----------------
> > > > >  1 file changed, 29 insertions(+), 16 deletions(-)
> > > > >
> > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > > > index 5a766d4fcaa5..2703dfafc632 100644
> > > > > --- a/mm/swapfile.c
> > > > > +++ b/mm/swapfile.c
> > > > > @@ -777,33 +777,51 @@ static int swap_cluster_setup_bad_slot(struct swap_cluster_info *cluster_info,
> > > > >       return 0;
> > > > >  }
> > > > >
> > > > > +/*
> > > > > + * Reclaim drops the ci lock, so the cluster may become unusable (freed or
> > > > > + * stolen by a lower order). @usable will be set to false if that happens.
> > > > > + */
> > > > >  static bool cluster_reclaim_range(struct swap_info_struct *si,
> > > > >                                 struct swap_cluster_info *ci,
> > > > > -                               unsigned long start, unsigned long end)
> > > > > +                               unsigned long start, unsigned int order,
> > > > > +                               bool *usable)
> > > > >  {
> > > > > +     unsigned int nr_pages = 1 << order;
> > > > > +     unsigned long offset = start, end = start + nr_pages;
> > > > >       unsigned char *map = si->swap_map;
> > > > > -     unsigned long offset = start;
> > > > >       int nr_reclaim;
> > > > >
> > > > >       spin_unlock(&ci->lock);
> > > > >       do {
> > > > >               switch (READ_ONCE(map[offset])) {
> > > > >               case 0:
> > > > > -                     offset++;
> > > > >                       break;
> > > > >               case SWAP_HAS_CACHE:
> > > > >                       nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
> > > > > -                     if (nr_reclaim > 0)
> > > > > -                             offset += nr_reclaim;
> > > > > -                     else
> > > > > +                     if (nr_reclaim < 0)
> > > > >                               goto out;
> > > > >                       break;
> > > > >               default:
> > > > >                       goto out;
> > > > >               }
> > > > > -     } while (offset < end);
> > > > > +     } while (++offset < end);
> > > >                  ~~~~~ '++offset' is conflicting with nr_reclaim
> > > >                  returned from __try_to_reclaim_swap(). can you explain?
> > >
> > > What do you mean conflicting? If (nr_reclaim < 0), reclaim failed,
> > > this loop ends. If (nr_reclaim == 0), the slot is likely concurrently
> > > freed so the loop should just continue to iterate & reclaim to ensure
> > > all slots are freed. If nr_reclaim > 0, the reclaim just freed a folio
> > > of nr_reclaim pages. We can round up by nr_reclaim to skip the slots
> > > that were occupied by the folio, but note here we are not locking the
> > > ci so there could be new folios landing in that range. Just keep
> > > iterating the reclaim seems still a good option and that makes the
> > > code simpler, and in practice maybe faster as there are less branches
> > > and calculations involved.
> >
> > I see now. The 'conflicting' may be not precise. I didn't understand
> > this because __try_to_reclaim_swap() is called in several places, and
> > all of them have the same situation about lock releasing and retaking
> > on ci->lock around __try_to_reclaim_swap(). As you said, we may need
> > refactor __try_to_reclaim_swap() and make change in all those places.
> 
> It's a bit different, other callers of __try_to_reclaim_swap are just
> best effort try to reclaim a slot's swap cache, because ultimately the
> allocator will reclaim the slot if needed anyway. But here, it is the
> allocator doing the reclaim, so we want precisely every slot to be
> cleaned.

OK, I see. Thanks for the explanation. While I think that's why we did
the recheck in later for loop. The old way and your change may have the
similar effect.

> 
> Avoid the align /round_up also make the code a bit cleaner.
>