[PATCH 2/2] mm, swap: prefer nonfull over free clusters

Kairui Song posted 2 patches 2 months ago
[PATCH 2/2] mm, swap: prefer nonfull over free clusters
Posted by Kairui Song 2 months ago
From: Kairui Song <kasong@tencent.com>

We prefer a free cluster over a nonfull cluster whenever a CPU local
cluster is drained to respect the SSD discard behavior [1]. It's not
a best practice for non-discarding devices. And this is causing a
chigher fragmentation rate.

So for a non-discarding device, prefer nonfull over free clusters. This
reduces the fragmentation issue by a lot.

Testing with make -j96, defconfig, using 64k mTHP, 8G ZRAM:

Before: sys time: 6121.0s  64kB/swpout: 1638155  64kB/swpout_fallback: 189562
After:  sys time: 6145.3s  64kB/swpout: 1761110  64kB/swpout_fallback: 66071

Testing with make -j96, defconfig, using 64k mTHP, 10G ZRAM:

Before: sys time 5527.9s  64kB/swpout: 1789358  64kB/swpout_fallback: 17813
After:  sys time 5538.3s  64kB/swpout: 1813133  64kB/swpout_fallback: 0

Performance is basically unchanged, and the large allocation failure rate
is lower. Enabling all mTHP sizes showed a more significant result:

Using the same test setup with 10G ZRAM and enabling all mTHP sizes:

128kB swap failure rate:
Before: swpout:449548 swpout_fallback:55894
After:  swpout:497519 swpout_fallback:3204

256kB swap failure rate:
Before: swpout:63938  swpout_fallback:2154
After:  swpout:65698  swpout_fallback:324

512kB swap failure rate:
Before: swpout:11971  swpout_fallback:2218
After:  swpout:14606  swpout_fallback:4

2M swap failure rate:
Before: swpout:12     swpout_fallback:1578
After:  swpout:1253   swpout_fallback:15

The success rate of large allocations is much higher.

Link: https://lore.kernel.org/linux-mm/87v8242vng.fsf@yhuang6-desk2.ccr.corp.intel.com/ [1]
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 38 ++++++++++++++++++++++++++++----------
 1 file changed, 28 insertions(+), 10 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 5fdb3cb2b8b7..4a0cf4fb348d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -908,18 +908,20 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 	}
 
 new_cluster:
-	ci = isolate_lock_cluster(si, &si->free_clusters);
-	if (ci) {
-		found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
-						order, usage);
-		if (found)
-			goto done;
+	/*
+	 * If the device need discard, prefer new cluster over nonfull
+	 * to spread out the writes.
+	 */
+	if (si->flags & SWP_PAGE_DISCARD) {
+		ci = isolate_lock_cluster(si, &si->free_clusters);
+		if (ci) {
+			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
+							order, usage);
+			if (found)
+				goto done;
+		}
 	}
 
-	/* Try reclaim from full clusters if free clusters list is drained */
-	if (vm_swap_full())
-		swap_reclaim_full_clusters(si, false);
-
 	if (order < PMD_ORDER) {
 		while ((ci = isolate_lock_cluster(si, &si->nonfull_clusters[order]))) {
 			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
@@ -927,7 +929,23 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 			if (found)
 				goto done;
 		}
+	}
 
+	if (!(si->flags & SWP_PAGE_DISCARD)) {
+		ci = isolate_lock_cluster(si, &si->free_clusters);
+		if (ci) {
+			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
+							order, usage);
+			if (found)
+				goto done;
+		}
+	}
+
+	/* Try reclaim full clusters if free and nonfull lists are drained */
+	if (vm_swap_full())
+		swap_reclaim_full_clusters(si, false);
+
+	if (order < PMD_ORDER) {
 		/*
 		 * Scan only one fragment cluster is good enough. Order 0
 		 * allocation will surely success, and large allocation
-- 
2.50.1
Re: [PATCH 2/2] mm, swap: prefer nonfull over free clusters
Posted by Chris Li 2 months ago
Acked-by: Chris Li <chrisl@kernel.org>

On Mon, Aug 4, 2025 at 10:25 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> We prefer a free cluster over a nonfull cluster whenever a CPU local
> cluster is drained to respect the SSD discard behavior [1]. It's not
> a best practice for non-discarding devices. And this is causing a
> chigher fragmentation rate.

Not only does it cause a higher fragmentation rate. It also causes
limit working set size over a long period of continued swapping can
write to the whole swapping partition. That is bad from the SSD point
of view if the swap page access pattern is random. Because at random
access patterns, very few clusters can have all 512 free, which can
reach to the discard. The previously preferred new cluster approach
works best with batched short to medium running cycle jobs, so at the
end of batch, there is a time where most of the working of swap is
released. That can release the nonfull cluster to a free cluster. For
long running jobs and random access of swap entry, very low change
frees a cluster to discard.

This patch will cause the limit working set to only write to a limited
swap area. Which is a good thing from the SSD wearing point of view.

Chris

> So for a non-discarding device, prefer nonfull over free clusters. This
> reduces the fragmentation issue by a lot.
>
> Testing with make -j96, defconfig, using 64k mTHP, 8G ZRAM:
>
> Before: sys time: 6121.0s  64kB/swpout: 1638155  64kB/swpout_fallback: 189562
> After:  sys time: 6145.3s  64kB/swpout: 1761110  64kB/swpout_fallback: 66071
>
> Testing with make -j96, defconfig, using 64k mTHP, 10G ZRAM:
>
> Before: sys time 5527.9s  64kB/swpout: 1789358  64kB/swpout_fallback: 17813
> After:  sys time 5538.3s  64kB/swpout: 1813133  64kB/swpout_fallback: 0
>
> Performance is basically unchanged, and the large allocation failure rate
> is lower. Enabling all mTHP sizes showed a more significant result:
>
> Using the same test setup with 10G ZRAM and enabling all mTHP sizes:
>
> 128kB swap failure rate:
> Before: swpout:449548 swpout_fallback:55894
> After:  swpout:497519 swpout_fallback:3204
>
> 256kB swap failure rate:
> Before: swpout:63938  swpout_fallback:2154
> After:  swpout:65698  swpout_fallback:324
>
> 512kB swap failure rate:
> Before: swpout:11971  swpout_fallback:2218
> After:  swpout:14606  swpout_fallback:4
>
> 2M swap failure rate:
> Before: swpout:12     swpout_fallback:1578
> After:  swpout:1253   swpout_fallback:15
>
> The success rate of large allocations is much higher.
>
> Link: https://lore.kernel.org/linux-mm/87v8242vng.fsf@yhuang6-desk2.ccr.corp.intel.com/ [1]
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swapfile.c | 38 ++++++++++++++++++++++++++++----------
>  1 file changed, 28 insertions(+), 10 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 5fdb3cb2b8b7..4a0cf4fb348d 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -908,18 +908,20 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>         }
>
>  new_cluster:
> -       ci = isolate_lock_cluster(si, &si->free_clusters);
> -       if (ci) {
> -               found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> -                                               order, usage);
> -               if (found)
> -                       goto done;
> +       /*
> +        * If the device need discard, prefer new cluster over nonfull
> +        * to spread out the writes.
> +        */
> +       if (si->flags & SWP_PAGE_DISCARD) {
> +               ci = isolate_lock_cluster(si, &si->free_clusters);
> +               if (ci) {
> +                       found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> +                                                       order, usage);
> +                       if (found)
> +                               goto done;
> +               }
>         }
>
> -       /* Try reclaim from full clusters if free clusters list is drained */
> -       if (vm_swap_full())
> -               swap_reclaim_full_clusters(si, false);
> -
>         if (order < PMD_ORDER) {
>                 while ((ci = isolate_lock_cluster(si, &si->nonfull_clusters[order]))) {
>                         found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> @@ -927,7 +929,23 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>                         if (found)
>                                 goto done;
>                 }
> +       }
>
> +       if (!(si->flags & SWP_PAGE_DISCARD)) {
> +               ci = isolate_lock_cluster(si, &si->free_clusters);
> +               if (ci) {
> +                       found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> +                                                       order, usage);
> +                       if (found)
> +                               goto done;
> +               }
> +       }
> +
> +       /* Try reclaim full clusters if free and nonfull lists are drained */
> +       if (vm_swap_full())
> +               swap_reclaim_full_clusters(si, false);
> +
> +       if (order < PMD_ORDER) {
>                 /*
>                  * Scan only one fragment cluster is good enough. Order 0
>                  * allocation will surely success, and large allocation
> --
> 2.50.1
>
>
Re: [PATCH 2/2] mm, swap: prefer nonfull over free clusters
Posted by Nhat Pham 2 months ago
On Mon, Aug 4, 2025 at 10:24 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> We prefer a free cluster over a nonfull cluster whenever a CPU local
> cluster is drained to respect the SSD discard behavior [1]. It's not
> a best practice for non-discarding devices. And this is causing a
> chigher fragmentation rate.
>
> So for a non-discarding device, prefer nonfull over free clusters. This
> reduces the fragmentation issue by a lot.
>
> Testing with make -j96, defconfig, using 64k mTHP, 8G ZRAM:
>
> Before: sys time: 6121.0s  64kB/swpout: 1638155  64kB/swpout_fallback: 189562
> After:  sys time: 6145.3s  64kB/swpout: 1761110  64kB/swpout_fallback: 66071
>
> Testing with make -j96, defconfig, using 64k mTHP, 10G ZRAM:
>
> Before: sys time 5527.9s  64kB/swpout: 1789358  64kB/swpout_fallback: 17813
> After:  sys time 5538.3s  64kB/swpout: 1813133  64kB/swpout_fallback: 0
>
> Performance is basically unchanged, and the large allocation failure rate
> is lower. Enabling all mTHP sizes showed a more significant result:
>
> Using the same test setup with 10G ZRAM and enabling all mTHP sizes:
>
> 128kB swap failure rate:
> Before: swpout:449548 swpout_fallback:55894
> After:  swpout:497519 swpout_fallback:3204
>
> 256kB swap failure rate:
> Before: swpout:63938  swpout_fallback:2154
> After:  swpout:65698  swpout_fallback:324
>
> 512kB swap failure rate:
> Before: swpout:11971  swpout_fallback:2218
> After:  swpout:14606  swpout_fallback:4
>
> 2M swap failure rate:
> Before: swpout:12     swpout_fallback:1578
> After:  swpout:1253   swpout_fallback:15
>
> The success rate of large allocations is much higher.
>
> Link: https://lore.kernel.org/linux-mm/87v8242vng.fsf@yhuang6-desk2.ccr.corp.intel.com/ [1]
> Signed-off-by: Kairui Song <kasong@tencent.com>

Nice! I agree with Chris' analysis too. It's less of a problem for
vswap (because there's no physical/SSD implication over there), but
this patch makes sense in the context of swapfile allocator.

FWIW:
Reviewed-by: Nhat Pham <nphamcs@gmail.com>

> ---
>  mm/swapfile.c | 38 ++++++++++++++++++++++++++++----------
>  1 file changed, 28 insertions(+), 10 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 5fdb3cb2b8b7..4a0cf4fb348d 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -908,18 +908,20 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>         }
>
>  new_cluster:
> -       ci = isolate_lock_cluster(si, &si->free_clusters);
> -       if (ci) {
> -               found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> -                                               order, usage);
> -               if (found)
> -                       goto done;
> +       /*
> +        * If the device need discard, prefer new cluster over nonfull
> +        * to spread out the writes.
> +        */
> +       if (si->flags & SWP_PAGE_DISCARD) {
> +               ci = isolate_lock_cluster(si, &si->free_clusters);
> +               if (ci) {
> +                       found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> +                                                       order, usage);
> +                       if (found)
> +                               goto done;
> +               }
>         }
>
> -       /* Try reclaim from full clusters if free clusters list is drained */
> -       if (vm_swap_full())
> -               swap_reclaim_full_clusters(si, false);
> -
>         if (order < PMD_ORDER) {
>                 while ((ci = isolate_lock_cluster(si, &si->nonfull_clusters[order]))) {
>                         found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> @@ -927,7 +929,23 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>                         if (found)
>                                 goto done;
>                 }
> +       }
>
> +       if (!(si->flags & SWP_PAGE_DISCARD)) {
> +               ci = isolate_lock_cluster(si, &si->free_clusters);
> +               if (ci) {
> +                       found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> +                                                       order, usage);
> +                       if (found)
> +                               goto done;
> +               }
> +       }

Seems like this pattern is repeated a couple of places -
isolate_lock_cluster from one of the lists, and if successful, then
try to allocate (alloc_swap_scan_cluster) from it.

Might be refactorable in a future clean up patch.
Re: [PATCH 2/2] mm, swap: prefer nonfull over free clusters
Posted by Chris Li 2 months ago
On Tue, Aug 5, 2025 at 5:03 PM Nhat Pham <nphamcs@gmail.com> wrote:
>

> >
> > +       if (!(si->flags & SWP_PAGE_DISCARD)) {
> > +               ci = isolate_lock_cluster(si, &si->free_clusters);
> > +               if (ci) {
> > +                       found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> > +                                                       order, usage);
> > +                       if (found)
> > +                               goto done;
> > +               }
> > +       }
>
> Seems like this pattern is repeated a couple of places -
> isolate_lock_cluster from one of the lists, and if successful, then
> try to allocate (alloc_swap_scan_cluster) from it.
>
> Might be refactorable in a future clean up patch.
>
Yes, agree. I noticed that as well. Incidentally I am writing a RFC
patch to clean it up when I saw your email coming in. Another reason
to clean it up is that, isolate_lock_cluster() must be paired with
relocate_cluster(), otherwise we have a dangling cluster not in the
list. We'd better pair the isolate() and relocate() in the same
function for better visibility.

Chris
Re: [PATCH 2/2] mm, swap: prefer nonfull over free clusters
Posted by Kairui Song 2 months ago
On Wed, Aug 6, 2025 at 8:06 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Mon, Aug 4, 2025 at 10:24 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > We prefer a free cluster over a nonfull cluster whenever a CPU local
> > cluster is drained to respect the SSD discard behavior [1]. It's not
> > a best practice for non-discarding devices. And this is causing a
> > chigher fragmentation rate.
> >
> > So for a non-discarding device, prefer nonfull over free clusters. This
> > reduces the fragmentation issue by a lot.
> >
> > Testing with make -j96, defconfig, using 64k mTHP, 8G ZRAM:
> >
> > Before: sys time: 6121.0s  64kB/swpout: 1638155  64kB/swpout_fallback: 189562
> > After:  sys time: 6145.3s  64kB/swpout: 1761110  64kB/swpout_fallback: 66071
> >
> > Testing with make -j96, defconfig, using 64k mTHP, 10G ZRAM:
> >
> > Before: sys time 5527.9s  64kB/swpout: 1789358  64kB/swpout_fallback: 17813
> > After:  sys time 5538.3s  64kB/swpout: 1813133  64kB/swpout_fallback: 0
> >
> > Performance is basically unchanged, and the large allocation failure rate
> > is lower. Enabling all mTHP sizes showed a more significant result:
> >
> > Using the same test setup with 10G ZRAM and enabling all mTHP sizes:
> >
> > 128kB swap failure rate:
> > Before: swpout:449548 swpout_fallback:55894
> > After:  swpout:497519 swpout_fallback:3204
> >
> > 256kB swap failure rate:
> > Before: swpout:63938  swpout_fallback:2154
> > After:  swpout:65698  swpout_fallback:324
> >
> > 512kB swap failure rate:
> > Before: swpout:11971  swpout_fallback:2218
> > After:  swpout:14606  swpout_fallback:4
> >
> > 2M swap failure rate:
> > Before: swpout:12     swpout_fallback:1578
> > After:  swpout:1253   swpout_fallback:15
> >
> > The success rate of large allocations is much higher.
> >
> > Link: https://lore.kernel.org/linux-mm/87v8242vng.fsf@yhuang6-desk2.ccr.corp.intel.com/ [1]
> > Signed-off-by: Kairui Song <kasong@tencent.com>
>
> Nice! I agree with Chris' analysis too. It's less of a problem for
> vswap (because there's no physical/SSD implication over there), but
> this patch makes sense in the context of swapfile allocator.
>
> FWIW:
> Reviewed-by: Nhat Pham <nphamcs@gmail.com>

Thanks!

>
> > ---
> >  mm/swapfile.c | 38 ++++++++++++++++++++++++++++----------
> >  1 file changed, 28 insertions(+), 10 deletions(-)
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 5fdb3cb2b8b7..4a0cf4fb348d 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -908,18 +908,20 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> >         }
> >
> >  new_cluster:
> > -       ci = isolate_lock_cluster(si, &si->free_clusters);
> > -       if (ci) {
> > -               found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> > -                                               order, usage);
> > -               if (found)
> > -                       goto done;
> > +       /*
> > +        * If the device need discard, prefer new cluster over nonfull
> > +        * to spread out the writes.
> > +        */
> > +       if (si->flags & SWP_PAGE_DISCARD) {
> > +               ci = isolate_lock_cluster(si, &si->free_clusters);
> > +               if (ci) {
> > +                       found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> > +                                                       order, usage);
> > +                       if (found)
> > +                               goto done;
> > +               }
> >         }
> >
> > -       /* Try reclaim from full clusters if free clusters list is drained */
> > -       if (vm_swap_full())
> > -               swap_reclaim_full_clusters(si, false);
> > -
> >         if (order < PMD_ORDER) {
> >                 while ((ci = isolate_lock_cluster(si, &si->nonfull_clusters[order]))) {
> >                         found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> > @@ -927,7 +929,23 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> >                         if (found)
> >                                 goto done;
> >                 }
> > +       }
> >
> > +       if (!(si->flags & SWP_PAGE_DISCARD)) {
> > +               ci = isolate_lock_cluster(si, &si->free_clusters);
> > +               if (ci) {
> > +                       found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> > +                                                       order, usage);
> > +                       if (found)
> > +                               goto done;
> > +               }
> > +       }
>
> Seems like this pattern is repeated a couple of places -
> isolate_lock_cluster from one of the lists, and if successful, then
> try to allocate (alloc_swap_scan_cluster) from it.

Indeed, I've been thinking about it but there are some other issues
that need to be cleaned up before this one.