From: Kairui Song <kasong@tencent.com>
We prefer a free cluster over a nonfull cluster whenever a CPU local
cluster is drained to respect the SSD discard behavior [1]. It's not
a best practice for non-discarding devices. And this is causing a
chigher fragmentation rate.
So for a non-discarding device, prefer nonfull over free clusters. This
reduces the fragmentation issue by a lot.
Testing with make -j96, defconfig, using 64k mTHP, 8G ZRAM:
Before: sys time: 6121.0s 64kB/swpout: 1638155 64kB/swpout_fallback: 189562
After: sys time: 6145.3s 64kB/swpout: 1761110 64kB/swpout_fallback: 66071
Testing with make -j96, defconfig, using 64k mTHP, 10G ZRAM:
Before: sys time 5527.9s 64kB/swpout: 1789358 64kB/swpout_fallback: 17813
After: sys time 5538.3s 64kB/swpout: 1813133 64kB/swpout_fallback: 0
Performance is basically unchanged, and the large allocation failure rate
is lower. Enabling all mTHP sizes showed a more significant result:
Using the same test setup with 10G ZRAM and enabling all mTHP sizes:
128kB swap failure rate:
Before: swpout:449548 swpout_fallback:55894
After: swpout:497519 swpout_fallback:3204
256kB swap failure rate:
Before: swpout:63938 swpout_fallback:2154
After: swpout:65698 swpout_fallback:324
512kB swap failure rate:
Before: swpout:11971 swpout_fallback:2218
After: swpout:14606 swpout_fallback:4
2M swap failure rate:
Before: swpout:12 swpout_fallback:1578
After: swpout:1253 swpout_fallback:15
The success rate of large allocations is much higher.
Link: https://lore.kernel.org/linux-mm/87v8242vng.fsf@yhuang6-desk2.ccr.corp.intel.com/ [1]
Signed-off-by: Kairui Song <kasong@tencent.com>
---
mm/swapfile.c | 38 ++++++++++++++++++++++++++++----------
1 file changed, 28 insertions(+), 10 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 5fdb3cb2b8b7..4a0cf4fb348d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -908,18 +908,20 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
}
new_cluster:
- ci = isolate_lock_cluster(si, &si->free_clusters);
- if (ci) {
- found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
- order, usage);
- if (found)
- goto done;
+ /*
+ * If the device need discard, prefer new cluster over nonfull
+ * to spread out the writes.
+ */
+ if (si->flags & SWP_PAGE_DISCARD) {
+ ci = isolate_lock_cluster(si, &si->free_clusters);
+ if (ci) {
+ found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
+ order, usage);
+ if (found)
+ goto done;
+ }
}
- /* Try reclaim from full clusters if free clusters list is drained */
- if (vm_swap_full())
- swap_reclaim_full_clusters(si, false);
-
if (order < PMD_ORDER) {
while ((ci = isolate_lock_cluster(si, &si->nonfull_clusters[order]))) {
found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
@@ -927,7 +929,23 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
if (found)
goto done;
}
+ }
+ if (!(si->flags & SWP_PAGE_DISCARD)) {
+ ci = isolate_lock_cluster(si, &si->free_clusters);
+ if (ci) {
+ found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
+ order, usage);
+ if (found)
+ goto done;
+ }
+ }
+
+ /* Try reclaim full clusters if free and nonfull lists are drained */
+ if (vm_swap_full())
+ swap_reclaim_full_clusters(si, false);
+
+ if (order < PMD_ORDER) {
/*
* Scan only one fragment cluster is good enough. Order 0
* allocation will surely success, and large allocation
--
2.50.1
Acked-by: Chris Li <chrisl@kernel.org> On Mon, Aug 4, 2025 at 10:25 AM Kairui Song <ryncsn@gmail.com> wrote: > > From: Kairui Song <kasong@tencent.com> > > We prefer a free cluster over a nonfull cluster whenever a CPU local > cluster is drained to respect the SSD discard behavior [1]. It's not > a best practice for non-discarding devices. And this is causing a > chigher fragmentation rate. Not only does it cause a higher fragmentation rate. It also causes limit working set size over a long period of continued swapping can write to the whole swapping partition. That is bad from the SSD point of view if the swap page access pattern is random. Because at random access patterns, very few clusters can have all 512 free, which can reach to the discard. The previously preferred new cluster approach works best with batched short to medium running cycle jobs, so at the end of batch, there is a time where most of the working of swap is released. That can release the nonfull cluster to a free cluster. For long running jobs and random access of swap entry, very low change frees a cluster to discard. This patch will cause the limit working set to only write to a limited swap area. Which is a good thing from the SSD wearing point of view. Chris > So for a non-discarding device, prefer nonfull over free clusters. This > reduces the fragmentation issue by a lot. > > Testing with make -j96, defconfig, using 64k mTHP, 8G ZRAM: > > Before: sys time: 6121.0s 64kB/swpout: 1638155 64kB/swpout_fallback: 189562 > After: sys time: 6145.3s 64kB/swpout: 1761110 64kB/swpout_fallback: 66071 > > Testing with make -j96, defconfig, using 64k mTHP, 10G ZRAM: > > Before: sys time 5527.9s 64kB/swpout: 1789358 64kB/swpout_fallback: 17813 > After: sys time 5538.3s 64kB/swpout: 1813133 64kB/swpout_fallback: 0 > > Performance is basically unchanged, and the large allocation failure rate > is lower. Enabling all mTHP sizes showed a more significant result: > > Using the same test setup with 10G ZRAM and enabling all mTHP sizes: > > 128kB swap failure rate: > Before: swpout:449548 swpout_fallback:55894 > After: swpout:497519 swpout_fallback:3204 > > 256kB swap failure rate: > Before: swpout:63938 swpout_fallback:2154 > After: swpout:65698 swpout_fallback:324 > > 512kB swap failure rate: > Before: swpout:11971 swpout_fallback:2218 > After: swpout:14606 swpout_fallback:4 > > 2M swap failure rate: > Before: swpout:12 swpout_fallback:1578 > After: swpout:1253 swpout_fallback:15 > > The success rate of large allocations is much higher. > > Link: https://lore.kernel.org/linux-mm/87v8242vng.fsf@yhuang6-desk2.ccr.corp.intel.com/ [1] > Signed-off-by: Kairui Song <kasong@tencent.com> > --- > mm/swapfile.c | 38 ++++++++++++++++++++++++++++---------- > 1 file changed, 28 insertions(+), 10 deletions(-) > > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 5fdb3cb2b8b7..4a0cf4fb348d 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -908,18 +908,20 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o > } > > new_cluster: > - ci = isolate_lock_cluster(si, &si->free_clusters); > - if (ci) { > - found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci), > - order, usage); > - if (found) > - goto done; > + /* > + * If the device need discard, prefer new cluster over nonfull > + * to spread out the writes. > + */ > + if (si->flags & SWP_PAGE_DISCARD) { > + ci = isolate_lock_cluster(si, &si->free_clusters); > + if (ci) { > + found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci), > + order, usage); > + if (found) > + goto done; > + } > } > > - /* Try reclaim from full clusters if free clusters list is drained */ > - if (vm_swap_full()) > - swap_reclaim_full_clusters(si, false); > - > if (order < PMD_ORDER) { > while ((ci = isolate_lock_cluster(si, &si->nonfull_clusters[order]))) { > found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci), > @@ -927,7 +929,23 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o > if (found) > goto done; > } > + } > > + if (!(si->flags & SWP_PAGE_DISCARD)) { > + ci = isolate_lock_cluster(si, &si->free_clusters); > + if (ci) { > + found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci), > + order, usage); > + if (found) > + goto done; > + } > + } > + > + /* Try reclaim full clusters if free and nonfull lists are drained */ > + if (vm_swap_full()) > + swap_reclaim_full_clusters(si, false); > + > + if (order < PMD_ORDER) { > /* > * Scan only one fragment cluster is good enough. Order 0 > * allocation will surely success, and large allocation > -- > 2.50.1 > >
On Mon, Aug 4, 2025 at 10:24 AM Kairui Song <ryncsn@gmail.com> wrote: > > From: Kairui Song <kasong@tencent.com> > > We prefer a free cluster over a nonfull cluster whenever a CPU local > cluster is drained to respect the SSD discard behavior [1]. It's not > a best practice for non-discarding devices. And this is causing a > chigher fragmentation rate. > > So for a non-discarding device, prefer nonfull over free clusters. This > reduces the fragmentation issue by a lot. > > Testing with make -j96, defconfig, using 64k mTHP, 8G ZRAM: > > Before: sys time: 6121.0s 64kB/swpout: 1638155 64kB/swpout_fallback: 189562 > After: sys time: 6145.3s 64kB/swpout: 1761110 64kB/swpout_fallback: 66071 > > Testing with make -j96, defconfig, using 64k mTHP, 10G ZRAM: > > Before: sys time 5527.9s 64kB/swpout: 1789358 64kB/swpout_fallback: 17813 > After: sys time 5538.3s 64kB/swpout: 1813133 64kB/swpout_fallback: 0 > > Performance is basically unchanged, and the large allocation failure rate > is lower. Enabling all mTHP sizes showed a more significant result: > > Using the same test setup with 10G ZRAM and enabling all mTHP sizes: > > 128kB swap failure rate: > Before: swpout:449548 swpout_fallback:55894 > After: swpout:497519 swpout_fallback:3204 > > 256kB swap failure rate: > Before: swpout:63938 swpout_fallback:2154 > After: swpout:65698 swpout_fallback:324 > > 512kB swap failure rate: > Before: swpout:11971 swpout_fallback:2218 > After: swpout:14606 swpout_fallback:4 > > 2M swap failure rate: > Before: swpout:12 swpout_fallback:1578 > After: swpout:1253 swpout_fallback:15 > > The success rate of large allocations is much higher. > > Link: https://lore.kernel.org/linux-mm/87v8242vng.fsf@yhuang6-desk2.ccr.corp.intel.com/ [1] > Signed-off-by: Kairui Song <kasong@tencent.com> Nice! I agree with Chris' analysis too. It's less of a problem for vswap (because there's no physical/SSD implication over there), but this patch makes sense in the context of swapfile allocator. FWIW: Reviewed-by: Nhat Pham <nphamcs@gmail.com> > --- > mm/swapfile.c | 38 ++++++++++++++++++++++++++++---------- > 1 file changed, 28 insertions(+), 10 deletions(-) > > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 5fdb3cb2b8b7..4a0cf4fb348d 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -908,18 +908,20 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o > } > > new_cluster: > - ci = isolate_lock_cluster(si, &si->free_clusters); > - if (ci) { > - found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci), > - order, usage); > - if (found) > - goto done; > + /* > + * If the device need discard, prefer new cluster over nonfull > + * to spread out the writes. > + */ > + if (si->flags & SWP_PAGE_DISCARD) { > + ci = isolate_lock_cluster(si, &si->free_clusters); > + if (ci) { > + found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci), > + order, usage); > + if (found) > + goto done; > + } > } > > - /* Try reclaim from full clusters if free clusters list is drained */ > - if (vm_swap_full()) > - swap_reclaim_full_clusters(si, false); > - > if (order < PMD_ORDER) { > while ((ci = isolate_lock_cluster(si, &si->nonfull_clusters[order]))) { > found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci), > @@ -927,7 +929,23 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o > if (found) > goto done; > } > + } > > + if (!(si->flags & SWP_PAGE_DISCARD)) { > + ci = isolate_lock_cluster(si, &si->free_clusters); > + if (ci) { > + found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci), > + order, usage); > + if (found) > + goto done; > + } > + } Seems like this pattern is repeated a couple of places - isolate_lock_cluster from one of the lists, and if successful, then try to allocate (alloc_swap_scan_cluster) from it. Might be refactorable in a future clean up patch.
On Tue, Aug 5, 2025 at 5:03 PM Nhat Pham <nphamcs@gmail.com> wrote: > > > > > + if (!(si->flags & SWP_PAGE_DISCARD)) { > > + ci = isolate_lock_cluster(si, &si->free_clusters); > > + if (ci) { > > + found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci), > > + order, usage); > > + if (found) > > + goto done; > > + } > > + } > > Seems like this pattern is repeated a couple of places - > isolate_lock_cluster from one of the lists, and if successful, then > try to allocate (alloc_swap_scan_cluster) from it. > > Might be refactorable in a future clean up patch. > Yes, agree. I noticed that as well. Incidentally I am writing a RFC patch to clean it up when I saw your email coming in. Another reason to clean it up is that, isolate_lock_cluster() must be paired with relocate_cluster(), otherwise we have a dangling cluster not in the list. We'd better pair the isolate() and relocate() in the same function for better visibility. Chris
On Wed, Aug 6, 2025 at 8:06 AM Nhat Pham <nphamcs@gmail.com> wrote: > > On Mon, Aug 4, 2025 at 10:24 AM Kairui Song <ryncsn@gmail.com> wrote: > > > > From: Kairui Song <kasong@tencent.com> > > > > We prefer a free cluster over a nonfull cluster whenever a CPU local > > cluster is drained to respect the SSD discard behavior [1]. It's not > > a best practice for non-discarding devices. And this is causing a > > chigher fragmentation rate. > > > > So for a non-discarding device, prefer nonfull over free clusters. This > > reduces the fragmentation issue by a lot. > > > > Testing with make -j96, defconfig, using 64k mTHP, 8G ZRAM: > > > > Before: sys time: 6121.0s 64kB/swpout: 1638155 64kB/swpout_fallback: 189562 > > After: sys time: 6145.3s 64kB/swpout: 1761110 64kB/swpout_fallback: 66071 > > > > Testing with make -j96, defconfig, using 64k mTHP, 10G ZRAM: > > > > Before: sys time 5527.9s 64kB/swpout: 1789358 64kB/swpout_fallback: 17813 > > After: sys time 5538.3s 64kB/swpout: 1813133 64kB/swpout_fallback: 0 > > > > Performance is basically unchanged, and the large allocation failure rate > > is lower. Enabling all mTHP sizes showed a more significant result: > > > > Using the same test setup with 10G ZRAM and enabling all mTHP sizes: > > > > 128kB swap failure rate: > > Before: swpout:449548 swpout_fallback:55894 > > After: swpout:497519 swpout_fallback:3204 > > > > 256kB swap failure rate: > > Before: swpout:63938 swpout_fallback:2154 > > After: swpout:65698 swpout_fallback:324 > > > > 512kB swap failure rate: > > Before: swpout:11971 swpout_fallback:2218 > > After: swpout:14606 swpout_fallback:4 > > > > 2M swap failure rate: > > Before: swpout:12 swpout_fallback:1578 > > After: swpout:1253 swpout_fallback:15 > > > > The success rate of large allocations is much higher. > > > > Link: https://lore.kernel.org/linux-mm/87v8242vng.fsf@yhuang6-desk2.ccr.corp.intel.com/ [1] > > Signed-off-by: Kairui Song <kasong@tencent.com> > > Nice! I agree with Chris' analysis too. It's less of a problem for > vswap (because there's no physical/SSD implication over there), but > this patch makes sense in the context of swapfile allocator. > > FWIW: > Reviewed-by: Nhat Pham <nphamcs@gmail.com> Thanks! > > > --- > > mm/swapfile.c | 38 ++++++++++++++++++++++++++++---------- > > 1 file changed, 28 insertions(+), 10 deletions(-) > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c > > index 5fdb3cb2b8b7..4a0cf4fb348d 100644 > > --- a/mm/swapfile.c > > +++ b/mm/swapfile.c > > @@ -908,18 +908,20 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o > > } > > > > new_cluster: > > - ci = isolate_lock_cluster(si, &si->free_clusters); > > - if (ci) { > > - found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci), > > - order, usage); > > - if (found) > > - goto done; > > + /* > > + * If the device need discard, prefer new cluster over nonfull > > + * to spread out the writes. > > + */ > > + if (si->flags & SWP_PAGE_DISCARD) { > > + ci = isolate_lock_cluster(si, &si->free_clusters); > > + if (ci) { > > + found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci), > > + order, usage); > > + if (found) > > + goto done; > > + } > > } > > > > - /* Try reclaim from full clusters if free clusters list is drained */ > > - if (vm_swap_full()) > > - swap_reclaim_full_clusters(si, false); > > - > > if (order < PMD_ORDER) { > > while ((ci = isolate_lock_cluster(si, &si->nonfull_clusters[order]))) { > > found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci), > > @@ -927,7 +929,23 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o > > if (found) > > goto done; > > } > > + } > > > > + if (!(si->flags & SWP_PAGE_DISCARD)) { > > + ci = isolate_lock_cluster(si, &si->free_clusters); > > + if (ci) { > > + found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci), > > + order, usage); > > + if (found) > > + goto done; > > + } > > + } > > Seems like this pattern is repeated a couple of places - > isolate_lock_cluster from one of the lists, and if successful, then > try to allocate (alloc_swap_scan_cluster) from it. Indeed, I've been thinking about it but there are some other issues that need to be cleaned up before this one.
© 2016 - 2025 Red Hat, Inc.