[PATCH v3 2/2] zswap: track swapins from disk more accurately

Nhat Pham posted 2 patches 1 month, 2 weeks ago
[PATCH v3 2/2] zswap: track swapins from disk more accurately
Posted by Nhat Pham 1 month, 2 weeks ago
Currently, there are a couple of issues with our disk swapin tracking
for dynamic zswap shrinker heuristics:

1. We only increment the swapin counter on pivot pages. This means we
   are not taking into account pages that also need to be swapped in,
   but are already taken care of as part of the readahead window.

2. We are also incrementing when the pages are read from the zswap pool,
   which is inaccurate.

This patch rectifies these issues by incrementing the counter whenever
we need to perform a non-zswap read. Note that we are slightly
overcounting, as a page might be read into memory by the readahead
algorithm even though it will not be neeeded by users - however, this is
an acceptable inaccuracy, as the readahead logic itself will adapt to
these kind of scenarios.

To test this change, I built the kernel under a cgroup with its
memory.max set to 2 GB:

real: 236.66s
user: 4286.06s
sys: 652.86s
swapins: 81552

For comparison, with just the new second chance algorithm, the build
time is as follows:

real: 244.85s
user: 4327.22s
sys: 664.39s
swapins: 94663

Without neither:

real: 263.89s
user: 4318.11s
sys: 673.29s
swapins: 227300.5

(average over 5 runs)

With this change, the kernel CPU time reduces by a further 1.7%, and
the real time is reduced by another 3.3%, compared to just the second
chance algorithm by itself. The swapins count also reduces by another
13.85%.

Combinng the two changes, we reduce the real time by 10.32%, kernel CPU
time by 3%, and number of swapins by 64.12%.

To gauge the new scheme's ability to offload cold data, I ran another
benchmark, in which the kernel was built under a cgroup with memory.max
set to 3 GB, but with 0.5 GB worth of cold data allocated before each
build (in a shmem file).

Under the old scheme:

real: 197.18s
user: 4365.08s
sys: 289.02s
zswpwb: 72115.2

Under the new scheme:

real: 195.8s
user: 4362.25s
sys: 290.14s
zswpwb: 87277.8

(average over 5 runs)

Notice that we actually observe a 21% increase in the number of written
back pages - so the new scheme is just as good, if not better at
offloading pages from the zswap pool when they are cold. Build time
reduces by around 0.7% as a result.

Fixes: b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/page_io.c    | 11 ++++++++++-
 mm/swap_state.c |  8 ++------
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index ff8c99ee3af7..0004c9fbf7e8 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -521,7 +521,15 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 
 	if (zswap_load(folio)) {
 		folio_unlock(folio);
-	} else if (data_race(sis->flags & SWP_FS_OPS)) {
+		goto finish;
+	}
+
+	/*
+	 * We have to read the page from slower devices. Increase zswap protection.
+	 */
+	zswap_folio_swapin(folio);
+
+	if (data_race(sis->flags & SWP_FS_OPS)) {
 		swap_read_folio_fs(folio, plug);
 	} else if (synchronous) {
 		swap_read_folio_bdev_sync(folio, sis);
@@ -529,6 +537,7 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 		swap_read_folio_bdev_async(folio, sis);
 	}
 
+finish:
 	if (workingset) {
 		delayacct_thrashing_end(&in_thrashing);
 		psi_memstall_leave(&pflags);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index a1726e49a5eb..3a0cf965f32b 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -698,10 +698,8 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	/* The page was likely read above, so no need for plugging here */
 	folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
 					&page_allocated, false);
-	if (unlikely(page_allocated)) {
-		zswap_folio_swapin(folio);
+	if (unlikely(page_allocated))
 		swap_read_folio(folio, NULL);
-	}
 	return folio;
 }
 
@@ -850,10 +848,8 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 	/* The folio was likely read above, so no need for plugging here */
 	folio = __read_swap_cache_async(targ_entry, gfp_mask, mpol, targ_ilx,
 					&page_allocated, false);
-	if (unlikely(page_allocated)) {
-		zswap_folio_swapin(folio);
+	if (unlikely(page_allocated))
 		swap_read_folio(folio, NULL);
-	}
 	return folio;
 }
 
-- 
2.43.0
Re: [PATCH v3 2/2] zswap: track swapins from disk more accurately
Posted by Johannes Weiner 1 month, 1 week ago
On Mon, Aug 05, 2024 at 04:22:43PM -0700, Nhat Pham wrote:
> Currently, there are a couple of issues with our disk swapin tracking
> for dynamic zswap shrinker heuristics:
> 
> 1. We only increment the swapin counter on pivot pages. This means we
>    are not taking into account pages that also need to be swapped in,
>    but are already taken care of as part of the readahead window.
> 
> 2. We are also incrementing when the pages are read from the zswap pool,
>    which is inaccurate.
> 
> This patch rectifies these issues by incrementing the counter whenever
> we need to perform a non-zswap read. Note that we are slightly
> overcounting, as a page might be read into memory by the readahead
> algorithm even though it will not be neeeded by users - however, this is
> an acceptable inaccuracy, as the readahead logic itself will adapt to
> these kind of scenarios.
> 
> To test this change, I built the kernel under a cgroup with its
> memory.max set to 2 GB:
> 
> real: 236.66s
> user: 4286.06s
> sys: 652.86s
> swapins: 81552
> 
> For comparison, with just the new second chance algorithm, the build
> time is as follows:
> 
> real: 244.85s
> user: 4327.22s
> sys: 664.39s
> swapins: 94663
> 
> Without neither:
> 
> real: 263.89s
> user: 4318.11s
> sys: 673.29s
> swapins: 227300.5
> 
> (average over 5 runs)
> 
> With this change, the kernel CPU time reduces by a further 1.7%, and
> the real time is reduced by another 3.3%, compared to just the second
> chance algorithm by itself. The swapins count also reduces by another
> 13.85%.
> 
> Combinng the two changes, we reduce the real time by 10.32%, kernel CPU
> time by 3%, and number of swapins by 64.12%.
> 
> To gauge the new scheme's ability to offload cold data, I ran another
> benchmark, in which the kernel was built under a cgroup with memory.max
> set to 3 GB, but with 0.5 GB worth of cold data allocated before each
> build (in a shmem file).
> 
> Under the old scheme:
> 
> real: 197.18s
> user: 4365.08s
> sys: 289.02s
> zswpwb: 72115.2
> 
> Under the new scheme:
> 
> real: 195.8s
> user: 4362.25s
> sys: 290.14s
> zswpwb: 87277.8
> 
> (average over 5 runs)
> 
> Notice that we actually observe a 21% increase in the number of written
> back pages - so the new scheme is just as good, if not better at
> offloading pages from the zswap pool when they are cold. Build time
> reduces by around 0.7% as a result.
> 
> Fixes: b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Nhat Pham <nphamcs@gmail.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Re: [PATCH v3 2/2] zswap: track swapins from disk more accurately
Posted by Yosry Ahmed 1 month, 2 weeks ago
On Mon, Aug 5, 2024 at 4:22 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> Currently, there are a couple of issues with our disk swapin tracking
> for dynamic zswap shrinker heuristics:
>
> 1. We only increment the swapin counter on pivot pages. This means we
>    are not taking into account pages that also need to be swapped in,
>    but are already taken care of as part of the readahead window.
>
> 2. We are also incrementing when the pages are read from the zswap pool,
>    which is inaccurate.
>
> This patch rectifies these issues by incrementing the counter whenever
> we need to perform a non-zswap read. Note that we are slightly
> overcounting, as a page might be read into memory by the readahead
> algorithm even though it will not be neeeded by users - however, this is

needed*

> an acceptable inaccuracy, as the readahead logic itself will adapt to
> these kind of scenarios.
>
> To test this change, I built the kernel under a cgroup with its
> memory.max set to 2 GB:
>
> real: 236.66s
> user: 4286.06s
> sys: 652.86s
> swapins: 81552
>
> For comparison, with just the new second chance algorithm, the build
> time is as follows:
>
> real: 244.85s
> user: 4327.22s
> sys: 664.39s
> swapins: 94663
>
> Without neither:
>
> real: 263.89s
> user: 4318.11s
> sys: 673.29s
> swapins: 227300.5
>
> (average over 5 runs)
>
> With this change, the kernel CPU time reduces by a further 1.7%, and
> the real time is reduced by another 3.3%, compared to just the second
> chance algorithm by itself. The swapins count also reduces by another
> 13.85%.
>
> Combinng the two changes, we reduce the real time by 10.32%, kernel CPU

Combining*

> time by 3%, and number of swapins by 64.12%.
>
> To gauge the new scheme's ability to offload cold data, I ran another
> benchmark, in which the kernel was built under a cgroup with memory.max
> set to 3 GB, but with 0.5 GB worth of cold data allocated before each
> build (in a shmem file).
>
> Under the old scheme:
>
> real: 197.18s
> user: 4365.08s
> sys: 289.02s
> zswpwb: 72115.2
>
> Under the new scheme:
>
> real: 195.8s
> user: 4362.25s
> sys: 290.14s
> zswpwb: 87277.8
>
> (average over 5 runs)
>
> Notice that we actually observe a 21% increase in the number of written
> back pages - so the new scheme is just as good, if not better at
> offloading pages from the zswap pool when they are cold. Build time
> reduces by around 0.7% as a result.
>
> Fixes: b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Nhat Pham <nphamcs@gmail.com>
> ---
>  mm/page_io.c    | 11 ++++++++++-
>  mm/swap_state.c |  8 ++------
>  2 files changed, 12 insertions(+), 7 deletions(-)
>
> diff --git a/mm/page_io.c b/mm/page_io.c
> index ff8c99ee3af7..0004c9fbf7e8 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -521,7 +521,15 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
>
>         if (zswap_load(folio)) {
>                 folio_unlock(folio);
> -       } else if (data_race(sis->flags & SWP_FS_OPS)) {
> +               goto finish;
> +       }
> +
> +       /*
> +        * We have to read the page from slower devices. Increase zswap protection.
> +        */

Can we fit this on a single line?

Anyway, LGTM:
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Re: [PATCH v3 2/2] zswap: track swapins from disk more accurately
Posted by Nhat Pham 1 month, 2 weeks ago
On Mon, Aug 5, 2024 at 5:15 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> > algorithm even though it will not be neeeded by users - however, this is
>
> needed*
>
[...]
> >
> > Combinng the two changes, we reduce the real time by 10.32%, kernel CPU
>
> Combining*
>

My bad - English is hard :) Thanks for picking these out, Yosry!

Andrew, would you mind fixing these two typos for me?

>
> Can we fit this on a single line?

Just sent a fixlet to squeeze it into a single line :)

>
> Anyway, LGTM:
> Acked-by: Yosry Ahmed <yosryahmed@google.com>

Thanks for the review and feedback on the patch series, Yosry!
[PATCH v3 2/2] zswap: track swapins from disk more accurately (fix)
Posted by Nhat Pham 1 month, 2 weeks ago
Squeeze a comment into a single line.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/page_io.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index 0004c9fbf7e8..aa190e3cb050 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -524,9 +524,7 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 		goto finish;
 	}
 
-	/*
-	 * We have to read the page from slower devices. Increase zswap protection.
-	 */
+	/* We have to read from slower devices. Increase zswap protection. */
 	zswap_folio_swapin(folio);
 
 	if (data_race(sis->flags & SWP_FS_OPS)) {
-- 
2.43.0