[v2] mm: zswap: global shrinker fix and proactive shrink

[PATCH v2 5/6] mm: zswap: store incompressible page as-is

Posted by Takero Funaki 1 year, 7 months ago

This patch allows zswap to accept incompressible pages and store them
into zpool if possible.

This change is required to achieve zero rejection on zswap_store(). With
proper amount of proactive shrinking, swapout can be buffered by zswap
without IO latency. Storing incompressible pages may seem costly, but it
can reduce latency. A rare incompressible page in a large batch of
compressive pages can delay the entire batch during swapping.

The memory overhead is negligible because the underlying zsmalloc
already accepts nearly incompressible pages. zsmalloc stores data close
to PAGE_SIZE to a dedicated page. Thus storing as-is saves decompression
cycles without allocation overhead. zswap itself has not rejected pages
in these cases.

To store the page as-is, use the compressed data size field `length` in
struct `zswap_entry`. The length == PAGE_SIZE indicates
incompressible data.

If a zpool backend does not support allocating PAGE_SIZE (zbud), the
behavior remains unchanged. The allocation failure reported by the zpool
blocks accepting the page as before.

Signed-off-by: Takero Funaki <flintglass@gmail.com>
---
 mm/zswap.c | 36 +++++++++++++++++++++++++++++++++---
 1 file changed, 33 insertions(+), 3 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 76691ca7b6a7..def0f948a4ab 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -186,6 +186,8 @@ static struct shrinker *zswap_shrinker;
  * length - the length in bytes of the compressed page data.  Needed during
  *          decompression. For a same value filled page length is 0, and both
  *          pool and lru are invalid and must be ignored.
+ *          If length is equal to PAGE_SIZE, the data stored in handle is
+ *          not compressed. The data must be copied to page as-is.
  * pool - the zswap_pool the entry's data is in
  * handle - zpool allocation handle that stores the compressed page data
  * value - value of the same-value filled pages which have same content
@@ -969,9 +971,23 @@ static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
 	 */
 	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
 	dlen = acomp_ctx->req->dlen;
-	if (comp_ret)
+
+	/* coa_compress returns -EINVAL for errors including insufficient dlen */
+	if (comp_ret && comp_ret != -EINVAL)
 		goto unlock;
 
+	/*
+	 * If the data cannot be compressed well, store the data as-is.
+	 * Switching by a threshold at
+	 * PAGE_SIZE - (allocation granularity)
+	 * zbud and z3fold use 64B granularity.
+	 * zsmalloc stores >3632B in one page for 4K page arch.
+	 */
+	if (comp_ret || dlen > PAGE_SIZE - 64) {
+		/* we do not use compressed result anymore */
+		comp_ret = 0;
+		dlen = PAGE_SIZE;
+	}
 	zpool = zswap_find_zpool(entry);
 	gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
 	if (zpool_malloc_support_movable(zpool))
@@ -981,14 +997,20 @@ static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
 		goto unlock;
 
 	buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO);
-	memcpy(buf, dst, dlen);
+
+	/* PAGE_SIZE indicates not compressed. */
+	if (dlen == PAGE_SIZE)
+		memcpy_from_folio(buf, folio, 0, PAGE_SIZE);
+	else
+		memcpy(buf, dst, dlen);
+
 	zpool_unmap_handle(zpool, handle);
 
 	entry->handle = handle;
 	entry->length = dlen;
 
 unlock:
-	if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
+	if (alloc_ret == -ENOSPC)
 		zswap_reject_compress_poor++;
 	else if (comp_ret)
 		zswap_reject_compress_fail++;
@@ -1006,6 +1028,14 @@ static void zswap_decompress(struct zswap_entry *entry, struct page *page)
 	struct crypto_acomp_ctx *acomp_ctx;
 	u8 *src;
 
+	if (entry->length == PAGE_SIZE) {
+		/* the content is not compressed. copy back as-is.  */
+		src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
+		memcpy_to_page(page, 0, src, entry->length);
+		zpool_unmap_handle(zpool, entry->handle);
+		return;
+	}
+
 	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
 	mutex_lock(&acomp_ctx->mutex);
 
-- 
2.43.0

Re: [PATCH v2 5/6] mm: zswap: store incompressible page as-is

Posted by Chengming Zhou 1 year, 7 months ago

On 2024/7/6 10:25, Takero Funaki wrote:
> This patch allows zswap to accept incompressible pages and store them
> into zpool if possible.
> 
> This change is required to achieve zero rejection on zswap_store(). With
> proper amount of proactive shrinking, swapout can be buffered by zswap
> without IO latency. Storing incompressible pages may seem costly, but it
> can reduce latency. A rare incompressible page in a large batch of
> compressive pages can delay the entire batch during swapping.
> 
> The memory overhead is negligible because the underlying zsmalloc
> already accepts nearly incompressible pages. zsmalloc stores data close
> to PAGE_SIZE to a dedicated page. Thus storing as-is saves decompression
> cycles without allocation overhead. zswap itself has not rejected pages
> in these cases.
> 
> To store the page as-is, use the compressed data size field `length` in
> struct `zswap_entry`. The length == PAGE_SIZE indicates
> incompressible data.
> 
> If a zpool backend does not support allocating PAGE_SIZE (zbud), the
> behavior remains unchanged. The allocation failure reported by the zpool
> blocks accepting the page as before.
> 
> Signed-off-by: Takero Funaki <flintglass@gmail.com>
> ---
>   mm/zswap.c | 36 +++++++++++++++++++++++++++++++++---
>   1 file changed, 33 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 76691ca7b6a7..def0f948a4ab 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -186,6 +186,8 @@ static struct shrinker *zswap_shrinker;
>    * length - the length in bytes of the compressed page data.  Needed during
>    *          decompression. For a same value filled page length is 0, and both
>    *          pool and lru are invalid and must be ignored.
> + *          If length is equal to PAGE_SIZE, the data stored in handle is
> + *          not compressed. The data must be copied to page as-is.
>    * pool - the zswap_pool the entry's data is in
>    * handle - zpool allocation handle that stores the compressed page data
>    * value - value of the same-value filled pages which have same content
> @@ -969,9 +971,23 @@ static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
>   	 */
>   	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
>   	dlen = acomp_ctx->req->dlen;
> -	if (comp_ret)
> +
> +	/* coa_compress returns -EINVAL for errors including insufficient dlen */
> +	if (comp_ret && comp_ret != -EINVAL)
>   		goto unlock;

Seems we don't need to care about? "comp_ret" is useless anymore.

Just:

if (comp_ret || dlen > PAGE_SIZE - 64)
	dlen = PAGE_SIZE;

And remove the checkings of comp_ret at the end.

>   
> +	/*
> +	 * If the data cannot be compressed well, store the data as-is.
> +	 * Switching by a threshold at
> +	 * PAGE_SIZE - (allocation granularity)
> +	 * zbud and z3fold use 64B granularity.
> +	 * zsmalloc stores >3632B in one page for 4K page arch.
> +	 */
> +	if (comp_ret || dlen > PAGE_SIZE - 64) {
> +		/* we do not use compressed result anymore */
> +		comp_ret = 0;
> +		dlen = PAGE_SIZE;
> +	}
>   	zpool = zswap_find_zpool(entry);
>   	gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
>   	if (zpool_malloc_support_movable(zpool))
> @@ -981,14 +997,20 @@ static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
>   		goto unlock;
>   
>   	buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO);
> -	memcpy(buf, dst, dlen);
> +
> +	/* PAGE_SIZE indicates not compressed. */
> +	if (dlen == PAGE_SIZE)
> +		memcpy_from_folio(buf, folio, 0, PAGE_SIZE);

We actually don't need to hold mutex if we are just copying folio.

Thanks.

> +	else
> +		memcpy(buf, dst, dlen);
> +
>   	zpool_unmap_handle(zpool, handle);
>   
>   	entry->handle = handle;
>   	entry->length = dlen;
>   
>   unlock:
> -	if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
> +	if (alloc_ret == -ENOSPC)
>   		zswap_reject_compress_poor++;
>   	else if (comp_ret)
>   		zswap_reject_compress_fail++;
> @@ -1006,6 +1028,14 @@ static void zswap_decompress(struct zswap_entry *entry, struct page *page)
>   	struct crypto_acomp_ctx *acomp_ctx;
>   	u8 *src;
>   
> +	if (entry->length == PAGE_SIZE) {
> +		/* the content is not compressed. copy back as-is.  */
> +		src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
> +		memcpy_to_page(page, 0, src, entry->length);
> +		zpool_unmap_handle(zpool, entry->handle);
> +		return;
> +	}
> +
>   	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
>   	mutex_lock(&acomp_ctx->mutex);
>

Re: [PATCH v2 5/6] mm: zswap: store incompressible page as-is

Posted by Takero Funaki 1 year, 7 months ago

2024年7月8日(月) 12:56 Chengming Zhou <chengming.zhou@linux.dev>:

> >       comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
> >       dlen = acomp_ctx->req->dlen;
> > -     if (comp_ret)
> > +
> > +     /* coa_compress returns -EINVAL for errors including insufficient dlen */
> > +     if (comp_ret && comp_ret != -EINVAL)
> >               goto unlock;
>
> Seems we don't need to care about? "comp_ret" is useless anymore.
>
> Just:
>
> if (comp_ret || dlen > PAGE_SIZE - 64)
>         dlen = PAGE_SIZE;
>
> And remove the checkings of comp_ret at the end.
>

>
> We actually don't need to hold mutex if we are just copying folio.
>
> Thanks.
>

Thanks for reviewing.

For comp_ret, can we consolidate all possible error codes as
incompressible data?
if we do not need to distinguish -EINVAL and the others, diff v2..v3
can be like:

@@ -62,8 +62,6 @@ static u64 zswap_pool_limit_hit;
 static u64 zswap_written_back_pages;
 /* Store failed due to a reclaim failure after pool limit was reached */
 static u64 zswap_reject_reclaim_fail;
-/* Store failed due to compression algorithm failure */
-static u64 zswap_reject_compress_fail;
 /* Compressed page was too big for the allocator to (optimally) store */
 static u64 zswap_reject_compress_poor;
 /* Store failed because underlying allocator could not get memory */
@@ -1043,10 +1041,6 @@ static bool zswap_compress(struct folio *folio,
struct zswap_entry *entry)
        comp_ret =
crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
&acomp_ctx->wait);
        dlen = acomp_ctx->req->dlen;

-       /* coa_compress returns -EINVAL for errors including
insufficient dlen */
-       if (comp_ret && comp_ret != -EINVAL)
-               goto unlock;
-
        /*
         * If the data cannot be compressed well, store the data as-is.
         * Switching by a threshold at
@@ -1056,7 +1050,8 @@ static bool zswap_compress(struct folio *folio,
struct zswap_entry *entry)
         */
        if (comp_ret || dlen > PAGE_SIZE - 64) {
                /* we do not use compressed result anymore */
-               comp_ret = 0;
+               mutex_unlock(&acomp_ctx->mutex);
+               acomp_ctx = NULL;
                dlen = PAGE_SIZE;
        }
        zpool = zswap_find_zpool(entry);
@@ -1083,12 +1078,11 @@ static bool zswap_compress(struct folio
*folio, struct zswap_entry *entry)
 unlock:
        if (alloc_ret == -ENOSPC)
                zswap_reject_compress_poor++;
-       else if (comp_ret)
-               zswap_reject_compress_fail++;
        else if (alloc_ret)
                zswap_reject_alloc_fail++;

-       mutex_unlock(&acomp_ctx->mutex);
+       if (acomp_ctx)
+               mutex_unlock(&acomp_ctx->mutex);
        return comp_ret == 0 && alloc_ret == 0;
 }

@@ -1886,8 +1880,6 @@ static int zswap_debugfs_init(void)
                           zswap_debugfs_root, &zswap_reject_alloc_fail);
        debugfs_create_u64("reject_kmemcache_fail", 0444,
                           zswap_debugfs_root, &zswap_reject_kmemcache_fail);
-       debugfs_create_u64("reject_compress_fail", 0444,
-                          zswap_debugfs_root, &zswap_reject_compress_fail);
        debugfs_create_u64("reject_compress_poor", 0444,
                           zswap_debugfs_root, &zswap_reject_compress_poor);
        debugfs_create_u64("written_back_pages", 0444,

Re: [PATCH v2 5/6] mm: zswap: store incompressible page as-is

Posted by Chengming Zhou 1 year, 7 months ago

On 2024/7/8 21:44, Takero Funaki wrote:
> 2024年7月8日(月) 12:56 Chengming Zhou <chengming.zhou@linux.dev>:
> 
>>>        comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
>>>        dlen = acomp_ctx->req->dlen;
>>> -     if (comp_ret)
>>> +
>>> +     /* coa_compress returns -EINVAL for errors including insufficient dlen */
>>> +     if (comp_ret && comp_ret != -EINVAL)
>>>                goto unlock;
>>
>> Seems we don't need to care about? "comp_ret" is useless anymore.
>>
>> Just:
>>
>> if (comp_ret || dlen > PAGE_SIZE - 64)
>>          dlen = PAGE_SIZE;
>>
>> And remove the checkings of comp_ret at the end.
>>
> 
>>
>> We actually don't need to hold mutex if we are just copying folio.
>>
>> Thanks.
>>
> 
> Thanks for reviewing.
> 
> For comp_ret, can we consolidate all possible error codes as
> incompressible data?

Maybe we still want these debug counters? I'm not sure.

With your proposal, I think we don't care about compression failures
anymore, in all cases it's just ok to fallback to just copy the folio.

> if we do not need to distinguish -EINVAL and the others, diff v2..v3
> can be like:
> 
> @@ -62,8 +62,6 @@ static u64 zswap_pool_limit_hit;
>   static u64 zswap_written_back_pages;
>   /* Store failed due to a reclaim failure after pool limit was reached */
>   static u64 zswap_reject_reclaim_fail;
> -/* Store failed due to compression algorithm failure */
> -static u64 zswap_reject_compress_fail;
>   /* Compressed page was too big for the allocator to (optimally) store */
>   static u64 zswap_reject_compress_poor;
>   /* Store failed because underlying allocator could not get memory */
> @@ -1043,10 +1041,6 @@ static bool zswap_compress(struct folio *folio,
> struct zswap_entry *entry)
>          comp_ret =
> crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
> &acomp_ctx->wait);
>          dlen = acomp_ctx->req->dlen;
> 
> -       /* coa_compress returns -EINVAL for errors including
> insufficient dlen */
> -       if (comp_ret && comp_ret != -EINVAL)
> -               goto unlock;
> -
>          /*
>           * If the data cannot be compressed well, store the data as-is.
>           * Switching by a threshold at
> @@ -1056,7 +1050,8 @@ static bool zswap_compress(struct folio *folio,
> struct zswap_entry *entry)
>           */
>          if (comp_ret || dlen > PAGE_SIZE - 64) {
>                  /* we do not use compressed result anymore */
> -               comp_ret = 0;
> +               mutex_unlock(&acomp_ctx->mutex);
> +               acomp_ctx = NULL;
>                  dlen = PAGE_SIZE;
>          }
>          zpool = zswap_find_zpool(entry);
> @@ -1083,12 +1078,11 @@ static bool zswap_compress(struct folio
> *folio, struct zswap_entry *entry)
>   unlock:
>          if (alloc_ret == -ENOSPC)
>                  zswap_reject_compress_poor++;
> -       else if (comp_ret)
> -               zswap_reject_compress_fail++;

If you want to keep these debug counters, you can move these forward.

>          else if (alloc_ret)
>                  zswap_reject_alloc_fail++;
> 
> -       mutex_unlock(&acomp_ctx->mutex);
> +       if (acomp_ctx)
> +               mutex_unlock(&acomp_ctx->mutex);
>          return comp_ret == 0 && alloc_ret == 0;

And here we don't care about comp_ret anymore.

Thanks.

>   }
> 
> @@ -1886,8 +1880,6 @@ static int zswap_debugfs_init(void)
>                             zswap_debugfs_root, &zswap_reject_alloc_fail);
>          debugfs_create_u64("reject_kmemcache_fail", 0444,
>                             zswap_debugfs_root, &zswap_reject_kmemcache_fail);
> -       debugfs_create_u64("reject_compress_fail", 0444,
> -                          zswap_debugfs_root, &zswap_reject_compress_fail);
>          debugfs_create_u64("reject_compress_poor", 0444,
>                             zswap_debugfs_root, &zswap_reject_compress_poor);
>          debugfs_create_u64("written_back_pages", 0444,

Re: [PATCH v2 5/6] mm: zswap: store incompressible page as-is

Posted by Nhat Pham 1 year, 7 months ago

On Tue, Jul 9, 2024 at 6:26 AM Chengming Zhou <chengming.zhou@linux.dev> wrote:
>
> On 2024/7/8 21:44, Takero Funaki wrote:
> > 2024年7月8日(月) 12:56 Chengming Zhou <chengming.zhou@linux.dev>:
> >
> >>>        comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
> >>>        dlen = acomp_ctx->req->dlen;
> >>> -     if (comp_ret)
> >>> +
> >>> +     /* coa_compress returns -EINVAL for errors including insufficient dlen */
> >>> +     if (comp_ret && comp_ret != -EINVAL)
> >>>                goto unlock;
> >>
> >> Seems we don't need to care about? "comp_ret" is useless anymore.
> >>
> >> Just:
> >>
> >> if (comp_ret || dlen > PAGE_SIZE - 64)
> >>          dlen = PAGE_SIZE;
> >>
> >> And remove the checkings of comp_ret at the end.
> >>
> >
> >>
> >> We actually don't need to hold mutex if we are just copying folio.
> >>
> >> Thanks.
> >>
> >
> > Thanks for reviewing.
> >
> > For comp_ret, can we consolidate all possible error codes as
> > incompressible data?
>
> Maybe we still want these debug counters? I'm not sure.

I'm a bit torn, but ATM I have no strong opinions on these two error
codes. If you do decide to consolidate these two, may I ask you to
separate it into its own patch so that we can review + discuss it
separately?

Re: [PATCH v2 5/6] mm: zswap: store incompressible page as-is

Posted by Chengming Zhou 1 year, 6 months ago

On 2024/7/13 06:47, Nhat Pham wrote:
> On Tue, Jul 9, 2024 at 6:26 AM Chengming Zhou <chengming.zhou@linux.dev> wrote:
>>
>> On 2024/7/8 21:44, Takero Funaki wrote:
>>> 2024年7月8日(月) 12:56 Chengming Zhou <chengming.zhou@linux.dev>:
>>>
>>>>>         comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
>>>>>         dlen = acomp_ctx->req->dlen;
>>>>> -     if (comp_ret)
>>>>> +
>>>>> +     /* coa_compress returns -EINVAL for errors including insufficient dlen */
>>>>> +     if (comp_ret && comp_ret != -EINVAL)
>>>>>                 goto unlock;
>>>>
>>>> Seems we don't need to care about? "comp_ret" is useless anymore.
>>>>
>>>> Just:
>>>>
>>>> if (comp_ret || dlen > PAGE_SIZE - 64)
>>>>           dlen = PAGE_SIZE;
>>>>
>>>> And remove the checkings of comp_ret at the end.
>>>>
>>>
>>>>
>>>> We actually don't need to hold mutex if we are just copying folio.
>>>>
>>>> Thanks.
>>>>
>>>
>>> Thanks for reviewing.
>>>
>>> For comp_ret, can we consolidate all possible error codes as
>>> incompressible data?
>>
>> Maybe we still want these debug counters? I'm not sure.
> 
> I'm a bit torn, but ATM I have no strong opinions on these two error
> codes. If you do decide to consolidate these two, may I ask you to
> separate it into its own patch so that we can review + discuss it
> separately?

Yeah, I also have no strong opinions on these two error codes,
it's just ok to keep them.

Re: [PATCH v2 5/6] mm: zswap: store incompressible page as-is

Posted by Nhat Pham 1 year, 7 months ago

On Fri, Jul 5, 2024 at 7:25 PM Takero Funaki <flintglass@gmail.com> wrote:
>
> This patch allows zswap to accept incompressible pages and store them
> into zpool if possible.
>
> This change is required to achieve zero rejection on zswap_store(). With
> proper amount of proactive shrinking, swapout can be buffered by zswap
> without IO latency. Storing incompressible pages may seem costly, but it
> can reduce latency. A rare incompressible page in a large batch of
> compressive pages can delay the entire batch during swapping.
>
> The memory overhead is negligible because the underlying zsmalloc
> already accepts nearly incompressible pages. zsmalloc stores data close
> to PAGE_SIZE to a dedicated page. Thus storing as-is saves decompression
> cycles without allocation overhead. zswap itself has not rejected pages
> in these cases.
>
> To store the page as-is, use the compressed data size field `length` in
> struct `zswap_entry`. The length == PAGE_SIZE indicates
> incompressible data.
>
> If a zpool backend does not support allocating PAGE_SIZE (zbud), the
> behavior remains unchanged. The allocation failure reported by the zpool
> blocks accepting the page as before.
>
> Signed-off-by: Takero Funaki <flintglass@gmail.com>

I tried to propose something similar in the past. Please read the
following discussion:

https://lore.kernel.org/all/CAJD7tka6XRyzYndRNEFZmi0Zj4DD2KnVzt=vMGhfF4iN2B4VKw@mail.gmail.com/

But, the TLDR is Yosry was (rightly) concerned that with this
approach, memory reclaiming could end up increasing memory usage
rather than reducing (since we do not free up the page that fail to
zswap-out, and we need extra memory for the zswap metadata of that
page).

So my vote on this patch would be NACK, until we get around this issue
somehow :)

Re: [PATCH v2 5/6] mm: zswap: store incompressible page as-is

Posted by Takero Funaki 1 year, 7 months ago

2024年7月7日(日) 8:53 Nhat Pham <nphamcs@gmail.com>:
>
> I tried to propose something similar in the past. Please read the
> following discussion:
>
> https://lore.kernel.org/all/CAJD7tka6XRyzYndRNEFZmi0Zj4DD2KnVzt=vMGhfF4iN2B4VKw@mail.gmail.com/
>
> But, the TLDR is Yosry was (rightly) concerned that with this
> approach, memory reclaiming could end up increasing memory usage
> rather than reducing (since we do not free up the page that fail to
> zswap-out, and we need extra memory for the zswap metadata of that
> page).
>
> So my vote on this patch would be NACK, until we get around this issue
> somehow :)

It seems the discussion on the thread mixed up memory allocation
failure (system runs out of memory reserve) and incompressible pages
(compression algorithm successfully compressed but the result is equal
to or larger than PAGE_SIZE).

zswap has been storing pages into dedicated pages 1:1 when compressed
to near PAGE_SIZE. Using zsmalloc, current zswap stores pages
compressed to between 3633 bytes (=hugeclass+1) to 4095 bytes
(=PAGE_SIZE-1) into 1 page. This patch changes the range to 3633 to
4096 by treating PAGE_SIZE as a special case. I could not find a
reason to reject only PAGE_SIZE while accepting PAGE_SIZE-1.

zswap wastes memory for metadata for all accepted pages but reduces IO
amount and latency by compressed buffer memory. For pages between 3633
to 4096 bytes, zswap reduces the latency only. This is still
beneficial because the rare incompressible pages trigger urgent
pageout IO and incur a head-of-line blocking on the subsequent pages.
It also keeps LRU priority for pagein latency.

In the worst case or with a malicious dataset, zswap will waste a
significant amount of memory, but this patch does not affect nor
resolve the scenario. For example, if a user allocates pages
compressed to 3633 bytes, current zswap using zsmalloc cannot gain
memory as the compression ratio, including zsmalloc overhead, becomes
1:1. This also applies to zbud. The compression ratio will be 1:1 as
zbud cannot find buddies smaller than 463 bytes. zswap will be less
efficient but still work in this situation since the max pool percent
and background writeback ensure the pool size does not overwhelm
usable memory.

I suppose the current zswap has accepted the possible waste of memory,
at least since the current zswap_compress() logic was implemented. If
zswap had to ensure the compression ratio is better than 1:1, and only
prefers reducing IO amount (not latency), there would have been a
compression ratio threshold to reject pages not compressible to under
2048 bytes. I think accepting nearly incompressible pages is
beneficial and changing the range to 4096 does not negatively affect
the current behavior.

Re: [PATCH v2 5/6] mm: zswap: store incompressible page as-is

Posted by Nhat Pham 1 year, 7 months ago

On Sun, Jul 7, 2024 at 2:38 AM Takero Funaki <flintglass@gmail.com> wrote:
>
> 2024年7月7日(日) 8:53 Nhat Pham <nphamcs@gmail.com>:
> >
> > I tried to propose something similar in the past. Please read the
> > following discussion:
> >
> > https://lore.kernel.org/all/CAJD7tka6XRyzYndRNEFZmi0Zj4DD2KnVzt=vMGhfF4iN2B4VKw@mail.gmail.com/
> >
> > But, the TLDR is Yosry was (rightly) concerned that with this
> > approach, memory reclaiming could end up increasing memory usage
> > rather than reducing (since we do not free up the page that fail to
> > zswap-out, and we need extra memory for the zswap metadata of that
> > page).
> >
> > So my vote on this patch would be NACK, until we get around this issue
> > somehow :)
>
> It seems the discussion on the thread mixed up memory allocation
> failure (system runs out of memory reserve) and incompressible pages
> (compression algorithm successfully compressed but the result is equal
> to or larger than PAGE_SIZE).
>
> zswap has been storing pages into dedicated pages 1:1 when compressed
> to near PAGE_SIZE. Using zsmalloc, current zswap stores pages
> compressed to between 3633 bytes (=hugeclass+1) to 4095 bytes
> (=PAGE_SIZE-1) into 1 page. This patch changes the range to 3633 to
> 4096 by treating PAGE_SIZE as a special case. I could not find a
> reason to reject only PAGE_SIZE while accepting PAGE_SIZE-1.
>

I'm not actually sure if this is true in practice. While yes, zsmalloc
has the capability to store near-PAGE_SIZE objects, this also depends
on the compression algorithm.

At Meta, we use zstd. What I have found is that a lot of the time, it
just flat out rejects the page if it's too poorly compressed. Without
this change, we will not have to suffer the memory overhead of the
zswap_entry structures for these rejected pages, whereas we will with
this change.

We might need to run some tracing to get a histogram of the
distribution of post-compression sizes.

> zswap wastes memory for metadata for all accepted pages but reduces IO

Key word: accepted. The compression algorithm might already have some
built in logic to reject poorly compressed pages, preventing the cases
where the overhead might be too high for the saving.

> amount and latency by compressed buffer memory. For pages between 3633
> to 4096 bytes, zswap reduces the latency only. This is still
> beneficial because the rare incompressible pages trigger urgent
> pageout IO and incur a head-of-line blocking on the subsequent pages.
> It also keeps LRU priority for pagein latency.
>
> In the worst case or with a malicious dataset, zswap will waste a
> significant amount of memory, but this patch does not affect nor
> resolve the scenario. For example, if a user allocates pages
> compressed to 3633 bytes, current zswap using zsmalloc cannot gain
> memory as the compression ratio, including zsmalloc overhead, becomes
> 1:1. This also applies to zbud. The compression ratio will be 1:1 as
> zbud cannot find buddies smaller than 463 bytes. zswap will be less
> efficient but still work in this situation since the max pool percent
> and background writeback ensure the pool size does not overwhelm
> usable memory.
>
> I suppose the current zswap has accepted the possible waste of memory,
> at least since the current zswap_compress() logic was implemented. If
> zswap had to ensure the compression ratio is better than 1:1, and only
> prefers reducing IO amount (not latency), there would have been a
> compression ratio threshold to reject pages not compressible to under
> 2048 bytes. I think accepting nearly incompressible pages is
> beneficial and changing the range to 4096 does not negatively affect
> the current behavior.

FWIW, I do agree with your approach (storing incompressible pages in
the zswap pool to maintain LRU ordering) - this is *essentially* what
I was trying to do too with the attempt I mentioned above.

I'll let Johannes and Yosry chime in as well, since they were the
original folks who raised these concerns :) If they're happy then I'll
revoke my NACK.