[v4] mm/zswap: dstmem reuse optimizations and cleanups

[PATCH v4 2/6] mm/zswap: reuse dstmem when decompress

Posted by Chengming Zhou 2 years, 1 month ago

In the !zpool_can_sleep_mapped() case such as zsmalloc, we need to first
copy the entry->handle memory to a temporary memory, which is allocated
using kmalloc.

Obviously we can reuse the per-compressor dstmem to avoid allocating
every time, since it's percpu-compressor and protected in percpu mutex.

Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Chris Li <chrisl@kernel.org> (Google)
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 mm/zswap.c | 44 ++++++++++++--------------------------------
 1 file changed, 12 insertions(+), 32 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 976f278aa507..6b872744e962 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1417,19 +1417,13 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	struct crypto_acomp_ctx *acomp_ctx;
 	struct zpool *pool = zswap_find_zpool(entry);
 	bool page_was_allocated;
-	u8 *src, *tmp = NULL;
+	u8 *src;
 	unsigned int dlen;
 	int ret;
 	struct writeback_control wbc = {
 		.sync_mode = WB_SYNC_NONE,
 	};
 
-	if (!zpool_can_sleep_mapped(pool)) {
-		tmp = kmalloc(PAGE_SIZE, GFP_KERNEL);
-		if (!tmp)
-			return -ENOMEM;
-	}
-
 	/* try to allocate swap cache page */
 	mpol = get_task_policy(current);
 	page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
@@ -1465,15 +1459,15 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	/* decompress */
 	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
 	dlen = PAGE_SIZE;
+	mutex_lock(acomp_ctx->mutex);
 
 	src = zpool_map_handle(pool, entry->handle, ZPOOL_MM_RO);
 	if (!zpool_can_sleep_mapped(pool)) {
-		memcpy(tmp, src, entry->length);
-		src = tmp;
+		memcpy(acomp_ctx->dstmem, src, entry->length);
+		src = acomp_ctx->dstmem;
 		zpool_unmap_handle(pool, entry->handle);
 	}
 
-	mutex_lock(acomp_ctx->mutex);
 	sg_init_one(&input, src, entry->length);
 	sg_init_table(&output, 1);
 	sg_set_page(&output, page, PAGE_SIZE, 0);
@@ -1482,9 +1476,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	dlen = acomp_ctx->req->dlen;
 	mutex_unlock(acomp_ctx->mutex);
 
-	if (!zpool_can_sleep_mapped(pool))
-		kfree(tmp);
-	else
+	if (zpool_can_sleep_mapped(pool))
 		zpool_unmap_handle(pool, entry->handle);
 
 	BUG_ON(ret);
@@ -1508,9 +1500,6 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	return ret;
 
 fail:
-	if (!zpool_can_sleep_mapped(pool))
-		kfree(tmp);
-
 	/*
 	 * If we get here because the page is already in swapcache, a
 	 * load may be happening concurrently. It is safe and okay to
@@ -1771,7 +1760,7 @@ bool zswap_load(struct folio *folio)
 	struct zswap_entry *entry;
 	struct scatterlist input, output;
 	struct crypto_acomp_ctx *acomp_ctx;
-	u8 *src, *dst, *tmp;
+	u8 *src, *dst;
 	struct zpool *zpool;
 	unsigned int dlen;
 	bool ret;
@@ -1796,26 +1785,19 @@ bool zswap_load(struct folio *folio)
 	}
 
 	zpool = zswap_find_zpool(entry);
-	if (!zpool_can_sleep_mapped(zpool)) {
-		tmp = kmalloc(entry->length, GFP_KERNEL);
-		if (!tmp) {
-			ret = false;
-			goto freeentry;
-		}
-	}
 
 	/* decompress */
 	dlen = PAGE_SIZE;
-	src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
+	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
+	mutex_lock(acomp_ctx->mutex);
 
+	src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
 	if (!zpool_can_sleep_mapped(zpool)) {
-		memcpy(tmp, src, entry->length);
-		src = tmp;
+		memcpy(acomp_ctx->dstmem, src, entry->length);
+		src = acomp_ctx->dstmem;
 		zpool_unmap_handle(zpool, entry->handle);
 	}
 
-	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
-	mutex_lock(acomp_ctx->mutex);
 	sg_init_one(&input, src, entry->length);
 	sg_init_table(&output, 1);
 	sg_set_page(&output, page, PAGE_SIZE, 0);
@@ -1826,15 +1808,13 @@ bool zswap_load(struct folio *folio)
 
 	if (zpool_can_sleep_mapped(zpool))
 		zpool_unmap_handle(zpool, entry->handle);
-	else
-		kfree(tmp);
 
 	ret = true;
 stats:
 	count_vm_event(ZSWPIN);
 	if (entry->objcg)
 		count_objcg_event(entry->objcg, ZSWPIN);
-freeentry:
+
 	spin_lock(&tree->lock);
 	if (ret && zswap_exclusive_loads_enabled) {
 		zswap_invalidate_entry(tree, entry);

-- 
b4 0.10.1

Re: [PATCH v4 2/6] mm/zswap: reuse dstmem when decompress

Posted by Barry Song 2 years, 1 month ago

On Wed, Dec 27, 2023 at 4:56 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> In the !zpool_can_sleep_mapped() case such as zsmalloc, we need to first
> copy the entry->handle memory to a temporary memory, which is allocated
> using kmalloc.
>
> Obviously we can reuse the per-compressor dstmem to avoid allocating
> every time, since it's percpu-compressor and protected in percpu mutex.

what is the benefit of this since we are actually increasing lock contention
by reusing this buffer between multiple compression and decompression
threads?

this mainly affects zsmalloc which can't sleep? do we have performance
data?

and it seems this patch is also negatively affecting z3fold and zbud.c
which actually don't need to allocate a tmp buffer.

>
> Reviewed-by: Nhat Pham <nphamcs@gmail.com>
> Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
> Acked-by: Chris Li <chrisl@kernel.org> (Google)
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> ---
>  mm/zswap.c | 44 ++++++++++++--------------------------------
>  1 file changed, 12 insertions(+), 32 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 976f278aa507..6b872744e962 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1417,19 +1417,13 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>         struct crypto_acomp_ctx *acomp_ctx;
>         struct zpool *pool = zswap_find_zpool(entry);
>         bool page_was_allocated;
> -       u8 *src, *tmp = NULL;
> +       u8 *src;
>         unsigned int dlen;
>         int ret;
>         struct writeback_control wbc = {
>                 .sync_mode = WB_SYNC_NONE,
>         };
>
> -       if (!zpool_can_sleep_mapped(pool)) {
> -               tmp = kmalloc(PAGE_SIZE, GFP_KERNEL);
> -               if (!tmp)
> -                       return -ENOMEM;
> -       }
> -
>         /* try to allocate swap cache page */
>         mpol = get_task_policy(current);
>         page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
> @@ -1465,15 +1459,15 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>         /* decompress */
>         acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
>         dlen = PAGE_SIZE;
> +       mutex_lock(acomp_ctx->mutex);
>
>         src = zpool_map_handle(pool, entry->handle, ZPOOL_MM_RO);
>         if (!zpool_can_sleep_mapped(pool)) {
> -               memcpy(tmp, src, entry->length);
> -               src = tmp;
> +               memcpy(acomp_ctx->dstmem, src, entry->length);
> +               src = acomp_ctx->dstmem;
>                 zpool_unmap_handle(pool, entry->handle);
>         }
>
> -       mutex_lock(acomp_ctx->mutex);
>         sg_init_one(&input, src, entry->length);
>         sg_init_table(&output, 1);
>         sg_set_page(&output, page, PAGE_SIZE, 0);
> @@ -1482,9 +1476,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>         dlen = acomp_ctx->req->dlen;
>         mutex_unlock(acomp_ctx->mutex);
>
> -       if (!zpool_can_sleep_mapped(pool))
> -               kfree(tmp);
> -       else
> +       if (zpool_can_sleep_mapped(pool))
>                 zpool_unmap_handle(pool, entry->handle);
>
>         BUG_ON(ret);
> @@ -1508,9 +1500,6 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>         return ret;
>
>  fail:
> -       if (!zpool_can_sleep_mapped(pool))
> -               kfree(tmp);
> -
>         /*
>          * If we get here because the page is already in swapcache, a
>          * load may be happening concurrently. It is safe and okay to
> @@ -1771,7 +1760,7 @@ bool zswap_load(struct folio *folio)
>         struct zswap_entry *entry;
>         struct scatterlist input, output;
>         struct crypto_acomp_ctx *acomp_ctx;
> -       u8 *src, *dst, *tmp;
> +       u8 *src, *dst;
>         struct zpool *zpool;
>         unsigned int dlen;
>         bool ret;
> @@ -1796,26 +1785,19 @@ bool zswap_load(struct folio *folio)
>         }
>
>         zpool = zswap_find_zpool(entry);
> -       if (!zpool_can_sleep_mapped(zpool)) {
> -               tmp = kmalloc(entry->length, GFP_KERNEL);
> -               if (!tmp) {
> -                       ret = false;
> -                       goto freeentry;
> -               }
> -       }
>
>         /* decompress */
>         dlen = PAGE_SIZE;
> -       src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
> +       acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> +       mutex_lock(acomp_ctx->mutex);
>
> +       src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
>         if (!zpool_can_sleep_mapped(zpool)) {
> -               memcpy(tmp, src, entry->length);
> -               src = tmp;
> +               memcpy(acomp_ctx->dstmem, src, entry->length);
> +               src = acomp_ctx->dstmem;
>                 zpool_unmap_handle(zpool, entry->handle);
>         }
>
> -       acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> -       mutex_lock(acomp_ctx->mutex);
>         sg_init_one(&input, src, entry->length);
>         sg_init_table(&output, 1);
>         sg_set_page(&output, page, PAGE_SIZE, 0);
> @@ -1826,15 +1808,13 @@ bool zswap_load(struct folio *folio)
>
>         if (zpool_can_sleep_mapped(zpool))
>                 zpool_unmap_handle(zpool, entry->handle);
> -       else
> -               kfree(tmp);
>
>         ret = true;
>  stats:
>         count_vm_event(ZSWPIN);
>         if (entry->objcg)
>                 count_objcg_event(entry->objcg, ZSWPIN);
> -freeentry:
> +
>         spin_lock(&tree->lock);
>         if (ret && zswap_exclusive_loads_enabled) {
>                 zswap_invalidate_entry(tree, entry);
>
> --
> b4 0.10.1
>

Re: [PATCH v4 2/6] mm/zswap: reuse dstmem when decompress

Posted by Chengming Zhou 2 years, 1 month ago

On 2023/12/27 09:24, Barry Song wrote:
> On Wed, Dec 27, 2023 at 4:56 AM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
>>
>> In the !zpool_can_sleep_mapped() case such as zsmalloc, we need to first
>> copy the entry->handle memory to a temporary memory, which is allocated
>> using kmalloc.
>>
>> Obviously we can reuse the per-compressor dstmem to avoid allocating
>> every time, since it's percpu-compressor and protected in percpu mutex.
> 
> what is the benefit of this since we are actually increasing lock contention
> by reusing this buffer between multiple compression and decompression
> threads?

This mutex is already reused in all compress/decompress paths even before
the reuse optimization. I think the best way maybe to use separate crypto_acomp
for compression and decompression.

Do you think the lock contention will be increased because we now put zpool_map_handle()
and memcpy() in the lock section? Actually, we can move zpool_map_handle() before
the lock section if needed, but that memcpy() should be protected in lock section.

> 
> this mainly affects zsmalloc which can't sleep? do we have performance
> data?

Right, last time when test I remembered there is very minor performance difference.
The main benefit here is to simply the code much and delete one failure case.

> 
> and it seems this patch is also negatively affecting z3fold and zbud.c
> which actually don't need to allocate a tmp buffer.
> 

As for z3fold and zbud, the influence should be much less since the only difference
here is zpool_map_handle() moved in lock section, which could be moved out if needed
as noted above. And also no evident performance regression in the testing.

Thanks.

>>
>> Reviewed-by: Nhat Pham <nphamcs@gmail.com>
>> Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
>> Acked-by: Chris Li <chrisl@kernel.org> (Google)
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>> ---
>>  mm/zswap.c | 44 ++++++++++++--------------------------------
>>  1 file changed, 12 insertions(+), 32 deletions(-)
>>
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> index 976f278aa507..6b872744e962 100644
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -1417,19 +1417,13 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>>         struct crypto_acomp_ctx *acomp_ctx;
>>         struct zpool *pool = zswap_find_zpool(entry);
>>         bool page_was_allocated;
>> -       u8 *src, *tmp = NULL;
>> +       u8 *src;
>>         unsigned int dlen;
>>         int ret;
>>         struct writeback_control wbc = {
>>                 .sync_mode = WB_SYNC_NONE,
>>         };
>>
>> -       if (!zpool_can_sleep_mapped(pool)) {
>> -               tmp = kmalloc(PAGE_SIZE, GFP_KERNEL);
>> -               if (!tmp)
>> -                       return -ENOMEM;
>> -       }
>> -
>>         /* try to allocate swap cache page */
>>         mpol = get_task_policy(current);
>>         page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
>> @@ -1465,15 +1459,15 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>>         /* decompress */
>>         acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
>>         dlen = PAGE_SIZE;
>> +       mutex_lock(acomp_ctx->mutex);
>>
>>         src = zpool_map_handle(pool, entry->handle, ZPOOL_MM_RO);
>>         if (!zpool_can_sleep_mapped(pool)) {
>> -               memcpy(tmp, src, entry->length);
>> -               src = tmp;
>> +               memcpy(acomp_ctx->dstmem, src, entry->length);
>> +               src = acomp_ctx->dstmem;
>>                 zpool_unmap_handle(pool, entry->handle);
>>         }
>>
>> -       mutex_lock(acomp_ctx->mutex);
>>         sg_init_one(&input, src, entry->length);
>>         sg_init_table(&output, 1);
>>         sg_set_page(&output, page, PAGE_SIZE, 0);
>> @@ -1482,9 +1476,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>>         dlen = acomp_ctx->req->dlen;
>>         mutex_unlock(acomp_ctx->mutex);
>>
>> -       if (!zpool_can_sleep_mapped(pool))
>> -               kfree(tmp);
>> -       else
>> +       if (zpool_can_sleep_mapped(pool))
>>                 zpool_unmap_handle(pool, entry->handle);
>>
>>         BUG_ON(ret);
>> @@ -1508,9 +1500,6 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>>         return ret;
>>
>>  fail:
>> -       if (!zpool_can_sleep_mapped(pool))
>> -               kfree(tmp);
>> -
>>         /*
>>          * If we get here because the page is already in swapcache, a
>>          * load may be happening concurrently. It is safe and okay to
>> @@ -1771,7 +1760,7 @@ bool zswap_load(struct folio *folio)
>>         struct zswap_entry *entry;
>>         struct scatterlist input, output;
>>         struct crypto_acomp_ctx *acomp_ctx;
>> -       u8 *src, *dst, *tmp;
>> +       u8 *src, *dst;
>>         struct zpool *zpool;
>>         unsigned int dlen;
>>         bool ret;
>> @@ -1796,26 +1785,19 @@ bool zswap_load(struct folio *folio)
>>         }
>>
>>         zpool = zswap_find_zpool(entry);
>> -       if (!zpool_can_sleep_mapped(zpool)) {
>> -               tmp = kmalloc(entry->length, GFP_KERNEL);
>> -               if (!tmp) {
>> -                       ret = false;
>> -                       goto freeentry;
>> -               }
>> -       }
>>
>>         /* decompress */
>>         dlen = PAGE_SIZE;
>> -       src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
>> +       acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
>> +       mutex_lock(acomp_ctx->mutex);
>>
>> +       src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
>>         if (!zpool_can_sleep_mapped(zpool)) {
>> -               memcpy(tmp, src, entry->length);
>> -               src = tmp;
>> +               memcpy(acomp_ctx->dstmem, src, entry->length);
>> +               src = acomp_ctx->dstmem;
>>                 zpool_unmap_handle(zpool, entry->handle);
>>         }
>>
>> -       acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
>> -       mutex_lock(acomp_ctx->mutex);
>>         sg_init_one(&input, src, entry->length);
>>         sg_init_table(&output, 1);
>>         sg_set_page(&output, page, PAGE_SIZE, 0);
>> @@ -1826,15 +1808,13 @@ bool zswap_load(struct folio *folio)
>>
>>         if (zpool_can_sleep_mapped(zpool))
>>                 zpool_unmap_handle(zpool, entry->handle);
>> -       else
>> -               kfree(tmp);
>>
>>         ret = true;
>>  stats:
>>         count_vm_event(ZSWPIN);
>>         if (entry->objcg)
>>                 count_objcg_event(entry->objcg, ZSWPIN);
>> -freeentry:
>> +
>>         spin_lock(&tree->lock);
>>         if (ret && zswap_exclusive_loads_enabled) {
>>                 zswap_invalidate_entry(tree, entry);
>>
>> --
>> b4 0.10.1
>>

Re: [PATCH v4 2/6] mm/zswap: reuse dstmem when decompress

Posted by Barry Song 2 years, 1 month ago

On Wed, Dec 27, 2023 at 7:32 PM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> On 2023/12/27 09:24, Barry Song wrote:
> > On Wed, Dec 27, 2023 at 4:56 AM Chengming Zhou
> > <zhouchengming@bytedance.com> wrote:
> >>
> >> In the !zpool_can_sleep_mapped() case such as zsmalloc, we need to first
> >> copy the entry->handle memory to a temporary memory, which is allocated
> >> using kmalloc.
> >>
> >> Obviously we can reuse the per-compressor dstmem to avoid allocating
> >> every time, since it's percpu-compressor and protected in percpu mutex.
> >
> > what is the benefit of this since we are actually increasing lock contention
> > by reusing this buffer between multiple compression and decompression
> > threads?
>
> This mutex is already reused in all compress/decompress paths even before
> the reuse optimization. I think the best way maybe to use separate crypto_acomp
> for compression and decompression.
>
> Do you think the lock contention will be increased because we now put zpool_map_handle()
> and memcpy() in the lock section? Actually, we can move zpool_map_handle() before
> the lock section if needed, but that memcpy() should be protected in lock section.
>
> >
> > this mainly affects zsmalloc which can't sleep? do we have performance
> > data?
>
> Right, last time when test I remembered there is very minor performance difference.
> The main benefit here is to simply the code much and delete one failure case.

ok.

For the majority of hardware, people are using CPU-based
compression/decompression,
there is no chance they will sleep. Thus, all
compression/decompression can be done
in a zpool_map section, there is *NO* need to copy at all! Only for
those hardware which
can provide a HW-accelerator to offload CPU, crypto will actually wait
for completion by

static inline int crypto_wait_req(int err, struct crypto_wait *wait)
{
        switch (err) {
        case -EINPROGRESS:
        case -EBUSY:
                wait_for_completion(&wait->completion);
                reinit_completion(&wait->completion);
                err = wait->err;
                break;
        }

        return err;
}

for CPU-based alg, we have completed the compr/decompr within
crypto_acomp_decompress()
synchronously. they won't return EINPROGRESS, EBUSY.

The problem is that crypto_acomp won't expose this information to its
users. if it does,
we can use this info, we will totally avoid the code of copying
zsmalloc's data to a tmp
buffer for the most majority users of zswap.

But I am not sure if we can find a way to convince Herbert(+To)  :-)

>
> >
> > and it seems this patch is also negatively affecting z3fold and zbud.c
> > which actually don't need to allocate a tmp buffer.
> >
>
> As for z3fold and zbud, the influence should be much less since the only difference
> here is zpool_map_handle() moved in lock section, which could be moved out if needed
> as noted above. And also no evident performance regression in the testing.
>
> Thanks.
>
> >>
> >> Reviewed-by: Nhat Pham <nphamcs@gmail.com>
> >> Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
> >> Acked-by: Chris Li <chrisl@kernel.org> (Google)
> >> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> >> ---
> >>  mm/zswap.c | 44 ++++++++++++--------------------------------
> >>  1 file changed, 12 insertions(+), 32 deletions(-)
> >>
> >> diff --git a/mm/zswap.c b/mm/zswap.c
> >> index 976f278aa507..6b872744e962 100644
> >> --- a/mm/zswap.c
> >> +++ b/mm/zswap.c
> >> @@ -1417,19 +1417,13 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> >>         struct crypto_acomp_ctx *acomp_ctx;
> >>         struct zpool *pool = zswap_find_zpool(entry);
> >>         bool page_was_allocated;
> >> -       u8 *src, *tmp = NULL;
> >> +       u8 *src;
> >>         unsigned int dlen;
> >>         int ret;
> >>         struct writeback_control wbc = {
> >>                 .sync_mode = WB_SYNC_NONE,
> >>         };
> >>
> >> -       if (!zpool_can_sleep_mapped(pool)) {
> >> -               tmp = kmalloc(PAGE_SIZE, GFP_KERNEL);
> >> -               if (!tmp)
> >> -                       return -ENOMEM;
> >> -       }
> >> -
> >>         /* try to allocate swap cache page */
> >>         mpol = get_task_policy(current);
> >>         page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
> >> @@ -1465,15 +1459,15 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> >>         /* decompress */
> >>         acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> >>         dlen = PAGE_SIZE;
> >> +       mutex_lock(acomp_ctx->mutex);
> >>
> >>         src = zpool_map_handle(pool, entry->handle, ZPOOL_MM_RO);
> >>         if (!zpool_can_sleep_mapped(pool)) {
> >> -               memcpy(tmp, src, entry->length);
> >> -               src = tmp;
> >> +               memcpy(acomp_ctx->dstmem, src, entry->length);
> >> +               src = acomp_ctx->dstmem;
> >>                 zpool_unmap_handle(pool, entry->handle);
> >>         }
> >>
> >> -       mutex_lock(acomp_ctx->mutex);
> >>         sg_init_one(&input, src, entry->length);
> >>         sg_init_table(&output, 1);
> >>         sg_set_page(&output, page, PAGE_SIZE, 0);
> >> @@ -1482,9 +1476,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> >>         dlen = acomp_ctx->req->dlen;
> >>         mutex_unlock(acomp_ctx->mutex);
> >>
> >> -       if (!zpool_can_sleep_mapped(pool))
> >> -               kfree(tmp);
> >> -       else
> >> +       if (zpool_can_sleep_mapped(pool))
> >>                 zpool_unmap_handle(pool, entry->handle);
> >>
> >>         BUG_ON(ret);
> >> @@ -1508,9 +1500,6 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> >>         return ret;
> >>
> >>  fail:
> >> -       if (!zpool_can_sleep_mapped(pool))
> >> -               kfree(tmp);
> >> -
> >>         /*
> >>          * If we get here because the page is already in swapcache, a
> >>          * load may be happening concurrently. It is safe and okay to
> >> @@ -1771,7 +1760,7 @@ bool zswap_load(struct folio *folio)
> >>         struct zswap_entry *entry;
> >>         struct scatterlist input, output;
> >>         struct crypto_acomp_ctx *acomp_ctx;
> >> -       u8 *src, *dst, *tmp;
> >> +       u8 *src, *dst;
> >>         struct zpool *zpool;
> >>         unsigned int dlen;
> >>         bool ret;
> >> @@ -1796,26 +1785,19 @@ bool zswap_load(struct folio *folio)
> >>         }
> >>
> >>         zpool = zswap_find_zpool(entry);
> >> -       if (!zpool_can_sleep_mapped(zpool)) {
> >> -               tmp = kmalloc(entry->length, GFP_KERNEL);
> >> -               if (!tmp) {
> >> -                       ret = false;
> >> -                       goto freeentry;
> >> -               }
> >> -       }
> >>
> >>         /* decompress */
> >>         dlen = PAGE_SIZE;
> >> -       src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
> >> +       acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> >> +       mutex_lock(acomp_ctx->mutex);
> >>
> >> +       src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
> >>         if (!zpool_can_sleep_mapped(zpool)) {
> >> -               memcpy(tmp, src, entry->length);
> >> -               src = tmp;
> >> +               memcpy(acomp_ctx->dstmem, src, entry->length);
> >> +               src = acomp_ctx->dstmem;
> >>                 zpool_unmap_handle(zpool, entry->handle);
> >>         }
> >>
> >> -       acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> >> -       mutex_lock(acomp_ctx->mutex);
> >>         sg_init_one(&input, src, entry->length);
> >>         sg_init_table(&output, 1);
> >>         sg_set_page(&output, page, PAGE_SIZE, 0);
> >> @@ -1826,15 +1808,13 @@ bool zswap_load(struct folio *folio)
> >>
> >>         if (zpool_can_sleep_mapped(zpool))
> >>                 zpool_unmap_handle(zpool, entry->handle);
> >> -       else
> >> -               kfree(tmp);
> >>
> >>         ret = true;
> >>  stats:
> >>         count_vm_event(ZSWPIN);
> >>         if (entry->objcg)
> >>                 count_objcg_event(entry->objcg, ZSWPIN);
> >> -freeentry:
> >> +
> >>         spin_lock(&tree->lock);
> >>         if (ret && zswap_exclusive_loads_enabled) {
> >>                 zswap_invalidate_entry(tree, entry);
> >>
> >> --
> >> b4 0.10.1
> >>

Thanks
Barry

Re: [PATCH v4 2/6] mm/zswap: reuse dstmem when decompress

Posted by Herbert Xu 2 years, 1 month ago

On Thu, Dec 28, 2023 at 09:03:32PM +1300, Barry Song wrote:
>
> for CPU-based alg, we have completed the compr/decompr within
> crypto_acomp_decompress()
> synchronously. they won't return EINPROGRESS, EBUSY.
> 
> The problem is that crypto_acomp won't expose this information to its
> users. if it does,
> we can use this info, we will totally avoid the code of copying
> zsmalloc's data to a tmp
> buffer for the most majority users of zswap.
> 
> But I am not sure if we can find a way to convince Herbert(+To)  :-)

What would you like to expose? The async status of the underlying
algorithm?

We could certainly do that.  But I wonder if it might actually be
better for you to allocate a second sync-only algorithm for such
cases.  I'd like to see some real numbers.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

[PATCH RFC 1/2] crypto: introduce acomp_is_async to expose if a acomp has a scomp backend

Posted by Barry Song 2 years, 1 month ago

From: Barry Song <v-songbaohua@oppo.com>

Almost all CPU-based compressors/decompressors are actually synchronous
though they support acomp APIs. While some chips have hardware-based
accelerators to offload CPU's work such as hisilicon and intel/qat/,
their drivers are working in async mode.
Letting acomp's users know exactly if the acomp is really async will
help users know if the compression and decompression procedure can
sleep and make their decisions accordingly.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 crypto/acompress.c         | 8 ++++++++
 include/crypto/acompress.h | 9 +++++++++
 2 files changed, 17 insertions(+)

diff --git a/crypto/acompress.c b/crypto/acompress.c
index 1c682810a484..99118e879a4a 100644
--- a/crypto/acompress.c
+++ b/crypto/acompress.c
@@ -152,6 +152,14 @@ struct crypto_acomp *crypto_alloc_acomp_node(const char *alg_name, u32 type,
 }
 EXPORT_SYMBOL_GPL(crypto_alloc_acomp_node);
 
+bool acomp_is_async(struct crypto_acomp *acomp)
+{
+	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp);
+
+	return tfm->__crt_alg->cra_type == &crypto_acomp_type;
+}
+EXPORT_SYMBOL_GPL(acomp_is_async);
+
 struct acomp_req *acomp_request_alloc(struct crypto_acomp *acomp)
 {
 	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp);
diff --git a/include/crypto/acompress.h b/include/crypto/acompress.h
index 574cffc90730..5831080479e9 100644
--- a/include/crypto/acompress.h
+++ b/include/crypto/acompress.h
@@ -195,6 +195,15 @@ static inline int crypto_has_acomp(const char *alg_name, u32 type, u32 mask)
  */
 struct acomp_req *acomp_request_alloc(struct crypto_acomp *tfm);
 
+/**
+ * acomp_is_async() -- check if an acomp is asynchronous(can sleep)
+ *
+ * @tfm:	ACOMPRESS tfm handle allocated with crypto_alloc_acomp()
+ *
+ * Return:	true if the acomp is asynchronous, otherwise, false
+ */
+bool acomp_is_async(struct crypto_acomp *tfm);
+
 /**
  * acomp_request_free() -- zeroize and free asynchronous (de)compression
  *			   request as well as the output buffer if allocated
-- 
2.34.1

Re: [PATCH v4 2/6] mm/zswap: reuse dstmem when decompress

Posted by Barry Song 2 years, 1 month ago

>>
>> for CPU-based alg, we have completed the compr/decompr within
>> crypto_acomp_decompress()
>> synchronously. they won't return EINPROGRESS, EBUSY.
>>
>> The problem is that crypto_acomp won't expose this information to its
>> users. if it does,
>> we can use this info, we will totally avoid the code of copying
>> zsmalloc's data to a tmp
>> buffer for the most majority users of zswap.
>>
>> But I am not sure if we can find a way to convince Herbert(+To)  :-)

> What would you like to expose? The async status of the underlying
> algorithm?

Right. followed by a rfc patchset, please help take a look.

> 
> We could certainly do that.  But I wonder if it might actually be
> better for you to allocate a second sync-only algorithm for such
> cases.  I'd like to see some real numbers.

some hardware might want to use an accelerator to help offload CPU's
work. their drivers are working in async mode, for example, hisilicon
and intel.

I don't have the exact number we can save by removing the redundant
memcpy, nor do i have a proper hardware to test and get the number.
As Chengming is actually working in zswap, i wonder if you can test
my patches and post some data?

> 
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>

Thanks
Barry

Re: [PATCH v4 2/6] mm/zswap: reuse dstmem when decompress

Posted by Herbert Xu 2 years ago

On Wed, Jan 03, 2024 at 03:57:59PM +1300, Barry Song wrote:
>
> > We could certainly do that.  But I wonder if it might actually be
> > better for you to allocate a second sync-only algorithm for such
> > cases.  I'd like to see some real numbers.
> 
> some hardware might want to use an accelerator to help offload CPU's
> work. their drivers are working in async mode, for example, hisilicon
> and intel.
> 
> I don't have the exact number we can save by removing the redundant
> memcpy, nor do i have a proper hardware to test and get the number.
> As Chengming is actually working in zswap, i wonder if you can test
> my patches and post some data?

I don't have the hardware to test this.  Since you're proposing
the change, please test it to ensure that we're not adding cruft
to the API that's actually detrimental to performance.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH v4 2/6] mm/zswap: reuse dstmem when decompress

Posted by Barry Song 2 years ago

On Thu, Jan 25, 2024 at 5:41 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Wed, Jan 03, 2024 at 03:57:59PM +1300, Barry Song wrote:
> >
> > > We could certainly do that.  But I wonder if it might actually be
> > > better for you to allocate a second sync-only algorithm for such
> > > cases.  I'd like to see some real numbers.
> >
> > some hardware might want to use an accelerator to help offload CPU's
> > work. their drivers are working in async mode, for example, hisilicon
> > and intel.
> >
> > I don't have the exact number we can save by removing the redundant
> > memcpy, nor do i have a proper hardware to test and get the number.
> > As Chengming is actually working in zswap, i wonder if you can test
> > my patches and post some data?
>
> I don't have the hardware to test this.  Since you're proposing
> the change, please test it to ensure that we're not adding cruft
> to the API that's actually detrimental to performance.

Understood.
sorry for the grammatical errors, i was actually asking for
chengming's help for testing as
he had hardware and was actively working on optimizing  zswap.
and he has already tested and sent the performance data which I quoted
in the formal
patchset.

>
> Thanks,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Thanks
Barry

[PATCH RFC 2/2] mm/zswap: remove the memcpy if acomp is not asynchronous

Posted by Barry Song 2 years, 1 month ago

From: Barry Song <v-songbaohua@oppo.com>

Most compressors are actually CPU-based, they won't sleep during
decompression. we should be able to remove the redundant memcpy
for them.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/zswap.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index ca25b676048e..36898614ebcc 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -168,6 +168,7 @@ struct crypto_acomp_ctx {
 	struct crypto_wait wait;
 	u8 *buffer;
 	struct mutex mutex;
+	bool is_async; /* if acomp can sleep */
 };
 
 /*
@@ -716,6 +717,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 		goto acomp_fail;
 	}
 	acomp_ctx->acomp = acomp;
+	acomp_ctx->is_async = acomp_is_async(acomp);
 
 	req = acomp_request_alloc(acomp_ctx->acomp);
 	if (!req) {
@@ -1370,7 +1372,7 @@ static void __zswap_load(struct zswap_entry *entry, struct page *page)
 	mutex_lock(&acomp_ctx->mutex);
 
 	src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
-	if (!zpool_can_sleep_mapped(zpool)) {
+	if (acomp_ctx->is_async && !zpool_can_sleep_mapped(zpool)) {
 		memcpy(acomp_ctx->buffer, src, entry->length);
 		src = acomp_ctx->buffer;
 		zpool_unmap_handle(zpool, entry->handle);
@@ -1384,7 +1386,7 @@ static void __zswap_load(struct zswap_entry *entry, struct page *page)
 	BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
 	mutex_unlock(&acomp_ctx->mutex);
 
-	if (zpool_can_sleep_mapped(zpool))
+	if (!acomp_ctx->is_async || zpool_can_sleep_mapped(zpool))
 		zpool_unmap_handle(zpool, entry->handle);
 }
 
-- 
2.34.1

Re: [PATCH v4 2/6] mm/zswap: reuse dstmem when decompress

Posted by Chengming Zhou 2 years, 1 month ago

On 2023/12/28 16:03, Barry Song wrote:
> On Wed, Dec 27, 2023 at 7:32 PM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
>>
>> On 2023/12/27 09:24, Barry Song wrote:
>>> On Wed, Dec 27, 2023 at 4:56 AM Chengming Zhou
>>> <zhouchengming@bytedance.com> wrote:
>>>>
>>>> In the !zpool_can_sleep_mapped() case such as zsmalloc, we need to first
>>>> copy the entry->handle memory to a temporary memory, which is allocated
>>>> using kmalloc.
>>>>
>>>> Obviously we can reuse the per-compressor dstmem to avoid allocating
>>>> every time, since it's percpu-compressor and protected in percpu mutex.
>>>
>>> what is the benefit of this since we are actually increasing lock contention
>>> by reusing this buffer between multiple compression and decompression
>>> threads?
>>
>> This mutex is already reused in all compress/decompress paths even before
>> the reuse optimization. I think the best way maybe to use separate crypto_acomp
>> for compression and decompression.
>>
>> Do you think the lock contention will be increased because we now put zpool_map_handle()
>> and memcpy() in the lock section? Actually, we can move zpool_map_handle() before
>> the lock section if needed, but that memcpy() should be protected in lock section.
>>
>>>
>>> this mainly affects zsmalloc which can't sleep? do we have performance
>>> data?
>>
>> Right, last time when test I remembered there is very minor performance difference.
>> The main benefit here is to simply the code much and delete one failure case.
> 
> ok.
> 
> For the majority of hardware, people are using CPU-based
> compression/decompression,
> there is no chance they will sleep. Thus, all
> compression/decompression can be done
> in a zpool_map section, there is *NO* need to copy at all! Only for

Yes, very good for zsmalloc.

> those hardware which
> can provide a HW-accelerator to offload CPU, crypto will actually wait
> for completion by
> 
> static inline int crypto_wait_req(int err, struct crypto_wait *wait)
> {
>         switch (err) {
>         case -EINPROGRESS:
>         case -EBUSY:
>                 wait_for_completion(&wait->completion);
>                 reinit_completion(&wait->completion);
>                 err = wait->err;
>                 break;
>         }
> 
>         return err;
> }
> 
> for CPU-based alg, we have completed the compr/decompr within
> crypto_acomp_decompress()
> synchronously. they won't return EINPROGRESS, EBUSY.

Ok, this is useful to know.

> 
> The problem is that crypto_acomp won't expose this information to its
> users. if it does,
> we can use this info, we will totally avoid the code of copying
> zsmalloc's data to a tmp
> buffer for the most majority users of zswap.

Agree, I think it's worthwhile to export, so zsmalloc users don't need to
prepare the temporary buffer and copy in the majority case.

Thanks!

> 
> But I am not sure if we can find a way to convince Herbert(+To)  :-)
>