This patch simplifies the zswap_pool's per-CPU acomp_ctx resource
management. Similar to the per-CPU acomp_ctx itself, the per-CPU
acomp_ctx's resources' (acomp, req, buffer) lifetime will also be from
pool creation to pool deletion. These resources will persist through CPU
hotplug operations instead of being destroyed/recreated. The
zswap_cpu_comp_dead() teardown callback has been deleted from the call
to cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE). As a result, CPU
offline hotplug operations will be no-ops as far as the acomp_ctx
resources are concerned.
This commit refactors the code from zswap_cpu_comp_dead() into a
new function acomp_ctx_dealloc() that is called to clean up acomp_ctx
resources from:
1) zswap_cpu_comp_prepare() when an error is encountered,
2) zswap_pool_create() when an error is encountered, and
3) from zswap_pool_destroy().
The main benefit of using the CPU hotplug multi state instance startup
callback to allocate the acomp_ctx resources is that it prevents the
cores from being offlined until the multi state instance addition call
returns.
From Documentation/core-api/cpu_hotplug.rst:
"The node list add/remove operations and the callback invocations are
serialized against CPU hotplug operations."
Furthermore, zswap_[de]compress() cannot contend with
zswap_cpu_comp_prepare() because:
- During pool creation/deletion, the pool is not in the zswap_pools
list.
- During CPU hot[un]plug, the CPU is not yet online, as Yosry pointed
out. zswap_cpu_comp_prepare() will be run on a control CPU,
since CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section of "enum
cpuhp_state". Thanks Yosry for sharing this observation!
In both these cases, any recursions into zswap reclaim from
zswap_cpu_comp_prepare() will be handled by the old pool.
The above two observations enable the following simplifications:
1) zswap_cpu_comp_prepare(): CPU cannot be offlined. Reclaim cannot use
the pool. Considerations for mutex init/locking and handling
subsequent CPU hotplug online-offline-online:
Should we lock the mutex of current CPU's acomp_ctx from start to
end? It doesn't seem like this is required. The CPU hotplug
operations acquire a "cpuhp_state_mutex" before proceeding, hence
they are serialized against CPU hotplug operations.
If the process gets migrated while zswap_cpu_comp_prepare() is
running, it will complete on the new CPU. In case of failures, we
pass the acomp_ctx pointer obtained at the start of
zswap_cpu_comp_prepare() to acomp_ctx_dealloc(), which again, can
only undergo migration. There appear to be no contention scenarios
that might cause inconsistent values of acomp_ctx's members. Hence,
it seems there is no need for mutex_lock(&acomp_ctx->mutex) in
zswap_cpu_comp_prepare().
Since the pool is not yet on zswap_pools list, we don't need to
initialize the per-CPU acomp_ctx mutex in zswap_pool_create(). This
has been restored to occur in zswap_cpu_comp_prepare().
zswap_cpu_comp_prepare() checks upfront if acomp_ctx->acomp is
valid. If so, it returns success. This should handle any CPU
hotplug online-offline transitions after pool creation is done.
2) CPU offline vis-a-vis zswap ops: Let's suppose the process is
migrated to another CPU before the current CPU is dysfunctional. If
zswap_[de]compress() holds the acomp_ctx->mutex lock of the offlined
CPU, that mutex will be released once it completes on the new
CPU. Since there is no teardown callback, there is no possibility of
UAF.
3) Pool creation/deletion and process migration to another CPU:
- During pool creation/deletion, the pool is not in the zswap_pools
list. Hence it cannot contend with zswap ops on that CPU. However,
the process can get migrated.
Pool creation --> zswap_cpu_comp_prepare()
--> process migrated:
* CPU offline: no-op.
* zswap_cpu_comp_prepare() continues
to run on the new CPU to finish
allocating acomp_ctx resources for
the offlined CPU.
Pool deletion --> acomp_ctx_dealloc()
--> process migrated:
* CPU offline: no-op.
* acomp_ctx_dealloc() continues
to run on the new CPU to finish
de-allocating acomp_ctx resources
for the offlined CPU.
4) Pool deletion vis-a-vis CPU onlining:
The call to cpuhp_state_remove_instance() cannot race with
zswap_cpu_comp_prepare() because of hotplug synchronization.
This patch deletes acomp_ctx_get_cpu_lock()/acomp_ctx_put_unlock().
Instead, zswap_[de]compress() directly call
mutex_[un]lock(&acomp_ctx->mutex).
The per-CPU memory cost of not deleting the acomp_ctx resources upon CPU
offlining, and only deleting them when the pool is destroyed, is as
follows, on x86_64:
IAA with 8 dst buffers for batching: 64.34 KB
Software compressors with 1 dst buffer: 8.28 KB
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
mm/zswap.c | 164 +++++++++++++++++++++--------------------------------
1 file changed, 64 insertions(+), 100 deletions(-)
diff --git a/mm/zswap.c b/mm/zswap.c
index 4897ed689b9f..87d50786f61f 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -242,6 +242,20 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
**********************************/
static void __zswap_pool_empty(struct percpu_ref *ref);
+static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx)
+{
+ if (IS_ERR_OR_NULL(acomp_ctx))
+ return;
+
+ if (!IS_ERR_OR_NULL(acomp_ctx->req))
+ acomp_request_free(acomp_ctx->req);
+
+ if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
+ crypto_free_acomp(acomp_ctx->acomp);
+
+ kfree(acomp_ctx->buffer);
+}
+
static struct zswap_pool *zswap_pool_create(char *compressor)
{
struct zswap_pool *pool;
@@ -263,19 +277,26 @@ static struct zswap_pool *zswap_pool_create(char *compressor)
strscpy(pool->tfm_name, compressor, sizeof(pool->tfm_name));
- pool->acomp_ctx = alloc_percpu(*pool->acomp_ctx);
+ /* Many things rely on the zero-initialization. */
+ pool->acomp_ctx = alloc_percpu_gfp(*pool->acomp_ctx,
+ GFP_KERNEL | __GFP_ZERO);
if (!pool->acomp_ctx) {
pr_err("percpu alloc failed\n");
goto error;
}
- for_each_possible_cpu(cpu)
- mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex);
-
+ /*
+ * This is serialized against CPU hotplug operations. Hence, cores
+ * cannot be offlined until this finishes.
+ * In case of errors, we need to goto "ref_fail" instead of "error"
+ * because there is no teardown callback registered anymore, for
+ * cpuhp_state_add_instance() to de-allocate resources as it rolls back
+ * state on cores before the CPU on which error was encountered.
+ */
ret = cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
&pool->node);
if (ret)
- goto error;
+ goto ref_fail;
/* being the current pool takes 1 ref; this func expects the
* caller to always add the new pool as the current pool
@@ -292,6 +313,9 @@ static struct zswap_pool *zswap_pool_create(char *compressor)
ref_fail:
cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
+
+ for_each_possible_cpu(cpu)
+ acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
error:
if (pool->acomp_ctx)
free_percpu(pool->acomp_ctx);
@@ -322,9 +346,15 @@ static struct zswap_pool *__zswap_pool_create_fallback(void)
static void zswap_pool_destroy(struct zswap_pool *pool)
{
+ int cpu;
+
zswap_pool_debug("destroying", pool);
cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
+
+ for_each_possible_cpu(cpu)
+ acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
+
free_percpu(pool->acomp_ctx);
zs_destroy_pool(pool->zs_pool);
@@ -736,39 +766,35 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
{
struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
- struct crypto_acomp *acomp = NULL;
- struct acomp_req *req = NULL;
- u8 *buffer = NULL;
- int ret;
+ int ret = -ENOMEM;
- buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
- if (!buffer) {
- ret = -ENOMEM;
- goto fail;
- }
+ /*
+ * To handle cases where the CPU goes through online-offline-online
+ * transitions, we return if the acomp_ctx has already been initialized.
+ */
+ if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
+ return 0;
- acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
- if (IS_ERR(acomp)) {
+ acomp_ctx->buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
+ if (!acomp_ctx->buffer)
+ return ret;
+
+ acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
+ if (IS_ERR(acomp_ctx->acomp)) {
pr_err("could not alloc crypto acomp %s : %ld\n",
- pool->tfm_name, PTR_ERR(acomp));
- ret = PTR_ERR(acomp);
+ pool->tfm_name, PTR_ERR(acomp_ctx->acomp));
+ ret = PTR_ERR(acomp_ctx->acomp);
goto fail;
}
+ acomp_ctx->is_sleepable = acomp_is_async(acomp_ctx->acomp);
- req = acomp_request_alloc(acomp);
- if (!req) {
+ acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp);
+ if (!acomp_ctx->req) {
pr_err("could not alloc crypto acomp_request %s\n",
pool->tfm_name);
- ret = -ENOMEM;
goto fail;
}
- /*
- * Only hold the mutex after completing allocations, otherwise we may
- * recurse into zswap through reclaim and attempt to hold the mutex
- * again resulting in a deadlock.
- */
- mutex_lock(&acomp_ctx->mutex);
crypto_init_wait(&acomp_ctx->wait);
/*
@@ -776,84 +802,19 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
* crypto_wait_req(); if the backend of acomp is scomp, the callback
* won't be called, crypto_wait_req() will return without blocking.
*/
- acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
+ acomp_request_set_callback(acomp_ctx->req, CRYPTO_TFM_REQ_MAY_BACKLOG,
crypto_req_done, &acomp_ctx->wait);
- acomp_ctx->buffer = buffer;
- acomp_ctx->acomp = acomp;
- acomp_ctx->is_sleepable = acomp_is_async(acomp);
- acomp_ctx->req = req;
-
acomp_request_set_unit_size(acomp_ctx->req, PAGE_SIZE);
- mutex_unlock(&acomp_ctx->mutex);
+ mutex_init(&acomp_ctx->mutex);
return 0;
fail:
- if (acomp)
- crypto_free_acomp(acomp);
- kfree(buffer);
+ acomp_ctx_dealloc(acomp_ctx);
return ret;
}
-static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
-{
- struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
- struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
- struct acomp_req *req;
- struct crypto_acomp *acomp;
- u8 *buffer;
-
- if (IS_ERR_OR_NULL(acomp_ctx))
- return 0;
-
- mutex_lock(&acomp_ctx->mutex);
- req = acomp_ctx->req;
- acomp = acomp_ctx->acomp;
- buffer = acomp_ctx->buffer;
- acomp_ctx->req = NULL;
- acomp_ctx->acomp = NULL;
- acomp_ctx->buffer = NULL;
- mutex_unlock(&acomp_ctx->mutex);
-
- /*
- * Do the actual freeing after releasing the mutex to avoid subtle
- * locking dependencies causing deadlocks.
- */
- if (!IS_ERR_OR_NULL(req))
- acomp_request_free(req);
- if (!IS_ERR_OR_NULL(acomp))
- crypto_free_acomp(acomp);
- kfree(buffer);
-
- return 0;
-}
-
-static struct crypto_acomp_ctx *acomp_ctx_get_cpu_lock(struct zswap_pool *pool)
-{
- struct crypto_acomp_ctx *acomp_ctx;
-
- for (;;) {
- acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
- mutex_lock(&acomp_ctx->mutex);
- if (likely(acomp_ctx->req))
- return acomp_ctx;
- /*
- * It is possible that we were migrated to a different CPU after
- * getting the per-CPU ctx but before the mutex was acquired. If
- * the old CPU got offlined, zswap_cpu_comp_dead() could have
- * already freed ctx->req (among other things) and set it to
- * NULL. Just try again on the new CPU that we ended up on.
- */
- mutex_unlock(&acomp_ctx->mutex);
- }
-}
-
-static void acomp_ctx_put_unlock(struct crypto_acomp_ctx *acomp_ctx)
-{
- mutex_unlock(&acomp_ctx->mutex);
-}
-
static bool zswap_compress(struct page *page, struct zswap_entry *entry,
struct zswap_pool *pool)
{
@@ -866,7 +827,9 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
u8 *dst;
bool mapped = false;
- acomp_ctx = acomp_ctx_get_cpu_lock(pool);
+ acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
+ mutex_lock(&acomp_ctx->mutex);
+
dst = acomp_ctx->buffer;
sg_init_table(&input, 1);
sg_set_page(&input, page, PAGE_SIZE, 0);
@@ -929,7 +892,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
else if (alloc_ret)
zswap_reject_alloc_fail++;
- acomp_ctx_put_unlock(acomp_ctx);
+ mutex_unlock(&acomp_ctx->mutex);
return comp_ret == 0 && alloc_ret == 0;
}
@@ -941,7 +904,8 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
int decomp_ret = 0, dlen = PAGE_SIZE;
u8 *src, *obj;
- acomp_ctx = acomp_ctx_get_cpu_lock(pool);
+ acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
+ mutex_lock(&acomp_ctx->mutex);
obj = zs_obj_read_begin(pool->zs_pool, entry->handle, acomp_ctx->buffer);
/* zswap entries of length PAGE_SIZE are not compressed. */
@@ -972,7 +936,7 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
read_done:
zs_obj_read_end(pool->zs_pool, entry->handle, obj);
- acomp_ctx_put_unlock(acomp_ctx);
+ mutex_unlock(&acomp_ctx->mutex);
if (!decomp_ret && dlen == PAGE_SIZE)
return true;
@@ -1798,7 +1762,7 @@ static int zswap_setup(void)
ret = cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE,
"mm/zswap_pool:prepare",
zswap_cpu_comp_prepare,
- zswap_cpu_comp_dead);
+ NULL);
if (ret)
goto hp_fail;
--
2.27.0
On Tue, Nov 04, 2025 at 01:12:32AM -0800, Kanchana P Sridhar wrote:
The subject can be shortened to:
"mm: zswap: Tie per-CPU acomp_ctx lifetime to the pool"
> This patch simplifies the zswap_pool's per-CPU acomp_ctx resource
> management. Similar to the per-CPU acomp_ctx itself, the per-CPU
> acomp_ctx's resources' (acomp, req, buffer) lifetime will also be from
> pool creation to pool deletion. These resources will persist through CPU
> hotplug operations instead of being destroyed/recreated. The
> zswap_cpu_comp_dead() teardown callback has been deleted from the call
> to cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE). As a result, CPU
> offline hotplug operations will be no-ops as far as the acomp_ctx
> resources are concerned.
Currently, per-CPU acomp_ctx are allocated on pool creation and/or CPU
hotplug, and destroyed on pool destruction or CPU hotunplug. This
complicates the lifetime management to save memory while a CPU is
offlined, which is not very common.
Simplify lifetime management by allocating per-CPU acomp_ctx once on
pool creation (or CPU hotplug for CPUs onlined later), and keeping them
allocated until the pool is destroyed.
>
> This commit refactors the code from zswap_cpu_comp_dead() into a
> new function acomp_ctx_dealloc() that is called to clean up acomp_ctx
> resources from:
>
> 1) zswap_cpu_comp_prepare() when an error is encountered,
> 2) zswap_pool_create() when an error is encountered, and
> 3) from zswap_pool_destroy().
Refactor cleanup code from zswap_cpu_comp_dead() into
acomp_ctx_dealloc() to be used elsewhere.
>
> The main benefit of using the CPU hotplug multi state instance startup
> callback to allocate the acomp_ctx resources is that it prevents the
> cores from being offlined until the multi state instance addition call
> returns.
>
> From Documentation/core-api/cpu_hotplug.rst:
>
> "The node list add/remove operations and the callback invocations are
> serialized against CPU hotplug operations."
>
> Furthermore, zswap_[de]compress() cannot contend with
> zswap_cpu_comp_prepare() because:
>
> - During pool creation/deletion, the pool is not in the zswap_pools
> list.
>
> - During CPU hot[un]plug, the CPU is not yet online, as Yosry pointed
> out. zswap_cpu_comp_prepare() will be run on a control CPU,
> since CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section of "enum
> cpuhp_state". Thanks Yosry for sharing this observation!
>
> In both these cases, any recursions into zswap reclaim from
> zswap_cpu_comp_prepare() will be handled by the old pool.
>
> The above two observations enable the following simplifications:
>
> 1) zswap_cpu_comp_prepare(): CPU cannot be offlined. Reclaim cannot use
> the pool. Considerations for mutex init/locking and handling
> subsequent CPU hotplug online-offline-online:
>
> Should we lock the mutex of current CPU's acomp_ctx from start to
> end? It doesn't seem like this is required. The CPU hotplug
> operations acquire a "cpuhp_state_mutex" before proceeding, hence
> they are serialized against CPU hotplug operations.
>
> If the process gets migrated while zswap_cpu_comp_prepare() is
> running, it will complete on the new CPU. In case of failures, we
> pass the acomp_ctx pointer obtained at the start of
> zswap_cpu_comp_prepare() to acomp_ctx_dealloc(), which again, can
> only undergo migration. There appear to be no contention scenarios
> that might cause inconsistent values of acomp_ctx's members. Hence,
> it seems there is no need for mutex_lock(&acomp_ctx->mutex) in
> zswap_cpu_comp_prepare().
>
> Since the pool is not yet on zswap_pools list, we don't need to
> initialize the per-CPU acomp_ctx mutex in zswap_pool_create(). This
> has been restored to occur in zswap_cpu_comp_prepare().
>
> zswap_cpu_comp_prepare() checks upfront if acomp_ctx->acomp is
> valid. If so, it returns success. This should handle any CPU
> hotplug online-offline transitions after pool creation is done.
>
> 2) CPU offline vis-a-vis zswap ops: Let's suppose the process is
> migrated to another CPU before the current CPU is dysfunctional. If
> zswap_[de]compress() holds the acomp_ctx->mutex lock of the offlined
> CPU, that mutex will be released once it completes on the new
> CPU. Since there is no teardown callback, there is no possibility of
> UAF.
>
> 3) Pool creation/deletion and process migration to another CPU:
>
> - During pool creation/deletion, the pool is not in the zswap_pools
> list. Hence it cannot contend with zswap ops on that CPU. However,
> the process can get migrated.
>
> Pool creation --> zswap_cpu_comp_prepare()
> --> process migrated:
> * CPU offline: no-op.
> * zswap_cpu_comp_prepare() continues
> to run on the new CPU to finish
> allocating acomp_ctx resources for
> the offlined CPU.
>
> Pool deletion --> acomp_ctx_dealloc()
> --> process migrated:
> * CPU offline: no-op.
> * acomp_ctx_dealloc() continues
> to run on the new CPU to finish
> de-allocating acomp_ctx resources
> for the offlined CPU.
>
> 4) Pool deletion vis-a-vis CPU onlining:
> The call to cpuhp_state_remove_instance() cannot race with
> zswap_cpu_comp_prepare() because of hotplug synchronization.
>
> This patch deletes acomp_ctx_get_cpu_lock()/acomp_ctx_put_unlock().
> Instead, zswap_[de]compress() directly call
> mutex_[un]lock(&acomp_ctx->mutex).
I am not sure why all of this is needed. We should just describe why
it's safe to drop holding the mutex while initializing per-CPU
acomp_ctx:
It is no longer possible for CPU hotplug to race against allocation or
usage of per-CPU acomp_ctx, as they are only allocated once before the
pool can be used, and remain allocated as long as the pool is used.
Hence, stop holding the lock during acomp_ctx initialization, and drop
acomp_ctx_get_cpu_lock()//acomp_ctx_put_unlock().
>
> The per-CPU memory cost of not deleting the acomp_ctx resources upon CPU
> offlining, and only deleting them when the pool is destroyed, is as
> follows, on x86_64:
>
> IAA with 8 dst buffers for batching: 64.34 KB
> Software compressors with 1 dst buffer: 8.28 KB
This cost is only paid when a CPU is offlined, until it is onlined
again.
>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
> mm/zswap.c | 164 +++++++++++++++++++++--------------------------------
> 1 file changed, 64 insertions(+), 100 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 4897ed689b9f..87d50786f61f 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -242,6 +242,20 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
> **********************************/
> static void __zswap_pool_empty(struct percpu_ref *ref);
>
> +static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx)
> +{
> + if (IS_ERR_OR_NULL(acomp_ctx))
> + return;
> +
> + if (!IS_ERR_OR_NULL(acomp_ctx->req))
> + acomp_request_free(acomp_ctx->req);
> +
> + if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> + crypto_free_acomp(acomp_ctx->acomp);
> +
> + kfree(acomp_ctx->buffer);
> +}
> +
> static struct zswap_pool *zswap_pool_create(char *compressor)
> {
> struct zswap_pool *pool;
> @@ -263,19 +277,26 @@ static struct zswap_pool *zswap_pool_create(char *compressor)
>
> strscpy(pool->tfm_name, compressor, sizeof(pool->tfm_name));
>
> - pool->acomp_ctx = alloc_percpu(*pool->acomp_ctx);
> + /* Many things rely on the zero-initialization. */
> + pool->acomp_ctx = alloc_percpu_gfp(*pool->acomp_ctx,
> + GFP_KERNEL | __GFP_ZERO);
> if (!pool->acomp_ctx) {
> pr_err("percpu alloc failed\n");
> goto error;
> }
>
> - for_each_possible_cpu(cpu)
> - mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex);
> -
> + /*
> + * This is serialized against CPU hotplug operations. Hence, cores
> + * cannot be offlined until this finishes.
> + * In case of errors, we need to goto "ref_fail" instead of "error"
> + * because there is no teardown callback registered anymore, for
> + * cpuhp_state_add_instance() to de-allocate resources as it rolls back
> + * state on cores before the CPU on which error was encountered.
> + */
Do we need to manually call acomp_ctx_dealloc() on each CPU on failure
because cpuhp_state_add_instance() relies on the hotunplug callback for
cleanup, and we don't have any?
If that's the case:
/*
* cpuhp_state_add_instance() will not cleanup on failure since
* we don't register a hotunplug callback.
*/
Describing what the code does is not helpful, and things like "anymore"
do not make sense once the code is merged.
> ret = cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> &pool->node);
> if (ret)
> - goto error;
> + goto ref_fail;
IIUC we shouldn't call cpuhp_state_remove_instance() on failure, we
probably should add a new label.
>
> /* being the current pool takes 1 ref; this func expects the
> * caller to always add the new pool as the current pool
> @@ -292,6 +313,9 @@ static struct zswap_pool *zswap_pool_create(char *compressor)
>
> ref_fail:
> cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
> +
> + for_each_possible_cpu(cpu)
> + acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
> error:
> if (pool->acomp_ctx)
> free_percpu(pool->acomp_ctx);
> @@ -322,9 +346,15 @@ static struct zswap_pool *__zswap_pool_create_fallback(void)
>
> static void zswap_pool_destroy(struct zswap_pool *pool)
> {
> + int cpu;
> +
> zswap_pool_debug("destroying", pool);
>
> cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
> +
> + for_each_possible_cpu(cpu)
> + acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
> +
> free_percpu(pool->acomp_ctx);
>
> zs_destroy_pool(pool->zs_pool);
> @@ -736,39 +766,35 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
> {
> struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
> struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
> - struct crypto_acomp *acomp = NULL;
> - struct acomp_req *req = NULL;
> - u8 *buffer = NULL;
> - int ret;
> + int ret = -ENOMEM;
>
> - buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
> - if (!buffer) {
> - ret = -ENOMEM;
> - goto fail;
> - }
> + /*
> + * To handle cases where the CPU goes through online-offline-online
> + * transitions, we return if the acomp_ctx has already been initialized.
> + */
> + if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> + return 0;
Is it possible for acomp_ctx->acomp to be an ERR value here? If it is,
then zswap initialization should have failed. Maybe WARN_ON_ONCE() for
that case?
>
> - acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
> - if (IS_ERR(acomp)) {
> + acomp_ctx->buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
> + if (!acomp_ctx->buffer)
> + return ret;
> +
> + acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
> + if (IS_ERR(acomp_ctx->acomp)) {
> pr_err("could not alloc crypto acomp %s : %ld\n",
> - pool->tfm_name, PTR_ERR(acomp));
> - ret = PTR_ERR(acomp);
> + pool->tfm_name, PTR_ERR(acomp_ctx->acomp));
> + ret = PTR_ERR(acomp_ctx->acomp);
> goto fail;
> }
> + acomp_ctx->is_sleepable = acomp_is_async(acomp_ctx->acomp);
>
> - req = acomp_request_alloc(acomp);
> - if (!req) {
> + acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp);
> + if (!acomp_ctx->req) {
> pr_err("could not alloc crypto acomp_request %s\n",
> pool->tfm_name);
> - ret = -ENOMEM;
> goto fail;
> }
>
> - /*
> - * Only hold the mutex after completing allocations, otherwise we may
> - * recurse into zswap through reclaim and attempt to hold the mutex
> - * again resulting in a deadlock.
> - */
> - mutex_lock(&acomp_ctx->mutex);
> crypto_init_wait(&acomp_ctx->wait);
>
> /*
[..]
> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Thursday, November 13, 2025 12:24 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx resources
> exist from pool creation to deletion.
>
> On Tue, Nov 04, 2025 at 01:12:32AM -0800, Kanchana P Sridhar wrote:
>
[...]
> > mm/zswap.c | 164 +++++++++++++++++++++--------------------------------
> > 1 file changed, 64 insertions(+), 100 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 4897ed689b9f..87d50786f61f 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -242,6 +242,20 @@ static inline struct xarray
> *swap_zswap_tree(swp_entry_t swp)
> > **********************************/
> > static void __zswap_pool_empty(struct percpu_ref *ref);
> >
> > +static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx)
> > +{
> > + if (IS_ERR_OR_NULL(acomp_ctx))
> > + return;
> > +
> > + if (!IS_ERR_OR_NULL(acomp_ctx->req))
> > + acomp_request_free(acomp_ctx->req);
> > +
> > + if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> > + crypto_free_acomp(acomp_ctx->acomp);
> > +
> > + kfree(acomp_ctx->buffer);
> > +}
> > +
> > static struct zswap_pool *zswap_pool_create(char *compressor)
> > {
> > struct zswap_pool *pool;
> > @@ -263,19 +277,26 @@ static struct zswap_pool
> *zswap_pool_create(char *compressor)
> >
> > strscpy(pool->tfm_name, compressor, sizeof(pool->tfm_name));
> >
> > - pool->acomp_ctx = alloc_percpu(*pool->acomp_ctx);
> > + /* Many things rely on the zero-initialization. */
> > + pool->acomp_ctx = alloc_percpu_gfp(*pool->acomp_ctx,
> > + GFP_KERNEL | __GFP_ZERO);
> > if (!pool->acomp_ctx) {
> > pr_err("percpu alloc failed\n");
> > goto error;
> > }
> >
> > - for_each_possible_cpu(cpu)
> > - mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex);
> > -
> > + /*
> > + * This is serialized against CPU hotplug operations. Hence, cores
> > + * cannot be offlined until this finishes.
> > + * In case of errors, we need to goto "ref_fail" instead of "error"
> > + * because there is no teardown callback registered anymore, for
> > + * cpuhp_state_add_instance() to de-allocate resources as it rolls
> back
> > + * state on cores before the CPU on which error was encountered.
> > + */
>
> Do we need to manually call acomp_ctx_dealloc() on each CPU on failure
> because cpuhp_state_add_instance() relies on the hotunplug callback for
> cleanup, and we don't have any?
That's correct.
>
> If that's the case:
>
> /*
> * cpuhp_state_add_instance() will not cleanup on failure since
> * we don't register a hotunplug callback.
> */
>
> Describing what the code does is not helpful, and things like "anymore"
> do not make sense once the code is merged.
Ok.
>
> > ret =
> cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> > &pool->node);
> > if (ret)
> > - goto error;
> > + goto ref_fail;
>
> IIUC we shouldn't call cpuhp_state_remove_instance() on failure, we
> probably should add a new label.
In this case we should because it is part of the pool creation failure
handling flow, at the end of which, the pool will be deleted.
>
> >
> > /* being the current pool takes 1 ref; this func expects the
> > * caller to always add the new pool as the current pool
> > @@ -292,6 +313,9 @@ static struct zswap_pool *zswap_pool_create(char
> *compressor)
> >
> > ref_fail:
> > cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> &pool->node);
> > +
> > + for_each_possible_cpu(cpu)
> > + acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
> > error:
> > if (pool->acomp_ctx)
> > free_percpu(pool->acomp_ctx);
> > @@ -322,9 +346,15 @@ static struct zswap_pool
> *__zswap_pool_create_fallback(void)
> >
> > static void zswap_pool_destroy(struct zswap_pool *pool)
> > {
> > + int cpu;
> > +
> > zswap_pool_debug("destroying", pool);
> >
> > cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> &pool->node);
> > +
> > + for_each_possible_cpu(cpu)
> > + acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
> > +
> > free_percpu(pool->acomp_ctx);
> >
> > zs_destroy_pool(pool->zs_pool);
> > @@ -736,39 +766,35 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
> > {
> > struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> node);
> > struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> >acomp_ctx, cpu);
> > - struct crypto_acomp *acomp = NULL;
> > - struct acomp_req *req = NULL;
> > - u8 *buffer = NULL;
> > - int ret;
> > + int ret = -ENOMEM;
> >
> > - buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
> > - if (!buffer) {
> > - ret = -ENOMEM;
> > - goto fail;
> > - }
> > + /*
> > + * To handle cases where the CPU goes through online-offline-online
> > + * transitions, we return if the acomp_ctx has already been initialized.
> > + */
> > + if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> > + return 0;
>
> Is it possible for acomp_ctx->acomp to be an ERR value here? If it is,
> then zswap initialization should have failed. Maybe WARN_ON_ONCE() for
> that case?
This is checking for a valid acomp_ctx->acomp using the same criteria
being uniformly used in acomp_ctx_dealloc(). This check is necessary to
handle the case where the CPU goes through online-offline-online state
transitions.
Thanks,
Kanchana
>
> >
> > - acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0,
> cpu_to_node(cpu));
> > - if (IS_ERR(acomp)) {
> > + acomp_ctx->buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL,
> cpu_to_node(cpu));
> > + if (!acomp_ctx->buffer)
> > + return ret;
> > +
> > + acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0,
> 0, cpu_to_node(cpu));
> > + if (IS_ERR(acomp_ctx->acomp)) {
> > pr_err("could not alloc crypto acomp %s : %ld\n",
> > - pool->tfm_name, PTR_ERR(acomp));
> > - ret = PTR_ERR(acomp);
> > + pool->tfm_name, PTR_ERR(acomp_ctx-
> >acomp));
> > + ret = PTR_ERR(acomp_ctx->acomp);
> > goto fail;
> > }
> > + acomp_ctx->is_sleepable = acomp_is_async(acomp_ctx->acomp);
> >
> > - req = acomp_request_alloc(acomp);
> > - if (!req) {
> > + acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp);
> > + if (!acomp_ctx->req) {
> > pr_err("could not alloc crypto acomp_request %s\n",
> > pool->tfm_name);
> > - ret = -ENOMEM;
> > goto fail;
> > }
> >
> > - /*
> > - * Only hold the mutex after completing allocations, otherwise we
> may
> > - * recurse into zswap through reclaim and attempt to hold the mutex
> > - * again resulting in a deadlock.
> > - */
> > - mutex_lock(&acomp_ctx->mutex);
> > crypto_init_wait(&acomp_ctx->wait);
> >
> > /*
> [..]
On Fri, Dec 12, 2025 at 06:17:07PM +0000, Sridhar, Kanchana P wrote:
> >
> > > ret =
> > cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> > > &pool->node);
> > > if (ret)
> > > - goto error;
> > > + goto ref_fail;
> >
> > IIUC we shouldn't call cpuhp_state_remove_instance() on failure, we
> > probably should add a new label.
>
> In this case we should because it is part of the pool creation failure
> handling flow, at the end of which, the pool will be deleted.
What I mean is, when cpuhp_state_add_instance() fails we goto ref_fail
which will call cpuhp_state_remove_instance(). But the current code does
not call cpuhp_state_remove_instance() if cpuhp_state_add_instance()
fails.
>
> >
> > >
> > > /* being the current pool takes 1 ref; this func expects the
> > > * caller to always add the new pool as the current pool
> > > @@ -292,6 +313,9 @@ static struct zswap_pool *zswap_pool_create(char
> > *compressor)
> > >
> > > ref_fail:
> > > cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> > &pool->node);
> > > +
> > > + for_each_possible_cpu(cpu)
> > > + acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
> > > error:
> > > if (pool->acomp_ctx)
> > > free_percpu(pool->acomp_ctx);
> > > @@ -322,9 +346,15 @@ static struct zswap_pool
> > *__zswap_pool_create_fallback(void)
> > >
> > > static void zswap_pool_destroy(struct zswap_pool *pool)
> > > {
> > > + int cpu;
> > > +
> > > zswap_pool_debug("destroying", pool);
> > >
> > > cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> > &pool->node);
> > > +
> > > + for_each_possible_cpu(cpu)
> > > + acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
> > > +
> > > free_percpu(pool->acomp_ctx);
> > >
> > > zs_destroy_pool(pool->zs_pool);
> > > @@ -736,39 +766,35 @@ static int zswap_cpu_comp_prepare(unsigned int
> > cpu, struct hlist_node *node)
> > > {
> > > struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> > node);
> > > struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> > >acomp_ctx, cpu);
> > > - struct crypto_acomp *acomp = NULL;
> > > - struct acomp_req *req = NULL;
> > > - u8 *buffer = NULL;
> > > - int ret;
> > > + int ret = -ENOMEM;
> > >
> > > - buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
> > > - if (!buffer) {
> > > - ret = -ENOMEM;
> > > - goto fail;
> > > - }
> > > + /*
> > > + * To handle cases where the CPU goes through online-offline-online
> > > + * transitions, we return if the acomp_ctx has already been initialized.
> > > + */
> > > + if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> > > + return 0;
> >
> > Is it possible for acomp_ctx->acomp to be an ERR value here? If it is,
> > then zswap initialization should have failed. Maybe WARN_ON_ONCE() for
> > that case?
>
> This is checking for a valid acomp_ctx->acomp using the same criteria
> being uniformly used in acomp_ctx_dealloc(). This check is necessary to
> handle the case where the CPU goes through online-offline-online state
> transitions.
I think I am confused. I thought now we don't free this on CPU offline,
so either it's NULL because this is the first time we initialize it on
this CPU, or it is allocated. If it is an ERR value, then the pool
creation should have failed and we wouldn't be calling this again on CPU
online.
In other words, what scenario do we expect to legitimately see an ERR
value here?
> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Friday, December 12, 2025 10:44 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx resources
> exist from pool creation to deletion.
>
> On Fri, Dec 12, 2025 at 06:17:07PM +0000, Sridhar, Kanchana P wrote:
> > >
> > > > ret =
> > > cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> > > > &pool->node);
> > > > if (ret)
> > > > - goto error;
> > > > + goto ref_fail;
> > >
> > > IIUC we shouldn't call cpuhp_state_remove_instance() on failure, we
> > > probably should add a new label.
> >
> > In this case we should because it is part of the pool creation failure
> > handling flow, at the end of which, the pool will be deleted.
>
> What I mean is, when cpuhp_state_add_instance() fails we goto ref_fail
> which will call cpuhp_state_remove_instance(). But the current code does
> not call cpuhp_state_remove_instance() if cpuhp_state_add_instance()
> fails.
I see what you mean. The current mainline code does not call
cpuhp_state_remove_instance() if cpuhp_state_add_instance() fails, because
the cpuhotplug code will call the teardown callback in this case.
In this patch, I do need to call cpuhp_state_remove_instance() and
acomp_ctx_dealloc() in this case because there is no teardown callback
being registered.
>
> >
> > >
> > > >
> > > > /* being the current pool takes 1 ref; this func expects the
> > > > * caller to always add the new pool as the current pool
> > > > @@ -292,6 +313,9 @@ static struct zswap_pool
> *zswap_pool_create(char
> > > *compressor)
> > > >
> > > > ref_fail:
> > > > cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> > > &pool->node);
> > > > +
> > > > + for_each_possible_cpu(cpu)
> > > > + acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
> > > > error:
> > > > if (pool->acomp_ctx)
> > > > free_percpu(pool->acomp_ctx);
> > > > @@ -322,9 +346,15 @@ static struct zswap_pool
> > > *__zswap_pool_create_fallback(void)
> > > >
> > > > static void zswap_pool_destroy(struct zswap_pool *pool)
> > > > {
> > > > + int cpu;
> > > > +
> > > > zswap_pool_debug("destroying", pool);
> > > >
> > > > cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> > > &pool->node);
> > > > +
> > > > + for_each_possible_cpu(cpu)
> > > > + acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
> > > > +
> > > > free_percpu(pool->acomp_ctx);
> > > >
> > > > zs_destroy_pool(pool->zs_pool);
> > > > @@ -736,39 +766,35 @@ static int
> zswap_cpu_comp_prepare(unsigned int
> > > cpu, struct hlist_node *node)
> > > > {
> > > > struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> > > node);
> > > > struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> > > >acomp_ctx, cpu);
> > > > - struct crypto_acomp *acomp = NULL;
> > > > - struct acomp_req *req = NULL;
> > > > - u8 *buffer = NULL;
> > > > - int ret;
> > > > + int ret = -ENOMEM;
> > > >
> > > > - buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
> > > > - if (!buffer) {
> > > > - ret = -ENOMEM;
> > > > - goto fail;
> > > > - }
> > > > + /*
> > > > + * To handle cases where the CPU goes through online-offline-online
> > > > + * transitions, we return if the acomp_ctx has already been initialized.
> > > > + */
> > > > + if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> > > > + return 0;
> > >
> > > Is it possible for acomp_ctx->acomp to be an ERR value here? If it is,
> > > then zswap initialization should have failed. Maybe WARN_ON_ONCE() for
> > > that case?
> >
> > This is checking for a valid acomp_ctx->acomp using the same criteria
> > being uniformly used in acomp_ctx_dealloc(). This check is necessary to
> > handle the case where the CPU goes through online-offline-online state
> > transitions.
>
> I think I am confused. I thought now we don't free this on CPU offline,
> so either it's NULL because this is the first time we initialize it on
> this CPU, or it is allocated.
Yes, this is correct.
> If it is an ERR value, then the pool
> creation should have failed and we wouldn't be calling this again on CPU
> online.
>
> In other words, what scenario do we expect to legitimately see an ERR
> value here?
I am using "(!IS_ERR_OR_NULL(acomp_ctx->acomp)" as a check for the
acomp being allocated already. I could instead have used "if (acomp_ctx->acomp)",
but use the former to be consistent with patch 20/22.
I cannot think of a scenario where we can expect an ERR value here.
Thanks,
Kanchana
On Fri, Dec 12, 2025 at 08:53:13PM +0000, Sridhar, Kanchana P wrote:
[..]
> > On Fri, Dec 12, 2025 at 06:17:07PM +0000, Sridhar, Kanchana P wrote:
> > > >
> > > > > ret =
> > > > cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> > > > > &pool->node);
> > > > > if (ret)
> > > > > - goto error;
> > > > > + goto ref_fail;
> > > >
> > > > IIUC we shouldn't call cpuhp_state_remove_instance() on failure, we
> > > > probably should add a new label.
> > >
> > > In this case we should because it is part of the pool creation failure
> > > handling flow, at the end of which, the pool will be deleted.
> >
> > What I mean is, when cpuhp_state_add_instance() fails we goto ref_fail
> > which will call cpuhp_state_remove_instance(). But the current code does
> > not call cpuhp_state_remove_instance() if cpuhp_state_add_instance()
> > fails.
>
> I see what you mean. The current mainline code does not call
> cpuhp_state_remove_instance() if cpuhp_state_add_instance() fails, because
> the cpuhotplug code will call the teardown callback in this case.
>
> In this patch, I do need to call cpuhp_state_remove_instance() and
> acomp_ctx_dealloc() in this case because there is no teardown callback
> being registered.
Hmm looking at cpuhp_state_add_instance(), it seems like it doesn't add
the node to the list on failure. cpuhp_state_remove_instance() only
removes the node from the list when there's no teardown cb, so it will
be a nop in this case.
What we need to do is manual cleanup, since there is no teardown cb,
which is already being done by acomp_ctx_dealloc() IIUC.
So I think calling cpuhp_state_remove_instance() when
cpuhp_state_add_instance() fails is not needed, and I don't see other
callers doing it.
[..]
> > > > > @@ -322,9 +346,15 @@ static struct zswap_pool
> > > > *__zswap_pool_create_fallback(void)
> > > > >
> > > > > static void zswap_pool_destroy(struct zswap_pool *pool)
> > > > > {
> > > > > + int cpu;
> > > > > +
> > > > > zswap_pool_debug("destroying", pool);
> > > > >
> > > > > cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> > > > &pool->node);
> > > > > +
> > > > > + for_each_possible_cpu(cpu)
> > > > > + acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
> > > > > +
> > > > > free_percpu(pool->acomp_ctx);
> > > > >
> > > > > zs_destroy_pool(pool->zs_pool);
> > > > > @@ -736,39 +766,35 @@ static int
> > zswap_cpu_comp_prepare(unsigned int
> > > > cpu, struct hlist_node *node)
> > > > > {
> > > > > struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> > > > node);
> > > > > struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> > > > >acomp_ctx, cpu);
> > > > > - struct crypto_acomp *acomp = NULL;
> > > > > - struct acomp_req *req = NULL;
> > > > > - u8 *buffer = NULL;
> > > > > - int ret;
> > > > > + int ret = -ENOMEM;
> > > > >
> > > > > - buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
> > > > > - if (!buffer) {
> > > > > - ret = -ENOMEM;
> > > > > - goto fail;
> > > > > - }
> > > > > + /*
> > > > > + * To handle cases where the CPU goes through online-offline-online
> > > > > + * transitions, we return if the acomp_ctx has already been initialized.
> > > > > + */
> > > > > + if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> > > > > + return 0;
> > > >
> > > > Is it possible for acomp_ctx->acomp to be an ERR value here? If it is,
> > > > then zswap initialization should have failed. Maybe WARN_ON_ONCE() for
> > > > that case?
> > >
> > > This is checking for a valid acomp_ctx->acomp using the same criteria
> > > being uniformly used in acomp_ctx_dealloc(). This check is necessary to
> > > handle the case where the CPU goes through online-offline-online state
> > > transitions.
> >
> > I think I am confused. I thought now we don't free this on CPU offline,
> > so either it's NULL because this is the first time we initialize it on
> > this CPU, or it is allocated.
>
> Yes, this is correct.
>
> > If it is an ERR value, then the pool
> > creation should have failed and we wouldn't be calling this again on CPU
> > online.
> >
> > In other words, what scenario do we expect to legitimately see an ERR
> > value here?
>
> I am using "(!IS_ERR_OR_NULL(acomp_ctx->acomp)" as a check for the
> acomp being allocated already. I could instead have used "if (acomp_ctx->acomp)",
> but use the former to be consistent with patch 20/22.
>
> I cannot think of a scenario where we can expect an ERR value here.
Yeah maybe do if (acomp_ctx->acomp) and
WARN_ON_ONCE(IS_ERR(acomp_ctx->acomp))?
> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Friday, December 12, 2025 2:25 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx resources
> exist from pool creation to deletion.
>
> On Fri, Dec 12, 2025 at 08:53:13PM +0000, Sridhar, Kanchana P wrote:
> [..]
> > > On Fri, Dec 12, 2025 at 06:17:07PM +0000, Sridhar, Kanchana P wrote:
> > > > >
> > > > > > ret =
> > > > > cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> > > > > > &pool->node);
> > > > > > if (ret)
> > > > > > - goto error;
> > > > > > + goto ref_fail;
> > > > >
> > > > > IIUC we shouldn't call cpuhp_state_remove_instance() on failure, we
> > > > > probably should add a new label.
> > > >
> > > > In this case we should because it is part of the pool creation failure
> > > > handling flow, at the end of which, the pool will be deleted.
> > >
> > > What I mean is, when cpuhp_state_add_instance() fails we goto ref_fail
> > > which will call cpuhp_state_remove_instance(). But the current code does
> > > not call cpuhp_state_remove_instance() if cpuhp_state_add_instance()
> > > fails.
> >
> > I see what you mean. The current mainline code does not call
> > cpuhp_state_remove_instance() if cpuhp_state_add_instance() fails,
> because
> > the cpuhotplug code will call the teardown callback in this case.
> >
> > In this patch, I do need to call cpuhp_state_remove_instance() and
> > acomp_ctx_dealloc() in this case because there is no teardown callback
> > being registered.
>
> Hmm looking at cpuhp_state_add_instance(), it seems like it doesn't add
> the node to the list on failure. cpuhp_state_remove_instance() only
> removes the node from the list when there's no teardown cb, so it will
> be a nop in this case.
>
> What we need to do is manual cleanup, since there is no teardown cb,
> which is already being done by acomp_ctx_dealloc() IIUC.
>
> So I think calling cpuhp_state_remove_instance() when
> cpuhp_state_add_instance() fails is not needed, and I don't see other
> callers doing it.
You are right. I too have verified this. I will create a label for the call
to acomp_ctx_dealloc() and fix this.
>
> [..]
> > > > > > @@ -322,9 +346,15 @@ static struct zswap_pool
> > > > > *__zswap_pool_create_fallback(void)
> > > > > >
> > > > > > static void zswap_pool_destroy(struct zswap_pool *pool)
> > > > > > {
> > > > > > + int cpu;
> > > > > > +
> > > > > > zswap_pool_debug("destroying", pool);
> > > > > >
> > > > > >
> cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> > > > > &pool->node);
> > > > > > +
> > > > > > + for_each_possible_cpu(cpu)
> > > > > > + acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx,
> cpu));
> > > > > > +
> > > > > > free_percpu(pool->acomp_ctx);
> > > > > >
> > > > > > zs_destroy_pool(pool->zs_pool);
> > > > > > @@ -736,39 +766,35 @@ static int
> > > zswap_cpu_comp_prepare(unsigned int
> > > > > cpu, struct hlist_node *node)
> > > > > > {
> > > > > > struct zswap_pool *pool = hlist_entry(node, struct
> zswap_pool,
> > > > > node);
> > > > > > struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> > > > > >acomp_ctx, cpu);
> > > > > > - struct crypto_acomp *acomp = NULL;
> > > > > > - struct acomp_req *req = NULL;
> > > > > > - u8 *buffer = NULL;
> > > > > > - int ret;
> > > > > > + int ret = -ENOMEM;
> > > > > >
> > > > > > - buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL,
> cpu_to_node(cpu));
> > > > > > - if (!buffer) {
> > > > > > - ret = -ENOMEM;
> > > > > > - goto fail;
> > > > > > - }
> > > > > > + /*
> > > > > > + * To handle cases where the CPU goes through online-
> offline-online
> > > > > > + * transitions, we return if the acomp_ctx has already been
> initialized.
> > > > > > + */
> > > > > > + if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> > > > > > + return 0;
> > > > >
> > > > > Is it possible for acomp_ctx->acomp to be an ERR value here? If it is,
> > > > > then zswap initialization should have failed. Maybe
> WARN_ON_ONCE() for
> > > > > that case?
> > > >
> > > > This is checking for a valid acomp_ctx->acomp using the same criteria
> > > > being uniformly used in acomp_ctx_dealloc(). This check is necessary to
> > > > handle the case where the CPU goes through online-offline-online state
> > > > transitions.
> > >
> > > I think I am confused. I thought now we don't free this on CPU offline,
> > > so either it's NULL because this is the first time we initialize it on
> > > this CPU, or it is allocated.
> >
> > Yes, this is correct.
> >
> > > If it is an ERR value, then the pool
> > > creation should have failed and we wouldn't be calling this again on CPU
> > > online.
> > >
> > > In other words, what scenario do we expect to legitimately see an ERR
> > > value here?
> >
> > I am using "(!IS_ERR_OR_NULL(acomp_ctx->acomp)" as a check for the
> > acomp being allocated already. I could instead have used "if (acomp_ctx-
> >acomp)",
> > but use the former to be consistent with patch 20/22.
> >
> > I cannot think of a scenario where we can expect an ERR value here.
>
> Yeah maybe do if (acomp_ctx->acomp) and
> WARN_ON_ONCE(IS_ERR(acomp_ctx->acomp))?
Sure.
Thanks,
Kanchana
> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Thursday, November 13, 2025 12:24 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx resources
> exist from pool creation to deletion.
>
> On Tue, Nov 04, 2025 at 01:12:32AM -0800, Kanchana P Sridhar wrote:
>
> The subject can be shortened to:
>
> "mm: zswap: Tie per-CPU acomp_ctx lifetime to the pool"
>
> > This patch simplifies the zswap_pool's per-CPU acomp_ctx resource
> > management. Similar to the per-CPU acomp_ctx itself, the per-CPU
> > acomp_ctx's resources' (acomp, req, buffer) lifetime will also be from
> > pool creation to pool deletion. These resources will persist through CPU
> > hotplug operations instead of being destroyed/recreated. The
> > zswap_cpu_comp_dead() teardown callback has been deleted from the call
> > to cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE). As a
> result, CPU
> > offline hotplug operations will be no-ops as far as the acomp_ctx
> > resources are concerned.
>
> Currently, per-CPU acomp_ctx are allocated on pool creation and/or CPU
> hotplug, and destroyed on pool destruction or CPU hotunplug. This
> complicates the lifetime management to save memory while a CPU is
> offlined, which is not very common.
>
> Simplify lifetime management by allocating per-CPU acomp_ctx once on
> pool creation (or CPU hotplug for CPUs onlined later), and keeping them
> allocated until the pool is destroyed.
>
> >
> > This commit refactors the code from zswap_cpu_comp_dead() into a
> > new function acomp_ctx_dealloc() that is called to clean up acomp_ctx
> > resources from:
> >
> > 1) zswap_cpu_comp_prepare() when an error is encountered,
> > 2) zswap_pool_create() when an error is encountered, and
> > 3) from zswap_pool_destroy().
>
>
> Refactor cleanup code from zswap_cpu_comp_dead() into
> acomp_ctx_dealloc() to be used elsewhere.
>
> >
> > The main benefit of using the CPU hotplug multi state instance startup
> > callback to allocate the acomp_ctx resources is that it prevents the
> > cores from being offlined until the multi state instance addition call
> > returns.
> >
> > From Documentation/core-api/cpu_hotplug.rst:
> >
> > "The node list add/remove operations and the callback invocations are
> > serialized against CPU hotplug operations."
> >
> > Furthermore, zswap_[de]compress() cannot contend with
> > zswap_cpu_comp_prepare() because:
> >
> > - During pool creation/deletion, the pool is not in the zswap_pools
> > list.
> >
> > - During CPU hot[un]plug, the CPU is not yet online, as Yosry pointed
> > out. zswap_cpu_comp_prepare() will be run on a control CPU,
> > since CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section of
> "enum
> > cpuhp_state". Thanks Yosry for sharing this observation!
> >
> > In both these cases, any recursions into zswap reclaim from
> > zswap_cpu_comp_prepare() will be handled by the old pool.
> >
> > The above two observations enable the following simplifications:
> >
> > 1) zswap_cpu_comp_prepare(): CPU cannot be offlined. Reclaim cannot
> use
> > the pool. Considerations for mutex init/locking and handling
> > subsequent CPU hotplug online-offline-online:
> >
> > Should we lock the mutex of current CPU's acomp_ctx from start to
> > end? It doesn't seem like this is required. The CPU hotplug
> > operations acquire a "cpuhp_state_mutex" before proceeding, hence
> > they are serialized against CPU hotplug operations.
> >
> > If the process gets migrated while zswap_cpu_comp_prepare() is
> > running, it will complete on the new CPU. In case of failures, we
> > pass the acomp_ctx pointer obtained at the start of
> > zswap_cpu_comp_prepare() to acomp_ctx_dealloc(), which again, can
> > only undergo migration. There appear to be no contention scenarios
> > that might cause inconsistent values of acomp_ctx's members. Hence,
> > it seems there is no need for mutex_lock(&acomp_ctx->mutex) in
> > zswap_cpu_comp_prepare().
> >
> > Since the pool is not yet on zswap_pools list, we don't need to
> > initialize the per-CPU acomp_ctx mutex in zswap_pool_create(). This
> > has been restored to occur in zswap_cpu_comp_prepare().
> >
> > zswap_cpu_comp_prepare() checks upfront if acomp_ctx->acomp is
> > valid. If so, it returns success. This should handle any CPU
> > hotplug online-offline transitions after pool creation is done.
> >
> > 2) CPU offline vis-a-vis zswap ops: Let's suppose the process is
> > migrated to another CPU before the current CPU is dysfunctional. If
> > zswap_[de]compress() holds the acomp_ctx->mutex lock of the offlined
> > CPU, that mutex will be released once it completes on the new
> > CPU. Since there is no teardown callback, there is no possibility of
> > UAF.
> >
> > 3) Pool creation/deletion and process migration to another CPU:
> >
> > - During pool creation/deletion, the pool is not in the zswap_pools
> > list. Hence it cannot contend with zswap ops on that CPU. However,
> > the process can get migrated.
> >
> > Pool creation --> zswap_cpu_comp_prepare()
> > --> process migrated:
> > * CPU offline: no-op.
> > * zswap_cpu_comp_prepare() continues
> > to run on the new CPU to finish
> > allocating acomp_ctx resources for
> > the offlined CPU.
> >
> > Pool deletion --> acomp_ctx_dealloc()
> > --> process migrated:
> > * CPU offline: no-op.
> > * acomp_ctx_dealloc() continues
> > to run on the new CPU to finish
> > de-allocating acomp_ctx resources
> > for the offlined CPU.
> >
> > 4) Pool deletion vis-a-vis CPU onlining:
> > The call to cpuhp_state_remove_instance() cannot race with
> > zswap_cpu_comp_prepare() because of hotplug synchronization.
> >
> > This patch deletes acomp_ctx_get_cpu_lock()/acomp_ctx_put_unlock().
> > Instead, zswap_[de]compress() directly call
> > mutex_[un]lock(&acomp_ctx->mutex).
>
> I am not sure why all of this is needed. We should just describe why
> it's safe to drop holding the mutex while initializing per-CPU
> acomp_ctx:
>
> It is no longer possible for CPU hotplug to race against allocation or
> usage of per-CPU acomp_ctx, as they are only allocated once before the
> pool can be used, and remain allocated as long as the pool is used.
> Hence, stop holding the lock during acomp_ctx initialization, and drop
> acomp_ctx_get_cpu_lock()//acomp_ctx_put_unlock().
Hi Yosry,
Thanks for these comments. IIRC, there was quite a bit of technical
discussion analyzing various what-ifs, that we were able to answer
adequately. The above is a nice summary of the outcome, however,
I think it would help the next time this topic is re-visited to have a log
of the "why" and how races/UAF scenarios are being considered and
addressed by the solution. Does this sound Ok?
Thanks,
Kanchana
>
> >
> > The per-CPU memory cost of not deleting the acomp_ctx resources upon
> CPU
> > offlining, and only deleting them when the pool is destroyed, is as
> > follows, on x86_64:
> >
> > IAA with 8 dst buffers for batching: 64.34 KB
> > Software compressors with 1 dst buffer: 8.28 KB
>
> This cost is only paid when a CPU is offlined, until it is onlined
> again.
>
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> > mm/zswap.c | 164 +++++++++++++++++++++--------------------------------
> > 1 file changed, 64 insertions(+), 100 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 4897ed689b9f..87d50786f61f 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -242,6 +242,20 @@ static inline struct xarray
> *swap_zswap_tree(swp_entry_t swp)
> > **********************************/
> > static void __zswap_pool_empty(struct percpu_ref *ref);
> >
> > +static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx)
> > +{
> > + if (IS_ERR_OR_NULL(acomp_ctx))
> > + return;
> > +
> > + if (!IS_ERR_OR_NULL(acomp_ctx->req))
> > + acomp_request_free(acomp_ctx->req);
> > +
> > + if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> > + crypto_free_acomp(acomp_ctx->acomp);
> > +
> > + kfree(acomp_ctx->buffer);
> > +}
> > +
> > static struct zswap_pool *zswap_pool_create(char *compressor)
> > {
> > struct zswap_pool *pool;
> > @@ -263,19 +277,26 @@ static struct zswap_pool
> *zswap_pool_create(char *compressor)
> >
> > strscpy(pool->tfm_name, compressor, sizeof(pool->tfm_name));
> >
> > - pool->acomp_ctx = alloc_percpu(*pool->acomp_ctx);
> > + /* Many things rely on the zero-initialization. */
> > + pool->acomp_ctx = alloc_percpu_gfp(*pool->acomp_ctx,
> > + GFP_KERNEL | __GFP_ZERO);
> > if (!pool->acomp_ctx) {
> > pr_err("percpu alloc failed\n");
> > goto error;
> > }
> >
> > - for_each_possible_cpu(cpu)
> > - mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex);
> > -
> > + /*
> > + * This is serialized against CPU hotplug operations. Hence, cores
> > + * cannot be offlined until this finishes.
> > + * In case of errors, we need to goto "ref_fail" instead of "error"
> > + * because there is no teardown callback registered anymore, for
> > + * cpuhp_state_add_instance() to de-allocate resources as it rolls
> back
> > + * state on cores before the CPU on which error was encountered.
> > + */
>
> Do we need to manually call acomp_ctx_dealloc() on each CPU on failure
> because cpuhp_state_add_instance() relies on the hotunplug callback for
> cleanup, and we don't have any?
>
> If that's the case:
>
> /*
> * cpuhp_state_add_instance() will not cleanup on failure since
> * we don't register a hotunplug callback.
> */
>
> Describing what the code does is not helpful, and things like "anymore"
> do not make sense once the code is merged.
>
> > ret =
> cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> > &pool->node);
> > if (ret)
> > - goto error;
> > + goto ref_fail;
>
> IIUC we shouldn't call cpuhp_state_remove_instance() on failure, we
> probably should add a new label.
>
> >
> > /* being the current pool takes 1 ref; this func expects the
> > * caller to always add the new pool as the current pool
> > @@ -292,6 +313,9 @@ static struct zswap_pool *zswap_pool_create(char
> *compressor)
> >
> > ref_fail:
> > cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> &pool->node);
> > +
> > + for_each_possible_cpu(cpu)
> > + acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
> > error:
> > if (pool->acomp_ctx)
> > free_percpu(pool->acomp_ctx);
> > @@ -322,9 +346,15 @@ static struct zswap_pool
> *__zswap_pool_create_fallback(void)
> >
> > static void zswap_pool_destroy(struct zswap_pool *pool)
> > {
> > + int cpu;
> > +
> > zswap_pool_debug("destroying", pool);
> >
> > cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> &pool->node);
> > +
> > + for_each_possible_cpu(cpu)
> > + acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
> > +
> > free_percpu(pool->acomp_ctx);
> >
> > zs_destroy_pool(pool->zs_pool);
> > @@ -736,39 +766,35 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
> > {
> > struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> node);
> > struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> >acomp_ctx, cpu);
> > - struct crypto_acomp *acomp = NULL;
> > - struct acomp_req *req = NULL;
> > - u8 *buffer = NULL;
> > - int ret;
> > + int ret = -ENOMEM;
> >
> > - buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
> > - if (!buffer) {
> > - ret = -ENOMEM;
> > - goto fail;
> > - }
> > + /*
> > + * To handle cases where the CPU goes through online-offline-online
> > + * transitions, we return if the acomp_ctx has already been initialized.
> > + */
> > + if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> > + return 0;
>
> Is it possible for acomp_ctx->acomp to be an ERR value here? If it is,
> then zswap initialization should have failed. Maybe WARN_ON_ONCE() for
> that case?
>
> >
> > - acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0,
> cpu_to_node(cpu));
> > - if (IS_ERR(acomp)) {
> > + acomp_ctx->buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL,
> cpu_to_node(cpu));
> > + if (!acomp_ctx->buffer)
> > + return ret;
> > +
> > + acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0,
> 0, cpu_to_node(cpu));
> > + if (IS_ERR(acomp_ctx->acomp)) {
> > pr_err("could not alloc crypto acomp %s : %ld\n",
> > - pool->tfm_name, PTR_ERR(acomp));
> > - ret = PTR_ERR(acomp);
> > + pool->tfm_name, PTR_ERR(acomp_ctx-
> >acomp));
> > + ret = PTR_ERR(acomp_ctx->acomp);
> > goto fail;
> > }
> > + acomp_ctx->is_sleepable = acomp_is_async(acomp_ctx->acomp);
> >
> > - req = acomp_request_alloc(acomp);
> > - if (!req) {
> > + acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp);
> > + if (!acomp_ctx->req) {
> > pr_err("could not alloc crypto acomp_request %s\n",
> > pool->tfm_name);
> > - ret = -ENOMEM;
> > goto fail;
> > }
> >
> > - /*
> > - * Only hold the mutex after completing allocations, otherwise we
> may
> > - * recurse into zswap through reclaim and attempt to hold the mutex
> > - * again resulting in a deadlock.
> > - */
> > - mutex_lock(&acomp_ctx->mutex);
> > crypto_init_wait(&acomp_ctx->wait);
> >
> > /*
> [..]
On Fri, Dec 12, 2025 at 12:55:10AM +0000, Sridhar, Kanchana P wrote: > > > -----Original Message----- > > From: Yosry Ahmed <yosry.ahmed@linux.dev> > > Sent: Thursday, November 13, 2025 12:24 PM > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com; > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; > > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux- > > crypto@vger.kernel.org; herbert@gondor.apana.org.au; > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > > <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>; > > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > > <vinodh.gopal@intel.com> > > Subject: Re: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx resources > > exist from pool creation to deletion. > > > > On Tue, Nov 04, 2025 at 01:12:32AM -0800, Kanchana P Sridhar wrote: > > > > The subject can be shortened to: > > > > "mm: zswap: Tie per-CPU acomp_ctx lifetime to the pool" > > > > > This patch simplifies the zswap_pool's per-CPU acomp_ctx resource > > > management. Similar to the per-CPU acomp_ctx itself, the per-CPU > > > acomp_ctx's resources' (acomp, req, buffer) lifetime will also be from > > > pool creation to pool deletion. These resources will persist through CPU > > > hotplug operations instead of being destroyed/recreated. The > > > zswap_cpu_comp_dead() teardown callback has been deleted from the call > > > to cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE). As a > > result, CPU > > > offline hotplug operations will be no-ops as far as the acomp_ctx > > > resources are concerned. > > > > Currently, per-CPU acomp_ctx are allocated on pool creation and/or CPU > > hotplug, and destroyed on pool destruction or CPU hotunplug. This > > complicates the lifetime management to save memory while a CPU is > > offlined, which is not very common. > > > > Simplify lifetime management by allocating per-CPU acomp_ctx once on > > pool creation (or CPU hotplug for CPUs onlined later), and keeping them > > allocated until the pool is destroyed. > > > > > > > > This commit refactors the code from zswap_cpu_comp_dead() into a > > > new function acomp_ctx_dealloc() that is called to clean up acomp_ctx > > > resources from: > > > > > > 1) zswap_cpu_comp_prepare() when an error is encountered, > > > 2) zswap_pool_create() when an error is encountered, and > > > 3) from zswap_pool_destroy(). > > > > > > Refactor cleanup code from zswap_cpu_comp_dead() into > > acomp_ctx_dealloc() to be used elsewhere. > > > > > > > > The main benefit of using the CPU hotplug multi state instance startup > > > callback to allocate the acomp_ctx resources is that it prevents the > > > cores from being offlined until the multi state instance addition call > > > returns. > > > > > > From Documentation/core-api/cpu_hotplug.rst: > > > > > > "The node list add/remove operations and the callback invocations are > > > serialized against CPU hotplug operations." > > > > > > Furthermore, zswap_[de]compress() cannot contend with > > > zswap_cpu_comp_prepare() because: > > > > > > - During pool creation/deletion, the pool is not in the zswap_pools > > > list. > > > > > > - During CPU hot[un]plug, the CPU is not yet online, as Yosry pointed > > > out. zswap_cpu_comp_prepare() will be run on a control CPU, > > > since CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section of > > "enum > > > cpuhp_state". Thanks Yosry for sharing this observation! > > > > > > In both these cases, any recursions into zswap reclaim from > > > zswap_cpu_comp_prepare() will be handled by the old pool. > > > > > > The above two observations enable the following simplifications: > > > > > > 1) zswap_cpu_comp_prepare(): CPU cannot be offlined. Reclaim cannot > > use > > > the pool. Considerations for mutex init/locking and handling > > > subsequent CPU hotplug online-offline-online: > > > > > > Should we lock the mutex of current CPU's acomp_ctx from start to > > > end? It doesn't seem like this is required. The CPU hotplug > > > operations acquire a "cpuhp_state_mutex" before proceeding, hence > > > they are serialized against CPU hotplug operations. > > > > > > If the process gets migrated while zswap_cpu_comp_prepare() is > > > running, it will complete on the new CPU. In case of failures, we > > > pass the acomp_ctx pointer obtained at the start of > > > zswap_cpu_comp_prepare() to acomp_ctx_dealloc(), which again, can > > > only undergo migration. There appear to be no contention scenarios > > > that might cause inconsistent values of acomp_ctx's members. Hence, > > > it seems there is no need for mutex_lock(&acomp_ctx->mutex) in > > > zswap_cpu_comp_prepare(). > > > > > > Since the pool is not yet on zswap_pools list, we don't need to > > > initialize the per-CPU acomp_ctx mutex in zswap_pool_create(). This > > > has been restored to occur in zswap_cpu_comp_prepare(). > > > > > > zswap_cpu_comp_prepare() checks upfront if acomp_ctx->acomp is > > > valid. If so, it returns success. This should handle any CPU > > > hotplug online-offline transitions after pool creation is done. > > > > > > 2) CPU offline vis-a-vis zswap ops: Let's suppose the process is > > > migrated to another CPU before the current CPU is dysfunctional. If > > > zswap_[de]compress() holds the acomp_ctx->mutex lock of the offlined > > > CPU, that mutex will be released once it completes on the new > > > CPU. Since there is no teardown callback, there is no possibility of > > > UAF. > > > > > > 3) Pool creation/deletion and process migration to another CPU: > > > > > > - During pool creation/deletion, the pool is not in the zswap_pools > > > list. Hence it cannot contend with zswap ops on that CPU. However, > > > the process can get migrated. > > > > > > Pool creation --> zswap_cpu_comp_prepare() > > > --> process migrated: > > > * CPU offline: no-op. > > > * zswap_cpu_comp_prepare() continues > > > to run on the new CPU to finish > > > allocating acomp_ctx resources for > > > the offlined CPU. > > > > > > Pool deletion --> acomp_ctx_dealloc() > > > --> process migrated: > > > * CPU offline: no-op. > > > * acomp_ctx_dealloc() continues > > > to run on the new CPU to finish > > > de-allocating acomp_ctx resources > > > for the offlined CPU. > > > > > > 4) Pool deletion vis-a-vis CPU onlining: > > > The call to cpuhp_state_remove_instance() cannot race with > > > zswap_cpu_comp_prepare() because of hotplug synchronization. > > > > > > This patch deletes acomp_ctx_get_cpu_lock()/acomp_ctx_put_unlock(). > > > Instead, zswap_[de]compress() directly call > > > mutex_[un]lock(&acomp_ctx->mutex). > > > > I am not sure why all of this is needed. We should just describe why > > it's safe to drop holding the mutex while initializing per-CPU > > acomp_ctx: > > > > It is no longer possible for CPU hotplug to race against allocation or > > usage of per-CPU acomp_ctx, as they are only allocated once before the > > pool can be used, and remain allocated as long as the pool is used. > > Hence, stop holding the lock during acomp_ctx initialization, and drop > > acomp_ctx_get_cpu_lock()//acomp_ctx_put_unlock(). > > Hi Yosry, > > Thanks for these comments. IIRC, there was quite a bit of technical > discussion analyzing various what-ifs, that we were able to answer > adequately. The above is a nice summary of the outcome, however, > I think it would help the next time this topic is re-visited to have a log > of the "why" and how races/UAF scenarios are being considered and > addressed by the solution. Does this sound Ok? How about using the summarized version in the commit log and linking to the thread with the discussion?
> -----Original Message----- > From: Yosry Ahmed <yosry.ahmed@linux.dev> > Sent: Thursday, December 11, 2025 5:06 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com; > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux- > crypto@vger.kernel.org; herbert@gondor.apana.org.au; > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>; > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > <vinodh.gopal@intel.com> > Subject: Re: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx resources > exist from pool creation to deletion. > > On Fri, Dec 12, 2025 at 12:55:10AM +0000, Sridhar, Kanchana P wrote: > > > > > -----Original Message----- > > > From: Yosry Ahmed <yosry.ahmed@linux.dev> > > > Sent: Thursday, November 13, 2025 12:24 PM > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > > > hannes@cmpxchg.org; nphamcs@gmail.com; > chengming.zhou@linux.dev; > > > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com; > > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; > > > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux- > > > crypto@vger.kernel.org; herbert@gondor.apana.org.au; > > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > > > <kristen.c.accardi@intel.com>; Gomes, Vinicius > <vinicius.gomes@intel.com>; > > > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > > > <vinodh.gopal@intel.com> > > > Subject: Re: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx > resources > > > exist from pool creation to deletion. > > > > > > On Tue, Nov 04, 2025 at 01:12:32AM -0800, Kanchana P Sridhar wrote: > > > > > > The subject can be shortened to: > > > > > > "mm: zswap: Tie per-CPU acomp_ctx lifetime to the pool" > > > > > > > This patch simplifies the zswap_pool's per-CPU acomp_ctx resource > > > > management. Similar to the per-CPU acomp_ctx itself, the per-CPU > > > > acomp_ctx's resources' (acomp, req, buffer) lifetime will also be from > > > > pool creation to pool deletion. These resources will persist through CPU > > > > hotplug operations instead of being destroyed/recreated. The > > > > zswap_cpu_comp_dead() teardown callback has been deleted from the > call > > > > to cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE). As a > > > result, CPU > > > > offline hotplug operations will be no-ops as far as the acomp_ctx > > > > resources are concerned. > > > > > > Currently, per-CPU acomp_ctx are allocated on pool creation and/or CPU > > > hotplug, and destroyed on pool destruction or CPU hotunplug. This > > > complicates the lifetime management to save memory while a CPU is > > > offlined, which is not very common. > > > > > > Simplify lifetime management by allocating per-CPU acomp_ctx once on > > > pool creation (or CPU hotplug for CPUs onlined later), and keeping them > > > allocated until the pool is destroyed. > > > > > > > > > > > This commit refactors the code from zswap_cpu_comp_dead() into a > > > > new function acomp_ctx_dealloc() that is called to clean up acomp_ctx > > > > resources from: > > > > > > > > 1) zswap_cpu_comp_prepare() when an error is encountered, > > > > 2) zswap_pool_create() when an error is encountered, and > > > > 3) from zswap_pool_destroy(). > > > > > > > > > Refactor cleanup code from zswap_cpu_comp_dead() into > > > acomp_ctx_dealloc() to be used elsewhere. > > > > > > > > > > > The main benefit of using the CPU hotplug multi state instance startup > > > > callback to allocate the acomp_ctx resources is that it prevents the > > > > cores from being offlined until the multi state instance addition call > > > > returns. > > > > > > > > From Documentation/core-api/cpu_hotplug.rst: > > > > > > > > "The node list add/remove operations and the callback invocations are > > > > serialized against CPU hotplug operations." > > > > > > > > Furthermore, zswap_[de]compress() cannot contend with > > > > zswap_cpu_comp_prepare() because: > > > > > > > > - During pool creation/deletion, the pool is not in the zswap_pools > > > > list. > > > > > > > > - During CPU hot[un]plug, the CPU is not yet online, as Yosry pointed > > > > out. zswap_cpu_comp_prepare() will be run on a control CPU, > > > > since CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section > of > > > "enum > > > > cpuhp_state". Thanks Yosry for sharing this observation! > > > > > > > > In both these cases, any recursions into zswap reclaim from > > > > zswap_cpu_comp_prepare() will be handled by the old pool. > > > > > > > > The above two observations enable the following simplifications: > > > > > > > > 1) zswap_cpu_comp_prepare(): CPU cannot be offlined. Reclaim cannot > > > use > > > > the pool. Considerations for mutex init/locking and handling > > > > subsequent CPU hotplug online-offline-online: > > > > > > > > Should we lock the mutex of current CPU's acomp_ctx from start to > > > > end? It doesn't seem like this is required. The CPU hotplug > > > > operations acquire a "cpuhp_state_mutex" before proceeding, hence > > > > they are serialized against CPU hotplug operations. > > > > > > > > If the process gets migrated while zswap_cpu_comp_prepare() is > > > > running, it will complete on the new CPU. In case of failures, we > > > > pass the acomp_ctx pointer obtained at the start of > > > > zswap_cpu_comp_prepare() to acomp_ctx_dealloc(), which again, can > > > > only undergo migration. There appear to be no contention scenarios > > > > that might cause inconsistent values of acomp_ctx's members. Hence, > > > > it seems there is no need for mutex_lock(&acomp_ctx->mutex) in > > > > zswap_cpu_comp_prepare(). > > > > > > > > Since the pool is not yet on zswap_pools list, we don't need to > > > > initialize the per-CPU acomp_ctx mutex in zswap_pool_create(). This > > > > has been restored to occur in zswap_cpu_comp_prepare(). > > > > > > > > zswap_cpu_comp_prepare() checks upfront if acomp_ctx->acomp is > > > > valid. If so, it returns success. This should handle any CPU > > > > hotplug online-offline transitions after pool creation is done. > > > > > > > > 2) CPU offline vis-a-vis zswap ops: Let's suppose the process is > > > > migrated to another CPU before the current CPU is dysfunctional. If > > > > zswap_[de]compress() holds the acomp_ctx->mutex lock of the > offlined > > > > CPU, that mutex will be released once it completes on the new > > > > CPU. Since there is no teardown callback, there is no possibility of > > > > UAF. > > > > > > > > 3) Pool creation/deletion and process migration to another CPU: > > > > > > > > - During pool creation/deletion, the pool is not in the zswap_pools > > > > list. Hence it cannot contend with zswap ops on that CPU. However, > > > > the process can get migrated. > > > > > > > > Pool creation --> zswap_cpu_comp_prepare() > > > > --> process migrated: > > > > * CPU offline: no-op. > > > > * zswap_cpu_comp_prepare() continues > > > > to run on the new CPU to finish > > > > allocating acomp_ctx resources for > > > > the offlined CPU. > > > > > > > > Pool deletion --> acomp_ctx_dealloc() > > > > --> process migrated: > > > > * CPU offline: no-op. > > > > * acomp_ctx_dealloc() continues > > > > to run on the new CPU to finish > > > > de-allocating acomp_ctx resources > > > > for the offlined CPU. > > > > > > > > 4) Pool deletion vis-a-vis CPU onlining: > > > > The call to cpuhp_state_remove_instance() cannot race with > > > > zswap_cpu_comp_prepare() because of hotplug synchronization. > > > > > > > > This patch deletes acomp_ctx_get_cpu_lock()/acomp_ctx_put_unlock(). > > > > Instead, zswap_[de]compress() directly call > > > > mutex_[un]lock(&acomp_ctx->mutex). > > > > > > I am not sure why all of this is needed. We should just describe why > > > it's safe to drop holding the mutex while initializing per-CPU > > > acomp_ctx: > > > > > > It is no longer possible for CPU hotplug to race against allocation or > > > usage of per-CPU acomp_ctx, as they are only allocated once before the > > > pool can be used, and remain allocated as long as the pool is used. > > > Hence, stop holding the lock during acomp_ctx initialization, and drop > > > acomp_ctx_get_cpu_lock()//acomp_ctx_put_unlock(). > > > > Hi Yosry, > > > > Thanks for these comments. IIRC, there was quite a bit of technical > > discussion analyzing various what-ifs, that we were able to answer > > adequately. The above is a nice summary of the outcome, however, > > I think it would help the next time this topic is re-visited to have a log > > of the "why" and how races/UAF scenarios are being considered and > > addressed by the solution. Does this sound Ok? > > How about using the summarized version in the commit log and linking to > the thread with the discussion? Seems like capturing just enough detail of the threads involving the discussions, in this commit log would be valuable. As against reading long email threads with indentations, as the sole resource to provide context?
December 11, 2025 at 5:58 PM, "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com mailto:kanchana.p.sridhar@intel.com?to=%22Sridhar%2C%20Kanchana%20P%22%20%3Ckanchana.p.sridhar%40intel.com%3E > wrote: > > > > > -----Original Message----- > > From: Yosry Ahmed <> > > Sent: Thursday, December 11, 2025 5:06 PM > > To: Sridhar, Kanchana P <> > > Cc:;; > > hannes@cmpxchg.org; mailto:hannes@cmpxchg.org; ;; > > usamaarif642@gmail.com; mailto:usamaarif642@gmail.com; ;; > > ying.huang@linux.alibaba.com; mailto:ying.huang@linux.alibaba.com; ; > > senozhatsky@chromium.org; mailto:senozhatsky@chromium.org; ;; linux- > > crypto@vger.kernel.org; mailto:crypto@vger.kernel.org; ; > > davem@davemloft.net; mailto:davem@davemloft.net; ;; > > ebiggers@google.com; mailto:ebiggers@google.com; ; Accardi, Kristen C > > <>; Gomes, Vinicius <>; > > Feghali, Wajdi K <>; Gopal, Vinodh > > <> > > Subject: Re: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx resources > > exist from pool creation to deletion. > > > > On Fri, Dec 12, 2025 at 12:55:10AM +0000, Sridhar, Kanchana P wrote: > > > > > -----Original Message----- > > > From: Yosry Ahmed <> > > > Sent: Thursday, November 13, 2025 12:24 PM > > > To: Sridhar, Kanchana P <> > > > Cc:;; > > >;; > > chengming.zhou@linux.dev; mailto:chengming.zhou@linux.dev; > > >;;; > > >;; > > >;;; linux- > > >;; > > >;;; > > >;; Accardi, Kristen C > > > <>; Gomes, Vinicius > > <>; > > > Feghali, Wajdi K <>; Gopal, Vinodh > > > <> > > > Subject: Re: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx > > resources > > > exist from pool creation to deletion. > > > > > > On Tue, Nov 04, 2025 at 01:12:32AM -0800, Kanchana P Sridhar wrote: > > > > > > The subject can be shortened to: > > > > > > "mm: zswap: Tie per-CPU acomp_ctx lifetime to the pool" > > > > > > > This patch simplifies the zswap_pool's per-CPU acomp_ctx resource > > > > management. Similar to the per-CPU acomp_ctx itself, the per-CPU > > > > acomp_ctx's resources' (acomp, req, buffer) lifetime will also be from > > > > pool creation to pool deletion. These resources will persist through CPU > > > > hotplug operations instead of being destroyed/recreated. The > > > > zswap_cpu_comp_dead() teardown callback has been deleted from the > > call > > > > to cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE). As a > > > result, CPU > > > > offline hotplug operations will be no-ops as far as the acomp_ctx > > > > resources are concerned. > > > > > > Currently, per-CPU acomp_ctx are allocated on pool creation and/or CPU > > > hotplug, and destroyed on pool destruction or CPU hotunplug. This > > > complicates the lifetime management to save memory while a CPU is > > > offlined, which is not very common. > > > > > > Simplify lifetime management by allocating per-CPU acomp_ctx once on > > > pool creation (or CPU hotplug for CPUs onlined later), and keeping them > > > allocated until the pool is destroyed. > > > > > > > > > > > This commit refactors the code from zswap_cpu_comp_dead() into a > > > > new function acomp_ctx_dealloc() that is called to clean up acomp_ctx > > > > resources from: > > > > > > > > 1) zswap_cpu_comp_prepare() when an error is encountered, > > > > 2) zswap_pool_create() when an error is encountered, and > > > > 3) from zswap_pool_destroy(). > > > > > > > > > Refactor cleanup code from zswap_cpu_comp_dead() into > > > acomp_ctx_dealloc() to be used elsewhere. > > > > > > > > > > > The main benefit of using the CPU hotplug multi state instance startup > > > > callback to allocate the acomp_ctx resources is that it prevents the > > > > cores from being offlined until the multi state instance addition call > > > > returns. > > > > > > > > From Documentation/core-api/cpu_hotplug.rst: > > > > > > > > "The node list add/remove operations and the callback invocations are > > > > serialized against CPU hotplug operations." > > > > > > > > Furthermore, zswap_[de]compress() cannot contend with > > > > zswap_cpu_comp_prepare() because: > > > > > > > > - During pool creation/deletion, the pool is not in the zswap_pools > > > > list. > > > > > > > > - During CPU hot[un]plug, the CPU is not yet online, as Yosry pointed > > > > out. zswap_cpu_comp_prepare() will be run on a control CPU, > > > > since CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section > > of > > > "enum > > > > cpuhp_state". Thanks Yosry for sharing this observation! > > > > > > > > In both these cases, any recursions into zswap reclaim from > > > > zswap_cpu_comp_prepare() will be handled by the old pool. > > > > > > > > The above two observations enable the following simplifications: > > > > > > > > 1) zswap_cpu_comp_prepare(): CPU cannot be offlined. Reclaim cannot > > > use > > > > the pool. Considerations for mutex init/locking and handling > > > > subsequent CPU hotplug online-offline-online: > > > > > > > > Should we lock the mutex of current CPU's acomp_ctx from start to > > > > end? It doesn't seem like this is required. The CPU hotplug > > > > operations acquire a "cpuhp_state_mutex" before proceeding, hence > > > > they are serialized against CPU hotplug operations. > > > > > > > > If the process gets migrated while zswap_cpu_comp_prepare() is > > > > running, it will complete on the new CPU. In case of failures, we > > > > pass the acomp_ctx pointer obtained at the start of > > > > zswap_cpu_comp_prepare() to acomp_ctx_dealloc(), which again, can > > > > only undergo migration. There appear to be no contention scenarios > > > > that might cause inconsistent values of acomp_ctx's members. Hence, > > > > it seems there is no need for mutex_lock(&acomp_ctx->mutex) in > > > > zswap_cpu_comp_prepare(). > > > > > > > > Since the pool is not yet on zswap_pools list, we don't need to > > > > initialize the per-CPU acomp_ctx mutex in zswap_pool_create(). This > > > > has been restored to occur in zswap_cpu_comp_prepare(). > > > > > > > > zswap_cpu_comp_prepare() checks upfront if acomp_ctx->acomp is > > > > valid. If so, it returns success. This should handle any CPU > > > > hotplug online-offline transitions after pool creation is done. > > > > > > > > 2) CPU offline vis-a-vis zswap ops: Let's suppose the process is > > > > migrated to another CPU before the current CPU is dysfunctional. If > > > > zswap_[de]compress() holds the acomp_ctx->mutex lock of the > > offlined > > > > CPU, that mutex will be released once it completes on the new > > > > CPU. Since there is no teardown callback, there is no possibility of > > > > UAF. > > > > > > > > 3) Pool creation/deletion and process migration to another CPU: > > > > > > > > - During pool creation/deletion, the pool is not in the zswap_pools > > > > list. Hence it cannot contend with zswap ops on that CPU. However, > > > > the process can get migrated. > > > > > > > > Pool creation --> zswap_cpu_comp_prepare() > > > > --> process migrated: > > > > * CPU offline: no-op. > > > > * zswap_cpu_comp_prepare() continues > > > > to run on the new CPU to finish > > > > allocating acomp_ctx resources for > > > > the offlined CPU. > > > > > > > > Pool deletion --> acomp_ctx_dealloc() > > > > --> process migrated: > > > > * CPU offline: no-op. > > > > * acomp_ctx_dealloc() continues > > > > to run on the new CPU to finish > > > > de-allocating acomp_ctx resources > > > > for the offlined CPU. > > > > > > > > 4) Pool deletion vis-a-vis CPU onlining: > > > > The call to cpuhp_state_remove_instance() cannot race with > > > > zswap_cpu_comp_prepare() because of hotplug synchronization. > > > > > > > > This patch deletes acomp_ctx_get_cpu_lock()/acomp_ctx_put_unlock(). > > > > Instead, zswap_[de]compress() directly call > > > > mutex_[un]lock(&acomp_ctx->mutex). > > > > > > I am not sure why all of this is needed. We should just describe why > > > it's safe to drop holding the mutex while initializing per-CPU > > > acomp_ctx: > > > > > > It is no longer possible for CPU hotplug to race against allocation or > > > usage of per-CPU acomp_ctx, as they are only allocated once before the > > > pool can be used, and remain allocated as long as the pool is used. > > > Hence, stop holding the lock during acomp_ctx initialization, and drop > > > acomp_ctx_get_cpu_lock()//acomp_ctx_put_unlock(). > > > > Hi Yosry, > > > > Thanks for these comments. IIRC, there was quite a bit of technical > > discussion analyzing various what-ifs, that we were able to answer > > adequately. The above is a nice summary of the outcome, however, > > I think it would help the next time this topic is re-visited to have a log > > of the "why" and how races/UAF scenarios are being considered and > > addressed by the solution. Does this sound Ok? > > > > How about using the summarized version in the commit log and linking to > > the thread with the discussion? > > > Seems like capturing just enough detail of the threads involving the > discussions, in this commit log would be valuable. As against reading long > email threads with indentations, as the sole resource to provide context? > If you feel strongly about it then sure, but try to keep it as concise as possible, thanks.
> -----Original Message----- > From: Yosry Ahmed <yosry.ahmed@linux.dev> > Sent: Thursday, December 11, 2025 6:47 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com; > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux- > crypto@vger.kernel.org; herbert@gondor.apana.org.au; > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>; > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > <vinodh.gopal@intel.com>; Sridhar, Kanchana P > <kanchana.p.sridhar@intel.com> > Subject: Re: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx resources > exist from pool creation to deletion. > > December 11, 2025 at 5:58 PM, "Sridhar, Kanchana P" > <kanchana.p.sridhar@intel.com > mailto:kanchana.p.sridhar@intel.com?to=%22Sridhar%2C%20Kanchana%20 > P%22%20%3Ckanchana.p.sridhar%40intel.com%3E > wrote: > > > > > > > > > > -----Original Message----- > > > From: Yosry Ahmed <> > > > Sent: Thursday, December 11, 2025 5:06 PM > > > To: Sridhar, Kanchana P <> > > > Cc:;; > > > hannes@cmpxchg.org; mailto:hannes@cmpxchg.org; ;; > > > usamaarif642@gmail.com; mailto:usamaarif642@gmail.com; ;; > > > ying.huang@linux.alibaba.com; mailto:ying.huang@linux.alibaba.com; ; > > > senozhatsky@chromium.org; mailto:senozhatsky@chromium.org; ;; > linux- > > > crypto@vger.kernel.org; mailto:crypto@vger.kernel.org; ; > > > davem@davemloft.net; mailto:davem@davemloft.net; ;; > > > ebiggers@google.com; mailto:ebiggers@google.com; ; Accardi, Kristen C > > > <>; Gomes, Vinicius <>; > > > Feghali, Wajdi K <>; Gopal, Vinodh > > > <> > > > Subject: Re: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx > resources > > > exist from pool creation to deletion. > > > > > > On Fri, Dec 12, 2025 at 12:55:10AM +0000, Sridhar, Kanchana P wrote: > > > > > > > -----Original Message----- > > > > From: Yosry Ahmed <> > > > > Sent: Thursday, November 13, 2025 12:24 PM > > > > To: Sridhar, Kanchana P <> > > > > Cc:;; > > > >;; > > > chengming.zhou@linux.dev; mailto:chengming.zhou@linux.dev; > > > >;;; > > > >;; > > > >;;; linux- > > > >;; > > > >;;; > > > >;; Accardi, Kristen C > > > > <>; Gomes, Vinicius > > > <>; > > > > Feghali, Wajdi K <>; Gopal, Vinodh > > > > <> > > > > Subject: Re: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx > > > resources > > > > exist from pool creation to deletion. > > > > > > > > On Tue, Nov 04, 2025 at 01:12:32AM -0800, Kanchana P Sridhar wrote: > > > > > > > > The subject can be shortened to: > > > > > > > > "mm: zswap: Tie per-CPU acomp_ctx lifetime to the pool" > > > > > > > > > This patch simplifies the zswap_pool's per-CPU acomp_ctx resource > > > > > management. Similar to the per-CPU acomp_ctx itself, the per-CPU > > > > > acomp_ctx's resources' (acomp, req, buffer) lifetime will also be from > > > > > pool creation to pool deletion. These resources will persist through > CPU > > > > > hotplug operations instead of being destroyed/recreated. The > > > > > zswap_cpu_comp_dead() teardown callback has been deleted from > the > > > call > > > > > to cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE). As > a > > > > result, CPU > > > > > offline hotplug operations will be no-ops as far as the acomp_ctx > > > > > resources are concerned. > > > > > > > > Currently, per-CPU acomp_ctx are allocated on pool creation and/or > CPU > > > > hotplug, and destroyed on pool destruction or CPU hotunplug. This > > > > complicates the lifetime management to save memory while a CPU is > > > > offlined, which is not very common. > > > > > > > > Simplify lifetime management by allocating per-CPU acomp_ctx once > on > > > > pool creation (or CPU hotplug for CPUs onlined later), and keeping them > > > > allocated until the pool is destroyed. > > > > > > > > > > > > > > This commit refactors the code from zswap_cpu_comp_dead() into a > > > > > new function acomp_ctx_dealloc() that is called to clean up > acomp_ctx > > > > > resources from: > > > > > > > > > > 1) zswap_cpu_comp_prepare() when an error is encountered, > > > > > 2) zswap_pool_create() when an error is encountered, and > > > > > 3) from zswap_pool_destroy(). > > > > > > > > > > > > Refactor cleanup code from zswap_cpu_comp_dead() into > > > > acomp_ctx_dealloc() to be used elsewhere. > > > > > > > > > > > > > > The main benefit of using the CPU hotplug multi state instance > startup > > > > > callback to allocate the acomp_ctx resources is that it prevents the > > > > > cores from being offlined until the multi state instance addition call > > > > > returns. > > > > > > > > > > From Documentation/core-api/cpu_hotplug.rst: > > > > > > > > > > "The node list add/remove operations and the callback invocations > are > > > > > serialized against CPU hotplug operations." > > > > > > > > > > Furthermore, zswap_[de]compress() cannot contend with > > > > > zswap_cpu_comp_prepare() because: > > > > > > > > > > - During pool creation/deletion, the pool is not in the zswap_pools > > > > > list. > > > > > > > > > > - During CPU hot[un]plug, the CPU is not yet online, as Yosry pointed > > > > > out. zswap_cpu_comp_prepare() will be run on a control CPU, > > > > > since CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section > > > of > > > > "enum > > > > > cpuhp_state". Thanks Yosry for sharing this observation! > > > > > > > > > > In both these cases, any recursions into zswap reclaim from > > > > > zswap_cpu_comp_prepare() will be handled by the old pool. > > > > > > > > > > The above two observations enable the following simplifications: > > > > > > > > > > 1) zswap_cpu_comp_prepare(): CPU cannot be offlined. Reclaim > cannot > > > > use > > > > > the pool. Considerations for mutex init/locking and handling > > > > > subsequent CPU hotplug online-offline-online: > > > > > > > > > > Should we lock the mutex of current CPU's acomp_ctx from start to > > > > > end? It doesn't seem like this is required. The CPU hotplug > > > > > operations acquire a "cpuhp_state_mutex" before proceeding, hence > > > > > they are serialized against CPU hotplug operations. > > > > > > > > > > If the process gets migrated while zswap_cpu_comp_prepare() is > > > > > running, it will complete on the new CPU. In case of failures, we > > > > > pass the acomp_ctx pointer obtained at the start of > > > > > zswap_cpu_comp_prepare() to acomp_ctx_dealloc(), which again, can > > > > > only undergo migration. There appear to be no contention scenarios > > > > > that might cause inconsistent values of acomp_ctx's members. Hence, > > > > > it seems there is no need for mutex_lock(&acomp_ctx->mutex) in > > > > > zswap_cpu_comp_prepare(). > > > > > > > > > > Since the pool is not yet on zswap_pools list, we don't need to > > > > > initialize the per-CPU acomp_ctx mutex in zswap_pool_create(). This > > > > > has been restored to occur in zswap_cpu_comp_prepare(). > > > > > > > > > > zswap_cpu_comp_prepare() checks upfront if acomp_ctx->acomp is > > > > > valid. If so, it returns success. This should handle any CPU > > > > > hotplug online-offline transitions after pool creation is done. > > > > > > > > > > 2) CPU offline vis-a-vis zswap ops: Let's suppose the process is > > > > > migrated to another CPU before the current CPU is dysfunctional. If > > > > > zswap_[de]compress() holds the acomp_ctx->mutex lock of the > > > offlined > > > > > CPU, that mutex will be released once it completes on the new > > > > > CPU. Since there is no teardown callback, there is no possibility of > > > > > UAF. > > > > > > > > > > 3) Pool creation/deletion and process migration to another CPU: > > > > > > > > > > - During pool creation/deletion, the pool is not in the zswap_pools > > > > > list. Hence it cannot contend with zswap ops on that CPU. However, > > > > > the process can get migrated. > > > > > > > > > > Pool creation --> zswap_cpu_comp_prepare() > > > > > --> process migrated: > > > > > * CPU offline: no-op. > > > > > * zswap_cpu_comp_prepare() continues > > > > > to run on the new CPU to finish > > > > > allocating acomp_ctx resources for > > > > > the offlined CPU. > > > > > > > > > > Pool deletion --> acomp_ctx_dealloc() > > > > > --> process migrated: > > > > > * CPU offline: no-op. > > > > > * acomp_ctx_dealloc() continues > > > > > to run on the new CPU to finish > > > > > de-allocating acomp_ctx resources > > > > > for the offlined CPU. > > > > > > > > > > 4) Pool deletion vis-a-vis CPU onlining: > > > > > The call to cpuhp_state_remove_instance() cannot race with > > > > > zswap_cpu_comp_prepare() because of hotplug synchronization. > > > > > > > > > > This patch deletes > acomp_ctx_get_cpu_lock()/acomp_ctx_put_unlock(). > > > > > Instead, zswap_[de]compress() directly call > > > > > mutex_[un]lock(&acomp_ctx->mutex). > > > > > > > > I am not sure why all of this is needed. We should just describe why > > > > it's safe to drop holding the mutex while initializing per-CPU > > > > acomp_ctx: > > > > > > > > It is no longer possible for CPU hotplug to race against allocation or > > > > usage of per-CPU acomp_ctx, as they are only allocated once before the > > > > pool can be used, and remain allocated as long as the pool is used. > > > > Hence, stop holding the lock during acomp_ctx initialization, and drop > > > > acomp_ctx_get_cpu_lock()//acomp_ctx_put_unlock(). > > > > > > Hi Yosry, > > > > > > Thanks for these comments. IIRC, there was quite a bit of technical > > > discussion analyzing various what-ifs, that we were able to answer > > > adequately. The above is a nice summary of the outcome, however, > > > I think it would help the next time this topic is re-visited to have a log > > > of the "why" and how races/UAF scenarios are being considered and > > > addressed by the solution. Does this sound Ok? > > > > > > How about using the summarized version in the commit log and linking to > > > the thread with the discussion? > > > > > Seems like capturing just enough detail of the threads involving the > > discussions, in this commit log would be valuable. As against reading long > > email threads with indentations, as the sole resource to provide context? > > > > If you feel strongly about it then sure, but try to keep it as concise as possible, > thanks. Sure, will do, thanks!
© 2016 - 2025 Red Hat, Inc.