This patch simplifies the zswap_pool's per-CPU acomp_ctx resource
management. Similar to the per-CPU acomp_ctx itself, the per-CPU
acomp_ctx's resources' (acomp, req, buffer) lifetime will also be from
pool creation to pool deletion. These resources will persist through CPU
hotplug operations. The zswap_cpu_comp_dead() teardown callback has been
deleted from the call to
cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE). As a result, CPU
offline hotplug operations will be no-ops as far as the acomp_ctx
resources are concerned.
This commit refactors the code from zswap_cpu_comp_dead() into a
new function acomp_ctx_dealloc() that preserves the IS_ERR_OR_NULL()
checks on acomp_ctx, req and acomp from the existing mainline
implementation of zswap_cpu_comp_dead(). acomp_ctx_dealloc() is called
to clean up acomp_ctx resources from all these procedures:
1) zswap_cpu_comp_prepare() when an error is encountered,
2) zswap_pool_create() when an error is encountered, and
3) from zswap_pool_destroy().
The main benefit of using the CPU hotplug multi state instance startup
callback to allocate the acomp_ctx resources is that it prevents the
cores from being offlined until the multi state instance addition call
returns.
From Documentation/core-api/cpu_hotplug.rst:
"The node list add/remove operations and the callback invocations are
serialized against CPU hotplug operations."
Furthermore, zswap_[de]compress() cannot contend with
zswap_cpu_comp_prepare() because:
- During pool creation/deletion, the pool is not in the zswap_pools
list.
- During CPU hot[un]plug, the CPU is not yet online, as Yosry pointed
out. zswap_cpu_comp_prepare() will be executed on a control CPU,
since CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section of "enum
cpuhp_state". Thanks Yosry for sharing this observation!
In both these cases, any recursions into zswap reclaim from
zswap_cpu_comp_prepare() will be handled by the old pool.
The above two observations enable the following simplifications:
1) zswap_cpu_comp_prepare(): CPU cannot be offlined. Reclaim cannot use
the pool. Considerations for mutex init/locking and handling
subsequent CPU hotplug online-offlines:
Should we lock the mutex of current CPU's acomp_ctx from start to
end? It doesn't seem like this is required. The CPU hotplug
operations acquire a "cpuhp_state_mutex" before proceeding, hence
they are serialized against CPU hotplug operations.
If the process gets migrated while zswap_cpu_comp_prepare() is
running, it will complete on the new CPU. In case of failures, we
pass the acomp_ctx pointer obtained at the start of
zswap_cpu_comp_prepare() to acomp_ctx_dealloc(), which again, can
only undergo migration. There appear to be no contention scenarios
that might cause inconsistent values of acomp_ctx's members. Hence,
it seems there is no need for mutex_lock(&acomp_ctx->mutex) in
zswap_cpu_comp_prepare().
Since the pool is not yet on zswap_pools list, we don't need to
initialize the per-CPU acomp_ctx mutex in zswap_pool_create(). This
has been restored to occur in zswap_cpu_comp_prepare().
zswap_cpu_comp_prepare() checks upfront if acomp_ctx->acomp is
valid. If so, it returns success. This should handle any CPU
hotplug online-offline transitions after pool creation is done.
2) CPU offline vis-a-vis zswap ops: Let's suppose the process is
migrated to another CPU before the current CPU is dysfunctional. If
zswap_[de]compress() holds the acomp_ctx->mutex lock of the offlined
CPU, that mutex will be released once it completes on the new
CPU. Since there is no teardown callback, there is no possibility of
UAF.
3) Pool creation/deletion and process migration to another CPU:
- During pool creation/deletion, the pool is not in the zswap_pools
list. Hence it cannot contend with zswap ops on that CPU. However,
the process can get migrated.
Pool creation --> zswap_cpu_comp_prepare()
--> process migrated:
* CPU offline: no-op.
* zswap_cpu_comp_prepare() continues
to run on the new CPU to finish
allocating acomp_ctx resources for
the offlined CPU.
Pool deletion --> acomp_ctx_dealloc()
--> process migrated:
* CPU offline: no-op.
* acomp_ctx_dealloc() continues
to run on the new CPU to finish
de-allocating acomp_ctx resources
for the offlined CPU.
4) Pool deletion vis-a-vis CPU onlining:
To prevent possibility of race conditions between
acomp_ctx_dealloc() freeing the acomp_ctx resources and the initial
check for a valid acomp_ctx->acomp in zswap_cpu_comp_prepare(), we
need to delete the multi state instance right after it is added, in
zswap_pool_create().
Summary of changes based on the above:
--------------------------------------
1) Zero-initialization of pool->acomp_ctx in zswap_pool_create() to
simplify and share common code for different error handling/cleanup
related to the acomp_ctx.
2) Remove the node list instance right after node list add function
call in zswap_pool_create(). This prevents race conditions between
CPU onlining after initial pool creation, and acomp_ctx_dealloc()
freeing the acomp_ctx resources.
3) zswap_pool_destroy() will call acomp_ctx_dealloc() to de-allocate
the per-CPU acomp_ctx resources.
4) Changes to zswap_cpu_comp_prepare():
a) Check if acomp_ctx->acomp is valid at the beginning and return,
because the acomp_ctx is already initialized.
b) Move the mutex_init to happen in this procedure, before it
returns.
c) All error conditions handled by calling acomp_ctx_dealloc().
5) New procedure acomp_ctx_dealloc() for common error/cleanup code.
6) No more multi state instance teardown callback. CPU offlining is a
no-op as far as acomp_ctx resources are concerned.
7) Delete acomp_ctx_get_cpu_lock()/acomp_ctx_put_unlock(). Directly
call mutex_lock(&acomp_ctx->mutex)/mutex_unlock(&acomp_ctx->mutex)
in zswap_[de]compress().
The per-CPU memory cost of not deleting the acomp_ctx resources upon CPU
offlining, and only deleting them when the pool is destroyed, is as
follows, on x86_64:
IAA with 8 dst buffers for batching: 64.34 KB
Software compressors with 1 dst buffer: 8.28 KB
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
mm/zswap.c | 194 +++++++++++++++++++++++++----------------------------
1 file changed, 93 insertions(+), 101 deletions(-)
diff --git a/mm/zswap.c b/mm/zswap.c
index c1af782e54ec..27665eaa3f89 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -242,6 +242,30 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
**********************************/
static void __zswap_pool_empty(struct percpu_ref *ref);
+/*
+ * The per-cpu pool->acomp_ctx is zero-initialized on allocation. This makes
+ * it easy for different error conditions/cleanup related to the acomp_ctx
+ * to be handled by acomp_ctx_dealloc():
+ * - Errors during zswap_cpu_comp_prepare().
+ * - Partial success/error of cpuhp_state_add_instance() call in
+ * zswap_pool_create(). Only some cores could have executed
+ * zswap_cpu_comp_prepare(), not others.
+ * - Cleanup acomp_ctx resources on all cores in zswap_pool_destroy().
+ */
+static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx)
+{
+ if (IS_ERR_OR_NULL(acomp_ctx))
+ return;
+
+ if (!IS_ERR_OR_NULL(acomp_ctx->req))
+ acomp_request_free(acomp_ctx->req);
+
+ if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
+ crypto_free_acomp(acomp_ctx->acomp);
+
+ kfree(acomp_ctx->buffer);
+}
+
static struct zswap_pool *zswap_pool_create(char *compressor)
{
struct zswap_pool *pool;
@@ -263,19 +287,43 @@ static struct zswap_pool *zswap_pool_create(char *compressor)
strscpy(pool->tfm_name, compressor, sizeof(pool->tfm_name));
- pool->acomp_ctx = alloc_percpu(*pool->acomp_ctx);
+ /* Many things rely on the zero-initialization. */
+ pool->acomp_ctx = alloc_percpu_gfp(*pool->acomp_ctx,
+ GFP_KERNEL | __GFP_ZERO);
if (!pool->acomp_ctx) {
pr_err("percpu alloc failed\n");
goto error;
}
- for_each_possible_cpu(cpu)
- mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex);
-
+ /*
+ * This is serialized against CPU hotplug operations. Hence, cores
+ * cannot be offlined until this finishes.
+ * In case of errors, we need to goto "ref_fail" instead of "error"
+ * because there is no teardown callback registered anymore, for
+ * cpuhp_state_add_instance() to de-allocate resources as it rolls back
+ * state on cores before the CPU on which error was encountered.
+ */
ret = cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
&pool->node);
+
+ /*
+ * We only needed the multi state instance add operation to invoke the
+ * startup callback for all cores without cores getting offlined. Since
+ * the acomp_ctx resources will now only be de-allocated when the pool
+ * is destroyed, we can safely remove the multi state instance. This
+ * minimizes (but does not eliminate) the possibility of
+ * zswap_cpu_comp_prepare() being invoked again due to a CPU
+ * offline-online transition. Removing the instance also prevents race
+ * conditions between CPU onlining after initial pool creation, and
+ * acomp_ctx_dealloc() freeing the acomp_ctx resources.
+ * Note that we delete the instance before checking the error status of
+ * the node list add operation because we want the instance removal even
+ * in case of errors in the former.
+ */
+ cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
+
if (ret)
- goto error;
+ goto ref_fail;
/* being the current pool takes 1 ref; this func expects the
* caller to always add the new pool as the current pool
@@ -291,7 +339,8 @@ static struct zswap_pool *zswap_pool_create(char *compressor)
return pool;
ref_fail:
- cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
+ for_each_possible_cpu(cpu)
+ acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
error:
if (pool->acomp_ctx)
free_percpu(pool->acomp_ctx);
@@ -322,9 +371,13 @@ static struct zswap_pool *__zswap_pool_create_fallback(void)
static void zswap_pool_destroy(struct zswap_pool *pool)
{
+ int cpu;
+
zswap_pool_debug("destroying", pool);
- cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
+ for_each_possible_cpu(cpu)
+ acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
+
free_percpu(pool->acomp_ctx);
zs_destroy_pool(pool->zs_pool);
@@ -736,39 +789,39 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
{
struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
- struct crypto_acomp *acomp = NULL;
- struct acomp_req *req = NULL;
- u8 *buffer = NULL;
- int ret;
+ int ret = -ENOMEM;
- buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
- if (!buffer) {
- ret = -ENOMEM;
- goto fail;
- }
+ /*
+ * The per-CPU pool->acomp_ctx is zero-initialized on allocation.
+ * Even though we delete the multi state instance right after successful
+ * addition of the instance in zswap_pool_create(), we cannot eliminate
+ * the possibility of the CPU going through offline-online transitions.
+ * If this does happen, we check if the acomp_ctx has already been
+ * initialized, and return.
+ */
+ if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
+ return 0;
- acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
- if (IS_ERR(acomp)) {
+ acomp_ctx->buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
+ if (!acomp_ctx->buffer)
+ return ret;
+
+ acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
+ if (IS_ERR(acomp_ctx->acomp)) {
pr_err("could not alloc crypto acomp %s : %ld\n",
- pool->tfm_name, PTR_ERR(acomp));
- ret = PTR_ERR(acomp);
+ pool->tfm_name, PTR_ERR(acomp_ctx->acomp));
+ ret = PTR_ERR(acomp_ctx->acomp);
goto fail;
}
+ acomp_ctx->is_sleepable = acomp_is_async(acomp_ctx->acomp);
- req = acomp_request_alloc(acomp);
- if (!req) {
+ acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp);
+ if (!acomp_ctx->req) {
pr_err("could not alloc crypto acomp_request %s\n",
pool->tfm_name);
- ret = -ENOMEM;
goto fail;
}
- /*
- * Only hold the mutex after completing allocations, otherwise we may
- * recurse into zswap through reclaim and attempt to hold the mutex
- * again resulting in a deadlock.
- */
- mutex_lock(&acomp_ctx->mutex);
crypto_init_wait(&acomp_ctx->wait);
/*
@@ -776,81 +829,17 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
* crypto_wait_req(); if the backend of acomp is scomp, the callback
* won't be called, crypto_wait_req() will return without blocking.
*/
- acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
+ acomp_request_set_callback(acomp_ctx->req, CRYPTO_TFM_REQ_MAY_BACKLOG,
crypto_req_done, &acomp_ctx->wait);
- acomp_ctx->buffer = buffer;
- acomp_ctx->acomp = acomp;
- acomp_ctx->is_sleepable = acomp_is_async(acomp);
- acomp_ctx->req = req;
- mutex_unlock(&acomp_ctx->mutex);
+ mutex_init(&acomp_ctx->mutex);
return 0;
fail:
- if (acomp)
- crypto_free_acomp(acomp);
- kfree(buffer);
+ acomp_ctx_dealloc(acomp_ctx);
return ret;
}
-static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
-{
- struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
- struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
- struct acomp_req *req;
- struct crypto_acomp *acomp;
- u8 *buffer;
-
- if (IS_ERR_OR_NULL(acomp_ctx))
- return 0;
-
- mutex_lock(&acomp_ctx->mutex);
- req = acomp_ctx->req;
- acomp = acomp_ctx->acomp;
- buffer = acomp_ctx->buffer;
- acomp_ctx->req = NULL;
- acomp_ctx->acomp = NULL;
- acomp_ctx->buffer = NULL;
- mutex_unlock(&acomp_ctx->mutex);
-
- /*
- * Do the actual freeing after releasing the mutex to avoid subtle
- * locking dependencies causing deadlocks.
- */
- if (!IS_ERR_OR_NULL(req))
- acomp_request_free(req);
- if (!IS_ERR_OR_NULL(acomp))
- crypto_free_acomp(acomp);
- kfree(buffer);
-
- return 0;
-}
-
-static struct crypto_acomp_ctx *acomp_ctx_get_cpu_lock(struct zswap_pool *pool)
-{
- struct crypto_acomp_ctx *acomp_ctx;
-
- for (;;) {
- acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
- mutex_lock(&acomp_ctx->mutex);
- if (likely(acomp_ctx->req))
- return acomp_ctx;
- /*
- * It is possible that we were migrated to a different CPU after
- * getting the per-CPU ctx but before the mutex was acquired. If
- * the old CPU got offlined, zswap_cpu_comp_dead() could have
- * already freed ctx->req (among other things) and set it to
- * NULL. Just try again on the new CPU that we ended up on.
- */
- mutex_unlock(&acomp_ctx->mutex);
- }
-}
-
-static void acomp_ctx_put_unlock(struct crypto_acomp_ctx *acomp_ctx)
-{
- mutex_unlock(&acomp_ctx->mutex);
-}
-
static bool zswap_compress(struct page *page, struct zswap_entry *entry,
struct zswap_pool *pool)
{
@@ -863,7 +852,9 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
u8 *dst;
bool mapped = false;
- acomp_ctx = acomp_ctx_get_cpu_lock(pool);
+ acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
+ mutex_lock(&acomp_ctx->mutex);
+
dst = acomp_ctx->buffer;
sg_init_table(&input, 1);
sg_set_page(&input, page, PAGE_SIZE, 0);
@@ -927,7 +918,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
else if (alloc_ret)
zswap_reject_alloc_fail++;
- acomp_ctx_put_unlock(acomp_ctx);
+ mutex_unlock(&acomp_ctx->mutex);
return comp_ret == 0 && alloc_ret == 0;
}
@@ -939,7 +930,8 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
int decomp_ret = 0, dlen = PAGE_SIZE;
u8 *src, *obj;
- acomp_ctx = acomp_ctx_get_cpu_lock(pool);
+ acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
+ mutex_lock(&acomp_ctx->mutex);
obj = zs_obj_read_begin(pool->zs_pool, entry->handle, acomp_ctx->buffer);
/* zswap entries of length PAGE_SIZE are not compressed. */
@@ -970,7 +962,7 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
read_done:
zs_obj_read_end(pool->zs_pool, entry->handle, obj);
- acomp_ctx_put_unlock(acomp_ctx);
+ mutex_unlock(&acomp_ctx->mutex);
if (!decomp_ret && dlen == PAGE_SIZE)
return true;
@@ -1796,7 +1788,7 @@ static int zswap_setup(void)
ret = cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE,
"mm/zswap_pool:prepare",
zswap_cpu_comp_prepare,
- zswap_cpu_comp_dead);
+ NULL);
if (ret)
goto hp_fail;
--
2.27.0
On Thu, Sep 25, 2025 at 08:34:59PM -0700, Kanchana P Sridhar wrote: > This patch simplifies the zswap_pool's per-CPU acomp_ctx resource > management. Similar to the per-CPU acomp_ctx itself, the per-CPU > acomp_ctx's resources' (acomp, req, buffer) lifetime will also be from > pool creation to pool deletion. These resources will persist through CPU > hotplug operations. The zswap_cpu_comp_dead() teardown callback has been > deleted from the call to > cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE). As a result, CPU > offline hotplug operations will be no-ops as far as the acomp_ctx > resources are concerned. > > This commit refactors the code from zswap_cpu_comp_dead() into a > new function acomp_ctx_dealloc() that preserves the IS_ERR_OR_NULL() > checks on acomp_ctx, req and acomp from the existing mainline > implementation of zswap_cpu_comp_dead(). acomp_ctx_dealloc() is called > to clean up acomp_ctx resources from all these procedures: > > 1) zswap_cpu_comp_prepare() when an error is encountered, > 2) zswap_pool_create() when an error is encountered, and > 3) from zswap_pool_destroy(). > > The main benefit of using the CPU hotplug multi state instance startup > callback to allocate the acomp_ctx resources is that it prevents the > cores from being offlined until the multi state instance addition call > returns. > > From Documentation/core-api/cpu_hotplug.rst: > > "The node list add/remove operations and the callback invocations are > serialized against CPU hotplug operations." > > Furthermore, zswap_[de]compress() cannot contend with > zswap_cpu_comp_prepare() because: > > - During pool creation/deletion, the pool is not in the zswap_pools > list. > > - During CPU hot[un]plug, the CPU is not yet online, as Yosry pointed > out. zswap_cpu_comp_prepare() will be executed on a control CPU, > since CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section of "enum > cpuhp_state". Thanks Yosry for sharing this observation! > > In both these cases, any recursions into zswap reclaim from > zswap_cpu_comp_prepare() will be handled by the old pool. > > The above two observations enable the following simplifications: > > 1) zswap_cpu_comp_prepare(): CPU cannot be offlined. Reclaim cannot use > the pool. Considerations for mutex init/locking and handling > subsequent CPU hotplug online-offlines: > > Should we lock the mutex of current CPU's acomp_ctx from start to > end? It doesn't seem like this is required. The CPU hotplug > operations acquire a "cpuhp_state_mutex" before proceeding, hence > they are serialized against CPU hotplug operations. > > If the process gets migrated while zswap_cpu_comp_prepare() is > running, it will complete on the new CPU. In case of failures, we > pass the acomp_ctx pointer obtained at the start of > zswap_cpu_comp_prepare() to acomp_ctx_dealloc(), which again, can > only undergo migration. There appear to be no contention scenarios > that might cause inconsistent values of acomp_ctx's members. Hence, > it seems there is no need for mutex_lock(&acomp_ctx->mutex) in > zswap_cpu_comp_prepare(). > > Since the pool is not yet on zswap_pools list, we don't need to > initialize the per-CPU acomp_ctx mutex in zswap_pool_create(). This > has been restored to occur in zswap_cpu_comp_prepare(). > > zswap_cpu_comp_prepare() checks upfront if acomp_ctx->acomp is > valid. If so, it returns success. This should handle any CPU > hotplug online-offline transitions after pool creation is done. > > 2) CPU offline vis-a-vis zswap ops: Let's suppose the process is > migrated to another CPU before the current CPU is dysfunctional. If > zswap_[de]compress() holds the acomp_ctx->mutex lock of the offlined > CPU, that mutex will be released once it completes on the new > CPU. Since there is no teardown callback, there is no possibility of > UAF. > > 3) Pool creation/deletion and process migration to another CPU: > > - During pool creation/deletion, the pool is not in the zswap_pools > list. Hence it cannot contend with zswap ops on that CPU. However, > the process can get migrated. > > Pool creation --> zswap_cpu_comp_prepare() > --> process migrated: > * CPU offline: no-op. > * zswap_cpu_comp_prepare() continues > to run on the new CPU to finish > allocating acomp_ctx resources for > the offlined CPU. > > Pool deletion --> acomp_ctx_dealloc() > --> process migrated: > * CPU offline: no-op. > * acomp_ctx_dealloc() continues > to run on the new CPU to finish > de-allocating acomp_ctx resources > for the offlined CPU. > > 4) Pool deletion vis-a-vis CPU onlining: > To prevent possibility of race conditions between > acomp_ctx_dealloc() freeing the acomp_ctx resources and the initial > check for a valid acomp_ctx->acomp in zswap_cpu_comp_prepare(), we > need to delete the multi state instance right after it is added, in > zswap_pool_create(). > > Summary of changes based on the above: > -------------------------------------- > 1) Zero-initialization of pool->acomp_ctx in zswap_pool_create() to > simplify and share common code for different error handling/cleanup > related to the acomp_ctx. > > 2) Remove the node list instance right after node list add function > call in zswap_pool_create(). This prevents race conditions between > CPU onlining after initial pool creation, and acomp_ctx_dealloc() > freeing the acomp_ctx resources. > > 3) zswap_pool_destroy() will call acomp_ctx_dealloc() to de-allocate > the per-CPU acomp_ctx resources. > > 4) Changes to zswap_cpu_comp_prepare(): > > a) Check if acomp_ctx->acomp is valid at the beginning and return, > because the acomp_ctx is already initialized. > b) Move the mutex_init to happen in this procedure, before it > returns. > c) All error conditions handled by calling acomp_ctx_dealloc(). > > 5) New procedure acomp_ctx_dealloc() for common error/cleanup code. > > 6) No more multi state instance teardown callback. CPU offlining is a > no-op as far as acomp_ctx resources are concerned. > > 7) Delete acomp_ctx_get_cpu_lock()/acomp_ctx_put_unlock(). Directly > call mutex_lock(&acomp_ctx->mutex)/mutex_unlock(&acomp_ctx->mutex) > in zswap_[de]compress(). > > The per-CPU memory cost of not deleting the acomp_ctx resources upon CPU > offlining, and only deleting them when the pool is destroyed, is as > follows, on x86_64: > > IAA with 8 dst buffers for batching: 64.34 KB > Software compressors with 1 dst buffer: 8.28 KB > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Please try to make the commit logs a bit more summarized. Details are helpful, but it's easy to lose track of things when it gets too long. > --- > mm/zswap.c | 194 +++++++++++++++++++++++++---------------------------- > 1 file changed, 93 insertions(+), 101 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index c1af782e54ec..27665eaa3f89 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -242,6 +242,30 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp) > **********************************/ > static void __zswap_pool_empty(struct percpu_ref *ref); > > +/* > + * The per-cpu pool->acomp_ctx is zero-initialized on allocation. This makes > + * it easy for different error conditions/cleanup related to the acomp_ctx > + * to be handled by acomp_ctx_dealloc(): > + * - Errors during zswap_cpu_comp_prepare(). > + * - Partial success/error of cpuhp_state_add_instance() call in > + * zswap_pool_create(). Only some cores could have executed > + * zswap_cpu_comp_prepare(), not others. > + * - Cleanup acomp_ctx resources on all cores in zswap_pool_destroy(). > + */ Comments describing specific code paths go out of date really fast. The comment is probably unnecessary, it's easy to check the allocation path to figure out that these are zero-initialized. Also in general, please keep the comments as summarized as possible, and only when the logic is not clear from the code. > +static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx) > +{ > + if (IS_ERR_OR_NULL(acomp_ctx)) > + return; > + > + if (!IS_ERR_OR_NULL(acomp_ctx->req)) > + acomp_request_free(acomp_ctx->req); > + > + if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) > + crypto_free_acomp(acomp_ctx->acomp); > + > + kfree(acomp_ctx->buffer); > +} > + > static struct zswap_pool *zswap_pool_create(char *compressor) > { > struct zswap_pool *pool; > @@ -263,19 +287,43 @@ static struct zswap_pool *zswap_pool_create(char *compressor) > > strscpy(pool->tfm_name, compressor, sizeof(pool->tfm_name)); > > - pool->acomp_ctx = alloc_percpu(*pool->acomp_ctx); > + /* Many things rely on the zero-initialization. */ > + pool->acomp_ctx = alloc_percpu_gfp(*pool->acomp_ctx, > + GFP_KERNEL | __GFP_ZERO); > if (!pool->acomp_ctx) { > pr_err("percpu alloc failed\n"); > goto error; > } > > - for_each_possible_cpu(cpu) > - mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex); > - > + /* > + * This is serialized against CPU hotplug operations. Hence, cores > + * cannot be offlined until this finishes. > + * In case of errors, we need to goto "ref_fail" instead of "error" > + * because there is no teardown callback registered anymore, for > + * cpuhp_state_add_instance() to de-allocate resources as it rolls back > + * state on cores before the CPU on which error was encountered. > + */ > ret = cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE, > &pool->node); > + > + /* > + * We only needed the multi state instance add operation to invoke the > + * startup callback for all cores without cores getting offlined. Since > + * the acomp_ctx resources will now only be de-allocated when the pool > + * is destroyed, we can safely remove the multi state instance. This > + * minimizes (but does not eliminate) the possibility of > + * zswap_cpu_comp_prepare() being invoked again due to a CPU > + * offline-online transition. Removing the instance also prevents race > + * conditions between CPU onlining after initial pool creation, and > + * acomp_ctx_dealloc() freeing the acomp_ctx resources. > + * Note that we delete the instance before checking the error status of > + * the node list add operation because we want the instance removal even > + * in case of errors in the former. > + */ > + cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node); > + I don't understand what's wrong with the current flow? We call cpuhp_state_remove_instance() in pool deletion before freeing up the per-CPU resources. Why is this not enough?
> -----Original Message----- > From: Yosry Ahmed <yosry.ahmed@linux.dev> > Sent: Tuesday, September 30, 2025 8:49 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com; > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux- > crypto@vger.kernel.org; herbert@gondor.apana.org.au; > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>; > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > <vinodh.gopal@intel.com> > Subject: Re: [PATCH v12 20/23] mm: zswap: Per-CPU acomp_ctx resources > exist from pool creation to deletion. > > On Thu, Sep 25, 2025 at 08:34:59PM -0700, Kanchana P Sridhar wrote: > > This patch simplifies the zswap_pool's per-CPU acomp_ctx resource > > management. Similar to the per-CPU acomp_ctx itself, the per-CPU > > acomp_ctx's resources' (acomp, req, buffer) lifetime will also be from > > pool creation to pool deletion. These resources will persist through CPU > > hotplug operations. The zswap_cpu_comp_dead() teardown callback has > been > > deleted from the call to > > cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE). As a result, > CPU > > offline hotplug operations will be no-ops as far as the acomp_ctx > > resources are concerned. > > > > This commit refactors the code from zswap_cpu_comp_dead() into a > > new function acomp_ctx_dealloc() that preserves the IS_ERR_OR_NULL() > > checks on acomp_ctx, req and acomp from the existing mainline > > implementation of zswap_cpu_comp_dead(). acomp_ctx_dealloc() is called > > to clean up acomp_ctx resources from all these procedures: > > > > 1) zswap_cpu_comp_prepare() when an error is encountered, > > 2) zswap_pool_create() when an error is encountered, and > > 3) from zswap_pool_destroy(). > > > > The main benefit of using the CPU hotplug multi state instance startup > > callback to allocate the acomp_ctx resources is that it prevents the > > cores from being offlined until the multi state instance addition call > > returns. > > > > From Documentation/core-api/cpu_hotplug.rst: > > > > "The node list add/remove operations and the callback invocations are > > serialized against CPU hotplug operations." > > > > Furthermore, zswap_[de]compress() cannot contend with > > zswap_cpu_comp_prepare() because: > > > > - During pool creation/deletion, the pool is not in the zswap_pools > > list. > > > > - During CPU hot[un]plug, the CPU is not yet online, as Yosry pointed > > out. zswap_cpu_comp_prepare() will be executed on a control CPU, > > since CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section of > "enum > > cpuhp_state". Thanks Yosry for sharing this observation! > > > > In both these cases, any recursions into zswap reclaim from > > zswap_cpu_comp_prepare() will be handled by the old pool. > > > > The above two observations enable the following simplifications: > > > > 1) zswap_cpu_comp_prepare(): CPU cannot be offlined. Reclaim cannot > use > > the pool. Considerations for mutex init/locking and handling > > subsequent CPU hotplug online-offlines: > > > > Should we lock the mutex of current CPU's acomp_ctx from start to > > end? It doesn't seem like this is required. The CPU hotplug > > operations acquire a "cpuhp_state_mutex" before proceeding, hence > > they are serialized against CPU hotplug operations. > > > > If the process gets migrated while zswap_cpu_comp_prepare() is > > running, it will complete on the new CPU. In case of failures, we > > pass the acomp_ctx pointer obtained at the start of > > zswap_cpu_comp_prepare() to acomp_ctx_dealloc(), which again, can > > only undergo migration. There appear to be no contention scenarios > > that might cause inconsistent values of acomp_ctx's members. Hence, > > it seems there is no need for mutex_lock(&acomp_ctx->mutex) in > > zswap_cpu_comp_prepare(). > > > > Since the pool is not yet on zswap_pools list, we don't need to > > initialize the per-CPU acomp_ctx mutex in zswap_pool_create(). This > > has been restored to occur in zswap_cpu_comp_prepare(). > > > > zswap_cpu_comp_prepare() checks upfront if acomp_ctx->acomp is > > valid. If so, it returns success. This should handle any CPU > > hotplug online-offline transitions after pool creation is done. > > > > 2) CPU offline vis-a-vis zswap ops: Let's suppose the process is > > migrated to another CPU before the current CPU is dysfunctional. If > > zswap_[de]compress() holds the acomp_ctx->mutex lock of the offlined > > CPU, that mutex will be released once it completes on the new > > CPU. Since there is no teardown callback, there is no possibility of > > UAF. > > > > 3) Pool creation/deletion and process migration to another CPU: > > > > - During pool creation/deletion, the pool is not in the zswap_pools > > list. Hence it cannot contend with zswap ops on that CPU. However, > > the process can get migrated. > > > > Pool creation --> zswap_cpu_comp_prepare() > > --> process migrated: > > * CPU offline: no-op. > > * zswap_cpu_comp_prepare() continues > > to run on the new CPU to finish > > allocating acomp_ctx resources for > > the offlined CPU. > > > > Pool deletion --> acomp_ctx_dealloc() > > --> process migrated: > > * CPU offline: no-op. > > * acomp_ctx_dealloc() continues > > to run on the new CPU to finish > > de-allocating acomp_ctx resources > > for the offlined CPU. > > > > 4) Pool deletion vis-a-vis CPU onlining: > > To prevent possibility of race conditions between > > acomp_ctx_dealloc() freeing the acomp_ctx resources and the initial > > check for a valid acomp_ctx->acomp in zswap_cpu_comp_prepare(), we > > need to delete the multi state instance right after it is added, in > > zswap_pool_create(). > > > > Summary of changes based on the above: > > -------------------------------------- > > 1) Zero-initialization of pool->acomp_ctx in zswap_pool_create() to > > simplify and share common code for different error handling/cleanup > > related to the acomp_ctx. > > > > 2) Remove the node list instance right after node list add function > > call in zswap_pool_create(). This prevents race conditions between > > CPU onlining after initial pool creation, and acomp_ctx_dealloc() > > freeing the acomp_ctx resources. > > > > 3) zswap_pool_destroy() will call acomp_ctx_dealloc() to de-allocate > > the per-CPU acomp_ctx resources. > > > > 4) Changes to zswap_cpu_comp_prepare(): > > > > a) Check if acomp_ctx->acomp is valid at the beginning and return, > > because the acomp_ctx is already initialized. > > b) Move the mutex_init to happen in this procedure, before it > > returns. > > c) All error conditions handled by calling acomp_ctx_dealloc(). > > > > 5) New procedure acomp_ctx_dealloc() for common error/cleanup code. > > > > 6) No more multi state instance teardown callback. CPU offlining is a > > no-op as far as acomp_ctx resources are concerned. > > > > 7) Delete acomp_ctx_get_cpu_lock()/acomp_ctx_put_unlock(). Directly > > call mutex_lock(&acomp_ctx->mutex)/mutex_unlock(&acomp_ctx- > >mutex) > > in zswap_[de]compress(). > > > > The per-CPU memory cost of not deleting the acomp_ctx resources upon > CPU > > offlining, and only deleting them when the pool is destroyed, is as > > follows, on x86_64: > > > > IAA with 8 dst buffers for batching: 64.34 KB > > Software compressors with 1 dst buffer: 8.28 KB > > > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > > Please try to make the commit logs a bit more summarized. Details are > helpful, but it's easy to lose track of things when it gets too long. Thanks Yosry, for the feedback. > > > --- > > mm/zswap.c | 194 +++++++++++++++++++++++++---------------------------- > > 1 file changed, 93 insertions(+), 101 deletions(-) > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > index c1af782e54ec..27665eaa3f89 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -242,6 +242,30 @@ static inline struct xarray > *swap_zswap_tree(swp_entry_t swp) > > **********************************/ > > static void __zswap_pool_empty(struct percpu_ref *ref); > > > > +/* > > + * The per-cpu pool->acomp_ctx is zero-initialized on allocation. This > makes > > + * it easy for different error conditions/cleanup related to the acomp_ctx > > + * to be handled by acomp_ctx_dealloc(): > > + * - Errors during zswap_cpu_comp_prepare(). > > + * - Partial success/error of cpuhp_state_add_instance() call in > > + * zswap_pool_create(). Only some cores could have executed > > + * zswap_cpu_comp_prepare(), not others. > > + * - Cleanup acomp_ctx resources on all cores in zswap_pool_destroy(). > > + */ > > Comments describing specific code paths go out of date really fast. The > comment is probably unnecessary, it's easy to check the allocation path > to figure out that these are zero-initialized. > > Also in general, please keep the comments as summarized as possible, and > only when the logic is not clear from the code. Sure. I have tried to explain the rationale for significant changes, but I can look for opportunities to summarize. I was sort of hoping that v12 would be it, but I can work on the comments being concise if this is crucial. > > > +static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx) > > +{ > > + if (IS_ERR_OR_NULL(acomp_ctx)) > > + return; > > + > > + if (!IS_ERR_OR_NULL(acomp_ctx->req)) > > + acomp_request_free(acomp_ctx->req); > > + > > + if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) > > + crypto_free_acomp(acomp_ctx->acomp); > > + > > + kfree(acomp_ctx->buffer); > > +} > > + > > static struct zswap_pool *zswap_pool_create(char *compressor) > > { > > struct zswap_pool *pool; > > @@ -263,19 +287,43 @@ static struct zswap_pool > *zswap_pool_create(char *compressor) > > > > strscpy(pool->tfm_name, compressor, sizeof(pool->tfm_name)); > > > > - pool->acomp_ctx = alloc_percpu(*pool->acomp_ctx); > > + /* Many things rely on the zero-initialization. */ > > + pool->acomp_ctx = alloc_percpu_gfp(*pool->acomp_ctx, > > + GFP_KERNEL | __GFP_ZERO); > > if (!pool->acomp_ctx) { > > pr_err("percpu alloc failed\n"); > > goto error; > > } > > > > - for_each_possible_cpu(cpu) > > - mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex); > > - > > + /* > > + * This is serialized against CPU hotplug operations. Hence, cores > > + * cannot be offlined until this finishes. > > + * In case of errors, we need to goto "ref_fail" instead of "error" > > + * because there is no teardown callback registered anymore, for > > + * cpuhp_state_add_instance() to de-allocate resources as it rolls > back > > + * state on cores before the CPU on which error was encountered. > > + */ > > ret = > cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE, > > &pool->node); > > + > > + /* > > + * We only needed the multi state instance add operation to invoke > the > > + * startup callback for all cores without cores getting offlined. Since > > + * the acomp_ctx resources will now only be de-allocated when the > pool > > + * is destroyed, we can safely remove the multi state instance. This > > + * minimizes (but does not eliminate) the possibility of > > + * zswap_cpu_comp_prepare() being invoked again due to a CPU > > + * offline-online transition. Removing the instance also prevents race > > + * conditions between CPU onlining after initial pool creation, and > > + * acomp_ctx_dealloc() freeing the acomp_ctx resources. > > + * Note that we delete the instance before checking the error status > of > > + * the node list add operation because we want the instance removal > even > > + * in case of errors in the former. > > + */ > > + cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, > &pool->node); > > + > > I don't understand what's wrong with the current flow? We call > cpuhp_state_remove_instance() in pool deletion before freeing up the > per-CPU resources. Why is this not enough? This is because with the changes proposed in this commit, the multi state add instance is used during pool creation as a way to create acomp_ctx resources correctly with just the offline/online state transitions guaranteed by CPU hotplug, without needing additional mutex locking as in the mainline. In other words, the consistency wrt safely creating/deleting acomp_ctx resources with the changes being proposed is accomplished by the hotplug state transitions guarantee. Stated differently, the hotplug framework helps enforce the new design during pool creation without relying on the mutex and subsequent simplifications during zswap_[de]compress() proposed in this commit. Once this is done, deleting the CPU hotplug state seems cleaner, and reflects the change in policy of the resources' lifetime. It also prevents race conditions between zswap_cpu_comp_prepare() and acomp_ctx_dealloc() called from zswap_pool_destroy(). The only cleaner design I can think of is to not use CPU hotplug callbacks at all, instead use a for_each_possible_cpu() to allocate acomp_ctx resources. The one benefit of the current design is that it saves memory if a considerable number of CPUs are offlined to begin with, for some reason. Thanks, Kanchana
On Tue, Sep 30, 2025 at 06:20:13PM +0000, Sridhar, Kanchana P wrote: > > > -----Original Message----- > > From: Yosry Ahmed <yosry.ahmed@linux.dev> > > Sent: Tuesday, September 30, 2025 8:49 AM > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com; > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; > > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux- > > crypto@vger.kernel.org; herbert@gondor.apana.org.au; > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > > <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>; > > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > > <vinodh.gopal@intel.com> > > Subject: Re: [PATCH v12 20/23] mm: zswap: Per-CPU acomp_ctx resources > > exist from pool creation to deletion. > > > > On Thu, Sep 25, 2025 at 08:34:59PM -0700, Kanchana P Sridhar wrote: > > > This patch simplifies the zswap_pool's per-CPU acomp_ctx resource > > > management. Similar to the per-CPU acomp_ctx itself, the per-CPU > > > acomp_ctx's resources' (acomp, req, buffer) lifetime will also be from > > > pool creation to pool deletion. These resources will persist through CPU > > > hotplug operations. The zswap_cpu_comp_dead() teardown callback has > > been > > > deleted from the call to > > > cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE). As a result, > > CPU > > > offline hotplug operations will be no-ops as far as the acomp_ctx > > > resources are concerned. > > > > > > This commit refactors the code from zswap_cpu_comp_dead() into a > > > new function acomp_ctx_dealloc() that preserves the IS_ERR_OR_NULL() > > > checks on acomp_ctx, req and acomp from the existing mainline > > > implementation of zswap_cpu_comp_dead(). acomp_ctx_dealloc() is called > > > to clean up acomp_ctx resources from all these procedures: > > > > > > 1) zswap_cpu_comp_prepare() when an error is encountered, > > > 2) zswap_pool_create() when an error is encountered, and > > > 3) from zswap_pool_destroy(). > > > > > > The main benefit of using the CPU hotplug multi state instance startup > > > callback to allocate the acomp_ctx resources is that it prevents the > > > cores from being offlined until the multi state instance addition call > > > returns. > > > > > > From Documentation/core-api/cpu_hotplug.rst: > > > > > > "The node list add/remove operations and the callback invocations are > > > serialized against CPU hotplug operations." > > > > > > Furthermore, zswap_[de]compress() cannot contend with > > > zswap_cpu_comp_prepare() because: > > > > > > - During pool creation/deletion, the pool is not in the zswap_pools > > > list. > > > > > > - During CPU hot[un]plug, the CPU is not yet online, as Yosry pointed > > > out. zswap_cpu_comp_prepare() will be executed on a control CPU, > > > since CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section of > > "enum > > > cpuhp_state". Thanks Yosry for sharing this observation! > > > > > > In both these cases, any recursions into zswap reclaim from > > > zswap_cpu_comp_prepare() will be handled by the old pool. > > > > > > The above two observations enable the following simplifications: > > > > > > 1) zswap_cpu_comp_prepare(): CPU cannot be offlined. Reclaim cannot > > use > > > the pool. Considerations for mutex init/locking and handling > > > subsequent CPU hotplug online-offlines: > > > > > > Should we lock the mutex of current CPU's acomp_ctx from start to > > > end? It doesn't seem like this is required. The CPU hotplug > > > operations acquire a "cpuhp_state_mutex" before proceeding, hence > > > they are serialized against CPU hotplug operations. > > > > > > If the process gets migrated while zswap_cpu_comp_prepare() is > > > running, it will complete on the new CPU. In case of failures, we > > > pass the acomp_ctx pointer obtained at the start of > > > zswap_cpu_comp_prepare() to acomp_ctx_dealloc(), which again, can > > > only undergo migration. There appear to be no contention scenarios > > > that might cause inconsistent values of acomp_ctx's members. Hence, > > > it seems there is no need for mutex_lock(&acomp_ctx->mutex) in > > > zswap_cpu_comp_prepare(). > > > > > > Since the pool is not yet on zswap_pools list, we don't need to > > > initialize the per-CPU acomp_ctx mutex in zswap_pool_create(). This > > > has been restored to occur in zswap_cpu_comp_prepare(). > > > > > > zswap_cpu_comp_prepare() checks upfront if acomp_ctx->acomp is > > > valid. If so, it returns success. This should handle any CPU > > > hotplug online-offline transitions after pool creation is done. > > > > > > 2) CPU offline vis-a-vis zswap ops: Let's suppose the process is > > > migrated to another CPU before the current CPU is dysfunctional. If > > > zswap_[de]compress() holds the acomp_ctx->mutex lock of the offlined > > > CPU, that mutex will be released once it completes on the new > > > CPU. Since there is no teardown callback, there is no possibility of > > > UAF. > > > > > > 3) Pool creation/deletion and process migration to another CPU: > > > > > > - During pool creation/deletion, the pool is not in the zswap_pools > > > list. Hence it cannot contend with zswap ops on that CPU. However, > > > the process can get migrated. > > > > > > Pool creation --> zswap_cpu_comp_prepare() > > > --> process migrated: > > > * CPU offline: no-op. > > > * zswap_cpu_comp_prepare() continues > > > to run on the new CPU to finish > > > allocating acomp_ctx resources for > > > the offlined CPU. > > > > > > Pool deletion --> acomp_ctx_dealloc() > > > --> process migrated: > > > * CPU offline: no-op. > > > * acomp_ctx_dealloc() continues > > > to run on the new CPU to finish > > > de-allocating acomp_ctx resources > > > for the offlined CPU. > > > > > > 4) Pool deletion vis-a-vis CPU onlining: > > > To prevent possibility of race conditions between > > > acomp_ctx_dealloc() freeing the acomp_ctx resources and the initial > > > check for a valid acomp_ctx->acomp in zswap_cpu_comp_prepare(), we > > > need to delete the multi state instance right after it is added, in > > > zswap_pool_create(). > > > > > > Summary of changes based on the above: > > > -------------------------------------- > > > 1) Zero-initialization of pool->acomp_ctx in zswap_pool_create() to > > > simplify and share common code for different error handling/cleanup > > > related to the acomp_ctx. > > > > > > 2) Remove the node list instance right after node list add function > > > call in zswap_pool_create(). This prevents race conditions between > > > CPU onlining after initial pool creation, and acomp_ctx_dealloc() > > > freeing the acomp_ctx resources. > > > > > > 3) zswap_pool_destroy() will call acomp_ctx_dealloc() to de-allocate > > > the per-CPU acomp_ctx resources. > > > > > > 4) Changes to zswap_cpu_comp_prepare(): > > > > > > a) Check if acomp_ctx->acomp is valid at the beginning and return, > > > because the acomp_ctx is already initialized. > > > b) Move the mutex_init to happen in this procedure, before it > > > returns. > > > c) All error conditions handled by calling acomp_ctx_dealloc(). > > > > > > 5) New procedure acomp_ctx_dealloc() for common error/cleanup code. > > > > > > 6) No more multi state instance teardown callback. CPU offlining is a > > > no-op as far as acomp_ctx resources are concerned. > > > > > > 7) Delete acomp_ctx_get_cpu_lock()/acomp_ctx_put_unlock(). Directly > > > call mutex_lock(&acomp_ctx->mutex)/mutex_unlock(&acomp_ctx- > > >mutex) > > > in zswap_[de]compress(). > > > > > > The per-CPU memory cost of not deleting the acomp_ctx resources upon > > CPU > > > offlining, and only deleting them when the pool is destroyed, is as > > > follows, on x86_64: > > > > > > IAA with 8 dst buffers for batching: 64.34 KB > > > Software compressors with 1 dst buffer: 8.28 KB > > > > > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > > > > Please try to make the commit logs a bit more summarized. Details are > > helpful, but it's easy to lose track of things when it gets too long. > > Thanks Yosry, for the feedback. > > > > > > --- > > > mm/zswap.c | 194 +++++++++++++++++++++++++---------------------------- > > > 1 file changed, 93 insertions(+), 101 deletions(-) > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > index c1af782e54ec..27665eaa3f89 100644 > > > --- a/mm/zswap.c > > > +++ b/mm/zswap.c > > > @@ -242,6 +242,30 @@ static inline struct xarray > > *swap_zswap_tree(swp_entry_t swp) > > > **********************************/ > > > static void __zswap_pool_empty(struct percpu_ref *ref); > > > > > > +/* > > > + * The per-cpu pool->acomp_ctx is zero-initialized on allocation. This > > makes > > > + * it easy for different error conditions/cleanup related to the acomp_ctx > > > + * to be handled by acomp_ctx_dealloc(): > > > + * - Errors during zswap_cpu_comp_prepare(). > > > + * - Partial success/error of cpuhp_state_add_instance() call in > > > + * zswap_pool_create(). Only some cores could have executed > > > + * zswap_cpu_comp_prepare(), not others. > > > + * - Cleanup acomp_ctx resources on all cores in zswap_pool_destroy(). > > > + */ > > > > Comments describing specific code paths go out of date really fast. The > > comment is probably unnecessary, it's easy to check the allocation path > > to figure out that these are zero-initialized. > > > > Also in general, please keep the comments as summarized as possible, and > > only when the logic is not clear from the code. > > Sure. I have tried to explain the rationale for significant changes, but I can > look for opportunities to summarize. I was sort of hoping that v12 would > be it, but I can work on the comments being concise if this is crucial. > > > > > > +static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx) > > > +{ > > > + if (IS_ERR_OR_NULL(acomp_ctx)) > > > + return; > > > + > > > + if (!IS_ERR_OR_NULL(acomp_ctx->req)) > > > + acomp_request_free(acomp_ctx->req); > > > + > > > + if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) > > > + crypto_free_acomp(acomp_ctx->acomp); > > > + > > > + kfree(acomp_ctx->buffer); > > > +} > > > + > > > static struct zswap_pool *zswap_pool_create(char *compressor) > > > { > > > struct zswap_pool *pool; > > > @@ -263,19 +287,43 @@ static struct zswap_pool > > *zswap_pool_create(char *compressor) > > > > > > strscpy(pool->tfm_name, compressor, sizeof(pool->tfm_name)); > > > > > > - pool->acomp_ctx = alloc_percpu(*pool->acomp_ctx); > > > + /* Many things rely on the zero-initialization. */ > > > + pool->acomp_ctx = alloc_percpu_gfp(*pool->acomp_ctx, > > > + GFP_KERNEL | __GFP_ZERO); > > > if (!pool->acomp_ctx) { > > > pr_err("percpu alloc failed\n"); > > > goto error; > > > } > > > > > > - for_each_possible_cpu(cpu) > > > - mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex); > > > - > > > + /* > > > + * This is serialized against CPU hotplug operations. Hence, cores > > > + * cannot be offlined until this finishes. > > > + * In case of errors, we need to goto "ref_fail" instead of "error" > > > + * because there is no teardown callback registered anymore, for > > > + * cpuhp_state_add_instance() to de-allocate resources as it rolls > > back > > > + * state on cores before the CPU on which error was encountered. > > > + */ > > > ret = > > cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE, > > > &pool->node); > > > + > > > + /* > > > + * We only needed the multi state instance add operation to invoke > > the > > > + * startup callback for all cores without cores getting offlined. Since > > > + * the acomp_ctx resources will now only be de-allocated when the > > pool > > > + * is destroyed, we can safely remove the multi state instance. This > > > + * minimizes (but does not eliminate) the possibility of > > > + * zswap_cpu_comp_prepare() being invoked again due to a CPU > > > + * offline-online transition. Removing the instance also prevents race > > > + * conditions between CPU onlining after initial pool creation, and > > > + * acomp_ctx_dealloc() freeing the acomp_ctx resources. > > > + * Note that we delete the instance before checking the error status > > of > > > + * the node list add operation because we want the instance removal > > even > > > + * in case of errors in the former. > > > + */ > > > + cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, > > &pool->node); > > > + > > > > I don't understand what's wrong with the current flow? We call > > cpuhp_state_remove_instance() in pool deletion before freeing up the > > per-CPU resources. Why is this not enough? > > This is because with the changes proposed in this commit, the multi state > add instance is used during pool creation as a way to create acomp_ctx > resources correctly with just the offline/online state transitions guaranteed > by CPU hotplug, without needing additional mutex locking as in the mainline. > In other words, the consistency wrt safely creating/deleting acomp_ctx > resources with the changes being proposed is accomplished by the hotplug > state transitions guarantee. Stated differently, the hotplug framework > helps enforce the new design during pool creation without relying on the > mutex and subsequent simplifications during zswap_[de]compress() > proposed in this commit. > > Once this is done, deleting the CPU hotplug state seems cleaner, and reflects > the change in policy of the resources' lifetime. It also prevents race conditions > between zswap_cpu_comp_prepare() and acomp_ctx_dealloc() called from > zswap_pool_destroy(). How is a race with zswap_cpu_comp_prepare() possible if we call cpuhp_state_remove_instance() before acomp_ctx_dealloc() in the pool deletion path? > > The only cleaner design I can think of is to not use CPU hotplug callbacks > at all, instead use a for_each_possible_cpu() to allocate acomp_ctx > resources. The one benefit of the current design is that it saves memory > if a considerable number of CPUs are offlined to begin with, for some > reason. > > Thanks, > Kanchana
> -----Original Message----- > From: Yosry Ahmed <yosry.ahmed@linux.dev> > Sent: Tuesday, September 30, 2025 11:30 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com; > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux- > crypto@vger.kernel.org; herbert@gondor.apana.org.au; > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>; > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > <vinodh.gopal@intel.com> > Subject: Re: [PATCH v12 20/23] mm: zswap: Per-CPU acomp_ctx resources > exist from pool creation to deletion. > > On Tue, Sep 30, 2025 at 06:20:13PM +0000, Sridhar, Kanchana P wrote: > > > > > -----Original Message----- > > > From: Yosry Ahmed <yosry.ahmed@linux.dev> > > > Sent: Tuesday, September 30, 2025 8:49 AM > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > > > hannes@cmpxchg.org; nphamcs@gmail.com; > chengming.zhou@linux.dev; > > > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com; > > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; > > > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux- > > > crypto@vger.kernel.org; herbert@gondor.apana.org.au; > > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > > > <kristen.c.accardi@intel.com>; Gomes, Vinicius > <vinicius.gomes@intel.com>; > > > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > > > <vinodh.gopal@intel.com> > > > Subject: Re: [PATCH v12 20/23] mm: zswap: Per-CPU acomp_ctx > resources > > > exist from pool creation to deletion. > > > > > > On Thu, Sep 25, 2025 at 08:34:59PM -0700, Kanchana P Sridhar wrote: > > > > This patch simplifies the zswap_pool's per-CPU acomp_ctx resource > > > > management. Similar to the per-CPU acomp_ctx itself, the per-CPU > > > > acomp_ctx's resources' (acomp, req, buffer) lifetime will also be from > > > > pool creation to pool deletion. These resources will persist through CPU > > > > hotplug operations. The zswap_cpu_comp_dead() teardown callback > has > > > been > > > > deleted from the call to > > > > cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE). As a > result, > > > CPU > > > > offline hotplug operations will be no-ops as far as the acomp_ctx > > > > resources are concerned. > > > > > > > > This commit refactors the code from zswap_cpu_comp_dead() into a > > > > new function acomp_ctx_dealloc() that preserves the > IS_ERR_OR_NULL() > > > > checks on acomp_ctx, req and acomp from the existing mainline > > > > implementation of zswap_cpu_comp_dead(). acomp_ctx_dealloc() is > called > > > > to clean up acomp_ctx resources from all these procedures: > > > > > > > > 1) zswap_cpu_comp_prepare() when an error is encountered, > > > > 2) zswap_pool_create() when an error is encountered, and > > > > 3) from zswap_pool_destroy(). > > > > > > > > The main benefit of using the CPU hotplug multi state instance startup > > > > callback to allocate the acomp_ctx resources is that it prevents the > > > > cores from being offlined until the multi state instance addition call > > > > returns. > > > > > > > > From Documentation/core-api/cpu_hotplug.rst: > > > > > > > > "The node list add/remove operations and the callback invocations are > > > > serialized against CPU hotplug operations." > > > > > > > > Furthermore, zswap_[de]compress() cannot contend with > > > > zswap_cpu_comp_prepare() because: > > > > > > > > - During pool creation/deletion, the pool is not in the zswap_pools > > > > list. > > > > > > > > - During CPU hot[un]plug, the CPU is not yet online, as Yosry pointed > > > > out. zswap_cpu_comp_prepare() will be executed on a control CPU, > > > > since CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section > of > > > "enum > > > > cpuhp_state". Thanks Yosry for sharing this observation! > > > > > > > > In both these cases, any recursions into zswap reclaim from > > > > zswap_cpu_comp_prepare() will be handled by the old pool. > > > > > > > > The above two observations enable the following simplifications: > > > > > > > > 1) zswap_cpu_comp_prepare(): CPU cannot be offlined. Reclaim cannot > > > use > > > > the pool. Considerations for mutex init/locking and handling > > > > subsequent CPU hotplug online-offlines: > > > > > > > > Should we lock the mutex of current CPU's acomp_ctx from start to > > > > end? It doesn't seem like this is required. The CPU hotplug > > > > operations acquire a "cpuhp_state_mutex" before proceeding, hence > > > > they are serialized against CPU hotplug operations. > > > > > > > > If the process gets migrated while zswap_cpu_comp_prepare() is > > > > running, it will complete on the new CPU. In case of failures, we > > > > pass the acomp_ctx pointer obtained at the start of > > > > zswap_cpu_comp_prepare() to acomp_ctx_dealloc(), which again, can > > > > only undergo migration. There appear to be no contention scenarios > > > > that might cause inconsistent values of acomp_ctx's members. Hence, > > > > it seems there is no need for mutex_lock(&acomp_ctx->mutex) in > > > > zswap_cpu_comp_prepare(). > > > > > > > > Since the pool is not yet on zswap_pools list, we don't need to > > > > initialize the per-CPU acomp_ctx mutex in zswap_pool_create(). This > > > > has been restored to occur in zswap_cpu_comp_prepare(). > > > > > > > > zswap_cpu_comp_prepare() checks upfront if acomp_ctx->acomp is > > > > valid. If so, it returns success. This should handle any CPU > > > > hotplug online-offline transitions after pool creation is done. > > > > > > > > 2) CPU offline vis-a-vis zswap ops: Let's suppose the process is > > > > migrated to another CPU before the current CPU is dysfunctional. If > > > > zswap_[de]compress() holds the acomp_ctx->mutex lock of the > offlined > > > > CPU, that mutex will be released once it completes on the new > > > > CPU. Since there is no teardown callback, there is no possibility of > > > > UAF. > > > > > > > > 3) Pool creation/deletion and process migration to another CPU: > > > > > > > > - During pool creation/deletion, the pool is not in the zswap_pools > > > > list. Hence it cannot contend with zswap ops on that CPU. However, > > > > the process can get migrated. > > > > > > > > Pool creation --> zswap_cpu_comp_prepare() > > > > --> process migrated: > > > > * CPU offline: no-op. > > > > * zswap_cpu_comp_prepare() continues > > > > to run on the new CPU to finish > > > > allocating acomp_ctx resources for > > > > the offlined CPU. > > > > > > > > Pool deletion --> acomp_ctx_dealloc() > > > > --> process migrated: > > > > * CPU offline: no-op. > > > > * acomp_ctx_dealloc() continues > > > > to run on the new CPU to finish > > > > de-allocating acomp_ctx resources > > > > for the offlined CPU. > > > > > > > > 4) Pool deletion vis-a-vis CPU onlining: > > > > To prevent possibility of race conditions between > > > > acomp_ctx_dealloc() freeing the acomp_ctx resources and the initial > > > > check for a valid acomp_ctx->acomp in zswap_cpu_comp_prepare(), > we > > > > need to delete the multi state instance right after it is added, in > > > > zswap_pool_create(). > > > > > > > > Summary of changes based on the above: > > > > -------------------------------------- > > > > 1) Zero-initialization of pool->acomp_ctx in zswap_pool_create() to > > > > simplify and share common code for different error handling/cleanup > > > > related to the acomp_ctx. > > > > > > > > 2) Remove the node list instance right after node list add function > > > > call in zswap_pool_create(). This prevents race conditions between > > > > CPU onlining after initial pool creation, and acomp_ctx_dealloc() > > > > freeing the acomp_ctx resources. > > > > > > > > 3) zswap_pool_destroy() will call acomp_ctx_dealloc() to de-allocate > > > > the per-CPU acomp_ctx resources. > > > > > > > > 4) Changes to zswap_cpu_comp_prepare(): > > > > > > > > a) Check if acomp_ctx->acomp is valid at the beginning and return, > > > > because the acomp_ctx is already initialized. > > > > b) Move the mutex_init to happen in this procedure, before it > > > > returns. > > > > c) All error conditions handled by calling acomp_ctx_dealloc(). > > > > > > > > 5) New procedure acomp_ctx_dealloc() for common error/cleanup > code. > > > > > > > > 6) No more multi state instance teardown callback. CPU offlining is a > > > > no-op as far as acomp_ctx resources are concerned. > > > > > > > > 7) Delete acomp_ctx_get_cpu_lock()/acomp_ctx_put_unlock(). Directly > > > > call mutex_lock(&acomp_ctx->mutex)/mutex_unlock(&acomp_ctx- > > > >mutex) > > > > in zswap_[de]compress(). > > > > > > > > The per-CPU memory cost of not deleting the acomp_ctx resources > upon > > > CPU > > > > offlining, and only deleting them when the pool is destroyed, is as > > > > follows, on x86_64: > > > > > > > > IAA with 8 dst buffers for batching: 64.34 KB > > > > Software compressors with 1 dst buffer: 8.28 KB > > > > > > > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > > > > > > Please try to make the commit logs a bit more summarized. Details are > > > helpful, but it's easy to lose track of things when it gets too long. > > > > Thanks Yosry, for the feedback. > > > > > > > > > --- > > > > mm/zswap.c | 194 +++++++++++++++++++++++++------------------------- > --- > > > > 1 file changed, 93 insertions(+), 101 deletions(-) > > > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > > index c1af782e54ec..27665eaa3f89 100644 > > > > --- a/mm/zswap.c > > > > +++ b/mm/zswap.c > > > > @@ -242,6 +242,30 @@ static inline struct xarray > > > *swap_zswap_tree(swp_entry_t swp) > > > > **********************************/ > > > > static void __zswap_pool_empty(struct percpu_ref *ref); > > > > > > > > +/* > > > > + * The per-cpu pool->acomp_ctx is zero-initialized on allocation. This > > > makes > > > > + * it easy for different error conditions/cleanup related to the > acomp_ctx > > > > + * to be handled by acomp_ctx_dealloc(): > > > > + * - Errors during zswap_cpu_comp_prepare(). > > > > + * - Partial success/error of cpuhp_state_add_instance() call in > > > > + * zswap_pool_create(). Only some cores could have executed > > > > + * zswap_cpu_comp_prepare(), not others. > > > > + * - Cleanup acomp_ctx resources on all cores in zswap_pool_destroy(). > > > > + */ > > > > > > Comments describing specific code paths go out of date really fast. The > > > comment is probably unnecessary, it's easy to check the allocation path > > > to figure out that these are zero-initialized. > > > > > > Also in general, please keep the comments as summarized as possible, > and > > > only when the logic is not clear from the code. > > > > Sure. I have tried to explain the rationale for significant changes, but I can > > look for opportunities to summarize. I was sort of hoping that v12 would > > be it, but I can work on the comments being concise if this is crucial. > > > > > > > > > +static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx) > > > > +{ > > > > + if (IS_ERR_OR_NULL(acomp_ctx)) > > > > + return; > > > > + > > > > + if (!IS_ERR_OR_NULL(acomp_ctx->req)) > > > > + acomp_request_free(acomp_ctx->req); > > > > + > > > > + if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) > > > > + crypto_free_acomp(acomp_ctx->acomp); > > > > + > > > > + kfree(acomp_ctx->buffer); > > > > +} > > > > + > > > > static struct zswap_pool *zswap_pool_create(char *compressor) > > > > { > > > > struct zswap_pool *pool; > > > > @@ -263,19 +287,43 @@ static struct zswap_pool > > > *zswap_pool_create(char *compressor) > > > > > > > > strscpy(pool->tfm_name, compressor, sizeof(pool->tfm_name)); > > > > > > > > - pool->acomp_ctx = alloc_percpu(*pool->acomp_ctx); > > > > + /* Many things rely on the zero-initialization. */ > > > > + pool->acomp_ctx = alloc_percpu_gfp(*pool->acomp_ctx, > > > > + GFP_KERNEL | __GFP_ZERO); > > > > if (!pool->acomp_ctx) { > > > > pr_err("percpu alloc failed\n"); > > > > goto error; > > > > } > > > > > > > > - for_each_possible_cpu(cpu) > > > > - mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex); > > > > - > > > > + /* > > > > + * This is serialized against CPU hotplug operations. Hence, cores > > > > + * cannot be offlined until this finishes. > > > > + * In case of errors, we need to goto "ref_fail" instead of "error" > > > > + * because there is no teardown callback registered anymore, for > > > > + * cpuhp_state_add_instance() to de-allocate resources as it rolls > > > back > > > > + * state on cores before the CPU on which error was encountered. > > > > + */ > > > > ret = > > > cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE, > > > > &pool->node); > > > > + > > > > + /* > > > > + * We only needed the multi state instance add operation to invoke > > > the > > > > + * startup callback for all cores without cores getting offlined. Since > > > > + * the acomp_ctx resources will now only be de-allocated when the > > > pool > > > > + * is destroyed, we can safely remove the multi state instance. This > > > > + * minimizes (but does not eliminate) the possibility of > > > > + * zswap_cpu_comp_prepare() being invoked again due to a CPU > > > > + * offline-online transition. Removing the instance also prevents race > > > > + * conditions between CPU onlining after initial pool creation, and > > > > + * acomp_ctx_dealloc() freeing the acomp_ctx resources. > > > > + * Note that we delete the instance before checking the error status > > > of > > > > + * the node list add operation because we want the instance removal > > > even > > > > + * in case of errors in the former. > > > > + */ > > > > + cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, > > > &pool->node); > > > > + > > > > > > I don't understand what's wrong with the current flow? We call > > > cpuhp_state_remove_instance() in pool deletion before freeing up the > > > per-CPU resources. Why is this not enough? > > > > This is because with the changes proposed in this commit, the multi state > > add instance is used during pool creation as a way to create acomp_ctx > > resources correctly with just the offline/online state transitions guaranteed > > by CPU hotplug, without needing additional mutex locking as in the > mainline. > > In other words, the consistency wrt safely creating/deleting acomp_ctx > > resources with the changes being proposed is accomplished by the hotplug > > state transitions guarantee. Stated differently, the hotplug framework > > helps enforce the new design during pool creation without relying on the > > mutex and subsequent simplifications during zswap_[de]compress() > > proposed in this commit. > > > > Once this is done, deleting the CPU hotplug state seems cleaner, and > reflects > > the change in policy of the resources' lifetime. It also prevents race > conditions > > between zswap_cpu_comp_prepare() and acomp_ctx_dealloc() called from > > zswap_pool_destroy(). > > How is a race with zswap_cpu_comp_prepare() possible if we call > cpuhp_state_remove_instance() before acomp_ctx_dealloc() in the pool > deletion path? Good point. I agree, calling cpuhp_state_remove_instance() before acomp_ctx_dealloc() will not cause a race. However, if we consider the time from pool creation to deletion: if there is an online-offline-online transition, can zswap_cpu_comp_prepare() race with the call to cpuhp_state_remove_instance()? If so, wouldn't this cause unpredictable behavior? I agree, this can occur even with the code in this commit, but there is less risk of things going wrong because we remove the CPU hotplug instance before the pool is added to zswap_pools. Further, removing the CPU hotplug instance directly codifies the intent of this commit, i.e., to use this as a facilitator and manage memory allotted to acomp_ctx, but not to manage those resources' lifetime thereafter. Do you see any advantage of keeping the call to cpuhp_state_remove_instance() occur before acomp_ctx_dealloc() in zswap_pool_destroy()? Please let me know if I am missing something. Thanks, Kanchana > > > > > The only cleaner design I can think of is to not use CPU hotplug callbacks > > at all, instead use a for_each_possible_cpu() to allocate acomp_ctx > > resources. The one benefit of the current design is that it saves memory > > if a considerable number of CPUs are offlined to begin with, for some > > reason. > > > > Thanks, > > Kanchana
> > > > > static struct zswap_pool *zswap_pool_create(char *compressor) > > > > > { > > > > > struct zswap_pool *pool; > > > > > @@ -263,19 +287,43 @@ static struct zswap_pool > > > > *zswap_pool_create(char *compressor) > > > > > > > > > > strscpy(pool->tfm_name, compressor, sizeof(pool->tfm_name)); > > > > > > > > > > - pool->acomp_ctx = alloc_percpu(*pool->acomp_ctx); > > > > > + /* Many things rely on the zero-initialization. */ > > > > > + pool->acomp_ctx = alloc_percpu_gfp(*pool->acomp_ctx, > > > > > + GFP_KERNEL | __GFP_ZERO); > > > > > if (!pool->acomp_ctx) { > > > > > pr_err("percpu alloc failed\n"); > > > > > goto error; > > > > > } > > > > > > > > > > - for_each_possible_cpu(cpu) > > > > > - mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex); > > > > > - > > > > > + /* > > > > > + * This is serialized against CPU hotplug operations. Hence, cores > > > > > + * cannot be offlined until this finishes. > > > > > + * In case of errors, we need to goto "ref_fail" instead of "error" > > > > > + * because there is no teardown callback registered anymore, for > > > > > + * cpuhp_state_add_instance() to de-allocate resources as it rolls > > > > back > > > > > + * state on cores before the CPU on which error was encountered. > > > > > + */ > > > > > ret = > > > > cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE, > > > > > &pool->node); > > > > > + > > > > > + /* > > > > > + * We only needed the multi state instance add operation to invoke > > > > the > > > > > + * startup callback for all cores without cores getting offlined. Since > > > > > + * the acomp_ctx resources will now only be de-allocated when the > > > > pool > > > > > + * is destroyed, we can safely remove the multi state instance. This > > > > > + * minimizes (but does not eliminate) the possibility of > > > > > + * zswap_cpu_comp_prepare() being invoked again due to a CPU > > > > > + * offline-online transition. Removing the instance also prevents race > > > > > + * conditions between CPU onlining after initial pool creation, and > > > > > + * acomp_ctx_dealloc() freeing the acomp_ctx resources. > > > > > + * Note that we delete the instance before checking the error status > > > > of > > > > > + * the node list add operation because we want the instance removal > > > > even > > > > > + * in case of errors in the former. > > > > > + */ > > > > > + cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, > > > > &pool->node); > > > > > + > > > > > > > > I don't understand what's wrong with the current flow? We call > > > > cpuhp_state_remove_instance() in pool deletion before freeing up the > > > > per-CPU resources. Why is this not enough? > > > > > > This is because with the changes proposed in this commit, the multi state > > > add instance is used during pool creation as a way to create acomp_ctx > > > resources correctly with just the offline/online state transitions guaranteed > > > by CPU hotplug, without needing additional mutex locking as in the > > mainline. > > > In other words, the consistency wrt safely creating/deleting acomp_ctx > > > resources with the changes being proposed is accomplished by the hotplug > > > state transitions guarantee. Stated differently, the hotplug framework > > > helps enforce the new design during pool creation without relying on the > > > mutex and subsequent simplifications during zswap_[de]compress() > > > proposed in this commit. > > > > > > Once this is done, deleting the CPU hotplug state seems cleaner, and > > reflects > > > the change in policy of the resources' lifetime. It also prevents race > > conditions > > > between zswap_cpu_comp_prepare() and acomp_ctx_dealloc() called from > > > zswap_pool_destroy(). > > > > How is a race with zswap_cpu_comp_prepare() possible if we call > > cpuhp_state_remove_instance() before acomp_ctx_dealloc() in the pool > > deletion path? > > Good point. I agree, calling cpuhp_state_remove_instance() before > acomp_ctx_dealloc() will not cause a race. However, if we consider the > time from pool creation to deletion: if there is an online-offline-online > transition, can zswap_cpu_comp_prepare() race with the call to > cpuhp_state_remove_instance()? If so, wouldn't this cause unpredictable > behavior? How will this race happen? cpuhp_state_remove_instance() is called while a pool is being destroyed, while zswap_cpu_comp_prepare() while the pool is being created or during CPU onlining. The former cannot race, and the latter should be synchronized by hotplug code. > > I agree, this can occur even with the code in this commit, but there is > less risk of things going wrong because we remove the CPU hotplug > instance before the pool is added to zswap_pools. > > Further, removing the CPU hotplug instance directly codifies the > intent of this commit, i.e., to use this as a facilitator and manage memory > allotted to acomp_ctx, but not to manage those resources' lifetime > thereafter. > > Do you see any advantage of keeping the call to cpuhp_state_remove_instance() > occur before acomp_ctx_dealloc() in zswap_pool_destroy()? Please let me know > if I am missing something. What about more CPUs going online? Without the hotplug instance we don't get per-CPU resources for those. We are not using the hotplug mechanism just to facilitate per-CPU resource allocation, we use it to automatically allocate resources for newly onlined CPUs without having to preallocate for all possible CPUs. Also, this makes the code more difficult to reason about, and is an unncessary change from the current behavior. The only change needed is to drop the teardown callback and do the freeing in the pool destruction path instead.
> -----Original Message----- > From: Yosry Ahmed <yosry.ahmed@linux.dev> > Sent: Tuesday, September 30, 2025 2:20 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com; > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux- > crypto@vger.kernel.org; herbert@gondor.apana.org.au; > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>; > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > <vinodh.gopal@intel.com> > Subject: Re: [PATCH v12 20/23] mm: zswap: Per-CPU acomp_ctx resources > exist from pool creation to deletion. > > > > > > > static struct zswap_pool *zswap_pool_create(char *compressor) > > > > > > { > > > > > > struct zswap_pool *pool; > > > > > > @@ -263,19 +287,43 @@ static struct zswap_pool > > > > > *zswap_pool_create(char *compressor) > > > > > > > > > > > > strscpy(pool->tfm_name, compressor, sizeof(pool- > >tfm_name)); > > > > > > > > > > > > - pool->acomp_ctx = alloc_percpu(*pool->acomp_ctx); > > > > > > + /* Many things rely on the zero-initialization. */ > > > > > > + pool->acomp_ctx = alloc_percpu_gfp(*pool->acomp_ctx, > > > > > > + GFP_KERNEL | > __GFP_ZERO); > > > > > > if (!pool->acomp_ctx) { > > > > > > pr_err("percpu alloc failed\n"); > > > > > > goto error; > > > > > > } > > > > > > > > > > > > - for_each_possible_cpu(cpu) > > > > > > - mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)- > >mutex); > > > > > > - > > > > > > + /* > > > > > > + * This is serialized against CPU hotplug operations. Hence, > cores > > > > > > + * cannot be offlined until this finishes. > > > > > > + * In case of errors, we need to goto "ref_fail" instead of > "error" > > > > > > + * because there is no teardown callback registered anymore, > for > > > > > > + * cpuhp_state_add_instance() to de-allocate resources as it > rolls > > > > > back > > > > > > + * state on cores before the CPU on which error was > encountered. > > > > > > + */ > > > > > > ret = > > > > > cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE, > > > > > > &pool->node); > > > > > > + > > > > > > + /* > > > > > > + * We only needed the multi state instance add operation to > invoke > > > > > the > > > > > > + * startup callback for all cores without cores getting offlined. > Since > > > > > > + * the acomp_ctx resources will now only be de-allocated > when the > > > > > pool > > > > > > + * is destroyed, we can safely remove the multi state > instance. This > > > > > > + * minimizes (but does not eliminate) the possibility of > > > > > > + * zswap_cpu_comp_prepare() being invoked again due to a > CPU > > > > > > + * offline-online transition. Removing the instance also > prevents race > > > > > > + * conditions between CPU onlining after initial pool creation, > and > > > > > > + * acomp_ctx_dealloc() freeing the acomp_ctx resources. > > > > > > + * Note that we delete the instance before checking the error > status > > > > > of > > > > > > + * the node list add operation because we want the instance > removal > > > > > even > > > > > > + * in case of errors in the former. > > > > > > + */ > > > > > > + > cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, > > > > > &pool->node); > > > > > > + > > > > > > > > > > I don't understand what's wrong with the current flow? We call > > > > > cpuhp_state_remove_instance() in pool deletion before freeing up the > > > > > per-CPU resources. Why is this not enough? > > > > > > > > This is because with the changes proposed in this commit, the multi > state > > > > add instance is used during pool creation as a way to create acomp_ctx > > > > resources correctly with just the offline/online state transitions > guaranteed > > > > by CPU hotplug, without needing additional mutex locking as in the > > > mainline. > > > > In other words, the consistency wrt safely creating/deleting acomp_ctx > > > > resources with the changes being proposed is accomplished by the > hotplug > > > > state transitions guarantee. Stated differently, the hotplug framework > > > > helps enforce the new design during pool creation without relying on the > > > > mutex and subsequent simplifications during zswap_[de]compress() > > > > proposed in this commit. > > > > > > > > Once this is done, deleting the CPU hotplug state seems cleaner, and > > > reflects > > > > the change in policy of the resources' lifetime. It also prevents race > > > conditions > > > > between zswap_cpu_comp_prepare() and acomp_ctx_dealloc() called > from > > > > zswap_pool_destroy(). > > > > > > How is a race with zswap_cpu_comp_prepare() possible if we call > > > cpuhp_state_remove_instance() before acomp_ctx_dealloc() in the pool > > > deletion path? > > > > Good point. I agree, calling cpuhp_state_remove_instance() before > > acomp_ctx_dealloc() will not cause a race. However, if we consider the > > time from pool creation to deletion: if there is an online-offline-online > > transition, can zswap_cpu_comp_prepare() race with the call to > > cpuhp_state_remove_instance()? If so, wouldn't this cause unpredictable > > behavior? > > How will this race happen? > > cpuhp_state_remove_instance() is called while a pool is being destroyed, > while zswap_cpu_comp_prepare() while the pool is being created or during > CPU onlining. > > The former cannot race, and the latter should be synchronized by hotplug > code. > > > > > I agree, this can occur even with the code in this commit, but there is > > less risk of things going wrong because we remove the CPU hotplug > > instance before the pool is added to zswap_pools. > > > > Further, removing the CPU hotplug instance directly codifies the > > intent of this commit, i.e., to use this as a facilitator and manage memory > > allotted to acomp_ctx, but not to manage those resources' lifetime > > thereafter. > > > > Do you see any advantage of keeping the call to > cpuhp_state_remove_instance() > > occur before acomp_ctx_dealloc() in zswap_pool_destroy()? Please let me > know > > if I am missing something. > > What about more CPUs going online? Without the hotplug instance we don't > get per-CPU resources for those. We are not using the hotplug mechanism > just to facilitate per-CPU resource allocation, we use it to > automatically allocate resources for newly onlined CPUs without having > to preallocate for all possible CPUs. This is an excellent point! It makes sense, I will move the call to cpuhp_state_remove_instance() to be before the call to acomp_ctx_dealloc() in zswap_pool_destroy(). Thanks for catching this. > > Also, this makes the code more difficult to reason about, and is an > unncessary change from the current behavior. Ok. > > The only change needed is to drop the teardown callback and do the > freeing in the pool destruction path instead. Just to summarize: besides moving the call to cpuhp_state_remove_instance() to zswap_pool_destroy() and more concise comments/commit logs, are there other changes to be made in patch 20? Thanks, Kanchana
On Tue, Sep 30, 2025 at 09:56:33PM +0000, Sridhar, Kanchana P wrote: > > > -----Original Message----- > > From: Yosry Ahmed <yosry.ahmed@linux.dev> > > Sent: Tuesday, September 30, 2025 2:20 PM > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com; > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; > > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux- > > crypto@vger.kernel.org; herbert@gondor.apana.org.au; > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > > <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>; > > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > > <vinodh.gopal@intel.com> > > Subject: Re: [PATCH v12 20/23] mm: zswap: Per-CPU acomp_ctx resources > > exist from pool creation to deletion. > > > > > > > > > static struct zswap_pool *zswap_pool_create(char *compressor) > > > > > > > { > > > > > > > struct zswap_pool *pool; > > > > > > > @@ -263,19 +287,43 @@ static struct zswap_pool > > > > > > *zswap_pool_create(char *compressor) > > > > > > > > > > > > > > strscpy(pool->tfm_name, compressor, sizeof(pool- > > >tfm_name)); > > > > > > > > > > > > > > - pool->acomp_ctx = alloc_percpu(*pool->acomp_ctx); > > > > > > > + /* Many things rely on the zero-initialization. */ > > > > > > > + pool->acomp_ctx = alloc_percpu_gfp(*pool->acomp_ctx, > > > > > > > + GFP_KERNEL | > > __GFP_ZERO); > > > > > > > if (!pool->acomp_ctx) { > > > > > > > pr_err("percpu alloc failed\n"); > > > > > > > goto error; > > > > > > > } > > > > > > > > > > > > > > - for_each_possible_cpu(cpu) > > > > > > > - mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)- > > >mutex); > > > > > > > - > > > > > > > + /* > > > > > > > + * This is serialized against CPU hotplug operations. Hence, > > cores > > > > > > > + * cannot be offlined until this finishes. > > > > > > > + * In case of errors, we need to goto "ref_fail" instead of > > "error" > > > > > > > + * because there is no teardown callback registered anymore, > > for > > > > > > > + * cpuhp_state_add_instance() to de-allocate resources as it > > rolls > > > > > > back > > > > > > > + * state on cores before the CPU on which error was > > encountered. > > > > > > > + */ > > > > > > > ret = > > > > > > cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE, > > > > > > > &pool->node); > > > > > > > + > > > > > > > + /* > > > > > > > + * We only needed the multi state instance add operation to > > invoke > > > > > > the > > > > > > > + * startup callback for all cores without cores getting offlined. > > Since > > > > > > > + * the acomp_ctx resources will now only be de-allocated > > when the > > > > > > pool > > > > > > > + * is destroyed, we can safely remove the multi state > > instance. This > > > > > > > + * minimizes (but does not eliminate) the possibility of > > > > > > > + * zswap_cpu_comp_prepare() being invoked again due to a > > CPU > > > > > > > + * offline-online transition. Removing the instance also > > prevents race > > > > > > > + * conditions between CPU onlining after initial pool creation, > > and > > > > > > > + * acomp_ctx_dealloc() freeing the acomp_ctx resources. > > > > > > > + * Note that we delete the instance before checking the error > > status > > > > > > of > > > > > > > + * the node list add operation because we want the instance > > removal > > > > > > even > > > > > > > + * in case of errors in the former. > > > > > > > + */ > > > > > > > + > > cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, > > > > > > &pool->node); > > > > > > > + > > > > > > > > > > > > I don't understand what's wrong with the current flow? We call > > > > > > cpuhp_state_remove_instance() in pool deletion before freeing up the > > > > > > per-CPU resources. Why is this not enough? > > > > > > > > > > This is because with the changes proposed in this commit, the multi > > state > > > > > add instance is used during pool creation as a way to create acomp_ctx > > > > > resources correctly with just the offline/online state transitions > > guaranteed > > > > > by CPU hotplug, without needing additional mutex locking as in the > > > > mainline. > > > > > In other words, the consistency wrt safely creating/deleting acomp_ctx > > > > > resources with the changes being proposed is accomplished by the > > hotplug > > > > > state transitions guarantee. Stated differently, the hotplug framework > > > > > helps enforce the new design during pool creation without relying on the > > > > > mutex and subsequent simplifications during zswap_[de]compress() > > > > > proposed in this commit. > > > > > > > > > > Once this is done, deleting the CPU hotplug state seems cleaner, and > > > > reflects > > > > > the change in policy of the resources' lifetime. It also prevents race > > > > conditions > > > > > between zswap_cpu_comp_prepare() and acomp_ctx_dealloc() called > > from > > > > > zswap_pool_destroy(). > > > > > > > > How is a race with zswap_cpu_comp_prepare() possible if we call > > > > cpuhp_state_remove_instance() before acomp_ctx_dealloc() in the pool > > > > deletion path? > > > > > > Good point. I agree, calling cpuhp_state_remove_instance() before > > > acomp_ctx_dealloc() will not cause a race. However, if we consider the > > > time from pool creation to deletion: if there is an online-offline-online > > > transition, can zswap_cpu_comp_prepare() race with the call to > > > cpuhp_state_remove_instance()? If so, wouldn't this cause unpredictable > > > behavior? > > > > How will this race happen? > > > > cpuhp_state_remove_instance() is called while a pool is being destroyed, > > while zswap_cpu_comp_prepare() while the pool is being created or during > > CPU onlining. > > > > The former cannot race, and the latter should be synchronized by hotplug > > code. > > > > > > > > I agree, this can occur even with the code in this commit, but there is > > > less risk of things going wrong because we remove the CPU hotplug > > > instance before the pool is added to zswap_pools. > > > > > > Further, removing the CPU hotplug instance directly codifies the > > > intent of this commit, i.e., to use this as a facilitator and manage memory > > > allotted to acomp_ctx, but not to manage those resources' lifetime > > > thereafter. > > > > > > Do you see any advantage of keeping the call to > > cpuhp_state_remove_instance() > > > occur before acomp_ctx_dealloc() in zswap_pool_destroy()? Please let me > > know > > > if I am missing something. > > > > What about more CPUs going online? Without the hotplug instance we don't > > get per-CPU resources for those. We are not using the hotplug mechanism > > just to facilitate per-CPU resource allocation, we use it to > > automatically allocate resources for newly onlined CPUs without having > > to preallocate for all possible CPUs. > > This is an excellent point! It makes sense, I will move the call to > cpuhp_state_remove_instance() to be before the call to > acomp_ctx_dealloc() in zswap_pool_destroy(). Thanks for catching this. > > > > > Also, this makes the code more difficult to reason about, and is an > > unncessary change from the current behavior. > > Ok. > > > > > The only change needed is to drop the teardown callback and do the > > freeing in the pool destruction path instead. > > Just to summarize: besides moving the call to cpuhp_state_remove_instance() > to zswap_pool_destroy() and more concise comments/commit logs, are there > other changes to be made in patch 20? I don't believe so. Thanks!
> -----Original Message----- > From: Yosry Ahmed <yosry.ahmed@linux.dev> > Sent: Wednesday, October 1, 2025 8:33 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com; > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux- > crypto@vger.kernel.org; herbert@gondor.apana.org.au; > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>; > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > <vinodh.gopal@intel.com> > Subject: Re: [PATCH v12 20/23] mm: zswap: Per-CPU acomp_ctx resources > exist from pool creation to deletion. > > On Tue, Sep 30, 2025 at 09:56:33PM +0000, Sridhar, Kanchana P wrote: > > > > > -----Original Message----- > > > From: Yosry Ahmed <yosry.ahmed@linux.dev> > > > Sent: Tuesday, September 30, 2025 2:20 PM > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > > > hannes@cmpxchg.org; nphamcs@gmail.com; > chengming.zhou@linux.dev; > > > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com; > > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; > > > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux- > > > crypto@vger.kernel.org; herbert@gondor.apana.org.au; > > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > > > <kristen.c.accardi@intel.com>; Gomes, Vinicius > <vinicius.gomes@intel.com>; > > > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > > > <vinodh.gopal@intel.com> > > > Subject: Re: [PATCH v12 20/23] mm: zswap: Per-CPU acomp_ctx > resources > > > exist from pool creation to deletion. > > > > > > > > > > > static struct zswap_pool *zswap_pool_create(char *compressor) > > > > > > > > { > > > > > > > > struct zswap_pool *pool; > > > > > > > > @@ -263,19 +287,43 @@ static struct zswap_pool > > > > > > > *zswap_pool_create(char *compressor) > > > > > > > > > > > > > > > > strscpy(pool->tfm_name, compressor, sizeof(pool- > > > >tfm_name)); > > > > > > > > > > > > > > > > - pool->acomp_ctx = alloc_percpu(*pool->acomp_ctx); > > > > > > > > + /* Many things rely on the zero-initialization. */ > > > > > > > > + pool->acomp_ctx = alloc_percpu_gfp(*pool->acomp_ctx, > > > > > > > > + GFP_KERNEL | > > > __GFP_ZERO); > > > > > > > > if (!pool->acomp_ctx) { > > > > > > > > pr_err("percpu alloc failed\n"); > > > > > > > > goto error; > > > > > > > > } > > > > > > > > > > > > > > > > - for_each_possible_cpu(cpu) > > > > > > > > - mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)- > > > >mutex); > > > > > > > > - > > > > > > > > + /* > > > > > > > > + * This is serialized against CPU hotplug operations. Hence, > > > cores > > > > > > > > + * cannot be offlined until this finishes. > > > > > > > > + * In case of errors, we need to goto "ref_fail" instead of > > > "error" > > > > > > > > + * because there is no teardown callback registered anymore, > > > for > > > > > > > > + * cpuhp_state_add_instance() to de-allocate resources as it > > > rolls > > > > > > > back > > > > > > > > + * state on cores before the CPU on which error was > > > encountered. > > > > > > > > + */ > > > > > > > > ret = > > > > > > > cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE, > > > > > > > > &pool->node); > > > > > > > > + > > > > > > > > + /* > > > > > > > > + * We only needed the multi state instance add operation to > > > invoke > > > > > > > the > > > > > > > > + * startup callback for all cores without cores getting offlined. > > > Since > > > > > > > > + * the acomp_ctx resources will now only be de-allocated > > > when the > > > > > > > pool > > > > > > > > + * is destroyed, we can safely remove the multi state > > > instance. This > > > > > > > > + * minimizes (but does not eliminate) the possibility of > > > > > > > > + * zswap_cpu_comp_prepare() being invoked again due to a > > > CPU > > > > > > > > + * offline-online transition. Removing the instance also > > > prevents race > > > > > > > > + * conditions between CPU onlining after initial pool creation, > > > and > > > > > > > > + * acomp_ctx_dealloc() freeing the acomp_ctx resources. > > > > > > > > + * Note that we delete the instance before checking the error > > > status > > > > > > > of > > > > > > > > + * the node list add operation because we want the instance > > > removal > > > > > > > even > > > > > > > > + * in case of errors in the former. > > > > > > > > + */ > > > > > > > > + > > > cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, > > > > > > > &pool->node); > > > > > > > > + > > > > > > > > > > > > > > I don't understand what's wrong with the current flow? We call > > > > > > > cpuhp_state_remove_instance() in pool deletion before freeing up > the > > > > > > > per-CPU resources. Why is this not enough? > > > > > > > > > > > > This is because with the changes proposed in this commit, the multi > > > state > > > > > > add instance is used during pool creation as a way to create > acomp_ctx > > > > > > resources correctly with just the offline/online state transitions > > > guaranteed > > > > > > by CPU hotplug, without needing additional mutex locking as in the > > > > > mainline. > > > > > > In other words, the consistency wrt safely creating/deleting > acomp_ctx > > > > > > resources with the changes being proposed is accomplished by the > > > hotplug > > > > > > state transitions guarantee. Stated differently, the hotplug > framework > > > > > > helps enforce the new design during pool creation without relying on > the > > > > > > mutex and subsequent simplifications during zswap_[de]compress() > > > > > > proposed in this commit. > > > > > > > > > > > > Once this is done, deleting the CPU hotplug state seems cleaner, and > > > > > reflects > > > > > > the change in policy of the resources' lifetime. It also prevents race > > > > > conditions > > > > > > between zswap_cpu_comp_prepare() and acomp_ctx_dealloc() > called > > > from > > > > > > zswap_pool_destroy(). > > > > > > > > > > How is a race with zswap_cpu_comp_prepare() possible if we call > > > > > cpuhp_state_remove_instance() before acomp_ctx_dealloc() in the > pool > > > > > deletion path? > > > > > > > > Good point. I agree, calling cpuhp_state_remove_instance() before > > > > acomp_ctx_dealloc() will not cause a race. However, if we consider the > > > > time from pool creation to deletion: if there is an online-offline-online > > > > transition, can zswap_cpu_comp_prepare() race with the call to > > > > cpuhp_state_remove_instance()? If so, wouldn't this cause > unpredictable > > > > behavior? > > > > > > How will this race happen? > > > > > > cpuhp_state_remove_instance() is called while a pool is being destroyed, > > > while zswap_cpu_comp_prepare() while the pool is being created or > during > > > CPU onlining. > > > > > > The former cannot race, and the latter should be synchronized by hotplug > > > code. > > > > > > > > > > > I agree, this can occur even with the code in this commit, but there is > > > > less risk of things going wrong because we remove the CPU hotplug > > > > instance before the pool is added to zswap_pools. > > > > > > > > Further, removing the CPU hotplug instance directly codifies the > > > > intent of this commit, i.e., to use this as a facilitator and manage > memory > > > > allotted to acomp_ctx, but not to manage those resources' lifetime > > > > thereafter. > > > > > > > > Do you see any advantage of keeping the call to > > > cpuhp_state_remove_instance() > > > > occur before acomp_ctx_dealloc() in zswap_pool_destroy()? Please let > me > > > know > > > > if I am missing something. > > > > > > What about more CPUs going online? Without the hotplug instance we > don't > > > get per-CPU resources for those. We are not using the hotplug mechanism > > > just to facilitate per-CPU resource allocation, we use it to > > > automatically allocate resources for newly onlined CPUs without having > > > to preallocate for all possible CPUs. > > > > This is an excellent point! It makes sense, I will move the call to > > cpuhp_state_remove_instance() to be before the call to > > acomp_ctx_dealloc() in zswap_pool_destroy(). Thanks for catching this. > > > > > > > > Also, this makes the code more difficult to reason about, and is an > > > unncessary change from the current behavior. > > > > Ok. > > > > > > > > The only change needed is to drop the teardown callback and do the > > > freeing in the pool destruction path instead. > > > > Just to summarize: besides moving the call to > cpuhp_state_remove_instance() > > to zswap_pool_destroy() and more concise comments/commit logs, are > there > > other changes to be made in patch 20? > > I don't believe so. Thanks! Sounds good, thanks! Thanks, Kanchana
© 2016 - 2025 Red Hat, Inc.