[PATCH v2] zsmalloc: prefer the the original page's node for compressed data

Nhat Pham posted 1 patch 21 hours ago
drivers/block/zram/zram_drv.c | 11 ++++++++---
include/linux/zpool.h         |  4 ++--
include/linux/zsmalloc.h      |  3 ++-
mm/zpool.c                    |  8 +++++---
mm/zsmalloc.c                 | 18 ++++++++++--------
mm/zswap.c                    |  2 +-
6 files changed, 28 insertions(+), 18 deletions(-)
[PATCH v2] zsmalloc: prefer the the original page's node for compressed data
Posted by Nhat Pham 21 hours ago
Currently, zsmalloc, zswap's and zram's backend memory allocator, does
not enforce any policy for the allocation of memory for the compressed
data, instead just adopting the memory policy of the task entering
reclaim, or the default policy (prefer local node) if no such policy is
specified. This can lead to several pathological behaviors in
multi-node NUMA systems:

1. Systems with CXL-based memory tiering can encounter the following
   inversion with zswap/zram: the coldest pages demoted to the CXL tier
   can return to the high tier when they are reclaimed to compressed
   swap, creating memory pressure on the high tier.

2. Consider a direct reclaimer scanning nodes in order of allocation
   preference. If it ventures into remote nodes, the memory it
   compresses there should stay there. Trying to shift those contents
   over to the reclaiming thread's preferred node further *increases*
   its local pressure, and provoking more spills. The remote node is
   also the most likely to refault this data again. This undesirable
   behavior was pointed out by Johannes Weiner in [1].

3. For zswap writeback, the zswap entries are organized in
   node-specific LRUs, based on the node placement of the original
   pages, allowing for targeted zswap writeback for specific nodes.

   However, the compressed data of a zswap entry can be placed on a
   different node from the LRU it is placed on. This means that reclaim
   targeted at one node might not free up memory used for zswap entries
   in that node, but instead reclaiming memory in a different node.

All of these issues will be resolved if the compressed data go to the
same node as the original page. This patch encourages this behavior by
having zswap and zram pass the node of the original page to zsmalloc,
and have zsmalloc prefer the specified node if we need to allocate new
(zs)pages for the compressed data.

Note that we are not strictly binding the allocation to the preferred
node. We still allow the allocation to fall back to other nodes when
the preferred node is full, or if we have zspages with slots available
on a different node. This is OK, and still a strict improvement over
the status quo:

1. On a system with demotion enabled, we will generally prefer
   demotions over compressed swapping, and only swap when pages have
   already gone to the lowest tier. This patch should achieve the
   desired effect for the most part.

2. If the preferred node is out of memory, letting the compressed data
   going to other nodes can be better than the alternative (OOMs,
   keeping cold memory unreclaimed, disk swapping, etc.).

3. If the allocation go to a separate node because we have a zspage
   with slots available, at least we're not creating extra immediate
   memory pressure (since the space is already allocated).

3. While there can be mixings, we generally reclaim pages in
   same-node batches, which encourage zspage grouping that is more
   likely to go to the right node.

4. A strict binding would require partitioning zsmalloc by node, which
   is more complicated, and more prone to regression, since it reduces
   the storage density of zsmalloc. We need to evaluate the tradeoff
   and benchmark carefully before adopting such an involved solution.

[1]: https://lore.kernel.org/linux-mm/20250331165306.GC2110528@cmpxchg.org/

Suggested-by: Gregory Price <gourry@gourry.net>
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 drivers/block/zram/zram_drv.c | 11 ++++++++---
 include/linux/zpool.h         |  4 ++--
 include/linux/zsmalloc.h      |  3 ++-
 mm/zpool.c                    |  8 +++++---
 mm/zsmalloc.c                 | 18 ++++++++++--------
 mm/zswap.c                    |  2 +-
 6 files changed, 28 insertions(+), 18 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index fda7d8624889..0ba18277ed7b 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1694,7 +1694,7 @@ static int write_incompressible_page(struct zram *zram, struct page *page,
 	 */
 	handle = zs_malloc(zram->mem_pool, PAGE_SIZE,
 			   GFP_NOIO | __GFP_NOWARN |
-			   __GFP_HIGHMEM | __GFP_MOVABLE);
+			   __GFP_HIGHMEM | __GFP_MOVABLE, page_to_nid(page));
 	if (IS_ERR_VALUE(handle))
 		return PTR_ERR((void *)handle);
 
@@ -1761,7 +1761,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 
 	handle = zs_malloc(zram->mem_pool, comp_len,
 			   GFP_NOIO | __GFP_NOWARN |
-			   __GFP_HIGHMEM | __GFP_MOVABLE);
+			   __GFP_HIGHMEM | __GFP_MOVABLE, page_to_nid(page));
 	if (IS_ERR_VALUE(handle)) {
 		zcomp_stream_put(zstrm);
 		return PTR_ERR((void *)handle);
@@ -1981,10 +1981,15 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 	 * We are holding per-CPU stream mutex and entry lock so better
 	 * avoid direct reclaim.  Allocation error is not fatal since
 	 * we still have the old object in the mem_pool.
+	 *
+	 * XXX: technically, the node we really want here is the node that holds
+	 * the original compressed data. But that would require us to modify
+	 * zsmalloc API to return this information. For now, we will make do with
+	 * the node of the page allocated for recompression.
 	 */
 	handle_new = zs_malloc(zram->mem_pool, comp_len_new,
 			       GFP_NOIO | __GFP_NOWARN |
-			       __GFP_HIGHMEM | __GFP_MOVABLE);
+			       __GFP_HIGHMEM | __GFP_MOVABLE, page_to_nid(page));
 	if (IS_ERR_VALUE(handle_new)) {
 		zcomp_stream_put(zstrm);
 		return PTR_ERR((void *)handle_new);
diff --git a/include/linux/zpool.h b/include/linux/zpool.h
index 52f30e526607..697525cb00bd 100644
--- a/include/linux/zpool.h
+++ b/include/linux/zpool.h
@@ -22,7 +22,7 @@ const char *zpool_get_type(struct zpool *pool);
 void zpool_destroy_pool(struct zpool *pool);
 
 int zpool_malloc(struct zpool *pool, size_t size, gfp_t gfp,
-			unsigned long *handle);
+			unsigned long *handle, const int nid);
 
 void zpool_free(struct zpool *pool, unsigned long handle);
 
@@ -64,7 +64,7 @@ struct zpool_driver {
 	void (*destroy)(void *pool);
 
 	int (*malloc)(void *pool, size_t size, gfp_t gfp,
-				unsigned long *handle);
+				unsigned long *handle, const int nid);
 	void (*free)(void *pool, unsigned long handle);
 
 	void *(*obj_read_begin)(void *pool, unsigned long handle,
diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
index c26baf9fb331..13e9cc5490f7 100644
--- a/include/linux/zsmalloc.h
+++ b/include/linux/zsmalloc.h
@@ -26,7 +26,8 @@ struct zs_pool;
 struct zs_pool *zs_create_pool(const char *name);
 void zs_destroy_pool(struct zs_pool *pool);
 
-unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t flags);
+unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t flags,
+			const int nid);
 void zs_free(struct zs_pool *pool, unsigned long obj);
 
 size_t zs_huge_class_size(struct zs_pool *pool);
diff --git a/mm/zpool.c b/mm/zpool.c
index 6d6d88930932..b99a7c03e735 100644
--- a/mm/zpool.c
+++ b/mm/zpool.c
@@ -226,20 +226,22 @@ const char *zpool_get_type(struct zpool *zpool)
  * @size:	The amount of memory to allocate.
  * @gfp:	The GFP flags to use when allocating memory.
  * @handle:	Pointer to the handle to set
+ * @nid:	The preferred node id.
  *
  * This allocates the requested amount of memory from the pool.
  * The gfp flags will be used when allocating memory, if the
  * implementation supports it.  The provided @handle will be
- * set to the allocated object handle.
+ * set to the allocated object handle. The allocation will
+ * prefer the NUMA node specified by @nid.
  *
  * Implementations must guarantee this to be thread-safe.
  *
  * Returns: 0 on success, negative value on error.
  */
 int zpool_malloc(struct zpool *zpool, size_t size, gfp_t gfp,
-			unsigned long *handle)
+			unsigned long *handle, const int nid)
 {
-	return zpool->driver->malloc(zpool->pool, size, gfp, handle);
+	return zpool->driver->malloc(zpool->pool, size, gfp, handle, nid);
 }
 
 /**
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 961b270f023c..8ba6cdf222ac 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -243,9 +243,9 @@ static inline void zpdesc_dec_zone_page_state(struct zpdesc *zpdesc)
 	dec_zone_page_state(zpdesc_page(zpdesc), NR_ZSPAGES);
 }
 
-static inline struct zpdesc *alloc_zpdesc(gfp_t gfp)
+static inline struct zpdesc *alloc_zpdesc(gfp_t gfp, const int nid)
 {
-	struct page *page = alloc_page(gfp);
+	struct page *page = alloc_pages_node(nid, gfp, 0);
 
 	return page_zpdesc(page);
 }
@@ -462,9 +462,9 @@ static void zs_zpool_destroy(void *pool)
 }
 
 static int zs_zpool_malloc(void *pool, size_t size, gfp_t gfp,
-			unsigned long *handle)
+			unsigned long *handle, const int nid)
 {
-	*handle = zs_malloc(pool, size, gfp);
+	*handle = zs_malloc(pool, size, gfp, nid);
 
 	if (IS_ERR_VALUE(*handle))
 		return PTR_ERR((void *)*handle);
@@ -1044,7 +1044,7 @@ static void create_page_chain(struct size_class *class, struct zspage *zspage,
  */
 static struct zspage *alloc_zspage(struct zs_pool *pool,
 					struct size_class *class,
-					gfp_t gfp)
+					gfp_t gfp, const int nid)
 {
 	int i;
 	struct zpdesc *zpdescs[ZS_MAX_PAGES_PER_ZSPAGE];
@@ -1061,7 +1061,7 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
 	for (i = 0; i < class->pages_per_zspage; i++) {
 		struct zpdesc *zpdesc;
 
-		zpdesc = alloc_zpdesc(gfp);
+		zpdesc = alloc_zpdesc(gfp, nid);
 		if (!zpdesc) {
 			while (--i >= 0) {
 				zpdesc_dec_zone_page_state(zpdescs[i]);
@@ -1336,12 +1336,14 @@ static unsigned long obj_malloc(struct zs_pool *pool,
  * @pool: pool to allocate from
  * @size: size of block to allocate
  * @gfp: gfp flags when allocating object
+ * @nid: The preferred node id to allocate new zspage (if needed)
  *
  * On success, handle to the allocated object is returned,
  * otherwise an ERR_PTR().
  * Allocation requests with size > ZS_MAX_ALLOC_SIZE will fail.
  */
-unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
+unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp,
+				const int nid)
 {
 	unsigned long handle;
 	struct size_class *class;
@@ -1376,7 +1378,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
 
 	spin_unlock(&class->lock);
 
-	zspage = alloc_zspage(pool, class, gfp);
+	zspage = alloc_zspage(pool, class, gfp, nid);
 	if (!zspage) {
 		cache_free_handle(pool, handle);
 		return (unsigned long)ERR_PTR(-ENOMEM);
diff --git a/mm/zswap.c b/mm/zswap.c
index 204fb59da33c..455e9425c5f5 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -981,7 +981,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 
 	zpool = pool->zpool;
 	gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
-	alloc_ret = zpool_malloc(zpool, dlen, gfp, &handle);
+	alloc_ret = zpool_malloc(zpool, dlen, gfp, &handle, page_to_nid(page));
 	if (alloc_ret)
 		goto unlock;
 

base-commit: 8c65b3b82efb3b2f0d1b6e3b3e73c6f0fd367fb5
-- 
2.47.1
Re: [PATCH v2] zsmalloc: prefer the the original page's node for compressed data
Posted by Johannes Weiner an hour ago
On Wed, Apr 02, 2025 at 01:44:16PM -0700, Nhat Pham wrote:
> Currently, zsmalloc, zswap's and zram's backend memory allocator, does
> not enforce any policy for the allocation of memory for the compressed
> data, instead just adopting the memory policy of the task entering
> reclaim, or the default policy (prefer local node) if no such policy is
> specified. This can lead to several pathological behaviors in
> multi-node NUMA systems:
> 
> 1. Systems with CXL-based memory tiering can encounter the following
>    inversion with zswap/zram: the coldest pages demoted to the CXL tier
>    can return to the high tier when they are reclaimed to compressed
>    swap, creating memory pressure on the high tier.
> 
> 2. Consider a direct reclaimer scanning nodes in order of allocation
>    preference. If it ventures into remote nodes, the memory it
>    compresses there should stay there. Trying to shift those contents
>    over to the reclaiming thread's preferred node further *increases*
>    its local pressure, and provoking more spills. The remote node is
>    also the most likely to refault this data again. This undesirable
>    behavior was pointed out by Johannes Weiner in [1].
> 
> 3. For zswap writeback, the zswap entries are organized in
>    node-specific LRUs, based on the node placement of the original
>    pages, allowing for targeted zswap writeback for specific nodes.
> 
>    However, the compressed data of a zswap entry can be placed on a
>    different node from the LRU it is placed on. This means that reclaim
>    targeted at one node might not free up memory used for zswap entries
>    in that node, but instead reclaiming memory in a different node.
> 
> All of these issues will be resolved if the compressed data go to the
> same node as the original page. This patch encourages this behavior by
> having zswap and zram pass the node of the original page to zsmalloc,
> and have zsmalloc prefer the specified node if we need to allocate new
> (zs)pages for the compressed data.
> 
> Note that we are not strictly binding the allocation to the preferred
> node. We still allow the allocation to fall back to other nodes when
> the preferred node is full, or if we have zspages with slots available
> on a different node. This is OK, and still a strict improvement over
> the status quo:
> 
> 1. On a system with demotion enabled, we will generally prefer
>    demotions over compressed swapping, and only swap when pages have
>    already gone to the lowest tier. This patch should achieve the
>    desired effect for the most part.
> 
> 2. If the preferred node is out of memory, letting the compressed data
>    going to other nodes can be better than the alternative (OOMs,
>    keeping cold memory unreclaimed, disk swapping, etc.).
> 
> 3. If the allocation go to a separate node because we have a zspage
>    with slots available, at least we're not creating extra immediate
>    memory pressure (since the space is already allocated).
> 
> 3. While there can be mixings, we generally reclaim pages in
>    same-node batches, which encourage zspage grouping that is more
>    likely to go to the right node.
> 
> 4. A strict binding would require partitioning zsmalloc by node, which
>    is more complicated, and more prone to regression, since it reduces
>    the storage density of zsmalloc. We need to evaluate the tradeoff
>    and benchmark carefully before adopting such an involved solution.
> 
> [1]: https://lore.kernel.org/linux-mm/20250331165306.GC2110528@cmpxchg.org/
> 
> Suggested-by: Gregory Price <gourry@gourry.net>
> Signed-off-by: Nhat Pham <nphamcs@gmail.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Re: [PATCH v2] zsmalloc: prefer the the original page's node for compressed data
Posted by Jonathan Cameron 2 hours ago
On Wed,  2 Apr 2025 13:44:16 -0700
Nhat Pham <nphamcs@gmail.com> wrote:

> Currently, zsmalloc, zswap's and zram's backend memory allocator, does
> not enforce any policy for the allocation of memory for the compressed
> data, instead just adopting the memory policy of the task entering
> reclaim, or the default policy (prefer local node) if no such policy is
> specified. This can lead to several pathological behaviors in
> multi-node NUMA systems:
> 
> 1. Systems with CXL-based memory tiering can encounter the following
>    inversion with zswap/zram: the coldest pages demoted to the CXL tier
>    can return to the high tier when they are reclaimed to compressed
>    swap, creating memory pressure on the high tier.
> 
> 2. Consider a direct reclaimer scanning nodes in order of allocation
>    preference. If it ventures into remote nodes, the memory it
>    compresses there should stay there. Trying to shift those contents
>    over to the reclaiming thread's preferred node further *increases*
>    its local pressure, and provoking more spills. The remote node is
>    also the most likely to refault this data again. This undesirable
>    behavior was pointed out by Johannes Weiner in [1].
> 
> 3. For zswap writeback, the zswap entries are organized in
>    node-specific LRUs, based on the node placement of the original
>    pages, allowing for targeted zswap writeback for specific nodes.
> 
>    However, the compressed data of a zswap entry can be placed on a
>    different node from the LRU it is placed on. This means that reclaim
>    targeted at one node might not free up memory used for zswap entries
>    in that node, but instead reclaiming memory in a different node.
> 
> All of these issues will be resolved if the compressed data go to the
> same node as the original page. This patch encourages this behavior by
> having zswap and zram pass the node of the original page to zsmalloc,
> and have zsmalloc prefer the specified node if we need to allocate new
> (zs)pages for the compressed data.
> 
> Note that we are not strictly binding the allocation to the preferred
> node. We still allow the allocation to fall back to other nodes when
> the preferred node is full, or if we have zspages with slots available
> on a different node. This is OK, and still a strict improvement over
> the status quo:
> 
> 1. On a system with demotion enabled, we will generally prefer
>    demotions over compressed swapping, and only swap when pages have
>    already gone to the lowest tier. This patch should achieve the
>    desired effect for the most part.
> 
> 2. If the preferred node is out of memory, letting the compressed data
>    going to other nodes can be better than the alternative (OOMs,
>    keeping cold memory unreclaimed, disk swapping, etc.).
> 
> 3. If the allocation go to a separate node because we have a zspage
>    with slots available, at least we're not creating extra immediate
>    memory pressure (since the space is already allocated).
> 
> 3. While there can be mixings, we generally reclaim pages in
>    same-node batches, which encourage zspage grouping that is more
>    likely to go to the right node.
> 
> 4. A strict binding would require partitioning zsmalloc by node, which
>    is more complicated, and more prone to regression, since it reduces
>    the storage density of zsmalloc. We need to evaluate the tradeoff
>    and benchmark carefully before adopting such an involved solution.
> 
> [1]: https://lore.kernel.org/linux-mm/20250331165306.GC2110528@cmpxchg.org/
> 
> Suggested-by: Gregory Price <gourry@gourry.net>
> Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Makes sense. Formatting suggestions in other review nice to have though.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Re: [PATCH v2] zsmalloc: prefer the the original page's node for compressed data
Posted by Sergey Senozhatsky 14 hours ago
On (25/04/02 13:44), Nhat Pham wrote:
[..]
> @@ -1981,10 +1981,15 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
>  	 * We are holding per-CPU stream mutex and entry lock so better
>  	 * avoid direct reclaim.  Allocation error is not fatal since
>  	 * we still have the old object in the mem_pool.
> +	 *
> +	 * XXX: technically, the node we really want here is the node that holds
> +	 * the original compressed data. But that would require us to modify
> +	 * zsmalloc API to return this information. For now, we will make do with
> +	 * the node of the page allocated for recompression.
>  	 */
>  	handle_new = zs_malloc(zram->mem_pool, comp_len_new,
>  			       GFP_NOIO | __GFP_NOWARN |
> -			       __GFP_HIGHMEM | __GFP_MOVABLE);
> +			       __GFP_HIGHMEM | __GFP_MOVABLE, page_to_nid(page));

This works for me.  I saw in other threads (and sorry, I'm on a sick
leave now, can't reply fast) that "zram people need to fix it"/etc.
I guess I'm one of those zram people and I don't think I run any
multi-node NUMA systems, so that problem probably has never occurred
to me.  We can add a simple API extension to lookup node-id by
zsmalloc handle, if anyone really needs it, but I'm okay with the
status quo (and XXX) for the time being.

[..]
> -unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
> +unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp,
> +				const int nid)
>  {

I'm not the zspool maintainer, but if I may ask, for new zsmalloc code my
preference is to follow the kernel styles (yes, the old code doesn't follow
any at all, whenever I touch the old code I fix it).

Do you mind if we do something like below? (at least for zsmalloc)

With this
Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org> #zram and zsmalloc


---

diff --git a/include/linux/zpool.h b/include/linux/zpool.h
index 697525cb00bd..369ef068fad8 100644
--- a/include/linux/zpool.h
+++ b/include/linux/zpool.h
@@ -22,7 +22,7 @@ const char *zpool_get_type(struct zpool *pool);
 void zpool_destroy_pool(struct zpool *pool);
 
 int zpool_malloc(struct zpool *pool, size_t size, gfp_t gfp,
-			unsigned long *handle, const int nid);
+		 unsigned long *handle, const int nid);
 
 void zpool_free(struct zpool *pool, unsigned long handle);
 
@@ -64,7 +64,7 @@ struct zpool_driver {
 	void (*destroy)(void *pool);
 
 	int (*malloc)(void *pool, size_t size, gfp_t gfp,
-				unsigned long *handle, const int nid);
+		      unsigned long *handle, const int nid);
 	void (*free)(void *pool, unsigned long handle);
 
 	void *(*obj_read_begin)(void *pool, unsigned long handle,
diff --git a/mm/zpool.c b/mm/zpool.c
index b99a7c03e735..0a71d03369f1 100644
--- a/mm/zpool.c
+++ b/mm/zpool.c
@@ -239,7 +239,7 @@ const char *zpool_get_type(struct zpool *zpool)
  * Returns: 0 on success, negative value on error.
  */
 int zpool_malloc(struct zpool *zpool, size_t size, gfp_t gfp,
-			unsigned long *handle, const int nid)
+		 unsigned long *handle, const int nid)
 {
 	return zpool->driver->malloc(zpool->pool, size, gfp, handle, nid);
 }
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index daa16ce4cf07..70406ac94bbd 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -452,7 +452,7 @@ static void zs_zpool_destroy(void *pool)
 }
 
 static int zs_zpool_malloc(void *pool, size_t size, gfp_t gfp,
-			unsigned long *handle, const int nid)
+			   unsigned long *handle, const int nid)
 {
 	*handle = zs_malloc(pool, size, gfp, nid);
 
@@ -1033,8 +1033,8 @@ static void create_page_chain(struct size_class *class, struct zspage *zspage,
  * Allocate a zspage for the given size class
  */
 static struct zspage *alloc_zspage(struct zs_pool *pool,
-					struct size_class *class,
-					gfp_t gfp, const int nid)
+				   struct size_class *class,
+				   gfp_t gfp, const int nid)
 {
 	int i;
 	struct zpdesc *zpdescs[ZS_MAX_PAGES_PER_ZSPAGE];
@@ -1333,7 +1333,7 @@ static unsigned long obj_malloc(struct zs_pool *pool,
  * Allocation requests with size > ZS_MAX_ALLOC_SIZE will fail.
  */
 unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp,
-				const int nid)
+			const int nid)
 {
 	unsigned long handle;
 	struct size_class *class;
Re: [PATCH v2] zsmalloc: prefer the the original page's node for compressed data
Posted by Nhat Pham 3 hours ago
On Wed, Apr 2, 2025 at 8:55 PM Sergey Senozhatsky
<senozhatsky@chromium.org> wrote:
>
> On (25/04/02 13:44), Nhat Pham wrote:
> [..]
> > @@ -1981,10 +1981,15 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
> >        * We are holding per-CPU stream mutex and entry lock so better
> >        * avoid direct reclaim.  Allocation error is not fatal since
> >        * we still have the old object in the mem_pool.
> > +      *
> > +      * XXX: technically, the node we really want here is the node that holds
> > +      * the original compressed data. But that would require us to modify
> > +      * zsmalloc API to return this information. For now, we will make do with
> > +      * the node of the page allocated for recompression.
> >        */
> >       handle_new = zs_malloc(zram->mem_pool, comp_len_new,
> >                              GFP_NOIO | __GFP_NOWARN |
> > -                            __GFP_HIGHMEM | __GFP_MOVABLE);
> > +                            __GFP_HIGHMEM | __GFP_MOVABLE, page_to_nid(page));
>
> This works for me.  I saw in other threads (and sorry, I'm on a sick
> leave now, can't reply fast) that "zram people need to fix it"/etc.
> I guess I'm one of those zram people and I don't think I run any
> multi-node NUMA systems, so that problem probably has never occurred
> to me.  We can add a simple API extension to lookup node-id by
> zsmalloc handle, if anyone really needs it, but I'm okay with the
> status quo (and XXX) for the time being.

Yeah tbh this path is not that much different from the status quo
(page is also allocated via the task's policy here), and all the other
paths are fixed, so this seems to be an overall improvement still :)

If someone uses recompress + multi node, we can fix it then. It's not
impossible, just more code :)

>
> [..]
> > -unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
> > +unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp,
> > +                             const int nid)
> >  {
>
> I'm not the zspool maintainer, but if I may ask, for new zsmalloc code my
> preference is to follow the kernel styles (yes, the old code doesn't follow
> any at all, whenever I touch the old code I fix it).
>
> Do you mind if we do something like below? (at least for zsmalloc)

I don't mind at all! Thanks for fixing it, Sergey.

I saw that Andrew already added this as a fixlet. Thanks, Andrew!

>
> With this
> Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org> #zram and zsmalloc
>
>
> ---
>
> diff --git a/include/linux/zpool.h b/include/linux/zpool.h
> index 697525cb00bd..369ef068fad8 100644
> --- a/include/linux/zpool.h
> +++ b/include/linux/zpool.h
> @@ -22,7 +22,7 @@ const char *zpool_get_type(struct zpool *pool);
>  void zpool_destroy_pool(struct zpool *pool);
>
>  int zpool_malloc(struct zpool *pool, size_t size, gfp_t gfp,
> -                       unsigned long *handle, const int nid);
> +                unsigned long *handle, const int nid);
>
>  void zpool_free(struct zpool *pool, unsigned long handle);
>
> @@ -64,7 +64,7 @@ struct zpool_driver {
>         void (*destroy)(void *pool);
>
>         int (*malloc)(void *pool, size_t size, gfp_t gfp,
> -                               unsigned long *handle, const int nid);
> +                     unsigned long *handle, const int nid);
>         void (*free)(void *pool, unsigned long handle);
>
>         void *(*obj_read_begin)(void *pool, unsigned long handle,
> diff --git a/mm/zpool.c b/mm/zpool.c
> index b99a7c03e735..0a71d03369f1 100644
> --- a/mm/zpool.c
> +++ b/mm/zpool.c
> @@ -239,7 +239,7 @@ const char *zpool_get_type(struct zpool *zpool)
>   * Returns: 0 on success, negative value on error.
>   */
>  int zpool_malloc(struct zpool *zpool, size_t size, gfp_t gfp,
> -                       unsigned long *handle, const int nid)
> +                unsigned long *handle, const int nid)
>  {
>         return zpool->driver->malloc(zpool->pool, size, gfp, handle, nid);
>  }
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index daa16ce4cf07..70406ac94bbd 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -452,7 +452,7 @@ static void zs_zpool_destroy(void *pool)
>  }
>
>  static int zs_zpool_malloc(void *pool, size_t size, gfp_t gfp,
> -                       unsigned long *handle, const int nid)
> +                          unsigned long *handle, const int nid)
>  {
>         *handle = zs_malloc(pool, size, gfp, nid);
>
> @@ -1033,8 +1033,8 @@ static void create_page_chain(struct size_class *class, struct zspage *zspage,
>   * Allocate a zspage for the given size class
>   */
>  static struct zspage *alloc_zspage(struct zs_pool *pool,
> -                                       struct size_class *class,
> -                                       gfp_t gfp, const int nid)
> +                                  struct size_class *class,
> +                                  gfp_t gfp, const int nid)
>  {
>         int i;
>         struct zpdesc *zpdescs[ZS_MAX_PAGES_PER_ZSPAGE];
> @@ -1333,7 +1333,7 @@ static unsigned long obj_malloc(struct zs_pool *pool,
>   * Allocation requests with size > ZS_MAX_ALLOC_SIZE will fail.
>   */
>  unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp,
> -                               const int nid)
> +                       const int nid)
>  {
>         unsigned long handle;
>         struct size_class *class;
Re: [PATCH v2] zsmalloc: prefer the the original page's node for compressed data
Posted by Chengming Zhou 15 hours ago
On 2025/4/3 04:44, Nhat Pham wrote:
> Currently, zsmalloc, zswap's and zram's backend memory allocator, does
> not enforce any policy for the allocation of memory for the compressed
> data, instead just adopting the memory policy of the task entering
> reclaim, or the default policy (prefer local node) if no such policy is
> specified. This can lead to several pathological behaviors in
> multi-node NUMA systems:
> 
> 1. Systems with CXL-based memory tiering can encounter the following
>     inversion with zswap/zram: the coldest pages demoted to the CXL tier
>     can return to the high tier when they are reclaimed to compressed
>     swap, creating memory pressure on the high tier.
> 
> 2. Consider a direct reclaimer scanning nodes in order of allocation
>     preference. If it ventures into remote nodes, the memory it
>     compresses there should stay there. Trying to shift those contents
>     over to the reclaiming thread's preferred node further *increases*
>     its local pressure, and provoking more spills. The remote node is
>     also the most likely to refault this data again. This undesirable
>     behavior was pointed out by Johannes Weiner in [1].
> 
> 3. For zswap writeback, the zswap entries are organized in
>     node-specific LRUs, based on the node placement of the original
>     pages, allowing for targeted zswap writeback for specific nodes.
> 
>     However, the compressed data of a zswap entry can be placed on a
>     different node from the LRU it is placed on. This means that reclaim
>     targeted at one node might not free up memory used for zswap entries
>     in that node, but instead reclaiming memory in a different node.
> 
> All of these issues will be resolved if the compressed data go to the
> same node as the original page. This patch encourages this behavior by
> having zswap and zram pass the node of the original page to zsmalloc,
> and have zsmalloc prefer the specified node if we need to allocate new
> (zs)pages for the compressed data.
> 
> Note that we are not strictly binding the allocation to the preferred
> node. We still allow the allocation to fall back to other nodes when
> the preferred node is full, or if we have zspages with slots available
> on a different node. This is OK, and still a strict improvement over
> the status quo:
> 
> 1. On a system with demotion enabled, we will generally prefer
>     demotions over compressed swapping, and only swap when pages have
>     already gone to the lowest tier. This patch should achieve the
>     desired effect for the most part.
> 
> 2. If the preferred node is out of memory, letting the compressed data
>     going to other nodes can be better than the alternative (OOMs,
>     keeping cold memory unreclaimed, disk swapping, etc.).
> 
> 3. If the allocation go to a separate node because we have a zspage
>     with slots available, at least we're not creating extra immediate
>     memory pressure (since the space is already allocated).
> 
> 3. While there can be mixings, we generally reclaim pages in
>     same-node batches, which encourage zspage grouping that is more
>     likely to go to the right node.
> 
> 4. A strict binding would require partitioning zsmalloc by node, which
>     is more complicated, and more prone to regression, since it reduces
>     the storage density of zsmalloc. We need to evaluate the tradeoff
>     and benchmark carefully before adopting such an involved solution.
> 
> [1]: https://lore.kernel.org/linux-mm/20250331165306.GC2110528@cmpxchg.org/
> 
> Suggested-by: Gregory Price <gourry@gourry.net>
> Signed-off-by: Nhat Pham <nphamcs@gmail.com>

Looks good to me:

Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>

Thanks!

> ---
>   drivers/block/zram/zram_drv.c | 11 ++++++++---
>   include/linux/zpool.h         |  4 ++--
>   include/linux/zsmalloc.h      |  3 ++-
>   mm/zpool.c                    |  8 +++++---
>   mm/zsmalloc.c                 | 18 ++++++++++--------
>   mm/zswap.c                    |  2 +-
>   6 files changed, 28 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index fda7d8624889..0ba18277ed7b 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -1694,7 +1694,7 @@ static int write_incompressible_page(struct zram *zram, struct page *page,
>   	 */
>   	handle = zs_malloc(zram->mem_pool, PAGE_SIZE,
>   			   GFP_NOIO | __GFP_NOWARN |
> -			   __GFP_HIGHMEM | __GFP_MOVABLE);
> +			   __GFP_HIGHMEM | __GFP_MOVABLE, page_to_nid(page));
>   	if (IS_ERR_VALUE(handle))
>   		return PTR_ERR((void *)handle);
>   
> @@ -1761,7 +1761,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
>   
>   	handle = zs_malloc(zram->mem_pool, comp_len,
>   			   GFP_NOIO | __GFP_NOWARN |
> -			   __GFP_HIGHMEM | __GFP_MOVABLE);
> +			   __GFP_HIGHMEM | __GFP_MOVABLE, page_to_nid(page));
>   	if (IS_ERR_VALUE(handle)) {
>   		zcomp_stream_put(zstrm);
>   		return PTR_ERR((void *)handle);
> @@ -1981,10 +1981,15 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
>   	 * We are holding per-CPU stream mutex and entry lock so better
>   	 * avoid direct reclaim.  Allocation error is not fatal since
>   	 * we still have the old object in the mem_pool.
> +	 *
> +	 * XXX: technically, the node we really want here is the node that holds
> +	 * the original compressed data. But that would require us to modify
> +	 * zsmalloc API to return this information. For now, we will make do with
> +	 * the node of the page allocated for recompression.
>   	 */
>   	handle_new = zs_malloc(zram->mem_pool, comp_len_new,
>   			       GFP_NOIO | __GFP_NOWARN |
> -			       __GFP_HIGHMEM | __GFP_MOVABLE);
> +			       __GFP_HIGHMEM | __GFP_MOVABLE, page_to_nid(page));
>   	if (IS_ERR_VALUE(handle_new)) {
>   		zcomp_stream_put(zstrm);
>   		return PTR_ERR((void *)handle_new);
> diff --git a/include/linux/zpool.h b/include/linux/zpool.h
> index 52f30e526607..697525cb00bd 100644
> --- a/include/linux/zpool.h
> +++ b/include/linux/zpool.h
> @@ -22,7 +22,7 @@ const char *zpool_get_type(struct zpool *pool);
>   void zpool_destroy_pool(struct zpool *pool);
>   
>   int zpool_malloc(struct zpool *pool, size_t size, gfp_t gfp,
> -			unsigned long *handle);
> +			unsigned long *handle, const int nid);
>   
>   void zpool_free(struct zpool *pool, unsigned long handle);
>   
> @@ -64,7 +64,7 @@ struct zpool_driver {
>   	void (*destroy)(void *pool);
>   
>   	int (*malloc)(void *pool, size_t size, gfp_t gfp,
> -				unsigned long *handle);
> +				unsigned long *handle, const int nid);
>   	void (*free)(void *pool, unsigned long handle);
>   
>   	void *(*obj_read_begin)(void *pool, unsigned long handle,
> diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
> index c26baf9fb331..13e9cc5490f7 100644
> --- a/include/linux/zsmalloc.h
> +++ b/include/linux/zsmalloc.h
> @@ -26,7 +26,8 @@ struct zs_pool;
>   struct zs_pool *zs_create_pool(const char *name);
>   void zs_destroy_pool(struct zs_pool *pool);
>   
> -unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t flags);
> +unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t flags,
> +			const int nid);
>   void zs_free(struct zs_pool *pool, unsigned long obj);
>   
>   size_t zs_huge_class_size(struct zs_pool *pool);
> diff --git a/mm/zpool.c b/mm/zpool.c
> index 6d6d88930932..b99a7c03e735 100644
> --- a/mm/zpool.c
> +++ b/mm/zpool.c
> @@ -226,20 +226,22 @@ const char *zpool_get_type(struct zpool *zpool)
>    * @size:	The amount of memory to allocate.
>    * @gfp:	The GFP flags to use when allocating memory.
>    * @handle:	Pointer to the handle to set
> + * @nid:	The preferred node id.
>    *
>    * This allocates the requested amount of memory from the pool.
>    * The gfp flags will be used when allocating memory, if the
>    * implementation supports it.  The provided @handle will be
> - * set to the allocated object handle.
> + * set to the allocated object handle. The allocation will
> + * prefer the NUMA node specified by @nid.
>    *
>    * Implementations must guarantee this to be thread-safe.
>    *
>    * Returns: 0 on success, negative value on error.
>    */
>   int zpool_malloc(struct zpool *zpool, size_t size, gfp_t gfp,
> -			unsigned long *handle)
> +			unsigned long *handle, const int nid)
>   {
> -	return zpool->driver->malloc(zpool->pool, size, gfp, handle);
> +	return zpool->driver->malloc(zpool->pool, size, gfp, handle, nid);
>   }
>   
>   /**
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 961b270f023c..8ba6cdf222ac 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -243,9 +243,9 @@ static inline void zpdesc_dec_zone_page_state(struct zpdesc *zpdesc)
>   	dec_zone_page_state(zpdesc_page(zpdesc), NR_ZSPAGES);
>   }
>   
> -static inline struct zpdesc *alloc_zpdesc(gfp_t gfp)
> +static inline struct zpdesc *alloc_zpdesc(gfp_t gfp, const int nid)
>   {
> -	struct page *page = alloc_page(gfp);
> +	struct page *page = alloc_pages_node(nid, gfp, 0);
>   
>   	return page_zpdesc(page);
>   }
> @@ -462,9 +462,9 @@ static void zs_zpool_destroy(void *pool)
>   }
>   
>   static int zs_zpool_malloc(void *pool, size_t size, gfp_t gfp,
> -			unsigned long *handle)
> +			unsigned long *handle, const int nid)
>   {
> -	*handle = zs_malloc(pool, size, gfp);
> +	*handle = zs_malloc(pool, size, gfp, nid);
>   
>   	if (IS_ERR_VALUE(*handle))
>   		return PTR_ERR((void *)*handle);
> @@ -1044,7 +1044,7 @@ static void create_page_chain(struct size_class *class, struct zspage *zspage,
>    */
>   static struct zspage *alloc_zspage(struct zs_pool *pool,
>   					struct size_class *class,
> -					gfp_t gfp)
> +					gfp_t gfp, const int nid)
>   {
>   	int i;
>   	struct zpdesc *zpdescs[ZS_MAX_PAGES_PER_ZSPAGE];
> @@ -1061,7 +1061,7 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
>   	for (i = 0; i < class->pages_per_zspage; i++) {
>   		struct zpdesc *zpdesc;
>   
> -		zpdesc = alloc_zpdesc(gfp);
> +		zpdesc = alloc_zpdesc(gfp, nid);
>   		if (!zpdesc) {
>   			while (--i >= 0) {
>   				zpdesc_dec_zone_page_state(zpdescs[i]);
> @@ -1336,12 +1336,14 @@ static unsigned long obj_malloc(struct zs_pool *pool,
>    * @pool: pool to allocate from
>    * @size: size of block to allocate
>    * @gfp: gfp flags when allocating object
> + * @nid: The preferred node id to allocate new zspage (if needed)
>    *
>    * On success, handle to the allocated object is returned,
>    * otherwise an ERR_PTR().
>    * Allocation requests with size > ZS_MAX_ALLOC_SIZE will fail.
>    */
> -unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
> +unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp,
> +				const int nid)
>   {
>   	unsigned long handle;
>   	struct size_class *class;
> @@ -1376,7 +1378,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
>   
>   	spin_unlock(&class->lock);
>   
> -	zspage = alloc_zspage(pool, class, gfp);
> +	zspage = alloc_zspage(pool, class, gfp, nid);
>   	if (!zspage) {
>   		cache_free_handle(pool, handle);
>   		return (unsigned long)ERR_PTR(-ENOMEM);
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 204fb59da33c..455e9425c5f5 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -981,7 +981,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>   
>   	zpool = pool->zpool;
>   	gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
> -	alloc_ret = zpool_malloc(zpool, dlen, gfp, &handle);
> +	alloc_ret = zpool_malloc(zpool, dlen, gfp, &handle, page_to_nid(page));
>   	if (alloc_ret)
>   		goto unlock;
>   
> 
> base-commit: 8c65b3b82efb3b2f0d1b6e3b3e73c6f0fd367fb5
Re: [PATCH v2] zsmalloc: prefer the the original page's node for compressed data
Posted by Dan Williams 20 hours ago
Nhat Pham wrote:
> Currently, zsmalloc, zswap's and zram's backend memory allocator, does
> not enforce any policy for the allocation of memory for the compressed
> data, instead just adopting the memory policy of the task entering
> reclaim, or the default policy (prefer local node) if no such policy is
> specified. This can lead to several pathological behaviors in
> multi-node NUMA systems:
> 
> 1. Systems with CXL-based memory tiering can encounter the following
>    inversion with zswap/zram: the coldest pages demoted to the CXL tier
>    can return to the high tier when they are reclaimed to compressed
>    swap, creating memory pressure on the high tier.
> 
> 2. Consider a direct reclaimer scanning nodes in order of allocation
>    preference. If it ventures into remote nodes, the memory it
>    compresses there should stay there. Trying to shift those contents
>    over to the reclaiming thread's preferred node further *increases*
>    its local pressure, and provoking more spills. The remote node is
>    also the most likely to refault this data again. This undesirable
>    behavior was pointed out by Johannes Weiner in [1].
> 
> 3. For zswap writeback, the zswap entries are organized in
>    node-specific LRUs, based on the node placement of the original
>    pages, allowing for targeted zswap writeback for specific nodes.
> 
>    However, the compressed data of a zswap entry can be placed on a
>    different node from the LRU it is placed on. This means that reclaim
>    targeted at one node might not free up memory used for zswap entries
>    in that node, but instead reclaiming memory in a different node.
> 
> All of these issues will be resolved if the compressed data go to the
> same node as the original page. This patch encourages this behavior by
> having zswap and zram pass the node of the original page to zsmalloc,
> and have zsmalloc prefer the specified node if we need to allocate new
> (zs)pages for the compressed data.
> 
> Note that we are not strictly binding the allocation to the preferred
> node. We still allow the allocation to fall back to other nodes when
> the preferred node is full, or if we have zspages with slots available
> on a different node. This is OK, and still a strict improvement over
> the status quo:
> 
> 1. On a system with demotion enabled, we will generally prefer
>    demotions over compressed swapping, and only swap when pages have
>    already gone to the lowest tier. This patch should achieve the
>    desired effect for the most part.
> 
> 2. If the preferred node is out of memory, letting the compressed data
>    going to other nodes can be better than the alternative (OOMs,
>    keeping cold memory unreclaimed, disk swapping, etc.).
> 
> 3. If the allocation go to a separate node because we have a zspage
>    with slots available, at least we're not creating extra immediate
>    memory pressure (since the space is already allocated).
> 
> 3. While there can be mixings, we generally reclaim pages in
>    same-node batches, which encourage zspage grouping that is more
>    likely to go to the right node.
> 
> 4. A strict binding would require partitioning zsmalloc by node, which
>    is more complicated, and more prone to regression, since it reduces
>    the storage density of zsmalloc. We need to evaluate the tradeoff
>    and benchmark carefully before adopting such an involved solution.
> 
> [1]: https://lore.kernel.org/linux-mm/20250331165306.GC2110528@cmpxchg.org/
> 
> Suggested-by: Gregory Price <gourry@gourry.net>
> Signed-off-by: Nhat Pham <nphamcs@gmail.com>
[..]
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 961b270f023c..8ba6cdf222ac 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -243,9 +243,9 @@ static inline void zpdesc_dec_zone_page_state(struct zpdesc *zpdesc)
>  	dec_zone_page_state(zpdesc_page(zpdesc), NR_ZSPAGES);
>  }
>  
> -static inline struct zpdesc *alloc_zpdesc(gfp_t gfp)
> +static inline struct zpdesc *alloc_zpdesc(gfp_t gfp, const int nid)
>  {
> -	struct page *page = alloc_page(gfp);
> +	struct page *page = alloc_pages_node(nid, gfp, 0);

Why do the work to pass in @nid only to hardcode 0 to
alloc_pages_node()?
Re: [PATCH v2] zsmalloc: prefer the the original page's node for compressed data
Posted by Nhat Pham 20 hours ago
On Wed, Apr 2, 2025 at 2:41 PM Dan Williams <dan.j.williams@intel.com> wrote>
>
> Why do the work to pass in @nid only to hardcode 0 to
> alloc_pages_node()?

That 0 is the order, i.e we want a single page here :)

The node id is the first argument of alloc_pages_node. I made the same
mistake in one of the earlier versions of this patch series (which,
fortunately I did not send out) - hopefully this time I'm correct :)
Re: [PATCH v2] zsmalloc: prefer the the original page's node for compressed data
Posted by Dan Williams 19 hours ago
Nhat Pham wrote:
> On Wed, Apr 2, 2025 at 2:41 PM Dan Williams <dan.j.williams@intel.com> wrote>
> >
> > Why do the work to pass in @nid only to hardcode 0 to
> > alloc_pages_node()?
> 
> That 0 is the order, i.e we want a single page here :)
> 
> The node id is the first argument of alloc_pages_node. I made the same
> mistake in one of the earlier versions of this patch series (which,
> fortunately I did not send out) - hopefully this time I'm correct :)

Oops, sorry for the noise. My eyes were so trained on @nid as the final
argument in all the other conversions that I overlooked that.

You can add:

Acked-by: Dan Williams <dan.j.williams@intel.com>