[v1] mm/zswap, zsmalloc: Per-memcg-lruvec zswap accounting

[PATCH 0/8] mm/zswap, zsmalloc: Per-memcg-lruvec zswap accounting

Posted by Joshua Hahn 1 month, 1 week ago

INTRODUCTION
============
The current design for zswap and zsmalloc leaves a clean divide between
layers of the memory stack. At the higher level, we have zswap, which
interacts directly with memory consumers, compression algorithms, and
handles memory usage accounting via memcg limits. At the lower level,
we have zsmalloc, which handles the page allocation and migration of
physical pages.

While this logical separation simplifies the codebase, it leaves
problems for accounting that requires both memory cgroup awareness and
physical memory location. To name a few:

 - On tiered systems, it is impossible to understand how much toptier
   memory a cgroup is using, since zswap has no understanding of where
   the compressed memory is physically stored.
   + With SeongJae Park's work to store incompressible pages as-is in
     zswap [1], the size of compressed memory can become non-trivial,
     and easily consume a meaningful portion of memory.

 - cgroups that restrict memory nodes have no control over which nodes
   their zswapped objects live on. This can lead to unexpectedly high
   fault times for workloads, who must eat the remote access latency
   cost of retrieving the compressed object from a remote node.
   + Nhat Pham addressed this issue via a best-effort attempt to place
     compressed objects in the same page as the original page, but this
     cannot guarantee complete isolation [2].

 - On the flip side, zsmalloc's ignorance of cgroup also makes its
   shrinker memcg-unaware, which can lead to ineffective reclaim when
   pressure is localized to a single cgroup.

Until recently, zpool acted as another layer of indirection between
zswap and zsmalloc, which made bridging memcg and physical location
difficult. Now that zsmalloc is the only allocator backend for zswap and
zram [3], it is possible to move memory-cgroup accounting to the
zsmalloc layer.

Introduce a new per-zpdesc array of objcg pointers to track
per-memcg-lruvec memory usage by zswap, while leaving zram users
unaffected.

This creates one source of truth for NR_ZSWAP, and more accurate
accounting for NR_ZSWAPPED.

This brings sizeof(struct zpdesc) from 56 bytes to 64 bytes, but this
increase in size is unseen by the rest of the system because zpdesc
overlays struct page. Implementation details and care taken to handle
the page->memcg_data field can be found in patch 3.

In addition, move the accounting of memcg charges to the zsmalloc layer,
whose only user is zswap at the moment.

PATCH OUTLINE
=============
Patches 1 and 2 are small cleanups that make the codebase consistent and
easier to digest.

Patches 3, 4, and 5 allocate and populate the new zpdesc->objcgs field
with compressed objects' obj_cgroups. zswap_entry->objcgs is removed,
and redirected to look at the zspage for memcg information.

Patch 6 moves the charging and lifetime management of obj_cgroups to
the zsmalloc layer, which leaves zswap only as a plumbing layer to hand
cgroup information to zsmalloc.

Patches 7 and 8 introduce node counters and memcg-lruvec counters for
zswap. Special care is taken for compressed objects that span multiple
nodes.

[1] https://lore.kernel.org/linux-mm/20250822190817.49287-1-sj@kernel.org/
[2] https://lore.kernel.org/linux-mm/20250402204416.3435994-1-nphamcs@gmail.com/#t3
[3] https://lore.kernel.org/linux-mm/20250829162212.208258-1-hannes@cmpxchg.org/
[4] https://lore.kernel.org/linux-mm/c8bc2dce-d4ec-c16e-8df4-2624c48cfc06@google.com/

Joshua Hahn (8):
  mm/zsmalloc: Rename zs_object_copy to zs_obj_copy
  mm/zsmalloc: Make all obj_idx unsigned ints
  mm/zsmalloc: Introduce objcgs pointer in struct zpdesc
  mm/zsmalloc: Store obj_cgroup pointer in zpdesc
  mm/zsmalloc,zswap: Redirect zswap_entry->obcg to zpdesc
  mm/zsmalloc, zswap: Handle objcg charging and lifetime in zsmalloc
  mm/memcontrol: Track MEMCG_ZSWAPPED in bytes
  mm/vmstat, memcontrol: Track ZSWAP_B, ZSWAPPED_B per-memcg-lruvec

 drivers/block/zram/zram_drv.c |  17 +-
 include/linux/memcontrol.h    |  15 +-
 include/linux/mmzone.h        |   2 +
 include/linux/zsmalloc.h      |   6 +-
 mm/memcontrol.c               |  68 ++------
 mm/vmstat.c                   |   2 +
 mm/zpdesc.h                   |  25 ++-
 mm/zsmalloc.c                 | 282 ++++++++++++++++++++++++++++++++--
 mm/zswap.c                    |  67 ++++----
 9 files changed, 345 insertions(+), 139 deletions(-)

-- 
2.47.3

Re: [PATCH 0/8] mm/zswap, zsmalloc: Per-memcg-lruvec zswap accounting

Posted by Nhat Pham 1 month ago

On Thu, Feb 26, 2026 at 11:29 AM Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
>
> INTRODUCTION
> ============
> The current design for zswap and zsmalloc leaves a clean divide between
> layers of the memory stack. At the higher level, we have zswap, which
> interacts directly with memory consumers, compression algorithms, and
> handles memory usage accounting via memcg limits. At the lower level,
> we have zsmalloc, which handles the page allocation and migration of
> physical pages.
>
> While this logical separation simplifies the codebase, it leaves
> problems for accounting that requires both memory cgroup awareness and
> physical memory location. To name a few:
>
>  - On tiered systems, it is impossible to understand how much toptier
>    memory a cgroup is using, since zswap has no understanding of where
>    the compressed memory is physically stored.
>    + With SeongJae Park's work to store incompressible pages as-is in
>      zswap [1], the size of compressed memory can become non-trivial,
>      and easily consume a meaningful portion of memory.
>
>  - cgroups that restrict memory nodes have no control over which nodes
>    their zswapped objects live on. This can lead to unexpectedly high
>    fault times for workloads, who must eat the remote access latency
>    cost of retrieving the compressed object from a remote node.
>    + Nhat Pham addressed this issue via a best-effort attempt to place
>      compressed objects in the same page as the original page, but this
>      cannot guarantee complete isolation [2].
>
>  - On the flip side, zsmalloc's ignorance of cgroup also makes its
>    shrinker memcg-unaware, which can lead to ineffective reclaim when
>    pressure is localized to a single cgroup.
>
> Until recently, zpool acted as another layer of indirection between
> zswap and zsmalloc, which made bridging memcg and physical location
> difficult. Now that zsmalloc is the only allocator backend for zswap and
> zram [3], it is possible to move memory-cgroup accounting to the
> zsmalloc layer.
>
> Introduce a new per-zpdesc array of objcg pointers to track
> per-memcg-lruvec memory usage by zswap, while leaving zram users
> unaffected.
>
> This creates one source of truth for NR_ZSWAP, and more accurate
> accounting for NR_ZSWAPPED.
>
> This brings sizeof(struct zpdesc) from 56 bytes to 64 bytes, but this
> increase in size is unseen by the rest of the system because zpdesc
> overlays struct page. Implementation details and care taken to handle
> the page->memcg_data field can be found in patch 3.
>
> In addition, move the accounting of memcg charges to the zsmalloc layer,
> whose only user is zswap at the moment.
>
> PATCH OUTLINE
> =============
> Patches 1 and 2 are small cleanups that make the codebase consistent and
> easier to digest.
>
> Patches 3, 4, and 5 allocate and populate the new zpdesc->objcgs field
> with compressed objects' obj_cgroups. zswap_entry->objcgs is removed,
> and redirected to look at the zspage for memcg information.
>
> Patch 6 moves the charging and lifetime management of obj_cgroups to
> the zsmalloc layer, which leaves zswap only as a plumbing layer to hand
> cgroup information to zsmalloc.
>
> Patches 7 and 8 introduce node counters and memcg-lruvec counters for
> zswap. Special care is taken for compressed objects that span multiple
> nodes.
>
> [1] https://lore.kernel.org/linux-mm/20250822190817.49287-1-sj@kernel.org/
> [2] https://lore.kernel.org/linux-mm/20250402204416.3435994-1-nphamcs@gmail.com/#t3
> [3] https://lore.kernel.org/linux-mm/20250829162212.208258-1-hannes@cmpxchg.org/
> [4] https://lore.kernel.org/linux-mm/c8bc2dce-d4ec-c16e-8df4-2624c48cfc06@google.com/
>
> Joshua Hahn (8):
>   mm/zsmalloc: Rename zs_object_copy to zs_obj_copy
>   mm/zsmalloc: Make all obj_idx unsigned ints
>   mm/zsmalloc: Introduce objcgs pointer in struct zpdesc
>   mm/zsmalloc: Store obj_cgroup pointer in zpdesc
>   mm/zsmalloc,zswap: Redirect zswap_entry->obcg to zpdesc
>   mm/zsmalloc, zswap: Handle objcg charging and lifetime in zsmalloc
>   mm/memcontrol: Track MEMCG_ZSWAPPED in bytes
>   mm/vmstat, memcontrol: Track ZSWAP_B, ZSWAPPED_B per-memcg-lruvec
>
>  drivers/block/zram/zram_drv.c |  17 +-
>  include/linux/memcontrol.h    |  15 +-
>  include/linux/mmzone.h        |   2 +
>  include/linux/zsmalloc.h      |   6 +-
>  mm/memcontrol.c               |  68 ++------
>  mm/vmstat.c                   |   2 +
>  mm/zpdesc.h                   |  25 ++-
>  mm/zsmalloc.c                 | 282 ++++++++++++++++++++++++++++++++--
>  mm/zswap.c                    |  67 ++++----
>  9 files changed, 345 insertions(+), 139 deletions(-)

I might have missed it and this might be in one of the latter patches,
but could also add some quick and dirty benchmark for zswap to ensure
there's no or minimal performance implications? IIUC there is a small
amount of extra overhead in certain steps, because we have to go
through zsmalloc to query objcg. Usemem or kernel build should suffice
IMHO.

To be clear, I don't anticipate any observable performance change, but
it's a good sanity check :) Besides, can't be too careful with stress
testing stuff :P

Re: [PATCH 0/8] mm/zswap, zsmalloc: Per-memcg-lruvec zswap accounting

Posted by Joshua Hahn 1 month ago

On Mon, 2 Mar 2026 13:31:32 -0800 Nhat Pham <nphamcs@gmail.com> wrote:

> On Thu, Feb 26, 2026 at 11:29 AM Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

[...snip...]

> > Introduce a new per-zpdesc array of objcg pointers to track
> > per-memcg-lruvec memory usage by zswap, while leaving zram users
> > unaffected.

[...snip...]

Hi Nhat! I hope you are doing well : -) Thank you for taking a look!

> I might have missed it and this might be in one of the latter patches,
> but could also add some quick and dirty benchmark for zswap to ensure
> there's no or minimal performance implications? IIUC there is a small
> amount of extra overhead in certain steps, because we have to go
> through zsmalloc to query objcg. Usemem or kernel build should suffice
> IMHO.

Yup, this was one of my concerns too. I tried to do a somewhat comprehensive
analysis below, hopefully this can show a good picture of what's happening.
Spoilers: there doesn't seem to be any significant regressions (< 1%)
and any regressions are within a small fraction of the standard deviation.

One thing that I have noticed is that there is a tangible reduction in
standard deviation for some of these benchmarks. I can't exactly pinpoint
why this is happening, but I'll take it as a win :p

> To be clear, I don't anticipate any observable performance change, but
> it's a good sanity check :) Besides, can't be too careful with stress
> testing stuff :P

For sure. I should have done these and included it in the original RFC,
but I think I might have been too eager to get the RFC out : -)
Will include in the second version of the series!

All the experiments below are done on a 2-NUMA system. The data is quite
compressible, which I think makes sense for measuring the overhead of accounting.

Benchmark 1
Allocating 2G memory to one node with 1G memory.high. Average across 10 trials
+-------------------------+---------+----------+
|                         | average |  stddev  |
+-------------------------+---------+----------+
| Baseline (11439c4635ed) | 8887.82 | 362.40   |
| Baseline + Series       | 8944.16 | 356.45   |
+-------------------------+---------+----------+
| Delta                   | +0.634% | -1.642%  |
+-------------------------+---------+----------+

Benchmark 2
Allocating 2G memory to one node with 1G memory.high, churn 5x through the
memory. Average across 5 trials.
+-------------------------+----------+----------+
|                         | average  |  stddev  |
+-------------------------+----------+----------+
| Baseline (11439c4635ed) | 31152.96 | 166.23   |
| Baseline + Series       | 31355.28 | 64.86    |
+-------------------------+----------+----------+
| Delta                   | +0.649%  | -60.981% |
+-------------------------+----------+----------+

Benchmark 3
Allocating 2G memory to one node with 1G memory.high, split across 2 nodes.
Average across 5 trials.
+-------------------------+---------+----------+
|            a            | average |  stddev  |
+-------------------------+---------+----------+
| Baseline (11439c4635ed) | 16101.6 | 174.18   |
| Baseline + Series       | 16022.4 | 117.17   |
+-------------------------+---------+----------+
| Delta                   | -0.492% | -32.731% |
+-------------------------+---------+----------+

Benchmark 4
Reading stat files 10000 times under memory pressure

memory.stat
+-------------------------+---------+----------+
|                         | average |  stddev  |
+-------------------------+---------+----------+
| Baseline (11439c4635ed) | 24524.4 | 501.7    |
| Baseline + Series       | 24807.2 | 444.53   |
+-------------------------+---------+---------+
| Delta                   | 1.153%  | -11.395% |
+-------------------------+---------+----------+

memory.numa_stat
+-------------------------+---------+---------+
|                         | average | stddev  |
+-------------------------+---------+---------+
| Baseline (11439c4635ed) | 24807.2 | 444.53  |
| Baseline + Series       | 23837.6 | 521.68  |
+-------------------------+---------+---------+
| Delta                   | -3.905% | 17.355% |
+-------------------------+---------+---------+

proc/vmstat
+-------------------------+---------+----------+
|                         | average |  stddev  |
+-------------------------+---------+----------+
| Baseline (11439c4635ed) | 24793.6 | 285.26   |
| Baseline + Series       | 23815.6 | 553.44   |
+-------------------------+---------+---------+
| Delta                   | -3.945% | +94.012% |
+-------------------------+---------+----------+

^^^ Some big increase in standard deviation here, although there is some
decrease in the average time. Probably the most notable change that I've seen
from this patch.

node0/vmstat
+-------------------------+---------+----------+
|            a            | average |  stddev  |
+-------------------------+---------+----------+
| Baseline (11439c4635ed) | 24541.4 | 281.41   |
| Baseline + Series       | 24479   | 241.29   |
+-------------------------+---------+---------+
| Delta                   | -0.254% | -14.257% |
+-------------------------+---------+----------+

Lots of testing results, I think mostly negligible in terms of average, but
some non-negligible changes in standard deviation going in both directions.
I don't see anything too concerning off the top of my head, but for the
next version I'll try to do some more testing across different machines
as well (I don't have any machines with > 2 nodes, but maybe I can do
some tests on QEMU just to sanity check)

Thanks again, Nhat. Have a great day!
Joshua

Re: [PATCH 0/8] mm/zswap, zsmalloc: Per-memcg-lruvec zswap accounting

Posted by Nhat Pham 1 month ago

On Tue, Mar 3, 2026 at 9:51 AM Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
>
> On Mon, 2 Mar 2026 13:31:32 -0800 Nhat Pham <nphamcs@gmail.com> wrote:
>
> > On Thu, Feb 26, 2026 at 11:29 AM Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
>
> [...snip...]
>
> > > Introduce a new per-zpdesc array of objcg pointers to track
> > > per-memcg-lruvec memory usage by zswap, while leaving zram users
> > > unaffected.
>
> [...snip...]
>
> Hi Nhat! I hope you are doing well : -) Thank you for taking a look!
>
> > I might have missed it and this might be in one of the latter patches,
> > but could also add some quick and dirty benchmark for zswap to ensure
> > there's no or minimal performance implications? IIUC there is a small
> > amount of extra overhead in certain steps, because we have to go
> > through zsmalloc to query objcg. Usemem or kernel build should suffice
> > IMHO.
>
> Yup, this was one of my concerns too. I tried to do a somewhat comprehensive
> analysis below, hopefully this can show a good picture of what's happening.
> Spoilers: there doesn't seem to be any significant regressions (< 1%)
> and any regressions are within a small fraction of the standard deviation.
>
> One thing that I have noticed is that there is a tangible reduction in
> standard deviation for some of these benchmarks. I can't exactly pinpoint
> why this is happening, but I'll take it as a win :p
>
> > To be clear, I don't anticipate any observable performance change, but
> > it's a good sanity check :) Besides, can't be too careful with stress
> > testing stuff :P
>
> For sure. I should have done these and included it in the original RFC,
> but I think I might have been too eager to get the RFC out : -)
> Will include in the second version of the series!
>
> All the experiments below are done on a 2-NUMA system. The data is quite
> compressible, which I think makes sense for measuring the overhead of accounting.
>
> Benchmark 1
> Allocating 2G memory to one node with 1G memory.high. Average across 10 trials
> +-------------------------+---------+----------+
> |                         | average |  stddev  |
> +-------------------------+---------+----------+
> | Baseline (11439c4635ed) | 8887.82 | 362.40   |
> | Baseline + Series       | 8944.16 | 356.45   |
> +-------------------------+---------+----------+
> | Delta                   | +0.634% | -1.642%  |
> +-------------------------+---------+----------+
>
> Benchmark 2
> Allocating 2G memory to one node with 1G memory.high, churn 5x through the
> memory. Average across 5 trials.
> +-------------------------+----------+----------+
> |                         | average  |  stddev  |
> +-------------------------+----------+----------+
> | Baseline (11439c4635ed) | 31152.96 | 166.23   |
> | Baseline + Series       | 31355.28 | 64.86    |
> +-------------------------+----------+----------+
> | Delta                   | +0.649%  | -60.981% |
> +-------------------------+----------+----------+
>
> Benchmark 3
> Allocating 2G memory to one node with 1G memory.high, split across 2 nodes.
> Average across 5 trials.
> +-------------------------+---------+----------+
> |            a            | average |  stddev  |
> +-------------------------+---------+----------+
> | Baseline (11439c4635ed) | 16101.6 | 174.18   |
> | Baseline + Series       | 16022.4 | 117.17   |
> +-------------------------+---------+----------+
> | Delta                   | -0.492% | -32.731% |
> +-------------------------+---------+----------+
>
> Benchmark 4
> Reading stat files 10000 times under memory pressure
>
> memory.stat
> +-------------------------+---------+----------+
> |                         | average |  stddev  |
> +-------------------------+---------+----------+
> | Baseline (11439c4635ed) | 24524.4 | 501.7    |
> | Baseline + Series       | 24807.2 | 444.53   |
> +-------------------------+---------+---------+
> | Delta                   | 1.153%  | -11.395% |
> +-------------------------+---------+----------+
>
> memory.numa_stat
> +-------------------------+---------+---------+
> |                         | average | stddev  |
> +-------------------------+---------+---------+
> | Baseline (11439c4635ed) | 24807.2 | 444.53  |
> | Baseline + Series       | 23837.6 | 521.68  |
> +-------------------------+---------+---------+
> | Delta                   | -3.905% | 17.355% |
> +-------------------------+---------+---------+
>
> proc/vmstat
> +-------------------------+---------+----------+
> |                         | average |  stddev  |
> +-------------------------+---------+----------+
> | Baseline (11439c4635ed) | 24793.6 | 285.26   |
> | Baseline + Series       | 23815.6 | 553.44   |
> +-------------------------+---------+---------+
> | Delta                   | -3.945% | +94.012% |
> +-------------------------+---------+----------+
>
> ^^^ Some big increase in standard deviation here, although there is some
> decrease in the average time. Probably the most notable change that I've seen
> from this patch.
>
> node0/vmstat
> +-------------------------+---------+----------+
> |            a            | average |  stddev  |
> +-------------------------+---------+----------+
> | Baseline (11439c4635ed) | 24541.4 | 281.41   |
> | Baseline + Series       | 24479   | 241.29   |
> +-------------------------+---------+---------+
> | Delta                   | -0.254% | -14.257% |
> +-------------------------+---------+----------+
>
> Lots of testing results, I think mostly negligible in terms of average, but
> some non-negligible changes in standard deviation going in both directions.
> I don't see anything too concerning off the top of my head, but for the
> next version I'll try to do some more testing across different machines
> as well (I don't have any machines with > 2 nodes, but maybe I can do
> some tests on QEMU just to sanity check)
>
> Thanks again, Nhat. Have a great day!
> Joshua

Sounds like any meagre performance difference is smaller than noise :P
If it's this negligible on these microbenchmarks, I think they'll be
infinitesimal in production workloads where these operations are a
very small part.

Kinda makes sense, because objcgroup access is only done in very small
subsets of operations: zswap entry store and zswap entry free, which
can only happen once each per zswap entry.

I think we're fine, but I'll leave other reviewers comment on it as well.