drivers/block/zram/zram_drv.c | 37 ++++++--- include/linux/zsmalloc.h | 2 + mm/zsmalloc.c | 141 ++++++++++++++++++++++++++++++++-- mm/zswap.c | 16 ++++- 4 files changed, 177 insertions(+), 19 deletions(-)
Swap freeing can be expensive when unmapping a VMA containing many swap entries. This has been reported to significantly delay memory reclamation during Android's low-memory killing, especially when multiple processes are terminated to free memory, with slot_free() accounting for more than 80% of the total cost of freeing swap entries. Two earlier attempts by Lei and Zhiguo added a new thread in the mm core to asynchronously collect and free swap entries [1][2], but the design itself is fairly complex. When anon folios and swap entries are mixed within a process, reclaiming anon folios from killed processes helps return memory to the system as quickly as possible, so that newly launched applications can satisfy their memory demands. It is not ideal for swap freeing to block anon folio freeing. On the other hand, swap freeing can still return memory to the system, although at a slower rate due to memory compression. Therefore, we introduce a GC worker to allow anon folio freeing and slot_free to run in parallel, since slot_free is performed asynchronously, maximizing the rate at which memory is returned to the system. This series takes two complementary approaches to reduce zs_free() latency: - Shrink zs_free() class->lock critical section by moving zspage freeing outside the lock. - Defer zs_free() to a workqueue via zs_free_deferred(), benefiting both zram and zswap. The deferred free approach builds on Barry Song's earlier RFC [1] with changes based on community feedback: optimization moved to zsmalloc layer instead of zram; fixed array storing handles (not indices) with O(1) enqueue to avoid memory allocation on the exit path and data consistency issues on slot reuse; size-based capacity scaling with PAGE_SIZE. Xueyuan's test on RK3588 with Barry's RFC v1 [3] shows that unmapping a 256MB swap-filled VMA becomes 3.4x faster when pinning tasks to CPU2, reducing the execution time from 63,102,982 ns to 18,570,726 ns. A positive side effect is that async GC also slightly improves do_swap_page() performance, as it no longer has to wait for slot_free() to complete. Xueyuan's test with Barry's RFC v1 [3] shows that swapping in 256MB of data (each page filled with repeating patterns such as "1024 one", "1024 two", "1024 three", and "1024 four") reduces execution time from 1,358,133,886 ns to 1,104,315,986 ns, achieving a 1.22x speedup. [1] https://lore.kernel.org/all/20240805153639.1057-1-justinjiang@vivo.com/ [2] https://lore.kernel.org/all/20250909065349.574894-1-liulei.rjpt@vivo.com/ [3] https://lore.kernel.org/linux-mm/20260412060450.15813-1-baohua@kernel.org/ Xueyuan Chen (1): mm:zsmalloc: drop class lock before freeing zspage Barry Song (Xiaomi) (1): zram: defer zs_free() in swap slot free notification path Wenchao Hao (2): mm/zsmalloc: introduce zs_free_deferred() for async handle freeing mm/zswap: defer zs_free() in zswap_invalidate() path drivers/block/zram/zram_drv.c | 37 ++++++--- include/linux/zsmalloc.h | 2 + mm/zsmalloc.c | 141 ++++++++++++++++++++++++++++++++-- mm/zswap.c | 16 ++++- 4 files changed, 177 insertions(+), 19 deletions(-) -- 2.34.1
On Tue, Apr 21, 2026 at 8:16 PM Wenchao Hao <haowenchao22@gmail.com> wrote:
>
> Swap freeing can be expensive when unmapping a VMA containing
> many swap entries. This has been reported to significantly
> delay memory reclamation during Android's low-memory killing,
> especially when multiple processes are terminated to free
> memory, with slot_free() accounting for more than 80% of
> the total cost of freeing swap entries.
>
> Two earlier attempts by Lei and Zhiguo added a new thread in the mm core
> to asynchronously collect and free swap entries [1][2], but the
> design itself is fairly complex.
>
Hi Nhat, Kairui, Barry, Xueyuan,
Thanks for the review. I agree with the direction and have some ideas for
an alternative approach.
My approach: first eliminate pool->lock from zs_free() itself, then defer
free to per-cpu buffers with a lockless handoff, and finally reduce
class->lock overhead during drain by exploiting natural class locality.
Achieving both per-cpu and per-class is difficult, so the class->lock
optimization is a compromise — but one that works well in practice.
1. Encode class_idx in obj to eliminate pool->lock
OBJ_INDEX_BITS is over-provisioned on 64-bit. For example on arm64
(chain_size=8): OBJ_INDEX_BITS=24 but only 10 bits are actually needed
for obj_idx, leaving 14 spare bits.
We can split OBJ_INDEX into class_idx + obj_idx:
obj: [PFN | class_idx (OBJ_CLASS_BITS) | obj_idx (OBJ_IDX_BITS)]
OBJ_CLASS_BITS is computed dynamically as `ilog2(ZS_SIZE_CLASSES - 1) + 1`
(8 bits for 4K pages, 9 for 64K).
Since class_idx is invariant across migration (only PFN changes), zs_free()
can extract class_idx locklessly, then acquire class->lock and re-read obj for a
stable PFN. No pool->lock needed.
2. Per-cpu deferred free with lockless buffer swap
Defer zs_free() to per-cpu dynamically-allocated buffers (~2048 entries).
Enqueue: one array write + WRITE_ONCE under preempt_disable — no lock,
no atomic. When buffers full, schedule a drain worker; overflow falls back
to sync zs_free().
Drain: allocate a fresh buffer, swap it in, reset count. Since
the producer stops writing at count==SIZE, the handoff is
race-free without any lock.
Pseudo-code:
/* enqueue - hot path */
def = get_cpu_ptr(pool->deferred);
if (def->count < SIZE) {
def->handles[def->count] = handle;
WRITE_ONCE(def->count, def->count + 1);
if (def->count == SIZE)
schedule_work(&pool->drain_work);
} else {
zs_free(pool, handle); /* fallback */
}
put_cpu_ptr(pool->deferred);
/* drain - worker */
for_each_possible_cpu(cpu) {
def = per_cpu_ptr(pool->deferred, cpu);
if (def->count < SIZE)
continue;
new_buf = kvmalloc_array(SIZE, sizeof(long));
old_buf = def->handles;
old_count = def->count;
def->handles = new_buf;
WRITE_ONCE(def->count, 0);
/* now drain old_buf[0..old_count-1] */
...
kvfree(old_buf);
}
3. Consecutive-class batching during drain
The drain worker extracts class_idx from each handle locklessly, and holds
class->lock across consecutive same-class handles.
On the exit path, compressed sizes tend to cluster, so consecutive handles
naturally share the same class — giving batch-like lock
amortization without sorting.
Pseudo-code:
cur_cls = -1;
for (i = 0; i < count; i++) {
obj = handle_to_obj(handles[i]);
cls = obj_to_class_idx(obj);
if (cls != cur_cls) {
if (cur_cls >= 0)
spin_unlock(&pool->size_class[cur_cls]->lock);
spin_lock(&pool->size_class[cls]->lock);
cur_cls = cls;
}
__zs_free(pool, handles[i]); /* free under lock */
}
if (cur_cls >= 0)
spin_unlock(&pool->size_class[cur_cls]->lock);
---
Benefits over current mainline:
- Removes pool->lock from zs_free() entirely
- Deferred free path is nearly zero-cost
- class->lock is amortized across batches instead of acquired per-handle
- Producer-consumer handoff is fully lockless
I've prototyped this on 64-bit and it works. Still need to sort out
32-bit compatibility and Kconfig gating. Does this direction look reasonable?
Thanks,
Wenchao
On Sat, Apr 25, 2026 at 9:13 PM Wenchao Hao <haowenchao22@gmail.com> wrote: > > On Tue, Apr 21, 2026 at 8:16 PM Wenchao Hao <haowenchao22@gmail.com> wrote: > > > > Swap freeing can be expensive when unmapping a VMA containing > > many swap entries. This has been reported to significantly > > delay memory reclamation during Android's low-memory killing, > > especially when multiple processes are terminated to free > > memory, with slot_free() accounting for more than 80% of > > the total cost of freeing swap entries. > > > > Two earlier attempts by Lei and Zhiguo added a new thread in the mm core > > to asynchronously collect and free swap entries [1][2], but the > > design itself is fairly complex. > > > Hi Nhat, Kairui, Barry, Xueyuan, > > Thanks for the review. I agree with the direction and have some ideas for > an alternative approach. > > My approach: first eliminate pool->lock from zs_free() itself, then defer > free to per-cpu buffers with a lockless handoff, and finally reduce > class->lock overhead during drain by exploiting natural class locality. > Achieving both per-cpu and per-class is difficult, so the class->lock > optimization is a compromise — but one that works well in practice. > > 1. Encode class_idx in obj to eliminate pool->lock > > OBJ_INDEX_BITS is over-provisioned on 64-bit. For example on arm64 > (chain_size=8): OBJ_INDEX_BITS=24 but only 10 bits are actually needed > for obj_idx, leaving 14 spare bits. > We can split OBJ_INDEX into class_idx + obj_idx: > > obj: [PFN | class_idx (OBJ_CLASS_BITS) | obj_idx (OBJ_IDX_BITS)] > > OBJ_CLASS_BITS is computed dynamically as `ilog2(ZS_SIZE_CLASSES - 1) + 1` > (8 bits for 4K pages, 9 for 64K). > Since class_idx is invariant across migration (only PFN changes), zs_free() > can extract class_idx locklessly, then acquire class->lock and re-read obj for a > stable PFN. No pool->lock needed. How much of the benefit do we get with just these locking improvements without having to defer any of the freeing work? As others have pointed out, I don't want to just defer expensive work without understanding why it's expensive and running into limitations about why it cannot be improved without deferring.
On Tue, Apr 28, 2026 at 2:17 AM Yosry Ahmed <yosry@kernel.org> wrote: > > On Sat, Apr 25, 2026 at 9:13 PM Wenchao Hao <haowenchao22@gmail.com> wrote: > > > > On Tue, Apr 21, 2026 at 8:16 PM Wenchao Hao <haowenchao22@gmail.com> wrote: > > > > > > Swap freeing can be expensive when unmapping a VMA containing > > > many swap entries. This has been reported to significantly > > > delay memory reclamation during Android's low-memory killing, > > > especially when multiple processes are terminated to free > > > memory, with slot_free() accounting for more than 80% of > > > the total cost of freeing swap entries. > > > > > > Two earlier attempts by Lei and Zhiguo added a new thread in the mm core > > > to asynchronously collect and free swap entries [1][2], but the > > > design itself is fairly complex. > > > > > Hi Nhat, Kairui, Barry, Xueyuan, > > > > Thanks for the review. I agree with the direction and have some ideas for > > an alternative approach. > > > > My approach: first eliminate pool->lock from zs_free() itself, then defer > > free to per-cpu buffers with a lockless handoff, and finally reduce > > class->lock overhead during drain by exploiting natural class locality. > > Achieving both per-cpu and per-class is difficult, so the class->lock > > optimization is a compromise — but one that works well in practice. > > > > 1. Encode class_idx in obj to eliminate pool->lock > > > > OBJ_INDEX_BITS is over-provisioned on 64-bit. For example on arm64 > > (chain_size=8): OBJ_INDEX_BITS=24 but only 10 bits are actually needed > > for obj_idx, leaving 14 spare bits. > > We can split OBJ_INDEX into class_idx + obj_idx: > > > > obj: [PFN | class_idx (OBJ_CLASS_BITS) | obj_idx (OBJ_IDX_BITS)] > > > > OBJ_CLASS_BITS is computed dynamically as `ilog2(ZS_SIZE_CLASSES - 1) + 1` > > (8 bits for 4K pages, 9 for 64K). > > Since class_idx is invariant across migration (only PFN changes), zs_free() > > can extract class_idx locklessly, then acquire class->lock and re-read obj for a > > stable PFN. No pool->lock needed. > > How much of the benefit do we get with just these locking improvements > without having to defer any of the freeing work? > Hi Yosry, Thanks for the review. Great question — we tested exactly this. With only the class_idx-in-obj encoding (eliminating pool->lock from zs_free, no deferred freeing), we measured on two platforms. Test: each process independently mmap 256MB, write data, madvise MADV_PAGEOUT to swap out via zram (lzo-rle), then concurrent munmap. Raspberry Pi 4B (4-core ARM64 Cortex-A72): mode Base ClassIdx-only Speedup single 59.0ms 56.0ms 1.05x multi 2p 94.6ms 66.7ms 1.42x multi 4p 202.9ms 110.6ms 1.83x x86 physical machine (4-core Intel i7-12700, 2 rounds averaged): mode Base ClassIdx-only Speedup single 11.7ms 9.8ms 1.19x multi 2p 24.1ms 17.2ms 1.40x multi 4p 63.0ms 45.3ms 1.39x Single-process shows modest improvement. With multiple processes, each read_lock/read_unlock atomically modifies the shared rwlock reader count, and the cost of these atomic operations increases with more CPUs accessing the same cacheline concurrently. Eliminating pool->lock removes this overhead entirely. This only works on 64-bit systems where OBJ_INDEX_BITS has enough spare bits to fit class_idx. 32-bit systems don't have the room. I'm still working on the compile-time gating to properly enable this based on architecture and page size configuration. > As others have pointed out, I don't want to just defer expensive work > without understanding why it's expensive and running into limitations > about why it cannot be improved without deferring. For the deferred freeing part: the class_idx-in-obj optimization addresses the multi-process scenario where concurrent atomic operations on pool->lock become expensive, but does not help single-process munmap. Deferred freeing moves the entire zs_free cost (including class->lock and zspage freeing) off the munmap hot path, which benefits even single-process workloads. The two optimizations are complementary. Thanks, Wenchao
On Tue, Apr 28, 2026 at 2:51 PM Wenchao Hao <haowenchao22@gmail.com> wrote: > > On Tue, Apr 28, 2026 at 2:17 AM Yosry Ahmed <yosry@kernel.org> wrote: > > > > On Sat, Apr 25, 2026 at 9:13 PM Wenchao Hao <haowenchao22@gmail.com> wrote: > > > > > > On Tue, Apr 21, 2026 at 8:16 PM Wenchao Hao <haowenchao22@gmail.com> wrote: > > > > > > > > Swap freeing can be expensive when unmapping a VMA containing > > > > many swap entries. This has been reported to significantly > > > > delay memory reclamation during Android's low-memory killing, > > > > especially when multiple processes are terminated to free > > > > memory, with slot_free() accounting for more than 80% of > > > > the total cost of freeing swap entries. > > > > > > > > Two earlier attempts by Lei and Zhiguo added a new thread in the mm core > > > > to asynchronously collect and free swap entries [1][2], but the > > > > design itself is fairly complex. > > > > > > > Hi Nhat, Kairui, Barry, Xueyuan, > > > > > > Thanks for the review. I agree with the direction and have some ideas for > > > an alternative approach. > > > > > > My approach: first eliminate pool->lock from zs_free() itself, then defer > > > free to per-cpu buffers with a lockless handoff, and finally reduce > > > class->lock overhead during drain by exploiting natural class locality. > > > Achieving both per-cpu and per-class is difficult, so the class->lock > > > optimization is a compromise — but one that works well in practice. > > > > > > 1. Encode class_idx in obj to eliminate pool->lock > > > > > > OBJ_INDEX_BITS is over-provisioned on 64-bit. For example on arm64 > > > (chain_size=8): OBJ_INDEX_BITS=24 but only 10 bits are actually needed > > > for obj_idx, leaving 14 spare bits. > > > We can split OBJ_INDEX into class_idx + obj_idx: > > > > > > obj: [PFN | class_idx (OBJ_CLASS_BITS) | obj_idx (OBJ_IDX_BITS)] > > > > > > OBJ_CLASS_BITS is computed dynamically as `ilog2(ZS_SIZE_CLASSES - 1) + 1` > > > (8 bits for 4K pages, 9 for 64K). > > > Since class_idx is invariant across migration (only PFN changes), zs_free() > > > can extract class_idx locklessly, then acquire class->lock and re-read obj for a > > > stable PFN. No pool->lock needed. > > > > How much of the benefit do we get with just these locking improvements > > without having to defer any of the freeing work? > > > > Hi Yosry, > > Thanks for the review. Great question — we tested exactly this. > > With only the class_idx-in-obj encoding (eliminating pool->lock from > zs_free, no deferred freeing), we measured on two platforms. > > Test: each process independently mmap 256MB, write data, madvise > MADV_PAGEOUT to swap out via zram (lzo-rle), then concurrent munmap. > > Raspberry Pi 4B (4-core ARM64 Cortex-A72): > > mode Base ClassIdx-only Speedup > single 59.0ms 56.0ms 1.05x > multi 2p 94.6ms 66.7ms 1.42x > multi 4p 202.9ms 110.6ms 1.83x > > x86 physical machine (4-core Intel i7-12700, 2 rounds averaged): > > mode Base ClassIdx-only Speedup > single 11.7ms 9.8ms 1.19x > multi 2p 24.1ms 17.2ms 1.40x > multi 4p 63.0ms 45.3ms 1.39x Oh man, you are eliminating pool lock here right? This would help my other patch series a lot too :) https://lore.kernel.org/all/CAKEwX=M5YpR0cQrryX_y4pm_BuxyUWZ_8MbhWodwbf1Fe=gzew@mail.gmail.com/ https://lore.kernel.org/all/CAKEwX=PkFiP+u+ThrzjTKBi+usQf2uuhTZcfB2BNNA8RboOFDQ@mail.gmail.com/ Well, the deferred freeing would completely move that contention out of the way lol. But this would benefit all users, regardless of whether we're deferring the free step or not (for instance, this will reduce contention between page fault and compaction, IIUC?) I feel like you'll get some good numbers testing in a system with compaction and THP enabled, with lots of swap activities. Which is... a lot of server setup :) If the deferred freeing is too controversial, this smells like something that should be upstreamed independently. > > Single-process shows modest improvement. With multiple processes, > each read_lock/read_unlock atomically modifies the shared rwlock > reader count, and the cost of these atomic operations increases > with more CPUs accessing the same cacheline concurrently. > Eliminating pool->lock removes this overhead entirely. > > This only works on 64-bit systems where OBJ_INDEX_BITS has enough > spare bits to fit class_idx. 32-bit systems don't have the room. > I'm still working on the compile-time gating to properly enable > this based on architecture and page size configuration. /* * The pool->lock protects the race with zpage's migration * so it's safe to get the page from handle. */ read_lock(&pool->lock); obj = handle_to_obj(handle); obj_to_zpdesc(obj, &f_zpdesc); zspage = get_zspage(f_zpdesc); class = zspage_class(pool, zspage); spin_lock(&class->lock); read_unlock(&pool->lock); It's basically just this blob right? > > > As others have pointed out, I don't want to just defer expensive work > > without understanding why it's expensive and running into limitations > > about why it cannot be improved without deferring. > > For the deferred freeing part: the class_idx-in-obj optimization > addresses the multi-process scenario where concurrent atomic > operations on pool->lock become expensive, but does not help > single-process munmap. Deferred freeing moves the entire zs_free > cost (including class->lock and zspage freeing) off the munmap > hot path, which benefits even single-process workloads. The two > optimizations are complementary. +1 :)
On Sat, May 2, 2026 at 3:21 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> > With only the class_idx-in-obj encoding (eliminating pool->lock from
> > zs_free, no deferred freeing), we measured on two platforms.
> >
> > Test: each process independently mmap 256MB, write data, madvise
> > MADV_PAGEOUT to swap out via zram (lzo-rle), then concurrent munmap.
> >
> > Raspberry Pi 4B (4-core ARM64 Cortex-A72):
> >
> > mode Base ClassIdx-only Speedup
> > single 59.0ms 56.0ms 1.05x
> > multi 2p 94.6ms 66.7ms 1.42x
> > multi 4p 202.9ms 110.6ms 1.83x
> >
> > x86 physical machine (4-core Intel i7-12700, 2 rounds averaged):
> >
> > mode Base ClassIdx-only Speedup
> > single 11.7ms 9.8ms 1.19x
> > multi 2p 24.1ms 17.2ms 1.40x
> > multi 4p 63.0ms 45.3ms 1.39x
>
> Oh man, you are eliminating pool lock here right? This would help my
> other patch series a lot too :)
>
> https://lore.kernel.org/all/CAKEwX=M5YpR0cQrryX_y4pm_BuxyUWZ_8MbhWodwbf1Fe=gzew@mail.gmail.com/
> https://lore.kernel.org/all/CAKEwX=PkFiP+u+ThrzjTKBi+usQf2uuhTZcfB2BNNA8RboOFDQ@mail.gmail.com/
>
Yes, exactly. With class_idx encoded in the obj value,
zs_free() can determine the correct size_class without
any pool-level lock. The lockless read gives a valid
class_idx because it's invariant across migration (only
PFN changes), and we re-read obj under class->lock to
get a stable PFN.
> Well, the deferred freeing would completely move that contention out
> of the way lol. But this would benefit all users, regardless of
> whether we're deferring the free step or not (for instance, this will
> reduce contention between page fault and compaction, IIUC?) I feel
> like you'll get some good numbers testing in a system with compaction
> and THP enabled, with lots of swap activities. Which is... a lot of
> server setup :)
>
> If the deferred freeing is too controversial, this smells like
> something that should be upstreamed independently.
>
Agreed. We're planning to split the series so that the
class_idx encoding + pool->lock elimination can be
reviewed and merged independently of the deferred free
framework. It's a pure win with no behavioral change
— just less lock contention.
> >
> > Single-process shows modest improvement. With multiple processes,
> > each read_lock/read_unlock atomically modifies the shared rwlock
> > reader count, and the cost of these atomic operations increases
> > with more CPUs accessing the same cacheline concurrently.
> > Eliminating pool->lock removes this overhead entirely.
> >
> > This only works on 64-bit systems where OBJ_INDEX_BITS has enough
> > spare bits to fit class_idx. 32-bit systems don't have the room.
> > I'm still working on the compile-time gating to properly enable
> > this based on architecture and page size configuration.
>
> /*
> * The pool->lock protects the race with zpage's migration
> * so it's safe to get the page from handle.
> */
> read_lock(&pool->lock);
> obj = handle_to_obj(handle);
> obj_to_zpdesc(obj, &f_zpdesc);
> zspage = get_zspage(f_zpdesc);
> class = zspage_class(pool, zspage);
> spin_lock(&class->lock);
> read_unlock(&pool->lock);
>
> It's basically just this blob right?
>
Yes, that's the blob being replaced. On the
ZS_OBJ_CLASS_IDX path (64-bit systems), it becomes:
obj = handle_to_obj(handle);
class = pool->size_class[obj_to_class_idx(obj)];
spin_lock(&class->lock);
obj = handle_to_obj(handle); /* re-read for stable PFN */
No pool->lock at all. We've also added compile-time
gating (#if BITS_PER_LONG >= 64) since 32-bit systems
lack the spare bits in OBJ_INDEX to fit class_idx. On
32-bit, it falls back to the original pool->lock path.
> >
> > > As others have pointed out, I don't want to just defer expensive work
> > > without understanding why it's expensive and running into limitations
> > > about why it cannot be improved without deferring.
> >
> > For the deferred freeing part: the class_idx-in-obj optimization
> > addresses the multi-process scenario where concurrent atomic
> > operations on pool->lock become expensive, but does not help
> > single-process munmap. Deferred freeing moves the entire zs_free
> > cost (including class->lock and zspage freeing) off the munmap
> > hot path, which benefits even single-process workloads. The two
> > optimizations are complementary.
>
> +1 :)
On Wed, May 6, 2026 at 6:55 AM Wenchao Hao <haowenchao22@gmail.com> wrote: > > On Sat, May 2, 2026 at 3:21 PM Nhat Pham <nphamcs@gmail.com> wrote: > > > > > With only the class_idx-in-obj encoding (eliminating pool->lock from > > > zs_free, no deferred freeing), we measured on two platforms. > > > > > > Test: each process independently mmap 256MB, write data, madvise > > > MADV_PAGEOUT to swap out via zram (lzo-rle), then concurrent munmap. > > > > > > Raspberry Pi 4B (4-core ARM64 Cortex-A72): > > > > > > mode Base ClassIdx-only Speedup > > > single 59.0ms 56.0ms 1.05x > > > multi 2p 94.6ms 66.7ms 1.42x > > > multi 4p 202.9ms 110.6ms 1.83x > > > > > > x86 physical machine (4-core Intel i7-12700, 2 rounds averaged): > > > > > > mode Base ClassIdx-only Speedup > > > single 11.7ms 9.8ms 1.19x > > > multi 2p 24.1ms 17.2ms 1.40x > > > multi 4p 63.0ms 45.3ms 1.39x > > > > Oh man, you are eliminating pool lock here right? This would help my > > other patch series a lot too :) > > > > https://lore.kernel.org/all/CAKEwX=M5YpR0cQrryX_y4pm_BuxyUWZ_8MbhWodwbf1Fe=gzew@mail.gmail.com/ > > https://lore.kernel.org/all/CAKEwX=PkFiP+u+ThrzjTKBi+usQf2uuhTZcfB2BNNA8RboOFDQ@mail.gmail.com/ > > > > Yes, exactly. With class_idx encoded in the obj value, > zs_free() can determine the correct size_class without > any pool-level lock. The lockless read gives a valid > class_idx because it's invariant across migration (only > PFN changes), and we re-read obj under class->lock to > get a stable PFN. > > > Well, the deferred freeing would completely move that contention out > > of the way lol. But this would benefit all users, regardless of > > whether we're deferring the free step or not (for instance, this will > > reduce contention between page fault and compaction, IIUC?) I feel > > like you'll get some good numbers testing in a system with compaction > > and THP enabled, with lots of swap activities. Which is... a lot of > > server setup :) > > > > If the deferred freeing is too controversial, this smells like > > something that should be upstreamed independently. > > > > Agreed. We're planning to split the series so that the > class_idx encoding + pool->lock elimination can be > reviewed and merged independently of the deferred free > framework. It's a pure win with no behavioral change > — just less lock contention. > > > > > > > Single-process shows modest improvement. With multiple processes, > > > each read_lock/read_unlock atomically modifies the shared rwlock > > > reader count, and the cost of these atomic operations increases > > > with more CPUs accessing the same cacheline concurrently. > > > Eliminating pool->lock removes this overhead entirely. > > > > > > This only works on 64-bit systems where OBJ_INDEX_BITS has enough > > > spare bits to fit class_idx. 32-bit systems don't have the room. > > > I'm still working on the compile-time gating to properly enable > > > this based on architecture and page size configuration. > > > > /* > > * The pool->lock protects the race with zpage's migration > > * so it's safe to get the page from handle. > > */ > > read_lock(&pool->lock); > > obj = handle_to_obj(handle); > > obj_to_zpdesc(obj, &f_zpdesc); > > zspage = get_zspage(f_zpdesc); > > class = zspage_class(pool, zspage); > > spin_lock(&class->lock); > > read_unlock(&pool->lock); > > > > It's basically just this blob right? > > > > Yes, that's the blob being replaced. On the > ZS_OBJ_CLASS_IDX path (64-bit systems), it becomes: > > obj = handle_to_obj(handle); > class = pool->size_class[obj_to_class_idx(obj)]; > spin_lock(&class->lock); > obj = handle_to_obj(handle); /* re-read for stable PFN */ > > No pool->lock at all. We've also added compile-time > gating (#if BITS_PER_LONG >= 64) since 32-bit systems > lack the spare bits in OBJ_INDEX to fit class_idx. On > 32-bit, it falls back to the original pool->lock path. > BTW, I've tested your idea with a hacky prototype, when I was playing with my vswap series. It absolutely improves free time in the usemem benchmark :) Idea is very promising - I won't scoop your work of course, just letting you know that at least in my use case, it works :) Look forward to seeing it submitted soon!!!
On Sat, May 9, 2026 at 7:33 AM Nhat Pham <nphamcs@gmail.com> wrote: > > On Wed, May 6, 2026 at 6:55 AM Wenchao Hao <haowenchao22@gmail.com> wrote: > > > > On Sat, May 2, 2026 at 3:21 PM Nhat Pham <nphamcs@gmail.com> wrote: > > > > > > > > > Oh man, you are eliminating pool lock here right? This would help my > > > other patch series a lot too :) > > > > > > https://lore.kernel.org/all/CAKEwX=M5YpR0cQrryX_y4pm_BuxyUWZ_8MbhWodwbf1Fe=gzew@mail.gmail.com/ > > > https://lore.kernel.org/all/CAKEwX=PkFiP+u+ThrzjTKBi+usQf2uuhTZcfB2BNNA8RboOFDQ@mail.gmail.com/ > > > > > > > Yes, exactly. With class_idx encoded in the obj value, > > zs_free() can determine the correct size_class without > > any pool-level lock. The lockless read gives a valid > > class_idx because it's invariant across migration (only > > PFN changes), and we re-read obj under class->lock to > > get a stable PFN. > > > > > > > > /* > > > * The pool->lock protects the race with zpage's migration > > > * so it's safe to get the page from handle. > > > */ > > > read_lock(&pool->lock); > > > obj = handle_to_obj(handle); > > > obj_to_zpdesc(obj, &f_zpdesc); > > > zspage = get_zspage(f_zpdesc); > > > class = zspage_class(pool, zspage); > > > spin_lock(&class->lock); > > > read_unlock(&pool->lock); > > > > > > It's basically just this blob right? > > > > > > > Yes, that's the blob being replaced. On the > > ZS_OBJ_CLASS_IDX path (64-bit systems), it becomes: > > > > obj = handle_to_obj(handle); > > class = pool->size_class[obj_to_class_idx(obj)]; > > spin_lock(&class->lock); > > obj = handle_to_obj(handle); /* re-read for stable PFN */ > > > > No pool->lock at all. We've also added compile-time > > gating (#if BITS_PER_LONG >= 64) since 32-bit systems > > lack the spare bits in OBJ_INDEX to fit class_idx. On > > 32-bit, it falls back to the original pool->lock path. > > > > BTW, I've tested your idea with a hacky prototype, when I was playing > with my vswap series. It absolutely improves free time in the usemem > benchmark :) Idea is very promising - I won't scoop your work of > course, just letting you know that at least in my use case, it works > :) Look forward to seeing it submitted soon!!! Thanks, Nhat, that's great to hear. I've split this part out and posted it as its own series: https://lore.kernel.org/linux-mm/20260508061910.3882831-1-haowenchao@xiaomi.com Review there would be very welcome. Also, could you share the details of your usemem setup? I'd like to reproduce it locally on the same baseline. Thanks, Wenchao
On Sat, May 9, 2026 at 2:08 AM Wenchao Hao <haowenchao22@gmail.com> wrote: > > On Sat, May 9, 2026 at 7:33 AM Nhat Pham <nphamcs@gmail.com> wrote: > > > > On Wed, May 6, 2026 at 6:55 AM Wenchao Hao <haowenchao22@gmail.com> wrote: > > > > > > On Sat, May 2, 2026 at 3:21 PM Nhat Pham <nphamcs@gmail.com> wrote: > > > > > > > > > > > > Oh man, you are eliminating pool lock here right? This would help my > > > > other patch series a lot too :) > > > > > > > > https://lore.kernel.org/all/CAKEwX=M5YpR0cQrryX_y4pm_BuxyUWZ_8MbhWodwbf1Fe=gzew@mail.gmail.com/ > > > > https://lore.kernel.org/all/CAKEwX=PkFiP+u+ThrzjTKBi+usQf2uuhTZcfB2BNNA8RboOFDQ@mail.gmail.com/ > > > > > > > > > > Yes, exactly. With class_idx encoded in the obj value, > > > zs_free() can determine the correct size_class without > > > any pool-level lock. The lockless read gives a valid > > > class_idx because it's invariant across migration (only > > > PFN changes), and we re-read obj under class->lock to > > > get a stable PFN. > > > > > > > > > > > /* > > > > * The pool->lock protects the race with zpage's migration > > > > * so it's safe to get the page from handle. > > > > */ > > > > read_lock(&pool->lock); > > > > obj = handle_to_obj(handle); > > > > obj_to_zpdesc(obj, &f_zpdesc); > > > > zspage = get_zspage(f_zpdesc); > > > > class = zspage_class(pool, zspage); > > > > spin_lock(&class->lock); > > > > read_unlock(&pool->lock); > > > > > > > > It's basically just this blob right? > > > > > > > > > > Yes, that's the blob being replaced. On the > > > ZS_OBJ_CLASS_IDX path (64-bit systems), it becomes: > > > > > > obj = handle_to_obj(handle); > > > class = pool->size_class[obj_to_class_idx(obj)]; > > > spin_lock(&class->lock); > > > obj = handle_to_obj(handle); /* re-read for stable PFN */ > > > > > > No pool->lock at all. We've also added compile-time > > > gating (#if BITS_PER_LONG >= 64) since 32-bit systems > > > lack the spare bits in OBJ_INDEX to fit class_idx. On > > > 32-bit, it falls back to the original pool->lock path. > > > > > > > BTW, I've tested your idea with a hacky prototype, when I was playing > > with my vswap series. It absolutely improves free time in the usemem > > benchmark :) Idea is very promising - I won't scoop your work of > > course, just letting you know that at least in my use case, it works > > :) Look forward to seeing it submitted soon!!! > > Thanks, Nhat, that's great to hear. > > I've split this part out and posted it as its own series: > > https://lore.kernel.org/linux-mm/20260508061910.3882831-1-haowenchao@xiaomi.com > > Review there would be very welcome. Huh I think I might have been unsubscribed from linux-mm again -.- Weird - I wonder if this is because of Gmail shenanigans. Can you cc me the thread next time just in case? > > Also, could you share the details of your usemem setup? I'd like > to reproduce it locally on the same baseline. Sure! I left some notes here: https://lore.kernel.org/all/20260505153854.1612033-1-nphamcs@gmail.com/ But for your convenience, this is the benchmark I ran: 2. Usemem single-threaded: anonymous memory allocation (56GB) on a host with 32GB RAM, 16 rounds. I don't put a limit on the cgroup, relying on global pressure (per Kairui's instructions). I'm not on my work server right now so I don't have the exact command, but hopefully that should be enough to show the wins with your patch series! I wanted to run it for your patch series myself but I do not have the cycles right now, unfortunately :( > > Thanks, > Wenchao
> > How much of the benefit do we get with just these locking improvements > > without having to defer any of the freeing work? > > > > Hi Yosry, > > Thanks for the review. Great question — we tested exactly this. > > With only the class_idx-in-obj encoding (eliminating pool->lock from > zs_free, no deferred freeing), we measured on two platforms. > > Test: each process independently mmap 256MB, write data, madvise > MADV_PAGEOUT to swap out via zram (lzo-rle), then concurrent munmap. > > Raspberry Pi 4B (4-core ARM64 Cortex-A72): > > mode Base ClassIdx-only Speedup > single 59.0ms 56.0ms 1.05x > multi 2p 94.6ms 66.7ms 1.42x > multi 4p 202.9ms 110.6ms 1.83x > > x86 physical machine (4-core Intel i7-12700, 2 rounds averaged): > > mode Base ClassIdx-only Speedup > single 11.7ms 9.8ms 1.19x > multi 2p 24.1ms 17.2ms 1.40x > multi 4p 63.0ms 45.3ms 1.39x > > Single-process shows modest improvement. With multiple processes, > each read_lock/read_unlock atomically modifies the shared rwlock > reader count, and the cost of these atomic operations increases > with more CPUs accessing the same cacheline concurrently. > Eliminating pool->lock removes this overhead entirely. > > This only works on 64-bit systems where OBJ_INDEX_BITS has enough > spare bits to fit class_idx. 32-bit systems don't have the room. > I'm still working on the compile-time gating to properly enable > this based on architecture and page size configuration. > > > As others have pointed out, I don't want to just defer expensive work > > without understanding why it's expensive and running into limitations > > about why it cannot be improved without deferring. > > For the deferred freeing part: the class_idx-in-obj optimization > addresses the multi-process scenario where concurrent atomic > operations on pool->lock become expensive, but does not help > single-process munmap. Deferred freeing moves the entire zs_free > cost (including class->lock and zspage freeing) off the munmap > hot path, which benefits even single-process workloads. The two > optimizations are complementary. What is the extra speedup added by the deferred freeing on top of the locking improvements? I couldn't immediately tell by looking at this vs. the cover letter. I wonder what portion of the improvement comes from the deferred freeing?
On Thu, Apr 30, 2026 at 6:44 AM Yosry Ahmed <yosry@kernel.org> wrote: > > > > How much of the benefit do we get with just these locking improvements > > > without having to defer any of the freeing work? > > > > > > > Hi Yosry, > > > > Thanks for the review. Great question — we tested exactly this. > > > > With only the class_idx-in-obj encoding (eliminating pool->lock from > > zs_free, no deferred freeing), we measured on two platforms. > > > > Test: each process independently mmap 256MB, write data, madvise > > MADV_PAGEOUT to swap out via zram (lzo-rle), then concurrent munmap. > > > > Raspberry Pi 4B (4-core ARM64 Cortex-A72): > > > > mode Base ClassIdx-only Speedup > > single 59.0ms 56.0ms 1.05x > > multi 2p 94.6ms 66.7ms 1.42x > > multi 4p 202.9ms 110.6ms 1.83x > > > > x86 physical machine (4-core Intel i7-12700, 2 rounds averaged): > > > > mode Base ClassIdx-only Speedup > > single 11.7ms 9.8ms 1.19x > > multi 2p 24.1ms 17.2ms 1.40x > > multi 4p 63.0ms 45.3ms 1.39x > > > > Single-process shows modest improvement. With multiple processes, > > each read_lock/read_unlock atomically modifies the shared rwlock > > reader count, and the cost of these atomic operations increases > > with more CPUs accessing the same cacheline concurrently. > > Eliminating pool->lock removes this overhead entirely. > > > > This only works on 64-bit systems where OBJ_INDEX_BITS has enough > > spare bits to fit class_idx. 32-bit systems don't have the room. > > I'm still working on the compile-time gating to properly enable > > this based on architecture and page size configuration. > > > > > As others have pointed out, I don't want to just defer expensive work > > > without understanding why it's expensive and running into limitations > > > about why it cannot be improved without deferring. > > > > For the deferred freeing part: the class_idx-in-obj optimization > > addresses the multi-process scenario where concurrent atomic > > operations on pool->lock become expensive, but does not help > > single-process munmap. Deferred freeing moves the entire zs_free > > cost (including class->lock and zspage freeing) off the munmap > > hot path, which benefits even single-process workloads. The two > > optimizations are complementary. > > What is the extra speedup added by the deferred freeing > on top of the locking improvements? The data I shared earlier was class_idx-in-obj only — no deferred freeing at all. > I couldn't immediately tell by looking at this vs. the cover letter. I wonder > what portion of the improvement comes from the deferred freeing? On top of that, we added deferred freeing in the zsmalloc layer (per-cpu page-pool based buffer swap + WQ_UNBOUND drain worker). With both class_idx + deferred: Test 1: concurrent munmap (256MB/process, RPi 4B): mode Base Deferred Speedup single 56.2ms 17.2ms 3.27x multi 3p 153.2ms 51.5ms 2.97x Test 2: single process munmap (various sizes): size Base Deferred Speedup 64MB 15.0ms 4.3ms 3.47x 128MB 28.7ms 8.5ms 3.37x 192MB 43.2ms 13.0ms 3.32x 256MB 57.0ms 17.3ms 3.30x 512MB 114.4ms 38.5ms 2.97x However, this is not the ceiling. Profiling with perf shows that after deferred zs_free, zram_slot_free_notify still accounts for ~65% of munmap time — mostly slot_trylock/unlock and slot metadata operations. To understand the theoretical limit, I tested an extreme version that removes slot_trylock from the hot path entirely (not safe for production, just benchmarking): size Base Deferred No-lock Speedup 64MB 15.0ms 4.3ms 2.3ms 6.50x 128MB 28.7ms 8.5ms 4.7ms 6.14x 192MB 43.2ms 13.0ms 6.8ms 6.31x 256MB 57.0ms 17.3ms 9.0ms 6.30x 512MB 114.4ms 38.5ms 33.0ms 3.46x I'm exploring ways to further reduce or eliminate the lock from this path, any suggestions on how to approach this would be appreciated. Unless otherwise noted, all data is from Raspberry Pi 4B (4-core ARM64 Cortex-A72, 8GB RAM, zram 2GB, lzo-rle). Test: mmap + fill + madvise(MADV_PAGEOUT) to swap out via zram, then measure munmap time. Thanks, Wenchao
On Thu, Apr 30, 2026 at 3:43 PM Wenchao Hao <haowenchao22@gmail.com> wrote: > The data I shared earlier was class_idx-in-obj only — no > deferred freeing at all. > > > I couldn't immediately tell by looking at this vs. the cover letter. I wonder > > what portion of the improvement comes from the deferred freeing? > > On top of that, we added deferred freeing in the zsmalloc > layer (per-cpu page-pool based buffer swap + WQ_UNBOUND > drain worker). With both class_idx + deferred: > > Test 1: concurrent munmap (256MB/process, RPi 4B): > > mode Base Deferred Speedup > single 56.2ms 17.2ms 3.27x > multi 3p 153.2ms 51.5ms 2.97x > > Test 2: single process munmap (various sizes): > > size Base Deferred Speedup > 64MB 15.0ms 4.3ms 3.47x > 128MB 28.7ms 8.5ms 3.37x > 192MB 43.2ms 13.0ms 3.32x > 256MB 57.0ms 17.3ms 3.30x > 512MB 114.4ms 38.5ms 2.97x Hi Wenchao, One concern here is that the total amount of work is unchanged. I mean you observe speed up because you offloaded the work to an async worker. But when under pressure these workers could be a larger burden. Is it possible for you to measure that part too?
On Thu, Apr 30, 2026 at 4:00 PM Kairui Song <ryncsn@gmail.com> wrote: > > On Thu, Apr 30, 2026 at 3:43 PM Wenchao Hao <haowenchao22@gmail.com> wrote: > > The data I shared earlier was class_idx-in-obj only — no > > deferred freeing at all. > > > > > I couldn't immediately tell by looking at this vs. the cover letter. I wonder > > > what portion of the improvement comes from the deferred freeing? > > > > On top of that, we added deferred freeing in the zsmalloc > > layer (per-cpu page-pool based buffer swap + WQ_UNBOUND > > drain worker). With both class_idx + deferred: > > > > Test 1: concurrent munmap (256MB/process, RPi 4B): > > > > mode Base Deferred Speedup > > single 56.2ms 17.2ms 3.27x > > multi 3p 153.2ms 51.5ms 2.97x > > > > Test 2: single process munmap (various sizes): > > > > size Base Deferred Speedup > > 64MB 15.0ms 4.3ms 3.47x > > 128MB 28.7ms 8.5ms 3.37x > > 192MB 43.2ms 13.0ms 3.32x > > 256MB 57.0ms 17.3ms 3.30x > > 512MB 114.4ms 38.5ms 2.97x > Hi Kairui, > One concern here is that the total amount of work is > unchanged. But when under pressure these workers could > be a larger burden. The total CPU work is actually slightly reduced — the batch drain eliminates pool->lock entirely, and holds class->lock across consecutive same-class handles rather than acquiring/releasing per handle. So the deferred path does less lock work than synchronous per-handle zs_free. I'm also exploring further reductions, such as merging zram flags operations in the notify path (as you suggested earlier) and reducing lock overhead. Suggestions are welcome. The key win is not reducing work but unblocking anon folio freeing. Each folio free returns a full page immediately, whereas zs_free may need many handle frees before a zspage becomes empty (multiple compressed objects share the same zspage). By not blocking folio freeing with expensive zs_free, we improve the rate at which usable memory returns to the system. With parallelism (munmap + worker on different CPUs), the process exits faster and memory is returned sooner. For example, what used to take ~1s on one CPU can now complete in ~400ms across two CPUs. Under memory pressure, spending a bit more CPU to release memory faster is a reasonable tradeoff. > Is it possible for you to measure that part too? Sure. Could you describe the specific scenario you're concerned about — CPU contention, memory pressure, or scheduling latency? I'm happy to design and run a test around it. Thanks, Wenchao
On Tue, Apr 28, 2026 at 9:51 PM Wenchao Hao <haowenchao22@gmail.com> wrote: > > On Tue, Apr 28, 2026 at 2:17 AM Yosry Ahmed <yosry@kernel.org> wrote: > > > > On Sat, Apr 25, 2026 at 9:13 PM Wenchao Hao <haowenchao22@gmail.com> wrote: > > > > > > On Tue, Apr 21, 2026 at 8:16 PM Wenchao Hao <haowenchao22@gmail.com> wrote: > > > > > > > > Swap freeing can be expensive when unmapping a VMA containing > > > > many swap entries. This has been reported to significantly > > > > delay memory reclamation during Android's low-memory killing, > > > > especially when multiple processes are terminated to free > > > > memory, with slot_free() accounting for more than 80% of > > > > the total cost of freeing swap entries. > > > > > > > > Two earlier attempts by Lei and Zhiguo added a new thread in the mm core > > > > to asynchronously collect and free swap entries [1][2], but the > > > > design itself is fairly complex. > > > > > > > Hi Nhat, Kairui, Barry, Xueyuan, > > > > > > Thanks for the review. I agree with the direction and have some ideas for > > > an alternative approach. > > > > > > My approach: first eliminate pool->lock from zs_free() itself, then defer > > > free to per-cpu buffers with a lockless handoff, and finally reduce > > > class->lock overhead during drain by exploiting natural class locality. > > > Achieving both per-cpu and per-class is difficult, so the class->lock > > > optimization is a compromise — but one that works well in practice. > > > > > > 1. Encode class_idx in obj to eliminate pool->lock > > > > > > OBJ_INDEX_BITS is over-provisioned on 64-bit. For example on arm64 > > > (chain_size=8): OBJ_INDEX_BITS=24 but only 10 bits are actually needed > > > for obj_idx, leaving 14 spare bits. > > > We can split OBJ_INDEX into class_idx + obj_idx: > > > > > > obj: [PFN | class_idx (OBJ_CLASS_BITS) | obj_idx (OBJ_IDX_BITS)] > > > > > > OBJ_CLASS_BITS is computed dynamically as `ilog2(ZS_SIZE_CLASSES - 1) + 1` > > > (8 bits for 4K pages, 9 for 64K). > > > Since class_idx is invariant across migration (only PFN changes), zs_free() > > > can extract class_idx locklessly, then acquire class->lock and re-read obj for a > > > stable PFN. No pool->lock needed. > > > > How much of the benefit do we get with just these locking improvements > > without having to defer any of the freeing work? > > > > Hi Yosry, > > Thanks for the review. Great question — we tested exactly this. > > With only the class_idx-in-obj encoding (eliminating pool->lock from > zs_free, no deferred freeing), we measured on two platforms. > > Test: each process independently mmap 256MB, write data, madvise > MADV_PAGEOUT to swap out via zram (lzo-rle), then concurrent munmap. > > Raspberry Pi 4B (4-core ARM64 Cortex-A72): > > mode Base ClassIdx-only Speedup > single 59.0ms 56.0ms 1.05x > multi 2p 94.6ms 66.7ms 1.42x > multi 4p 202.9ms 110.6ms 1.83x > > x86 physical machine (4-core Intel i7-12700, 2 rounds averaged): > > mode Base ClassIdx-only Speedup > single 11.7ms 9.8ms 1.19x > multi 2p 24.1ms 17.2ms 1.40x > multi 4p 63.0ms 45.3ms 1.39x > Correction on the x86 test description: the machine is a 20-core Intel i7-12700, not 4-core. The test only ran 4 concurrent processes. The multi 4p result (1.39x) is with 4 out of 20 cores active — pool->lock contention would be higher with more concurrent processes on this machine.
On Sun, Apr 26, 2026 at 12:13:02PM +0800, Wenchao Hao wrote:
[...]
>2. Per-cpu deferred free with lockless buffer swap
>
>Defer zs_free() to per-cpu dynamically-allocated buffers (~2048 entries).
>Enqueue: one array write + WRITE_ONCE under preempt_disable — no lock,
>no atomic. When buffers full, schedule a drain worker; overflow falls back
>to sync zs_free().
>
>Drain: allocate a fresh buffer, swap it in, reset count. Since
>the producer stops writing at count==SIZE, the handoff is
>race-free without any lock.
>
>Pseudo-code:
>
> /* enqueue - hot path */
> def = get_cpu_ptr(pool->deferred);
> if (def->count < SIZE) {
> def->handles[def->count] = handle;
> WRITE_ONCE(def->count, def->count + 1);
> if (def->count == SIZE)
> schedule_work(&pool->drain_work);
> } else {
> zs_free(pool, handle); /* fallback */
> }
> put_cpu_ptr(pool->deferred);
>
> /* drain - worker */
> for_each_possible_cpu(cpu) {
> def = per_cpu_ptr(pool->deferred, cpu);
> if (def->count < SIZE)
> continue;
> new_buf = kvmalloc_array(SIZE, sizeof(long));
> old_buf = def->handles;
> old_count = def->count;
> def->handles = new_buf;
> WRITE_ONCE(def->count, 0);
> /* now drain old_buf[0..old_count-1] */
> ...
> kvfree(old_buf);
> }
>
Hi Wenchao,
I suspect there is a memory ordering issue here:
def->handles = new_buf;
WRITE_ONCE(def->count, 0);
Since there are no explicit memory barriers, we cannot guarantee the
order of these stores. If def->count is cleared to 0 first, an enqueue
might end up operating on the old_buf.
This race condition is more likely to be triggered when the size is
smaller. Perhaps we should consider using smp_store_release() to enforce
the ordering?
Thanks
Xueyuan
On Sun, Apr 26, 2026 at 4:50 PM Xueyuan Chen <xueyuan.chen21@gmail.com> wrote:
>
>
> On Sun, Apr 26, 2026 at 12:13:02PM +0800, Wenchao Hao wrote:
>
> [...]
>
> >2. Per-cpu deferred free with lockless buffer swap
> >
> >Defer zs_free() to per-cpu dynamically-allocated buffers (~2048 entries).
> >Enqueue: one array write + WRITE_ONCE under preempt_disable — no lock,
> >no atomic. When buffers full, schedule a drain worker; overflow falls back
> >to sync zs_free().
> >
> >Drain: allocate a fresh buffer, swap it in, reset count. Since
> >the producer stops writing at count==SIZE, the handoff is
> >race-free without any lock.
> >
> >Pseudo-code:
> >
> > /* enqueue - hot path */
> > def = get_cpu_ptr(pool->deferred);
> > if (def->count < SIZE) {
> > def->handles[def->count] = handle;
> > WRITE_ONCE(def->count, def->count + 1);
> > if (def->count == SIZE)
> > schedule_work(&pool->drain_work);
> > } else {
> > zs_free(pool, handle); /* fallback */
> > }
> > put_cpu_ptr(pool->deferred);
> >
> > /* drain - worker */
> > for_each_possible_cpu(cpu) {
> > def = per_cpu_ptr(pool->deferred, cpu);
> > if (def->count < SIZE)
> > continue;
> > new_buf = kvmalloc_array(SIZE, sizeof(long));
> > old_buf = def->handles;
> > old_count = def->count;
> > def->handles = new_buf;
> > WRITE_ONCE(def->count, 0);
> > /* now drain old_buf[0..old_count-1] */
> > ...
> > kvfree(old_buf);
> > }
> >
>
> Hi Wenchao,
>
> I suspect there is a memory ordering issue here:
>
> def->handles = new_buf;
> WRITE_ONCE(def->count, 0);
>
> Since there are no explicit memory barriers, we cannot guarantee the
> order of these stores. If def->count is cleared to 0 first, an enqueue
> might end up operating on the old_buf.
>
> This race condition is more likely to be triggered when the size is
> smaller. Perhaps we should consider using smp_store_release() to enforce
> the ordering?
>
Hi Xueyuan,
Good catch! You are right — there is a memory ordering issue between
the handles pointer swap and the count reset.
I'll fix this in the next version by using smp_store_release() /
smp_load_acquire() pairs:
/* drain - worker */
def->handles = new_buf;
smp_store_release(&def->count, 0);
/* enqueue - producer */
count = smp_load_acquire(&def->count);
if (count < SIZE) {
def->handles[count] = handle;
smp_store_release(&def->count, count + 1);
}
This ensures the producer always observes the new handles pointer
before it sees count reset to 0. Will include this fix when posting
the formal patch series.
Thanks,
Wenchao
> Thanks
> Xueyuan
On Tue, Apr 21, 2026 at 5:16 AM Wenchao Hao <haowenchao22@gmail.com> wrote: > > Swap freeing can be expensive when unmapping a VMA containing > many swap entries. This has been reported to significantly > delay memory reclamation during Android's low-memory killing, > especially when multiple processes are terminated to free > memory, with slot_free() accounting for more than 80% of > the total cost of freeing swap entries. > > Two earlier attempts by Lei and Zhiguo added a new thread in the mm core > to asynchronously collect and free swap entries [1][2], but the > design itself is fairly complex. > > When anon folios and swap entries are mixed within a > process, reclaiming anon folios from killed processes > helps return memory to the system as quickly as possible, > so that newly launched applications can satisfy their > memory demands. It is not ideal for swap freeing to block > anon folio freeing. On the other hand, swap freeing can > still return memory to the system, although at a slower > rate due to memory compression. Is this correct? I don't think we do decompression in zswap_invalidate() path. We do decompression in zswap_load(), but as a separate step from zswap_invalidate(). zswap/zsmalloc entry freeing is decoupled from decompression. For example, on process teardown, we free the zsmalloc memory but never decompress (if we do then it's a bug to be fixed lol, but I doubt it). Zsmalloc freeing might not be worth as much bang-for-your-buck wise compared to anon folio freeing, but if it's "expensive", then I think that points to a different root-cause: zsmalloc's poor scalability in the free path. I've stared at this code path for a bit, because my other patch series (vswap - see [1]) was reported to display regression on the free path on the usemem benchmark. And one of the issues was the contention between compaction (both systemwide compaction, i.e zs_page_migrate, and zsmalloc's internal compaction, but mostly the former).: * zs_free read-acquires pool->lock, and compaction write-acquires the same lock. So the compaction thread will make all zs free-ers wait for it. I saw this read lock delay when I perfed the free step of usemem. * If this lock has fair queue-ing semantics (I have not checked), then if there a compaction is behind a bunch of zs_free in the queue, then all the subsequent zs_free's ers are blocked :) * I'm also curious about cache-friendliness of this rwlock, bouncing across CPUs, if you have multiple processes being torn down concurrently. Have you perf-ed process teardown yet? Can I ask you for a perf trace on this part? I'm not against async zs-freeing (might still be required after all), but if it's something fixable on the zsmalloc side, we should probably prioritize that :) Otherwise these swap freeing workers will exhibit the same poor scalability behavior - we might be better off because we manage to get rid of bigger chunks of uncompressed memory first, but we will still be slowed in releasing the system's and cgroup's (in zswap's case) compressed memory I'd love to hear more about thoughts from Yosry, Johannes, Sergey and Minchan too.
On Tue, Apr 21, 2026 at 11:55 PM Nhat Pham <nphamcs@gmail.com> wrote: > Thanks for adding me to the Cc list :), Barry started this idea with ZRAM, which looks very interesting to me. > On Tue, Apr 21, 2026 at 5:16 AM Wenchao Hao <haowenchao22@gmail.com> wrote: > > > > Swap freeing can be expensive when unmapping a VMA containing > > many swap entries. This has been reported to significantly > > delay memory reclamation during Android's low-memory killing, > > especially when multiple processes are terminated to free > > memory, with slot_free() accounting for more than 80% of > > the total cost of freeing swap entries. > > > > Two earlier attempts by Lei and Zhiguo added a new thread in the mm core > > to asynchronously collect and free swap entries [1][2], but the > > design itself is fairly complex. > > > > When anon folios and swap entries are mixed within a > > process, reclaiming anon folios from killed processes > > helps return memory to the system as quickly as possible, > > so that newly launched applications can satisfy their > > memory demands. It is not ideal for swap freeing to block > > anon folio freeing. On the other hand, swap freeing can > > still return memory to the system, although at a slower > > rate due to memory compression. > > Is this correct? I don't think we do decompression in > zswap_invalidate() path. We do decompression in zswap_load(), but as a > separate step from zswap_invalidate(). It's not about decompression. I think what Wenchao means here is that: freeing the swap entry also releases the backing compression data, but compared to freeing an actual folio (which bring back a free folio to reduce memory pressure), you may need to free a lot of swap entries to free one whole folio, because the compressed data could be much smaller than folio and with fragmentation. And swap entry freeing is still not fast enough to be ignored. > > zswap/zsmalloc entry freeing is decoupled from decompression. For > example, on process teardown, we free the zsmalloc memory but never > decompress (if we do then it's a bug to be fixed lol, but I doubt it). > > Zsmalloc freeing might not be worth as much bang-for-your-buck wise > compared to anon folio freeing, but if it's "expensive", then I think > that points to a different root-cause: zsmalloc's poor scalability in > the free path. That's a very nice insight. I had an idea previously that can we have something like a zs free bulk? Freeing handles one by one does seem expensive. https://lore.kernel.org/linux-mm/adt3Q_SRToF6fb3W@KASONG-MC4/ It might be tricky to do so though. It will be best if we can speed up everything, doing things async doesn't reduce the total amount of work, and might cause more trouble like worker overhead or delayed freeing causing more memory pressure, if the workqueue didn't run in time. Or maybe a process is almost completely swapped out, then this won't help at all. I'm not against the async idea, they might combine well. > > I've stared at this code path for a bit, because my other patch series > (vswap - see [1]) was reported to display regression on the free path > on the usemem benchmark. And one of the issues was the contention > between compaction (both systemwide compaction, i.e zs_page_migrate, > and zsmalloc's internal compaction, but mostly the former).: > > * zs_free read-acquires pool->lock, and compaction write-acquires the > same lock. So the compaction thread will make all zs free-ers wait for > it. I saw this read lock delay when I perfed the free step of usemem. > > * If this lock has fair queue-ing semantics (I have not checked), then > if there a compaction is behind a bunch of zs_free in the queue, then > all the subsequent zs_free's ers are blocked :) > > * I'm also curious about cache-friendliness of this rwlock, bouncing > across CPUs, if you have multiple processes being torn down > concurrently. That's interesting, when I mentioned zs free bulk I was thinking that, if we have a percpu queue, at least we may try read lock that on every enqueue, free the whole queue if successful, then release the lock. I'm sure there are more ways to optimize that, just a random idea :)
On Tue, Apr 21, 2026 at 10:18 AM Kairui Song <ryncsn@gmail.com> wrote: > > On Tue, Apr 21, 2026 at 11:55 PM Nhat Pham <nphamcs@gmail.com> wrote: > > > > Thanks for adding me to the Cc list :), Barry started this idea with > ZRAM, which looks very interesting to me. > > > On Tue, Apr 21, 2026 at 5:16 AM Wenchao Hao <haowenchao22@gmail.com> wrote: > > > > > > Swap freeing can be expensive when unmapping a VMA containing > > > many swap entries. This has been reported to significantly > > > delay memory reclamation during Android's low-memory killing, > > > especially when multiple processes are terminated to free > > > memory, with slot_free() accounting for more than 80% of > > > the total cost of freeing swap entries. > > > > > > Two earlier attempts by Lei and Zhiguo added a new thread in the mm core > > > to asynchronously collect and free swap entries [1][2], but the > > > design itself is fairly complex. > > > > > > When anon folios and swap entries are mixed within a > > > process, reclaiming anon folios from killed processes > > > helps return memory to the system as quickly as possible, > > > so that newly launched applications can satisfy their > > > memory demands. It is not ideal for swap freeing to block > > > anon folio freeing. On the other hand, swap freeing can > > > still return memory to the system, although at a slower > > > rate due to memory compression. > > > > Is this correct? I don't think we do decompression in > > zswap_invalidate() path. We do decompression in zswap_load(), but as a > > separate step from zswap_invalidate(). > > It's not about decompression. I think what Wenchao means here is that: > freeing the swap entry also releases the backing compression data, but > compared to freeing an actual folio (which bring back a free folio to > reduce memory pressure), you may need to free a lot of swap entries to > free one whole folio, because the compressed data could be much > smaller than folio and with fragmentation. And swap entry freeing is > still not fast enough to be ignored. Ah I see yeah. That's the not "as much bang-for-your-buck" as folio freeing category. I agree on this point. > > > > > zswap/zsmalloc entry freeing is decoupled from decompression. For > > example, on process teardown, we free the zsmalloc memory but never > > decompress (if we do then it's a bug to be fixed lol, but I doubt it). > > > > Zsmalloc freeing might not be worth as much bang-for-your-buck wise > > compared to anon folio freeing, but if it's "expensive", then I think > > that points to a different root-cause: zsmalloc's poor scalability in > > the free path. > > That's a very nice insight. I had an idea previously that can we have > something like a zs free bulk? Freeing handles one by one does seem > expensive. > https://lore.kernel.org/linux-mm/adt3Q_SRToF6fb3W@KASONG-MC4/ > > It might be tricky to do so though. > > It will be best if we can speed up everything, doing things async > doesn't reduce the total amount of work, and might cause more trouble > like worker overhead or delayed freeing causing more memory pressure, > if the workqueue didn't run in time. Or maybe a process is almost > completely swapped out, then this won't help at all. > > I'm not against the async idea, they might combine well. Completely agree! I was thinking about batching the free operations for zsmalloc. Right now seems like even if we have a contiguous range of swap slots to be freed, we call one zram_slot_free_notify/zswap_invalidate at a time, which then call zs_free one at a time? I wonder if there's any batching opportunity here. Might be complicated with the pool lock and class lock dance in zs_free() though :) And yeah the async stuff is orthogonal too. > > > > > I've stared at this code path for a bit, because my other patch series > > (vswap - see [1]) was reported to display regression on the free path > > on the usemem benchmark. And one of the issues was the contention > > between compaction (both systemwide compaction, i.e zs_page_migrate, > > and zsmalloc's internal compaction, but mostly the former).: > > > > * zs_free read-acquires pool->lock, and compaction write-acquires the > > same lock. So the compaction thread will make all zs free-ers wait for > > it. I saw this read lock delay when I perfed the free step of usemem. > > > > * If this lock has fair queue-ing semantics (I have not checked), then > > if there a compaction is behind a bunch of zs_free in the queue, then > > all the subsequent zs_free's ers are blocked :) > > > > * I'm also curious about cache-friendliness of this rwlock, bouncing > > across CPUs, if you have multiple processes being torn down > > concurrently. > > That's interesting, when I mentioned zs free bulk I was thinking that, > if we have a percpu queue, at least we may try read lock that on every > enqueue, free the whole queue if successful, then release the lock. > I'm sure there are more ways to optimize that, just a random idea :) Yep! Would be nice to have some perf trace to pinpoint where the overhead is. On my end, I perfed the free phase of usemem. It varies a bit based on exact build config, kernel version, or even between runs, but the cheapest I've seen for the pool lock contention overhead is about 3% of the free phase (this is on baseline, not vswap kernel). That's pretty big (bigger than vswap overhead even on the kernels with vswap, which is kinda silly). Obviously the host was very overcommitted, so compaction was running in the background at the same time, but still...
On Tue, Apr 21, 2026 at 11:07 AM Nhat Pham <nphamcs@gmail.com> wrote: > > On Tue, Apr 21, 2026 at 10:18 AM Kairui Song <ryncsn@gmail.com> wrote: > > > > On Tue, Apr 21, 2026 at 11:55 PM Nhat Pham <nphamcs@gmail.com> wrote: > > > > > > > Thanks for adding me to the Cc list :), Barry started this idea with > > ZRAM, which looks very interesting to me. > > > > > On Tue, Apr 21, 2026 at 5:16 AM Wenchao Hao <haowenchao22@gmail.com> wrote: > > > > > > > > Swap freeing can be expensive when unmapping a VMA containing > > > > many swap entries. This has been reported to significantly > > > > delay memory reclamation during Android's low-memory killing, > > > > especially when multiple processes are terminated to free > > > > memory, with slot_free() accounting for more than 80% of > > > > the total cost of freeing swap entries. > > > > > > > > Two earlier attempts by Lei and Zhiguo added a new thread in the mm core > > > > to asynchronously collect and free swap entries [1][2], but the > > > > design itself is fairly complex. > > > > > > > > When anon folios and swap entries are mixed within a > > > > process, reclaiming anon folios from killed processes > > > > helps return memory to the system as quickly as possible, > > > > so that newly launched applications can satisfy their > > > > memory demands. It is not ideal for swap freeing to block > > > > anon folio freeing. On the other hand, swap freeing can > > > > still return memory to the system, although at a slower > > > > rate due to memory compression. > > > > > > Is this correct? I don't think we do decompression in > > > zswap_invalidate() path. We do decompression in zswap_load(), but as a > > > separate step from zswap_invalidate(). > > > > It's not about decompression. I think what Wenchao means here is that: > > freeing the swap entry also releases the backing compression data, but > > compared to freeing an actual folio (which bring back a free folio to > > reduce memory pressure), you may need to free a lot of swap entries to > > free one whole folio, because the compressed data could be much > > smaller than folio and with fragmentation. And swap entry freeing is > > still not fast enough to be ignored. > > Ah I see yeah. That's the not "as much bang-for-your-buck" as folio > freeing category. I agree on this point. > > > > > > > > > zswap/zsmalloc entry freeing is decoupled from decompression. For > > > example, on process teardown, we free the zsmalloc memory but never > > > decompress (if we do then it's a bug to be fixed lol, but I doubt it). > > > > > > Zsmalloc freeing might not be worth as much bang-for-your-buck wise > > > compared to anon folio freeing, but if it's "expensive", then I think > > > that points to a different root-cause: zsmalloc's poor scalability in > > > the free path. > > > > That's a very nice insight. I had an idea previously that can we have > > something like a zs free bulk? Freeing handles one by one does seem > > expensive. > > https://lore.kernel.org/linux-mm/adt3Q_SRToF6fb3W@KASONG-MC4/ > > > > It might be tricky to do so though. > > > > It will be best if we can speed up everything, doing things async > > doesn't reduce the total amount of work, and might cause more trouble > > like worker overhead or delayed freeing causing more memory pressure, > > if the workqueue didn't run in time. Or maybe a process is almost > > completely swapped out, then this won't help at all. > > > > I'm not against the async idea, they might combine well. > > Completely agree! I was thinking about batching the free operations > for zsmalloc. Right now seems like even if we have a contiguous range > of swap slots to be freed, we call one > zram_slot_free_notify/zswap_invalidate at a time, which then call > zs_free one at a time? I wonder if there's any batching opportunity > here. Might be complicated with the pool lock and class lock dance in > zs_free() though :) > > And yeah the async stuff is orthogonal too. > > > > > > > > > I've stared at this code path for a bit, because my other patch series > > > (vswap - see [1]) was reported to display regression on the free path > > > on the usemem benchmark. And one of the issues was the contention > > > between compaction (both systemwide compaction, i.e zs_page_migrate, > > > and zsmalloc's internal compaction, but mostly the former).: > > > > > > * zs_free read-acquires pool->lock, and compaction write-acquires the > > > same lock. So the compaction thread will make all zs free-ers wait for > > > it. I saw this read lock delay when I perfed the free step of usemem. > > > > > > * If this lock has fair queue-ing semantics (I have not checked), then > > > if there a compaction is behind a bunch of zs_free in the queue, then > > > all the subsequent zs_free's ers are blocked :) > > > > > > * I'm also curious about cache-friendliness of this rwlock, bouncing > > > across CPUs, if you have multiple processes being torn down > > > concurrently. > > > > That's interesting, when I mentioned zs free bulk I was thinking that, > > if we have a percpu queue, at least we may try read lock that on every > > enqueue, free the whole queue if successful, then release the lock. > > I'm sure there are more ways to optimize that, just a random idea :) > > Yep! Would be nice to have some perf trace to pinpoint where the overhead is. > Ah OK - I found this thread now: https://lore.kernel.org/linux-mm/20260414054930.225853-1-xueyuan.chen21@gmail.com/ Hmm, free_zspage() and kmem_cache_free(). * kmem_cache_free() is just handle freeing. Bulk-freeing? * free_zspage() looks like just ordinary teardown work :( Seems like we're not spinning any lock here - we just try lock the backing pages, and the rest is normal work. Not sure how to optimize this - perhaps deferring is the only way.
On Tue, Apr 21, 2026 at 11:25:17AM -0700, Nhat Pham wrote:
[...]
>Hmm, free_zspage() and kmem_cache_free().
>
>* kmem_cache_free() is just handle freeing. Bulk-freeing?
>
>* free_zspage() looks like just ordinary teardown work :( Seems like
>we're not spinning any lock here - we just try lock the backing pages,
>and the rest is normal work. Not sure how to optimize this - perhaps
>deferring is the only way.
>
>
Hi Nhat,
Currently, free_zspage() is called while holding the class->lock.
However, free_zspage() eventually invokes folio_put(), which may acquire
the zone->lock.
This creates a nested lock dependency. If multiple CPUs contend for the
same class->lock and the current holder is stalled waiting for the
zone->lock, it significantly extends the hold time of the class->lock.
This causes other CPUs to wait much longer.
Here is the ftrace data showing the severe contention on class->lock.
Under contention, the time spent in queued_spin_lock_slowpath() jumps
from ~1.3us to over 30us, significantly increasing the total latency
of zs_free().
7) | zs_free() {
7) 0.220 us | _raw_read_lock();
7) | _raw_spin_lock() {
7) 1.320 us | queued_spin_lock_slowpath();
7) 1.820 us | }
7) 0.170 us | _raw_read_unlock();
7) 0.170 us | obj_free();
7) 0.190 us | fix_fullness_group();
7) 0.150 us | _raw_spin_unlock();
7) 0.170 us | kmem_cache_free();
7) 4.610 us | }
---------------------------------------------------------
7) | zs_free() {
7) 0.230 us | _raw_read_lock();
7) | _raw_spin_lock() {
7) + 30.100 us | queued_spin_lock_slowpath();
7) + 30.600 us | }
7) 0.200 us | _raw_read_unlock();
7) 0.170 us | obj_free();
7) 0.170 us | fix_fullness_group();
7) 0.170 us | _raw_spin_unlock();
7) 0.210 us | kmem_cache_free();
7) + 33.850 us | }
Best regards,
Xueyuan
© 2016 - 2026 Red Hat, Inc.