[RFC 0/3] mm: process_mrelease: expedited reclaim and auto-kill support

Minchan Kim posted 3 patches 2 months, 1 week ago
arch/s390/include/asm/tlb.h        |  2 +-
include/linux/swap.h               |  9 ++++++---
include/uapi/asm-generic/siginfo.h |  6 ++++++
include/uapi/linux/mman.h          |  4 ++++
kernel/signal.c                    |  4 ++++
mm/memory.c                        | 13 ++++++++++++-
mm/mmu_gather.c                    |  8 +++++---
mm/oom_kill.c                      | 20 +++++++++++++++++++-
mm/swap_state.c                    | 19 +++++++++++++++++--
9 files changed, 74 insertions(+), 11 deletions(-)
[RFC 0/3] mm: process_mrelease: expedited reclaim and auto-kill support
Posted by Minchan Kim 2 months, 1 week ago
This patch series introduces optimizations to expedite memory reclamation
in process_mrelease() and provides a secure, race-free "auto-kill"
mechanism for efficient container shutdown and OOM handling.

Currently, process_mrelease() unmaps pages but leaves clean file folios
on the LRU list, relying on standard memory reclaim to eventually free
them. Furthermore, requiring userspace to send a SIGKILL prior to
invoking process_mrelease() introduces scheduling race conditions where
the victim task may enter the exit path prematurely, bypassing expedited
reclamation hooks.

This series addresses these limitations in three logical steps.

Patch #1: mm: process_mrelease: expedite clean file folio reclaim via mmu_gather
Integrates clean file folio eviction directly into the low-level TLB
batching (mmu_gather) infrastructure. Symmetrically truncates clean file
folios alongside anonymous pages during the unmap loop.

Patch #2: mm: process_mrelease: skip LRU movement for exclusive file folios
Skips costly LRU marking (folio_mark_accessed) for exclusive file-backed
folios undergoing process_mrelease reclaim. Perf profiling reveals that
LRU movement accounts for ~55% of overhead during unmap.

Patch #3: mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
Adds an auto-kill flag supporting atomic teardown. Utilizes a dedicated
signal code (KILL_MRELEASE) to guarantee MMF_UNSTABLE is marked in the
signal delivery path, preventing scheduling races.

Minchan Kim (3):
  mm: process_mrelease: expedite clean file folio reclaim via mmu_gather
  mm: process_mrelease: skip LRU movement for exclusive file folios
  mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag

 arch/s390/include/asm/tlb.h        |  2 +-
 include/linux/swap.h               |  9 ++++++---
 include/uapi/asm-generic/siginfo.h |  6 ++++++
 include/uapi/linux/mman.h          |  4 ++++
 kernel/signal.c                    |  4 ++++
 mm/memory.c                        | 13 ++++++++++++-
 mm/mmu_gather.c                    |  8 +++++---
 mm/oom_kill.c                      | 20 +++++++++++++++++++-
 mm/swap_state.c                    | 19 +++++++++++++++++--
 9 files changed, 74 insertions(+), 11 deletions(-)

-- 
2.54.0.rc0.605.g598a273b03-goog
Re: [RFC 0/3] mm: process_mrelease: expedited reclaim and auto-kill support
Posted by Michal Hocko 2 months, 1 week ago
On Mon 13-04-26 15:39:45, Minchan Kim wrote:
> This patch series introduces optimizations to expedite memory reclamation
> in process_mrelease() and provides a secure, race-free "auto-kill"
> mechanism for efficient container shutdown and OOM handling.
> 
> Currently, process_mrelease() unmaps pages but leaves clean file folios
> on the LRU list, relying on standard memory reclaim to eventually free
> them. Furthermore, requiring userspace to send a SIGKILL prior to
> invoking process_mrelease() introduces scheduling race conditions where
> the victim task may enter the exit path prematurely, bypassing expedited
> reclamation hooks.
> 
> This series addresses these limitations in three logical steps.
> 
> Patch #1: mm: process_mrelease: expedite clean file folio reclaim via mmu_gather
> Integrates clean file folio eviction directly into the low-level TLB
> batching (mmu_gather) infrastructure. Symmetrically truncates clean file
> folios alongside anonymous pages during the unmap loop.

Why do we need to care about clean page cache? Is this a form of
drop_caches?

> Patch #2: mm: process_mrelease: skip LRU movement for exclusive file folios
> Skips costly LRU marking (folio_mark_accessed) for exclusive file-backed
> folios undergoing process_mrelease reclaim. Perf profiling reveals that
> LRU movement accounts for ~55% of overhead during unmap.

OK, but why is this not desirable behavior fir mrelease?

> Patch #3: mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
> Adds an auto-kill flag supporting atomic teardown. Utilizes a dedicated
> signal code (KILL_MRELEASE) to guarantee MMF_UNSTABLE is marked in the
> signal delivery path, preventing scheduling races.

Could you explain why those races are a real problem?
-- 
Michal Hocko
SUSE Labs
Re: [RFC 0/3] mm: process_mrelease: expedited reclaim and auto-kill support
Posted by Minchan Kim 2 months ago
On Tue, Apr 14, 2026 at 08:57:57AM +0200, Michal Hocko wrote:
> On Mon 13-04-26 15:39:45, Minchan Kim wrote:
> > This patch series introduces optimizations to expedite memory reclamation
> > in process_mrelease() and provides a secure, race-free "auto-kill"
> > mechanism for efficient container shutdown and OOM handling.
> > 
> > Currently, process_mrelease() unmaps pages but leaves clean file folios
> > on the LRU list, relying on standard memory reclaim to eventually free
> > them. Furthermore, requiring userspace to send a SIGKILL prior to
> > invoking process_mrelease() introduces scheduling race conditions where
> > the victim task may enter the exit path prematurely, bypassing expedited
> > reclamation hooks.
> > 
> > This series addresses these limitations in three logical steps.
> > 
> > Patch #1: mm: process_mrelease: expedite clean file folio reclaim via mmu_gather
> > Integrates clean file folio eviction directly into the low-level TLB
> > batching (mmu_gather) infrastructure. Symmetrically truncates clean file
> > folios alongside anonymous pages during the unmap loop.
> 
> Why do we need to care about clean page cache? Is this a form of
> drop_caches?

The goal is to ensure the memory is actually freed by the time
process_mrelease returns. Currently, process_mrelease unmaps pages, but
page caches remain on the LRU, leaving them to be reclaimed later
by kswapd or direct reclaim. This delay defeats the purpose of
"expedited" release. It’s not a global drop_caches, but rather a
targeted eviction for the victim process to make its memory immediately
available for other urgent allocations.

> 
> > Patch #2: mm: process_mrelease: skip LRU movement for exclusive file folios
> > Skips costly LRU marking (folio_mark_accessed) for exclusive file-backed
> > folios undergoing process_mrelease reclaim. Perf profiling reveals that
> > LRU movement accounts for ~55% of overhead during unmap.
> 
> OK, but why is this not desirable behavior fir mrelease?

In Android, lmkd kills background apps under memory pressure and then calls
process_mrelease. If the memory release is slow due to LRU overhead (~55% as noted),
it cannot keep up with the allocation speed of the foreground app.
This delay often leads to "over-killing" - killing more background apps
than necessary because the system hasn't yet "seen" the memory freed
from the first kill.

> 
> > Patch #3: mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
> > Adds an auto-kill flag supporting atomic teardown. Utilizes a dedicated
> > signal code (KILL_MRELEASE) to guarantee MMF_UNSTABLE is marked in the
> > signal delivery path, preventing scheduling races.
> 
> Could you explain why those races are a real problem?

The race occurs when the victim process starts its own exit path (after
SIGKILL) before the caller can invoke process_mrelease. If the victim
reaches the exit path first, the caller might lose the window to apply
these expedited reclamation optimizations. By combining the kill and the
release into an atomic operation with a dedicated signal code, we
guarantee that the process is reaped efficiently without competing with
the process's own teardown logic.

> -- 
> Michal Hocko
> SUSE Labs
Re: [RFC 0/3] mm: process_mrelease: expedited reclaim and auto-kill support
Posted by Michal Hocko 2 months ago
On Tue 14-04-26 13:00:16, Minchan Kim wrote:
> On Tue, Apr 14, 2026 at 08:57:57AM +0200, Michal Hocko wrote:
> > On Mon 13-04-26 15:39:45, Minchan Kim wrote:
> > > This patch series introduces optimizations to expedite memory reclamation
> > > in process_mrelease() and provides a secure, race-free "auto-kill"
> > > mechanism for efficient container shutdown and OOM handling.
> > > 
> > > Currently, process_mrelease() unmaps pages but leaves clean file folios
> > > on the LRU list, relying on standard memory reclaim to eventually free
> > > them. Furthermore, requiring userspace to send a SIGKILL prior to
> > > invoking process_mrelease() introduces scheduling race conditions where
> > > the victim task may enter the exit path prematurely, bypassing expedited
> > > reclamation hooks.
> > > 
> > > This series addresses these limitations in three logical steps.
> > > 
> > > Patch #1: mm: process_mrelease: expedite clean file folio reclaim via mmu_gather
> > > Integrates clean file folio eviction directly into the low-level TLB
> > > batching (mmu_gather) infrastructure. Symmetrically truncates clean file
> > > folios alongside anonymous pages during the unmap loop.
> > 
> > Why do we need to care about clean page cache? Is this a form of
> > drop_caches?
> 
> The goal is to ensure the memory is actually freed by the time
> process_mrelease returns. Currently, process_mrelease unmaps pages, but
> page caches remain on the LRU, leaving them to be reclaimed later
> by kswapd or direct reclaim.

Correct. This was the initial design decision because there is not much
you can assume about page cache pages which are very often shared. Even
if they are not mapped by all users.

> This delay defeats the purpose of
> "expedited" release. It’s not a global drop_caches, but rather a
> targeted eviction for the victim process to make its memory immediately
> available for other urgent allocations.

Clean page cache reclaim should be quite effective. Why doesn't kswapd
keep up in that regards? Or is this more a per-memcg problem where there
is no background reclaim and you are hitting direct reclaim to clean up
those pages?

> > > Patch #2: mm: process_mrelease: skip LRU movement for exclusive file folios
> > > Skips costly LRU marking (folio_mark_accessed) for exclusive file-backed
> > > folios undergoing process_mrelease reclaim. Perf profiling reveals that
> > > LRU movement accounts for ~55% of overhead during unmap.
> > 
> > OK, but why is this not desirable behavior fir mrelease?
> 
> In Android, lmkd kills background apps under memory pressure and then calls
> process_mrelease. If the memory release is slow due to LRU overhead (~55% as noted),
> it cannot keep up with the allocation speed of the foreground app.
> This delay often leads to "over-killing" - killing more background apps
> than necessary because the system hasn't yet "seen" the memory freed
> from the first kill.

OK, I see. More on that below.
 
> > > Patch #3: mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
> > > Adds an auto-kill flag supporting atomic teardown. Utilizes a dedicated
> > > signal code (KILL_MRELEASE) to guarantee MMF_UNSTABLE is marked in the
> > > signal delivery path, preventing scheduling races.
> > 
> > Could you explain why those races are a real problem?
> 
> The race occurs when the victim process starts its own exit path (after
> SIGKILL) before the caller can invoke process_mrelease. If the victim
> reaches the exit path first, the caller might lose the window to apply
> these expedited reclamation optimizations.

Isn't this the problem you are trying to solve then? You are special
casing process_mrelease while you really want to expedite the process
memory clean up. 

The same situation happens with the global OOM and your approach doesn't
really close the race anyway. You send SIGKILL first and the victim can
hit the exit path right after that before you start processing the rest.
That is not fundamentally different from doing that in two syscalls,
race window is just smaller.

All that being said, I do not think those special hacks for
process_mrelease is the right approach. I very much agree that the
address space tear down for a dying process could be improved and we
should be focusing on that part.
-- 
Michal Hocko
SUSE Labs
Re: [RFC 0/3] mm: process_mrelease: expedited reclaim and auto-kill support
Posted by Minchan Kim 2 months ago
On Wed, Apr 15, 2026 at 09:38:05AM +0200, Michal Hocko wrote:
> On Tue 14-04-26 13:00:16, Minchan Kim wrote:
> > On Tue, Apr 14, 2026 at 08:57:57AM +0200, Michal Hocko wrote:
> > > On Mon 13-04-26 15:39:45, Minchan Kim wrote:
> > > > This patch series introduces optimizations to expedite memory reclamation
> > > > in process_mrelease() and provides a secure, race-free "auto-kill"
> > > > mechanism for efficient container shutdown and OOM handling.
> > > > 
> > > > Currently, process_mrelease() unmaps pages but leaves clean file folios
> > > > on the LRU list, relying on standard memory reclaim to eventually free
> > > > them. Furthermore, requiring userspace to send a SIGKILL prior to
> > > > invoking process_mrelease() introduces scheduling race conditions where
> > > > the victim task may enter the exit path prematurely, bypassing expedited
> > > > reclamation hooks.
> > > > 
> > > > This series addresses these limitations in three logical steps.
> > > > 
> > > > Patch #1: mm: process_mrelease: expedite clean file folio reclaim via mmu_gather
> > > > Integrates clean file folio eviction directly into the low-level TLB
> > > > batching (mmu_gather) infrastructure. Symmetrically truncates clean file
> > > > folios alongside anonymous pages during the unmap loop.
> > > 
> > > Why do we need to care about clean page cache? Is this a form of
> > > drop_caches?
> > 
> > The goal is to ensure the memory is actually freed by the time
> > process_mrelease returns. Currently, process_mrelease unmaps pages, but
> > page caches remain on the LRU, leaving them to be reclaimed later
> > by kswapd or direct reclaim.
> 
> Correct. This was the initial design decision because there is not much
> you can assume about page cache pages which are very often shared. Even
> if they are not mapped by all users.

Fair point. However, that's the trade-off:

Leaving unmapped caches to be reclaimed asynchronously keeps system memory
pressure high for too long. In Android, this delay forces the LMKD to
unnecessarily kill additional innocent background apps before the memory
from the original victim is recovered.

Furthermore, this approach maintains safety: mapping_evict_folio() will
naturally fail if the folio is still actively shared by other processes.
Yeah, that's not perfect but minimal safety net.

I believe preventing redundant background app kills is a highly beneficial
trade-off than adding more IOs since user can keep listening their music.

> 
> > This delay defeats the purpose of
> > "expedited" release. It’s not a global drop_caches, but rather a
> > targeted eviction for the victim process to make its memory immediately
> > available for other urgent allocations.
> 
> Clean page cache reclaim should be quite effective. Why doesn't kswapd
> keep up in that regards? Or is this more a per-memcg problem where there
> is no background reclaim and you are hitting direct reclaim to clean up
> those pages?

So many reasons: Cannot say everything how kswapd is inefficient compared
to memory allocation.

1. kswapd is CPU hungry.
2. Out of control which core(big, midlle or little) kswapd will run
3. Allocatoin is super fast than the memory reclaim
4. Kswapd is stuck on some lock
5. Kswapd is doing swapping out since anon/file reclaims are serialized

> 
> > > > Patch #2: mm: process_mrelease: skip LRU movement for exclusive file folios
> > > > Skips costly LRU marking (folio_mark_accessed) for exclusive file-backed
> > > > folios undergoing process_mrelease reclaim. Perf profiling reveals that
> > > > LRU movement accounts for ~55% of overhead during unmap.
> > > 
> > > OK, but why is this not desirable behavior fir mrelease?
> > 
> > In Android, lmkd kills background apps under memory pressure and then calls
> > process_mrelease. If the memory release is slow due to LRU overhead (~55% as noted),
> > it cannot keep up with the allocation speed of the foreground app.
> > This delay often leads to "over-killing" - killing more background apps
> > than necessary because the system hasn't yet "seen" the memory freed
> > from the first kill.
> 
> OK, I see. More on that below.
>  
> > > > Patch #3: mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
> > > > Adds an auto-kill flag supporting atomic teardown. Utilizes a dedicated
> > > > signal code (KILL_MRELEASE) to guarantee MMF_UNSTABLE is marked in the
> > > > signal delivery path, preventing scheduling races.
> > > 
> > > Could you explain why those races are a real problem?
> > 
> > The race occurs when the victim process starts its own exit path (after
> > SIGKILL) before the caller can invoke process_mrelease. If the victim
> > reaches the exit path first, the caller might lose the window to apply
> > these expedited reclamation optimizations.
> 
> Isn't this the problem you are trying to solve then? You are special
> casing process_mrelease while you really want to expedite the process
> memory clean up. 
> 
> The same situation happens with the global OOM and your approach doesn't
> really close the race anyway. You send SIGKILL first and the victim can
> hit the exit path right after that before you start processing the rest.
> That is not fundamentally different from doing that in two syscalls,
> race window is just smaller.

No, this approach completely close the race.

When it invokes do_send_sig_info(SIGKILL) with the KILL_MRELEASE code,
the kernel sets the MMF_UNSTABLE flag on the victim's mm_struct in the signal
delivery path (kernel/signal.c) *before* the task begins processing the signal.

When the victim gets scheduled and wakes up to process the fatal signal,
the MMF_UNSTABLE flag is already set.

This guarantees that the victim's own exit path (do_exit -> exit_mmap) will
utilize the expedited reclamation optimizations automatically, regardless of
whether the reaper or the victim gets scheduled first.

For the OOM, we can use the same idea.

> 
> All that being said, I do not think those special hacks for
> process_mrelease is the right approach. I very much agree that the
> address space tear down for a dying process could be improved and we
> should be focusing on that part.

I think process_mrelease is crucial here because relying on the exit path is
non-deterministic.

The mm_struct can have its reference count increased by other tasks
or kernel subsystems. When a process exits, the actual address space teardown
(exit_mmap) is deferred until the mm's reference count reaches zero.
If any component holds a reference, the tearing down is delayed indefinitely.
We have observed the problem in the field quite a lot.

This is precisely why process_mrelease was introduced in the first place.
It allows an external reaper to bypass the reference count delays and
synchronously reap the memory of a dying process on demand.

Improving the regular exit path alone cannot substitute for this,
as it would still be stalled by outstanding reference counts.

I don't know what alternative ideas you are thinking about, but please share if
you have a better approach.
Re: [RFC 0/3] mm: process_mrelease: expedited reclaim and auto-kill support
Posted by Michal Hocko 2 months ago
On Wed 15-04-26 16:26:34, Minchan Kim wrote:
> On Wed, Apr 15, 2026 at 09:38:05AM +0200, Michal Hocko wrote:
> > On Tue 14-04-26 13:00:16, Minchan Kim wrote:
> > > On Tue, Apr 14, 2026 at 08:57:57AM +0200, Michal Hocko wrote:
> > > > On Mon 13-04-26 15:39:45, Minchan Kim wrote:
> > > > > This patch series introduces optimizations to expedite memory reclamation
> > > > > in process_mrelease() and provides a secure, race-free "auto-kill"
> > > > > mechanism for efficient container shutdown and OOM handling.
> > > > > 
> > > > > Currently, process_mrelease() unmaps pages but leaves clean file folios
> > > > > on the LRU list, relying on standard memory reclaim to eventually free
> > > > > them. Furthermore, requiring userspace to send a SIGKILL prior to
> > > > > invoking process_mrelease() introduces scheduling race conditions where
> > > > > the victim task may enter the exit path prematurely, bypassing expedited
> > > > > reclamation hooks.
> > > > > 
> > > > > This series addresses these limitations in three logical steps.
> > > > > 
> > > > > Patch #1: mm: process_mrelease: expedite clean file folio reclaim via mmu_gather
> > > > > Integrates clean file folio eviction directly into the low-level TLB
> > > > > batching (mmu_gather) infrastructure. Symmetrically truncates clean file
> > > > > folios alongside anonymous pages during the unmap loop.
> > > > 
> > > > Why do we need to care about clean page cache? Is this a form of
> > > > drop_caches?
> > > 
> > > The goal is to ensure the memory is actually freed by the time
> > > process_mrelease returns. Currently, process_mrelease unmaps pages, but
> > > page caches remain on the LRU, leaving them to be reclaimed later
> > > by kswapd or direct reclaim.
> > 
> > Correct. This was the initial design decision because there is not much
> > you can assume about page cache pages which are very often shared. Even
> > if they are not mapped by all users.
> 
> Fair point. However, that's the trade-off:
> 
> Leaving unmapped caches to be reclaimed asynchronously keeps system memory
> pressure high for too long. In Android, this delay forces the LMKD to
> unnecessarily kill additional innocent background apps before the memory
> from the original victim is recovered.

OK, this is really not clear to me. How come you end up triggering LMKD
(or any OOM handling) when there is a considerable amount of clean page
cache?

[...]

> > > The race occurs when the victim process starts its own exit path (after
> > > SIGKILL) before the caller can invoke process_mrelease. If the victim
> > > reaches the exit path first, the caller might lose the window to apply
> > > these expedited reclamation optimizations.
> > 
> > Isn't this the problem you are trying to solve then? You are special
> > casing process_mrelease while you really want to expedite the process
> > memory clean up. 
> > 
> > The same situation happens with the global OOM and your approach doesn't
> > really close the race anyway. You send SIGKILL first and the victim can
> > hit the exit path right after that before you start processing the rest.
> > That is not fundamentally different from doing that in two syscalls,
> > race window is just smaller.
> 
> No, this approach completely close the race.
> 
> When it invokes do_send_sig_info(SIGKILL) with the KILL_MRELEASE code,
> the kernel sets the MMF_UNSTABLE flag on the victim's mm_struct in the signal
> delivery path (kernel/signal.c) *before* the task begins processing the signal.

OK, I have missed this part. I haven't really looked into specific
patches at this stage. I am still trying to understand the motivation
and your reasoning. So effectivelly you want to get SIGOOMKILL more or
less.

> When the victim gets scheduled and wakes up to process the fatal signal,
> the MMF_UNSTABLE flag is already set.
> 
> This guarantees that the victim's own exit path (do_exit -> exit_mmap) will
> utilize the expedited reclamation optimizations automatically, regardless of
> whether the reaper or the victim gets scheduled first.
> 
> For the OOM, we can use the same idea.
> 
> > 
> > All that being said, I do not think those special hacks for
> > process_mrelease is the right approach. I very much agree that the
> > address space tear down for a dying process could be improved and we
> > should be focusing on that part.
> 
> I think process_mrelease is crucial here because relying on the exit path is
> non-deterministic.

I suspect you are missing my point. I am arguing that those special
hacks in the address space release path shouldn't be process_mrelease
specific. I do recognize the value of the sync tear down need. I am also
in favor of something like SIGOOMKILL. process_mrelease might even be
the right syscall for that purpose.
-- 
Michal Hocko
SUSE Labs
Re: [RFC 0/3] mm: process_mrelease: expedited reclaim and auto-kill support
Posted by Minchan Kim 2 months ago
On Thu, Apr 16, 2026 at 08:54:53AM +0200, Michal Hocko wrote:
> On Wed 15-04-26 16:26:34, Minchan Kim wrote:
> > On Wed, Apr 15, 2026 at 09:38:05AM +0200, Michal Hocko wrote:
> > > On Tue 14-04-26 13:00:16, Minchan Kim wrote:
> > > > On Tue, Apr 14, 2026 at 08:57:57AM +0200, Michal Hocko wrote:
> > > > > On Mon 13-04-26 15:39:45, Minchan Kim wrote:
> > > > > > This patch series introduces optimizations to expedite memory reclamation
> > > > > > in process_mrelease() and provides a secure, race-free "auto-kill"
> > > > > > mechanism for efficient container shutdown and OOM handling.
> > > > > > 
> > > > > > Currently, process_mrelease() unmaps pages but leaves clean file folios
> > > > > > on the LRU list, relying on standard memory reclaim to eventually free
> > > > > > them. Furthermore, requiring userspace to send a SIGKILL prior to
> > > > > > invoking process_mrelease() introduces scheduling race conditions where
> > > > > > the victim task may enter the exit path prematurely, bypassing expedited
> > > > > > reclamation hooks.
> > > > > > 
> > > > > > This series addresses these limitations in three logical steps.
> > > > > > 
> > > > > > Patch #1: mm: process_mrelease: expedite clean file folio reclaim via mmu_gather
> > > > > > Integrates clean file folio eviction directly into the low-level TLB
> > > > > > batching (mmu_gather) infrastructure. Symmetrically truncates clean file
> > > > > > folios alongside anonymous pages during the unmap loop.
> > > > > 
> > > > > Why do we need to care about clean page cache? Is this a form of
> > > > > drop_caches?
> > > > 
> > > > The goal is to ensure the memory is actually freed by the time
> > > > process_mrelease returns. Currently, process_mrelease unmaps pages, but
> > > > page caches remain on the LRU, leaving them to be reclaimed later
> > > > by kswapd or direct reclaim.
> > > 
> > > Correct. This was the initial design decision because there is not much
> > > you can assume about page cache pages which are very often shared. Even
> > > if they are not mapped by all users.
> > 
> > Fair point. However, that's the trade-off:
> > 
> > Leaving unmapped caches to be reclaimed asynchronously keeps system memory
> > pressure high for too long. In Android, this delay forces the LMKD to
> > unnecessarily kill additional innocent background apps before the memory
> > from the original victim is recovered.
> 
> OK, this is really not clear to me. How come you end up triggering LMKD
> (or any OOM handling) when there is a considerable amount of clean page
> cache?

It's not simple to explain all the heuristics, but basically, LMKD is triggered
by PSI pressure (usually contributed by kswapd rather than other components
like refault, kcompactd, or workingset operations).

It then checks the current free memory against system watermarks. Depending
on the free memory size, file cache, and free swap, it decides to start
killing background apps.

In other words, LMKD acts as a "userspace kswapd" to assist kernel kswapd's
reclamation speed. It is smarter than kswapd because it has high-level knowledge
of which processes are okay to be killed rather than forcing slow, unnecessary
paing out.

Whenever LMKD is running, kswapd is usually running alongside it. You might
wonder why LMKD kills background apps even when there are plenty of clean file
pages. That's because the system cannot predict current memory allocation rates.
If the allocation is bursty, kswapd can never catch up with the allocation speed.
This forces the foreground apps into direct reclaim, resulting in visible
UI jank. Android prioritizes UI smoothness and chooses to kill background apps.

Furthermore, when LMKD kills a background app, it expects immediate memory relief.
If the clean file pages of the killed process are left on the LRU to be reclaimed
asynchronously later, the system's memory pressure (PSI) remains high.
This forces LMKD to unnecessarily kill *additional* background apps before
the memory from the first victim is fully recovered.

Again, this is why I want process_mrelease expedite clean file reclamation
synchronously.

> 
> [...]
> 
> > > > The race occurs when the victim process starts its own exit path (after
> > > > SIGKILL) before the caller can invoke process_mrelease. If the victim
> > > > reaches the exit path first, the caller might lose the window to apply
> > > > these expedited reclamation optimizations.
> > > 
> > > Isn't this the problem you are trying to solve then? You are special
> > > casing process_mrelease while you really want to expedite the process
> > > memory clean up. 
> > > 
> > > The same situation happens with the global OOM and your approach doesn't
> > > really close the race anyway. You send SIGKILL first and the victim can
> > > hit the exit path right after that before you start processing the rest.
> > > That is not fundamentally different from doing that in two syscalls,
> > > race window is just smaller.
> > 
> > No, this approach completely close the race.
> > 
> > When it invokes do_send_sig_info(SIGKILL) with the KILL_MRELEASE code,
> > the kernel sets the MMF_UNSTABLE flag on the victim's mm_struct in the signal
> > delivery path (kernel/signal.c) *before* the task begins processing the signal.
> 
> OK, I have missed this part. I haven't really looked into specific
> patches at this stage. I am still trying to understand the motivation
> and your reasoning. So effectivelly you want to get SIGOOMKILL more or
> less.

The fundamental problem with signals is their unpredictability. The time
between sending a signal and when the victim task actually handles it is
highly non-deterministic. Furthermore, as mentioned earlier, outstanding
reference counts on the mm_struct can delay the teardown indefinitely.

> 
> > When the victim gets scheduled and wakes up to process the fatal signal,
> > the MMF_UNSTABLE flag is already set.
> > 
> > This guarantees that the victim's own exit path (do_exit -> exit_mmap) will
> > utilize the expedited reclamation optimizations automatically, regardless of
> > whether the reaper or the victim gets scheduled first.
> > 
> > For the OOM, we can use the same idea.
> > 
> > > 
> > > All that being said, I do not think those special hacks for
> > > process_mrelease is the right approach. I very much agree that the
> > > address space tear down for a dying process could be improved and we
> > > should be focusing on that part.
> > 
> > I think process_mrelease is crucial here because relying on the exit path is
> > non-deterministic.
> 
> I suspect you are missing my point. I am arguing that those special
> hacks in the address space release path shouldn't be process_mrelease

I am a bit confused now. Do you mean you want to apply these expedited
reclamation optimizations to ALL dying processes in the common exit path,
rather than making them specific to process_mrelease?

I don't think so since you mentioned below "process_mrelease might ..
right syscall" but I wanted to clarify what you means.

> specific. I do recognize the value of the sync tear down need. I am also
> in favor of something like SIGOOMKILL. process_mrelease might even be
> the right syscall for that purpose.

I am glad we are aligned on the necessity of this feature.
Re: [RFC 0/3] mm: process_mrelease: expedited reclaim and auto-kill support
Posted by Michal Hocko 2 months ago
On Thu 16-04-26 23:20:30, Minchan Kim wrote:
> On Thu, Apr 16, 2026 at 08:54:53AM +0200, Michal Hocko wrote:
> > On Wed 15-04-26 16:26:34, Minchan Kim wrote:
> > > On Wed, Apr 15, 2026 at 09:38:05AM +0200, Michal Hocko wrote:
> > > > On Tue 14-04-26 13:00:16, Minchan Kim wrote:
> > > > > On Tue, Apr 14, 2026 at 08:57:57AM +0200, Michal Hocko wrote:
> > > > > > On Mon 13-04-26 15:39:45, Minchan Kim wrote:
> > > > > > > This patch series introduces optimizations to expedite memory reclamation
> > > > > > > in process_mrelease() and provides a secure, race-free "auto-kill"
> > > > > > > mechanism for efficient container shutdown and OOM handling.
> > > > > > > 
> > > > > > > Currently, process_mrelease() unmaps pages but leaves clean file folios
> > > > > > > on the LRU list, relying on standard memory reclaim to eventually free
> > > > > > > them. Furthermore, requiring userspace to send a SIGKILL prior to
> > > > > > > invoking process_mrelease() introduces scheduling race conditions where
> > > > > > > the victim task may enter the exit path prematurely, bypassing expedited
> > > > > > > reclamation hooks.
> > > > > > > 
> > > > > > > This series addresses these limitations in three logical steps.
> > > > > > > 
> > > > > > > Patch #1: mm: process_mrelease: expedite clean file folio reclaim via mmu_gather
> > > > > > > Integrates clean file folio eviction directly into the low-level TLB
> > > > > > > batching (mmu_gather) infrastructure. Symmetrically truncates clean file
> > > > > > > folios alongside anonymous pages during the unmap loop.
> > > > > > 
> > > > > > Why do we need to care about clean page cache? Is this a form of
> > > > > > drop_caches?
> > > > > 
> > > > > The goal is to ensure the memory is actually freed by the time
> > > > > process_mrelease returns. Currently, process_mrelease unmaps pages, but
> > > > > page caches remain on the LRU, leaving them to be reclaimed later
> > > > > by kswapd or direct reclaim.
> > > > 
> > > > Correct. This was the initial design decision because there is not much
> > > > you can assume about page cache pages which are very often shared. Even
> > > > if they are not mapped by all users.
> > > 
> > > Fair point. However, that's the trade-off:
> > > 
> > > Leaving unmapped caches to be reclaimed asynchronously keeps system memory
> > > pressure high for too long. In Android, this delay forces the LMKD to
> > > unnecessarily kill additional innocent background apps before the memory
> > > from the original victim is recovered.
> > 
> > OK, this is really not clear to me. How come you end up triggering LMKD
> > (or any OOM handling) when there is a considerable amount of clean page
> > cache?
> 
> It's not simple to explain all the heuristics, but basically, LMKD is triggered
> by PSI pressure (usually contributed by kswapd rather than other components
> like refault, kcompactd, or workingset operations).
> 
> It then checks the current free memory against system watermarks. Depending
> on the free memory size, file cache, and free swap, it decides to start
> killing background apps.
> 
> In other words, LMKD acts as a "userspace kswapd" to assist kernel kswapd's
> reclamation speed. It is smarter than kswapd because it has high-level knowledge
> of which processes are okay to be killed rather than forcing slow, unnecessary
> paing out.
> 
> Whenever LMKD is running, kswapd is usually running alongside it. You might
> wonder why LMKD kills background apps even when there are plenty of clean file
> pages. That's because the system cannot predict current memory allocation rates.
> If the allocation is bursty, kswapd can never catch up with the allocation speed.
> This forces the foreground apps into direct reclaim, resulting in visible
> UI jank. Android prioritizes UI smoothness and chooses to kill background apps.
>
> Furthermore, when LMKD kills a background app, it expects immediate memory relief.
> If the clean file pages of the killed process are left on the LRU to be reclaimed
> asynchronously later, the system's memory pressure (PSI) remains high.
> This forces LMKD to unnecessarily kill *additional* background apps before
> the memory from the first victim is fully recovered.
> 
> Again, this is why I want process_mrelease expedite clean file reclamation
> synchronously.

How much of a clean page cache do you usually drop this way?
 
[...]
> > I suspect you are missing my point. I am arguing that those special
> > hacks in the address space release path shouldn't be process_mrelease
> 
> I am a bit confused now. Do you mean you want to apply these expedited
> reclamation optimizations to ALL dying processes in the common exit path,
> rather than making them specific to process_mrelease?

Yes. All which make sense, really. I am still not convinced about the
clean page cache because that just seems like a hack to workaround wrong
userspace oom heuristics.
-- 
Michal Hocko
SUSE Labs
Re: [RFC 0/3] mm: process_mrelease: expedited reclaim and auto-kill support
Posted by Minchan Kim 2 months ago
On Fri, Apr 17, 2026 at 09:11:21AM +0200, Michal Hocko wrote:
> On Thu 16-04-26 23:20:30, Minchan Kim wrote:
> > On Thu, Apr 16, 2026 at 08:54:53AM +0200, Michal Hocko wrote:
> > > On Wed 15-04-26 16:26:34, Minchan Kim wrote:
> > > > On Wed, Apr 15, 2026 at 09:38:05AM +0200, Michal Hocko wrote:
> > > > > On Tue 14-04-26 13:00:16, Minchan Kim wrote:
> > > > > > On Tue, Apr 14, 2026 at 08:57:57AM +0200, Michal Hocko wrote:
> > > > > > > On Mon 13-04-26 15:39:45, Minchan Kim wrote:
> > > > > > > > This patch series introduces optimizations to expedite memory reclamation
> > > > > > > > in process_mrelease() and provides a secure, race-free "auto-kill"
> > > > > > > > mechanism for efficient container shutdown and OOM handling.
> > > > > > > > 
> > > > > > > > Currently, process_mrelease() unmaps pages but leaves clean file folios
> > > > > > > > on the LRU list, relying on standard memory reclaim to eventually free
> > > > > > > > them. Furthermore, requiring userspace to send a SIGKILL prior to
> > > > > > > > invoking process_mrelease() introduces scheduling race conditions where
> > > > > > > > the victim task may enter the exit path prematurely, bypassing expedited
> > > > > > > > reclamation hooks.
> > > > > > > > 
> > > > > > > > This series addresses these limitations in three logical steps.
> > > > > > > > 
> > > > > > > > Patch #1: mm: process_mrelease: expedite clean file folio reclaim via mmu_gather
> > > > > > > > Integrates clean file folio eviction directly into the low-level TLB
> > > > > > > > batching (mmu_gather) infrastructure. Symmetrically truncates clean file
> > > > > > > > folios alongside anonymous pages during the unmap loop.
> > > > > > > 
> > > > > > > Why do we need to care about clean page cache? Is this a form of
> > > > > > > drop_caches?
> > > > > > 
> > > > > > The goal is to ensure the memory is actually freed by the time
> > > > > > process_mrelease returns. Currently, process_mrelease unmaps pages, but
> > > > > > page caches remain on the LRU, leaving them to be reclaimed later
> > > > > > by kswapd or direct reclaim.
> > > > > 
> > > > > Correct. This was the initial design decision because there is not much
> > > > > you can assume about page cache pages which are very often shared. Even
> > > > > if they are not mapped by all users.
> > > > 
> > > > Fair point. However, that's the trade-off:
> > > > 
> > > > Leaving unmapped caches to be reclaimed asynchronously keeps system memory
> > > > pressure high for too long. In Android, this delay forces the LMKD to
> > > > unnecessarily kill additional innocent background apps before the memory
> > > > from the original victim is recovered.
> > > 
> > > OK, this is really not clear to me. How come you end up triggering LMKD
> > > (or any OOM handling) when there is a considerable amount of clean page
> > > cache?
> > 
> > It's not simple to explain all the heuristics, but basically, LMKD is triggered
> > by PSI pressure (usually contributed by kswapd rather than other components
> > like refault, kcompactd, or workingset operations).
> > 
> > It then checks the current free memory against system watermarks. Depending
> > on the free memory size, file cache, and free swap, it decides to start
> > killing background apps.
> > 
> > In other words, LMKD acts as a "userspace kswapd" to assist kernel kswapd's
> > reclamation speed. It is smarter than kswapd because it has high-level knowledge
> > of which processes are okay to be killed rather than forcing slow, unnecessary
> > paing out.
> > 
> > Whenever LMKD is running, kswapd is usually running alongside it. You might
> > wonder why LMKD kills background apps even when there are plenty of clean file
> > pages. That's because the system cannot predict current memory allocation rates.
> > If the allocation is bursty, kswapd can never catch up with the allocation speed.
> > This forces the foreground apps into direct reclaim, resulting in visible
> > UI jank. Android prioritizes UI smoothness and chooses to kill background apps.
> >
> > Furthermore, when LMKD kills a background app, it expects immediate memory relief.
> > If the clean file pages of the killed process are left on the LRU to be reclaimed
> > asynchronously later, the system's memory pressure (PSI) remains high.
> > This forces LMKD to unnecessarily kill *additional* background apps before
> > the memory from the first victim is fully recovered.
> > 
> > Again, this is why I want process_mrelease expedite clean file reclamation
> > synchronously.
> 
> How much of a clean page cache do you usually drop this way?

Based on some measurements on typical device, the numbers are actually
quite significant. The total amount of exclusive clean page cache across
all killable apps was around 800 to 900 MB.

While even typical background apps often hold tens to hundreds of
megabytes, heavier applications(e.g., modern on-device AI workloads) can
easily hold several gigabytes of clean file cache all by themselves.

So by using this expedited reclaim, we can grab anywhere from hundreds of
megabytes to several gigabytes instantly when a process is killed.
That's definitely enough to relieve the pressure right away and stop unnecessary
redundant/perceptible kills.

>  
> [...]
> > > I suspect you are missing my point. I am arguing that those special
> > > hacks in the address space release path shouldn't be process_mrelease
> > 
> > I am a bit confused now. Do you mean you want to apply these expedited
> > reclamation optimizations to ALL dying processes in the common exit path,
> > rather than making them specific to process_mrelease?
> 
> Yes. All which make sense, really. I am still not convinced about the
> clean page cache because that just seems like a hack to workaround wrong
> userspace oom heuristics.

I see it a bit differently. When paltform decides to kill a process
to free up memory, they want that memory back right away.

So it doesn't make much sense for the kernel to ignore that and leave the clean
file pages to be picked up slowly by kswapd later.

In some aspects, you can think of LMKD as a more specialized, userspace version
of kswapd. It has high-level knowledge of process priorities and knows exactly
which process is safe to kill to get memory instantly. The kernel's kswapd,
however, operates globally without this specific process-level awareness, which
makes it less suited for this kind of targeted reclamation.

If we force LMKD to rely on the slower global kswapd to actually free the clean
pages, it defeats the whole purpose of targeting a specific process.

So letting process_mrelease speed this up isn't a hack at all. It's just helping
the kernel do what the admin wanted in the first place: fast, targeted memory.
Re: [RFC 0/3] mm: process_mrelease: expedited reclaim and auto-kill support
Posted by Michal Hocko 1 month, 4 weeks ago
On Mon 20-04-26 14:53:23, Minchan Kim wrote:
> On Fri, Apr 17, 2026 at 09:11:21AM +0200, Michal Hocko wrote:
[...]
> > Yes. All which make sense, really. I am still not convinced about the
> > clean page cache because that just seems like a hack to workaround wrong
> > userspace oom heuristics.
> 
> I see it a bit differently. When paltform decides to kill a process
> to free up memory, they want that memory back right away.
> 
> So it doesn't make much sense for the kernel to ignore that and leave the clean
> file pages to be picked up slowly by kswapd later.
> 
> In some aspects, you can think of LMKD as a more specialized, userspace version
> of kswapd. It has high-level knowledge of process priorities and knows exactly
> which process is safe to kill to get memory instantly. The kernel's kswapd,
> however, operates globally without this specific process-level awareness, which
> makes it less suited for this kind of targeted reclamation.
> 
> If we force LMKD to rely on the slower global kswapd to actually free the clean
> pages, it defeats the whole purpose of targeting a specific process.
> 
> So letting process_mrelease speed this up isn't a hack at all. It's just helping
> the kernel do what the admin wanted in the first place: fast, targeted memory.

This is a very creative/disruptive way to do a memory reclaim. From a
user POV I would much rather see clean page cache reclaimed before my
apps start to disappear. But this is obviously your call and your users
that will care.

Anyway, I still maintain my position. I do not think it is a good
idea to drop clean page cache as you do not know whether there are other
users. I am NOT NAKing this patch though but please make sure you have a
wider support for this idea before this gets merged. Also make sure that
all the above reasoning is part of the changelog if you want to get this
merged.

-- 
Michal Hocko
SUSE Labs
Re: [RFC 0/3] mm: process_mrelease: expedited reclaim and auto-kill support
Posted by Minchan Kim 1 month, 3 weeks ago
On Thu, Apr 23, 2026 at 09:50:47AM +0200, Michal Hocko wrote:
> On Mon 20-04-26 14:53:23, Minchan Kim wrote:
> > On Fri, Apr 17, 2026 at 09:11:21AM +0200, Michal Hocko wrote:
> [...]
> > > Yes. All which make sense, really. I am still not convinced about the
> > > clean page cache because that just seems like a hack to workaround wrong
> > > userspace oom heuristics.
> > 
> > I see it a bit differently. When paltform decides to kill a process
> > to free up memory, they want that memory back right away.
> > 
> > So it doesn't make much sense for the kernel to ignore that and leave the clean
> > file pages to be picked up slowly by kswapd later.
> > 
> > In some aspects, you can think of LMKD as a more specialized, userspace version
> > of kswapd. It has high-level knowledge of process priorities and knows exactly
> > which process is safe to kill to get memory instantly. The kernel's kswapd,
> > however, operates globally without this specific process-level awareness, which
> > makes it less suited for this kind of targeted reclamation.
> > 
> > If we force LMKD to rely on the slower global kswapd to actually free the clean
> > pages, it defeats the whole purpose of targeting a specific process.
> > 
> > So letting process_mrelease speed this up isn't a hack at all. It's just helping
> > the kernel do what the admin wanted in the first place: fast, targeted memory.
> 
> This is a very creative/disruptive way to do a memory reclaim. From a
> user POV I would much rather see clean page cache reclaimed before my
> apps start to disappear. But this is obviously your call and your users
> that will care.

The problem we see in practice is that kswapd is too slow to react
to sudden memory spikes from foreground apps. By the time LMKD decides
to kill a background process to make room, the foreground app is already
stuck suffering from direct reclaim stalls, which can trigger sluggish/
jank issues which is high priority rather than sustaning background
frozen apps user didn't use recently.

> 
> Anyway, I still maintain my position. I do not think it is a good
> idea to drop clean page cache as you do not know whether there are other
> users. I am NOT NAKing this patch though but please make sure you have a
> wider support for this idea before this gets merged. Also make sure that
> all the above reasoning is part of the changelog if you want to get this
> merged.

Sure, I will make sure to include all this reasoning in the changelog of
the next version.

Thanks for review.
Re: [RFC 0/3] mm: process_mrelease: expedited reclaim and auto-kill support
Posted by David Hildenbrand (Arm) 1 month, 4 weeks ago
On 4/23/26 09:50, Michal Hocko wrote:
> On Mon 20-04-26 14:53:23, Minchan Kim wrote:
>> On Fri, Apr 17, 2026 at 09:11:21AM +0200, Michal Hocko wrote:
> [...]
>>> Yes. All which make sense, really. I am still not convinced about the
>>> clean page cache because that just seems like a hack to workaround wrong
>>> userspace oom heuristics.
>>
>> I see it a bit differently. When paltform decides to kill a process
>> to free up memory, they want that memory back right away.
>>
>> So it doesn't make much sense for the kernel to ignore that and leave the clean
>> file pages to be picked up slowly by kswapd later.
>>
>> In some aspects, you can think of LMKD as a more specialized, userspace version
>> of kswapd. It has high-level knowledge of process priorities and knows exactly
>> which process is safe to kill to get memory instantly. The kernel's kswapd,
>> however, operates globally without this specific process-level awareness, which
>> makes it less suited for this kind of targeted reclamation.
>>
>> If we force LMKD to rely on the slower global kswapd to actually free the clean
>> pages, it defeats the whole purpose of targeting a specific process.
>>
>> So letting process_mrelease speed this up isn't a hack at all. It's just helping
>> the kernel do what the admin wanted in the first place: fast, targeted memory.
> 
> This is a very creative/disruptive way to do a memory reclaim. From a
> user POV I would much rather see clean page cache reclaimed before my
> apps start to disappear. But this is obviously your call and your users
> that will care.
> 
> Anyway, I still maintain my position. I do not think it is a good
> idea to drop clean page cache as you do not know whether there are other
> users.

IIRC, Johannes raised in the past the we cannot predict the future.

For example, if an app gets OOM-killed, wouldn't we usually try restarting it,
re-consuming the clean pagecache pages we would be evicting here?

Just a thought.

-- 
Cheers,

David
Re: [RFC 0/3] mm: process_mrelease: expedited reclaim and auto-kill support
Posted by Suren Baghdasaryan 1 month, 3 weeks ago
On Thu, Apr 23, 2026 at 2:50 AM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
> On 4/23/26 09:50, Michal Hocko wrote:
> > On Mon 20-04-26 14:53:23, Minchan Kim wrote:
> >> On Fri, Apr 17, 2026 at 09:11:21AM +0200, Michal Hocko wrote:
> > [...]
> >>> Yes. All which make sense, really. I am still not convinced about the
> >>> clean page cache because that just seems like a hack to workaround wrong
> >>> userspace oom heuristics.
> >>
> >> I see it a bit differently. When paltform decides to kill a process
> >> to free up memory, they want that memory back right away.
> >>
> >> So it doesn't make much sense for the kernel to ignore that and leave the clean
> >> file pages to be picked up slowly by kswapd later.
> >>
> >> In some aspects, you can think of LMKD as a more specialized, userspace version
> >> of kswapd. It has high-level knowledge of process priorities and knows exactly
> >> which process is safe to kill to get memory instantly. The kernel's kswapd,
> >> however, operates globally without this specific process-level awareness, which
> >> makes it less suited for this kind of targeted reclamation.
> >>
> >> If we force LMKD to rely on the slower global kswapd to actually free the clean
> >> pages, it defeats the whole purpose of targeting a specific process.
> >>
> >> So letting process_mrelease speed this up isn't a hack at all. It's just helping
> >> the kernel do what the admin wanted in the first place: fast, targeted memory.
> >
> > This is a very creative/disruptive way to do a memory reclaim. From a
> > user POV I would much rather see clean page cache reclaimed before my
> > apps start to disappear. But this is obviously your call and your users
> > that will care.
> >
> > Anyway, I still maintain my position. I do not think it is a good
> > idea to drop clean page cache as you do not know whether there are other
> > users.

I'm very much familiar with these issues in Android and really want to
find a good solution for them. IIUC, this RFC tries to address 2
things at once:
1. handling clean private page cache when reaping memory of a kill victim;
2. addressing a race between kill() and process_release() when
process_release() can't happen before the kill() but if it happens too
late after the victim passed its exit_mm() then process_release()
fails to find the mm to reap. This defeats the purpose of
process_release() call because the actual memory (released by
exit_mmap()) might not yet be free and a successful process_release()
would be very beneficial.

I see these two as separate issues and I'm not sure combining them
into a single discussion is a good idea.

>
> IIRC, Johannes raised in the past the we cannot predict the future.
>
> For example, if an app gets OOM-killed, wouldn't we usually try restarting it,
> re-consuming the clean pagecache pages we would be evicting here?

Sure, we can't predict which app the user will use next, so when
killing we usually kill the least recently used one. That's a
reasonable strategy in most cases.
In general, if speeding up the victim's reclaim negatively affects the
overall user workflow then this would mean we are selecting wrong kill
targets. In that case, we would need to adjust the target selection
strategy.

Thanks for tackling this Minchan! I'll try to review the patches this
weekend and provide my feedback.
Thanks,
Suren.

>
> Just a thought.
>
> --
> Cheers,
>
> David
Re: [RFC 0/3] mm: process_mrelease: expedited reclaim and auto-kill support
Posted by David Hildenbrand (Arm) 1 month, 3 weeks ago
On 4/24/26 00:36, Suren Baghdasaryan wrote:
> On Thu, Apr 23, 2026 at 2:50 AM David Hildenbrand (Arm)
> <david@kernel.org> wrote:
>>
>> On 4/23/26 09:50, Michal Hocko wrote:
>>> [...]
>>>
>>> This is a very creative/disruptive way to do a memory reclaim. From a
>>> user POV I would much rather see clean page cache reclaimed before my
>>> apps start to disappear. But this is obviously your call and your users
>>> that will care.
>>>
>>> Anyway, I still maintain my position. I do not think it is a good
>>> idea to drop clean page cache as you do not know whether there are other
>>> users.
> 
> I'm very much familiar with these issues in Android and really want to
> find a good solution for them. IIUC, this RFC tries to address 2
> things at once:
> 1. handling clean private page cache when reaping memory of a kill victim;
> 2. addressing a race between kill() and process_release() when
> process_release() can't happen before the kill() but if it happens too
> late after the victim passed its exit_mm() then process_release()
> fails to find the mm to reap. This defeats the purpose of
> process_release() call because the actual memory (released by
> exit_mmap()) might not yet be free and a successful process_release()
> would be very beneficial.
> 
> I see these two as separate issues and I'm not sure combining them
> into a single discussion is a good idea.
> 
>>
>> IIRC, Johannes raised in the past the we cannot predict the future.
>>
>> For example, if an app gets OOM-killed, wouldn't we usually try restarting it,
>> re-consuming the clean pagecache pages we would be evicting here?
> 
> Sure, we can't predict which app the user will use next, so when
> killing we usually kill the least recently used one. That's a
> reasonable strategy in most cases.

That makes sense. As long as other apps you open next won't need the same
libraries etc that you just evicted.

-- 
Cheers,

David
Re: [RFC 0/3] mm: process_mrelease: expedited reclaim and auto-kill support
Posted by Suren Baghdasaryan 1 month, 3 weeks ago
On Fri, Apr 24, 2026 at 12:41 AM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
> On 4/24/26 00:36, Suren Baghdasaryan wrote:
> > On Thu, Apr 23, 2026 at 2:50 AM David Hildenbrand (Arm)
> > <david@kernel.org> wrote:
> >>
> >> On 4/23/26 09:50, Michal Hocko wrote:
> >>> [...]
> >>>
> >>> This is a very creative/disruptive way to do a memory reclaim. From a
> >>> user POV I would much rather see clean page cache reclaimed before my
> >>> apps start to disappear. But this is obviously your call and your users
> >>> that will care.
> >>>
> >>> Anyway, I still maintain my position. I do not think it is a good
> >>> idea to drop clean page cache as you do not know whether there are other
> >>> users.
> >
> > I'm very much familiar with these issues in Android and really want to
> > find a good solution for them. IIUC, this RFC tries to address 2
> > things at once:
> > 1. handling clean private page cache when reaping memory of a kill victim;
> > 2. addressing a race between kill() and process_release() when
> > process_release() can't happen before the kill() but if it happens too
> > late after the victim passed its exit_mm() then process_release()
> > fails to find the mm to reap. This defeats the purpose of
> > process_release() call because the actual memory (released by
> > exit_mmap()) might not yet be free and a successful process_release()
> > would be very beneficial.
> >
> > I see these two as separate issues and I'm not sure combining them
> > into a single discussion is a good idea.
> >
> >>
> >> IIRC, Johannes raised in the past the we cannot predict the future.
> >>
> >> For example, if an app gets OOM-killed, wouldn't we usually try restarting it,
> >> re-consuming the clean pagecache pages we would be evicting here?
> >
> > Sure, we can't predict which app the user will use next, so when
> > killing we usually kill the least recently used one. That's a
> > reasonable strategy in most cases.
>
> That makes sense. As long as other apps you open next won't need the same
> libraries etc that you just evicted.

True. However shared libraries have a high chance of being used by
more that one process when we kill a background apps. So, while it's
possible that we might evict a page from a library that will be used
shortly after, it's more likely that pages from such libraries would
not be private to the victim. Again, all this is probabilistic and no
ideal strategy exists without knowing the future. All we can do is
optimize for the most likely scenario.

>
> --
> Cheers,
>
> David
Re: [RFC 0/3] mm: process_mrelease: expedited reclaim and auto-kill support
Posted by Michal Hocko 1 month, 3 weeks ago
On Thu 23-04-26 15:36:57, Suren Baghdasaryan wrote:
> On Thu, Apr 23, 2026 at 2:50 AM David Hildenbrand (Arm)
> <david@kernel.org> wrote:
> >
> > On 4/23/26 09:50, Michal Hocko wrote:
> > > On Mon 20-04-26 14:53:23, Minchan Kim wrote:
> > >> On Fri, Apr 17, 2026 at 09:11:21AM +0200, Michal Hocko wrote:
> > > [...]
> > >>> Yes. All which make sense, really. I am still not convinced about the
> > >>> clean page cache because that just seems like a hack to workaround wrong
> > >>> userspace oom heuristics.
> > >>
> > >> I see it a bit differently. When paltform decides to kill a process
> > >> to free up memory, they want that memory back right away.
> > >>
> > >> So it doesn't make much sense for the kernel to ignore that and leave the clean
> > >> file pages to be picked up slowly by kswapd later.
> > >>
> > >> In some aspects, you can think of LMKD as a more specialized, userspace version
> > >> of kswapd. It has high-level knowledge of process priorities and knows exactly
> > >> which process is safe to kill to get memory instantly. The kernel's kswapd,
> > >> however, operates globally without this specific process-level awareness, which
> > >> makes it less suited for this kind of targeted reclamation.
> > >>
> > >> If we force LMKD to rely on the slower global kswapd to actually free the clean
> > >> pages, it defeats the whole purpose of targeting a specific process.
> > >>
> > >> So letting process_mrelease speed this up isn't a hack at all. It's just helping
> > >> the kernel do what the admin wanted in the first place: fast, targeted memory.
> > >
> > > This is a very creative/disruptive way to do a memory reclaim. From a
> > > user POV I would much rather see clean page cache reclaimed before my
> > > apps start to disappear. But this is obviously your call and your users
> > > that will care.
> > >
> > > Anyway, I still maintain my position. I do not think it is a good
> > > idea to drop clean page cache as you do not know whether there are other
> > > users.
> 
> I'm very much familiar with these issues in Android and really want to
> find a good solution for them. IIUC, this RFC tries to address 2
> things at once:
> 1. handling clean private page cache when reaping memory of a kill victim;
> 2. addressing a race between kill() and process_release() when
> process_release() can't happen before the kill() but if it happens too
> late after the victim passed its exit_mm() then process_release()
> fails to find the mm to reap. This defeats the purpose of
> process_release() call because the actual memory (released by
> exit_mmap()) might not yet be free and a successful process_release()
> would be very beneficial.
> 
> I see these two as separate issues and I'm not sure combining them
> into a single discussion is a good idea.

Fully agreed!

-- 
Michal Hocko
SUSE Labs
Re: [RFC 0/3] mm: process_mrelease: expedited reclaim and auto-kill support
Posted by Minchan Kim 1 month, 3 weeks ago
On Thu, Apr 23, 2026 at 03:36:57PM -0700, Suren Baghdasaryan wrote:
> On Thu, Apr 23, 2026 at 2:50 AM David Hildenbrand (Arm)
> <david@kernel.org> wrote:
> >
> > On 4/23/26 09:50, Michal Hocko wrote:
> > > On Mon 20-04-26 14:53:23, Minchan Kim wrote:
> > >> On Fri, Apr 17, 2026 at 09:11:21AM +0200, Michal Hocko wrote:
> > > [...]
> > >>> Yes. All which make sense, really. I am still not convinced about the
> > >>> clean page cache because that just seems like a hack to workaround wrong
> > >>> userspace oom heuristics.
> > >>
> > >> I see it a bit differently. When paltform decides to kill a process
> > >> to free up memory, they want that memory back right away.
> > >>
> > >> So it doesn't make much sense for the kernel to ignore that and leave the clean
> > >> file pages to be picked up slowly by kswapd later.
> > >>
> > >> In some aspects, you can think of LMKD as a more specialized, userspace version
> > >> of kswapd. It has high-level knowledge of process priorities and knows exactly
> > >> which process is safe to kill to get memory instantly. The kernel's kswapd,
> > >> however, operates globally without this specific process-level awareness, which
> > >> makes it less suited for this kind of targeted reclamation.
> > >>
> > >> If we force LMKD to rely on the slower global kswapd to actually free the clean
> > >> pages, it defeats the whole purpose of targeting a specific process.
> > >>
> > >> So letting process_mrelease speed this up isn't a hack at all. It's just helping
> > >> the kernel do what the admin wanted in the first place: fast, targeted memory.
> > >
> > > This is a very creative/disruptive way to do a memory reclaim. From a
> > > user POV I would much rather see clean page cache reclaimed before my
> > > apps start to disappear. But this is obviously your call and your users
> > > that will care.
> > >
> > > Anyway, I still maintain my position. I do not think it is a good
> > > idea to drop clean page cache as you do not know whether there are other
> > > users.
> 
> I'm very much familiar with these issues in Android and really want to
> find a good solution for them. IIUC, this RFC tries to address 2
> things at once:
> 1. handling clean private page cache when reaping memory of a kill victim;
> 2. addressing a race between kill() and process_release() when
> process_release() can't happen before the kill() but if it happens too
> late after the victim passed its exit_mm() then process_release()
> fails to find the mm to reap. This defeats the purpose of
> process_release() call because the actual memory (released by
> exit_mmap()) might not yet be free and a successful process_release()
> would be very beneficial.
> 
> I see these two as separate issues and I'm not sure combining them
> into a single discussion is a good idea.

Yeah, they are two different issues so I tried to show those problems
in cover-letter and address each issues one by one from each patch.

I can easily drop either of them if it's not received well.
I am fine to send them separately, too if that's confused. No problem.

> 
> >
> > IIRC, Johannes raised in the past the we cannot predict the future.
> >
> > For example, if an app gets OOM-killed, wouldn't we usually try restarting it,
> > re-consuming the clean pagecache pages we would be evicting here?
> 
> Sure, we can't predict which app the user will use next, so when
> killing we usually kill the least recently used one. That's a
> reasonable strategy in most cases.
> In general, if speeding up the victim's reclaim negatively affects the
> overall user workflow then this would mean we are selecting wrong kill
> targets. In that case, we would need to adjust the target selection
> strategy.
> 
> Thanks for tackling this Minchan! I'll try to review the patches this
> weekend and provide my feedback.

Please go with second patchset.
https://lore.kernel.org/linux-mm/20260421230239.172582-1-minchan@kernel.org/

Thanks!