mm: swap: Gather swap entries and batch async release

[PATCH v0 0/2] mm: swap: Gather swap entries and batch async release

Posted by Lei Liu 3 weeks, 2 days ago

1. Problem Scenario
On systems with ZRAM and swap enabled, simultaneous process exits create
contention. The primary bottleneck occurs during swap entry release
operations, causing exiting processes to monopolize CPU resources. This
leads to scheduling delays for high-priority processes.

2. Android Use Case
During camera launch, LMKD terminates background processes to free memory.
Exiting processes compete for CPU cycles, delaying the camera preview
thread and causing visible stuttering - directly impacting user
experience.

3. Root Cause Analysis
When background applications heavily utilize swap space, process exit
profiling reveals 55% of time spent in free_swap_and_cache_nr():

Function              Duration (ms)   Percentage
do_signal               791.813     **********100%
do_group_exit           791.813     **********100%
do_exit                 791.813     **********100%
exit_mm                 577.859        *******73%
exit_mmap               577.497        *******73%
zap_pte_range           558.645        *******71%
free_swap_and_cache_nr  433.381          *****55%
free_swap_slot          403.568          *****51%
swap_entry_free         393.863          *****50%
swap_range_free         372.602           ****47%

4. Optimization Approach
a) For processes exceeding swap entry threshold: aggregate and isolate
swap entries to enable fast exit
b) Asynchronously release batched entries when isolation reaches
configured threshold

5. Performance Gains (User Scenario: Camera Cold Launch)
a) 74% reduction in process exit latency (>500ms cases)
b) ~4% lower peak CPU load during concurrent process exits
c) ~70MB additional free memory during camera preview initialization
d) 40% reduction in camera preview stuttering probability

6. Prior Art & Improvements
Reference: Zhiguo Jiang's patch
(https://lore.kernel.org/all/20240805153639.1057-1-justinjiang@vivo.com/)

Key enhancements:
a) Reimplemented logic moved from mmu_gather.c to swapfile.c for clarity
b) Async release delegated to workqueue kworkers with configurable
max_active for NUMA-optimized concurrency

Lei Liu (2):
  mm: swap: Gather swap entries and batch async release core
  mm: swap: Forced swap entries release under memory pressure

 include/linux/oom.h           |  23 ++++++
 include/linux/swapfile.h      |   2 +
 include/linux/vm_event_item.h |   1 +
 kernel/exit.c                 |   2 +
 mm/memcontrol.c               |   6 --
 mm/memory.c                   |   4 +-
 mm/page_alloc.c               |   4 +
 mm/swapfile.c                 | 134 ++++++++++++++++++++++++++++++++++
 mm/vmstat.c                   |   1 +
 9 files changed, 170 insertions(+), 7 deletions(-)

-- 
2.34.1

Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release

Posted by Shakeel Butt 3 weeks, 2 days ago

On Tue, Sep 09, 2025 at 02:53:39PM +0800, Lei Liu wrote:
> 1. Problem Scenario
> On systems with ZRAM and swap enabled, simultaneous process exits create
> contention. The primary bottleneck occurs during swap entry release
> operations, causing exiting processes to monopolize CPU resources. This
> leads to scheduling delays for high-priority processes.
> 
> 2. Android Use Case
> During camera launch, LMKD terminates background processes to free memory.

How does LMKD trigger the kills? SIGKILL or cgroup.kill?

> Exiting processes compete for CPU cycles, delaying the camera preview
> thread and causing visible stuttering - directly impacting user
> experience.

Since the exit/kill is due to low memory situation, punting the memory
freeing to a low priority async mechanism will help in improving user
experience. Most probably the application (camera preview here) will get
into global reclaim and will compete for CPU with the async memory
freeing.

What we really need is faster memory freeing and we should explore all
possible ways. As others suggested fix/improve the bottleneck in the
memory freeing path. In addition I think we should explore parallelizing
this as well.

On Android, I suppose most of the memory is associated with single or
small set of processes and parallelizing memory freeing would be
challenging. BTW is LMKD using process_mrelease() to release the killed
process memory?

Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release

Posted by Suren Baghdasaryan 3 weeks, 2 days ago

On Tue, Sep 9, 2025 at 12:21 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Tue, Sep 09, 2025 at 02:53:39PM +0800, Lei Liu wrote:
> > 1. Problem Scenario
> > On systems with ZRAM and swap enabled, simultaneous process exits create
> > contention. The primary bottleneck occurs during swap entry release
> > operations, causing exiting processes to monopolize CPU resources. This
> > leads to scheduling delays for high-priority processes.
> >
> > 2. Android Use Case
> > During camera launch, LMKD terminates background processes to free memory.
>
> How does LMKD trigger the kills? SIGKILL or cgroup.kill?

SIGKILL

>
> > Exiting processes compete for CPU cycles, delaying the camera preview
> > thread and causing visible stuttering - directly impacting user
> > experience.
>
> Since the exit/kill is due to low memory situation, punting the memory
> freeing to a low priority async mechanism will help in improving user
> experience. Most probably the application (camera preview here) will get
> into global reclaim and will compete for CPU with the async memory
> freeing.
>
> What we really need is faster memory freeing and we should explore all
> possible ways. As others suggested fix/improve the bottleneck in the
> memory freeing path. In addition I think we should explore parallelizing
> this as well.
>
> On Android, I suppose most of the memory is associated with single or
> small set of processes and parallelizing memory freeing would be
> challenging. BTW is LMKD using process_mrelease() to release the killed
> process memory?

Yes, LMKD has a reaper thread which wakes up and calls
process_mrelease() after the main LMKD thread issued SIGKILL.

>

Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release

Posted by Shakeel Butt 3 weeks, 1 day ago

On Tue, Sep 09, 2025 at 12:48:02PM -0700, Suren Baghdasaryan wrote:
> On Tue, Sep 9, 2025 at 12:21 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Tue, Sep 09, 2025 at 02:53:39PM +0800, Lei Liu wrote:
> > > 1. Problem Scenario
> > > On systems with ZRAM and swap enabled, simultaneous process exits create
> > > contention. The primary bottleneck occurs during swap entry release
> > > operations, causing exiting processes to monopolize CPU resources. This
> > > leads to scheduling delays for high-priority processes.
> > >
> > > 2. Android Use Case
> > > During camera launch, LMKD terminates background processes to free memory.
> >
> > How does LMKD trigger the kills? SIGKILL or cgroup.kill?
> 
> SIGKILL
> 
> >
> > > Exiting processes compete for CPU cycles, delaying the camera preview
> > > thread and causing visible stuttering - directly impacting user
> > > experience.
> >
> > Since the exit/kill is due to low memory situation, punting the memory
> > freeing to a low priority async mechanism will help in improving user
> > experience. Most probably the application (camera preview here) will get
> > into global reclaim and will compete for CPU with the async memory
> > freeing.
> >
> > What we really need is faster memory freeing and we should explore all
> > possible ways. As others suggested fix/improve the bottleneck in the
> > memory freeing path. In addition I think we should explore parallelizing
> > this as well.
> >
> > On Android, I suppose most of the memory is associated with single or
> > small set of processes and parallelizing memory freeing would be
> > challenging. BTW is LMKD using process_mrelease() to release the killed
> > process memory?
> 
> Yes, LMKD has a reaper thread which wakes up and calls
> process_mrelease() after the main LMKD thread issued SIGKILL.
> 

Thanks Suren. I remember Android is planning to use Apps in cgroup. Is
that still the plan? I am actually looking into cgroup.kill, beside
sending SIGKILL, putting the processes of the target cgroup in the oom
reaper list. In addition, making oom reaper able to reap processes in
parallel. I am hoping that functionality to be useful to Android as
well.
> >

Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release

Posted by Suren Baghdasaryan 3 weeks, 1 day ago

On Wed, Sep 10, 2025 at 1:10 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Tue, Sep 09, 2025 at 12:48:02PM -0700, Suren Baghdasaryan wrote:
> > On Tue, Sep 9, 2025 at 12:21 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Tue, Sep 09, 2025 at 02:53:39PM +0800, Lei Liu wrote:
> > > > 1. Problem Scenario
> > > > On systems with ZRAM and swap enabled, simultaneous process exits create
> > > > contention. The primary bottleneck occurs during swap entry release
> > > > operations, causing exiting processes to monopolize CPU resources. This
> > > > leads to scheduling delays for high-priority processes.
> > > >
> > > > 2. Android Use Case
> > > > During camera launch, LMKD terminates background processes to free memory.
> > >
> > > How does LMKD trigger the kills? SIGKILL or cgroup.kill?
> >
> > SIGKILL
> >
> > >
> > > > Exiting processes compete for CPU cycles, delaying the camera preview
> > > > thread and causing visible stuttering - directly impacting user
> > > > experience.
> > >
> > > Since the exit/kill is due to low memory situation, punting the memory
> > > freeing to a low priority async mechanism will help in improving user
> > > experience. Most probably the application (camera preview here) will get
> > > into global reclaim and will compete for CPU with the async memory
> > > freeing.
> > >
> > > What we really need is faster memory freeing and we should explore all
> > > possible ways. As others suggested fix/improve the bottleneck in the
> > > memory freeing path. In addition I think we should explore parallelizing
> > > this as well.
> > >
> > > On Android, I suppose most of the memory is associated with single or
> > > small set of processes and parallelizing memory freeing would be
> > > challenging. BTW is LMKD using process_mrelease() to release the killed
> > > process memory?
> >
> > Yes, LMKD has a reaper thread which wakes up and calls
> > process_mrelease() after the main LMKD thread issued SIGKILL.
> >
>
> Thanks Suren. I remember Android is planning to use Apps in cgroup. Is
> that still the plan? I am actually looking into cgroup.kill, beside
> sending SIGKILL, putting the processes of the target cgroup in the oom
> reaper list. In addition, making oom reaper able to reap processes in
> parallel. I am hoping that functionality to be useful to Android as
> well.

Yes, cgroups v2 with per-app hierarchy is already enabled on Android
as of about a year or so ago. The first usecase was the freezer. TJ
(CC'ing him here) also changed how ActivityManager Service (AMS) kills
process groups to use cgroup.kill (think when you force-stop an app
that's what will happen). LMKD has not been changed to use cgroup.kill
but that might be worth doing now. TJ, WDYT?


> > >

Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release

Posted by T.J. Mercier 3 weeks, 1 day ago

On Wed, Sep 10, 2025 at 1:41 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Sep 10, 2025 at 1:10 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Tue, Sep 09, 2025 at 12:48:02PM -0700, Suren Baghdasaryan wrote:
> > > On Tue, Sep 9, 2025 at 12:21 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > >
> > > > On Tue, Sep 09, 2025 at 02:53:39PM +0800, Lei Liu wrote:
> > > > > 1. Problem Scenario
> > > > > On systems with ZRAM and swap enabled, simultaneous process exits create
> > > > > contention. The primary bottleneck occurs during swap entry release
> > > > > operations, causing exiting processes to monopolize CPU resources. This
> > > > > leads to scheduling delays for high-priority processes.
> > > > >
> > > > > 2. Android Use Case
> > > > > During camera launch, LMKD terminates background processes to free memory.
> > > >
> > > > How does LMKD trigger the kills? SIGKILL or cgroup.kill?
> > >
> > > SIGKILL
> > >
> > > >
> > > > > Exiting processes compete for CPU cycles, delaying the camera preview
> > > > > thread and causing visible stuttering - directly impacting user
> > > > > experience.
> > > >
> > > > Since the exit/kill is due to low memory situation, punting the memory
> > > > freeing to a low priority async mechanism will help in improving user
> > > > experience. Most probably the application (camera preview here) will get
> > > > into global reclaim and will compete for CPU with the async memory
> > > > freeing.
> > > >
> > > > What we really need is faster memory freeing and we should explore all
> > > > possible ways. As others suggested fix/improve the bottleneck in the
> > > > memory freeing path. In addition I think we should explore parallelizing
> > > > this as well.
> > > >
> > > > On Android, I suppose most of the memory is associated with single or
> > > > small set of processes and parallelizing memory freeing would be
> > > > challenging. BTW is LMKD using process_mrelease() to release the killed
> > > > process memory?
> > >
> > > Yes, LMKD has a reaper thread which wakes up and calls
> > > process_mrelease() after the main LMKD thread issued SIGKILL.
> > >
> >
> > Thanks Suren. I remember Android is planning to use Apps in cgroup. Is
> > that still the plan? I am actually looking into cgroup.kill, beside
> > sending SIGKILL, putting the processes of the target cgroup in the oom
> > reaper list. In addition, making oom reaper able to reap processes in
> > parallel. I am hoping that functionality to be useful to Android as
> > well.
>
> Yes, cgroups v2 with per-app hierarchy is already enabled on Android
> as of about a year or so ago. The first usecase was the freezer. TJ
> (CC'ing him here) also changed how ActivityManager Service (AMS) kills
> process groups to use cgroup.kill (think when you force-stop an app
> that's what will happen). LMKD has not been changed to use cgroup.kill
> but that might be worth doing now. TJ, WDYT?

Sounds like it's worth trying here [1].

One potential downside of cgroup.kill is that it requires taking the
cgroup_mutex, which is one of our most heavily contended locks.

We already have logic that waits for exits in libprocessgroup's
KillProcessGroup [2], but I don't think LMKD needs or wants that from
its main thread. I think we'll still want process_mrelease [3] from
LMKD's reaper thread.

[1] https://cs.android.com/android/platform/superproject/main/+/main:system/memory/lmkd/reaper.cpp;drc=88ca1a4963004011669da415bc421b846936071f;l=233
[2] https://cs.android.com/android/platform/superproject/main/+/main:system/core/libprocessgroup/processgroup.cpp;drc=61197364367c9e404c7da6900658f1b16c42d0da;l=537
[3] https://cs.android.com/android/platform/superproject/main/+/main:system/memory/lmkd/reaper.cpp;drc=88ca1a4963004011669da415bc421b846936071f;l=123

Shakeel could we not also invoke the oom reaper's help for regular
kill(SIGKILL)s?

Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release

Posted by Shakeel Butt 3 weeks, 1 day ago

On Wed, Sep 10, 2025 at 03:10:29PM -0700, T.J. Mercier wrote:
> On Wed, Sep 10, 2025 at 1:41 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Wed, Sep 10, 2025 at 1:10 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Tue, Sep 09, 2025 at 12:48:02PM -0700, Suren Baghdasaryan wrote:
> > > > On Tue, Sep 9, 2025 at 12:21 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > > >
> > > > > On Tue, Sep 09, 2025 at 02:53:39PM +0800, Lei Liu wrote:
> > > > > > 1. Problem Scenario
> > > > > > On systems with ZRAM and swap enabled, simultaneous process exits create
> > > > > > contention. The primary bottleneck occurs during swap entry release
> > > > > > operations, causing exiting processes to monopolize CPU resources. This
> > > > > > leads to scheduling delays for high-priority processes.
> > > > > >
> > > > > > 2. Android Use Case
> > > > > > During camera launch, LMKD terminates background processes to free memory.
> > > > >
> > > > > How does LMKD trigger the kills? SIGKILL or cgroup.kill?
> > > >
> > > > SIGKILL
> > > >
> > > > >
> > > > > > Exiting processes compete for CPU cycles, delaying the camera preview
> > > > > > thread and causing visible stuttering - directly impacting user
> > > > > > experience.
> > > > >
> > > > > Since the exit/kill is due to low memory situation, punting the memory
> > > > > freeing to a low priority async mechanism will help in improving user
> > > > > experience. Most probably the application (camera preview here) will get
> > > > > into global reclaim and will compete for CPU with the async memory
> > > > > freeing.
> > > > >
> > > > > What we really need is faster memory freeing and we should explore all
> > > > > possible ways. As others suggested fix/improve the bottleneck in the
> > > > > memory freeing path. In addition I think we should explore parallelizing
> > > > > this as well.
> > > > >
> > > > > On Android, I suppose most of the memory is associated with single or
> > > > > small set of processes and parallelizing memory freeing would be
> > > > > challenging. BTW is LMKD using process_mrelease() to release the killed
> > > > > process memory?
> > > >
> > > > Yes, LMKD has a reaper thread which wakes up and calls
> > > > process_mrelease() after the main LMKD thread issued SIGKILL.
> > > >
> > >
> > > Thanks Suren. I remember Android is planning to use Apps in cgroup. Is
> > > that still the plan? I am actually looking into cgroup.kill, beside
> > > sending SIGKILL, putting the processes of the target cgroup in the oom
> > > reaper list. In addition, making oom reaper able to reap processes in
> > > parallel. I am hoping that functionality to be useful to Android as
> > > well.
> >
> > Yes, cgroups v2 with per-app hierarchy is already enabled on Android
> > as of about a year or so ago. The first usecase was the freezer. TJ
> > (CC'ing him here) also changed how ActivityManager Service (AMS) kills
> > process groups to use cgroup.kill (think when you force-stop an app
> > that's what will happen). LMKD has not been changed to use cgroup.kill
> > but that might be worth doing now. TJ, WDYT?
> 
> Sounds like it's worth trying here [1].
> 
> One potential downside of cgroup.kill is that it requires taking the
> cgroup_mutex, which is one of our most heavily contended locks.

Oh let me look into that and see if we can remove cgroup_mutex from that
interface.

> 
> We already have logic that waits for exits in libprocessgroup's
> KillProcessGroup [2], but I don't think LMKD needs or wants that from
> its main thread. I think we'll still want process_mrelease [3] from
> LMKD's reaper thread.

I imagine once kernel oom reaper can work on killed processes
transparently, it would be much easier to let it do the job instead of
manual process_mrelease() on all the processes in a cgroup.

> 
> [1] https://cs.android.com/android/platform/superproject/main/+/main:system/memory/lmkd/reaper.cpp;drc=88ca1a4963004011669da415bc421b846936071f;l=233
> [2] https://cs.android.com/android/platform/superproject/main/+/main:system/core/libprocessgroup/processgroup.cpp;drc=61197364367c9e404c7da6900658f1b16c42d0da;l=537
> [3] https://cs.android.com/android/platform/superproject/main/+/main:system/memory/lmkd/reaper.cpp;drc=88ca1a4963004011669da415bc421b846936071f;l=123
> 
> Shakeel could we not also invoke the oom reaper's help for regular
> kill(SIGKILL)s?

I don't see why this can not be done. I will take a look.

Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release

Posted by Chris Li 3 weeks, 1 day ago

On Tue, Sep 9, 2025 at 12:48 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > Exiting processes compete for CPU cycles, delaying the camera preview
> > > thread and causing visible stuttering - directly impacting user
> > > experience.
> >
> > Since the exit/kill is due to low memory situation, punting the memory
> > freeing to a low priority async mechanism will help in improving user
> > experience. Most probably the application (camera preview here) will get
> > into global reclaim and will compete for CPU with the async memory
> > freeing.
> >
> > What we really need is faster memory freeing and we should explore all
> > possible ways. As others suggested fix/improve the bottleneck in the
> > memory freeing path. In addition I think we should explore parallelizing
> > this as well.
> >
> > On Android, I suppose most of the memory is associated with single or
> > small set of processes and parallelizing memory freeing would be
> > challenging. BTW is LMKD using process_mrelease() to release the killed
> > process memory?
>
> Yes, LMKD has a reaper thread which wakes up and calls
> process_mrelease() after the main LMKD thread issued SIGKILL.

I feel this is a better solution to address the exit process that is too slow.

We are basically optimizing the exit() system call, I feel there
should be something we can do in the userspace before exit() to help
us without the kernel putting too much complexity into exit().
process_mrelease() souds fit the bill pretty well.

Chris

Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release

Posted by Lei Liu 3 weeks, 1 day ago

On 2025/9/10 3:48, Suren Baghdasaryan wrote:
> On Tue, Sep 9, 2025 at 12:21 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>> On Tue, Sep 09, 2025 at 02:53:39PM +0800, Lei Liu wrote:
>>> 1. Problem Scenario
>>> On systems with ZRAM and swap enabled, simultaneous process exits create
>>> contention. The primary bottleneck occurs during swap entry release
>>> operations, causing exiting processes to monopolize CPU resources. This
>>> leads to scheduling delays for high-priority processes.
>>>
>>> 2. Android Use Case
>>> During camera launch, LMKD terminates background processes to free memory.
>> How does LMKD trigger the kills? SIGKILL or cgroup.kill?
> SIGKILL
>
>>> Exiting processes compete for CPU cycles, delaying the camera preview
>>> thread and causing visible stuttering - directly impacting user
>>> experience.
>> Since the exit/kill is due to low memory situation, punting the memory
>> freeing to a low priority async mechanism will help in improving user
>> experience. Most probably the application (camera preview here) will get
>> into global reclaim and will compete for CPU with the async memory
>> freeing.
>>
>> What we really need is faster memory freeing and we should explore all
>> possible ways. As others suggested fix/improve the bottleneck in the
>> memory freeing path. In addition I think we should explore parallelizing
>> this as well.
>>
>> On Android, I suppose most of the memory is associated with single or
>> small set of processes and parallelizing memory freeing would be
>> challenging. BTW is LMKD using process_mrelease() to release the killed
>> process memory?
> Yes, LMKD has a reaper thread which wakes up and calls
> process_mrelease() after the main LMKD thread issued SIGKILL.

Hi Suren

our current issue is that after lmkd kills a process,|exit_mm|takes 
considerable time. The interface you provided might help quickly free 
memory, potentially allowing us to release some memory from processes 
before lmkd kills them. This could be a good idea.

We will take your suggestion into consideration.


Thank you




>

Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release

Posted by Shakeel Butt 3 weeks, 1 day ago

On Wed, Sep 10, 2025 at 10:14:04PM +0800, Lei Liu wrote:
> 
> On 2025/9/10 3:48, Suren Baghdasaryan wrote:
> > On Tue, Sep 9, 2025 at 12:21 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > On Tue, Sep 09, 2025 at 02:53:39PM +0800, Lei Liu wrote:
> > > > 1. Problem Scenario
> > > > On systems with ZRAM and swap enabled, simultaneous process exits create
> > > > contention. The primary bottleneck occurs during swap entry release
> > > > operations, causing exiting processes to monopolize CPU resources. This
> > > > leads to scheduling delays for high-priority processes.
> > > > 
> > > > 2. Android Use Case
> > > > During camera launch, LMKD terminates background processes to free memory.
> > > How does LMKD trigger the kills? SIGKILL or cgroup.kill?
> > SIGKILL
> > 
> > > > Exiting processes compete for CPU cycles, delaying the camera preview
> > > > thread and causing visible stuttering - directly impacting user
> > > > experience.
> > > Since the exit/kill is due to low memory situation, punting the memory
> > > freeing to a low priority async mechanism will help in improving user
> > > experience. Most probably the application (camera preview here) will get
> > > into global reclaim and will compete for CPU with the async memory
> > > freeing.
> > > 
> > > What we really need is faster memory freeing and we should explore all
> > > possible ways. As others suggested fix/improve the bottleneck in the
> > > memory freeing path. In addition I think we should explore parallelizing
> > > this as well.
> > > 
> > > On Android, I suppose most of the memory is associated with single or
> > > small set of processes and parallelizing memory freeing would be
> > > challenging. BTW is LMKD using process_mrelease() to release the killed
> > > process memory?
> > Yes, LMKD has a reaper thread which wakes up and calls
> > process_mrelease() after the main LMKD thread issued SIGKILL.
> 
> Hi Suren
> 
> our current issue is that after lmkd kills a process,|exit_mm|takes
> considerable time. The interface you provided might help quickly free
> memory, potentially allowing us to release some memory from processes before
> lmkd kills them. This could be a good idea.
> 
> We will take your suggestion into consideration.

But LMKD already does the process_mrelease(). Is that not happening on
your setup?

Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release

Posted by Lei Liu 3 weeks ago

on 2025/9/11 4:12, Shakeel Butt wrote:
> On Wed, Sep 10, 2025 at 10:14:04PM +0800, Lei Liu wrote:
>> On 2025/9/10 3:48, Suren Baghdasaryan wrote:
>>> On Tue, Sep 9, 2025 at 12:21 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>>> On Tue, Sep 09, 2025 at 02:53:39PM +0800, Lei Liu wrote:
>>>>> 1. Problem Scenario
>>>>> On systems with ZRAM and swap enabled, simultaneous process exits create
>>>>> contention. The primary bottleneck occurs during swap entry release
>>>>> operations, causing exiting processes to monopolize CPU resources. This
>>>>> leads to scheduling delays for high-priority processes.
>>>>>
>>>>> 2. Android Use Case
>>>>> During camera launch, LMKD terminates background processes to free memory.
>>>> How does LMKD trigger the kills? SIGKILL or cgroup.kill?
>>> SIGKILL
>>>
>>>>> Exiting processes compete for CPU cycles, delaying the camera preview
>>>>> thread and causing visible stuttering - directly impacting user
>>>>> experience.
>>>> Since the exit/kill is due to low memory situation, punting the memory
>>>> freeing to a low priority async mechanism will help in improving user
>>>> experience. Most probably the application (camera preview here) will get
>>>> into global reclaim and will compete for CPU with the async memory
>>>> freeing.
>>>>
>>>> What we really need is faster memory freeing and we should explore all
>>>> possible ways. As others suggested fix/improve the bottleneck in the
>>>> memory freeing path. In addition I think we should explore parallelizing
>>>> this as well.
>>>>
>>>> On Android, I suppose most of the memory is associated with single or
>>>> small set of processes and parallelizing memory freeing would be
>>>> challenging. BTW is LMKD using process_mrelease() to release the killed
>>>> process memory?
>>> Yes, LMKD has a reaper thread which wakes up and calls
>>> process_mrelease() after the main LMKD thread issued SIGKILL.
>> Hi Suren
>>
>> our current issue is that after lmkd kills a process,|exit_mm|takes
>> considerable time. The interface you provided might help quickly free
>> memory, potentially allowing us to release some memory from processes before
>> lmkd kills them. This could be a good idea.
>>
>> We will take your suggestion into consideration.
> But LMKD already does the process_mrelease(). Is that not happening on
> your setup?

Hi Shakeel

Thank you for your consideration.

In our product, we have observed that in scenarios where multiple

processes are being killed, the load on the lmkd_reaper thread can
become very heavy, leading to issues with power consumption and lag.

This problem also occurs in the current camera launch scenario.


Best regards,
Lei

Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release

Posted by Chris Li 3 weeks, 1 day ago

On Wed, Sep 10, 2025 at 7:14 AM Lei Liu <liulei.rjpt@vivo.com> wrote:
> >> On Android, I suppose most of the memory is associated with single or
> >> small set of processes and parallelizing memory freeing would be
> >> challenging. BTW is LMKD using process_mrelease() to release the killed
> >> process memory?
> > Yes, LMKD has a reaper thread which wakes up and calls
> > process_mrelease() after the main LMKD thread issued SIGKILL.
>
> Hi Suren
>
> our current issue is that after lmkd kills a process,|exit_mm|takes
> considerable time. The interface you provided might help quickly free
> memory, potentially allowing us to release some memory from processes
> before lmkd kills them. This could be a good idea.
>
> We will take your suggestion into consideration.

Hi Lei,

I do want to help your usage case. With my previous analysis of the
swap fault time breakdown. The amount of time it spends on batching
freeing the swap entry is not that much. Yes, it has a long tail, but
that is on a very small percentage of page faults. It shouldn't have
such a huge impact on the global average time.

https://services.google.com/fh/files/misc/zswap-breakdown.png
https://services.google.com/fh/files/misc/zswap-breakdown-detail.png

That is what I am trying to get to, the batch free of swap entry is
just the surface level. By itself it does not contribute much. Your
exit latency is largely a different issue.

However, the approach you take, (I briefly go over your patch) is to
add another batch layer for the swap entry free. Which impacts not
only the exit() path, it impacts other non exit() freeing of swap
entry as well. The swap entry is a resource best managed by the swap
allocator. The swap allocator knows best when it is the best place to
cache it vs freeing it under pressure. The extra batch of swap entry
free (before triggering the threshold) is just swap entry seating in
the batch queue. The allocator has no internal knowledge of this batch
behavior and it is interfering with the global view of swap entry
allocator. You need to address this before your patch can be
re-considered.

It feels like a CFO needs to do a company wide budget and revenue
projection. The sales department is having a side pocket account to
defer the revenue and sand bagging the sales number, which can
jeopardize the CFO's ability to budget and project . BTW, what I
describe is probably illegal for public companies. Kids, don't try
this at home.

I think you can do some of the following:
1) redo the test with the latest kernel which does not have the swap
slot caching batching any more. Report back what you got.
2) try out the process_mrelease().

Please share your findings, I am happy to work with you to address the
problem you encounter.

Chris

Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release

Posted by Suren Baghdasaryan 3 weeks, 1 day ago

On Wed, Sep 10, 2025 at 7:14 AM Lei Liu <liulei.rjpt@vivo.com> wrote:
>
>
> On 2025/9/10 3:48, Suren Baghdasaryan wrote:
> > On Tue, Sep 9, 2025 at 12:21 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >> On Tue, Sep 09, 2025 at 02:53:39PM +0800, Lei Liu wrote:
> >>> 1. Problem Scenario
> >>> On systems with ZRAM and swap enabled, simultaneous process exits create
> >>> contention. The primary bottleneck occurs during swap entry release
> >>> operations, causing exiting processes to monopolize CPU resources. This
> >>> leads to scheduling delays for high-priority processes.
> >>>
> >>> 2. Android Use Case
> >>> During camera launch, LMKD terminates background processes to free memory.
> >> How does LMKD trigger the kills? SIGKILL or cgroup.kill?
> > SIGKILL
> >
> >>> Exiting processes compete for CPU cycles, delaying the camera preview
> >>> thread and causing visible stuttering - directly impacting user
> >>> experience.
> >> Since the exit/kill is due to low memory situation, punting the memory
> >> freeing to a low priority async mechanism will help in improving user
> >> experience. Most probably the application (camera preview here) will get
> >> into global reclaim and will compete for CPU with the async memory
> >> freeing.
> >>
> >> What we really need is faster memory freeing and we should explore all
> >> possible ways. As others suggested fix/improve the bottleneck in the
> >> memory freeing path. In addition I think we should explore parallelizing
> >> this as well.
> >>
> >> On Android, I suppose most of the memory is associated with single or
> >> small set of processes and parallelizing memory freeing would be
> >> challenging. BTW is LMKD using process_mrelease() to release the killed
> >> process memory?
> > Yes, LMKD has a reaper thread which wakes up and calls
> > process_mrelease() after the main LMKD thread issued SIGKILL.
>
> Hi Suren
>
> our current issue is that after lmkd kills a process,|exit_mm|takes
> considerable time. The interface you provided might help quickly free
> memory, potentially allowing us to release some memory from processes
> before lmkd kills them. This could be a good idea.
>
> We will take your suggestion into consideration.

I wasn't really suggesting anything, just explaining how LMKD works today.

>
>
> Thank you
>
>
>
>
> >

Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release

Posted by Kairui Song 3 weeks, 2 days ago

On Tue, Sep 9, 2025 at 3:04 PM Lei Liu <liulei.rjpt@vivo.com> wrote:
>

Hi Lei,

> 1. Problem Scenario
> On systems with ZRAM and swap enabled, simultaneous process exits create
> contention. The primary bottleneck occurs during swap entry release
> operations, causing exiting processes to monopolize CPU resources. This
> leads to scheduling delays for high-priority processes.
>
> 2. Android Use Case
> During camera launch, LMKD terminates background processes to free memory.
> Exiting processes compete for CPU cycles, delaying the camera preview
> thread and causing visible stuttering - directly impacting user
> experience.
>
> 3. Root Cause Analysis
> When background applications heavily utilize swap space, process exit
> profiling reveals 55% of time spent in free_swap_and_cache_nr():
>
> Function              Duration (ms)   Percentage
> do_signal               791.813     **********100%
> do_group_exit           791.813     **********100%
> do_exit                 791.813     **********100%
> exit_mm                 577.859        *******73%
> exit_mmap               577.497        *******73%
> zap_pte_range           558.645        *******71%
> free_swap_and_cache_nr  433.381          *****55%
> free_swap_slot          403.568          *****51%

Thanks for sharing this case.

One problem is that now the free_swap_slot function no longer exists
after 0ff67f990bd4. Have you tested the latest kernel? Or what is the
actual overhead here?

Some batch freeing optimizations are introduced. And we have reworked
the whole locking mechanism for swap, so even on a system with 96t the
contention seems barely observable with common workloads.

And another series is further reducing the contention and the overall
overhead (24% faster freeing for phase 1):
https://lore.kernel.org/linux-mm/20250905191357.78298-1-ryncsn@gmail.com/

Will these be helpful for you? I think optimizing the root problem is
better than just deferring the overhead with async workers, which may
increase the overall overhead and complexity.


> swap_entry_free         393.863          *****50%
> swap_range_free         372.602           ****47%
>
> 4. Optimization Approach
> a) For processes exceeding swap entry threshold: aggregate and isolate
> swap entries to enable fast exit
> b) Asynchronously release batched entries when isolation reaches
> configured threshold
>
> 5. Performance Gains (User Scenario: Camera Cold Launch)
> a) 74% reduction in process exit latency (>500ms cases)
> b) ~4% lower peak CPU load during concurrent process exits
> c) ~70MB additional free memory during camera preview initialization
> d) 40% reduction in camera preview stuttering probability
>
> 6. Prior Art & Improvements
> Reference: Zhiguo Jiang's patch
> (https://lore.kernel.org/all/20240805153639.1057-1-justinjiang@vivo.com/)
>
> Key enhancements:
> a) Reimplemented logic moved from mmu_gather.c to swapfile.c for clarity
> b) Async release delegated to workqueue kworkers with configurable
> max_active for NUMA-optimized concurrency
>
> Lei Liu (2):
>   mm: swap: Gather swap entries and batch async release core
>   mm: swap: Forced swap entries release under memory pressure
>
>  include/linux/oom.h           |  23 ++++++
>  include/linux/swapfile.h      |   2 +
>  include/linux/vm_event_item.h |   1 +
>  kernel/exit.c                 |   2 +
>  mm/memcontrol.c               |   6 --
>  mm/memory.c                   |   4 +-
>  mm/page_alloc.c               |   4 +
>  mm/swapfile.c                 | 134 ++++++++++++++++++++++++++++++++++
>  mm/vmstat.c                   |   1 +
>  9 files changed, 170 insertions(+), 7 deletions(-)
>
> --
> 2.34.1
>
>

Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release

Posted by Lei Liu 3 weeks, 1 day ago

On 2025/9/9 15:30, Kairui Song wrote:
> [You don't often get email from ryncsn@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> On Tue, Sep 9, 2025 at 3:04 PM Lei Liu <liulei.rjpt@vivo.com> wrote:
> Hi Lei,
>
>> 1. Problem Scenario
>> On systems with ZRAM and swap enabled, simultaneous process exits create
>> contention. The primary bottleneck occurs during swap entry release
>> operations, causing exiting processes to monopolize CPU resources. This
>> leads to scheduling delays for high-priority processes.
>>
>> 2. Android Use Case
>> During camera launch, LMKD terminates background processes to free memory.
>> Exiting processes compete for CPU cycles, delaying the camera preview
>> thread and causing visible stuttering - directly impacting user
>> experience.
>>
>> 3. Root Cause Analysis
>> When background applications heavily utilize swap space, process exit
>> profiling reveals 55% of time spent in free_swap_and_cache_nr():
>>
>> Function              Duration (ms)   Percentage
>> do_signal               791.813     **********100%
>> do_group_exit           791.813     **********100%
>> do_exit                 791.813     **********100%
>> exit_mm                 577.859        *******73%
>> exit_mmap               577.497        *******73%
>> zap_pte_range           558.645        *******71%
>> free_swap_and_cache_nr  433.381          *****55%
>> free_swap_slot          403.568          *****51%
> Thanks for sharing this case.
>
> One problem is that now the free_swap_slot function no longer exists
> after 0ff67f990bd4. Have you tested the latest kernel? Or what is the
> actual overhead here?
>
> Some batch freeing optimizations are introduced. And we have reworked
> the whole locking mechanism for swap, so even on a system with 96t the
> contention seems barely observable with common workloads.
>
> And another series is further reducing the contention and the overall
> overhead (24% faster freeing for phase 1):
> https://lore.kernel.org/linux-mm/20250905191357.78298-1-ryncsn@gmail.com/
>
> Will these be helpful for you? I think optimizing the root problem is
> better than just deferring the overhead with async workers, which may
> increase the overall overhead and complexity.

Hi Kairui

Thank you for your optimization suggestions. We believe your patch may 
help ou
r scenario. We'll try integrating it to evaluate benefits. However, it 
may not
fully solve our issue. Below is our problem description:

Flame graph of time distribution for TikTok process exit (~400MB swapped):
do_notify_resume         3.89%
get_signal               3.89%
do_signal_exit           3.88%
do_exit                  3.88%
mmput                    3.22%
exit_mmap                3.22%
unmap_vmas               3.08%
unmap_page_range         3.07%
free_swap_and_cache_nr   1.31%****
swap_entry_range_free    1.17%****
zram_slot_free_notify    1.11%****
zram_free_hw_entry_dc    0.43%
free_zspage[zsmalloc]    0.09%

CPU: 8-core ARM64 (14.21GHz+33.5GHz+4*2.7GHz), 12GB RAM

Process with ~400MB swap exit situation:
Exit takes 200-300ms, ~4% CPU load
With more zram compression/swap, exit time increases to 400-500ms
free_swap_and_cache_nr avg: 0.5ms, max: ~1.5ms (running time)
free_swap_and_cache_nr dominates exit time (33%, up to 50% in worst cases
). Main time is zram resource freeing (0.25ms per operation). With dozens
of simultaneous exits, cumulative time becomes significant.

Optimization approach:
Focus isn't on optimizing hot functions (limited improvement potential).
High load comes from too many simultaneous exits. We'll make time-consumin
g interfaces in do_exit asynchronous to accelerate exit completion while
allowing non-swap page (file/anonymous) freeing by other processes.

Camera startup scenario:
20-30 background apps, anonymous pages compressed to zram (200-500MB).
Camera launch triggers lmkd to kill 10+ apps - their exits consume 25%+
CPU. System services/third-party processes use 60%+ CPU, leaving camera
startup process CPU-starved and delayed.


Sincere wishes,
Lei


>
>
>> swap_entry_free         393.863          *****50%
>> swap_range_free         372.602           ****47%
>>
>> 4. Optimization Approach
>> a) For processes exceeding swap entry threshold: aggregate and isolate
>> swap entries to enable fast exit
>> b) Asynchronously release batched entries when isolation reaches
>> configured threshold
>>
>> 5. Performance Gains (User Scenario: Camera Cold Launch)
>> a) 74% reduction in process exit latency (>500ms cases)
>> b) ~4% lower peak CPU load during concurrent process exits
>> c) ~70MB additional free memory during camera preview initialization
>> d) 40% reduction in camera preview stuttering probability
>>
>> 6. Prior Art & Improvements
>> Reference: Zhiguo Jiang's patch
>> (https://lore.kernel.org/all/20240805153639.1057-1-justinjiang@vivo.com/)
>>
>> Key enhancements:
>> a) Reimplemented logic moved from mmu_gather.c to swapfile.c for clarity
>> b) Async release delegated to workqueue kworkers with configurable
>> max_active for NUMA-optimized concurrency
>>
>> Lei Liu (2):
>>    mm: swap: Gather swap entries and batch async release core
>>    mm: swap: Forced swap entries release under memory pressure
>>
>>   include/linux/oom.h           |  23 ++++++
>>   include/linux/swapfile.h      |   2 +
>>   include/linux/vm_event_item.h |   1 +
>>   kernel/exit.c                 |   2 +
>>   mm/memcontrol.c               |   6 --
>>   mm/memory.c                   |   4 +-
>>   mm/page_alloc.c               |   4 +
>>   mm/swapfile.c                 | 134 ++++++++++++++++++++++++++++++++++
>>   mm/vmstat.c                   |   1 +
>>   9 files changed, 170 insertions(+), 7 deletions(-)
>>
>> --
>> 2.34.1
>>
>>

Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release

Posted by Chris Li 3 weeks, 2 days ago

On Tue, Sep 9, 2025 at 12:31 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Tue, Sep 9, 2025 at 3:04 PM Lei Liu <liulei.rjpt@vivo.com> wrote:
> >
>
> Hi Lei,
>
> > 1. Problem Scenario
> > On systems with ZRAM and swap enabled, simultaneous process exits create
> > contention. The primary bottleneck occurs during swap entry release
> > operations, causing exiting processes to monopolize CPU resources. This
> > leads to scheduling delays for high-priority processes.
> >
> > 2. Android Use Case
> > During camera launch, LMKD terminates background processes to free memory.
> > Exiting processes compete for CPU cycles, delaying the camera preview
> > thread and causing visible stuttering - directly impacting user
> > experience.
> >
> > 3. Root Cause Analysis
> > When background applications heavily utilize swap space, process exit
> > profiling reveals 55% of time spent in free_swap_and_cache_nr():
> >
> > Function              Duration (ms)   Percentage
> > do_signal               791.813     **********100%
> > do_group_exit           791.813     **********100%
> > do_exit                 791.813     **********100%
> > exit_mm                 577.859        *******73%
> > exit_mmap               577.497        *******73%
> > zap_pte_range           558.645        *******71%
> > free_swap_and_cache_nr  433.381          *****55%
> > free_swap_slot          403.568          *****51%
>
> Thanks for sharing this case.
>
> One problem is that now the free_swap_slot function no longer exists
> after 0ff67f990bd4. Have you tested the latest kernel? Or what is the
> actual overhead here?
>
> Some batch freeing optimizations are introduced. And we have reworked
> the whole locking mechanism for swap, so even on a system with 96t the
> contention seems barely observable with common workloads.
>
> And another series is further reducing the contention and the overall
> overhead (24% faster freeing for phase 1):
> https://lore.kernel.org/linux-mm/20250905191357.78298-1-ryncsn@gmail.com/
>
> Will these be helpful for you? I think optimizing the root problem is
> better than just deferring the overhead with async workers, which may
> increase the overall overhead and complexity.

+100.

Hi Lei,

This CC list is very long :-)

Is it similar to this one a while back?

https://lore.kernel.org/linux-mm/20240213-async-free-v3-1-b89c3cc48384@kernel.org/

I ultimately abandoned this approach and considered it harmful. Yes, I
can be as harsh as I like for my own previous bad ideas. The better
solution is as Kairui did, just remove the swap slot caching
completely. It is the harder path to take and get better results. I
recall having a discussion with Kairui on this and we are aligned on
removing the swap slot caching eventually . Thanks Kairui for the
heavy lifting of actually removing the swap slot cache. I am just
cheerleading on the side :-)

So no, we are not getting the async free of swap slot caching again.
We shouldn't need to.

Chris




>
>
> > swap_entry_free         393.863          *****50%
> > swap_range_free         372.602           ****47%
> >
> > 4. Optimization Approach
> > a) For processes exceeding swap entry threshold: aggregate and isolate
> > swap entries to enable fast exit
> > b) Asynchronously release batched entries when isolation reaches
> > configured threshold
> >
> > 5. Performance Gains (User Scenario: Camera Cold Launch)
> > a) 74% reduction in process exit latency (>500ms cases)
> > b) ~4% lower peak CPU load during concurrent process exits
> > c) ~70MB additional free memory during camera preview initialization
> > d) 40% reduction in camera preview stuttering probability
> >
> > 6. Prior Art & Improvements
> > Reference: Zhiguo Jiang's patch
> > (https://lore.kernel.org/all/20240805153639.1057-1-justinjiang@vivo.com/)
> >
> > Key enhancements:
> > a) Reimplemented logic moved from mmu_gather.c to swapfile.c for clarity
> > b) Async release delegated to workqueue kworkers with configurable
> > max_active for NUMA-optimized concurrency
> >
> > Lei Liu (2):
> >   mm: swap: Gather swap entries and batch async release core
> >   mm: swap: Forced swap entries release under memory pressure
> >
> >  include/linux/oom.h           |  23 ++++++
> >  include/linux/swapfile.h      |   2 +
> >  include/linux/vm_event_item.h |   1 +
> >  kernel/exit.c                 |   2 +
> >  mm/memcontrol.c               |   6 --
> >  mm/memory.c                   |   4 +-
> >  mm/page_alloc.c               |   4 +
> >  mm/swapfile.c                 | 134 ++++++++++++++++++++++++++++++++++
> >  mm/vmstat.c                   |   1 +
> >  9 files changed, 170 insertions(+), 7 deletions(-)
> >
> > --
> > 2.34.1
> >
> >
>

Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release

Posted by Barry Song 3 weeks, 2 days ago

On Tue, Sep 9, 2025 at 3:30 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Tue, Sep 9, 2025 at 3:04 PM Lei Liu <liulei.rjpt@vivo.com> wrote:
> >
>
> Hi Lei,
>
> > 1. Problem Scenario
> > On systems with ZRAM and swap enabled, simultaneous process exits create
> > contention. The primary bottleneck occurs during swap entry release
> > operations, causing exiting processes to monopolize CPU resources. This
> > leads to scheduling delays for high-priority processes.
> >
> > 2. Android Use Case
> > During camera launch, LMKD terminates background processes to free memory.
> > Exiting processes compete for CPU cycles, delaying the camera preview
> > thread and causing visible stuttering - directly impacting user
> > experience.
> >
> > 3. Root Cause Analysis
> > When background applications heavily utilize swap space, process exit
> > profiling reveals 55% of time spent in free_swap_and_cache_nr():
> >
> > Function              Duration (ms)   Percentage
> > do_signal               791.813     **********100%
> > do_group_exit           791.813     **********100%
> > do_exit                 791.813     **********100%
> > exit_mm                 577.859        *******73%
> > exit_mmap               577.497        *******73%
> > zap_pte_range           558.645        *******71%
> > free_swap_and_cache_nr  433.381          *****55%
> > free_swap_slot          403.568          *****51%
>
> Thanks for sharing this case.
>
> One problem is that now the free_swap_slot function no longer exists
> after 0ff67f990bd4. Have you tested the latest kernel? Or what is the
> actual overhead here?
>
> Some batch freeing optimizations are introduced. And we have reworked
> the whole locking mechanism for swap, so even on a system with 96t the
> contention seems barely observable with common workloads.
>
> And another series is further reducing the contention and the overall
> overhead (24% faster freeing for phase 1):
> https://lore.kernel.org/linux-mm/20250905191357.78298-1-ryncsn@gmail.com/
>
> Will these be helpful for you? I think optimizing the root problem is
> better than just deferring the overhead with async workers, which may
> increase the overall overhead and complexity.
>

I feel the cover letter does not clearly describe where the bottleneck
occurs or where the performance gains originate. To be honest, even
the versions submitted last year did not present the bottleneck clearly.

For example, is this due to lock contention (in which case we would
need performance data to see how much CPU time is spent waiting for
locks), or simply because we can simultaneously zap present and
non-present PTEs?

Thanks
Barry

Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release

Posted by Lei Liu 3 weeks, 1 day ago

On 2025/9/9 17:24, Barry Song wrote:
> [You don't often get email from 21cnbao@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> On Tue, Sep 9, 2025 at 3:30 PM Kairui Song <ryncsn@gmail.com> wrote:
>> On Tue, Sep 9, 2025 at 3:04 PM Lei Liu <liulei.rjpt@vivo.com> wrote:
>> Hi Lei,
>>
>>> 1. Problem Scenario
>>> On systems with ZRAM and swap enabled, simultaneous process exits create
>>> contention. The primary bottleneck occurs during swap entry release
>>> operations, causing exiting processes to monopolize CPU resources. This
>>> leads to scheduling delays for high-priority processes.
>>>
>>> 2. Android Use Case
>>> During camera launch, LMKD terminates background processes to free memory.
>>> Exiting processes compete for CPU cycles, delaying the camera preview
>>> thread and causing visible stuttering - directly impacting user
>>> experience.
>>>
>>> 3. Root Cause Analysis
>>> When background applications heavily utilize swap space, process exit
>>> profiling reveals 55% of time spent in free_swap_and_cache_nr():
>>>
>>> Function              Duration (ms)   Percentage
>>> do_signal               791.813     **********100%
>>> do_group_exit           791.813     **********100%
>>> do_exit                 791.813     **********100%
>>> exit_mm                 577.859        *******73%
>>> exit_mmap               577.497        *******73%
>>> zap_pte_range           558.645        *******71%
>>> free_swap_and_cache_nr  433.381          *****55%
>>> free_swap_slot          403.568          *****51%
>> Thanks for sharing this case.
>>
>> One problem is that now the free_swap_slot function no longer exists
>> after 0ff67f990bd4. Have you tested the latest kernel? Or what is the
>> actual overhead here?
>>
>> Some batch freeing optimizations are introduced. And we have reworked
>> the whole locking mechanism for swap, so even on a system with 96t the
>> contention seems barely observable with common workloads.
>>
>> And another series is further reducing the contention and the overall
>> overhead (24% faster freeing for phase 1):
>> https://lore.kernel.org/linux-mm/20250905191357.78298-1-ryncsn@gmail.com/
>>
>> Will these be helpful for you? I think optimizing the root problem is
>> better than just deferring the overhead with async workers, which may
>> increase the overall overhead and complexity.
>>
> I feel the cover letter does not clearly describe where the bottleneck
> occurs or where the performance gains originate. To be honest, even
> the versions submitted last year did not present the bottleneck clearly.
>
> For example, is this due to lock contention (in which case we would
> need performance data to see how much CPU time is spent waiting for
> locks), or simply because we can simultaneously zap present and
> non-present PTEs?
>
> Thanks
> Barry

Hi Barry

Thank you for your question. Here is the issue we are encountering:

Flame graph of time distribution for douyin process exit (~400MB swapped):
do_notify_resume         3.89%
get_signal               3.89%
do_signal_exit           3.88%
do_exit                  3.88%
mmput                    3.22%
exit_mmap                3.22%
unmap_vmas               3.08%
unmap_page_range         3.07%
free_swap_and_cache_nr   1.31%****
swap_entry_range_free    1.17%****
zram_slot_free_notify    1.11%****
zram_free_hw_entry_dc    0.43%
free_zspage[zsmalloc]    0.09%

CPU: 8-core ARM64 (14.21GHz+33.5GHz+4*2.7GHz), 12GB RAM

Process with ~400MB swap exit situation:
Exit takes 200-300ms, ~4% CPU load
With more zram compression/swap, exit time increases to 400-500ms
free_swap_and_cache_nr avg: 0.5ms, max: ~1.5ms (running time)
free_swap_and_cache_nr dominates exit time (33%, up to 50% in worst cases
). Main time is zram resource freeing (0.25ms per operation). With dozens
of simultaneous exits, cumulative time becomes significant.

Optimization approach:
Focus isn't on optimizing hot functions (limited improvement potential).
High load comes from too many simultaneous exits. We'll make time-consumin
g interfaces in do_exit asynchronous to accelerate exit completion while
allowing non-swap page (file/anonymous) freeing by other processes.

Camera startup scenario:
20-30 background apps, anonymous pages compressed to zram (200-500MB).
Camera launch triggers lmkd to kill 10+ apps - their exits consume 25%+
CPU. System services/third-party processes use 60%+ CPU, leaving camera
startup process CPU-starved and delayed.

Sincere wishes,
Lei

Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release

Posted by Chris Li 3 weeks, 2 days ago

On Tue, Sep 9, 2025 at 2:24 AM Barry Song <21cnbao@gmail.com> wrote:
> I feel the cover letter does not clearly describe where the bottleneck
> occurs or where the performance gains originate. To be honest, even
> the versions submitted last year did not present the bottleneck clearly.
>
> For example, is this due to lock contention (in which case we would
> need performance data to see how much CPU time is spent waiting for
> locks), or simply because we can simultaneously zap present and
> non-present PTEs?

I have done some long tail analysis of the zswap page fault a while
back, before zswap converting to xarray. For the zswap page fault, in
the long tail a good chunk is a bath free swap slot. The breakdown
inside  shows a huge chunk is the clear_shadow() followed by
memsw_uncharge(). I will post the link to the breakdown image once it
is available.

Chris

Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release

Posted by Chris Li 3 weeks, 2 days ago

On Tue, Sep 9, 2025 at 9:15 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Tue, Sep 9, 2025 at 2:24 AM Barry Song <21cnbao@gmail.com> wrote:
> > I feel the cover letter does not clearly describe where the bottleneck
> > occurs or where the performance gains originate. To be honest, even
> > the versions submitted last year did not present the bottleneck clearly.
> >
> > For example, is this due to lock contention (in which case we would
> > need performance data to see how much CPU time is spent waiting for
> > locks), or simply because we can simultaneously zap present and
> > non-present PTEs?
>
> I have done some long tail analysis of the zswap page fault a while
> back, before zswap converting to xarray. For the zswap page fault, in
> the long tail a good chunk is a bath free swap slot. The breakdown
> inside  shows a huge chunk is the clear_shadow() followed by
> memsw_uncharge(). I will post the link to the breakdown image once it
> is available.

Here is a graph, high level breakdown shows the batch free swap slot
contribute to the long tail:
https://services.google.com/fh/files/misc/zswap-breakdown.png

The detail breakdown inside bath free swap slots:
https://services.google.com/fh/files/misc/zswap-breakdown-detail.png

Those data are on pretty old data, before zswap uses the xarray.

Now the batch freeing the swap entries is gone. I am wondering if the
new kernel shows any bottleneck for Lei's zram test case.

Hi Lei, Please report back on your new findings. In this case, with
removal of swap slot cache, the performance profile will likely be
very different. Let me know if you have difficulties running the
latest kernel on your test bench.

Chris