Do not delay oom reaper when the victim is frozen

[PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order

Posted by zhongjinji 1 month, 1 week ago

When a process is OOM killed without reaper delay, the oom reaper and the
exit_mmap() thread likely run simultaneously. They traverse the vma's maple
tree along the same path and may easily unmap the same vma, causing them to
compete for the pte spinlock.

When a process exits, exit_mmap() traverses the vma's maple tree from low
to high addresses. To reduce the chance of unmapping the same vma
simultaneously, the OOM reaper should traverse the vma's tree from high to
low address.

Signed-off-by: zhongjinji <zhongjinji@honor.com>
---
 mm/oom_kill.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4b4d73b1e00d..a0650da9ec9c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -516,7 +516,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
 	bool ret = true;
-	VMA_ITERATOR(vmi, mm, 0);
+	MA_STATE(mas, &mm->mm_mt, ULONG_MAX, 0);
 
 	/*
 	 * Tell all users of get_user/copy_from_user etc... that the content
@@ -526,7 +526,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
 	 */
 	set_bit(MMF_UNSTABLE, &mm->flags);
 
-	for_each_vma(vmi, vma) {
+	/*
+	 * When two tasks unmap the same vma at the same time, they may contend for the
+	 * pte spinlock. To reduce the probability of them unmapping the same vma, the
+	 * oom reaper traverse the vma maple tree in reverse order.
+	 */
+	while ((vma = mas_find_rev(&mas, 0)) != NULL) {
 		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
 			continue;
 
-- 
2.17.1

Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order

Posted by Lorenzo Stoakes 1 month, 1 week ago

On Mon, Aug 25, 2025 at 09:38:55PM +0800, zhongjinji wrote:
> When a process is OOM killed without reaper delay, the oom reaper and the
> exit_mmap() thread likely run simultaneously. They traverse the vma's maple
> tree along the same path and may easily unmap the same vma, causing them to
> compete for the pte spinlock.
>
> When a process exits, exit_mmap() traverses the vma's maple tree from low
> to high addresses. To reduce the chance of unmapping the same vma
> simultaneously, the OOM reaper should traverse the vma's tree from high to
> low address.
>
> Signed-off-by: zhongjinji <zhongjinji@honor.com>

I will leave it to Liam to confirm the maple tree bit is ok, but I guess
I'm softening to the idea of doing this - because it should have no impact
on most users, so even if it's some rare edge case that triggers the
situation, then it's worth doing it in reverse just to help you guys out :)

Liam - please confirm this is good from your side, and then I can add a tag!

Cheers, Lorenzo

> ---
>  mm/oom_kill.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 4b4d73b1e00d..a0650da9ec9c 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -516,7 +516,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
>  {
>  	struct vm_area_struct *vma;
>  	bool ret = true;
> -	VMA_ITERATOR(vmi, mm, 0);
> +	MA_STATE(mas, &mm->mm_mt, ULONG_MAX, 0);
>
>  	/*
>  	 * Tell all users of get_user/copy_from_user etc... that the content
> @@ -526,7 +526,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
>  	 */
>  	set_bit(MMF_UNSTABLE, &mm->flags);
>
> -	for_each_vma(vmi, vma) {
> +	/*
> +	 * When two tasks unmap the same vma at the same time, they may contend for the
> +	 * pte spinlock. To reduce the probability of them unmapping the same vma, the
> +	 * oom reaper traverse the vma maple tree in reverse order.
> +	 */
> +	while ((vma = mas_find_rev(&mas, 0)) != NULL) {

It's a pity there isn't a nicer formulation of this but this is probably
the least worst way of doing it.

>  		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
>  			continue;
>
> --
> 2.17.1
>

Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order

Posted by Michal Hocko 1 month ago

On Tue 26-08-25 13:53:43, Lorenzo Stoakes wrote:
> On Mon, Aug 25, 2025 at 09:38:55PM +0800, zhongjinji wrote:
> > When a process is OOM killed without reaper delay, the oom reaper and the
> > exit_mmap() thread likely run simultaneously. They traverse the vma's maple
> > tree along the same path and may easily unmap the same vma, causing them to
> > compete for the pte spinlock.
> >
> > When a process exits, exit_mmap() traverses the vma's maple tree from low
> > to high addresses. To reduce the chance of unmapping the same vma
> > simultaneously, the OOM reaper should traverse the vma's tree from high to
> > low address.
> >
> > Signed-off-by: zhongjinji <zhongjinji@honor.com>
> 
> I will leave it to Liam to confirm the maple tree bit is ok, but I guess
> I'm softening to the idea of doing this - because it should have no impact
> on most users, so even if it's some rare edge case that triggers the
> situation, then it's worth doing it in reverse just to help you guys out :)

I tend to agree on this. I would expect that oom_reaper would race with
exit_mmap only for seriously stalled processes or huge memory consumers
where exit_mmap takes a while. oom_reaper would be quick to catch the
exit_mmap as it wouldn't have fewer work to do on already released
memory.
Going from the other end of the address space might reduces this chance
while not really introducing any actual issues.

-- 
Michal Hocko
SUSE Labs

Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order

Posted by Liam R. Howlett 1 month, 1 week ago

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250826 08:53]:
> On Mon, Aug 25, 2025 at 09:38:55PM +0800, zhongjinji wrote:
> > When a process is OOM killed without reaper delay, the oom reaper and the
> > exit_mmap() thread likely run simultaneously. They traverse the vma's maple
> > tree along the same path and may easily unmap the same vma, causing them to
> > compete for the pte spinlock.
> >
> > When a process exits, exit_mmap() traverses the vma's maple tree from low
> > to high addresses. To reduce the chance of unmapping the same vma
> > simultaneously, the OOM reaper should traverse the vma's tree from high to
> > low address.
> >
> > Signed-off-by: zhongjinji <zhongjinji@honor.com>
> 
> I will leave it to Liam to confirm the maple tree bit is ok, but I guess
> I'm softening to the idea of doing this - because it should have no impact
> on most users, so even if it's some rare edge case that triggers the
> situation, then it's worth doing it in reverse just to help you guys out :)
> 

I really don't think this is worth doing.  We're avoiding a race between
oom and a task unmap - the MMF bits should be used to avoid this race -
or at least mitigate it.

They are probably both under the read lock, but considering how rare it
would be, would a racy flag check be enough - it is hardly critical to
get right.  Either would reduce the probability.

> Liam - please confirm this is good from your side, and then I can add a tag!
> 
> Cheers, Lorenzo
> 
> > ---
> >  mm/oom_kill.c | 9 +++++++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 4b4d73b1e00d..a0650da9ec9c 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -516,7 +516,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
> >  {
> >  	struct vm_area_struct *vma;
> >  	bool ret = true;
> > -	VMA_ITERATOR(vmi, mm, 0);
> > +	MA_STATE(mas, &mm->mm_mt, ULONG_MAX, 0);
                                  ^^^^^^^^^  ^^
You have set the index larger than the last.  It (probably?) works, but
isn't correct and may stop working, so let's fix it.

MA_STATE(mas, &mm->mm_mt, ULONG_MAX, ULONG_MAX);


> >
> >  	/*
> >  	 * Tell all users of get_user/copy_from_user etc... that the content
> > @@ -526,7 +526,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
> >  	 */
> >  	set_bit(MMF_UNSTABLE, &mm->flags);
> >
> > -	for_each_vma(vmi, vma) {
> > +	/*
> > +	 * When two tasks unmap the same vma at the same time, they may contend for the
> > +	 * pte spinlock. To reduce the probability of them unmapping the same vma, the
> > +	 * oom reaper traverse the vma maple tree in reverse order.
> > +	 */
> > +	while ((vma = mas_find_rev(&mas, 0)) != NULL) {
> 
> It's a pity there isn't a nicer formulation of this but this is probably
> the least worst way of doing it.
> 

mas_for_each_rev() exists for this use case.

You will find that the implementation is very close to what you see
here. :)

> >  		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
> >  			continue;
> >
> > --
> > 2.17.1
> >

Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order

Posted by Lorenzo Stoakes 1 month, 1 week ago

On Tue, Aug 26, 2025 at 09:37:22AM -0400, Liam R. Howlett wrote:
> I really don't think this is worth doing.  We're avoiding a race between
> oom and a task unmap - the MMF bits should be used to avoid this race -
> or at least mitigate it.

Yes for sure, as explored at length in previous discussions this feels like
we're papering over cracks here.

_However_, I'm sort of ok with a minimalistic fix that solves the proximate
issue even if it is that, as long as it doesn't cause issues in doing so.

So this is my take on the below and why I'm open to it!

>
> They are probably both under the read lock, but considering how rare it
> would be, would a racy flag check be enough - it is hardly critical to
> get right.  Either would reduce the probability.

Zongjinji - I'm stil not sure that you've really indicated _why_ you're
seeing such a tight and unusual race. Presumably some truly massive number
of tasks being OOM'd and unmapping but... yeah that seems odd anyway.

But again, if we can safely fix this in a way that doesn't hurt stuff too
much I'm ok with it (of course, these are famous last words in the kernel
often...!)

Liam - are you open to a solution on the basis above, or do you feel we
ought simply to fix the underlying issue here?

to me we're at a simple enough implementaiton of this (esp. utilising the
helper you mention) that probably kthis is fine (like the meme,
or... hopefully not :)

I will go with your judgment here!

Cheers, Lorenzo

Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order

Posted by Liam R. Howlett 1 month, 1 week ago

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250826 09:50]:
> On Tue, Aug 26, 2025 at 09:37:22AM -0400, Liam R. Howlett wrote:
> > I really don't think this is worth doing.  We're avoiding a race between
> > oom and a task unmap - the MMF bits should be used to avoid this race -
> > or at least mitigate it.
> 
> Yes for sure, as explored at length in previous discussions this feels like
> we're papering over cracks here.
> 
> _However_, I'm sort of ok with a minimalistic fix that solves the proximate
> issue even if it is that, as long as it doesn't cause issues in doing so.
> 
> So this is my take on the below and why I'm open to it!
> 
> >
> > They are probably both under the read lock, but considering how rare it
> > would be, would a racy flag check be enough - it is hardly critical to
> > get right.  Either would reduce the probability.
> 
> Zongjinji - I'm stil not sure that you've really indicated _why_ you're
> seeing such a tight and unusual race. Presumably some truly massive number
> of tasks being OOM'd and unmapping but... yeah that seems odd anyway.
> 
> But again, if we can safely fix this in a way that doesn't hurt stuff too
> much I'm ok with it (of course, these are famous last words in the kernel
> often...!)
> 
> Liam - are you open to a solution on the basis above, or do you feel we
> ought simply to fix the underlying issue here?

At least this is a benign race.  I'd think using MMF_ to reduce the race
would achieve the same goal with less risk - which is why I bring it up.

Really, both methods should be low risk, so I'm fine with either way.

But I am interested in hearing how this race is happening enough to
necessitate a fix.  Reversing the iterator is a one-spot fix - if this
happens elsewhere then we're out of options.  Using the MMF_ flags is
more of a scalable fix, if it achieves the same results.

> 
> to me we're at a simple enough implementaiton of this (esp. utilising the
> helper you mention) that probably kthis is fine (like the meme,
> or... hopefully not :)
> 
> I will go with your judgment here!
> 
> Cheers, Lorenzo

Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order

Posted by Shakeel Butt 1 month, 1 week ago

On Tue, Aug 26, 2025 at 11:21:13AM -0400, Liam R. Howlett wrote:
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250826 09:50]:
> > On Tue, Aug 26, 2025 at 09:37:22AM -0400, Liam R. Howlett wrote:
> > > I really don't think this is worth doing.  We're avoiding a race between
> > > oom and a task unmap - the MMF bits should be used to avoid this race -
> > > or at least mitigate it.
> > 
> > Yes for sure, as explored at length in previous discussions this feels like
> > we're papering over cracks here.
> > 
> > _However_, I'm sort of ok with a minimalistic fix that solves the proximate
> > issue even if it is that, as long as it doesn't cause issues in doing so.
> > 
> > So this is my take on the below and why I'm open to it!
> > 
> > >
> > > They are probably both under the read lock, but considering how rare it
> > > would be, would a racy flag check be enough - it is hardly critical to
> > > get right.  Either would reduce the probability.
> > 
> > Zongjinji - I'm stil not sure that you've really indicated _why_ you're
> > seeing such a tight and unusual race. Presumably some truly massive number
> > of tasks being OOM'd and unmapping but... yeah that seems odd anyway.
> > 
> > But again, if we can safely fix this in a way that doesn't hurt stuff too
> > much I'm ok with it (of course, these are famous last words in the kernel
> > often...!)
> > 
> > Liam - are you open to a solution on the basis above, or do you feel we
> > ought simply to fix the underlying issue here?
> 
> At least this is a benign race. 

Is this really a race or rather a contention? IIUC exit_mmap and the oom
reaper are trying to unmap the address space of the oom-killed process
and can compete on page table locks. If both are running concurrently on
two cpus then the contention can continue for whole address space and
can slow down the actual memory freeing. Making oom reaper traverse in
opposite direction can drastically reduce the contention and faster
memory freeing.

> I'd think using MMF_ to reduce the race
> would achieve the same goal with less risk - which is why I bring it up.
> 

With MMF_ flag, are you suggesting oom reaper to skip the unmapping of
the oom-killed process?

> Really, both methods should be low risk, so I'm fine with either way.
> 
> But I am interested in hearing how this race is happening enough to
> necessitate a fix.  Reversing the iterator is a one-spot fix - if this
> happens elsewhere then we're out of options.  Using the MMF_ flags is
> more of a scalable fix, if it achieves the same results.

On the question of if this is a rare situaion and worth the patch. I
would say this scenario is not that rare particularly on low memory
devices and on highly utilized overcommitted systems. Memory pressure
and oom-kills are norm on such systems. The point of oom reaper is to
bring the system out of the oom situation quickly and having two cpus
unmapping the oom-killed process can potentially bring the system out of
oom situation faster.

I think the patch (with your suggestions) is simple enough and I don't
see any risk in including it.

Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order

Posted by Liam R. Howlett 1 month, 1 week ago

+ Cc Suren since he has worked on the exit_mmap() path a lot.

* Shakeel Butt <shakeel.butt@linux.dev> [250826 18:26]:
> On Tue, Aug 26, 2025 at 11:21:13AM -0400, Liam R. Howlett wrote:
> > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250826 09:50]:
> > > On Tue, Aug 26, 2025 at 09:37:22AM -0400, Liam R. Howlett wrote:
> > > > I really don't think this is worth doing.  We're avoiding a race between
> > > > oom and a task unmap - the MMF bits should be used to avoid this race -
> > > > or at least mitigate it.
> > > 
> > > Yes for sure, as explored at length in previous discussions this feels like
> > > we're papering over cracks here.
> > > 
> > > _However_, I'm sort of ok with a minimalistic fix that solves the proximate
> > > issue even if it is that, as long as it doesn't cause issues in doing so.
> > > 
> > > So this is my take on the below and why I'm open to it!
> > > 
> > > >
> > > > They are probably both under the read lock, but considering how rare it
> > > > would be, would a racy flag check be enough - it is hardly critical to
> > > > get right.  Either would reduce the probability.
> > > 
> > > Zongjinji - I'm stil not sure that you've really indicated _why_ you're
> > > seeing such a tight and unusual race. Presumably some truly massive number
> > > of tasks being OOM'd and unmapping but... yeah that seems odd anyway.
> > > 
> > > But again, if we can safely fix this in a way that doesn't hurt stuff too
> > > much I'm ok with it (of course, these are famous last words in the kernel
> > > often...!)
> > > 
> > > Liam - are you open to a solution on the basis above, or do you feel we
> > > ought simply to fix the underlying issue here?
> > 
> > At least this is a benign race. 
> 
> Is this really a race or rather a contention? IIUC exit_mmap and the oom
> reaper are trying to unmap the address space of the oom-killed process
> and can compete on page table locks. If both are running concurrently on
> two cpus then the contention can continue for whole address space and
> can slow down the actual memory freeing. Making oom reaper traverse in
> opposite direction can drastically reduce the contention and faster
> memory freeing.

It is two readers of the vma tree racing to lock the page tables for
each vma, so I guess you can see it as contention as well.. but since
the pte is a split lock, I see it as racing through vmas to see who hits
which lock first.  The smart money is on the oom killer as it skips some
vmas :)

If it were just contention, then the loop direction wouldn't matter..
but I do see your point.

> > I'd think using MMF_ to reduce the race
> > would achieve the same goal with less risk - which is why I bring it up.
> > 
> 
> With MMF_ flag, are you suggesting oom reaper to skip the unmapping of
> the oom-killed process?

Yes, specifically move the MMF_OOM_SKIP flag to earlier in the exit
path to reduce the possibility of the race/contention.

> 
> > Really, both methods should be low risk, so I'm fine with either way.
> > 
> > But I am interested in hearing how this race is happening enough to
> > necessitate a fix.  Reversing the iterator is a one-spot fix - if this
> > happens elsewhere then we're out of options.  Using the MMF_ flags is
> > more of a scalable fix, if it achieves the same results.
> 
> On the question of if this is a rare situaion and worth the patch. I
> would say this scenario is not that rare particularly on low memory
> devices and on highly utilized overcommitted systems. Memory pressure
> and oom-kills are norm on such systems. The point of oom reaper is to
> bring the system out of the oom situation quickly and having two cpus
> unmapping the oom-killed process can potentially bring the system out of
> oom situation faster.

The exit_mmap() path used to run the oom reaper if it was an oom victim,
until recently [1].  The part that makes me nervous is the exit_mmap()
call to mmu_notifier_release(mm), while the oom reaper uses an
mmu_notifier.  I am not sure if there is an issue in ordering on any of
the platforms of such things.  Or the associated cost of the calls.

I mean, it's already pretty crazy that we have this in the exit:
mmap_read_lock()
   tlb_gather_mmu_fullmm()
     unmap vmas..
mmap_read_unlock()
mmap_write_lock()
   tlb_finish_mmu()..

So not only do we now have two tasks iterating over the vmas, but we
also have mmu notifiers and tlb calls happening across the ranges..  At
least doing all the work on a single cpu means that the hardware view is
consistent.  But I don't see this being worse than a forward race?

As it is written here, we'll have one CPU working in one direction while
the other works in the other, until both hit the end of the VMAs.  Only
when both tasks stop iterating the vmas can the exit continue since it
requires the write lock.

So the tlb_finish_mmu() in exit_mmap() will always be called after
tlb_finish_mmu() on each individual vma has run in the
__oom_reap_task_mm() context (when the race happens).

There is also a window here, between the exit_mmap() dropping the read
lock, setting MMF_OOM_SKIP, and taking the lock - where the oom killer
will iterate through a list of vmas with zero memory to free and delay
the task exiting.  That is, wasting cpu and stopping the memory
associated with the mm_struct (vmas and such) from being freed.

I'm also not sure on the cpu cache effects of what we are doing and how
much that would play into the speedup.  My guess is that it's
insignificant compared to the time we spend under the pte, but we have
no numbers to go on.

So I'd like to know how likely the simultaneous runs are and if there is
a measurable gain?

I agree, that at face value, two cpus should be able to split the work..
but I don't know about the notifier or the holding up the mm_struct
associated memory.  And it could slow things down by holding up an
exiting task.

> 
> I think the patch (with your suggestions) is simple enough and I don't
> see any risk in including it.
> 

Actually, the more I look at this, the worse I feel about it..  Am I
overreacting?

Thanks,
Liam

[1] https://elixir.bootlin.com/linux/v6.0.19/source/mm/mmap.c#L3085

Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order

Posted by Michal Hocko 1 month ago

On Wed 27-08-25 00:12:34, Liam R. Howlett wrote:
[...]
> > > I'd think using MMF_ to reduce the race
> > > would achieve the same goal with less risk - which is why I bring it up.
> > > 
> > 
> > With MMF_ flag, are you suggesting oom reaper to skip the unmapping of
> > the oom-killed process?
> 
> Yes, specifically move the MMF_OOM_SKIP flag to earlier in the exit
> path to reduce the possibility of the race/contention.

MMF_OOM_SKIP is a "hide from oom killer" flag and that means that as
soon as you set it up the next oom killer invocation would kill another
task while the current one hasn't freed up the memory yet and therefore
you have a potentially pointless kill.

-- 
Michal Hocko
SUSE Labs

Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order

Posted by zhongjinji 1 month, 1 week ago

> + Cc Suren since he has worked on the exit_mmap() path a lot.
 
Thank you for your assistance. I realize now that I should have
Cc Suren earlier.

> * Shakeel Butt <shakeel.butt@linux.dev> [250826 18:26]:
> > On Tue, Aug 26, 2025 at 11:21:13AM -0400, Liam R. Howlett wrote:
> > > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250826 09:50]:
> > > > On Tue, Aug 26, 2025 at 09:37:22AM -0400, Liam R. Howlett wrote:
> > > > > I really don't think this is worth doing.  We're avoiding a race between
> > > > > oom and a task unmap - the MMF bits should be used to avoid this race -
> > > > > or at least mitigate it.
> > > > 
> > > > Yes for sure, as explored at length in previous discussions this feels like
> > > > we're papering over cracks here.
> > > > 
> > > > _However_, I'm sort of ok with a minimalistic fix that solves the proximate
> > > > issue even if it is that, as long as it doesn't cause issues in doing so.
> > > > 
> > > > So this is my take on the below and why I'm open to it!
> > > > 
> > > > >
> > > > > They are probably both under the read lock, but considering how rare it
> > > > > would be, would a racy flag check be enough - it is hardly critical to
> > > > > get right.  Either would reduce the probability.
> > > > 
> > > > Zongjinji - I'm stil not sure that you've really indicated _why_ you're
> > > > seeing such a tight and unusual race. Presumably some truly massive number
> > > > of tasks being OOM'd and unmapping but... yeah that seems odd anyway.
> > > > 
> > > > But again, if we can safely fix this in a way that doesn't hurt stuff too
> > > > much I'm ok with it (of course, these are famous last words in the kernel
> > > > often...!)
> > > > 
> > > > Liam - are you open to a solution on the basis above, or do you feel we
> > > > ought simply to fix the underlying issue here?
> > > 
> > > At least this is a benign race. 
> > 
> > Is this really a race or rather a contention? IIUC exit_mmap and the oom
> > reaper are trying to unmap the address space of the oom-killed process
> > and can compete on page table locks. If both are running concurrently on
> > two cpus then the contention can continue for whole address space and
> > can slow down the actual memory freeing. Making oom reaper traverse in
> > opposite direction can drastically reduce the contention and faster
> > memory freeing.
> 
> It is two readers of the vma tree racing to lock the page tables for
> each vma, so I guess you can see it as contention as well.. but since
> the pte is a split lock, I see it as racing through vmas to see who hits
> which lock first.  The smart money is on the oom killer as it skips some
> vmas :)
> 
> If it were just contention, then the loop direction wouldn't matter..
> but I do see your point.
> 
> > > I'd think using MMF_ to reduce the race
> > > would achieve the same goal with less risk - which is why I bring it up.
> > > 
> > 
> > With MMF_ flag, are you suggesting oom reaper to skip the unmapping of
> > the oom-killed process?
> 
> Yes, specifically move the MMF_OOM_SKIP flag to earlier in the exit
> path to reduce the possibility of the race/contention.
> 
> > 
> > > Really, both methods should be low risk, so I'm fine with either way.
> > > 
> > > But I am interested in hearing how this race is happening enough to
> > > necessitate a fix.  Reversing the iterator is a one-spot fix - if this
> > > happens elsewhere then we're out of options.  Using the MMF_ flags is
> > > more of a scalable fix, if it achieves the same results.
> > 
> > On the question of if this is a rare situaion and worth the patch. I
> > would say this scenario is not that rare particularly on low memory
> > devices and on highly utilized overcommitted systems. Memory pressure
> > and oom-kills are norm on such systems. The point of oom reaper is to
> > bring the system out of the oom situation quickly and having two cpus
> > unmapping the oom-killed process can potentially bring the system out of
> > oom situation faster.
> 
> The exit_mmap() path used to run the oom reaper if it was an oom victim,
> until recently [1].  The part that makes me nervous is the exit_mmap()
> call to mmu_notifier_release(mm), while the oom reaper uses an
> mmu_notifier.  I am not sure if there is an issue in ordering on any of
> the platforms of such things.  Or the associated cost of the calls.
> 
> I mean, it's already pretty crazy that we have this in the exit:
> mmap_read_lock()
>    tlb_gather_mmu_fullmm()
>      unmap vmas..
> mmap_read_unlock()
> mmap_write_lock()
>    tlb_finish_mmu()..
> 
> So not only do we now have two tasks iterating over the vmas, but we
> also have mmu notifiers and tlb calls happening across the ranges..  At
> least doing all the work on a single cpu means that the hardware view is
> consistent.  But I don't see this being worse than a forward race?
> 
> As it is written here, we'll have one CPU working in one direction while
> the other works in the other, until both hit the end of the VMAs.  Only
> when both tasks stop iterating the vmas can the exit continue since it
> requires the write lock.
> 
> So the tlb_finish_mmu() in exit_mmap() will always be called after
> tlb_finish_mmu() on each individual vma has run in the
> __oom_reap_task_mm() context (when the race happens).
> 
> There is also a window here, between the exit_mmap() dropping the read
> lock, setting MMF_OOM_SKIP, and taking the lock - where the oom killer
> will iterate through a list of vmas with zero memory to free and delay
> the task exiting.  That is, wasting cpu and stopping the memory
> associated with the mm_struct (vmas and such) from being freed.
> 
> I'm also not sure on the cpu cache effects of what we are doing and how
> much that would play into the speedup.  My guess is that it's
> insignificant compared to the time we spend under the pte, but we have
> no numbers to go on.
> 
> So I'd like to know how likely the simultaneous runs are and if there is
> a measurable gain?

Since process killing events are very frequent on Android, the likelihood of
exit_mmap and reaper work (not only OOM, but also some proactive reaping
actions such as process_mrelease) occurring at the same time is much higher.
When lmkd kills a process, it actively reaps the process using
process_mrelease, similar to the way the OOM reaper works. Surenb may be
able to clarify this point, as he is the author of lmkd.

I referenced this data in link[1], and I should have included it in the cover
letter. The attached test data was collected on Android. Before testing, in
order to simulate the OOM kill process, I intercepted the kill signal and added
the killed process to the OOM reaper queue.

The reproduction steps are as follows:
1. Start a process.
2. Kill the process.
3. Capture a perfetto trace.

Below are the load benefit data, measured by process running time:

Note: #RxComputationT, vdp:vidtask:m, and tp-background are threads of the
same process, and they are the last threads to exit.

Thread             TID         State        Wall duration (ms)          total running
# with oom reaper but traverse reverse
#RxComputationT    13708       Running      60.690572
oom_reaper         81          Running      46.492032                   107.182604

# with oom reaper
vdp:vidtask:m      14040       Running      81.848297
oom_reaper         81          Running      69.32                       151.168297

# without oom reaper
tp-background      12424       Running      106.021874                  106.021874

Compared to reaping when a process is killed, it provides approximately
30% load benefit.
Compared to not performing reap when a process is killed, it can release
memory earlier, with 40% faster memory release.

[1] https://lore.kernel.org/all/20250815163207.7078-1-zhongjinji@honor.com/

> I agree, that at face value, two cpus should be able to split the work..
> but I don't know about the notifier or the holding up the mm_struct
> associated memory.  And it could slow things down by holding up an
> exiting task.
> 
> > 
> > I think the patch (with your suggestions) is simple enough and I don't
> > see any risk in including it.
> > 
> 
> Actually, the more I look at this, the worse I feel about it..  Am I
> overreacting?
> 
> Thanks,
> Liam
> 
> [1] https://elixir.bootlin.com/linux/v6.0.19/source/mm/mmap.c#L3085

Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order

Posted by Suren Baghdasaryan 1 month, 1 week ago

On Wed, Aug 27, 2025 at 2:55 AM zhongjinji <zhongjinji@honor.com> wrote:
>
> > + Cc Suren since he has worked on the exit_mmap() path a lot.
>
> Thank you for your assistance. I realize now that I should have
> Cc Suren earlier.

Thanks for adding me!

>
> > * Shakeel Butt <shakeel.butt@linux.dev> [250826 18:26]:
> > > On Tue, Aug 26, 2025 at 11:21:13AM -0400, Liam R. Howlett wrote:
> > > > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250826 09:50]:
> > > > > On Tue, Aug 26, 2025 at 09:37:22AM -0400, Liam R. Howlett wrote:
> > > > > > I really don't think this is worth doing.  We're avoiding a race between
> > > > > > oom and a task unmap - the MMF bits should be used to avoid this race -
> > > > > > or at least mitigate it.
> > > > >
> > > > > Yes for sure, as explored at length in previous discussions this feels like
> > > > > we're papering over cracks here.
> > > > >
> > > > > _However_, I'm sort of ok with a minimalistic fix that solves the proximate
> > > > > issue even if it is that, as long as it doesn't cause issues in doing so.
> > > > >
> > > > > So this is my take on the below and why I'm open to it!
> > > > >
> > > > > >
> > > > > > They are probably both under the read lock, but considering how rare it
> > > > > > would be, would a racy flag check be enough - it is hardly critical to
> > > > > > get right.  Either would reduce the probability.
> > > > >
> > > > > Zongjinji - I'm stil not sure that you've really indicated _why_ you're
> > > > > seeing such a tight and unusual race. Presumably some truly massive number
> > > > > of tasks being OOM'd and unmapping but... yeah that seems odd anyway.
> > > > >
> > > > > But again, if we can safely fix this in a way that doesn't hurt stuff too
> > > > > much I'm ok with it (of course, these are famous last words in the kernel
> > > > > often...!)
> > > > >
> > > > > Liam - are you open to a solution on the basis above, or do you feel we
> > > > > ought simply to fix the underlying issue here?
> > > >
> > > > At least this is a benign race.
> > >
> > > Is this really a race or rather a contention? IIUC exit_mmap and the oom
> > > reaper are trying to unmap the address space of the oom-killed process
> > > and can compete on page table locks. If both are running concurrently on
> > > two cpus then the contention can continue for whole address space and
> > > can slow down the actual memory freeing. Making oom reaper traverse in
> > > opposite direction can drastically reduce the contention and faster
> > > memory freeing.
> >
> > It is two readers of the vma tree racing to lock the page tables for
> > each vma, so I guess you can see it as contention as well.. but since
> > the pte is a split lock, I see it as racing through vmas to see who hits
> > which lock first.  The smart money is on the oom killer as it skips some
> > vmas :)
> >
> > If it were just contention, then the loop direction wouldn't matter..
> > but I do see your point.
> >
> > > > I'd think using MMF_ to reduce the race
> > > > would achieve the same goal with less risk - which is why I bring it up.
> > > >
> > >
> > > With MMF_ flag, are you suggesting oom reaper to skip the unmapping of
> > > the oom-killed process?
> >
> > Yes, specifically move the MMF_OOM_SKIP flag to earlier in the exit
> > path to reduce the possibility of the race/contention.
> >
> > >
> > > > Really, both methods should be low risk, so I'm fine with either way.
> > > >
> > > > But I am interested in hearing how this race is happening enough to
> > > > necessitate a fix.  Reversing the iterator is a one-spot fix - if this
> > > > happens elsewhere then we're out of options.  Using the MMF_ flags is
> > > > more of a scalable fix, if it achieves the same results.
> > >
> > > On the question of if this is a rare situaion and worth the patch. I
> > > would say this scenario is not that rare particularly on low memory
> > > devices and on highly utilized overcommitted systems. Memory pressure
> > > and oom-kills are norm on such systems. The point of oom reaper is to
> > > bring the system out of the oom situation quickly and having two cpus
> > > unmapping the oom-killed process can potentially bring the system out of
> > > oom situation faster.
> >
> > The exit_mmap() path used to run the oom reaper if it was an oom victim,
> > until recently [1].  The part that makes me nervous is the exit_mmap()
> > call to mmu_notifier_release(mm), while the oom reaper uses an
> > mmu_notifier.  I am not sure if there is an issue in ordering on any of
> > the platforms of such things.  Or the associated cost of the calls.
> >
> > I mean, it's already pretty crazy that we have this in the exit:
> > mmap_read_lock()
> >    tlb_gather_mmu_fullmm()
> >      unmap vmas..
> > mmap_read_unlock()
> > mmap_write_lock()
> >    tlb_finish_mmu()..
> >
> > So not only do we now have two tasks iterating over the vmas, but we
> > also have mmu notifiers and tlb calls happening across the ranges..  At
> > least doing all the work on a single cpu means that the hardware view is
> > consistent.  But I don't see this being worse than a forward race?

This part seems to have changed quite a bit since I last looked into
it closely and it's worth re-checking, however that seems orthogonal
to what this patch is trying to do.

> >
> > As it is written here, we'll have one CPU working in one direction while
> > the other works in the other, until both hit the end of the VMAs.  Only
> > when both tasks stop iterating the vmas can the exit continue since it
> > requires the write lock.
> >
> > So the tlb_finish_mmu() in exit_mmap() will always be called after
> > tlb_finish_mmu() on each individual vma has run in the
> > __oom_reap_task_mm() context (when the race happens).

Correct.

> >
> > There is also a window here, between the exit_mmap() dropping the read
> > lock, setting MMF_OOM_SKIP, and taking the lock - where the oom killer
> > will iterate through a list of vmas with zero memory to free and delay
> > the task exiting.  That is, wasting cpu and stopping the memory
> > associated with the mm_struct (vmas and such) from being freed.

Might be an opportunity to optimize but again, this is happening with
or without this patch, no?

> >
> > I'm also not sure on the cpu cache effects of what we are doing and how
> > much that would play into the speedup.  My guess is that it's
> > insignificant compared to the time we spend under the pte, but we have
> > no numbers to go on.
> >
> > So I'd like to know how likely the simultaneous runs are and if there is
> > a measurable gain?
>
> Since process killing events are very frequent on Android, the likelihood of
> exit_mmap and reaper work (not only OOM, but also some proactive reaping
> actions such as process_mrelease) occurring at the same time is much higher.
> When lmkd kills a process, it actively reaps the process using
> process_mrelease, similar to the way the OOM reaper works. Surenb may be
> able to clarify this point, as he is the author of lmkd.

Yes, on Android process_mrelease() is used after lmkd kills a process
to expedite memory release in case the victim process is blocked for
some reason. This makes the race between __oom_reap_task_mm() and
exit_mmap() much more frequent. That is probably the disconnect in
thinking that this race is rare. I don't see any harm in
__oom_reap_task_mm() walking the tree backwards to minimize the
contention with exit_mmap(). Liam, is there a performance difference
between mas_for_each_rev() and mas_for_each() ?

>
> I referenced this data in link[1], and I should have included it in the cover
> letter. The attached test data was collected on Android. Before testing, in
> order to simulate the OOM kill process, I intercepted the kill signal and added
> the killed process to the OOM reaper queue.
>
> The reproduction steps are as follows:
> 1. Start a process.
> 2. Kill the process.
> 3. Capture a perfetto trace.
>
> Below are the load benefit data, measured by process running time:
>
> Note: #RxComputationT, vdp:vidtask:m, and tp-background are threads of the
> same process, and they are the last threads to exit.
>
> Thread             TID         State        Wall duration (ms)          total running
> # with oom reaper but traverse reverse
> #RxComputationT    13708       Running      60.690572
> oom_reaper         81          Running      46.492032                   107.182604
>
> # with oom reaper
> vdp:vidtask:m      14040       Running      81.848297
> oom_reaper         81          Running      69.32                       151.168297
>
> # without oom reaper
> tp-background      12424       Running      106.021874                  106.021874
>
> Compared to reaping when a process is killed, it provides approximately
> 30% load benefit.
> Compared to not performing reap when a process is killed, it can release
> memory earlier, with 40% faster memory release.

That looks like a nice performance improvement for reaping the memory
with low churn and risk.

>
> [1] https://lore.kernel.org/all/20250815163207.7078-1-zhongjinji@honor.com/
>
> > I agree, that at face value, two cpus should be able to split the work..
> > but I don't know about the notifier or the holding up the mm_struct
> > associated memory.  And it could slow things down by holding up an
> > exiting task.
> >
> > >
> > > I think the patch (with your suggestions) is simple enough and I don't
> > > see any risk in including it.
> > >
> >
> > Actually, the more I look at this, the worse I feel about it..  Am I
> > overreacting?
> >
> > Thanks,
> > Liam
> >
> > [1] https://elixir.bootlin.com/linux/v6.0.19/source/mm/mmap.c#L3085

Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order

Posted by Liam R. Howlett 1 month ago

* Suren Baghdasaryan <surenb@google.com> [250827 11:57]:

...

> > >
> > > The exit_mmap() path used to run the oom reaper if it was an oom victim,
> > > until recently [1].  The part that makes me nervous is the exit_mmap()
> > > call to mmu_notifier_release(mm), while the oom reaper uses an
> > > mmu_notifier.  I am not sure if there is an issue in ordering on any of
> > > the platforms of such things.  Or the associated cost of the calls.
> > >
> > > I mean, it's already pretty crazy that we have this in the exit:
> > > mmap_read_lock()
> > >    tlb_gather_mmu_fullmm()
> > >      unmap vmas..
> > > mmap_read_unlock()
> > > mmap_write_lock()
> > >    tlb_finish_mmu()..
> > >
> > > So not only do we now have two tasks iterating over the vmas, but we
> > > also have mmu notifiers and tlb calls happening across the ranges..  At
> > > least doing all the work on a single cpu means that the hardware view is
> > > consistent.  But I don't see this being worse than a forward race?
> 
> This part seems to have changed quite a bit since I last looked into
> it closely and it's worth re-checking, however that seems orthogonal
> to what this patch is trying to do.

I was concerned about how a reverse iterator may affect what is
considered accurate for the mmu/tlb so I thought it worth pointing out.

...
> > >
> > > There is also a window here, between the exit_mmap() dropping the read
> > > lock, setting MMF_OOM_SKIP, and taking the lock - where the oom killer
> > > will iterate through a list of vmas with zero memory to free and delay
> > > the task exiting.  That is, wasting cpu and stopping the memory
> > > associated with the mm_struct (vmas and such) from being freed.
> 
> Might be an opportunity to optimize but again, this is happening with
> or without this patch, no?

Correct, but with number it looks to be better to go with two loops.

> 
> > >
> > > I'm also not sure on the cpu cache effects of what we are doing and how
> > > much that would play into the speedup.  My guess is that it's
> > > insignificant compared to the time we spend under the pte, but we have
> > > no numbers to go on.
> > >
> > > So I'd like to know how likely the simultaneous runs are and if there is
> > > a measurable gain?
> >
> > Since process killing events are very frequent on Android, the likelihood of
> > exit_mmap and reaper work (not only OOM, but also some proactive reaping
> > actions such as process_mrelease) occurring at the same time is much higher.
> > When lmkd kills a process, it actively reaps the process using
> > process_mrelease, similar to the way the OOM reaper works. Surenb may be
> > able to clarify this point, as he is the author of lmkd.
> 
> Yes, on Android process_mrelease() is used after lmkd kills a process
> to expedite memory release in case the victim process is blocked for
> some reason. This makes the race between __oom_reap_task_mm() and
> exit_mmap() much more frequent. That is probably the disconnect in
> thinking that this race is rare. I don't see any harm in
> __oom_reap_task_mm() walking the tree backwards to minimize the
> contention with exit_mmap(). Liam, is there a performance difference
> between mas_for_each_rev() and mas_for_each() ?

There should be no performance difference.

> >
> > I referenced this data in link[1], and I should have included it in the cover
> > letter. The attached test data was collected on Android. Before testing, in
> > order to simulate the OOM kill process, I intercepted the kill signal and added
> > the killed process to the OOM reaper queue.

Sorry I missed your response in v4 on this.

> >
> > The reproduction steps are as follows:
> > 1. Start a process.
> > 2. Kill the process.
> > 3. Capture a perfetto trace.
> >
> > Below are the load benefit data, measured by process running time:
> >
> > Note: #RxComputationT, vdp:vidtask:m, and tp-background are threads of the
> > same process, and they are the last threads to exit.
> >
> > Thread             TID         State        Wall duration (ms)          total running
> > # with oom reaper but traverse reverse
> > #RxComputationT    13708       Running      60.690572
> > oom_reaper         81          Running      46.492032                   107.182604
> >
> > # with oom reaper
> > vdp:vidtask:m      14040       Running      81.848297
> > oom_reaper         81          Running      69.32                       151.168297
> >
> > # without oom reaper
> > tp-background      12424       Running      106.021874                  106.021874
> >
> > Compared to reaping when a process is killed, it provides approximately
> > 30% load benefit.
> > Compared to not performing reap when a process is killed, it can release
> > memory earlier, with 40% faster memory release.
> 
> That looks like a nice performance improvement for reaping the memory
> with low churn and risk.

Agreed.


Please include the numbers in the change log in the next revision so it
is recorded in git.

I think all my questions are resolved, thanks.

I look forward to v6.

Regards,
Liam

Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order

Posted by Liam R. Howlett 1 month, 1 week ago

..actually add Suren this time.

* Liam R. Howlett <Liam.Howlett@oracle.com> [250827 00:12]:
> + Cc Suren since he has worked on the exit_mmap() path a lot.
> 
> * Shakeel Butt <shakeel.butt@linux.dev> [250826 18:26]:
> > On Tue, Aug 26, 2025 at 11:21:13AM -0400, Liam R. Howlett wrote:
> > > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250826 09:50]:
> > > > On Tue, Aug 26, 2025 at 09:37:22AM -0400, Liam R. Howlett wrote:
> > > > > I really don't think this is worth doing.  We're avoiding a race between
> > > > > oom and a task unmap - the MMF bits should be used to avoid this race -
> > > > > or at least mitigate it.
> > > > 
> > > > Yes for sure, as explored at length in previous discussions this feels like
> > > > we're papering over cracks here.
> > > > 
> > > > _However_, I'm sort of ok with a minimalistic fix that solves the proximate
> > > > issue even if it is that, as long as it doesn't cause issues in doing so.
> > > > 
> > > > So this is my take on the below and why I'm open to it!
> > > > 
> > > > >
> > > > > They are probably both under the read lock, but considering how rare it
> > > > > would be, would a racy flag check be enough - it is hardly critical to
> > > > > get right.  Either would reduce the probability.
> > > > 
> > > > Zongjinji - I'm stil not sure that you've really indicated _why_ you're
> > > > seeing such a tight and unusual race. Presumably some truly massive number
> > > > of tasks being OOM'd and unmapping but... yeah that seems odd anyway.
> > > > 
> > > > But again, if we can safely fix this in a way that doesn't hurt stuff too
> > > > much I'm ok with it (of course, these are famous last words in the kernel
> > > > often...!)
> > > > 
> > > > Liam - are you open to a solution on the basis above, or do you feel we
> > > > ought simply to fix the underlying issue here?
> > > 
> > > At least this is a benign race. 
> > 
> > Is this really a race or rather a contention? IIUC exit_mmap and the oom
> > reaper are trying to unmap the address space of the oom-killed process
> > and can compete on page table locks. If both are running concurrently on
> > two cpus then the contention can continue for whole address space and
> > can slow down the actual memory freeing. Making oom reaper traverse in
> > opposite direction can drastically reduce the contention and faster
> > memory freeing.
> 
> It is two readers of the vma tree racing to lock the page tables for
> each vma, so I guess you can see it as contention as well.. but since
> the pte is a split lock, I see it as racing through vmas to see who hits
> which lock first.  The smart money is on the oom killer as it skips some
> vmas :)
> 
> If it were just contention, then the loop direction wouldn't matter..
> but I do see your point.
> 
> > > I'd think using MMF_ to reduce the race
> > > would achieve the same goal with less risk - which is why I bring it up.
> > > 
> > 
> > With MMF_ flag, are you suggesting oom reaper to skip the unmapping of
> > the oom-killed process?
> 
> Yes, specifically move the MMF_OOM_SKIP flag to earlier in the exit
> path to reduce the possibility of the race/contention.
> 
> > 
> > > Really, both methods should be low risk, so I'm fine with either way.
> > > 
> > > But I am interested in hearing how this race is happening enough to
> > > necessitate a fix.  Reversing the iterator is a one-spot fix - if this
> > > happens elsewhere then we're out of options.  Using the MMF_ flags is
> > > more of a scalable fix, if it achieves the same results.
> > 
> > On the question of if this is a rare situaion and worth the patch. I
> > would say this scenario is not that rare particularly on low memory
> > devices and on highly utilized overcommitted systems. Memory pressure
> > and oom-kills are norm on such systems. The point of oom reaper is to
> > bring the system out of the oom situation quickly and having two cpus
> > unmapping the oom-killed process can potentially bring the system out of
> > oom situation faster.
> 
> The exit_mmap() path used to run the oom reaper if it was an oom victim,
> until recently [1].  The part that makes me nervous is the exit_mmap()
> call to mmu_notifier_release(mm), while the oom reaper uses an
> mmu_notifier.  I am not sure if there is an issue in ordering on any of
> the platforms of such things.  Or the associated cost of the calls.
> 
> I mean, it's already pretty crazy that we have this in the exit:
> mmap_read_lock()
>    tlb_gather_mmu_fullmm()
>      unmap vmas..
> mmap_read_unlock()
> mmap_write_lock()
>    tlb_finish_mmu()..
> 
> So not only do we now have two tasks iterating over the vmas, but we
> also have mmu notifiers and tlb calls happening across the ranges..  At
> least doing all the work on a single cpu means that the hardware view is
> consistent.  But I don't see this being worse than a forward race?
> 
> As it is written here, we'll have one CPU working in one direction while
> the other works in the other, until both hit the end of the VMAs.  Only
> when both tasks stop iterating the vmas can the exit continue since it
> requires the write lock.
> 
> So the tlb_finish_mmu() in exit_mmap() will always be called after
> tlb_finish_mmu() on each individual vma has run in the
> __oom_reap_task_mm() context (when the race happens).
> 
> There is also a window here, between the exit_mmap() dropping the read
> lock, setting MMF_OOM_SKIP, and taking the lock - where the oom killer
> will iterate through a list of vmas with zero memory to free and delay
> the task exiting.  That is, wasting cpu and stopping the memory
> associated with the mm_struct (vmas and such) from being freed.
> 
> I'm also not sure on the cpu cache effects of what we are doing and how
> much that would play into the speedup.  My guess is that it's
> insignificant compared to the time we spend under the pte, but we have
> no numbers to go on.
> 
> So I'd like to know how likely the simultaneous runs are and if there is
> a measurable gain?
> 
> I agree, that at face value, two cpus should be able to split the work..
> but I don't know about the notifier or the holding up the mm_struct
> associated memory.  And it could slow things down by holding up an
> exiting task.
> 
> > 
> > I think the patch (with your suggestions) is simple enough and I don't
> > see any risk in including it.
> > 
> 
> Actually, the more I look at this, the worse I feel about it..  Am I
> overreacting?
> 
> Thanks,
> Liam
> 
> [1] https://elixir.bootlin.com/linux/v6.0.19/source/mm/mmap.c#L3085
>

[PATCH v5 1/2] mm/oom_kill: Do not delay oom reaper when the victim is frozen
[PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order