mm/oom_kill: Only delay OOM reaper for processes using robust futexes

[PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite orders

Posted by zhongjinji@honor.com 1 month, 3 weeks ago

From: zhongjinji <zhongjinji@honor.com>

When a process is OOM killed, if the OOM reaper and the thread running
exit_mmap() execute at the same time, both will traverse the vma's maple
tree along the same path. They may easily unmap the same vma, causing them
to compete for the pte spinlock. This increases unnecessary load, causing
the execution time of the OOM reaper and the thread running exit_mmap() to
increase.

When a process exits, exit_mmap() traverses the vma's maple tree from low to high
address. To reduce the chance of unmapping the same vma simultaneously,
the OOM reaper should traverse vma's tree from high to low address. This reduces
lock contention when unmapping the same vma.

Signed-off-by: zhongjinji <zhongjinji@honor.com>
---
 include/linux/mm.h | 3 +++
 mm/oom_kill.c      | 9 +++++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0c44bb8ce544..b665ea3c30eb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -923,6 +923,9 @@ static inline void vma_iter_set(struct vma_iterator *vmi, unsigned long addr)
 #define for_each_vma_range(__vmi, __vma, __end)				\
 	while (((__vma) = vma_find(&(__vmi), (__end))) != NULL)
 
+#define for_each_vma_reverse(__vmi, __vma)					\
+	while (((__vma) = vma_prev(&(__vmi))) != NULL)
+
 #ifdef CONFIG_SHMEM
 /*
  * The vma_is_shmem is not inline because it is used only by slow
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7ae4001e47c1..602d6836098a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -517,7 +517,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
 	bool ret = true;
-	VMA_ITERATOR(vmi, mm, 0);
+	VMA_ITERATOR(vmi, mm, ULONG_MAX);
 
 	/*
 	 * Tell all users of get_user/copy_from_user etc... that the content
@@ -527,7 +527,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
 	 */
 	set_bit(MMF_UNSTABLE, &mm->flags);
 
-	for_each_vma(vmi, vma) {
+	/*
+	 * When two tasks unmap the same vma at the same time, they may contend for the
+	 * pte spinlock. To avoid traversing the same vma as exit_mmap unmap, traverse
+	 * the vma maple tree in reverse order.
+	 */
+	for_each_vma_reverse(vmi, vma) {
 		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
 			continue;
 
-- 
2.17.1

Re: [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite orders

Posted by Liam R. Howlett 1 month, 2 weeks ago

* zhongjinji@honor.com <zhongjinji@honor.com> [250814 09:56]:
> From: zhongjinji <zhongjinji@honor.com>
> 
> When a process is OOM killed, if the OOM reaper and the thread running
> exit_mmap() execute at the same time, both will traverse the vma's maple
> tree along the same path. They may easily unmap the same vma, causing them
> to compete for the pte spinlock. This increases unnecessary load, causing
> the execution time of the OOM reaper and the thread running exit_mmap() to
> increase.
> 
> When a process exits, exit_mmap() traverses the vma's maple tree from low to high
> address. To reduce the chance of unmapping the same vma simultaneously,
> the OOM reaper should traverse vma's tree from high to low address. This reduces
> lock contention when unmapping the same vma.
> 
> Signed-off-by: zhongjinji <zhongjinji@honor.com>
> ---
>  include/linux/mm.h | 3 +++
>  mm/oom_kill.c      | 9 +++++++--
>  2 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0c44bb8ce544..b665ea3c30eb 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -923,6 +923,9 @@ static inline void vma_iter_set(struct vma_iterator *vmi, unsigned long addr)
>  #define for_each_vma_range(__vmi, __vma, __end)				\
>  	while (((__vma) = vma_find(&(__vmi), (__end))) != NULL)
>  
> +#define for_each_vma_reverse(__vmi, __vma)					\
> +	while (((__vma) = vma_prev(&(__vmi))) != NULL)
> +

This does not do what you think it does, nor does it do what others
will think it will do.  It's not the opposite of the
for_each_vma_range() above.

vma_find() calls mas_find() which has a different meaning that
mas_next().  mas_find()'s behaviour is a hold-over from the vma_find()
of yesteryears: it will find the first entry at the address (if it's the
first time called) or the entry after it.

mas_prev() is trying to replace the linked list behaviour of "go to the
previous one", so it'll walk to the index you specified and go to the
previous one.  It will skip the index you passed in regardless of its
existence or not.

So what you have here is a broken interface, you just don't see it with
your code because you don't happen to have a mapping at ULONG_MAX.

This should not be merged as-is.

Also, there was zero mention of the new interface in the subject so I
almost missed this being added.

>  #ifdef CONFIG_SHMEM
>  /*
>   * The vma_is_shmem is not inline because it is used only by slow
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 7ae4001e47c1..602d6836098a 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -517,7 +517,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
>  {
>  	struct vm_area_struct *vma;
>  	bool ret = true;
> -	VMA_ITERATOR(vmi, mm, 0);
> +	VMA_ITERATOR(vmi, mm, ULONG_MAX);
>  
>  	/*
>  	 * Tell all users of get_user/copy_from_user etc... that the content
> @@ -527,7 +527,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
>  	 */
>  	set_bit(MMF_UNSTABLE, &mm->flags);
>  
> -	for_each_vma(vmi, vma) {
> +	/*
> +	 * When two tasks unmap the same vma at the same time, they may contend for the
> +	 * pte spinlock. To avoid traversing the same vma as exit_mmap unmap, traverse
> +	 * the vma maple tree in reverse order.
> +	 */
> +	for_each_vma_reverse(vmi, vma) {

How is this possible?  Both need the same lock..?  What part of
exit_mmap() will race here?

Why aren't we using the MMF_UNSTABLE flag set above to avoid it?  Or the
MMF_OOM_SKIP?

>  		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
>  			continue;
>  
> -- 
> 2.17.1
>

Re: [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite orders

Posted by Liam R. Howlett 1 month, 2 weeks ago

* Liam R. Howlett <Liam.Howlett@oracle.com> [250815 10:41]:
> * zhongjinji@honor.com <zhongjinji@honor.com> [250814 09:56]:
> > From: zhongjinji <zhongjinji@honor.com>
...

> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 7ae4001e47c1..602d6836098a 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -517,7 +517,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
> >  {
> >  	struct vm_area_struct *vma;
> >  	bool ret = true;
> > -	VMA_ITERATOR(vmi, mm, 0);
> > +	VMA_ITERATOR(vmi, mm, ULONG_MAX);
> >  
> >  	/*
> >  	 * Tell all users of get_user/copy_from_user etc... that the content
> > @@ -527,7 +527,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
> >  	 */
> >  	set_bit(MMF_UNSTABLE, &mm->flags);
> >  
> > -	for_each_vma(vmi, vma) {
> > +	/*
> > +	 * When two tasks unmap the same vma at the same time, they may contend for the
> > +	 * pte spinlock. To avoid traversing the same vma as exit_mmap unmap, traverse
> > +	 * the vma maple tree in reverse order.
> > +	 */
> > +	for_each_vma_reverse(vmi, vma) {
> 
> How is this possible?  Both need the same lock..?  What part of
> exit_mmap() will race here?

I see, exit_mmap() and the oom both use unmap_page_range() under the
mmap read lock, so they can race but since they'll contend on the pte
lock it doesn't really matter.

This is so rare, I don't think this is worth fixing.

Thanks,
Liam

Re: [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite orders

Posted by Lorenzo Stoakes 1 month, 2 weeks ago

On Thu, Aug 14, 2025 at 09:55:55PM +0800, zhongjinji@honor.com wrote:
> From: zhongjinji <zhongjinji@honor.com>
>
> When a process is OOM killed, if the OOM reaper and the thread running
> exit_mmap() execute at the same time, both will traverse the vma's maple
> tree along the same path. They may easily unmap the same vma, causing them
> to compete for the pte spinlock. This increases unnecessary load, causing
> the execution time of the OOM reaper and the thread running exit_mmap() to
> increase.

You're not giving any numbers, and this seems pretty niche, you really
exiting that many processes with the reaper running at the exact same time
that this is an issue? Waiting on a spinlock also?

This commit message is very unconvincing.

>
> When a process exits, exit_mmap() traverses the vma's maple tree from low to high
> address. To reduce the chance of unmapping the same vma simultaneously,
> the OOM reaper should traverse vma's tree from high to low address. This reduces
> lock contention when unmapping the same vma.

Are they going to run through and do their work in exactly the same time,
or might one 'run past' the other and you still have an issue?

Seems very vague and timing dependent and again, not convincing.

>
> Signed-off-by: zhongjinji <zhongjinji@honor.com>
> ---
>  include/linux/mm.h | 3 +++
>  mm/oom_kill.c      | 9 +++++++--
>  2 files changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0c44bb8ce544..b665ea3c30eb 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -923,6 +923,9 @@ static inline void vma_iter_set(struct vma_iterator *vmi, unsigned long addr)
>  #define for_each_vma_range(__vmi, __vma, __end)				\
>  	while (((__vma) = vma_find(&(__vmi), (__end))) != NULL)
>
> +#define for_each_vma_reverse(__vmi, __vma)					\
> +	while (((__vma) = vma_prev(&(__vmi))) != NULL)

Please don't casually add an undocumented public VMA iterator hidden in a
patch doing something else :)

Won't this skip the first VMA? Not sure this is really worth having as a
general thing anyway, it's not many people who want to do this in reverse.

> +
>  #ifdef CONFIG_SHMEM
>  /*
>   * The vma_is_shmem is not inline because it is used only by slow
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 7ae4001e47c1..602d6836098a 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -517,7 +517,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
>  {
>  	struct vm_area_struct *vma;
>  	bool ret = true;
> -	VMA_ITERATOR(vmi, mm, 0);
> +	VMA_ITERATOR(vmi, mm, ULONG_MAX);
>
>  	/*
>  	 * Tell all users of get_user/copy_from_user etc... that the content
> @@ -527,7 +527,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
>  	 */
>  	set_bit(MMF_UNSTABLE, &mm->flags);
>
> -	for_each_vma(vmi, vma) {
> +	/*
> +	 * When two tasks unmap the same vma at the same time, they may contend for the
> +	 * pte spinlock. To avoid traversing the same vma as exit_mmap unmap, traverse
> +	 * the vma maple tree in reverse order.
> +	 */

Except you won't necessarily avoid anything, as if one walker is faster
than the other they'll run ahead, plus of course they'll have a cross-over
where they share the same PTE anyway.

I feel like maybe you've got a fairly specific situation that indicates an
issue elsewhere and you're maybe solving the wrong problem here?

> +	for_each_vma_reverse(vmi, vma) {
>  		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
>  			continue;
>
> --
> 2.17.1
>
>

Re: [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite orders

Posted by zhongjinji 1 month, 2 weeks ago

> On Thu, Aug 14, 2025 at 09:55:55PM +0800, zhongjinji@honor.com wrote:
> > From: zhongjinji <zhongjinji@honor.com>
> >
> > When a process is OOM killed, if the OOM reaper and the thread running
> > exit_mmap() execute at the same time, both will traverse the vma's maple
> > tree along the same path. They may easily unmap the same vma, causing them
> > to compete for the pte spinlock. This increases unnecessary load, causing
> > the execution time of the OOM reaper and the thread running exit_mmap() to
> > increase.
> 
> You're not giving any numbers, and this seems pretty niche, you really
> exiting that many processes with the reaper running at the exact same time
> that this is an issue? Waiting on a spinlock also?
> 
> This commit message is very unconvincing.

This is the perf data: the first one is without this patch applied, and the
second one is with this patch applied.  It is obvious that without this patch,
the lock contention on the pte spinlock will be very intense.

|--99.74%-- oom_reaper
|    |--76.67%-- unmap_page_range
|    |    |--33.70%-- __pte_offset_map_lock
|    |    |    |--98.46%-- _raw_spin_lock
|    |    |--27.61%-- free_swap_and_cache_nr
|    |    |--16.40%-- folio_remove_rmap_ptes
|    |    |--12.25%-- tlb_flush_mmu
|    |--12.61%-- tlb_finish_mmu


|--98.84%-- oom_reaper
|    |--53.45%-- unmap_page_range
|    |    |--24.29%-- [hit in function]
|    |    |--48.06%-- folio_remove_rmap_ptes
|    |    |--17.99%-- tlb_flush_mmu
|    |    |--1.72%-- __pte_offset_map_lock
|    |
|    |--30.43%-- tlb_finish_mmu

> >
> > When a process exits, exit_mmap() traverses the vma's maple tree from low to high
> > address. To reduce the chance of unmapping the same vma simultaneously,
> > the OOM reaper should traverse vma's tree from high to low address. This reduces
> > lock contention when unmapping the same vma.
> 
> Are they going to run through and do their work in exactly the same time,
> or might one 'run past' the other and you still have an issue?
> 
> Seems very vague and timing dependent and again, not convincing.
> 
> >
> > Signed-off-by: zhongjinji <zhongjinji@honor.com>
> > ---
> >  include/linux/mm.h | 3 +++
> >  mm/oom_kill.c      | 9 +++++++--
> >  2 files changed, 10 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 0c44bb8ce544..b665ea3c30eb 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -923,6 +923,9 @@ static inline void vma_iter_set(struct vma_iterator *vmi, unsigned long addr)
> >  #define for_each_vma_range(__vmi, __vma, __end)				\
> >  	while (((__vma) = vma_find(&(__vmi), (__end))) != NULL)
> >
> > +#define for_each_vma_reverse(__vmi, __vma)					\
> > +	while (((__vma) = vma_prev(&(__vmi))) != NULL)
> 
> Please don't casually add an undocumented public VMA iterator hidden in a
> patch doing something else :)
> 
> Won't this skip the first VMA? Not sure this is really worth having as a
> general thing anyway, it's not many people who want to do this in reverse.

I got it. mas_find_rev() should be used instead of vma_prev().

> > +
> >  #ifdef CONFIG_SHMEM
> >  /*
> >   * The vma_is_shmem is not inline because it is used only by slow
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 7ae4001e47c1..602d6836098a 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -517,7 +517,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
> >  {
> >  	struct vm_area_struct *vma;
> >  	bool ret = true;
> > -	VMA_ITERATOR(vmi, mm, 0);
> > +	VMA_ITERATOR(vmi, mm, ULONG_MAX);
> >
> >  	/*
> >  	 * Tell all users of get_user/copy_from_user etc... that the content
> > @@ -527,7 +527,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
> >  	 */
> >  	set_bit(MMF_UNSTABLE, &mm->flags);
> >
> > -	for_each_vma(vmi, vma) {
> > +	/*
> > +	 * When two tasks unmap the same vma at the same time, they may contend for the
> > +	 * pte spinlock. To avoid traversing the same vma as exit_mmap unmap, traverse
> > +	 * the vma maple tree in reverse order.
> > +	 */
> 
> Except you won't necessarily avoid anything, as if one walker is faster
> than the other they'll run ahead, plus of course they'll have a cross-over
> where they share the same PTE anyway.
> 
> I feel like maybe you've got a fairly specific situation that indicates an
> issue elsewhere and you're maybe solving the wrong problem here?
> 
> > +	for_each_vma_reverse(vmi, vma) {
> >  		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
> >  			continue;
> >
> > --
> > 2.17.1
> >
> >

Re: [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite orders

Posted by Lorenzo Stoakes 1 month, 2 weeks ago

On Tue, Aug 19, 2025 at 11:18:34PM +0800, zhongjinji wrote:
> > On Thu, Aug 14, 2025 at 09:55:55PM +0800, zhongjinji@honor.com wrote:
> > > From: zhongjinji <zhongjinji@honor.com>
> > >
> > > When a process is OOM killed, if the OOM reaper and the thread running
> > > exit_mmap() execute at the same time, both will traverse the vma's maple
> > > tree along the same path. They may easily unmap the same vma, causing them
> > > to compete for the pte spinlock. This increases unnecessary load, causing
> > > the execution time of the OOM reaper and the thread running exit_mmap() to
> > > increase.
> >
> > You're not giving any numbers, and this seems pretty niche, you really
> > exiting that many processes with the reaper running at the exact same time
> > that this is an issue? Waiting on a spinlock also?
> >
> > This commit message is very unconvincing.
>
> This is the perf data: the first one is without this patch applied, and the
> second one is with this patch applied.  It is obvious that without this patch,
> the lock contention on the pte spinlock will be very intense.

>
> |--99.74%-- oom_reaper
> |    |--76.67%-- unmap_page_range
> |    |    |--33.70%-- __pte_offset_map_lock
> |    |    |    |--98.46%-- _raw_spin_lock
> |    |    |--27.61%-- free_swap_and_cache_nr
> |    |    |--16.40%-- folio_remove_rmap_ptes
> |    |    |--12.25%-- tlb_flush_mmu
> |    |--12.61%-- tlb_finish_mmu
>
>
> |--98.84%-- oom_reaper
> |    |--53.45%-- unmap_page_range
> |    |    |--24.29%-- [hit in function]
> |    |    |--48.06%-- folio_remove_rmap_ptes
> |    |    |--17.99%-- tlb_flush_mmu
> |    |    |--1.72%-- __pte_offset_map_lock
> |    |
> |    |--30.43%-- tlb_finish_mmu

Right yes thanks for providing this.

I'm still not convinced by this approach however, it feels like you're papering
over a crack for a problematic hack that needs to be solved at a different
level.

It feels like the whole waiting around thing is a hack to paper over something
and then we're introducing another hack to make that work in a specific
scenario.

I also am not clear (perhaps you answered it elsewhere) how you're encountering
this at a scale for it to be a meaningful issue?

Also not sure we should be changing core mm to support perf issues with using an
effectively-deprecated interface (cgroup v1)?

>
> > >
> > > When a process exits, exit_mmap() traverses the vma's maple tree from low to high
> > > address. To reduce the chance of unmapping the same vma simultaneously,
> > > the OOM reaper should traverse vma's tree from high to low address. This reduces
> > > lock contention when unmapping the same vma.
> >
> > Are they going to run through and do their work in exactly the same time,
> > or might one 'run past' the other and you still have an issue?
> >
> > Seems very vague and timing dependent and again, not convincing.
> >
> > >
> > > Signed-off-by: zhongjinji <zhongjinji@honor.com>
> > > ---
> > >  include/linux/mm.h | 3 +++
> > >  mm/oom_kill.c      | 9 +++++++--
> > >  2 files changed, 10 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 0c44bb8ce544..b665ea3c30eb 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -923,6 +923,9 @@ static inline void vma_iter_set(struct vma_iterator *vmi, unsigned long addr)
> > >  #define for_each_vma_range(__vmi, __vma, __end)				\
> > >  	while (((__vma) = vma_find(&(__vmi), (__end))) != NULL)
> > >
> > > +#define for_each_vma_reverse(__vmi, __vma)					\
> > > +	while (((__vma) = vma_prev(&(__vmi))) != NULL)
> >
> > Please don't casually add an undocumented public VMA iterator hidden in a
> > patch doing something else :)
> >
> > Won't this skip the first VMA? Not sure this is really worth having as a
> > general thing anyway, it's not many people who want to do this in reverse.
>
> I got it. mas_find_rev() should be used instead of vma_prev().
>
> > > +
> > >  #ifdef CONFIG_SHMEM
> > >  /*
> > >   * The vma_is_shmem is not inline because it is used only by slow
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index 7ae4001e47c1..602d6836098a 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -517,7 +517,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
> > >  {
> > >  	struct vm_area_struct *vma;
> > >  	bool ret = true;
> > > -	VMA_ITERATOR(vmi, mm, 0);
> > > +	VMA_ITERATOR(vmi, mm, ULONG_MAX);
> > >
> > >  	/*
> > >  	 * Tell all users of get_user/copy_from_user etc... that the content
> > > @@ -527,7 +527,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
> > >  	 */
> > >  	set_bit(MMF_UNSTABLE, &mm->flags);
> > >
> > > -	for_each_vma(vmi, vma) {
> > > +	/*
> > > +	 * When two tasks unmap the same vma at the same time, they may contend for the
> > > +	 * pte spinlock. To avoid traversing the same vma as exit_mmap unmap, traverse
> > > +	 * the vma maple tree in reverse order.
> > > +	 */
> >
> > Except you won't necessarily avoid anything, as if one walker is faster
> > than the other they'll run ahead, plus of course they'll have a cross-over
> > where they share the same PTE anyway.
> >
> > I feel like maybe you've got a fairly specific situation that indicates an
> > issue elsewhere and you're maybe solving the wrong problem here?
> >
> > > +	for_each_vma_reverse(vmi, vma) {
> > >  		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
> > >  			continue;
> > >
> > > --
> > > 2.17.1
> > >
> > >

Re: [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite orders

Posted by zhongjinji 1 month, 1 week ago

> >
> > |--99.74%-- oom_reaper
> > |    |--76.67%-- unmap_page_range
> > |    |    |--33.70%-- __pte_offset_map_lock
> > |    |    |    |--98.46%-- _raw_spin_lock
> > |    |    |--27.61%-- free_swap_and_cache_nr
> > |    |    |--16.40%-- folio_remove_rmap_ptes
> > |    |    |--12.25%-- tlb_flush_mmu
> > |    |--12.61%-- tlb_finish_mmu
> >
> >
> > |--98.84%-- oom_reaper
> > |    |--53.45%-- unmap_page_range
> > |    |    |--24.29%-- [hit in function]
> > |    |    |--48.06%-- folio_remove_rmap_ptes
> > |    |    |--17.99%-- tlb_flush_mmu
> > |    |    |--1.72%-- __pte_offset_map_lock
> > |    |
> > |    |--30.43%-- tlb_finish_mmu
> 
> Right yes thanks for providing this.
> 
> I'm still not convinced by this approach however, it feels like you're papering
> over a crack for a problematic hack that needs to be solved at a different
> level.
> 
> It feels like the whole waiting around thing is a hack to paper over something
> and then we're introducing another hack to make that work in a specific
> scenario.
> 
> I also am not clear (perhaps you answered it elsewhere) how you're encountering
> this at a scale for it to be a meaningful issue?

On low-memory Android devices, high memory pressure often requires killing
processes to free memory, which is generally accepted on Android. When
killing a process on Android, there is also an asynchronous process reap
mechanism, which is implemented through process_mrelease, similar to the
oom reaper. OOM events are also not rare. Therefore, it makes sense to
reduce the load on the reaper.

> Also not sure we should be changing core mm to support perf issues with using an
> effectively-deprecated interface (cgroup v1)?
Yeah, it is not that appealing.

Re: [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite orders

Posted by zhongjinji 1 month, 2 weeks ago

> 
> On Thu, Aug 14, 2025 at 09:55:55PM +0800, zhongjinji@honor.com wrote:
> > From: zhongjinji <zhongjinji@honor.com>
> >
> > When a process is OOM killed, if the OOM reaper and the thread running
> > exit_mmap() execute at the same time, both will traverse the vma's maple
> > tree along the same path. They may easily unmap the same vma, causing them
> > to compete for the pte spinlock. This increases unnecessary load, causing
> > the execution time of the OOM reaper and the thread running exit_mmap() to
> > increase.
> 
> You're not giving any numbers, and this seems pretty niche, you really
> exiting that many processes with the reaper running at the exact same time
> that this is an issue? Waiting on a spinlock also?
> 
> This commit message is very unconvincing.

Thank you, I will reconfirm this issue.

> 
> >
> > When a process exits, exit_mmap() traverses the vma's maple tree from low to high
> > address. To reduce the chance of unmapping the same vma simultaneously,
> > the OOM reaper should traverse vma's tree from high to low address. This reduces
> > lock contention when unmapping the same vma.
> 
> Are they going to run through and do their work in exactly the same time,
> or might one 'run past' the other and you still have an issue?
> 
> Seems very vague and timing dependent and again, not convincing.

well, Thank you, I should capture a perf trace for the oom reaper, not perfetto.

> 
> >
> > Signed-off-by: zhongjinji <zhongjinji@honor.com>
> > ---
> >  include/linux/mm.h | 3 +++
> >  mm/oom_kill.c      | 9 +++++++--
> >  2 files changed, 10 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 0c44bb8ce544..b665ea3c30eb 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -923,6 +923,9 @@ static inline void vma_iter_set(struct vma_iterator *vmi, unsigned long addr)
> >  #define for_each_vma_range(__vmi, __vma, __end)				\
> >  	while (((__vma) = vma_find(&(__vmi), (__end))) != NULL)
> >
> > +#define for_each_vma_reverse(__vmi, __vma)					\
> > +	while (((__vma) = vma_prev(&(__vmi))) != NULL)
> 
> Please don't casually add an undocumented public VMA iterator hidden in a
> patch doing something else :)

sorry, I got it.

> 
> Won't this skip the first VMA? Not sure this is really worth having as a
> general thing anyway, it's not many people who want to do this in reverse.
> 
> > +
> >  #ifdef CONFIG_SHMEM
> >  /*
> >   * The vma_is_shmem is not inline because it is used only by slow
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 7ae4001e47c1..602d6836098a 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -517,7 +517,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
> >  {
> >  	struct vm_area_struct *vma;
> >  	bool ret = true;
> > -	VMA_ITERATOR(vmi, mm, 0);
> > +	VMA_ITERATOR(vmi, mm, ULONG_MAX);
> >
> >  	/*
> >  	 * Tell all users of get_user/copy_from_user etc... that the content
> > @@ -527,7 +527,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
> >  	 */
> >  	set_bit(MMF_UNSTABLE, &mm->flags);
> >
> > -	for_each_vma(vmi, vma) {
> > +	/*
> > +	 * When two tasks unmap the same vma at the same time, they may contend for the
> > +	 * pte spinlock. To avoid traversing the same vma as exit_mmap unmap, traverse
> > +	 * the vma maple tree in reverse order.
> > +	 */
> 
> Except you won't necessarily avoid anything, as if one walker is faster
> than the other they'll run ahead, plus of course they'll have a cross-over
> where they share the same PTE anyway.
> 
> I feel like maybe you've got a fairly specific situation that indicates an
> issue elsewhere and you're maybe solving the wrong problem here?

Thank you, I will reconfirm this issue.

> 
> > +	for_each_vma_reverse(vmi, vma) {
> >  		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
> >  			continue;
> >
> > --
> > 2.17.1
> >
> >

Re: [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite orders

Posted by Lorenzo Stoakes 1 month, 2 weeks ago

On Fri, Aug 15, 2025 at 03:29:24PM +0100, Lorenzo Stoakes wrote:
> On Thu, Aug 14, 2025 at 09:55:55PM +0800, zhongjinji@honor.com wrote:
> > From: zhongjinji <zhongjinji@honor.com>
> >
> > When a process is OOM killed, if the OOM reaper and the thread running
> > exit_mmap() execute at the same time, both will traverse the vma's maple
> > tree along the same path. They may easily unmap the same vma, causing them
> > to compete for the pte spinlock. This increases unnecessary load, causing
> > the execution time of the OOM reaper and the thread running exit_mmap() to
> > increase.
>
> You're not giving any numbers, and this seems pretty niche, you really
> exiting that many processes with the reaper running at the exact same time
> that this is an issue? Waiting on a spinlock also?
>
> This commit message is very unconvincing.
>
> >
> > When a process exits, exit_mmap() traverses the vma's maple tree from low to high
> > address. To reduce the chance of unmapping the same vma simultaneously,
> > the OOM reaper should traverse vma's tree from high to low address. This reduces
> > lock contention when unmapping the same vma.
>
> Are they going to run through and do their work in exactly the same time,
> or might one 'run past' the other and you still have an issue?
>
> Seems very vague and timing dependent and again, not convincing.
>
> >
> > Signed-off-by: zhongjinji <zhongjinji@honor.com>
> > ---
> >  include/linux/mm.h | 3 +++
> >  mm/oom_kill.c      | 9 +++++++--
> >  2 files changed, 10 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 0c44bb8ce544..b665ea3c30eb 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -923,6 +923,9 @@ static inline void vma_iter_set(struct vma_iterator *vmi, unsigned long addr)
> >  #define for_each_vma_range(__vmi, __vma, __end)				\
> >  	while (((__vma) = vma_find(&(__vmi), (__end))) != NULL)
> >
> > +#define for_each_vma_reverse(__vmi, __vma)					\
> > +	while (((__vma) = vma_prev(&(__vmi))) != NULL)
>
> Please don't casually add an undocumented public VMA iterator hidden in a
> patch doing something else :)
>
> Won't this skip the first VMA? Not sure this is really worth having as a
> general thing anyway, it's not many people who want to do this in reverse.
>
> > +
> >  #ifdef CONFIG_SHMEM
> >  /*
> >   * The vma_is_shmem is not inline because it is used only by slow
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 7ae4001e47c1..602d6836098a 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -517,7 +517,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
> >  {
> >  	struct vm_area_struct *vma;
> >  	bool ret = true;
> > -	VMA_ITERATOR(vmi, mm, 0);
> > +	VMA_ITERATOR(vmi, mm, ULONG_MAX);
> >
> >  	/*
> >  	 * Tell all users of get_user/copy_from_user etc... that the content
> > @@ -527,7 +527,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
> >  	 */
> >  	set_bit(MMF_UNSTABLE, &mm->flags);
> >
> > -	for_each_vma(vmi, vma) {
> > +	/*
> > +	 * When two tasks unmap the same vma at the same time, they may contend for the
> > +	 * pte spinlock. To avoid traversing the same vma as exit_mmap unmap, traverse
> > +	 * the vma maple tree in reverse order.
> > +	 */
>
> Except you won't necessarily avoid anything, as if one walker is faster
> than the other they'll run ahead, plus of course they'll have a cross-over
> where they share the same PTE anyway.

OK I guess what is happening here is very likely one task will be faster the
other slower, but like a slow train ahead of a fast one on a single line, if it
happens to get the lock first it'll hold up the first one overa nd over again if
the same PTEs are traversed.

Still, again this is super timing dependent it still feels like the wrong
solution and something of a hack and it really needs to be backed/explained more
thorougly.

The remaining comments still apply.

>
> I feel like maybe you've got a fairly specific situation that indicates an
> issue elsewhere and you're maybe solving the wrong problem here?
>
> > +	for_each_vma_reverse(vmi, vma) {
> >  		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
> >  			continue;
> >
> > --
> > 2.17.1
> >
> >

Re: [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite orders

Posted by Andrew Morton 1 month, 2 weeks ago

On Thu, 14 Aug 2025 21:55:55 +0800 <zhongjinji@honor.com> wrote:

> When a process is OOM killed, if the OOM reaper and the thread running
> exit_mmap() execute at the same time, both will traverse the vma's maple
> tree along the same path. They may easily unmap the same vma, causing them
> to compete for the pte spinlock. This increases unnecessary load, causing
> the execution time of the OOM reaper and the thread running exit_mmap() to
> increase.

Please tell me what I'm missing here.

OOM kills are a rare event.  And this race sounds like it will rarely
occur even if an oom-killing is happening.  And the delay will be
relatively short.

If I'm correct then we're addressing rare*rare*small, so why bother?

> When a process exits, exit_mmap() traverses the vma's maple tree from low to high
> address. To reduce the chance of unmapping the same vma simultaneously,
> the OOM reaper should traverse vma's tree from high to low address. This reduces
> lock contention when unmapping the same vma.

Sharing some before-and-after runtime measurements would be useful.  Or
at least, detailed anecdotes.

Re: [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite orders

Posted by zhongjinji 1 month, 2 weeks ago

On Thu, 14 Aug 2025 21:55:55 +0800 <zhongjinji@honor.com> wrote:

> > When a process is OOM killed, if the OOM reaper and the thread running
> > exit_mmap() execute at the same time, both will traverse the vma's maple
> > tree along the same path. They may easily unmap the same vma, causing them
> > to compete for the pte spinlock. This increases unnecessary load, causing
> > the execution time of the OOM reaper and the thread running exit_mmap() to
> > increase.
> 
> Please tell me what I'm missing here.
> 
> OOM kills are a rare event.  And this race sounds like it will rarely
> occur even if an oom-killing is happening.  And the delay will be
> relatively short.
> 
> If I'm correct then we're addressing rare*rare*small, so why bother?

When there are apps that consume a large amount of memory, encountering OOM on
low-memory Android devices is not uncommon. On Android devices, programs like lmkd
(a user-space daemon in the Android system) also call process_mrelease() to reap 
memory when an app is killed.

> > When a process exits, exit_mmap() traverses the vma's maple tree from low to high
> > address. To reduce the chance of unmapping the same vma simultaneously,
> > the OOM reaper should traverse vma's tree from high to low address. This reduces
> > lock contention when unmapping the same vma.
> 
> Sharing some before-and-after runtime measurements would be useful.  Or
> at least, detailed anecdotes.

Here is my test data on Android. The test process is as follows: start the same app,
then kill it, and finally capture the perfetto trace.
In the test, the way to trigger the OOM reaper is: intercept the kill signal and
actively add the process to the OOM reaper queue as what OOM does.

Note: #RxComputationT, vdp:vidtask:m, and tp-background are threads of the same process,
and they are the last threads to exit.

Thread             TID         State        Wall duration (ms)
# with oom reaper and traverse reverse
#RxComputationT    13708       Running      60.690572
oom_reaper         81          Running      46.492032

# with oom reaper but traverses
vdp:vidtask:m      14040       Running      81.848297
oom_reaper         81          Running      69.32

# without oom reaper
tp-background      12424       Running      106.021874