From: zhongjinji <zhongjinji@honor.com>
When a process is OOM killed, if the OOM reaper and the thread running
exit_mmap() execute at the same time, both will traverse the vma's maple
tree along the same path. They may easily unmap the same vma, causing them
to compete for the pte spinlock. This increases unnecessary load, causing
the execution time of the OOM reaper and the thread running exit_mmap() to
increase.
When a process exits, exit_mmap() traverses the vma's maple tree from low to high
address. To reduce the chance of unmapping the same vma simultaneously,
the OOM reaper should traverse vma's tree from high to low address. This reduces
lock contention when unmapping the same vma.
Signed-off-by: zhongjinji <zhongjinji@honor.com>
---
include/linux/mm.h | 3 +++
mm/oom_kill.c | 9 +++++++--
2 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0c44bb8ce544..b665ea3c30eb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -923,6 +923,9 @@ static inline void vma_iter_set(struct vma_iterator *vmi, unsigned long addr)
#define for_each_vma_range(__vmi, __vma, __end) \
while (((__vma) = vma_find(&(__vmi), (__end))) != NULL)
+#define for_each_vma_reverse(__vmi, __vma) \
+ while (((__vma) = vma_prev(&(__vmi))) != NULL)
+
#ifdef CONFIG_SHMEM
/*
* The vma_is_shmem is not inline because it is used only by slow
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7ae4001e47c1..602d6836098a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -517,7 +517,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
{
struct vm_area_struct *vma;
bool ret = true;
- VMA_ITERATOR(vmi, mm, 0);
+ VMA_ITERATOR(vmi, mm, ULONG_MAX);
/*
* Tell all users of get_user/copy_from_user etc... that the content
@@ -527,7 +527,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
*/
set_bit(MMF_UNSTABLE, &mm->flags);
- for_each_vma(vmi, vma) {
+ /*
+ * When two tasks unmap the same vma at the same time, they may contend for the
+ * pte spinlock. To avoid traversing the same vma as exit_mmap unmap, traverse
+ * the vma maple tree in reverse order.
+ */
+ for_each_vma_reverse(vmi, vma) {
if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
continue;
--
2.17.1
* zhongjinji@honor.com <zhongjinji@honor.com> [250814 09:56]: > From: zhongjinji <zhongjinji@honor.com> > > When a process is OOM killed, if the OOM reaper and the thread running > exit_mmap() execute at the same time, both will traverse the vma's maple > tree along the same path. They may easily unmap the same vma, causing them > to compete for the pte spinlock. This increases unnecessary load, causing > the execution time of the OOM reaper and the thread running exit_mmap() to > increase. > > When a process exits, exit_mmap() traverses the vma's maple tree from low to high > address. To reduce the chance of unmapping the same vma simultaneously, > the OOM reaper should traverse vma's tree from high to low address. This reduces > lock contention when unmapping the same vma. > > Signed-off-by: zhongjinji <zhongjinji@honor.com> > --- > include/linux/mm.h | 3 +++ > mm/oom_kill.c | 9 +++++++-- > 2 files changed, 10 insertions(+), 2 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 0c44bb8ce544..b665ea3c30eb 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -923,6 +923,9 @@ static inline void vma_iter_set(struct vma_iterator *vmi, unsigned long addr) > #define for_each_vma_range(__vmi, __vma, __end) \ > while (((__vma) = vma_find(&(__vmi), (__end))) != NULL) > > +#define for_each_vma_reverse(__vmi, __vma) \ > + while (((__vma) = vma_prev(&(__vmi))) != NULL) > + This does not do what you think it does, nor does it do what others will think it will do. It's not the opposite of the for_each_vma_range() above. vma_find() calls mas_find() which has a different meaning that mas_next(). mas_find()'s behaviour is a hold-over from the vma_find() of yesteryears: it will find the first entry at the address (if it's the first time called) or the entry after it. mas_prev() is trying to replace the linked list behaviour of "go to the previous one", so it'll walk to the index you specified and go to the previous one. It will skip the index you passed in regardless of its existence or not. So what you have here is a broken interface, you just don't see it with your code because you don't happen to have a mapping at ULONG_MAX. This should not be merged as-is. Also, there was zero mention of the new interface in the subject so I almost missed this being added. > #ifdef CONFIG_SHMEM > /* > * The vma_is_shmem is not inline because it is used only by slow > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 7ae4001e47c1..602d6836098a 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -517,7 +517,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm) > { > struct vm_area_struct *vma; > bool ret = true; > - VMA_ITERATOR(vmi, mm, 0); > + VMA_ITERATOR(vmi, mm, ULONG_MAX); > > /* > * Tell all users of get_user/copy_from_user etc... that the content > @@ -527,7 +527,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm) > */ > set_bit(MMF_UNSTABLE, &mm->flags); > > - for_each_vma(vmi, vma) { > + /* > + * When two tasks unmap the same vma at the same time, they may contend for the > + * pte spinlock. To avoid traversing the same vma as exit_mmap unmap, traverse > + * the vma maple tree in reverse order. > + */ > + for_each_vma_reverse(vmi, vma) { How is this possible? Both need the same lock..? What part of exit_mmap() will race here? Why aren't we using the MMF_UNSTABLE flag set above to avoid it? Or the MMF_OOM_SKIP? > if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP)) > continue; > > -- > 2.17.1 >
* Liam R. Howlett <Liam.Howlett@oracle.com> [250815 10:41]: > * zhongjinji@honor.com <zhongjinji@honor.com> [250814 09:56]: > > From: zhongjinji <zhongjinji@honor.com> ... > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index 7ae4001e47c1..602d6836098a 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -517,7 +517,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm) > > { > > struct vm_area_struct *vma; > > bool ret = true; > > - VMA_ITERATOR(vmi, mm, 0); > > + VMA_ITERATOR(vmi, mm, ULONG_MAX); > > > > /* > > * Tell all users of get_user/copy_from_user etc... that the content > > @@ -527,7 +527,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm) > > */ > > set_bit(MMF_UNSTABLE, &mm->flags); > > > > - for_each_vma(vmi, vma) { > > + /* > > + * When two tasks unmap the same vma at the same time, they may contend for the > > + * pte spinlock. To avoid traversing the same vma as exit_mmap unmap, traverse > > + * the vma maple tree in reverse order. > > + */ > > + for_each_vma_reverse(vmi, vma) { > > How is this possible? Both need the same lock..? What part of > exit_mmap() will race here? I see, exit_mmap() and the oom both use unmap_page_range() under the mmap read lock, so they can race but since they'll contend on the pte lock it doesn't really matter. This is so rare, I don't think this is worth fixing. Thanks, Liam
On Thu, Aug 14, 2025 at 09:55:55PM +0800, zhongjinji@honor.com wrote: > From: zhongjinji <zhongjinji@honor.com> > > When a process is OOM killed, if the OOM reaper and the thread running > exit_mmap() execute at the same time, both will traverse the vma's maple > tree along the same path. They may easily unmap the same vma, causing them > to compete for the pte spinlock. This increases unnecessary load, causing > the execution time of the OOM reaper and the thread running exit_mmap() to > increase. You're not giving any numbers, and this seems pretty niche, you really exiting that many processes with the reaper running at the exact same time that this is an issue? Waiting on a spinlock also? This commit message is very unconvincing. > > When a process exits, exit_mmap() traverses the vma's maple tree from low to high > address. To reduce the chance of unmapping the same vma simultaneously, > the OOM reaper should traverse vma's tree from high to low address. This reduces > lock contention when unmapping the same vma. Are they going to run through and do their work in exactly the same time, or might one 'run past' the other and you still have an issue? Seems very vague and timing dependent and again, not convincing. > > Signed-off-by: zhongjinji <zhongjinji@honor.com> > --- > include/linux/mm.h | 3 +++ > mm/oom_kill.c | 9 +++++++-- > 2 files changed, 10 insertions(+), 2 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 0c44bb8ce544..b665ea3c30eb 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -923,6 +923,9 @@ static inline void vma_iter_set(struct vma_iterator *vmi, unsigned long addr) > #define for_each_vma_range(__vmi, __vma, __end) \ > while (((__vma) = vma_find(&(__vmi), (__end))) != NULL) > > +#define for_each_vma_reverse(__vmi, __vma) \ > + while (((__vma) = vma_prev(&(__vmi))) != NULL) Please don't casually add an undocumented public VMA iterator hidden in a patch doing something else :) Won't this skip the first VMA? Not sure this is really worth having as a general thing anyway, it's not many people who want to do this in reverse. > + > #ifdef CONFIG_SHMEM > /* > * The vma_is_shmem is not inline because it is used only by slow > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 7ae4001e47c1..602d6836098a 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -517,7 +517,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm) > { > struct vm_area_struct *vma; > bool ret = true; > - VMA_ITERATOR(vmi, mm, 0); > + VMA_ITERATOR(vmi, mm, ULONG_MAX); > > /* > * Tell all users of get_user/copy_from_user etc... that the content > @@ -527,7 +527,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm) > */ > set_bit(MMF_UNSTABLE, &mm->flags); > > - for_each_vma(vmi, vma) { > + /* > + * When two tasks unmap the same vma at the same time, they may contend for the > + * pte spinlock. To avoid traversing the same vma as exit_mmap unmap, traverse > + * the vma maple tree in reverse order. > + */ Except you won't necessarily avoid anything, as if one walker is faster than the other they'll run ahead, plus of course they'll have a cross-over where they share the same PTE anyway. I feel like maybe you've got a fairly specific situation that indicates an issue elsewhere and you're maybe solving the wrong problem here? > + for_each_vma_reverse(vmi, vma) { > if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP)) > continue; > > -- > 2.17.1 > >
> On Thu, Aug 14, 2025 at 09:55:55PM +0800, zhongjinji@honor.com wrote: > > From: zhongjinji <zhongjinji@honor.com> > > > > When a process is OOM killed, if the OOM reaper and the thread running > > exit_mmap() execute at the same time, both will traverse the vma's maple > > tree along the same path. They may easily unmap the same vma, causing them > > to compete for the pte spinlock. This increases unnecessary load, causing > > the execution time of the OOM reaper and the thread running exit_mmap() to > > increase. > > You're not giving any numbers, and this seems pretty niche, you really > exiting that many processes with the reaper running at the exact same time > that this is an issue? Waiting on a spinlock also? > > This commit message is very unconvincing. This is the perf data: the first one is without this patch applied, and the second one is with this patch applied. It is obvious that without this patch, the lock contention on the pte spinlock will be very intense. |--99.74%-- oom_reaper | |--76.67%-- unmap_page_range | | |--33.70%-- __pte_offset_map_lock | | | |--98.46%-- _raw_spin_lock | | |--27.61%-- free_swap_and_cache_nr | | |--16.40%-- folio_remove_rmap_ptes | | |--12.25%-- tlb_flush_mmu | |--12.61%-- tlb_finish_mmu |--98.84%-- oom_reaper | |--53.45%-- unmap_page_range | | |--24.29%-- [hit in function] | | |--48.06%-- folio_remove_rmap_ptes | | |--17.99%-- tlb_flush_mmu | | |--1.72%-- __pte_offset_map_lock | | | |--30.43%-- tlb_finish_mmu > > > > When a process exits, exit_mmap() traverses the vma's maple tree from low to high > > address. To reduce the chance of unmapping the same vma simultaneously, > > the OOM reaper should traverse vma's tree from high to low address. This reduces > > lock contention when unmapping the same vma. > > Are they going to run through and do their work in exactly the same time, > or might one 'run past' the other and you still have an issue? > > Seems very vague and timing dependent and again, not convincing. > > > > > Signed-off-by: zhongjinji <zhongjinji@honor.com> > > --- > > include/linux/mm.h | 3 +++ > > mm/oom_kill.c | 9 +++++++-- > > 2 files changed, 10 insertions(+), 2 deletions(-) > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index 0c44bb8ce544..b665ea3c30eb 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -923,6 +923,9 @@ static inline void vma_iter_set(struct vma_iterator *vmi, unsigned long addr) > > #define for_each_vma_range(__vmi, __vma, __end) \ > > while (((__vma) = vma_find(&(__vmi), (__end))) != NULL) > > > > +#define for_each_vma_reverse(__vmi, __vma) \ > > + while (((__vma) = vma_prev(&(__vmi))) != NULL) > > Please don't casually add an undocumented public VMA iterator hidden in a > patch doing something else :) > > Won't this skip the first VMA? Not sure this is really worth having as a > general thing anyway, it's not many people who want to do this in reverse. I got it. mas_find_rev() should be used instead of vma_prev(). > > + > > #ifdef CONFIG_SHMEM > > /* > > * The vma_is_shmem is not inline because it is used only by slow > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index 7ae4001e47c1..602d6836098a 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -517,7 +517,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm) > > { > > struct vm_area_struct *vma; > > bool ret = true; > > - VMA_ITERATOR(vmi, mm, 0); > > + VMA_ITERATOR(vmi, mm, ULONG_MAX); > > > > /* > > * Tell all users of get_user/copy_from_user etc... that the content > > @@ -527,7 +527,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm) > > */ > > set_bit(MMF_UNSTABLE, &mm->flags); > > > > - for_each_vma(vmi, vma) { > > + /* > > + * When two tasks unmap the same vma at the same time, they may contend for the > > + * pte spinlock. To avoid traversing the same vma as exit_mmap unmap, traverse > > + * the vma maple tree in reverse order. > > + */ > > Except you won't necessarily avoid anything, as if one walker is faster > than the other they'll run ahead, plus of course they'll have a cross-over > where they share the same PTE anyway. > > I feel like maybe you've got a fairly specific situation that indicates an > issue elsewhere and you're maybe solving the wrong problem here? > > > + for_each_vma_reverse(vmi, vma) { > > if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP)) > > continue; > > > > -- > > 2.17.1 > > > >
On Tue, Aug 19, 2025 at 11:18:34PM +0800, zhongjinji wrote: > > On Thu, Aug 14, 2025 at 09:55:55PM +0800, zhongjinji@honor.com wrote: > > > From: zhongjinji <zhongjinji@honor.com> > > > > > > When a process is OOM killed, if the OOM reaper and the thread running > > > exit_mmap() execute at the same time, both will traverse the vma's maple > > > tree along the same path. They may easily unmap the same vma, causing them > > > to compete for the pte spinlock. This increases unnecessary load, causing > > > the execution time of the OOM reaper and the thread running exit_mmap() to > > > increase. > > > > You're not giving any numbers, and this seems pretty niche, you really > > exiting that many processes with the reaper running at the exact same time > > that this is an issue? Waiting on a spinlock also? > > > > This commit message is very unconvincing. > > This is the perf data: the first one is without this patch applied, and the > second one is with this patch applied. It is obvious that without this patch, > the lock contention on the pte spinlock will be very intense. > > |--99.74%-- oom_reaper > | |--76.67%-- unmap_page_range > | | |--33.70%-- __pte_offset_map_lock > | | | |--98.46%-- _raw_spin_lock > | | |--27.61%-- free_swap_and_cache_nr > | | |--16.40%-- folio_remove_rmap_ptes > | | |--12.25%-- tlb_flush_mmu > | |--12.61%-- tlb_finish_mmu > > > |--98.84%-- oom_reaper > | |--53.45%-- unmap_page_range > | | |--24.29%-- [hit in function] > | | |--48.06%-- folio_remove_rmap_ptes > | | |--17.99%-- tlb_flush_mmu > | | |--1.72%-- __pte_offset_map_lock > | | > | |--30.43%-- tlb_finish_mmu Right yes thanks for providing this. I'm still not convinced by this approach however, it feels like you're papering over a crack for a problematic hack that needs to be solved at a different level. It feels like the whole waiting around thing is a hack to paper over something and then we're introducing another hack to make that work in a specific scenario. I also am not clear (perhaps you answered it elsewhere) how you're encountering this at a scale for it to be a meaningful issue? Also not sure we should be changing core mm to support perf issues with using an effectively-deprecated interface (cgroup v1)? > > > > > > > When a process exits, exit_mmap() traverses the vma's maple tree from low to high > > > address. To reduce the chance of unmapping the same vma simultaneously, > > > the OOM reaper should traverse vma's tree from high to low address. This reduces > > > lock contention when unmapping the same vma. > > > > Are they going to run through and do their work in exactly the same time, > > or might one 'run past' the other and you still have an issue? > > > > Seems very vague and timing dependent and again, not convincing. > > > > > > > > Signed-off-by: zhongjinji <zhongjinji@honor.com> > > > --- > > > include/linux/mm.h | 3 +++ > > > mm/oom_kill.c | 9 +++++++-- > > > 2 files changed, 10 insertions(+), 2 deletions(-) > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > index 0c44bb8ce544..b665ea3c30eb 100644 > > > --- a/include/linux/mm.h > > > +++ b/include/linux/mm.h > > > @@ -923,6 +923,9 @@ static inline void vma_iter_set(struct vma_iterator *vmi, unsigned long addr) > > > #define for_each_vma_range(__vmi, __vma, __end) \ > > > while (((__vma) = vma_find(&(__vmi), (__end))) != NULL) > > > > > > +#define for_each_vma_reverse(__vmi, __vma) \ > > > + while (((__vma) = vma_prev(&(__vmi))) != NULL) > > > > Please don't casually add an undocumented public VMA iterator hidden in a > > patch doing something else :) > > > > Won't this skip the first VMA? Not sure this is really worth having as a > > general thing anyway, it's not many people who want to do this in reverse. > > I got it. mas_find_rev() should be used instead of vma_prev(). > > > > + > > > #ifdef CONFIG_SHMEM > > > /* > > > * The vma_is_shmem is not inline because it is used only by slow > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > > index 7ae4001e47c1..602d6836098a 100644 > > > --- a/mm/oom_kill.c > > > +++ b/mm/oom_kill.c > > > @@ -517,7 +517,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm) > > > { > > > struct vm_area_struct *vma; > > > bool ret = true; > > > - VMA_ITERATOR(vmi, mm, 0); > > > + VMA_ITERATOR(vmi, mm, ULONG_MAX); > > > > > > /* > > > * Tell all users of get_user/copy_from_user etc... that the content > > > @@ -527,7 +527,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm) > > > */ > > > set_bit(MMF_UNSTABLE, &mm->flags); > > > > > > - for_each_vma(vmi, vma) { > > > + /* > > > + * When two tasks unmap the same vma at the same time, they may contend for the > > > + * pte spinlock. To avoid traversing the same vma as exit_mmap unmap, traverse > > > + * the vma maple tree in reverse order. > > > + */ > > > > Except you won't necessarily avoid anything, as if one walker is faster > > than the other they'll run ahead, plus of course they'll have a cross-over > > where they share the same PTE anyway. > > > > I feel like maybe you've got a fairly specific situation that indicates an > > issue elsewhere and you're maybe solving the wrong problem here? > > > > > + for_each_vma_reverse(vmi, vma) { > > > if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP)) > > > continue; > > > > > > -- > > > 2.17.1 > > > > > >
> > > > |--99.74%-- oom_reaper > > | |--76.67%-- unmap_page_range > > | | |--33.70%-- __pte_offset_map_lock > > | | | |--98.46%-- _raw_spin_lock > > | | |--27.61%-- free_swap_and_cache_nr > > | | |--16.40%-- folio_remove_rmap_ptes > > | | |--12.25%-- tlb_flush_mmu > > | |--12.61%-- tlb_finish_mmu > > > > > > |--98.84%-- oom_reaper > > | |--53.45%-- unmap_page_range > > | | |--24.29%-- [hit in function] > > | | |--48.06%-- folio_remove_rmap_ptes > > | | |--17.99%-- tlb_flush_mmu > > | | |--1.72%-- __pte_offset_map_lock > > | | > > | |--30.43%-- tlb_finish_mmu > > Right yes thanks for providing this. > > I'm still not convinced by this approach however, it feels like you're papering > over a crack for a problematic hack that needs to be solved at a different > level. > > It feels like the whole waiting around thing is a hack to paper over something > and then we're introducing another hack to make that work in a specific > scenario. > > I also am not clear (perhaps you answered it elsewhere) how you're encountering > this at a scale for it to be a meaningful issue? On low-memory Android devices, high memory pressure often requires killing processes to free memory, which is generally accepted on Android. When killing a process on Android, there is also an asynchronous process reap mechanism, which is implemented through process_mrelease, similar to the oom reaper. OOM events are also not rare. Therefore, it makes sense to reduce the load on the reaper. > Also not sure we should be changing core mm to support perf issues with using an > effectively-deprecated interface (cgroup v1)? Yeah, it is not that appealing.
> > On Thu, Aug 14, 2025 at 09:55:55PM +0800, zhongjinji@honor.com wrote: > > From: zhongjinji <zhongjinji@honor.com> > > > > When a process is OOM killed, if the OOM reaper and the thread running > > exit_mmap() execute at the same time, both will traverse the vma's maple > > tree along the same path. They may easily unmap the same vma, causing them > > to compete for the pte spinlock. This increases unnecessary load, causing > > the execution time of the OOM reaper and the thread running exit_mmap() to > > increase. > > You're not giving any numbers, and this seems pretty niche, you really > exiting that many processes with the reaper running at the exact same time > that this is an issue? Waiting on a spinlock also? > > This commit message is very unconvincing. Thank you, I will reconfirm this issue. > > > > > When a process exits, exit_mmap() traverses the vma's maple tree from low to high > > address. To reduce the chance of unmapping the same vma simultaneously, > > the OOM reaper should traverse vma's tree from high to low address. This reduces > > lock contention when unmapping the same vma. > > Are they going to run through and do their work in exactly the same time, > or might one 'run past' the other and you still have an issue? > > Seems very vague and timing dependent and again, not convincing. well, Thank you, I should capture a perf trace for the oom reaper, not perfetto. > > > > > Signed-off-by: zhongjinji <zhongjinji@honor.com> > > --- > > include/linux/mm.h | 3 +++ > > mm/oom_kill.c | 9 +++++++-- > > 2 files changed, 10 insertions(+), 2 deletions(-) > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index 0c44bb8ce544..b665ea3c30eb 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -923,6 +923,9 @@ static inline void vma_iter_set(struct vma_iterator *vmi, unsigned long addr) > > #define for_each_vma_range(__vmi, __vma, __end) \ > > while (((__vma) = vma_find(&(__vmi), (__end))) != NULL) > > > > +#define for_each_vma_reverse(__vmi, __vma) \ > > + while (((__vma) = vma_prev(&(__vmi))) != NULL) > > Please don't casually add an undocumented public VMA iterator hidden in a > patch doing something else :) sorry, I got it. > > Won't this skip the first VMA? Not sure this is really worth having as a > general thing anyway, it's not many people who want to do this in reverse. > > > + > > #ifdef CONFIG_SHMEM > > /* > > * The vma_is_shmem is not inline because it is used only by slow > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index 7ae4001e47c1..602d6836098a 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -517,7 +517,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm) > > { > > struct vm_area_struct *vma; > > bool ret = true; > > - VMA_ITERATOR(vmi, mm, 0); > > + VMA_ITERATOR(vmi, mm, ULONG_MAX); > > > > /* > > * Tell all users of get_user/copy_from_user etc... that the content > > @@ -527,7 +527,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm) > > */ > > set_bit(MMF_UNSTABLE, &mm->flags); > > > > - for_each_vma(vmi, vma) { > > + /* > > + * When two tasks unmap the same vma at the same time, they may contend for the > > + * pte spinlock. To avoid traversing the same vma as exit_mmap unmap, traverse > > + * the vma maple tree in reverse order. > > + */ > > Except you won't necessarily avoid anything, as if one walker is faster > than the other they'll run ahead, plus of course they'll have a cross-over > where they share the same PTE anyway. > > I feel like maybe you've got a fairly specific situation that indicates an > issue elsewhere and you're maybe solving the wrong problem here? Thank you, I will reconfirm this issue. > > > + for_each_vma_reverse(vmi, vma) { > > if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP)) > > continue; > > > > -- > > 2.17.1 > > > >
On Fri, Aug 15, 2025 at 03:29:24PM +0100, Lorenzo Stoakes wrote: > On Thu, Aug 14, 2025 at 09:55:55PM +0800, zhongjinji@honor.com wrote: > > From: zhongjinji <zhongjinji@honor.com> > > > > When a process is OOM killed, if the OOM reaper and the thread running > > exit_mmap() execute at the same time, both will traverse the vma's maple > > tree along the same path. They may easily unmap the same vma, causing them > > to compete for the pte spinlock. This increases unnecessary load, causing > > the execution time of the OOM reaper and the thread running exit_mmap() to > > increase. > > You're not giving any numbers, and this seems pretty niche, you really > exiting that many processes with the reaper running at the exact same time > that this is an issue? Waiting on a spinlock also? > > This commit message is very unconvincing. > > > > > When a process exits, exit_mmap() traverses the vma's maple tree from low to high > > address. To reduce the chance of unmapping the same vma simultaneously, > > the OOM reaper should traverse vma's tree from high to low address. This reduces > > lock contention when unmapping the same vma. > > Are they going to run through and do their work in exactly the same time, > or might one 'run past' the other and you still have an issue? > > Seems very vague and timing dependent and again, not convincing. > > > > > Signed-off-by: zhongjinji <zhongjinji@honor.com> > > --- > > include/linux/mm.h | 3 +++ > > mm/oom_kill.c | 9 +++++++-- > > 2 files changed, 10 insertions(+), 2 deletions(-) > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index 0c44bb8ce544..b665ea3c30eb 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -923,6 +923,9 @@ static inline void vma_iter_set(struct vma_iterator *vmi, unsigned long addr) > > #define for_each_vma_range(__vmi, __vma, __end) \ > > while (((__vma) = vma_find(&(__vmi), (__end))) != NULL) > > > > +#define for_each_vma_reverse(__vmi, __vma) \ > > + while (((__vma) = vma_prev(&(__vmi))) != NULL) > > Please don't casually add an undocumented public VMA iterator hidden in a > patch doing something else :) > > Won't this skip the first VMA? Not sure this is really worth having as a > general thing anyway, it's not many people who want to do this in reverse. > > > + > > #ifdef CONFIG_SHMEM > > /* > > * The vma_is_shmem is not inline because it is used only by slow > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index 7ae4001e47c1..602d6836098a 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -517,7 +517,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm) > > { > > struct vm_area_struct *vma; > > bool ret = true; > > - VMA_ITERATOR(vmi, mm, 0); > > + VMA_ITERATOR(vmi, mm, ULONG_MAX); > > > > /* > > * Tell all users of get_user/copy_from_user etc... that the content > > @@ -527,7 +527,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm) > > */ > > set_bit(MMF_UNSTABLE, &mm->flags); > > > > - for_each_vma(vmi, vma) { > > + /* > > + * When two tasks unmap the same vma at the same time, they may contend for the > > + * pte spinlock. To avoid traversing the same vma as exit_mmap unmap, traverse > > + * the vma maple tree in reverse order. > > + */ > > Except you won't necessarily avoid anything, as if one walker is faster > than the other they'll run ahead, plus of course they'll have a cross-over > where they share the same PTE anyway. OK I guess what is happening here is very likely one task will be faster the other slower, but like a slow train ahead of a fast one on a single line, if it happens to get the lock first it'll hold up the first one overa nd over again if the same PTEs are traversed. Still, again this is super timing dependent it still feels like the wrong solution and something of a hack and it really needs to be backed/explained more thorougly. The remaining comments still apply. > > I feel like maybe you've got a fairly specific situation that indicates an > issue elsewhere and you're maybe solving the wrong problem here? > > > + for_each_vma_reverse(vmi, vma) { > > if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP)) > > continue; > > > > -- > > 2.17.1 > > > >
On Thu, 14 Aug 2025 21:55:55 +0800 <zhongjinji@honor.com> wrote: > When a process is OOM killed, if the OOM reaper and the thread running > exit_mmap() execute at the same time, both will traverse the vma's maple > tree along the same path. They may easily unmap the same vma, causing them > to compete for the pte spinlock. This increases unnecessary load, causing > the execution time of the OOM reaper and the thread running exit_mmap() to > increase. Please tell me what I'm missing here. OOM kills are a rare event. And this race sounds like it will rarely occur even if an oom-killing is happening. And the delay will be relatively short. If I'm correct then we're addressing rare*rare*small, so why bother? > When a process exits, exit_mmap() traverses the vma's maple tree from low to high > address. To reduce the chance of unmapping the same vma simultaneously, > the OOM reaper should traverse vma's tree from high to low address. This reduces > lock contention when unmapping the same vma. Sharing some before-and-after runtime measurements would be useful. Or at least, detailed anecdotes.
On Thu, 14 Aug 2025 21:55:55 +0800 <zhongjinji@honor.com> wrote: > > When a process is OOM killed, if the OOM reaper and the thread running > > exit_mmap() execute at the same time, both will traverse the vma's maple > > tree along the same path. They may easily unmap the same vma, causing them > > to compete for the pte spinlock. This increases unnecessary load, causing > > the execution time of the OOM reaper and the thread running exit_mmap() to > > increase. > > Please tell me what I'm missing here. > > OOM kills are a rare event. And this race sounds like it will rarely > occur even if an oom-killing is happening. And the delay will be > relatively short. > > If I'm correct then we're addressing rare*rare*small, so why bother? When there are apps that consume a large amount of memory, encountering OOM on low-memory Android devices is not uncommon. On Android devices, programs like lmkd (a user-space daemon in the Android system) also call process_mrelease() to reap memory when an app is killed. > > When a process exits, exit_mmap() traverses the vma's maple tree from low to high > > address. To reduce the chance of unmapping the same vma simultaneously, > > the OOM reaper should traverse vma's tree from high to low address. This reduces > > lock contention when unmapping the same vma. > > Sharing some before-and-after runtime measurements would be useful. Or > at least, detailed anecdotes. Here is my test data on Android. The test process is as follows: start the same app, then kill it, and finally capture the perfetto trace. In the test, the way to trigger the OOM reaper is: intercept the kill signal and actively add the process to the OOM reaper queue as what OOM does. Note: #RxComputationT, vdp:vidtask:m, and tp-background are threads of the same process, and they are the last threads to exit. Thread TID State Wall duration (ms) # with oom reaper and traverse reverse #RxComputationT 13708 Running 60.690572 oom_reaper 81 Running 46.492032 # with oom reaper but traverses vdp:vidtask:m 14040 Running 81.848297 oom_reaper 81 Running 69.32 # without oom reaper tp-background 12424 Running 106.021874
© 2016 - 2025 Red Hat, Inc.