When a process is OOM killed without reaper delay, the oom reaper and the
exit_mmap() thread likely run simultaneously. They traverse the vma's maple
tree along the same path and may easily unmap the same vma, causing them to
compete for the pte spinlock.
When a process exits, exit_mmap() traverses the vma's maple tree from low
to high addresses. To reduce the chance of unmapping the same vma
simultaneously, the OOM reaper should traverse the vma's tree from high to
low address.
Signed-off-by: zhongjinji <zhongjinji@honor.com>
---
mm/oom_kill.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4b4d73b1e00d..a0650da9ec9c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -516,7 +516,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
{
struct vm_area_struct *vma;
bool ret = true;
- VMA_ITERATOR(vmi, mm, 0);
+ MA_STATE(mas, &mm->mm_mt, ULONG_MAX, 0);
/*
* Tell all users of get_user/copy_from_user etc... that the content
@@ -526,7 +526,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
*/
set_bit(MMF_UNSTABLE, &mm->flags);
- for_each_vma(vmi, vma) {
+ /*
+ * When two tasks unmap the same vma at the same time, they may contend for the
+ * pte spinlock. To reduce the probability of them unmapping the same vma, the
+ * oom reaper traverse the vma maple tree in reverse order.
+ */
+ while ((vma = mas_find_rev(&mas, 0)) != NULL) {
if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
continue;
--
2.17.1
On Mon, Aug 25, 2025 at 09:38:55PM +0800, zhongjinji wrote: > When a process is OOM killed without reaper delay, the oom reaper and the > exit_mmap() thread likely run simultaneously. They traverse the vma's maple > tree along the same path and may easily unmap the same vma, causing them to > compete for the pte spinlock. > > When a process exits, exit_mmap() traverses the vma's maple tree from low > to high addresses. To reduce the chance of unmapping the same vma > simultaneously, the OOM reaper should traverse the vma's tree from high to > low address. > > Signed-off-by: zhongjinji <zhongjinji@honor.com> I will leave it to Liam to confirm the maple tree bit is ok, but I guess I'm softening to the idea of doing this - because it should have no impact on most users, so even if it's some rare edge case that triggers the situation, then it's worth doing it in reverse just to help you guys out :) Liam - please confirm this is good from your side, and then I can add a tag! Cheers, Lorenzo > --- > mm/oom_kill.c | 9 +++++++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 4b4d73b1e00d..a0650da9ec9c 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -516,7 +516,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm) > { > struct vm_area_struct *vma; > bool ret = true; > - VMA_ITERATOR(vmi, mm, 0); > + MA_STATE(mas, &mm->mm_mt, ULONG_MAX, 0); > > /* > * Tell all users of get_user/copy_from_user etc... that the content > @@ -526,7 +526,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm) > */ > set_bit(MMF_UNSTABLE, &mm->flags); > > - for_each_vma(vmi, vma) { > + /* > + * When two tasks unmap the same vma at the same time, they may contend for the > + * pte spinlock. To reduce the probability of them unmapping the same vma, the > + * oom reaper traverse the vma maple tree in reverse order. > + */ > + while ((vma = mas_find_rev(&mas, 0)) != NULL) { It's a pity there isn't a nicer formulation of this but this is probably the least worst way of doing it. > if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP)) > continue; > > -- > 2.17.1 >
On Tue 26-08-25 13:53:43, Lorenzo Stoakes wrote: > On Mon, Aug 25, 2025 at 09:38:55PM +0800, zhongjinji wrote: > > When a process is OOM killed without reaper delay, the oom reaper and the > > exit_mmap() thread likely run simultaneously. They traverse the vma's maple > > tree along the same path and may easily unmap the same vma, causing them to > > compete for the pte spinlock. > > > > When a process exits, exit_mmap() traverses the vma's maple tree from low > > to high addresses. To reduce the chance of unmapping the same vma > > simultaneously, the OOM reaper should traverse the vma's tree from high to > > low address. > > > > Signed-off-by: zhongjinji <zhongjinji@honor.com> > > I will leave it to Liam to confirm the maple tree bit is ok, but I guess > I'm softening to the idea of doing this - because it should have no impact > on most users, so even if it's some rare edge case that triggers the > situation, then it's worth doing it in reverse just to help you guys out :) I tend to agree on this. I would expect that oom_reaper would race with exit_mmap only for seriously stalled processes or huge memory consumers where exit_mmap takes a while. oom_reaper would be quick to catch the exit_mmap as it wouldn't have fewer work to do on already released memory. Going from the other end of the address space might reduces this chance while not really introducing any actual issues. -- Michal Hocko SUSE Labs
* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250826 08:53]: > On Mon, Aug 25, 2025 at 09:38:55PM +0800, zhongjinji wrote: > > When a process is OOM killed without reaper delay, the oom reaper and the > > exit_mmap() thread likely run simultaneously. They traverse the vma's maple > > tree along the same path and may easily unmap the same vma, causing them to > > compete for the pte spinlock. > > > > When a process exits, exit_mmap() traverses the vma's maple tree from low > > to high addresses. To reduce the chance of unmapping the same vma > > simultaneously, the OOM reaper should traverse the vma's tree from high to > > low address. > > > > Signed-off-by: zhongjinji <zhongjinji@honor.com> > > I will leave it to Liam to confirm the maple tree bit is ok, but I guess > I'm softening to the idea of doing this - because it should have no impact > on most users, so even if it's some rare edge case that triggers the > situation, then it's worth doing it in reverse just to help you guys out :) > I really don't think this is worth doing. We're avoiding a race between oom and a task unmap - the MMF bits should be used to avoid this race - or at least mitigate it. They are probably both under the read lock, but considering how rare it would be, would a racy flag check be enough - it is hardly critical to get right. Either would reduce the probability. > Liam - please confirm this is good from your side, and then I can add a tag! > > Cheers, Lorenzo > > > --- > > mm/oom_kill.c | 9 +++++++-- > > 1 file changed, 7 insertions(+), 2 deletions(-) > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index 4b4d73b1e00d..a0650da9ec9c 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -516,7 +516,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm) > > { > > struct vm_area_struct *vma; > > bool ret = true; > > - VMA_ITERATOR(vmi, mm, 0); > > + MA_STATE(mas, &mm->mm_mt, ULONG_MAX, 0); ^^^^^^^^^ ^^ You have set the index larger than the last. It (probably?) works, but isn't correct and may stop working, so let's fix it. MA_STATE(mas, &mm->mm_mt, ULONG_MAX, ULONG_MAX); > > > > /* > > * Tell all users of get_user/copy_from_user etc... that the content > > @@ -526,7 +526,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm) > > */ > > set_bit(MMF_UNSTABLE, &mm->flags); > > > > - for_each_vma(vmi, vma) { > > + /* > > + * When two tasks unmap the same vma at the same time, they may contend for the > > + * pte spinlock. To reduce the probability of them unmapping the same vma, the > > + * oom reaper traverse the vma maple tree in reverse order. > > + */ > > + while ((vma = mas_find_rev(&mas, 0)) != NULL) { > > It's a pity there isn't a nicer formulation of this but this is probably > the least worst way of doing it. > mas_for_each_rev() exists for this use case. You will find that the implementation is very close to what you see here. :) > > if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP)) > > continue; > > > > -- > > 2.17.1 > >
On Tue, Aug 26, 2025 at 09:37:22AM -0400, Liam R. Howlett wrote: > I really don't think this is worth doing. We're avoiding a race between > oom and a task unmap - the MMF bits should be used to avoid this race - > or at least mitigate it. Yes for sure, as explored at length in previous discussions this feels like we're papering over cracks here. _However_, I'm sort of ok with a minimalistic fix that solves the proximate issue even if it is that, as long as it doesn't cause issues in doing so. So this is my take on the below and why I'm open to it! > > They are probably both under the read lock, but considering how rare it > would be, would a racy flag check be enough - it is hardly critical to > get right. Either would reduce the probability. Zongjinji - I'm stil not sure that you've really indicated _why_ you're seeing such a tight and unusual race. Presumably some truly massive number of tasks being OOM'd and unmapping but... yeah that seems odd anyway. But again, if we can safely fix this in a way that doesn't hurt stuff too much I'm ok with it (of course, these are famous last words in the kernel often...!) Liam - are you open to a solution on the basis above, or do you feel we ought simply to fix the underlying issue here? to me we're at a simple enough implementaiton of this (esp. utilising the helper you mention) that probably kthis is fine (like the meme, or... hopefully not :) I will go with your judgment here! Cheers, Lorenzo
* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250826 09:50]: > On Tue, Aug 26, 2025 at 09:37:22AM -0400, Liam R. Howlett wrote: > > I really don't think this is worth doing. We're avoiding a race between > > oom and a task unmap - the MMF bits should be used to avoid this race - > > or at least mitigate it. > > Yes for sure, as explored at length in previous discussions this feels like > we're papering over cracks here. > > _However_, I'm sort of ok with a minimalistic fix that solves the proximate > issue even if it is that, as long as it doesn't cause issues in doing so. > > So this is my take on the below and why I'm open to it! > > > > > They are probably both under the read lock, but considering how rare it > > would be, would a racy flag check be enough - it is hardly critical to > > get right. Either would reduce the probability. > > Zongjinji - I'm stil not sure that you've really indicated _why_ you're > seeing such a tight and unusual race. Presumably some truly massive number > of tasks being OOM'd and unmapping but... yeah that seems odd anyway. > > But again, if we can safely fix this in a way that doesn't hurt stuff too > much I'm ok with it (of course, these are famous last words in the kernel > often...!) > > Liam - are you open to a solution on the basis above, or do you feel we > ought simply to fix the underlying issue here? At least this is a benign race. I'd think using MMF_ to reduce the race would achieve the same goal with less risk - which is why I bring it up. Really, both methods should be low risk, so I'm fine with either way. But I am interested in hearing how this race is happening enough to necessitate a fix. Reversing the iterator is a one-spot fix - if this happens elsewhere then we're out of options. Using the MMF_ flags is more of a scalable fix, if it achieves the same results. > > to me we're at a simple enough implementaiton of this (esp. utilising the > helper you mention) that probably kthis is fine (like the meme, > or... hopefully not :) > > I will go with your judgment here! > > Cheers, Lorenzo
On Tue, Aug 26, 2025 at 11:21:13AM -0400, Liam R. Howlett wrote: > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250826 09:50]: > > On Tue, Aug 26, 2025 at 09:37:22AM -0400, Liam R. Howlett wrote: > > > I really don't think this is worth doing. We're avoiding a race between > > > oom and a task unmap - the MMF bits should be used to avoid this race - > > > or at least mitigate it. > > > > Yes for sure, as explored at length in previous discussions this feels like > > we're papering over cracks here. > > > > _However_, I'm sort of ok with a minimalistic fix that solves the proximate > > issue even if it is that, as long as it doesn't cause issues in doing so. > > > > So this is my take on the below and why I'm open to it! > > > > > > > > They are probably both under the read lock, but considering how rare it > > > would be, would a racy flag check be enough - it is hardly critical to > > > get right. Either would reduce the probability. > > > > Zongjinji - I'm stil not sure that you've really indicated _why_ you're > > seeing such a tight and unusual race. Presumably some truly massive number > > of tasks being OOM'd and unmapping but... yeah that seems odd anyway. > > > > But again, if we can safely fix this in a way that doesn't hurt stuff too > > much I'm ok with it (of course, these are famous last words in the kernel > > often...!) > > > > Liam - are you open to a solution on the basis above, or do you feel we > > ought simply to fix the underlying issue here? > > At least this is a benign race. Is this really a race or rather a contention? IIUC exit_mmap and the oom reaper are trying to unmap the address space of the oom-killed process and can compete on page table locks. If both are running concurrently on two cpus then the contention can continue for whole address space and can slow down the actual memory freeing. Making oom reaper traverse in opposite direction can drastically reduce the contention and faster memory freeing. > I'd think using MMF_ to reduce the race > would achieve the same goal with less risk - which is why I bring it up. > With MMF_ flag, are you suggesting oom reaper to skip the unmapping of the oom-killed process? > Really, both methods should be low risk, so I'm fine with either way. > > But I am interested in hearing how this race is happening enough to > necessitate a fix. Reversing the iterator is a one-spot fix - if this > happens elsewhere then we're out of options. Using the MMF_ flags is > more of a scalable fix, if it achieves the same results. On the question of if this is a rare situaion and worth the patch. I would say this scenario is not that rare particularly on low memory devices and on highly utilized overcommitted systems. Memory pressure and oom-kills are norm on such systems. The point of oom reaper is to bring the system out of the oom situation quickly and having two cpus unmapping the oom-killed process can potentially bring the system out of oom situation faster. I think the patch (with your suggestions) is simple enough and I don't see any risk in including it.
+ Cc Suren since he has worked on the exit_mmap() path a lot. * Shakeel Butt <shakeel.butt@linux.dev> [250826 18:26]: > On Tue, Aug 26, 2025 at 11:21:13AM -0400, Liam R. Howlett wrote: > > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250826 09:50]: > > > On Tue, Aug 26, 2025 at 09:37:22AM -0400, Liam R. Howlett wrote: > > > > I really don't think this is worth doing. We're avoiding a race between > > > > oom and a task unmap - the MMF bits should be used to avoid this race - > > > > or at least mitigate it. > > > > > > Yes for sure, as explored at length in previous discussions this feels like > > > we're papering over cracks here. > > > > > > _However_, I'm sort of ok with a minimalistic fix that solves the proximate > > > issue even if it is that, as long as it doesn't cause issues in doing so. > > > > > > So this is my take on the below and why I'm open to it! > > > > > > > > > > > They are probably both under the read lock, but considering how rare it > > > > would be, would a racy flag check be enough - it is hardly critical to > > > > get right. Either would reduce the probability. > > > > > > Zongjinji - I'm stil not sure that you've really indicated _why_ you're > > > seeing such a tight and unusual race. Presumably some truly massive number > > > of tasks being OOM'd and unmapping but... yeah that seems odd anyway. > > > > > > But again, if we can safely fix this in a way that doesn't hurt stuff too > > > much I'm ok with it (of course, these are famous last words in the kernel > > > often...!) > > > > > > Liam - are you open to a solution on the basis above, or do you feel we > > > ought simply to fix the underlying issue here? > > > > At least this is a benign race. > > Is this really a race or rather a contention? IIUC exit_mmap and the oom > reaper are trying to unmap the address space of the oom-killed process > and can compete on page table locks. If both are running concurrently on > two cpus then the contention can continue for whole address space and > can slow down the actual memory freeing. Making oom reaper traverse in > opposite direction can drastically reduce the contention and faster > memory freeing. It is two readers of the vma tree racing to lock the page tables for each vma, so I guess you can see it as contention as well.. but since the pte is a split lock, I see it as racing through vmas to see who hits which lock first. The smart money is on the oom killer as it skips some vmas :) If it were just contention, then the loop direction wouldn't matter.. but I do see your point. > > I'd think using MMF_ to reduce the race > > would achieve the same goal with less risk - which is why I bring it up. > > > > With MMF_ flag, are you suggesting oom reaper to skip the unmapping of > the oom-killed process? Yes, specifically move the MMF_OOM_SKIP flag to earlier in the exit path to reduce the possibility of the race/contention. > > > Really, both methods should be low risk, so I'm fine with either way. > > > > But I am interested in hearing how this race is happening enough to > > necessitate a fix. Reversing the iterator is a one-spot fix - if this > > happens elsewhere then we're out of options. Using the MMF_ flags is > > more of a scalable fix, if it achieves the same results. > > On the question of if this is a rare situaion and worth the patch. I > would say this scenario is not that rare particularly on low memory > devices and on highly utilized overcommitted systems. Memory pressure > and oom-kills are norm on such systems. The point of oom reaper is to > bring the system out of the oom situation quickly and having two cpus > unmapping the oom-killed process can potentially bring the system out of > oom situation faster. The exit_mmap() path used to run the oom reaper if it was an oom victim, until recently [1]. The part that makes me nervous is the exit_mmap() call to mmu_notifier_release(mm), while the oom reaper uses an mmu_notifier. I am not sure if there is an issue in ordering on any of the platforms of such things. Or the associated cost of the calls. I mean, it's already pretty crazy that we have this in the exit: mmap_read_lock() tlb_gather_mmu_fullmm() unmap vmas.. mmap_read_unlock() mmap_write_lock() tlb_finish_mmu().. So not only do we now have two tasks iterating over the vmas, but we also have mmu notifiers and tlb calls happening across the ranges.. At least doing all the work on a single cpu means that the hardware view is consistent. But I don't see this being worse than a forward race? As it is written here, we'll have one CPU working in one direction while the other works in the other, until both hit the end of the VMAs. Only when both tasks stop iterating the vmas can the exit continue since it requires the write lock. So the tlb_finish_mmu() in exit_mmap() will always be called after tlb_finish_mmu() on each individual vma has run in the __oom_reap_task_mm() context (when the race happens). There is also a window here, between the exit_mmap() dropping the read lock, setting MMF_OOM_SKIP, and taking the lock - where the oom killer will iterate through a list of vmas with zero memory to free and delay the task exiting. That is, wasting cpu and stopping the memory associated with the mm_struct (vmas and such) from being freed. I'm also not sure on the cpu cache effects of what we are doing and how much that would play into the speedup. My guess is that it's insignificant compared to the time we spend under the pte, but we have no numbers to go on. So I'd like to know how likely the simultaneous runs are and if there is a measurable gain? I agree, that at face value, two cpus should be able to split the work.. but I don't know about the notifier or the holding up the mm_struct associated memory. And it could slow things down by holding up an exiting task. > > I think the patch (with your suggestions) is simple enough and I don't > see any risk in including it. > Actually, the more I look at this, the worse I feel about it.. Am I overreacting? Thanks, Liam [1] https://elixir.bootlin.com/linux/v6.0.19/source/mm/mmap.c#L3085
On Wed 27-08-25 00:12:34, Liam R. Howlett wrote: [...] > > > I'd think using MMF_ to reduce the race > > > would achieve the same goal with less risk - which is why I bring it up. > > > > > > > With MMF_ flag, are you suggesting oom reaper to skip the unmapping of > > the oom-killed process? > > Yes, specifically move the MMF_OOM_SKIP flag to earlier in the exit > path to reduce the possibility of the race/contention. MMF_OOM_SKIP is a "hide from oom killer" flag and that means that as soon as you set it up the next oom killer invocation would kill another task while the current one hasn't freed up the memory yet and therefore you have a potentially pointless kill. -- Michal Hocko SUSE Labs
> + Cc Suren since he has worked on the exit_mmap() path a lot. Thank you for your assistance. I realize now that I should have Cc Suren earlier. > * Shakeel Butt <shakeel.butt@linux.dev> [250826 18:26]: > > On Tue, Aug 26, 2025 at 11:21:13AM -0400, Liam R. Howlett wrote: > > > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250826 09:50]: > > > > On Tue, Aug 26, 2025 at 09:37:22AM -0400, Liam R. Howlett wrote: > > > > > I really don't think this is worth doing. We're avoiding a race between > > > > > oom and a task unmap - the MMF bits should be used to avoid this race - > > > > > or at least mitigate it. > > > > > > > > Yes for sure, as explored at length in previous discussions this feels like > > > > we're papering over cracks here. > > > > > > > > _However_, I'm sort of ok with a minimalistic fix that solves the proximate > > > > issue even if it is that, as long as it doesn't cause issues in doing so. > > > > > > > > So this is my take on the below and why I'm open to it! > > > > > > > > > > > > > > They are probably both under the read lock, but considering how rare it > > > > > would be, would a racy flag check be enough - it is hardly critical to > > > > > get right. Either would reduce the probability. > > > > > > > > Zongjinji - I'm stil not sure that you've really indicated _why_ you're > > > > seeing such a tight and unusual race. Presumably some truly massive number > > > > of tasks being OOM'd and unmapping but... yeah that seems odd anyway. > > > > > > > > But again, if we can safely fix this in a way that doesn't hurt stuff too > > > > much I'm ok with it (of course, these are famous last words in the kernel > > > > often...!) > > > > > > > > Liam - are you open to a solution on the basis above, or do you feel we > > > > ought simply to fix the underlying issue here? > > > > > > At least this is a benign race. > > > > Is this really a race or rather a contention? IIUC exit_mmap and the oom > > reaper are trying to unmap the address space of the oom-killed process > > and can compete on page table locks. If both are running concurrently on > > two cpus then the contention can continue for whole address space and > > can slow down the actual memory freeing. Making oom reaper traverse in > > opposite direction can drastically reduce the contention and faster > > memory freeing. > > It is two readers of the vma tree racing to lock the page tables for > each vma, so I guess you can see it as contention as well.. but since > the pte is a split lock, I see it as racing through vmas to see who hits > which lock first. The smart money is on the oom killer as it skips some > vmas :) > > If it were just contention, then the loop direction wouldn't matter.. > but I do see your point. > > > > I'd think using MMF_ to reduce the race > > > would achieve the same goal with less risk - which is why I bring it up. > > > > > > > With MMF_ flag, are you suggesting oom reaper to skip the unmapping of > > the oom-killed process? > > Yes, specifically move the MMF_OOM_SKIP flag to earlier in the exit > path to reduce the possibility of the race/contention. > > > > > > Really, both methods should be low risk, so I'm fine with either way. > > > > > > But I am interested in hearing how this race is happening enough to > > > necessitate a fix. Reversing the iterator is a one-spot fix - if this > > > happens elsewhere then we're out of options. Using the MMF_ flags is > > > more of a scalable fix, if it achieves the same results. > > > > On the question of if this is a rare situaion and worth the patch. I > > would say this scenario is not that rare particularly on low memory > > devices and on highly utilized overcommitted systems. Memory pressure > > and oom-kills are norm on such systems. The point of oom reaper is to > > bring the system out of the oom situation quickly and having two cpus > > unmapping the oom-killed process can potentially bring the system out of > > oom situation faster. > > The exit_mmap() path used to run the oom reaper if it was an oom victim, > until recently [1]. The part that makes me nervous is the exit_mmap() > call to mmu_notifier_release(mm), while the oom reaper uses an > mmu_notifier. I am not sure if there is an issue in ordering on any of > the platforms of such things. Or the associated cost of the calls. > > I mean, it's already pretty crazy that we have this in the exit: > mmap_read_lock() > tlb_gather_mmu_fullmm() > unmap vmas.. > mmap_read_unlock() > mmap_write_lock() > tlb_finish_mmu().. > > So not only do we now have two tasks iterating over the vmas, but we > also have mmu notifiers and tlb calls happening across the ranges.. At > least doing all the work on a single cpu means that the hardware view is > consistent. But I don't see this being worse than a forward race? > > As it is written here, we'll have one CPU working in one direction while > the other works in the other, until both hit the end of the VMAs. Only > when both tasks stop iterating the vmas can the exit continue since it > requires the write lock. > > So the tlb_finish_mmu() in exit_mmap() will always be called after > tlb_finish_mmu() on each individual vma has run in the > __oom_reap_task_mm() context (when the race happens). > > There is also a window here, between the exit_mmap() dropping the read > lock, setting MMF_OOM_SKIP, and taking the lock - where the oom killer > will iterate through a list of vmas with zero memory to free and delay > the task exiting. That is, wasting cpu and stopping the memory > associated with the mm_struct (vmas and such) from being freed. > > I'm also not sure on the cpu cache effects of what we are doing and how > much that would play into the speedup. My guess is that it's > insignificant compared to the time we spend under the pte, but we have > no numbers to go on. > > So I'd like to know how likely the simultaneous runs are and if there is > a measurable gain? Since process killing events are very frequent on Android, the likelihood of exit_mmap and reaper work (not only OOM, but also some proactive reaping actions such as process_mrelease) occurring at the same time is much higher. When lmkd kills a process, it actively reaps the process using process_mrelease, similar to the way the OOM reaper works. Surenb may be able to clarify this point, as he is the author of lmkd. I referenced this data in link[1], and I should have included it in the cover letter. The attached test data was collected on Android. Before testing, in order to simulate the OOM kill process, I intercepted the kill signal and added the killed process to the OOM reaper queue. The reproduction steps are as follows: 1. Start a process. 2. Kill the process. 3. Capture a perfetto trace. Below are the load benefit data, measured by process running time: Note: #RxComputationT, vdp:vidtask:m, and tp-background are threads of the same process, and they are the last threads to exit. Thread TID State Wall duration (ms) total running # with oom reaper but traverse reverse #RxComputationT 13708 Running 60.690572 oom_reaper 81 Running 46.492032 107.182604 # with oom reaper vdp:vidtask:m 14040 Running 81.848297 oom_reaper 81 Running 69.32 151.168297 # without oom reaper tp-background 12424 Running 106.021874 106.021874 Compared to reaping when a process is killed, it provides approximately 30% load benefit. Compared to not performing reap when a process is killed, it can release memory earlier, with 40% faster memory release. [1] https://lore.kernel.org/all/20250815163207.7078-1-zhongjinji@honor.com/ > I agree, that at face value, two cpus should be able to split the work.. > but I don't know about the notifier or the holding up the mm_struct > associated memory. And it could slow things down by holding up an > exiting task. > > > > > I think the patch (with your suggestions) is simple enough and I don't > > see any risk in including it. > > > > Actually, the more I look at this, the worse I feel about it.. Am I > overreacting? > > Thanks, > Liam > > [1] https://elixir.bootlin.com/linux/v6.0.19/source/mm/mmap.c#L3085
On Wed, Aug 27, 2025 at 2:55 AM zhongjinji <zhongjinji@honor.com> wrote: > > > + Cc Suren since he has worked on the exit_mmap() path a lot. > > Thank you for your assistance. I realize now that I should have > Cc Suren earlier. Thanks for adding me! > > > * Shakeel Butt <shakeel.butt@linux.dev> [250826 18:26]: > > > On Tue, Aug 26, 2025 at 11:21:13AM -0400, Liam R. Howlett wrote: > > > > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250826 09:50]: > > > > > On Tue, Aug 26, 2025 at 09:37:22AM -0400, Liam R. Howlett wrote: > > > > > > I really don't think this is worth doing. We're avoiding a race between > > > > > > oom and a task unmap - the MMF bits should be used to avoid this race - > > > > > > or at least mitigate it. > > > > > > > > > > Yes for sure, as explored at length in previous discussions this feels like > > > > > we're papering over cracks here. > > > > > > > > > > _However_, I'm sort of ok with a minimalistic fix that solves the proximate > > > > > issue even if it is that, as long as it doesn't cause issues in doing so. > > > > > > > > > > So this is my take on the below and why I'm open to it! > > > > > > > > > > > > > > > > > They are probably both under the read lock, but considering how rare it > > > > > > would be, would a racy flag check be enough - it is hardly critical to > > > > > > get right. Either would reduce the probability. > > > > > > > > > > Zongjinji - I'm stil not sure that you've really indicated _why_ you're > > > > > seeing such a tight and unusual race. Presumably some truly massive number > > > > > of tasks being OOM'd and unmapping but... yeah that seems odd anyway. > > > > > > > > > > But again, if we can safely fix this in a way that doesn't hurt stuff too > > > > > much I'm ok with it (of course, these are famous last words in the kernel > > > > > often...!) > > > > > > > > > > Liam - are you open to a solution on the basis above, or do you feel we > > > > > ought simply to fix the underlying issue here? > > > > > > > > At least this is a benign race. > > > > > > Is this really a race or rather a contention? IIUC exit_mmap and the oom > > > reaper are trying to unmap the address space of the oom-killed process > > > and can compete on page table locks. If both are running concurrently on > > > two cpus then the contention can continue for whole address space and > > > can slow down the actual memory freeing. Making oom reaper traverse in > > > opposite direction can drastically reduce the contention and faster > > > memory freeing. > > > > It is two readers of the vma tree racing to lock the page tables for > > each vma, so I guess you can see it as contention as well.. but since > > the pte is a split lock, I see it as racing through vmas to see who hits > > which lock first. The smart money is on the oom killer as it skips some > > vmas :) > > > > If it were just contention, then the loop direction wouldn't matter.. > > but I do see your point. > > > > > > I'd think using MMF_ to reduce the race > > > > would achieve the same goal with less risk - which is why I bring it up. > > > > > > > > > > With MMF_ flag, are you suggesting oom reaper to skip the unmapping of > > > the oom-killed process? > > > > Yes, specifically move the MMF_OOM_SKIP flag to earlier in the exit > > path to reduce the possibility of the race/contention. > > > > > > > > > Really, both methods should be low risk, so I'm fine with either way. > > > > > > > > But I am interested in hearing how this race is happening enough to > > > > necessitate a fix. Reversing the iterator is a one-spot fix - if this > > > > happens elsewhere then we're out of options. Using the MMF_ flags is > > > > more of a scalable fix, if it achieves the same results. > > > > > > On the question of if this is a rare situaion and worth the patch. I > > > would say this scenario is not that rare particularly on low memory > > > devices and on highly utilized overcommitted systems. Memory pressure > > > and oom-kills are norm on such systems. The point of oom reaper is to > > > bring the system out of the oom situation quickly and having two cpus > > > unmapping the oom-killed process can potentially bring the system out of > > > oom situation faster. > > > > The exit_mmap() path used to run the oom reaper if it was an oom victim, > > until recently [1]. The part that makes me nervous is the exit_mmap() > > call to mmu_notifier_release(mm), while the oom reaper uses an > > mmu_notifier. I am not sure if there is an issue in ordering on any of > > the platforms of such things. Or the associated cost of the calls. > > > > I mean, it's already pretty crazy that we have this in the exit: > > mmap_read_lock() > > tlb_gather_mmu_fullmm() > > unmap vmas.. > > mmap_read_unlock() > > mmap_write_lock() > > tlb_finish_mmu().. > > > > So not only do we now have two tasks iterating over the vmas, but we > > also have mmu notifiers and tlb calls happening across the ranges.. At > > least doing all the work on a single cpu means that the hardware view is > > consistent. But I don't see this being worse than a forward race? This part seems to have changed quite a bit since I last looked into it closely and it's worth re-checking, however that seems orthogonal to what this patch is trying to do. > > > > As it is written here, we'll have one CPU working in one direction while > > the other works in the other, until both hit the end of the VMAs. Only > > when both tasks stop iterating the vmas can the exit continue since it > > requires the write lock. > > > > So the tlb_finish_mmu() in exit_mmap() will always be called after > > tlb_finish_mmu() on each individual vma has run in the > > __oom_reap_task_mm() context (when the race happens). Correct. > > > > There is also a window here, between the exit_mmap() dropping the read > > lock, setting MMF_OOM_SKIP, and taking the lock - where the oom killer > > will iterate through a list of vmas with zero memory to free and delay > > the task exiting. That is, wasting cpu and stopping the memory > > associated with the mm_struct (vmas and such) from being freed. Might be an opportunity to optimize but again, this is happening with or without this patch, no? > > > > I'm also not sure on the cpu cache effects of what we are doing and how > > much that would play into the speedup. My guess is that it's > > insignificant compared to the time we spend under the pte, but we have > > no numbers to go on. > > > > So I'd like to know how likely the simultaneous runs are and if there is > > a measurable gain? > > Since process killing events are very frequent on Android, the likelihood of > exit_mmap and reaper work (not only OOM, but also some proactive reaping > actions such as process_mrelease) occurring at the same time is much higher. > When lmkd kills a process, it actively reaps the process using > process_mrelease, similar to the way the OOM reaper works. Surenb may be > able to clarify this point, as he is the author of lmkd. Yes, on Android process_mrelease() is used after lmkd kills a process to expedite memory release in case the victim process is blocked for some reason. This makes the race between __oom_reap_task_mm() and exit_mmap() much more frequent. That is probably the disconnect in thinking that this race is rare. I don't see any harm in __oom_reap_task_mm() walking the tree backwards to minimize the contention with exit_mmap(). Liam, is there a performance difference between mas_for_each_rev() and mas_for_each() ? > > I referenced this data in link[1], and I should have included it in the cover > letter. The attached test data was collected on Android. Before testing, in > order to simulate the OOM kill process, I intercepted the kill signal and added > the killed process to the OOM reaper queue. > > The reproduction steps are as follows: > 1. Start a process. > 2. Kill the process. > 3. Capture a perfetto trace. > > Below are the load benefit data, measured by process running time: > > Note: #RxComputationT, vdp:vidtask:m, and tp-background are threads of the > same process, and they are the last threads to exit. > > Thread TID State Wall duration (ms) total running > # with oom reaper but traverse reverse > #RxComputationT 13708 Running 60.690572 > oom_reaper 81 Running 46.492032 107.182604 > > # with oom reaper > vdp:vidtask:m 14040 Running 81.848297 > oom_reaper 81 Running 69.32 151.168297 > > # without oom reaper > tp-background 12424 Running 106.021874 106.021874 > > Compared to reaping when a process is killed, it provides approximately > 30% load benefit. > Compared to not performing reap when a process is killed, it can release > memory earlier, with 40% faster memory release. That looks like a nice performance improvement for reaping the memory with low churn and risk. > > [1] https://lore.kernel.org/all/20250815163207.7078-1-zhongjinji@honor.com/ > > > I agree, that at face value, two cpus should be able to split the work.. > > but I don't know about the notifier or the holding up the mm_struct > > associated memory. And it could slow things down by holding up an > > exiting task. > > > > > > > > I think the patch (with your suggestions) is simple enough and I don't > > > see any risk in including it. > > > > > > > Actually, the more I look at this, the worse I feel about it.. Am I > > overreacting? > > > > Thanks, > > Liam > > > > [1] https://elixir.bootlin.com/linux/v6.0.19/source/mm/mmap.c#L3085
* Suren Baghdasaryan <surenb@google.com> [250827 11:57]: ... > > > > > > The exit_mmap() path used to run the oom reaper if it was an oom victim, > > > until recently [1]. The part that makes me nervous is the exit_mmap() > > > call to mmu_notifier_release(mm), while the oom reaper uses an > > > mmu_notifier. I am not sure if there is an issue in ordering on any of > > > the platforms of such things. Or the associated cost of the calls. > > > > > > I mean, it's already pretty crazy that we have this in the exit: > > > mmap_read_lock() > > > tlb_gather_mmu_fullmm() > > > unmap vmas.. > > > mmap_read_unlock() > > > mmap_write_lock() > > > tlb_finish_mmu().. > > > > > > So not only do we now have two tasks iterating over the vmas, but we > > > also have mmu notifiers and tlb calls happening across the ranges.. At > > > least doing all the work on a single cpu means that the hardware view is > > > consistent. But I don't see this being worse than a forward race? > > This part seems to have changed quite a bit since I last looked into > it closely and it's worth re-checking, however that seems orthogonal > to what this patch is trying to do. I was concerned about how a reverse iterator may affect what is considered accurate for the mmu/tlb so I thought it worth pointing out. ... > > > > > > There is also a window here, between the exit_mmap() dropping the read > > > lock, setting MMF_OOM_SKIP, and taking the lock - where the oom killer > > > will iterate through a list of vmas with zero memory to free and delay > > > the task exiting. That is, wasting cpu and stopping the memory > > > associated with the mm_struct (vmas and such) from being freed. > > Might be an opportunity to optimize but again, this is happening with > or without this patch, no? Correct, but with number it looks to be better to go with two loops. > > > > > > > I'm also not sure on the cpu cache effects of what we are doing and how > > > much that would play into the speedup. My guess is that it's > > > insignificant compared to the time we spend under the pte, but we have > > > no numbers to go on. > > > > > > So I'd like to know how likely the simultaneous runs are and if there is > > > a measurable gain? > > > > Since process killing events are very frequent on Android, the likelihood of > > exit_mmap and reaper work (not only OOM, but also some proactive reaping > > actions such as process_mrelease) occurring at the same time is much higher. > > When lmkd kills a process, it actively reaps the process using > > process_mrelease, similar to the way the OOM reaper works. Surenb may be > > able to clarify this point, as he is the author of lmkd. > > Yes, on Android process_mrelease() is used after lmkd kills a process > to expedite memory release in case the victim process is blocked for > some reason. This makes the race between __oom_reap_task_mm() and > exit_mmap() much more frequent. That is probably the disconnect in > thinking that this race is rare. I don't see any harm in > __oom_reap_task_mm() walking the tree backwards to minimize the > contention with exit_mmap(). Liam, is there a performance difference > between mas_for_each_rev() and mas_for_each() ? There should be no performance difference. > > > > I referenced this data in link[1], and I should have included it in the cover > > letter. The attached test data was collected on Android. Before testing, in > > order to simulate the OOM kill process, I intercepted the kill signal and added > > the killed process to the OOM reaper queue. Sorry I missed your response in v4 on this. > > > > The reproduction steps are as follows: > > 1. Start a process. > > 2. Kill the process. > > 3. Capture a perfetto trace. > > > > Below are the load benefit data, measured by process running time: > > > > Note: #RxComputationT, vdp:vidtask:m, and tp-background are threads of the > > same process, and they are the last threads to exit. > > > > Thread TID State Wall duration (ms) total running > > # with oom reaper but traverse reverse > > #RxComputationT 13708 Running 60.690572 > > oom_reaper 81 Running 46.492032 107.182604 > > > > # with oom reaper > > vdp:vidtask:m 14040 Running 81.848297 > > oom_reaper 81 Running 69.32 151.168297 > > > > # without oom reaper > > tp-background 12424 Running 106.021874 106.021874 > > > > Compared to reaping when a process is killed, it provides approximately > > 30% load benefit. > > Compared to not performing reap when a process is killed, it can release > > memory earlier, with 40% faster memory release. > > That looks like a nice performance improvement for reaping the memory > with low churn and risk. Agreed. Please include the numbers in the change log in the next revision so it is recorded in git. I think all my questions are resolved, thanks. I look forward to v6. Regards, Liam
..actually add Suren this time. * Liam R. Howlett <Liam.Howlett@oracle.com> [250827 00:12]: > + Cc Suren since he has worked on the exit_mmap() path a lot. > > * Shakeel Butt <shakeel.butt@linux.dev> [250826 18:26]: > > On Tue, Aug 26, 2025 at 11:21:13AM -0400, Liam R. Howlett wrote: > > > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250826 09:50]: > > > > On Tue, Aug 26, 2025 at 09:37:22AM -0400, Liam R. Howlett wrote: > > > > > I really don't think this is worth doing. We're avoiding a race between > > > > > oom and a task unmap - the MMF bits should be used to avoid this race - > > > > > or at least mitigate it. > > > > > > > > Yes for sure, as explored at length in previous discussions this feels like > > > > we're papering over cracks here. > > > > > > > > _However_, I'm sort of ok with a minimalistic fix that solves the proximate > > > > issue even if it is that, as long as it doesn't cause issues in doing so. > > > > > > > > So this is my take on the below and why I'm open to it! > > > > > > > > > > > > > > They are probably both under the read lock, but considering how rare it > > > > > would be, would a racy flag check be enough - it is hardly critical to > > > > > get right. Either would reduce the probability. > > > > > > > > Zongjinji - I'm stil not sure that you've really indicated _why_ you're > > > > seeing such a tight and unusual race. Presumably some truly massive number > > > > of tasks being OOM'd and unmapping but... yeah that seems odd anyway. > > > > > > > > But again, if we can safely fix this in a way that doesn't hurt stuff too > > > > much I'm ok with it (of course, these are famous last words in the kernel > > > > often...!) > > > > > > > > Liam - are you open to a solution on the basis above, or do you feel we > > > > ought simply to fix the underlying issue here? > > > > > > At least this is a benign race. > > > > Is this really a race or rather a contention? IIUC exit_mmap and the oom > > reaper are trying to unmap the address space of the oom-killed process > > and can compete on page table locks. If both are running concurrently on > > two cpus then the contention can continue for whole address space and > > can slow down the actual memory freeing. Making oom reaper traverse in > > opposite direction can drastically reduce the contention and faster > > memory freeing. > > It is two readers of the vma tree racing to lock the page tables for > each vma, so I guess you can see it as contention as well.. but since > the pte is a split lock, I see it as racing through vmas to see who hits > which lock first. The smart money is on the oom killer as it skips some > vmas :) > > If it were just contention, then the loop direction wouldn't matter.. > but I do see your point. > > > > I'd think using MMF_ to reduce the race > > > would achieve the same goal with less risk - which is why I bring it up. > > > > > > > With MMF_ flag, are you suggesting oom reaper to skip the unmapping of > > the oom-killed process? > > Yes, specifically move the MMF_OOM_SKIP flag to earlier in the exit > path to reduce the possibility of the race/contention. > > > > > > Really, both methods should be low risk, so I'm fine with either way. > > > > > > But I am interested in hearing how this race is happening enough to > > > necessitate a fix. Reversing the iterator is a one-spot fix - if this > > > happens elsewhere then we're out of options. Using the MMF_ flags is > > > more of a scalable fix, if it achieves the same results. > > > > On the question of if this is a rare situaion and worth the patch. I > > would say this scenario is not that rare particularly on low memory > > devices and on highly utilized overcommitted systems. Memory pressure > > and oom-kills are norm on such systems. The point of oom reaper is to > > bring the system out of the oom situation quickly and having two cpus > > unmapping the oom-killed process can potentially bring the system out of > > oom situation faster. > > The exit_mmap() path used to run the oom reaper if it was an oom victim, > until recently [1]. The part that makes me nervous is the exit_mmap() > call to mmu_notifier_release(mm), while the oom reaper uses an > mmu_notifier. I am not sure if there is an issue in ordering on any of > the platforms of such things. Or the associated cost of the calls. > > I mean, it's already pretty crazy that we have this in the exit: > mmap_read_lock() > tlb_gather_mmu_fullmm() > unmap vmas.. > mmap_read_unlock() > mmap_write_lock() > tlb_finish_mmu().. > > So not only do we now have two tasks iterating over the vmas, but we > also have mmu notifiers and tlb calls happening across the ranges.. At > least doing all the work on a single cpu means that the hardware view is > consistent. But I don't see this being worse than a forward race? > > As it is written here, we'll have one CPU working in one direction while > the other works in the other, until both hit the end of the VMAs. Only > when both tasks stop iterating the vmas can the exit continue since it > requires the write lock. > > So the tlb_finish_mmu() in exit_mmap() will always be called after > tlb_finish_mmu() on each individual vma has run in the > __oom_reap_task_mm() context (when the race happens). > > There is also a window here, between the exit_mmap() dropping the read > lock, setting MMF_OOM_SKIP, and taking the lock - where the oom killer > will iterate through a list of vmas with zero memory to free and delay > the task exiting. That is, wasting cpu and stopping the memory > associated with the mm_struct (vmas and such) from being freed. > > I'm also not sure on the cpu cache effects of what we are doing and how > much that would play into the speedup. My guess is that it's > insignificant compared to the time we spend under the pte, but we have > no numbers to go on. > > So I'd like to know how likely the simultaneous runs are and if there is > a measurable gain? > > I agree, that at face value, two cpus should be able to split the work.. > but I don't know about the notifier or the holding up the mm_struct > associated memory. And it could slow things down by holding up an > exiting task. > > > > > I think the patch (with your suggestions) is simple enough and I don't > > see any risk in including it. > > > > Actually, the more I look at this, the worse I feel about it.. Am I > overreacting? > > Thanks, > Liam > > [1] https://elixir.bootlin.com/linux/v6.0.19/source/mm/mmap.c#L3085 >
© 2016 - 2025 Red Hat, Inc.