mm/internal.h | 4 ++- mm/memory.c | 13 ++++----- mm/mmap.c | 80 ++++++++++++++++++++++++++++++++++----------------- mm/vma.c | 10 +++++-- mm/vma.h | 1 + 5 files changed, 70 insertions(+), 38 deletions(-)
Before you read on, please take a moment to acknowledge that David Hildenbrand asked for this, so I'm blaming mostly him :) It is possible that the dup_mmap() call fails on allocating or setting up a vma after the maple tree of the oldmm is copied. Today, that failure point is marked by inserting an XA_ZERO entry over the failure point so that the exact location does not need to be communicated through to exit_mmap(). However, a race exists in the tear down process because the dup_mmap() drops the mmap lock before exit_mmap() can remove the partially set up vma tree. This means that other tasks may get to the mm tree and find the invalid vma pointer (since it's an XA_ZERO entry), even though the mm is marked as MMF_OOM_SKIP and MMF_UNSTABLE. To remove the race fully, the tree must be cleaned up before dropping the lock. This is accomplished by extracting the vma cleanup in exit_mmap() and changing the required functions to pass through the vma search limit. This does run the risk of increasing the possibility of finding no vmas (which is already possible!) in code this isn't careful. The passing of so many limits and variables was such a mess when the dup_mmap() was introduced that it was avoided in favour of the XA_ZERO entry marker, but since the swap case was the second time we've hit cases of walking an almost-dead mm, here's the alternative to checking MMF_UNSTABLE before wandering into other mm structs. [1]. https://lore.kernel.org/all/2e8df53b-d953-43fb-9c69-7d7d60e95c9a@redhat.com/ Liam R. Howlett (6): mm/mmap: Move exit_mmap() trace point mm/mmap: Abstract vma clean up from exit_mmap() mm/vma: Add limits to unmap_region() for vmas mm/memory: Add tree limit to free_pgtables() mm/vma: Add page table limit to unmap_region() mm: Change dup_mmap() recovery mm/internal.h | 4 ++- mm/memory.c | 13 ++++----- mm/mmap.c | 80 ++++++++++++++++++++++++++++++++++----------------- mm/vma.c | 10 +++++-- mm/vma.h | 1 + 5 files changed, 70 insertions(+), 38 deletions(-) -- 2.47.2
On 15.08.25 21:10, Liam R. Howlett wrote: > Before you read on, please take a moment to acknowledge that David > Hildenbrand asked for this, so I'm blaming mostly him :) :) > > It is possible that the dup_mmap() call fails on allocating or setting > up a vma after the maple tree of the oldmm is copied. Today, that > failure point is marked by inserting an XA_ZERO entry over the failure > point so that the exact location does not need to be communicated > through to exit_mmap(). > > However, a race exists in the tear down process because the dup_mmap() > drops the mmap lock before exit_mmap() can remove the partially set up > vma tree. This means that other tasks may get to the mm tree and find > the invalid vma pointer (since it's an XA_ZERO entry), even though the > mm is marked as MMF_OOM_SKIP and MMF_UNSTABLE. > > To remove the race fully, the tree must be cleaned up before dropping > the lock. This is accomplished by extracting the vma cleanup in > exit_mmap() and changing the required functions to pass through the vma > search limit. > > This does run the risk of increasing the possibility of finding no vmas > (which is already possible!) in code this isn't careful. Right, it would also happen if __mt_dup() fails I guess. > > The passing of so many limits and variables was such a mess when the > dup_mmap() was introduced that it was avoided in favour of the XA_ZERO > entry marker, but since the swap case was the second time we've hit > cases of walking an almost-dead mm, here's the alternative to checking > MMF_UNSTABLE before wandering into other mm structs. Changes look fairly small and reasonable, so I really like this. I agree with Jann that doing a partial teardown might be even better, but code-wise I suspect it might end up with a lot more churn and weird allocation-corner-cases to handle. -- Cheers David / dhildenb
On Mon, Aug 18, 2025 at 11:44:16AM +0200, David Hildenbrand wrote: > On 15.08.25 21:10, Liam R. Howlett wrote: > > Before you read on, please take a moment to acknowledge that David > > Hildenbrand asked for this, so I'm blaming mostly him :) > > :) > > > > > It is possible that the dup_mmap() call fails on allocating or setting > > up a vma after the maple tree of the oldmm is copied. Today, that > > failure point is marked by inserting an XA_ZERO entry over the failure > > point so that the exact location does not need to be communicated > > through to exit_mmap(). > > > > However, a race exists in the tear down process because the dup_mmap() > > drops the mmap lock before exit_mmap() can remove the partially set up > > vma tree. This means that other tasks may get to the mm tree and find > > the invalid vma pointer (since it's an XA_ZERO entry), even though the > > mm is marked as MMF_OOM_SKIP and MMF_UNSTABLE. > > > > To remove the race fully, the tree must be cleaned up before dropping > > the lock. This is accomplished by extracting the vma cleanup in > > exit_mmap() and changing the required functions to pass through the vma > > search limit. > > > > This does run the risk of increasing the possibility of finding no vmas > > (which is already possible!) in code this isn't careful. > > Right, it would also happen if __mt_dup() fails I guess. > > > > > The passing of so many limits and variables was such a mess when the > > dup_mmap() was introduced that it was avoided in favour of the XA_ZERO > > entry marker, but since the swap case was the second time we've hit > > cases of walking an almost-dead mm, here's the alternative to checking > > MMF_UNSTABLE before wandering into other mm structs. > > Changes look fairly small and reasonable, so I really like this. > > I agree with Jann that doing a partial teardown might be even better, but > code-wise I suspect it might end up with a lot more churn and weird > allocation-corner-cases to handle. I've yet to review the series and see exactly what's proposed but on gut instinct (and based on past experience with the munmap gather/reattach stuff), some kind of a partial thing like this tends to end up a nightmare of weird-stuff-you-didn't-think-about. So I'm instincitively against this. However I'll take a proper look through this series shortly and hopefully have more intelligent things to say... An aside - I was working on a crazy anon idea over the weekend (I know, I know) and noticed that mm life cycle is just weird. I observed apparent duplicate calls of __mmdrop() for instance (I think the unwinding just broke), the delayed mmdrop is strange and the whole area seems rife with complexity. So I'm glad David talked you into doing this ;) this particular edge case was always strange and the fact we have now hid it twice (and this time more seriously - as it's due to a fatal signal which is much more likely to arise than an OOM scenario with too-small-to-fail allocations). BTW where are we with the hotfix for the swapoff case [0]? I think we agreed settng on MMF_UNSTABLE there and using that to decide not to proceed in unuse_mm() right? Cheers, Lorenzo [0]: https://lore.kernel.org/all/20250808092156.1918973-1-quic_charante@quicinc.com/
On 8/18/2025 3:14 PM, David Hildenbrand wrote: >> Before you read on, please take a moment to acknowledge that David >> Hildenbrand asked for this, so I'm blaming mostly him 🙂 > Just curious If we can go by the suggestion of mm unstable check just for the 6.12, as it is LTS. I can't see this issue on branches earlier to it. As well this patchset is not getting applied cleanly on 6.12. > 🙂
* Charan Teja Kalla <quic_charante@quicinc.com> [250818 10:27]: > > On 8/18/2025 3:14 PM, David Hildenbrand wrote: > >> Before you read on, please take a moment to acknowledge that David > >> Hildenbrand asked for this, so I'm blaming mostly him 🙂 > > > > Just curious If we can go by the suggestion of mm unstable check just > for the 6.12, as it is LTS. I can't see this issue on branches earlier > to it. As well this patchset is not getting applied cleanly on 6.12. This was developed against mm-new to be applied and potentially backported, but it's an RFC right now so it won't be applied anywhere. I think you still should be checking the MMF_UNSTABLE in the swap path, as David suggested as a hotfix [1]. Considering it's a single mm global check before iterating, it should be checked regardless of what happens with this RFC. Thanks, Liam [1]. https://lore.kernel.org/linux-mm/9178bf98-2ea7-4ad8-ad43-cdcc02ab863d@redhat.com/
On Fri, Aug 15, 2025 at 9:10 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > Before you read on, please take a moment to acknowledge that David > Hildenbrand asked for this, so I'm blaming mostly him :) > > It is possible that the dup_mmap() call fails on allocating or setting > up a vma after the maple tree of the oldmm is copied. Today, that > failure point is marked by inserting an XA_ZERO entry over the failure > point so that the exact location does not need to be communicated > through to exit_mmap(). Overall: Yes please, I'm in favor of getting rid of that XA_ZERO special case. > However, a race exists in the tear down process because the dup_mmap() > drops the mmap lock before exit_mmap() can remove the partially set up > vma tree. This means that other tasks may get to the mm tree and find > the invalid vma pointer (since it's an XA_ZERO entry), even though the > mm is marked as MMF_OOM_SKIP and MMF_UNSTABLE. > > To remove the race fully, the tree must be cleaned up before dropping > the lock. This is accomplished by extracting the vma cleanup in > exit_mmap() and changing the required functions to pass through the vma > search limit. It really seems to me like, instead of tearing down the whole tree on this failure path, we should be able to remove those entries in the cloned vma tree that haven't been copied yet and then proceed as normal. I understand that this is complicated because of maple tree weirdness; but can't we somehow fix the wr_rebalance case to not allocate more memory when reducing the number of tree nodes? Surely there's some way to do that. A really stupid suggestion: As long as wr_rebalance is guaranteed to not increase the number of nodes, we could make do with a global-mutex-protected system-global preallocation of significantly less than 64 maple tree nodes, right? We could even use that in RCU mode, as long as we are willing to take a synchronize_rcu() penalty on this "we really want to wipe some tree elements" slowpath. It feels like we're adding more and more weird contortions caused by the kinda bizarre "you can't reliably remove tree elements" property of maple trees, and I really feel like that complexity should be pushed down into the maple tree implementation instead.
* Jann Horn <jannh@google.com> [250815 15:50]: > On Fri, Aug 15, 2025 at 9:10 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > Before you read on, please take a moment to acknowledge that David > > Hildenbrand asked for this, so I'm blaming mostly him :) > > > > It is possible that the dup_mmap() call fails on allocating or setting > > up a vma after the maple tree of the oldmm is copied. Today, that > > failure point is marked by inserting an XA_ZERO entry over the failure > > point so that the exact location does not need to be communicated > > through to exit_mmap(). > > Overall: Yes please, I'm in favor of getting rid of that XA_ZERO special case. > > > However, a race exists in the tear down process because the dup_mmap() > > drops the mmap lock before exit_mmap() can remove the partially set up > > vma tree. This means that other tasks may get to the mm tree and find > > the invalid vma pointer (since it's an XA_ZERO entry), even though the > > mm is marked as MMF_OOM_SKIP and MMF_UNSTABLE. > > > > To remove the race fully, the tree must be cleaned up before dropping > > the lock. This is accomplished by extracting the vma cleanup in > > exit_mmap() and changing the required functions to pass through the vma > > search limit. > > It really seems to me like, instead of tearing down the whole tree on > this failure path, we should be able to remove those entries in the > cloned vma tree that haven't been copied yet and then proceed as > normal. I understand that this is complicated because of maple tree > weirdness; but can't we somehow fix the wr_rebalance case to not > allocate more memory when reducing the number of tree nodes? > Surely there's some way to do that. A really stupid suggestion: As > long as wr_rebalance is guaranteed to not increase the number of > nodes, we could make do with a global-mutex-protected system-global > preallocation of significantly less than 64 maple tree nodes, right? > We could even use that in RCU mode, as long as we are willing to take > a synchronize_rcu() penalty on this "we really want to wipe some tree > elements" slowpath. > > It feels like we're adding more and more weird contortions caused by > the kinda bizarre "you can't reliably remove tree elements" property > of maple trees, and I really feel like that complexity should be > pushed down into the maple tree implementation instead. First, thanks for looking at this. I appreciate you taking the time to think through what is going on here and alternatives. The maple tree isn't the first time that we need memory to free memory and I don't think it'll be the last either. I have a method of reusing nodes, but that's when rcu is not on. What you are suggesting is very complicated and will have a number of corner cases, and probably zero users. Having a global reserve of nodes is something that has come up before, but there is no way to ensure that we'd have enough to ensure that things won't fail. 64 is a huge number of nodes, but there's no way to know if we're going to hit this situation multiple times in a row in rapid succession. And since we cannot depending on it working, then there is no real benefit to having the mode - we have to plan for the rare case of failure regardless. There is also another project to reduce the possibility of the maple tree itself failing to allocate [1], but in the dup_mmap() situation the tree is not the problem. What you are asking me to do is to add a special mode in the tree to work around the fact that the mm struct has a life cycle problem. If this special mode already existed as you have suggested above, I would advise not using it here because we are out of memory and we want to get memory back from the failed dup_mmap() as quickly as possible, and the tear down is the fastest way to do that. I don't want to have code executing to juggle things around to fix an mm struct so that the unlock/lock dance is safe for random race conditions. This is why I think the mode would have zero users, dup_mmap() shouldn't be doing any extra work after a failed allocation. I see where you are coming from and that having compartmentalised code feels like the right path, but in this case the system failed to allocate enough memory after reclaim ran and we're looking at adding extra run time for zero benefit. From the maple tree perspective, there is one call to destroy the tree, the rest of the code is to clean up the failed mm setup. On a higher level, I'm disappointed that the mm struct is half initialized and unlocked at all. Really, I think we should go further and not have the mm exposed for any window of time. The issue with just calling exit_mmap() directly is the unlocking/locking dance for performance reasons. Equally, the !vma issue seems crazy because it's about avoiding races where the task has unmapped everything and we really shouldn't be able to iterate over tasks that are in this state. And we have this MMF_UNSTABLE bit in the mm, but other tasks walk into the mm and access the struct without checking. That seems suboptimal, at best. [1] https://lore.kernel.org/all/20250723-slub-percpu-caches-v5-0-b792cd830f5d@suse.cz/ Thanks, Liam
© 2016 - 2025 Red Hat, Inc.