Remove XA_ZERO from error recovery of

[RFC PATCH 0/6] Remove XA_ZERO from error recovery of

Posted by Liam R. Howlett 1 month, 2 weeks ago

Before you read on, please take a moment to acknowledge that David
Hildenbrand asked for this, so I'm blaming mostly him :)

It is possible that the dup_mmap() call fails on allocating or setting
up a vma after the maple tree of the oldmm is copied.  Today, that
failure point is marked by inserting an XA_ZERO entry over the failure
point so that the exact location does not need to be communicated
through to exit_mmap().

However, a race exists in the tear down process because the dup_mmap()
drops the mmap lock before exit_mmap() can remove the partially set up
vma tree.  This means that other tasks may get to the mm tree and find
the invalid vma pointer (since it's an XA_ZERO entry), even though the
mm is marked as MMF_OOM_SKIP and MMF_UNSTABLE.

To remove the race fully, the tree must be cleaned up before dropping
the lock.  This is accomplished by extracting the vma cleanup in
exit_mmap() and changing the required functions to pass through the vma
search limit.

This does run the risk of increasing the possibility of finding no vmas
(which is already possible!) in code this isn't careful.

The passing of so many limits and variables was such a mess when the
dup_mmap() was introduced that it was avoided in favour of the XA_ZERO
entry marker, but since the swap case was the second time we've hit
cases of walking an almost-dead mm, here's the alternative to checking
MMF_UNSTABLE before wandering into other mm structs.

[1].  https://lore.kernel.org/all/2e8df53b-d953-43fb-9c69-7d7d60e95c9a@redhat.com/

Liam R. Howlett (6):
  mm/mmap: Move exit_mmap() trace point
  mm/mmap: Abstract vma clean up from exit_mmap()
  mm/vma: Add limits to unmap_region() for vmas
  mm/memory: Add tree limit to free_pgtables()
  mm/vma: Add page table limit to unmap_region()
  mm: Change dup_mmap() recovery

 mm/internal.h |  4 ++-
 mm/memory.c   | 13 ++++-----
 mm/mmap.c     | 80 ++++++++++++++++++++++++++++++++++-----------------
 mm/vma.c      | 10 +++++--
 mm/vma.h      |  1 +
 5 files changed, 70 insertions(+), 38 deletions(-)

-- 
2.47.2

Re: [RFC PATCH 0/6] Remove XA_ZERO from error recovery of

Posted by David Hildenbrand 1 month, 2 weeks ago

On 15.08.25 21:10, Liam R. Howlett wrote:
> Before you read on, please take a moment to acknowledge that David
> Hildenbrand asked for this, so I'm blaming mostly him :)

:)

> 
> It is possible that the dup_mmap() call fails on allocating or setting
> up a vma after the maple tree of the oldmm is copied.  Today, that
> failure point is marked by inserting an XA_ZERO entry over the failure
> point so that the exact location does not need to be communicated
> through to exit_mmap().
> 
> However, a race exists in the tear down process because the dup_mmap()
> drops the mmap lock before exit_mmap() can remove the partially set up
> vma tree.  This means that other tasks may get to the mm tree and find
> the invalid vma pointer (since it's an XA_ZERO entry), even though the
> mm is marked as MMF_OOM_SKIP and MMF_UNSTABLE.
> 
> To remove the race fully, the tree must be cleaned up before dropping
> the lock.  This is accomplished by extracting the vma cleanup in
> exit_mmap() and changing the required functions to pass through the vma
> search limit.
> 
> This does run the risk of increasing the possibility of finding no vmas
> (which is already possible!) in code this isn't careful.

Right, it would also happen if __mt_dup() fails I guess.

> 
> The passing of so many limits and variables was such a mess when the
> dup_mmap() was introduced that it was avoided in favour of the XA_ZERO
> entry marker, but since the swap case was the second time we've hit
> cases of walking an almost-dead mm, here's the alternative to checking
> MMF_UNSTABLE before wandering into other mm structs.

Changes look fairly small and reasonable, so I really like this.

I agree with Jann that doing a partial teardown might be even better, 
but code-wise I suspect it might end up with a lot more churn and weird 
allocation-corner-cases to handle.

-- 
Cheers

David / dhildenb

Re: [RFC PATCH 0/6] Remove XA_ZERO from error recovery of

Posted by Lorenzo Stoakes 1 month, 2 weeks ago

On Mon, Aug 18, 2025 at 11:44:16AM +0200, David Hildenbrand wrote:
> On 15.08.25 21:10, Liam R. Howlett wrote:
> > Before you read on, please take a moment to acknowledge that David
> > Hildenbrand asked for this, so I'm blaming mostly him :)
>
> :)
>
> >
> > It is possible that the dup_mmap() call fails on allocating or setting
> > up a vma after the maple tree of the oldmm is copied.  Today, that
> > failure point is marked by inserting an XA_ZERO entry over the failure
> > point so that the exact location does not need to be communicated
> > through to exit_mmap().
> >
> > However, a race exists in the tear down process because the dup_mmap()
> > drops the mmap lock before exit_mmap() can remove the partially set up
> > vma tree.  This means that other tasks may get to the mm tree and find
> > the invalid vma pointer (since it's an XA_ZERO entry), even though the
> > mm is marked as MMF_OOM_SKIP and MMF_UNSTABLE.
> >
> > To remove the race fully, the tree must be cleaned up before dropping
> > the lock.  This is accomplished by extracting the vma cleanup in
> > exit_mmap() and changing the required functions to pass through the vma
> > search limit.
> >
> > This does run the risk of increasing the possibility of finding no vmas
> > (which is already possible!) in code this isn't careful.
>
> Right, it would also happen if __mt_dup() fails I guess.
>
> >
> > The passing of so many limits and variables was such a mess when the
> > dup_mmap() was introduced that it was avoided in favour of the XA_ZERO
> > entry marker, but since the swap case was the second time we've hit
> > cases of walking an almost-dead mm, here's the alternative to checking
> > MMF_UNSTABLE before wandering into other mm structs.
>
> Changes look fairly small and reasonable, so I really like this.
>
> I agree with Jann that doing a partial teardown might be even better, but
> code-wise I suspect it might end up with a lot more churn and weird
> allocation-corner-cases to handle.

I've yet to review the series and see exactly what's proposed but on gut
instinct (and based on past experience with the munmap gather/reattach
stuff), some kind of a partial thing like this tends to end up a nightmare
of weird-stuff-you-didn't-think-about.

So I'm instincitively against this.

However I'll take a proper look through this series shortly and hopefully
have more intelligent things to say...

An aside - I was working on a crazy anon idea over the weekend (I know, I
know) and noticed that mm life cycle is just weird. I observed apparent
duplicate calls of __mmdrop() for instance (I think the unwinding just
broke), the delayed mmdrop is strange and the whole area seems rife with
complexity.

So I'm glad David talked you into doing this ;) this particular edge case
was always strange and the fact we have now hid it twice (and this time
more seriously - as it's due to a fatal signal which is much more likely to
arise than an OOM scenario with too-small-to-fail allocations).

BTW where are we with the hotfix for the swapoff case [0]? I think we
agreed settng on MMF_UNSTABLE there and using that to decide not to proceed
in unuse_mm() right?

Cheers, Lorenzo

[0]: https://lore.kernel.org/all/20250808092156.1918973-1-quic_charante@quicinc.com/

Re: [RFC PATCH 0/6] Remove XA_ZERO from error recovery of

Posted by Charan Teja Kalla 1 month, 2 weeks ago

On 8/18/2025 3:14 PM, David Hildenbrand wrote:
>> Before you read on, please take a moment to acknowledge that David
>> Hildenbrand asked for this, so I'm blaming mostly him 🙂
> 

Just curious If we can go by the suggestion of mm unstable check just
for the 6.12, as it is LTS.  I can't see this issue on branches earlier
to it. As well this patchset is not getting applied cleanly on 6.12.

> 🙂

Re: [RFC PATCH 0/6] Remove XA_ZERO from error recovery of

Posted by Liam R. Howlett 1 month, 2 weeks ago

* Charan Teja Kalla <quic_charante@quicinc.com> [250818 10:27]:
> 
> On 8/18/2025 3:14 PM, David Hildenbrand wrote:
> >> Before you read on, please take a moment to acknowledge that David
> >> Hildenbrand asked for this, so I'm blaming mostly him 🙂
> > 
> 
> Just curious If we can go by the suggestion of mm unstable check just
> for the 6.12, as it is LTS.  I can't see this issue on branches earlier
> to it. As well this patchset is not getting applied cleanly on 6.12.

This was developed against mm-new to be applied and potentially
backported, but it's an RFC right now so it won't be applied anywhere.

I think you still should be checking the MMF_UNSTABLE in the swap path,
as David suggested as a hotfix [1].  Considering it's a single mm global
check before iterating, it should be checked regardless of what happens
with this RFC.

Thanks,
Liam

[1].  https://lore.kernel.org/linux-mm/9178bf98-2ea7-4ad8-ad43-cdcc02ab863d@redhat.com/

Re: [RFC PATCH 0/6] Remove XA_ZERO from error recovery of

Posted by Jann Horn 1 month, 2 weeks ago

On Fri, Aug 15, 2025 at 9:10 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> Before you read on, please take a moment to acknowledge that David
> Hildenbrand asked for this, so I'm blaming mostly him :)
>
> It is possible that the dup_mmap() call fails on allocating or setting
> up a vma after the maple tree of the oldmm is copied.  Today, that
> failure point is marked by inserting an XA_ZERO entry over the failure
> point so that the exact location does not need to be communicated
> through to exit_mmap().

Overall: Yes please, I'm in favor of getting rid of that XA_ZERO special case.

> However, a race exists in the tear down process because the dup_mmap()
> drops the mmap lock before exit_mmap() can remove the partially set up
> vma tree.  This means that other tasks may get to the mm tree and find
> the invalid vma pointer (since it's an XA_ZERO entry), even though the
> mm is marked as MMF_OOM_SKIP and MMF_UNSTABLE.
>
> To remove the race fully, the tree must be cleaned up before dropping
> the lock.  This is accomplished by extracting the vma cleanup in
> exit_mmap() and changing the required functions to pass through the vma
> search limit.

It really seems to me like, instead of tearing down the whole tree on
this failure path, we should be able to remove those entries in the
cloned vma tree that haven't been copied yet and then proceed as
normal. I understand that this is complicated because of maple tree
weirdness; but can't we somehow fix the wr_rebalance case to not
allocate more memory when reducing the number of tree nodes?
Surely there's some way to do that. A really stupid suggestion: As
long as wr_rebalance is guaranteed to not increase the number of
nodes, we could make do with a global-mutex-protected system-global
preallocation of significantly less than 64 maple tree nodes, right?
We could even use that in RCU mode, as long as we are willing to take
a synchronize_rcu() penalty on this "we really want to wipe some tree
elements" slowpath.

It feels like we're adding more and more weird contortions caused by
the kinda bizarre "you can't reliably remove tree elements" property
of maple trees, and I really feel like that complexity should be
pushed down into the maple tree implementation instead.

Re: [RFC PATCH 0/6] Remove XA_ZERO from error recovery of

Posted by Liam R. Howlett 1 month, 2 weeks ago

* Jann Horn <jannh@google.com> [250815 15:50]:
> On Fri, Aug 15, 2025 at 9:10 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > Before you read on, please take a moment to acknowledge that David
> > Hildenbrand asked for this, so I'm blaming mostly him :)
> >
> > It is possible that the dup_mmap() call fails on allocating or setting
> > up a vma after the maple tree of the oldmm is copied.  Today, that
> > failure point is marked by inserting an XA_ZERO entry over the failure
> > point so that the exact location does not need to be communicated
> > through to exit_mmap().
> 
> Overall: Yes please, I'm in favor of getting rid of that XA_ZERO special case.
> 
> > However, a race exists in the tear down process because the dup_mmap()
> > drops the mmap lock before exit_mmap() can remove the partially set up
> > vma tree.  This means that other tasks may get to the mm tree and find
> > the invalid vma pointer (since it's an XA_ZERO entry), even though the
> > mm is marked as MMF_OOM_SKIP and MMF_UNSTABLE.
> >
> > To remove the race fully, the tree must be cleaned up before dropping
> > the lock.  This is accomplished by extracting the vma cleanup in
> > exit_mmap() and changing the required functions to pass through the vma
> > search limit.
> 
> It really seems to me like, instead of tearing down the whole tree on
> this failure path, we should be able to remove those entries in the
> cloned vma tree that haven't been copied yet and then proceed as
> normal. I understand that this is complicated because of maple tree
> weirdness; but can't we somehow fix the wr_rebalance case to not
> allocate more memory when reducing the number of tree nodes?
> Surely there's some way to do that. A really stupid suggestion: As
> long as wr_rebalance is guaranteed to not increase the number of
> nodes, we could make do with a global-mutex-protected system-global
> preallocation of significantly less than 64 maple tree nodes, right?
> We could even use that in RCU mode, as long as we are willing to take
> a synchronize_rcu() penalty on this "we really want to wipe some tree
> elements" slowpath.
> 
> It feels like we're adding more and more weird contortions caused by
> the kinda bizarre "you can't reliably remove tree elements" property
> of maple trees, and I really feel like that complexity should be
> pushed down into the maple tree implementation instead.

First, thanks for looking at this.  I appreciate you taking the time to
think through what is going on here and alternatives.

The maple tree isn't the first time that we need memory to free memory
and I don't think it'll be the last either.  I have a method of reusing
nodes, but that's when rcu is not on.  What you are suggesting is very
complicated and will have a number of corner cases, and probably zero
users.

Having a global reserve of nodes is something that has come up before,
but there is no way to ensure that we'd have enough to ensure that
things won't fail.  64 is a huge number of nodes, but there's no way to
know if we're going to hit this situation multiple times in a row in
rapid succession.  And since we cannot depending on it working, then
there is no real benefit to having the mode - we have to plan for the
rare case of failure regardless.

There is also another project to reduce the possibility of the maple
tree itself failing to allocate [1], but in the dup_mmap() situation the
tree is not the problem.   What you are asking me to do is to add a
special mode in the tree to work around the fact that the mm struct has
a life cycle problem.

If this special mode already existed as you have suggested above, I
would advise not using it here because we are out of memory and we want
to get memory back from the failed dup_mmap() as quickly as possible,
and the tear down is the fastest way to do that.  I don't want to have
code executing to juggle things around to fix an mm struct so that the
unlock/lock dance is safe for random race conditions.

This is why I think the mode would have zero users, dup_mmap() shouldn't
be doing any extra work after a failed allocation.

I see where you are coming from and that having compartmentalised code
feels like the right path, but in this case the system failed to
allocate enough memory after reclaim ran and we're looking at adding
extra run time for zero benefit.  From the maple tree perspective, there
is one call to destroy the tree, the rest of the code is to clean up the
failed mm setup.

On a higher level, I'm disappointed that the mm struct is half
initialized and unlocked at all.  Really, I think we should go further
and not have the mm exposed for any window of time.  The issue with just
calling exit_mmap() directly is the unlocking/locking dance for
performance reasons.

Equally, the !vma issue seems crazy because it's about avoiding races
where the task has unmapped everything and we really shouldn't be able
to iterate over tasks that are in this state.

And we have this MMF_UNSTABLE bit in the mm, but other tasks walk into
the mm and access the struct without checking.  That seems suboptimal,
at best.

[1] https://lore.kernel.org/all/20250723-slub-percpu-caches-v5-0-b792cd830f5d@suse.cz/ 

Thanks,
Liam