[PATCH hotfix 6.12] maple_tree: correct tree corruption on spanning store

Lorenzo Stoakes posted 1 patch 1 month, 3 weeks ago
There is a newer version of this series
lib/maple_tree.c | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)
[PATCH hotfix 6.12] maple_tree: correct tree corruption on spanning store
Posted by Lorenzo Stoakes 1 month, 3 weeks ago
Writing a data range into a maple tree may involve overwriting a number of
existing entries that span across more than one node. Doing so invokes a
'spanning' store.

Performing a spanning store across two leaf nodes in a maple tree in which
entries are overwritten is achieved by first initialising a 'big' node,
which will store the coalesced entries between the two nodes comprising
entries prior to the newly stored entry, the newly stored entry, and
subsequent entries.

This 'big node' is then merged back into the tree and the tree is
rebalanced, replacing the entries across the spanned nodes with those
contained in the big node.

The operation is performed in mas_wr_spanning_store() which starts by
establishing two maple tree state objects ('mas' objects) on the left of
the range and on the right (l_mas and r_mas respectively).

l_mas traverses to the beginning of the range to be stored in order to copy
the data BEFORE the requested store into the big node.

We then insert our new entry immediately afterwards (both the left copy and
the storing of the new entry are combined and performed by
mas_store_b_node()).

r_mas traverses to the populated slot immediately after, in order to copy
the data AFTER the requested store into the big node.

This copy of the right-hand node is performed by mas_mab_cp() as long as
r_mas indicates that there's data to copy, i.e. r_mas.offset <= r_mas.end.

We traverse r_mas to this position in mas_wr_node_walk() using a simple
loop:

	while (offset < count && mas->index > wr_mas->pivots[offset])
		offset++;

Note here that count is determined to be the (inclusive) index of the last
node containing data in the node as determined by ma_data_end().

This means that even in searching for mas->index, which will have been set
to one plus the end of the target range in order to traverse to the next
slot in mas_wr_spanning_store(), we will terminate the iteration at the end
of the node range even if this condition is not met due to the offset <
count condition.

The fact this right hand node contains the end of the range being stored is
why we are traversing it, and this loop is why we appear to discover a
viable range within the right node to copy to the big one.

However, if the node that r_mas traverses contains a pivot EQUAL to the end
of the range being stored, and this is the LAST pivot contained within the
node, something unexpected happens:

1. The l_mas traversal copy and insertion of the new entry in the big node
   is performed via mas_store_b_node() correctly.

2. The traversal performed by mas_wr_node_walk() means our r_mas.offset is
   set to the offset of the entry equal to the end of the range we store.

3. We therefore copy this DUPLICATE of the final pivot into the big node,
   and insert this DUPLICATE entry, alongside its invalid slot entry
   immediately after the newly inserted entry.

4. The big node containing this duplicated is inserted into the tree which
   is rebalanced, and therefore the maple tree becomes corrupted.

Note that if the right hand node had one or more entries with pivots of
greater value than the end of the stored range, this would not happen. If
it contained entries with pivots of lesser value it would not be the right
node in this spanning store.

This appears to have been at risk of happening throughout the maple tree's
history, however it seemed significantly less likely to occur until
recently.

The balancing of the tree seems to have made it unlikely that you would
happen to perform a store that both spans two nodes AND would overwrite
precisely the entry with the largest pivot in the right-hand node which
contains no further larger pivots.

The work performed in commit f8d112a4e657 ("mm/mmap: avoid zeroing vma tree
in mmap_region()") seems to have made the probability of this event much
more likely.

Previous to this change, MAP_FIXED mappings which were overwritten would
first be cleared before any subsequent store or importantly - merge of
surrounding entries - would be performed.

After this change, this is no longer the case, and this means that, in the
worst case, a number of entries might be overwritten in combination with a
merge (and subsequent overwriting expansion) between both the prior entry
AND a subsequent entry.

The motivation for this change arose from Bert Karwatzki's report of
encountering mm instability after the release of kernel v6.12-rc1 which,
after the use of CONFIG_DEBUG_VM_MAPLE_TREE and similar configuration
options, was identified as maple tree corruption.

After Bert very generously provided his time and ability to reproduce this
event consistently, I was able to finally identify that the issue discussed
in this commit message was occurring for him.

The solution implemented in this patch is:

1. Adjust mas_wr_walk_index() to return a boolean value indicating whether
   the containing node is actually populated with entries possessing pivots
   equal to or greater than mas->index.

2. When traversing the right node in mas_wr_spanning_store(), use this
   value to determine whether to try to copy from the right node - if it is
   not populated, then do not do so.

This passes all maple tree unit tests and resolves the reported bug.

Reported-and-tested-by: Bert Karwatzki <spasswolf@web.de>
Closes: https://lore.kernel.org/all/20241001023402.3374-1-spasswolf@web.de/
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Closes: https://lore.kernel.org/all/CABXGCsOPwuoNOqSMmAvWO2Fz4TEmPnjFj-b7iF+XFRu1h7-+Dg@mail.gmail.com/
Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 lib/maple_tree.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 20990ecba2dd..f72e1a5a4dfa 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -2196,6 +2196,8 @@ static inline void mas_node_or_none(struct ma_state *mas,

 /*
  * mas_wr_node_walk() - Find the correct offset for the index in the @mas.
+ *                      If @mas->index cannot be found within the containing
+ *                      node, we traverse to the last entry in the node.
  * @wr_mas: The maple write state
  *
  * Uses mas_slot_locked() and does not need to worry about dead nodes.
@@ -3532,6 +3534,12 @@ static bool mas_wr_walk(struct ma_wr_state *wr_mas)
 	return true;
 }

+/*
+ * Traverse the maple tree until the offset of mas->index is reached.
+ *
+ * Return: Is this node actually populated with entries possessing pivots equal
+ *         to or greater than mas->index?
+ */
 static bool mas_wr_walk_index(struct ma_wr_state *wr_mas)
 {
 	struct ma_state *mas = wr_mas->mas;
@@ -3540,8 +3548,11 @@ static bool mas_wr_walk_index(struct ma_wr_state *wr_mas)
 		mas_wr_walk_descend(wr_mas);
 		wr_mas->content = mas_slot_locked(mas, wr_mas->slots,
 						  mas->offset);
-		if (ma_is_leaf(wr_mas->type))
-			return true;
+		if (ma_is_leaf(wr_mas->type)) {
+			unsigned long pivot = wr_mas->pivots[mas->offset];
+
+			return pivot == 0 || mas->index <= pivot;
+		}
 		mas_wr_walk_traverse(wr_mas);

 	}
@@ -3701,6 +3712,7 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas)
 	struct maple_big_node b_node;
 	struct ma_state *mas;
 	unsigned char height;
+	bool r_populated;

 	/* Left and Right side of spanning store */
 	MA_STATE(l_mas, NULL, 0, 0);
@@ -3742,7 +3754,7 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas)
 		r_mas.last++;

 	r_mas.index = r_mas.last;
-	mas_wr_walk_index(&r_wr_mas);
+	r_populated = mas_wr_walk_index(&r_wr_mas);
 	r_mas.last = r_mas.index = mas->last;

 	/* Set up left side. */
@@ -3766,7 +3778,7 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas)
 	/* Copy l_mas and store the value in b_node. */
 	mas_store_b_node(&l_wr_mas, &b_node, l_mas.end);
 	/* Copy r_mas into b_node. */
-	if (r_mas.offset <= r_mas.end)
+	if (r_populated && r_mas.offset <= r_mas.end)
 		mas_mab_cp(&r_mas, r_mas.offset, r_mas.end,
 			   &b_node, b_node.b_end + 1);
 	else
--
2.46.2
Re: [PATCH hotfix 6.12] maple_tree: correct tree corruption on spanning store
Posted by Lorenzo Stoakes 1 month, 3 weeks ago
On Sat, Oct 05, 2024 at 07:41:14AM +0100, Lorenzo Stoakes wrote:
> Writing a data range into a maple tree may involve overwriting a number of
> existing entries that span across more than one node. Doing so invokes a
> 'spanning' store.
>

[snip]

Andrew - just to note that I have intentionally left stable off this, in
order that wre can allow this to stabilise in the 6.12 release candidates.

Up until 6.12 this bug seemed much harder to hit, and as far as I'm aware
we've never had a bug report for it prior to this.

I am confident in the patch for 6.12 as all of the (_numerous_) maple tree
userland tests pass and the kernel is stable with it, but as this is a
such a subtle algorithmic change I think we should be cautious.

As soon as things settle down I will ping stable to get it backported.

Thanks!
Re: [PATCH hotfix 6.12] maple_tree: correct tree corruption on spanning store
Posted by Bert Karwatzki 1 month, 3 weeks ago
Am Samstag, dem 05.10.2024 um 12:17 +0100 schrieb Lorenzo Stoakes:
> On Sat, Oct 05, 2024 at 07:41:14AM +0100, Lorenzo Stoakes wrote:
> > Writing a data range into a maple tree may involve overwriting a number of
> > existing entries that span across more than one node. Doing so invokes a
> > 'spanning' store.
> >
>
> [snip]
>
> Andrew - just to note that I have intentionally left stable off this, in
> order that wre can allow this to stabilise in the 6.12 release candidates.
>
> Up until 6.12 this bug seemed much harder to hit, and as far as I'm aware
> we've never had a bug report for it prior to this.

I still suspect that this could have been the same error:
https://lkml.org/lkml/2024/8/28/1558
When compiling the kernel without CONFIG_DEBUG_VM maple tree bug results in an
unkillable task, and when trying to kill it first produced the rwsem warning
(and soon after took down the whole system).
But I couldn't reproduce it with the given reproducer, either.

Bert Karwatzki
Re: [PATCH hotfix 6.12] maple_tree: correct tree corruption on spanning store
Posted by Lorenzo Stoakes 1 month, 3 weeks ago
On Sat, Oct 05, 2024 at 03:24:39PM +0200, Bert Karwatzki wrote:
> Am Samstag, dem 05.10.2024 um 12:17 +0100 schrieb Lorenzo Stoakes:
> > On Sat, Oct 05, 2024 at 07:41:14AM +0100, Lorenzo Stoakes wrote:
> > > Writing a data range into a maple tree may involve overwriting a number of
> > > existing entries that span across more than one node. Doing so invokes a
> > > 'spanning' store.
> > >
> >
> > [snip]
> >
> > Andrew - just to note that I have intentionally left stable off this, in
> > order that wre can allow this to stabilise in the 6.12 release candidates.
> >
> > Up until 6.12 this bug seemed much harder to hit, and as far as I'm aware
> > we've never had a bug report for it prior to this.
>
> I still suspect that this could have been the same error:
> https://lkml.org/lkml/2024/8/28/1558
> When compiling the kernel without CONFIG_DEBUG_VM maple tree bug results in an
> unkillable task, and when trying to kill it first produced the rwsem warning
> (and soon after took down the whole system).
> But I couldn't reproduce it with the given reproducer, either.
>
> Bert Karwatzki
>

Thanks for reminding me of that one!

Yeah unfortunately that thread was very unproductive in that the reporter
gave no feedback or further information. They spammed the list with a bunch
of such reports many looking suspect...

So it is possible, and I suspect that this bug may have caused some other
'weird' crashes that were non-repro in the past.

The difference here may be that we (or rather specifically - you! :)
finally found a way to reliable repro this to the point where we could
diagnose it.

As far as I can tell this could happen even with vma_iter_clear_gfp(), so
old unmap/MAP_FIXED behaviour could have hit it.

BUT a difference now is that we essentially combine the MAP_FIXED with a
merge and overwrite everything in between, so this addition of up to 2
extra entries probably pushed it over the edge to make this event
statistically likely enough for you to have hit it.

Note in your case it took unmapping 6 (!) entries and merging another 2 for
a total of an overwrite spanning 8 entries.

Anyway, assuming we so no issues stabilising this in the rc's I will ping
stable to get this backported and fix this everywhere.