[PATCH v3 00/30] maple_tree: Replace big node with maple copy

Liam R. Howlett posted 30 patches 1 week ago
include/linux/maple_tree.h       |   42 +
lib/maple_tree.c                 | 2134 +++++++++++++-----------------
lib/test_maple_tree.c            |   55 +-
tools/testing/radix-tree/maple.c |  308 ++++-
4 files changed, 1340 insertions(+), 1199 deletions(-)
[PATCH v3 00/30] maple_tree: Replace big node with maple copy
Posted by Liam R. Howlett 1 week ago
The big node struct was created for simplicity of splitting,
rebalancing, and spanning store operations by using a copy buffer to
create the data necessary prior to breaking it up into 256B nodes.
Certain operations were rather tricky due to the restriction of keeping
NULL entries together and never at the end of a node (except the
right-most node).

The big node struct is incompatible with future features that are
currently in development.  Specifically different node types and
different data type sizes for pivots.

The big node struct was also a stack variable, which caused issues with
certain configurations of kernel build.

This series removes big node by introducing another node type which will
never be written to the tree: maple_copy.  The maple copy node operates
more like a scatter/gather operation with a number of sources and
destinations of allocated nodes.

The sources are copied to the destinations, in turn, until the sources
are exhausted.  The destination is changed if it is filled or the split
location is reached prior to the source data end.

New data is inserted by using the maple copy node itself as a source
with up to 3 slots and pivots.  The data in the maple copy node is the
data being written to the tree along with any fragment of the range(s)
being overwritten.

As with all nodes, the maple copy node is of size 256B.  Using a node
type allows for the copy operation to treat the new data stored in the
maple copy node the same as any other source node.

Analysis of the runtime shows no regression or benefit of removing the
larger stack structure.  The motivation is the ground work to use new
node types and to help those with odd configurations that have had
issues.

The change was tested by myself using mm_tests on amd64 and by Suren on
android (arm64).  Limited testing on s390 qemu was also performed using
stress-ng on the virtual memory, which should cover many corner cases.

v1: https://lore.kernel.org/all/20260115193647.1695937-1-Liam.Howlett@oracle.com/
v2: https://lore.kernel.org/all/20260121164526.2093265-1-Liam.Howlett@oracle.com/

Changes since v2:
 - Whitespace fix in rebalance_data()
 - Added ma_init_slot() for cleaner casting during RCU_INIT_POINTER().
   Thanks test robot & SK [1]
 - Fixed off-by-one in rebalance_data() which caused node overuse,
   reported by linux-next. Thanks Mark [2]
 - Added new patch to reproducer to test cases in userspace testing for
   the rebalance_data() off-by-one

[1]. https://lore.kernel.org/all/20260122055516.69335-1-sj@kernel.org/
[2]. https://lore.kernel.org/all/bd1c3356-11a1-4d0b-bf58-47eb21bfd24d@sirena.org.uk/

Liam R. Howlett (30):
  maple_tree: Fix mas_dup_alloc() sparse warning
  maple_tree: Move mas_spanning_rebalance loop to function
  maple_tree: Extract use of big node from mas_wr_spanning_store()
  maple_tree: Remove unnecessary assignment of orig_l index
  maple_tree: inline mas_spanning_rebalance() into
    mas_wr_spanning_rebalance()
  maple_tree: Make ma_wr_states reliable for reuse in spanning store
  maple_tree: Remove l_wr_mas from mas_wr_spanning_rebalance
  maple_tree: Don't pass through height in mas_wr_spanning_store
  maple_tree: Move maple_subtree_state from mas_wr_spanning_store to
    mas_wr_spanning_rebalance
  maple_tree: Correct right ma_wr_state end pivot in
    mas_wr_spanning_store()
  maple_tree: Introduce maple_copy node and use it in
    mas_spanning_rebalance()
  maple_tree: Testing update for spanning store
  maple_tree: Inline mas_spanning_rebalance_loop() into
    mas_wr_spanning_rebalance()
  maple_tree: Change initial big node setup in
    mas_wr_spanning_rebalance()
  maple_tree: Introduce ma_leaf_max_gap()
  maple_tree: Add gap support, slot and pivot sizes for maple copy
  maple_tree: Start using maple copy node for destination
  maple_tree: inline mas_wr_spanning_rebalance()
  maple_tree: Remove unnecessary return statements
  maple_tree: Separate wr_split_store and wr_rebalance store type code
    path
  maple_tree: Add cp_is_new_root() helper
  maple_tree: Use maple copy node for mas_wr_rebalance() operation
  maple_tree: Add test for rebalance calculation off-by-one
  maple_tree: Add copy_tree_location() helper
  maple_tree: Add cp_converged() helper
  maple_tree: Use maple copy node for mas_wr_split()
  maple_tree: Remove maple big node and subtree structs
  maple_tree: Pass maple copy node to mas_wmb_replace()
  maple_tree: Don't pass end to mas_wr_append()
  maple_tree: Clean up mas_wr_node_store()

 include/linux/maple_tree.h       |   42 +
 lib/maple_tree.c                 | 2134 +++++++++++++-----------------
 lib/test_maple_tree.c            |   55 +-
 tools/testing/radix-tree/maple.c |  308 ++++-
 4 files changed, 1340 insertions(+), 1199 deletions(-)

-- 
2.47.3
Re: [PATCH v3 00/30] maple_tree: Replace big node with maple copy
Posted by Andrew Morton 6 days, 19 hours ago
On Fri, 30 Jan 2026 15:59:05 -0500 "Liam R. Howlett" <Liam.Howlett@oracle.com> wrote:

> The big node struct was created for simplicity of splitting,
> rebalancing, and spanning store operations by using a copy buffer to
> create the data necessary prior to breaking it up into 256B nodes.
> Certain operations were rather tricky due to the restriction of keeping
> NULL entries together and never at the end of a node (except the
> right-most node).
> 

Updated, thanks.

What are your thoughts on adding this to 6.19?  I'd expect to move it
into mm-stable Feb 17ish.

> Changes since v2:
>  - Whitespace fix in rebalance_data()
>  - Added ma_init_slot() for cleaner casting during RCU_INIT_POINTER().
>    Thanks test robot & SK [1]
>  - Fixed off-by-one in rebalance_data() which caused node overuse,
>    reported by linux-next. Thanks Mark [2]
>  - Added new patch to reproducer to test cases in userspace testing for
>    the rebalance_data() off-by-one

Below is how v3 altered mm.git:

--- a/lib/maple_tree.c~b
+++ a/lib/maple_tree.c
@@ -314,6 +314,13 @@ static inline struct maple_enode *mt_mk_
 			(type << MAPLE_ENODE_TYPE_SHIFT) | MAPLE_ENODE_NULL);
 }
 
+static inline void ma_init_slot(void __rcu **slot, const struct maple_node *mn,
+				const enum maple_type mt)
+{
+	/* WARNING: this is unsafe if the slot is exposed to readers. */
+	RCU_INIT_POINTER(*slot, (void *)mt_mk_node(mn, mt));
+}
+
 static inline void *mte_mk_root(const struct maple_enode *node)
 {
 	return (void *)((unsigned long)node | MAPLE_ROOT_NODE);
@@ -2231,7 +2238,7 @@ static inline void rebalance_data(struct
 {
 	cp_data_calc(cp, wr_mas, wr_mas);
 	sib->end = 0;
-	if (cp->data >= mt_slots[wr_mas->type]) {
+	if (cp->data > mt_slots[wr_mas->type]) {
 		push_data_sib(cp, wr_mas->mas, sib, parent);
 		if (sib->end)
 			goto use_sib;
@@ -2246,7 +2253,8 @@ static inline void rebalance_data(struct
 	return;
 
 use_sib:
-		cp->data += sib->end + 1;
+
+	cp->data += sib->end + 1;
 }
 
 /*
@@ -2553,7 +2561,7 @@ static inline void cp_dst_to_slots(struc
 		 * read-side operations that can view them until they are
 		 * inserted into the tree after an rcu_assign_pointer() call.
 		 */
-		RCU_INIT_POINTER(cp->slot[d], (void *)mt_mk_node(mn, mt));
+		ma_init_slot(&cp->slot[d], mn, mt);
 		cp->pivot[d] = slot_max;
 		if (mt_is_alloc(mas->tree)) {
 			if (ma_is_leaf(mt)) {
@@ -2603,8 +2611,7 @@ static inline bool cp_is_new_root(struct
 		 * read-side operations that can view it until it is insert into
 		 * the tree after an rcu_assign_pointer() call.
 		 */
-		RCU_INIT_POINTER(cp->slot[0],
-				 (void *)mt_mk_node(cp->dst[0].node, mt));
+		RCU_INIT_POINTER(cp->slot[0], mt_mk_node(cp->dst[0].node, mt));
 		cp->height++;
 	}
 	WARN_ON_ONCE(cp->dst[0].node != mte_to_node(
--- a/tools/testing/radix-tree/maple.c~b
+++ a/tools/testing/radix-tree/maple.c
@@ -35899,6 +35899,127 @@ unlock:
 	return ret;
 }
 
+static noinline void __init check_erase_rebalance(struct maple_tree *mt)
+{
+	unsigned long val;
+	void *enode;
+	int ret;
+
+	MA_STATE(mas, mt, 0, 0);
+
+	/*
+	 * During removal of big node, the rebalance started going too high,
+	 * which resulted in too many nodes trying to be used.
+	 *
+	 * Create a rebalance which results in an exactly full parent (0-9) that
+	 * does not need to be rebalanced.  This required two full levels,
+	 * followed by an insufficient level which will be rebalanced into two
+	 * nodes, finally leaves that need to be rebalanced into one node.
+	 *
+	 * The bugs tree:
+	 * root    4      Label     R
+	 *        /\               /\
+	 *       9   X            F
+	 *      /\   /\          /
+	 *     9   X            E
+	 *    /\   /\          /\
+	 *   4  8             C  D
+	 *  /\               /\
+	 * 6  9             A  B
+	 * ^ becomes 5 with the write.
+	 *
+	 * Below, the reconstruction leaves the root with 2 entries, the setup
+	 * uses the letter labels above.
+	 */
+
+	ret = build_full_tree(mt, MT_FLAGS_ALLOC_RANGE, 4);
+	MT_BUG_ON(mt, ret);
+
+	/* Cheap expansion to 5 levels */
+	mtree_store(mt, ULONG_MAX, xa_mk_value(0), GFP_KERNEL);
+	/* rcu is used to ensure node use */
+	mt_set_in_rcu(mt);
+	mas_lock(&mas);
+
+	/* Node A had 6 entries */
+	mas_walk(&mas);
+	MAS_BUG_ON(&mas, mas_data_end(&mas) < 6);
+	while (mas_data_end(&mas) > 6) {
+		mas_erase(&mas);
+		mas_next(&mas, ULONG_MAX);
+	}
+
+	/* Move to Node B */
+	enode = (void*) mas.node;
+	while (mas.node == enode)
+		mas_next(&mas, ULONG_MAX);
+
+	/* Node B had 9 entries */
+	MAS_BUG_ON(&mas, mas_data_end(&mas) < 9);
+	while (mas_data_end(&mas) > 9) {
+		mas_erase(&mas);
+		mas_next(&mas, ULONG_MAX);
+	}
+
+	/* Move to Node C */
+	mas_ascend(&mas);
+	val = mas.max;
+	/* Adjust entries to be 4 */
+	while (mas_data_end(&mas) > 4) {
+		mas_set(&mas, val);
+		mas_erase(&mas);
+		mas_prev(&mas, 0);
+		val = mas.index;
+		mas_ascend(&mas);
+	}
+
+	/* Move to Node D */
+	mas_ascend(&mas);
+	mas.offset = 1;
+	mas_descend(&mas);
+	val = mas.max;
+	/* Adjust entries to be 8 */
+	while (mas_data_end(&mas) < 8) {
+		mas_set(&mas, val--);
+		mas_store_gfp(&mas, &mas, GFP_KERNEL);
+		mas_ascend(&mas);
+	}
+
+	/* Move to Node E */
+	mas_ascend(&mas);
+	val = mas.max;
+	MAS_BUG_ON(&mas, mas_data_end(&mas) > 9);
+	/* Adjust Node E to 9 entries */
+	while (mas_data_end(&mas) < 9) {
+		mas_set(&mas, val--);
+		mas_store_gfp(&mas, &mas, GFP_KERNEL);
+		mas_ascend(&mas);
+		mas_ascend(&mas);
+	}
+
+	/* Move to Node F */
+	mas_ascend(&mas);
+	val = mas.max;
+	MAS_BUG_ON(&mas, mas_data_end(&mas) > 9);
+	/* Adjust Node F to 9 entries */
+	while (mas_data_end(&mas) < 9) {
+		mas_set(&mas, val--);
+		mas_store_gfp(&mas, &mas, GFP_KERNEL);
+		mas_ascend(&mas);
+		mas_ascend(&mas);
+		mas_ascend(&mas);
+	}
+
+	/* Test is set up, walk to first entry */
+	mas_set(&mas, 0);
+	mas_next(&mas, ULONG_MAX);
+	/* overwrite the entry to cause a rebalance, which was 1 too few */
+	mas_set_range(&mas, 0, mas.last);
+	mas_preallocate(&mas, NULL, GFP_KERNEL);
+	mas_store_prealloc(&mas, NULL);
+	mas_unlock(&mas);
+}
+
 static noinline void __init check_mtree_dup(struct maple_tree *mt)
 {
 	DEFINE_MTREE(new);
@@ -36260,6 +36381,10 @@ void farmer_tests(void)
 	check_mtree_dup(&tree);
 	mtree_destroy(&tree);
 
+	mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
+	check_erase_rebalance(&tree);
+	mtree_destroy(&tree);
+
 	/* RCU testing */
 	mt_init_flags(&tree, 0);
 	check_erase_testset(&tree);
_
Re: [PATCH v3 00/30] maple_tree: Replace big node with maple copy
Posted by Liam R. Howlett 4 days, 23 hours ago
* Andrew Morton <akpm@linux-foundation.org> [260131 15:27]:
> On Fri, 30 Jan 2026 15:59:05 -0500 "Liam R. Howlett" <Liam.Howlett@oracle.com> wrote:
> 
> > The big node struct was created for simplicity of splitting,
> > rebalancing, and spanning store operations by using a copy buffer to
> > create the data necessary prior to breaking it up into 256B nodes.
> > Certain operations were rather tricky due to the restriction of keeping
> > NULL entries together and never at the end of a node (except the
> > right-most node).
> > 
> 
> Updated, thanks.
> 
> What are your thoughts on adding this to 6.19?  I'd expect to move it
> into mm-stable Feb 17ish.

I think the chance of making 6.19 has passed.

Can we please target getting this into testing in linux-next as soon as
6.19 ships?

The priority for me is to ensure this is rock-solid prior to release.
My concerns are corner cases and arch specific issues, which I can only
do so much to mitigate on my own.  My hope is that linux-next exposure
will help with those concerns.

So, at this point, I think the right call is delaying until after next
Sunday.

> 
> > Changes since v2:
> >  - Whitespace fix in rebalance_data()
> >  - Added ma_init_slot() for cleaner casting during RCU_INIT_POINTER().
> >    Thanks test robot & SK [1]
> >  - Fixed off-by-one in rebalance_data() which caused node overuse,
> >    reported by linux-next. Thanks Mark [2]
> >  - Added new patch to reproducer to test cases in userspace testing for
> >    the rebalance_data() off-by-one
> 
> Below is how v3 altered mm.git:

This looks correct.  I did check my git diff here between my two
branches and it is identical to what you sent out.

...

> +	 *
> +	 * The bugs tree:
> +	 * root    4      Label     R
> +	 *        /\               /\
> +	 *       9   X            F
> +	 *      /\   /\          /
> +	 *     9   X            E
> +	 *    /\   /\          /\
> +	 *   4  8             C  D
> +	 *  /\               /\
> +	 * 6  9             A  B
> +	 * ^ becomes 5 with the write.
> +	 *

I especially like the ASCII art in the test, totally worth the wait.

Thanks,
Liam
Re: [PATCH v3 00/30] maple_tree: Replace big node with maple copy
Posted by Andrew Morton 4 days, 21 hours ago
On Mon, 2 Feb 2026 10:40:06 -0500 "Liam R. Howlett" <Liam.Howlett@oracle.com> wrote:

> * Andrew Morton <akpm@linux-foundation.org> [260131 15:27]:
> > On Fri, 30 Jan 2026 15:59:05 -0500 "Liam R. Howlett" <Liam.Howlett@oracle.com> wrote:
> > 
> > > The big node struct was created for simplicity of splitting,
> > > rebalancing, and spanning store operations by using a copy buffer to
> > > create the data necessary prior to breaking it up into 256B nodes.
> > > Certain operations were rather tricky due to the restriction of keeping
> > > NULL entries together and never at the end of a node (except the
> > > right-most node).
> > > 
> > 
> > Updated, thanks.
> > 
> > What are your thoughts on adding this to 6.19?  I'd expect to move it
> > into mm-stable Feb 17ish.
> 
> I think the chance of making 6.19 has passed.
> 
> Can we please target getting this into testing in linux-next as soon as
> 6.19 ships?
> 
> The priority for me is to ensure this is rock-solid prior to release.
> My concerns are corner cases and arch specific issues, which I can only
> do so much to mitigate on my own.  My hope is that linux-next exposure
> will help with those concerns.

Sounds good, I'll push this back into mm-new until -rc1.