mm/mempolicy: Cleanup and optimization for weighted interleave

[PATCH 2/2] mm/mempolicy: Skip extra call to __alloc_pages_bulk in weighted interleave

Posted by Joshua Hahn 3 months, 1 week ago

Currently, alloc_pages_bulk_weighted_interleave can make up to nr_node_ids+1
calls to __alloc_pages_bulk. The additional allocation can happen if the
previous call to this function finished the weighted round robin allocation
partially on a node. To make up for this, the next time this function is
called, an extra allocation is made to finish cleanly on the node boundaries
before performing the weighted round-robin cycles again.

Instead of making an additional call, we can calculate how many additional
pages should be allocated from the first node (aka carryover) and add that
value to the number of pages that should be allocated as part of the current
round-robin cycle.

Running a quick benchmark by compiling the kernel shows a small increase
in performance. These experiments were run on a machine with 2 nodes, each
with 125GB memory and 40 CPUs.

time numactl -w 0,1 make -j$(nproc)

+----------+---------+------------+---------+
| Time (s) |  6.16   | With patch | % Delta |
+----------+---------+------------+---------+
| Real     |  88.374 |    88.3356 | -0.2019 |
| User     |  3631.7 |   3636.263 |  0.0631 |
| Sys      | 366.029 |    363.792 | -0.7534 |
+----------+---------+------------+---------+

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>

---
 mm/mempolicy.c | 39 ++++++++++++++++++++-------------------
 1 file changed, 20 insertions(+), 19 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 78ad74a0e249..0d693f96cf66 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2569,7 +2569,7 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
 	unsigned long node_pages, delta;
 	u8 *weights, weight;
 	unsigned int weight_total = 0;
-	unsigned long rem_pages = nr_pages;
+	unsigned long rem_pages = nr_pages, carryover = 0;
 	nodemask_t nodes;
 	int nnodes, node;
 	int resume_node = MAX_NUMNODES - 1;
@@ -2594,18 +2594,12 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
 	node = me->il_prev;
 	weight = me->il_weight;
 	if (weight && node_isset(node, nodes)) {
-		node_pages = min(rem_pages, weight);
-		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
-						  page_array);
-		page_array += nr_allocated;
-		total_allocated += nr_allocated;
-		/* if that's all the pages, no need to interleave */
 		if (rem_pages <= weight) {
-			me->il_weight -= rem_pages;
-			return total_allocated;
+			node_pages = rem_pages;
+			me->il_weight -= node_pages;
+			goto allocate;
 		}
-		/* Otherwise we adjust remaining pages, continue from there */
-		rem_pages -= weight;
+		carryover = weight;
 	}
 	/* clear active weight in case of an allocation failure */
 	me->il_weight = 0;
@@ -2614,7 +2608,7 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
 	/* create a local copy of node weights to operate on outside rcu */
 	weights = kzalloc(nr_node_ids, GFP_KERNEL);
 	if (!weights)
-		return total_allocated;
+		return 0;
 
 	rcu_read_lock();
 	state = rcu_dereference(wi_state);
@@ -2634,16 +2628,17 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
 	/*
 	 * Calculate rounds/partial rounds to minimize __alloc_pages_bulk calls.
 	 * Track which node weighted interleave should resume from.
+	 * Account for carryover. It is always allocated from the first node.
 	 *
 	 * if (rounds > 0) and (delta == 0), resume_node will always be
 	 * the node following prev_node and its weight.
 	 */
-	rounds = rem_pages / weight_total;
-	delta = rem_pages % weight_total;
+	rounds = (rem_pages - carryover) / weight_total;
+	delta = (rem_pages - carryover) % weight_total;
 	resume_node = next_node_in(prev_node, nodes);
 	resume_weight = weights[resume_node];
+	node = carryover ? prev_node : next_node_in(prev_node, nodes);
 	for (i = 0; i < nnodes; i++) {
-		node = next_node_in(prev_node, nodes);
 		weight = weights[node];
 		/* when delta is depleted, resume from that node */
 		if (delta && delta < weight) {
@@ -2651,12 +2646,14 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
 			resume_weight = weight - delta;
 		}
 		/* Add the node's portion of the delta, if there is one */
-		node_pages = weight * rounds + min(delta, weight);
+		node_pages = weight * rounds + min(delta, weight) + carryover;
 		delta -= min(delta, weight);
+		carryover = 0;
 
 		/* node_pages can be 0 if an allocation fails and rounds == 0 */
 		if (!node_pages)
 			break;
+allocate:
 		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
 						  page_array);
 		page_array += nr_allocated;
@@ -2664,10 +2661,14 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
 		if (total_allocated == nr_pages)
 			break;
 		prev_node = node;
+		node = next_node_in(prev_node, nodes);
+	}
+
+	if (weights) {
+		me->il_prev = resume_node;
+		me->il_weight = resume_weight;
+		kfree(weights);
 	}
-	me->il_prev = resume_node;
-	me->il_weight = resume_weight;
-	kfree(weights);
 	return total_allocated;
 }
 
-- 
2.47.1

Re: [PATCH 2/2] mm/mempolicy: Skip extra call to __alloc_pages_bulk in weighted interleave

Posted by Kees Bakker 3 months, 1 week ago

Op 26-06-2025 om 22:09 schreef Joshua Hahn:
> Currently, alloc_pages_bulk_weighted_interleave can make up to nr_node_ids+1
> calls to __alloc_pages_bulk. The additional allocation can happen if the
> previous call to this function finished the weighted round robin allocation
> partially on a node. To make up for this, the next time this function is
> called, an extra allocation is made to finish cleanly on the node boundaries
> before performing the weighted round-robin cycles again.
>
> Instead of making an additional call, we can calculate how many additional
> pages should be allocated from the first node (aka carryover) and add that
> value to the number of pages that should be allocated as part of the current
> round-robin cycle.
>
> Running a quick benchmark by compiling the kernel shows a small increase
> in performance. These experiments were run on a machine with 2 nodes, each
> with 125GB memory and 40 CPUs.
>
> time numactl -w 0,1 make -j$(nproc)
>
> +----------+---------+------------+---------+
> | Time (s) |  6.16   | With patch | % Delta |
> +----------+---------+------------+---------+
> | Real     |  88.374 |    88.3356 | -0.2019 |
> | User     |  3631.7 |   3636.263 |  0.0631 |
> | Sys      | 366.029 |    363.792 | -0.7534 |
> +----------+---------+------------+---------+
>
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
>
> ---
>   mm/mempolicy.c | 39 ++++++++++++++++++++-------------------
>   1 file changed, 20 insertions(+), 19 deletions(-)
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 78ad74a0e249..0d693f96cf66 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2569,7 +2569,7 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
>   	unsigned long node_pages, delta;
>   	u8 *weights, weight;
>   	unsigned int weight_total = 0;
> -	unsigned long rem_pages = nr_pages;
> +	unsigned long rem_pages = nr_pages, carryover = 0;
>   	nodemask_t nodes;
>   	int nnodes, node;
>   	int resume_node = MAX_NUMNODES - 1;
> @@ -2594,18 +2594,12 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
>   	node = me->il_prev;
>   	weight = me->il_weight;
>   	if (weight && node_isset(node, nodes)) {
> -		node_pages = min(rem_pages, weight);
> -		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
> -						  page_array);
> -		page_array += nr_allocated;
> -		total_allocated += nr_allocated;
> -		/* if that's all the pages, no need to interleave */
>   		if (rem_pages <= weight) {
> -			me->il_weight -= rem_pages;
> -			return total_allocated;
> +			node_pages = rem_pages;
> +			me->il_weight -= node_pages;
> +			goto allocate;
This is a goto into the middle of a for-loop.
What do you think is going to happen at the end of that loop?

I think (only tested with a small C program) it will go to the start of
the loop, do the i++, check i<nnodes, and possibly do the loop again.
Variable i is uninitialized at that point. In the loop it hits several
uninitialized variables.

Even if this is legal C code, it is pretty obscure.

>   		}
> -		/* Otherwise we adjust remaining pages, continue from there */
> -		rem_pages -= weight;
> +		carryover = weight;
>   	}
>   	/* clear active weight in case of an allocation failure */
>   	me->il_weight = 0;
> @@ -2614,7 +2608,7 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
>   	/* create a local copy of node weights to operate on outside rcu */
>   	weights = kzalloc(nr_node_ids, GFP_KERNEL);
>   	if (!weights)
> -		return total_allocated;
> +		return 0;
>   
>   	rcu_read_lock();
>   	state = rcu_dereference(wi_state);
> @@ -2634,16 +2628,17 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
>   	/*
>   	 * Calculate rounds/partial rounds to minimize __alloc_pages_bulk calls.
>   	 * Track which node weighted interleave should resume from.
> +	 * Account for carryover. It is always allocated from the first node.
>   	 *
>   	 * if (rounds > 0) and (delta == 0), resume_node will always be
>   	 * the node following prev_node and its weight.
>   	 */
> -	rounds = rem_pages / weight_total;
> -	delta = rem_pages % weight_total;
> +	rounds = (rem_pages - carryover) / weight_total;
> +	delta = (rem_pages - carryover) % weight_total;
>   	resume_node = next_node_in(prev_node, nodes);
>   	resume_weight = weights[resume_node];
> +	node = carryover ? prev_node : next_node_in(prev_node, nodes);
>   	for (i = 0; i < nnodes; i++) {
> -		node = next_node_in(prev_node, nodes);
>   		weight = weights[node];
>   		/* when delta is depleted, resume from that node */
>   		if (delta && delta < weight) {
> @@ -2651,12 +2646,14 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
>   			resume_weight = weight - delta;
>   		}
>   		/* Add the node's portion of the delta, if there is one */
> -		node_pages = weight * rounds + min(delta, weight);
> +		node_pages = weight * rounds + min(delta, weight) + carryover;
>   		delta -= min(delta, weight);
> +		carryover = 0;
>   
>   		/* node_pages can be 0 if an allocation fails and rounds == 0 */
>   		if (!node_pages)
>   			break;
> +allocate:
>   		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
>   						  page_array);
>   		page_array += nr_allocated;
> @@ -2664,10 +2661,14 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
>   		if (total_allocated == nr_pages)
>   			break;
>   		prev_node = node;
> +		node = next_node_in(prev_node, nodes);
> +	}
> +
> +	if (weights) {
> +		me->il_prev = resume_node;
> +		me->il_weight = resume_weight;
> +		kfree(weights);
>   	}
> -	me->il_prev = resume_node;
> -	me->il_weight = resume_weight;
> -	kfree(weights);
>   	return total_allocated;
>   }
>

Re: [PATCH 2/2] mm/mempolicy: Skip extra call to __alloc_pages_bulk in weighted interleave

Posted by Joshua Hahn 3 months, 1 week ago

On Mon, 30 Jun 2025 22:05:48 +0200 Kees Bakker <kees@ijzerbout.nl> wrote:

> >   mm/mempolicy.c | 39 ++++++++++++++++++++-------------------
> >   1 file changed, 20 insertions(+), 19 deletions(-)
> >
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 78ad74a0e249..0d693f96cf66 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -2569,7 +2569,7 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
> >   	unsigned long node_pages, delta;
> >   	u8 *weights, weight;
> >   	unsigned int weight_total = 0;
> > -	unsigned long rem_pages = nr_pages;
> > +	unsigned long rem_pages = nr_pages, carryover = 0;
> >   	nodemask_t nodes;
> >   	int nnodes, node;
> >   	int resume_node = MAX_NUMNODES - 1;
> > @@ -2594,18 +2594,12 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
> >   	node = me->il_prev;
> >   	weight = me->il_weight;
> >   	if (weight && node_isset(node, nodes)) {
> > -		node_pages = min(rem_pages, weight);
> > -		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
> > -						  page_array);
> > -		page_array += nr_allocated;
> > -		total_allocated += nr_allocated;
> > -		/* if that's all the pages, no need to interleave */
> >   		if (rem_pages <= weight) {
> > -			me->il_weight -= rem_pages;
> > -			return total_allocated;
> > +			node_pages = rem_pages;
> > +			me->il_weight -= node_pages;
> > +			goto allocate;

Hello Kees,

Thank you for reviewing my code!

> This is a goto into the middle of a for-loop.
> What do you think is going to happen at the end of that loop?
> 
> I think (only tested with a small C program) it will go to the start of
> the loop, do the i++, check i<nnodes, and possibly do the loop again.
> Variable i is uninitialized at that point. In the loop it hits several
> uninitialized variables.

From what I can see from my code, I think the only the goto statement leads
to a second iteration of the for loop is if allocation fails.
But otherwise, it should be ok since we always hit

if (total_allocated == nr_pages)
	break;

within the loop. For the branch that takes the goto, we set
node_pages = rem_pages, then jump to the label and allocate.
So nr_allocated = node_pages, and total_allocated = 0 + nr_allocated
so total_allocated = node_pages

total_allocated == node_pages == rem_pages == nr_pages, so we will break. Phew!

To cover the case where allocation fails, I think we should be breaking
anyways, so I can definitely add a new check for this.

> Even if this is legal C code, it is pretty obscure.

I agree that it not very clean. I did this to reduce the amount of repeated
code there is. Even if this code works, it could definitely be written
better to make it more readable and maintainable. As I noted in my second
response to Gregory, I'm not planning on pursuing this version anymore,
so if I decide to send a second version, I'll keep this in mind.

Thank you again for taking the time to review this, and also testing it on
your end! I hope you have a great day : -)
Joshua

Sent using hkml (https://github.com/sjp38/hackermail)

Re: [PATCH 2/2] mm/mempolicy: Skip extra call to __alloc_pages_bulk in weighted interleave

Posted by Andrew Morton 3 months, 1 week ago

On Mon, 30 Jun 2025 13:21:14 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

> > This is a goto into the middle of a for-loop.
> > What do you think is going to happen at the end of that loop?
> > 
> > I think (only tested with a small C program) it will go to the start of
> > the loop, do the i++, check i<nnodes, and possibly do the loop again.
> > Variable i is uninitialized at that point. In the loop it hits several
> > uninitialized variables.
> 
> >From what I can see from my code, I think the only the goto statement leads
> to a second iteration of the for loop is if allocation fails.
> But otherwise, it should be ok since we always hit
> 
> if (total_allocated == nr_pages)
> 	break;
> 
> within the loop. For the branch that takes the goto, we set
> node_pages = rem_pages, then jump to the label and allocate.
> So nr_allocated = node_pages, and total_allocated = 0 + nr_allocated
> so total_allocated = node_pages
> 
> total_allocated == node_pages == rem_pages == nr_pages, so we will break. Phew!
> 
> To cover the case where allocation fails, I think we should be breaking
> anyways, so I can definitely add a new check for this.

I do agree, that goto is a "goto too far".  That we can do a thing
doesn't mean we should do it!

> > Even if this is legal C code, it is pretty obscure.
> 
> I agree that it not very clean. I did this to reduce the amount of repeated
> code there is. Even if this code works, it could definitely be written
> better to make it more readable and maintainable. As I noted in my second
> response to Gregory, I'm not planning on pursuing this version anymore,
> so if I decide to send a second version, I'll keep this in mind.

Cool, I'll drop this version from mm-unstable.

Re: [PATCH 2/2] mm/mempolicy: Skip extra call to __alloc_pages_bulk in weighted interleave

Posted by Joshua Hahn 3 months, 1 week ago

On Mon, 30 Jun 2025 15:35:01 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> On Mon, 30 Jun 2025 13:21:14 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> 
> > > This is a goto into the middle of a for-loop.
> > > What do you think is going to happen at the end of that loop?
> > > 
> > > I think (only tested with a small C program) it will go to the start of
> > > the loop, do the i++, check i<nnodes, and possibly do the loop again.
> > > Variable i is uninitialized at that point. In the loop it hits several
> > > uninitialized variables.
> > 
> > >From what I can see from my code, I think the only the goto statement leads
> > to a second iteration of the for loop is if allocation fails.
> > But otherwise, it should be ok since we always hit
> > 
> > if (total_allocated == nr_pages)
> > 	break;
> > 
> > within the loop. For the branch that takes the goto, we set
> > node_pages = rem_pages, then jump to the label and allocate.
> > So nr_allocated = node_pages, and total_allocated = 0 + nr_allocated
> > so total_allocated = node_pages
> > 
> > total_allocated == node_pages == rem_pages == nr_pages, so we will break. Phew!
> > 
> > To cover the case where allocation fails, I think we should be breaking
> > anyways, so I can definitely add a new check for this.
> 
> I do agree, that goto is a "goto too far".  That we can do a thing
> doesn't mean we should do it!

Haha : -)

> > > Even if this is legal C code, it is pretty obscure.
> > 
> > I agree that it not very clean. I did this to reduce the amount of repeated
> > code there is. Even if this code works, it could definitely be written
> > better to make it more readable and maintainable. As I noted in my second
> > response to Gregory, I'm not planning on pursuing this version anymore,
> > so if I decide to send a second version, I'll keep this in mind.
> 
> Cool, I'll drop this version from mm-unstable.

Sounds good Andrew, thank you always for all of your help!
Joshua

Sent using hkml (https://github.com/sjp38/hackermail)

Re: [PATCH 2/2] mm/mempolicy: Skip extra call to __alloc_pages_bulk in weighted interleave

Posted by Gregory Price 3 months, 1 week ago

On Thu, Jun 26, 2025 at 01:09:34PM -0700, Joshua Hahn wrote:
> Currently, alloc_pages_bulk_weighted_interleave can make up to nr_node_ids+1
> calls to __alloc_pages_bulk. The additional allocation can happen if the
> previous call to this function finished the weighted round robin allocation
> partially on a node. To make up for this, the next time this function is
> called, an extra allocation is made to finish cleanly on the node boundaries
> before performing the weighted round-robin cycles again.
> 

task->il_weight can be operated on by other weighted_interleave
functions in mempolicy, so it's not just
alloc_pages_bulk_weighted_interleave that affects this.

Observations here are still true, just a correction for clarity.

i.e.:

The additional allocation happens for the current il_node/il_weight.

I *think* I did it this way just because it was easier to reason about
the two chunks separately.  So I don't see a reason not to improve this.
I will say that, at least at the time, I took the core math and
validated the edge conditions in a separate program.

If you get it wrong in the kernel, you'd either fail to allocate - or more
likely just get the wrong distribution of pages.  The latter is
non-obvious unless you go looking for it, so it might be good to at
least add this test result in the change log.  It's hard to write this
in LTP or kernel selftest unfortunately.

> Instead of making an additional call, we can calculate how many additional
> pages should be allocated from the first node (aka carryover) and add that
> value to the number of pages that should be allocated as part of the current
> round-robin cycle.
> 
> Running a quick benchmark by compiling the kernel shows a small increase
> in performance. These experiments were run on a machine with 2 nodes, each
> with 125GB memory and 40 CPUs.
> 
> time numactl -w 0,1 make -j$(nproc)
> 
> +----------+---------+------------+---------+
> | Time (s) |  6.16   | With patch | % Delta |
> +----------+---------+------------+---------+
> | Real     |  88.374 |    88.3356 | -0.2019 |
> | User     |  3631.7 |   3636.263 |  0.0631 |
> | Sys      | 366.029 |    363.792 | -0.7534 |
> +----------+---------+------------+---------+
> 
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> 
> ---
>  mm/mempolicy.c | 39 ++++++++++++++++++++-------------------
>  1 file changed, 20 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 78ad74a0e249..0d693f96cf66 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2569,7 +2569,7 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
>  	unsigned long node_pages, delta;
>  	u8 *weights, weight;
>  	unsigned int weight_total = 0;
> -	unsigned long rem_pages = nr_pages;
> +	unsigned long rem_pages = nr_pages, carryover = 0;
>  	nodemask_t nodes;
>  	int nnodes, node;
>  	int resume_node = MAX_NUMNODES - 1;
> @@ -2594,18 +2594,12 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
>  	node = me->il_prev;
>  	weight = me->il_weight;
>  	if (weight && node_isset(node, nodes)) {
> -		node_pages = min(rem_pages, weight);
> -		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
> -						  page_array);
> -		page_array += nr_allocated;
> -		total_allocated += nr_allocated;
> -		/* if that's all the pages, no need to interleave */
>  		if (rem_pages <= weight) {
> -			me->il_weight -= rem_pages;
> -			return total_allocated;
> +			node_pages = rem_pages;
> +			me->il_weight -= node_pages;
> +			goto allocate;
>  		}
> -		/* Otherwise we adjust remaining pages, continue from there */
> -		rem_pages -= weight;
> +		carryover = weight;
>  	}
>  	/* clear active weight in case of an allocation failure */
>  	me->il_weight = 0;
> @@ -2614,7 +2608,7 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
>  	/* create a local copy of node weights to operate on outside rcu */
>  	weights = kzalloc(nr_node_ids, GFP_KERNEL);
>  	if (!weights)
> -		return total_allocated;
> +		return 0;

may be worth noting that this change means small bulk allocations that
could be covered by the current il_node/weight may now fail if kzalloc
fails.  But then there are probably other problems.  However, this is a
functional difference between the old and new state of the function.

>  
>  	rcu_read_lock();
>  	state = rcu_dereference(wi_state);
> @@ -2634,16 +2628,17 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
>  	/*
>  	 * Calculate rounds/partial rounds to minimize __alloc_pages_bulk calls.
>  	 * Track which node weighted interleave should resume from.
> +	 * Account for carryover. It is always allocated from the first node.
>  	 *
>  	 * if (rounds > 0) and (delta == 0), resume_node will always be
>  	 * the node following prev_node and its weight.
>  	 */
> -	rounds = rem_pages / weight_total;
> -	delta = rem_pages % weight_total;
> +	rounds = (rem_pages - carryover) / weight_total;
> +	delta = (rem_pages - carryover) % weight_total;
>  	resume_node = next_node_in(prev_node, nodes);
>  	resume_weight = weights[resume_node];
> +	node = carryover ? prev_node : next_node_in(prev_node, nodes);
>  	for (i = 0; i < nnodes; i++) {
> -		node = next_node_in(prev_node, nodes);
>  		weight = weights[node];
>  		/* when delta is depleted, resume from that node */
>  		if (delta && delta < weight) {
> @@ -2651,12 +2646,14 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
>  			resume_weight = weight - delta;
>  		}
>  		/* Add the node's portion of the delta, if there is one */
> -		node_pages = weight * rounds + min(delta, weight);
> +		node_pages = weight * rounds + min(delta, weight) + carryover;
>  		delta -= min(delta, weight);
> +		carryover = 0;
>  
>  		/* node_pages can be 0 if an allocation fails and rounds == 0 */
>  		if (!node_pages)
>  			break;
> +allocate:
>  		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
>  						  page_array);
>  		page_array += nr_allocated;
> @@ -2664,10 +2661,14 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
>  		if (total_allocated == nr_pages)
>  			break;
>  		prev_node = node;
> +		node = next_node_in(prev_node, nodes);
> +	}
> +
> +	if (weights) {
> +		me->il_prev = resume_node;
> +		me->il_weight = resume_weight;
> +		kfree(weights);
>  	}
> -	me->il_prev = resume_node;
> -	me->il_weight = resume_weight;
> -	kfree(weights);
>  	return total_allocated;
>  }
>  
> -- 
> 2.47.1

Re: [PATCH 2/2] mm/mempolicy: Skip extra call to __alloc_pages_bulk in weighted interleave

Posted by Joshua Hahn 3 months, 1 week ago

On Fri, 27 Jun 2025 00:28:33 -0400 Gregory Price <gourry@gourry.net> wrote:

Hi Gregory,

Hope you are doing well : -) Thanks for the review!

> On Thu, Jun 26, 2025 at 01:09:34PM -0700, Joshua Hahn wrote:
> > Currently, alloc_pages_bulk_weighted_interleave can make up to nr_node_ids+1
> > calls to __alloc_pages_bulk. The additional allocation can happen if the
> > previous call to this function finished the weighted round robin allocation
> > partially on a node. To make up for this, the next time this function is
> > called, an extra allocation is made to finish cleanly on the node boundaries
> > before performing the weighted round-robin cycles again.
> > 
> 
> task->il_weight can be operated on by other weighted_interleave
> functions in mempolicy, so it's not just
> alloc_pages_bulk_weighted_interleave that affects this.

This is true, and I think that's what makes testing this part of the code
tricky. I'll elaborate more below.

> Observations here are still true, just a correction for clarity.
> 
> i.e.:
> 
> The additional allocation happens for the current il_node/il_weight.
> 
> I *think* I did it this way just because it was easier to reason about
> the two chunks separately.  So I don't see a reason not to improve this.
> I will say that, at least at the time, I took the core math and
> validated the edge conditions in a separate program.
> 
> If you get it wrong in the kernel, you'd either fail to allocate - or more
> likely just get the wrong distribution of pages.  The latter is
> non-obvious unless you go looking for it, so it might be good to at
> least add this test result in the change log.  It's hard to write this
> in LTP or kernel selftest unfortunately.

I think so too : -(

One test that I wanted to do while developing this feature was to see if
I could figure out how many pages are really allocated from each node. The
difficulty in doing this (as you pointed out above) is that because there are
other ways that move the round robin forward (without necessarily calling the
bulk alloc function), it's hard to directly attribute the page allocations.

If this was the only place that we were modifying these values, then a
correctness check would be equivalent to just adding a printk of each node
and how many pages were allocated on it, then adding all the numbers up to
see if it matches the weight ratios in sysfs.

So I think I will do what you did as well -- I think that performing some
tests, at least on the edge cases in a separate program will help give
some confidence that the code works as intended. I'll also continue to think
about if there are better ways to be testing this instead.

Thanks again for reviewing this, have a great day!
Joshua

Sent using hkml (https://github.com/sjp38/hackermail)

Re: [PATCH 2/2] mm/mempolicy: Skip extra call to __alloc_pages_bulk in weighted interleave

Posted by Joshua Hahn 3 months, 1 week ago

On Fri, 27 Jun 2025 09:13:18 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

> On Fri, 27 Jun 2025 00:28:33 -0400 Gregory Price <gourry@gourry.net> wrote:
> 
> Hi Gregory,

[...snip...]

> > I will say that, at least at the time, I took the core math and
> > validated the edge conditions in a separate program.
> > 
> > If you get it wrong in the kernel, you'd either fail to allocate - or more
> > likely just get the wrong distribution of pages.  The latter is
> > non-obvious unless you go looking for it, so it might be good to at
> > least add this test result in the change log.  It's hard to write this
> > in LTP or kernel selftest unfortunately.
> 
> I think so too : -(
> 
> One test that I wanted to do while developing this feature was to see if
> I could figure out how many pages are really allocated from each node. The
> difficulty in doing this (as you pointed out above) is that because there are
> other ways that move the round robin forward (without necessarily calling the
> bulk alloc function), it's hard to directly attribute the page allocations.
> 
> If this was the only place that we were modifying these values, then a
> correctness check would be equivalent to just adding a printk of each node
> and how many pages were allocated on it, then adding all the numbers up to
> see if it matches the weight ratios in sysfs.
> 
> So I think I will do what you did as well -- I think that performing some
> tests, at least on the edge cases in a separate program will help give
> some confidence that the code works as intended. I'll also continue to think
> about if there are better ways to be testing this instead.

Like you suggested, I decided to run a simulation just to see if the number
of nodes allocated from each page lined up with the old version, and if the
numbers made sense for both cases. I found a few issues with my version of the
code:

The math is just incorrect when rounds == 0. Let's say there's a 2-node machine
with weights [10, 7]. We should start allocating from node0 with 7 remaining
pages, and we want to allocate 14 pages. Here's how the new math goes:

- First node should allocate 7 pages, let carryover = 7
- Then remaining pages = 14 - 7 = 7
- Allocate rounds * weight + min(weight, delta) + carryover:
    = 0 * 10 + min(10, 7) + 7
    = 7 + 7
    = 14

This is incorrect, since we will be allocating all 14 pages from the first
node, and the real distribution should be 7 pages from the first node, and
7 pages from the second node. I think this can also lead to some overallocation
issues.

So there are a few options now:
- Change the addition to be:
        rounds * weight + min(min(weight, delta + carryover), weight)
  and adjust the remaining math accordingly. But this looks very bad and
  is not intuitive at all, so I don't like this idea. 
- This can easily be solved, if instead of figuring out how many pages have to
  be allocated as we iterate through the nodes, we do one pass and figure out
  how many pages must be allocated. The problem here is that we re-introduce
  another loop, which adds to the code complexity and may actually be a
  performance decrease instead. So again, I don't think this is the best idea.
- Skip this patch! This is a small optimization on performance, and I think it
  simplifies the code, but turns out the original math to do is just much harder
  without doing two separate calculations. I'll keep this running in the back
  of my mind to see if I can figure out a way to solve it later...

I also think it makes sense to drop the first patch as well, since there's no
optimization included with the cleanup. But since it already got a few reviews,
maybe it makes to keep that one : -)

Anyways, I really appreciate the review Gregory! Maybe I'll have a brekathrough
idea on how to correctly do the math here sometime in the future.

Have a great day!
Joshua

Sent using hkml (https://github.com/sjp38/hackermail)

[PATCH 1/2] mm/mempolicy: Simplify weighted interleave bulk alloc calculations
[PATCH 2/2] mm/mempolicy: Skip extra call to __alloc_pages_bulk in weighted interleave