Currently, alloc_pages_bulk_weighted_interleave can make up to nr_node_ids+1
calls to __alloc_pages_bulk. The additional allocation can happen if the
previous call to this function finished the weighted round robin allocation
partially on a node. To make up for this, the next time this function is
called, an extra allocation is made to finish cleanly on the node boundaries
before performing the weighted round-robin cycles again.
Instead of making an additional call, we can calculate how many additional
pages should be allocated from the first node (aka carryover) and add that
value to the number of pages that should be allocated as part of the current
round-robin cycle.
Running a quick benchmark by compiling the kernel shows a small increase
in performance. These experiments were run on a machine with 2 nodes, each
with 125GB memory and 40 CPUs.
time numactl -w 0,1 make -j$(nproc)
+----------+---------+------------+---------+
| Time (s) | 6.16 | With patch | % Delta |
+----------+---------+------------+---------+
| Real | 88.374 | 88.3356 | -0.2019 |
| User | 3631.7 | 3636.263 | 0.0631 |
| Sys | 366.029 | 363.792 | -0.7534 |
+----------+---------+------------+---------+
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
mm/mempolicy.c | 39 ++++++++++++++++++++-------------------
1 file changed, 20 insertions(+), 19 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 78ad74a0e249..0d693f96cf66 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2569,7 +2569,7 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
unsigned long node_pages, delta;
u8 *weights, weight;
unsigned int weight_total = 0;
- unsigned long rem_pages = nr_pages;
+ unsigned long rem_pages = nr_pages, carryover = 0;
nodemask_t nodes;
int nnodes, node;
int resume_node = MAX_NUMNODES - 1;
@@ -2594,18 +2594,12 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
node = me->il_prev;
weight = me->il_weight;
if (weight && node_isset(node, nodes)) {
- node_pages = min(rem_pages, weight);
- nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
- page_array);
- page_array += nr_allocated;
- total_allocated += nr_allocated;
- /* if that's all the pages, no need to interleave */
if (rem_pages <= weight) {
- me->il_weight -= rem_pages;
- return total_allocated;
+ node_pages = rem_pages;
+ me->il_weight -= node_pages;
+ goto allocate;
}
- /* Otherwise we adjust remaining pages, continue from there */
- rem_pages -= weight;
+ carryover = weight;
}
/* clear active weight in case of an allocation failure */
me->il_weight = 0;
@@ -2614,7 +2608,7 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
/* create a local copy of node weights to operate on outside rcu */
weights = kzalloc(nr_node_ids, GFP_KERNEL);
if (!weights)
- return total_allocated;
+ return 0;
rcu_read_lock();
state = rcu_dereference(wi_state);
@@ -2634,16 +2628,17 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
/*
* Calculate rounds/partial rounds to minimize __alloc_pages_bulk calls.
* Track which node weighted interleave should resume from.
+ * Account for carryover. It is always allocated from the first node.
*
* if (rounds > 0) and (delta == 0), resume_node will always be
* the node following prev_node and its weight.
*/
- rounds = rem_pages / weight_total;
- delta = rem_pages % weight_total;
+ rounds = (rem_pages - carryover) / weight_total;
+ delta = (rem_pages - carryover) % weight_total;
resume_node = next_node_in(prev_node, nodes);
resume_weight = weights[resume_node];
+ node = carryover ? prev_node : next_node_in(prev_node, nodes);
for (i = 0; i < nnodes; i++) {
- node = next_node_in(prev_node, nodes);
weight = weights[node];
/* when delta is depleted, resume from that node */
if (delta && delta < weight) {
@@ -2651,12 +2646,14 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
resume_weight = weight - delta;
}
/* Add the node's portion of the delta, if there is one */
- node_pages = weight * rounds + min(delta, weight);
+ node_pages = weight * rounds + min(delta, weight) + carryover;
delta -= min(delta, weight);
+ carryover = 0;
/* node_pages can be 0 if an allocation fails and rounds == 0 */
if (!node_pages)
break;
+allocate:
nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
page_array);
page_array += nr_allocated;
@@ -2664,10 +2661,14 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
if (total_allocated == nr_pages)
break;
prev_node = node;
+ node = next_node_in(prev_node, nodes);
+ }
+
+ if (weights) {
+ me->il_prev = resume_node;
+ me->il_weight = resume_weight;
+ kfree(weights);
}
- me->il_prev = resume_node;
- me->il_weight = resume_weight;
- kfree(weights);
return total_allocated;
}
--
2.47.1
Op 26-06-2025 om 22:09 schreef Joshua Hahn: > Currently, alloc_pages_bulk_weighted_interleave can make up to nr_node_ids+1 > calls to __alloc_pages_bulk. The additional allocation can happen if the > previous call to this function finished the weighted round robin allocation > partially on a node. To make up for this, the next time this function is > called, an extra allocation is made to finish cleanly on the node boundaries > before performing the weighted round-robin cycles again. > > Instead of making an additional call, we can calculate how many additional > pages should be allocated from the first node (aka carryover) and add that > value to the number of pages that should be allocated as part of the current > round-robin cycle. > > Running a quick benchmark by compiling the kernel shows a small increase > in performance. These experiments were run on a machine with 2 nodes, each > with 125GB memory and 40 CPUs. > > time numactl -w 0,1 make -j$(nproc) > > +----------+---------+------------+---------+ > | Time (s) | 6.16 | With patch | % Delta | > +----------+---------+------------+---------+ > | Real | 88.374 | 88.3356 | -0.2019 | > | User | 3631.7 | 3636.263 | 0.0631 | > | Sys | 366.029 | 363.792 | -0.7534 | > +----------+---------+------------+---------+ > > Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> > > --- > mm/mempolicy.c | 39 ++++++++++++++++++++------------------- > 1 file changed, 20 insertions(+), 19 deletions(-) > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 78ad74a0e249..0d693f96cf66 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -2569,7 +2569,7 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, > unsigned long node_pages, delta; > u8 *weights, weight; > unsigned int weight_total = 0; > - unsigned long rem_pages = nr_pages; > + unsigned long rem_pages = nr_pages, carryover = 0; > nodemask_t nodes; > int nnodes, node; > int resume_node = MAX_NUMNODES - 1; > @@ -2594,18 +2594,12 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, > node = me->il_prev; > weight = me->il_weight; > if (weight && node_isset(node, nodes)) { > - node_pages = min(rem_pages, weight); > - nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, > - page_array); > - page_array += nr_allocated; > - total_allocated += nr_allocated; > - /* if that's all the pages, no need to interleave */ > if (rem_pages <= weight) { > - me->il_weight -= rem_pages; > - return total_allocated; > + node_pages = rem_pages; > + me->il_weight -= node_pages; > + goto allocate; This is a goto into the middle of a for-loop. What do you think is going to happen at the end of that loop? I think (only tested with a small C program) it will go to the start of the loop, do the i++, check i<nnodes, and possibly do the loop again. Variable i is uninitialized at that point. In the loop it hits several uninitialized variables. Even if this is legal C code, it is pretty obscure. > } > - /* Otherwise we adjust remaining pages, continue from there */ > - rem_pages -= weight; > + carryover = weight; > } > /* clear active weight in case of an allocation failure */ > me->il_weight = 0; > @@ -2614,7 +2608,7 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, > /* create a local copy of node weights to operate on outside rcu */ > weights = kzalloc(nr_node_ids, GFP_KERNEL); > if (!weights) > - return total_allocated; > + return 0; > > rcu_read_lock(); > state = rcu_dereference(wi_state); > @@ -2634,16 +2628,17 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, > /* > * Calculate rounds/partial rounds to minimize __alloc_pages_bulk calls. > * Track which node weighted interleave should resume from. > + * Account for carryover. It is always allocated from the first node. > * > * if (rounds > 0) and (delta == 0), resume_node will always be > * the node following prev_node and its weight. > */ > - rounds = rem_pages / weight_total; > - delta = rem_pages % weight_total; > + rounds = (rem_pages - carryover) / weight_total; > + delta = (rem_pages - carryover) % weight_total; > resume_node = next_node_in(prev_node, nodes); > resume_weight = weights[resume_node]; > + node = carryover ? prev_node : next_node_in(prev_node, nodes); > for (i = 0; i < nnodes; i++) { > - node = next_node_in(prev_node, nodes); > weight = weights[node]; > /* when delta is depleted, resume from that node */ > if (delta && delta < weight) { > @@ -2651,12 +2646,14 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, > resume_weight = weight - delta; > } > /* Add the node's portion of the delta, if there is one */ > - node_pages = weight * rounds + min(delta, weight); > + node_pages = weight * rounds + min(delta, weight) + carryover; > delta -= min(delta, weight); > + carryover = 0; > > /* node_pages can be 0 if an allocation fails and rounds == 0 */ > if (!node_pages) > break; > +allocate: > nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, > page_array); > page_array += nr_allocated; > @@ -2664,10 +2661,14 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, > if (total_allocated == nr_pages) > break; > prev_node = node; > + node = next_node_in(prev_node, nodes); > + } > + > + if (weights) { > + me->il_prev = resume_node; > + me->il_weight = resume_weight; > + kfree(weights); > } > - me->il_prev = resume_node; > - me->il_weight = resume_weight; > - kfree(weights); > return total_allocated; > } >
On Mon, 30 Jun 2025 22:05:48 +0200 Kees Bakker <kees@ijzerbout.nl> wrote: > > mm/mempolicy.c | 39 ++++++++++++++++++++------------------- > > 1 file changed, 20 insertions(+), 19 deletions(-) > > > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > > index 78ad74a0e249..0d693f96cf66 100644 > > --- a/mm/mempolicy.c > > +++ b/mm/mempolicy.c > > @@ -2569,7 +2569,7 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, > > unsigned long node_pages, delta; > > u8 *weights, weight; > > unsigned int weight_total = 0; > > - unsigned long rem_pages = nr_pages; > > + unsigned long rem_pages = nr_pages, carryover = 0; > > nodemask_t nodes; > > int nnodes, node; > > int resume_node = MAX_NUMNODES - 1; > > @@ -2594,18 +2594,12 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, > > node = me->il_prev; > > weight = me->il_weight; > > if (weight && node_isset(node, nodes)) { > > - node_pages = min(rem_pages, weight); > > - nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, > > - page_array); > > - page_array += nr_allocated; > > - total_allocated += nr_allocated; > > - /* if that's all the pages, no need to interleave */ > > if (rem_pages <= weight) { > > - me->il_weight -= rem_pages; > > - return total_allocated; > > + node_pages = rem_pages; > > + me->il_weight -= node_pages; > > + goto allocate; Hello Kees, Thank you for reviewing my code! > This is a goto into the middle of a for-loop. > What do you think is going to happen at the end of that loop? > > I think (only tested with a small C program) it will go to the start of > the loop, do the i++, check i<nnodes, and possibly do the loop again. > Variable i is uninitialized at that point. In the loop it hits several > uninitialized variables. From what I can see from my code, I think the only the goto statement leads to a second iteration of the for loop is if allocation fails. But otherwise, it should be ok since we always hit if (total_allocated == nr_pages) break; within the loop. For the branch that takes the goto, we set node_pages = rem_pages, then jump to the label and allocate. So nr_allocated = node_pages, and total_allocated = 0 + nr_allocated so total_allocated = node_pages total_allocated == node_pages == rem_pages == nr_pages, so we will break. Phew! To cover the case where allocation fails, I think we should be breaking anyways, so I can definitely add a new check for this. > Even if this is legal C code, it is pretty obscure. I agree that it not very clean. I did this to reduce the amount of repeated code there is. Even if this code works, it could definitely be written better to make it more readable and maintainable. As I noted in my second response to Gregory, I'm not planning on pursuing this version anymore, so if I decide to send a second version, I'll keep this in mind. Thank you again for taking the time to review this, and also testing it on your end! I hope you have a great day : -) Joshua Sent using hkml (https://github.com/sjp38/hackermail)
On Mon, 30 Jun 2025 13:21:14 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote: > > This is a goto into the middle of a for-loop. > > What do you think is going to happen at the end of that loop? > > > > I think (only tested with a small C program) it will go to the start of > > the loop, do the i++, check i<nnodes, and possibly do the loop again. > > Variable i is uninitialized at that point. In the loop it hits several > > uninitialized variables. > > >From what I can see from my code, I think the only the goto statement leads > to a second iteration of the for loop is if allocation fails. > But otherwise, it should be ok since we always hit > > if (total_allocated == nr_pages) > break; > > within the loop. For the branch that takes the goto, we set > node_pages = rem_pages, then jump to the label and allocate. > So nr_allocated = node_pages, and total_allocated = 0 + nr_allocated > so total_allocated = node_pages > > total_allocated == node_pages == rem_pages == nr_pages, so we will break. Phew! > > To cover the case where allocation fails, I think we should be breaking > anyways, so I can definitely add a new check for this. I do agree, that goto is a "goto too far". That we can do a thing doesn't mean we should do it! > > Even if this is legal C code, it is pretty obscure. > > I agree that it not very clean. I did this to reduce the amount of repeated > code there is. Even if this code works, it could definitely be written > better to make it more readable and maintainable. As I noted in my second > response to Gregory, I'm not planning on pursuing this version anymore, > so if I decide to send a second version, I'll keep this in mind. Cool, I'll drop this version from mm-unstable.
On Mon, 30 Jun 2025 15:35:01 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > On Mon, 30 Jun 2025 13:21:14 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote: > > > > This is a goto into the middle of a for-loop. > > > What do you think is going to happen at the end of that loop? > > > > > > I think (only tested with a small C program) it will go to the start of > > > the loop, do the i++, check i<nnodes, and possibly do the loop again. > > > Variable i is uninitialized at that point. In the loop it hits several > > > uninitialized variables. > > > > >From what I can see from my code, I think the only the goto statement leads > > to a second iteration of the for loop is if allocation fails. > > But otherwise, it should be ok since we always hit > > > > if (total_allocated == nr_pages) > > break; > > > > within the loop. For the branch that takes the goto, we set > > node_pages = rem_pages, then jump to the label and allocate. > > So nr_allocated = node_pages, and total_allocated = 0 + nr_allocated > > so total_allocated = node_pages > > > > total_allocated == node_pages == rem_pages == nr_pages, so we will break. Phew! > > > > To cover the case where allocation fails, I think we should be breaking > > anyways, so I can definitely add a new check for this. > > I do agree, that goto is a "goto too far". That we can do a thing > doesn't mean we should do it! Haha : -) > > > Even if this is legal C code, it is pretty obscure. > > > > I agree that it not very clean. I did this to reduce the amount of repeated > > code there is. Even if this code works, it could definitely be written > > better to make it more readable and maintainable. As I noted in my second > > response to Gregory, I'm not planning on pursuing this version anymore, > > so if I decide to send a second version, I'll keep this in mind. > > Cool, I'll drop this version from mm-unstable. Sounds good Andrew, thank you always for all of your help! Joshua Sent using hkml (https://github.com/sjp38/hackermail)
On Thu, Jun 26, 2025 at 01:09:34PM -0700, Joshua Hahn wrote: > Currently, alloc_pages_bulk_weighted_interleave can make up to nr_node_ids+1 > calls to __alloc_pages_bulk. The additional allocation can happen if the > previous call to this function finished the weighted round robin allocation > partially on a node. To make up for this, the next time this function is > called, an extra allocation is made to finish cleanly on the node boundaries > before performing the weighted round-robin cycles again. > task->il_weight can be operated on by other weighted_interleave functions in mempolicy, so it's not just alloc_pages_bulk_weighted_interleave that affects this. Observations here are still true, just a correction for clarity. i.e.: The additional allocation happens for the current il_node/il_weight. I *think* I did it this way just because it was easier to reason about the two chunks separately. So I don't see a reason not to improve this. I will say that, at least at the time, I took the core math and validated the edge conditions in a separate program. If you get it wrong in the kernel, you'd either fail to allocate - or more likely just get the wrong distribution of pages. The latter is non-obvious unless you go looking for it, so it might be good to at least add this test result in the change log. It's hard to write this in LTP or kernel selftest unfortunately. > Instead of making an additional call, we can calculate how many additional > pages should be allocated from the first node (aka carryover) and add that > value to the number of pages that should be allocated as part of the current > round-robin cycle. > > Running a quick benchmark by compiling the kernel shows a small increase > in performance. These experiments were run on a machine with 2 nodes, each > with 125GB memory and 40 CPUs. > > time numactl -w 0,1 make -j$(nproc) > > +----------+---------+------------+---------+ > | Time (s) | 6.16 | With patch | % Delta | > +----------+---------+------------+---------+ > | Real | 88.374 | 88.3356 | -0.2019 | > | User | 3631.7 | 3636.263 | 0.0631 | > | Sys | 366.029 | 363.792 | -0.7534 | > +----------+---------+------------+---------+ > > Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> > > --- > mm/mempolicy.c | 39 ++++++++++++++++++++------------------- > 1 file changed, 20 insertions(+), 19 deletions(-) > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 78ad74a0e249..0d693f96cf66 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -2569,7 +2569,7 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, > unsigned long node_pages, delta; > u8 *weights, weight; > unsigned int weight_total = 0; > - unsigned long rem_pages = nr_pages; > + unsigned long rem_pages = nr_pages, carryover = 0; > nodemask_t nodes; > int nnodes, node; > int resume_node = MAX_NUMNODES - 1; > @@ -2594,18 +2594,12 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, > node = me->il_prev; > weight = me->il_weight; > if (weight && node_isset(node, nodes)) { > - node_pages = min(rem_pages, weight); > - nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, > - page_array); > - page_array += nr_allocated; > - total_allocated += nr_allocated; > - /* if that's all the pages, no need to interleave */ > if (rem_pages <= weight) { > - me->il_weight -= rem_pages; > - return total_allocated; > + node_pages = rem_pages; > + me->il_weight -= node_pages; > + goto allocate; > } > - /* Otherwise we adjust remaining pages, continue from there */ > - rem_pages -= weight; > + carryover = weight; > } > /* clear active weight in case of an allocation failure */ > me->il_weight = 0; > @@ -2614,7 +2608,7 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, > /* create a local copy of node weights to operate on outside rcu */ > weights = kzalloc(nr_node_ids, GFP_KERNEL); > if (!weights) > - return total_allocated; > + return 0; may be worth noting that this change means small bulk allocations that could be covered by the current il_node/weight may now fail if kzalloc fails. But then there are probably other problems. However, this is a functional difference between the old and new state of the function. > > rcu_read_lock(); > state = rcu_dereference(wi_state); > @@ -2634,16 +2628,17 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, > /* > * Calculate rounds/partial rounds to minimize __alloc_pages_bulk calls. > * Track which node weighted interleave should resume from. > + * Account for carryover. It is always allocated from the first node. > * > * if (rounds > 0) and (delta == 0), resume_node will always be > * the node following prev_node and its weight. > */ > - rounds = rem_pages / weight_total; > - delta = rem_pages % weight_total; > + rounds = (rem_pages - carryover) / weight_total; > + delta = (rem_pages - carryover) % weight_total; > resume_node = next_node_in(prev_node, nodes); > resume_weight = weights[resume_node]; > + node = carryover ? prev_node : next_node_in(prev_node, nodes); > for (i = 0; i < nnodes; i++) { > - node = next_node_in(prev_node, nodes); > weight = weights[node]; > /* when delta is depleted, resume from that node */ > if (delta && delta < weight) { > @@ -2651,12 +2646,14 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, > resume_weight = weight - delta; > } > /* Add the node's portion of the delta, if there is one */ > - node_pages = weight * rounds + min(delta, weight); > + node_pages = weight * rounds + min(delta, weight) + carryover; > delta -= min(delta, weight); > + carryover = 0; > > /* node_pages can be 0 if an allocation fails and rounds == 0 */ > if (!node_pages) > break; > +allocate: > nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, > page_array); > page_array += nr_allocated; > @@ -2664,10 +2661,14 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, > if (total_allocated == nr_pages) > break; > prev_node = node; > + node = next_node_in(prev_node, nodes); > + } > + > + if (weights) { > + me->il_prev = resume_node; > + me->il_weight = resume_weight; > + kfree(weights); > } > - me->il_prev = resume_node; > - me->il_weight = resume_weight; > - kfree(weights); > return total_allocated; > } > > -- > 2.47.1
On Fri, 27 Jun 2025 00:28:33 -0400 Gregory Price <gourry@gourry.net> wrote: Hi Gregory, Hope you are doing well : -) Thanks for the review! > On Thu, Jun 26, 2025 at 01:09:34PM -0700, Joshua Hahn wrote: > > Currently, alloc_pages_bulk_weighted_interleave can make up to nr_node_ids+1 > > calls to __alloc_pages_bulk. The additional allocation can happen if the > > previous call to this function finished the weighted round robin allocation > > partially on a node. To make up for this, the next time this function is > > called, an extra allocation is made to finish cleanly on the node boundaries > > before performing the weighted round-robin cycles again. > > > > task->il_weight can be operated on by other weighted_interleave > functions in mempolicy, so it's not just > alloc_pages_bulk_weighted_interleave that affects this. This is true, and I think that's what makes testing this part of the code tricky. I'll elaborate more below. > Observations here are still true, just a correction for clarity. > > i.e.: > > The additional allocation happens for the current il_node/il_weight. > > I *think* I did it this way just because it was easier to reason about > the two chunks separately. So I don't see a reason not to improve this. > I will say that, at least at the time, I took the core math and > validated the edge conditions in a separate program. > > If you get it wrong in the kernel, you'd either fail to allocate - or more > likely just get the wrong distribution of pages. The latter is > non-obvious unless you go looking for it, so it might be good to at > least add this test result in the change log. It's hard to write this > in LTP or kernel selftest unfortunately. I think so too : -( One test that I wanted to do while developing this feature was to see if I could figure out how many pages are really allocated from each node. The difficulty in doing this (as you pointed out above) is that because there are other ways that move the round robin forward (without necessarily calling the bulk alloc function), it's hard to directly attribute the page allocations. If this was the only place that we were modifying these values, then a correctness check would be equivalent to just adding a printk of each node and how many pages were allocated on it, then adding all the numbers up to see if it matches the weight ratios in sysfs. So I think I will do what you did as well -- I think that performing some tests, at least on the edge cases in a separate program will help give some confidence that the code works as intended. I'll also continue to think about if there are better ways to be testing this instead. Thanks again for reviewing this, have a great day! Joshua Sent using hkml (https://github.com/sjp38/hackermail)
On Fri, 27 Jun 2025 09:13:18 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote: > On Fri, 27 Jun 2025 00:28:33 -0400 Gregory Price <gourry@gourry.net> wrote: > > Hi Gregory, [...snip...] > > I will say that, at least at the time, I took the core math and > > validated the edge conditions in a separate program. > > > > If you get it wrong in the kernel, you'd either fail to allocate - or more > > likely just get the wrong distribution of pages. The latter is > > non-obvious unless you go looking for it, so it might be good to at > > least add this test result in the change log. It's hard to write this > > in LTP or kernel selftest unfortunately. > > I think so too : -( > > One test that I wanted to do while developing this feature was to see if > I could figure out how many pages are really allocated from each node. The > difficulty in doing this (as you pointed out above) is that because there are > other ways that move the round robin forward (without necessarily calling the > bulk alloc function), it's hard to directly attribute the page allocations. > > If this was the only place that we were modifying these values, then a > correctness check would be equivalent to just adding a printk of each node > and how many pages were allocated on it, then adding all the numbers up to > see if it matches the weight ratios in sysfs. > > So I think I will do what you did as well -- I think that performing some > tests, at least on the edge cases in a separate program will help give > some confidence that the code works as intended. I'll also continue to think > about if there are better ways to be testing this instead. Like you suggested, I decided to run a simulation just to see if the number of nodes allocated from each page lined up with the old version, and if the numbers made sense for both cases. I found a few issues with my version of the code: The math is just incorrect when rounds == 0. Let's say there's a 2-node machine with weights [10, 7]. We should start allocating from node0 with 7 remaining pages, and we want to allocate 14 pages. Here's how the new math goes: - First node should allocate 7 pages, let carryover = 7 - Then remaining pages = 14 - 7 = 7 - Allocate rounds * weight + min(weight, delta) + carryover: = 0 * 10 + min(10, 7) + 7 = 7 + 7 = 14 This is incorrect, since we will be allocating all 14 pages from the first node, and the real distribution should be 7 pages from the first node, and 7 pages from the second node. I think this can also lead to some overallocation issues. So there are a few options now: - Change the addition to be: rounds * weight + min(min(weight, delta + carryover), weight) and adjust the remaining math accordingly. But this looks very bad and is not intuitive at all, so I don't like this idea. - This can easily be solved, if instead of figuring out how many pages have to be allocated as we iterate through the nodes, we do one pass and figure out how many pages must be allocated. The problem here is that we re-introduce another loop, which adds to the code complexity and may actually be a performance decrease instead. So again, I don't think this is the best idea. - Skip this patch! This is a small optimization on performance, and I think it simplifies the code, but turns out the original math to do is just much harder without doing two separate calculations. I'll keep this running in the back of my mind to see if I can figure out a way to solve it later... I also think it makes sense to drop the first patch as well, since there's no optimization included with the cleanup. But since it already got a few reviews, maybe it makes to keep that one : -) Anyways, I really appreciate the review Gregory! Maybe I'll have a brekathrough idea on how to correctly do the math here sometime in the future. Have a great day! Joshua Sent using hkml (https://github.com/sjp38/hackermail)
© 2016 - 2025 Red Hat, Inc.