* [PATCH 0/2] mm/mempolicy: Cleanup and optimization for weighted interleave @ 2025-06-26 20:09 Joshua Hahn 2025-06-26 20:09 ` [PATCH 1/2] mm/mempolicy: Simplify weighted interleave bulk alloc calculations Joshua Hahn 2025-06-26 20:09 ` [PATCH 2/2] mm/mempolicy: Skip extra call to __alloc_pages_bulk in weighted interleave Joshua Hahn 0 siblings, 2 replies; 14+ messages in thread From: Joshua Hahn @ 2025-06-26 20:09 UTC (permalink / raw) To: Gregory Price Cc: Andrew Morton, Alistair Popple, Byungchul Park, David Hildenbrand, Matthew Brost, Rakie Kim, Ying Huang, Zi Yan, linux-kernel, linux-mm, kernel-team Two small patches for weighted interleave bulk allocaton. The first patch simplifies the delta calculation needed for the allocations, removing an if-else and performing unconditional additions instead. The second patch makes a minor improvement to the weighted interleave bulk allocation function by skipping a call to __alloc_pages_bulk. Running a quick benchmark by compiling the kernel shows a small increase in performance. These experiments were run on a machine with 2 nodes, each with 125GB memory and 40 CPUs. time numactl -w 0,1 make -j$(nproc) +----------+---------+------------+---------+ | Time (s) | 6.16 | With patch | % Delta | +----------+---------+------------+---------+ | Real | 88.374 | 88.3356 | -0.2019 | | User | 3631.7 | 3636.263 | 0.0631 | | Sys | 366.029 | 363.792 | -0.7534 | +----------+---------+------------+---------+ Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> Joshua Hahn (2): mm/mempolicy: Simplify weighted interleave bulk alloc calculations mm/mempolicy: Skip extra call to __alloc_pages_bulk in weighted interleave mm/mempolicy.c | 52 ++++++++++++++++++++++++-------------------------- 1 file changed, 25 insertions(+), 27 deletions(-) base-commit: bf8761eda0930400291552bd314c9d59b720e899 -- 2.47.1 ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 1/2] mm/mempolicy: Simplify weighted interleave bulk alloc calculations 2025-06-26 20:09 [PATCH 0/2] mm/mempolicy: Cleanup and optimization for weighted interleave Joshua Hahn @ 2025-06-26 20:09 ` Joshua Hahn 2025-06-26 21:51 ` David Hildenbrand ` (3 more replies) 2025-06-26 20:09 ` [PATCH 2/2] mm/mempolicy: Skip extra call to __alloc_pages_bulk in weighted interleave Joshua Hahn 1 sibling, 4 replies; 14+ messages in thread From: Joshua Hahn @ 2025-06-26 20:09 UTC (permalink / raw) To: Gregory Price Cc: Andrew Morton, Alistair Popple, Byungchul Park, David Hildenbrand, Matthew Brost, Rakie Kim, Ying Huang, Zi Yan, linux-kernel, linux-mm, kernel-team Simplify the math used to figure out how many pages should be allocated per node. Instead of making conditional additions and deletions, we can just make them unconditional by using min(). No functional changes intended. Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> --- mm/mempolicy.c | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 3b1dfd08338b..78ad74a0e249 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2645,18 +2645,15 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, for (i = 0; i < nnodes; i++) { node = next_node_in(prev_node, nodes); weight = weights[node]; - node_pages = weight * rounds; - /* If a delta exists, add this node's portion of the delta */ - if (delta > weight) { - node_pages += weight; - delta -= weight; - } else if (delta) { - /* when delta is depleted, resume from that node */ - node_pages += delta; + /* when delta is depleted, resume from that node */ + if (delta && delta < weight) { resume_node = node; resume_weight = weight - delta; - delta = 0; } + /* Add the node's portion of the delta, if there is one */ + node_pages = weight * rounds + min(delta, weight); + delta -= min(delta, weight); + /* node_pages can be 0 if an allocation fails and rounds == 0 */ if (!node_pages) break; -- 2.47.1 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] mm/mempolicy: Simplify weighted interleave bulk alloc calculations 2025-06-26 20:09 ` [PATCH 1/2] mm/mempolicy: Simplify weighted interleave bulk alloc calculations Joshua Hahn @ 2025-06-26 21:51 ` David Hildenbrand 2025-06-27 4:31 ` Gregory Price ` (2 subsequent siblings) 3 siblings, 0 replies; 14+ messages in thread From: David Hildenbrand @ 2025-06-26 21:51 UTC (permalink / raw) To: Joshua Hahn, Gregory Price Cc: Andrew Morton, Alistair Popple, Byungchul Park, Matthew Brost, Rakie Kim, Ying Huang, Zi Yan, linux-kernel, linux-mm, kernel-team On 26.06.25 22:09, Joshua Hahn wrote: > Simplify the math used to figure out how many pages should be allocated > per node. Instead of making conditional additions and deletions, we can just > make them unconditional by using min(). No functional changes intended. > > Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> > > --- > mm/mempolicy.c | 15 ++++++--------- > 1 file changed, 6 insertions(+), 9 deletions(-) > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 3b1dfd08338b..78ad74a0e249 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -2645,18 +2645,15 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, > for (i = 0; i < nnodes; i++) { > node = next_node_in(prev_node, nodes); > weight = weights[node]; > - node_pages = weight * rounds; > - /* If a delta exists, add this node's portion of the delta */ > - if (delta > weight) { > - node_pages += weight; > - delta -= weight; > - } else if (delta) { > - /* when delta is depleted, resume from that node */ > - node_pages += delta; > + /* when delta is depleted, resume from that node */ > + if (delta && delta < weight) { > resume_node = node; > resume_weight = weight - delta; > - delta = 0; > } > + /* Add the node's portion of the delta, if there is one */ > + node_pages = weight * rounds + min(delta, weight); > + delta -= min(delta, weight); > + LGTM, but took me a second (and it's a bit late ...) Acked-by: David Hildenbrand <david@redhat.com> -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] mm/mempolicy: Simplify weighted interleave bulk alloc calculations 2025-06-26 20:09 ` [PATCH 1/2] mm/mempolicy: Simplify weighted interleave bulk alloc calculations Joshua Hahn 2025-06-26 21:51 ` David Hildenbrand @ 2025-06-27 4:31 ` Gregory Price 2025-06-27 7:38 ` Rakie Kim 2025-06-27 7:45 ` Oscar Salvador 3 siblings, 0 replies; 14+ messages in thread From: Gregory Price @ 2025-06-27 4:31 UTC (permalink / raw) To: Joshua Hahn Cc: Andrew Morton, Alistair Popple, Byungchul Park, David Hildenbrand, Matthew Brost, Rakie Kim, Ying Huang, Zi Yan, linux-kernel, linux-mm, kernel-team On Thu, Jun 26, 2025 at 01:09:33PM -0700, Joshua Hahn wrote: > Simplify the math used to figure out how many pages should be allocated > per node. Instead of making conditional additions and deletions, we can just > make them unconditional by using min(). No functional changes intended. > > Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> you're better at concise math than I am :] Reviewed-by: Gregory Price <gourry@gourry.net> > > --- > mm/mempolicy.c | 15 ++++++--------- > 1 file changed, 6 insertions(+), 9 deletions(-) > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 3b1dfd08338b..78ad74a0e249 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -2645,18 +2645,15 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, > for (i = 0; i < nnodes; i++) { > node = next_node_in(prev_node, nodes); > weight = weights[node]; > - node_pages = weight * rounds; > - /* If a delta exists, add this node's portion of the delta */ > - if (delta > weight) { > - node_pages += weight; > - delta -= weight; > - } else if (delta) { > - /* when delta is depleted, resume from that node */ > - node_pages += delta; > + /* when delta is depleted, resume from that node */ > + if (delta && delta < weight) { > resume_node = node; > resume_weight = weight - delta; > - delta = 0; > } > + /* Add the node's portion of the delta, if there is one */ > + node_pages = weight * rounds + min(delta, weight); > + delta -= min(delta, weight); > + > /* node_pages can be 0 if an allocation fails and rounds == 0 */ > if (!node_pages) > break; > -- > 2.47.1 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] mm/mempolicy: Simplify weighted interleave bulk alloc calculations 2025-06-26 20:09 ` [PATCH 1/2] mm/mempolicy: Simplify weighted interleave bulk alloc calculations Joshua Hahn 2025-06-26 21:51 ` David Hildenbrand 2025-06-27 4:31 ` Gregory Price @ 2025-06-27 7:38 ` Rakie Kim 2025-06-27 7:45 ` Oscar Salvador 3 siblings, 0 replies; 14+ messages in thread From: Rakie Kim @ 2025-06-27 7:38 UTC (permalink / raw) To: Joshua Hahn Cc: Andrew Morton, Alistair Popple, Byungchul Park, David Hildenbrand, Matthew Brost, Rakie Kim, Ying Huang, Zi Yan, linux-kernel, linux-mm, kernel-team, Gregory Price, kernel_team On Thu, 26 Jun 2025 13:09:33 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote: > Simplify the math used to figure out how many pages should be allocated > per node. Instead of making conditional additions and deletions, we can just > make them unconditional by using min(). No functional changes intended. > > Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> > > --- > mm/mempolicy.c | 15 ++++++--------- > 1 file changed, 6 insertions(+), 9 deletions(-) > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 3b1dfd08338b..78ad74a0e249 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -2645,18 +2645,15 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, > for (i = 0; i < nnodes; i++) { > node = next_node_in(prev_node, nodes); > weight = weights[node]; > - node_pages = weight * rounds; > - /* If a delta exists, add this node's portion of the delta */ > - if (delta > weight) { > - node_pages += weight; > - delta -= weight; > - } else if (delta) { > - /* when delta is depleted, resume from that node */ > - node_pages += delta; > + /* when delta is depleted, resume from that node */ > + if (delta && delta < weight) { > resume_node = node; > resume_weight = weight - delta; > - delta = 0; > } > + /* Add the node's portion of the delta, if there is one */ > + node_pages = weight * rounds + min(delta, weight); > + delta -= min(delta, weight); > + > /* node_pages can be 0 if an allocation fails and rounds == 0 */ > if (!node_pages) > break; > -- > 2.47.1 > The updated logic improves clarity and readability. Well structured change. Reviewed-by: Rakie Kim <rakie.kim@sk.com> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] mm/mempolicy: Simplify weighted interleave bulk alloc calculations 2025-06-26 20:09 ` [PATCH 1/2] mm/mempolicy: Simplify weighted interleave bulk alloc calculations Joshua Hahn ` (2 preceding siblings ...) 2025-06-27 7:38 ` Rakie Kim @ 2025-06-27 7:45 ` Oscar Salvador 3 siblings, 0 replies; 14+ messages in thread From: Oscar Salvador @ 2025-06-27 7:45 UTC (permalink / raw) To: Joshua Hahn Cc: Gregory Price, Andrew Morton, Alistair Popple, Byungchul Park, David Hildenbrand, Matthew Brost, Rakie Kim, Ying Huang, Zi Yan, linux-kernel, linux-mm, kernel-team On Thu, Jun 26, 2025 at 01:09:33PM -0700, Joshua Hahn wrote: > Simplify the math used to figure out how many pages should be allocated > per node. Instead of making conditional additions and deletions, we can just > make them unconditional by using min(). No functional changes intended. > > Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> -- Oscar Salvador SUSE Labs ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 2/2] mm/mempolicy: Skip extra call to __alloc_pages_bulk in weighted interleave 2025-06-26 20:09 [PATCH 0/2] mm/mempolicy: Cleanup and optimization for weighted interleave Joshua Hahn 2025-06-26 20:09 ` [PATCH 1/2] mm/mempolicy: Simplify weighted interleave bulk alloc calculations Joshua Hahn @ 2025-06-26 20:09 ` Joshua Hahn 2025-06-27 4:28 ` Gregory Price 2025-06-30 20:05 ` Kees Bakker 1 sibling, 2 replies; 14+ messages in thread From: Joshua Hahn @ 2025-06-26 20:09 UTC (permalink / raw) To: Gregory Price Cc: Andrew Morton, Alistair Popple, Byungchul Park, David Hildenbrand, Matthew Brost, Rakie Kim, Ying Huang, Zi Yan, linux-kernel, linux-mm, kernel-team Currently, alloc_pages_bulk_weighted_interleave can make up to nr_node_ids+1 calls to __alloc_pages_bulk. The additional allocation can happen if the previous call to this function finished the weighted round robin allocation partially on a node. To make up for this, the next time this function is called, an extra allocation is made to finish cleanly on the node boundaries before performing the weighted round-robin cycles again. Instead of making an additional call, we can calculate how many additional pages should be allocated from the first node (aka carryover) and add that value to the number of pages that should be allocated as part of the current round-robin cycle. Running a quick benchmark by compiling the kernel shows a small increase in performance. These experiments were run on a machine with 2 nodes, each with 125GB memory and 40 CPUs. time numactl -w 0,1 make -j$(nproc) +----------+---------+------------+---------+ | Time (s) | 6.16 | With patch | % Delta | +----------+---------+------------+---------+ | Real | 88.374 | 88.3356 | -0.2019 | | User | 3631.7 | 3636.263 | 0.0631 | | Sys | 366.029 | 363.792 | -0.7534 | +----------+---------+------------+---------+ Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> --- mm/mempolicy.c | 39 ++++++++++++++++++++------------------- 1 file changed, 20 insertions(+), 19 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 78ad74a0e249..0d693f96cf66 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2569,7 +2569,7 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, unsigned long node_pages, delta; u8 *weights, weight; unsigned int weight_total = 0; - unsigned long rem_pages = nr_pages; + unsigned long rem_pages = nr_pages, carryover = 0; nodemask_t nodes; int nnodes, node; int resume_node = MAX_NUMNODES - 1; @@ -2594,18 +2594,12 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, node = me->il_prev; weight = me->il_weight; if (weight && node_isset(node, nodes)) { - node_pages = min(rem_pages, weight); - nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, - page_array); - page_array += nr_allocated; - total_allocated += nr_allocated; - /* if that's all the pages, no need to interleave */ if (rem_pages <= weight) { - me->il_weight -= rem_pages; - return total_allocated; + node_pages = rem_pages; + me->il_weight -= node_pages; + goto allocate; } - /* Otherwise we adjust remaining pages, continue from there */ - rem_pages -= weight; + carryover = weight; } /* clear active weight in case of an allocation failure */ me->il_weight = 0; @@ -2614,7 +2608,7 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, /* create a local copy of node weights to operate on outside rcu */ weights = kzalloc(nr_node_ids, GFP_KERNEL); if (!weights) - return total_allocated; + return 0; rcu_read_lock(); state = rcu_dereference(wi_state); @@ -2634,16 +2628,17 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, /* * Calculate rounds/partial rounds to minimize __alloc_pages_bulk calls. * Track which node weighted interleave should resume from. + * Account for carryover. It is always allocated from the first node. * * if (rounds > 0) and (delta == 0), resume_node will always be * the node following prev_node and its weight. */ - rounds = rem_pages / weight_total; - delta = rem_pages % weight_total; + rounds = (rem_pages - carryover) / weight_total; + delta = (rem_pages - carryover) % weight_total; resume_node = next_node_in(prev_node, nodes); resume_weight = weights[resume_node]; + node = carryover ? prev_node : next_node_in(prev_node, nodes); for (i = 0; i < nnodes; i++) { - node = next_node_in(prev_node, nodes); weight = weights[node]; /* when delta is depleted, resume from that node */ if (delta && delta < weight) { @@ -2651,12 +2646,14 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, resume_weight = weight - delta; } /* Add the node's portion of the delta, if there is one */ - node_pages = weight * rounds + min(delta, weight); + node_pages = weight * rounds + min(delta, weight) + carryover; delta -= min(delta, weight); + carryover = 0; /* node_pages can be 0 if an allocation fails and rounds == 0 */ if (!node_pages) break; +allocate: nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, page_array); page_array += nr_allocated; @@ -2664,10 +2661,14 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, if (total_allocated == nr_pages) break; prev_node = node; + node = next_node_in(prev_node, nodes); + } + + if (weights) { + me->il_prev = resume_node; + me->il_weight = resume_weight; + kfree(weights); } - me->il_prev = resume_node; - me->il_weight = resume_weight; - kfree(weights); return total_allocated; } -- 2.47.1 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] mm/mempolicy: Skip extra call to __alloc_pages_bulk in weighted interleave 2025-06-26 20:09 ` [PATCH 2/2] mm/mempolicy: Skip extra call to __alloc_pages_bulk in weighted interleave Joshua Hahn @ 2025-06-27 4:28 ` Gregory Price 2025-06-27 16:13 ` Joshua Hahn 2025-06-30 20:05 ` Kees Bakker 1 sibling, 1 reply; 14+ messages in thread From: Gregory Price @ 2025-06-27 4:28 UTC (permalink / raw) To: Joshua Hahn Cc: Andrew Morton, Alistair Popple, Byungchul Park, David Hildenbrand, Matthew Brost, Rakie Kim, Ying Huang, Zi Yan, linux-kernel, linux-mm, kernel-team On Thu, Jun 26, 2025 at 01:09:34PM -0700, Joshua Hahn wrote: > Currently, alloc_pages_bulk_weighted_interleave can make up to nr_node_ids+1 > calls to __alloc_pages_bulk. The additional allocation can happen if the > previous call to this function finished the weighted round robin allocation > partially on a node. To make up for this, the next time this function is > called, an extra allocation is made to finish cleanly on the node boundaries > before performing the weighted round-robin cycles again. > task->il_weight can be operated on by other weighted_interleave functions in mempolicy, so it's not just alloc_pages_bulk_weighted_interleave that affects this. Observations here are still true, just a correction for clarity. i.e.: The additional allocation happens for the current il_node/il_weight. I *think* I did it this way just because it was easier to reason about the two chunks separately. So I don't see a reason not to improve this. I will say that, at least at the time, I took the core math and validated the edge conditions in a separate program. If you get it wrong in the kernel, you'd either fail to allocate - or more likely just get the wrong distribution of pages. The latter is non-obvious unless you go looking for it, so it might be good to at least add this test result in the change log. It's hard to write this in LTP or kernel selftest unfortunately. > Instead of making an additional call, we can calculate how many additional > pages should be allocated from the first node (aka carryover) and add that > value to the number of pages that should be allocated as part of the current > round-robin cycle. > > Running a quick benchmark by compiling the kernel shows a small increase > in performance. These experiments were run on a machine with 2 nodes, each > with 125GB memory and 40 CPUs. > > time numactl -w 0,1 make -j$(nproc) > > +----------+---------+------------+---------+ > | Time (s) | 6.16 | With patch | % Delta | > +----------+---------+------------+---------+ > | Real | 88.374 | 88.3356 | -0.2019 | > | User | 3631.7 | 3636.263 | 0.0631 | > | Sys | 366.029 | 363.792 | -0.7534 | > +----------+---------+------------+---------+ > > Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> > > --- > mm/mempolicy.c | 39 ++++++++++++++++++++------------------- > 1 file changed, 20 insertions(+), 19 deletions(-) > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 78ad74a0e249..0d693f96cf66 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -2569,7 +2569,7 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, > unsigned long node_pages, delta; > u8 *weights, weight; > unsigned int weight_total = 0; > - unsigned long rem_pages = nr_pages; > + unsigned long rem_pages = nr_pages, carryover = 0; > nodemask_t nodes; > int nnodes, node; > int resume_node = MAX_NUMNODES - 1; > @@ -2594,18 +2594,12 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, > node = me->il_prev; > weight = me->il_weight; > if (weight && node_isset(node, nodes)) { > - node_pages = min(rem_pages, weight); > - nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, > - page_array); > - page_array += nr_allocated; > - total_allocated += nr_allocated; > - /* if that's all the pages, no need to interleave */ > if (rem_pages <= weight) { > - me->il_weight -= rem_pages; > - return total_allocated; > + node_pages = rem_pages; > + me->il_weight -= node_pages; > + goto allocate; > } > - /* Otherwise we adjust remaining pages, continue from there */ > - rem_pages -= weight; > + carryover = weight; > } > /* clear active weight in case of an allocation failure */ > me->il_weight = 0; > @@ -2614,7 +2608,7 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, > /* create a local copy of node weights to operate on outside rcu */ > weights = kzalloc(nr_node_ids, GFP_KERNEL); > if (!weights) > - return total_allocated; > + return 0; may be worth noting that this change means small bulk allocations that could be covered by the current il_node/weight may now fail if kzalloc fails. But then there are probably other problems. However, this is a functional difference between the old and new state of the function. > > rcu_read_lock(); > state = rcu_dereference(wi_state); > @@ -2634,16 +2628,17 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, > /* > * Calculate rounds/partial rounds to minimize __alloc_pages_bulk calls. > * Track which node weighted interleave should resume from. > + * Account for carryover. It is always allocated from the first node. > * > * if (rounds > 0) and (delta == 0), resume_node will always be > * the node following prev_node and its weight. > */ > - rounds = rem_pages / weight_total; > - delta = rem_pages % weight_total; > + rounds = (rem_pages - carryover) / weight_total; > + delta = (rem_pages - carryover) % weight_total; > resume_node = next_node_in(prev_node, nodes); > resume_weight = weights[resume_node]; > + node = carryover ? prev_node : next_node_in(prev_node, nodes); > for (i = 0; i < nnodes; i++) { > - node = next_node_in(prev_node, nodes); > weight = weights[node]; > /* when delta is depleted, resume from that node */ > if (delta && delta < weight) { > @@ -2651,12 +2646,14 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, > resume_weight = weight - delta; > } > /* Add the node's portion of the delta, if there is one */ > - node_pages = weight * rounds + min(delta, weight); > + node_pages = weight * rounds + min(delta, weight) + carryover; > delta -= min(delta, weight); > + carryover = 0; > > /* node_pages can be 0 if an allocation fails and rounds == 0 */ > if (!node_pages) > break; > +allocate: > nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, > page_array); > page_array += nr_allocated; > @@ -2664,10 +2661,14 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, > if (total_allocated == nr_pages) > break; > prev_node = node; > + node = next_node_in(prev_node, nodes); > + } > + > + if (weights) { > + me->il_prev = resume_node; > + me->il_weight = resume_weight; > + kfree(weights); > } > - me->il_prev = resume_node; > - me->il_weight = resume_weight; > - kfree(weights); > return total_allocated; > } > > -- > 2.47.1 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] mm/mempolicy: Skip extra call to __alloc_pages_bulk in weighted interleave 2025-06-27 4:28 ` Gregory Price @ 2025-06-27 16:13 ` Joshua Hahn 2025-06-30 15:39 ` Joshua Hahn 0 siblings, 1 reply; 14+ messages in thread From: Joshua Hahn @ 2025-06-27 16:13 UTC (permalink / raw) To: Gregory Price Cc: Andrew Morton, Alistair Popple, Byungchul Park, David Hildenbrand, Matthew Brost, Rakie Kim, Ying Huang, Zi Yan, linux-kernel, linux-mm, kernel-team On Fri, 27 Jun 2025 00:28:33 -0400 Gregory Price <gourry@gourry.net> wrote: Hi Gregory, Hope you are doing well : -) Thanks for the review! > On Thu, Jun 26, 2025 at 01:09:34PM -0700, Joshua Hahn wrote: > > Currently, alloc_pages_bulk_weighted_interleave can make up to nr_node_ids+1 > > calls to __alloc_pages_bulk. The additional allocation can happen if the > > previous call to this function finished the weighted round robin allocation > > partially on a node. To make up for this, the next time this function is > > called, an extra allocation is made to finish cleanly on the node boundaries > > before performing the weighted round-robin cycles again. > > > > task->il_weight can be operated on by other weighted_interleave > functions in mempolicy, so it's not just > alloc_pages_bulk_weighted_interleave that affects this. This is true, and I think that's what makes testing this part of the code tricky. I'll elaborate more below. > Observations here are still true, just a correction for clarity. > > i.e.: > > The additional allocation happens for the current il_node/il_weight. > > I *think* I did it this way just because it was easier to reason about > the two chunks separately. So I don't see a reason not to improve this. > I will say that, at least at the time, I took the core math and > validated the edge conditions in a separate program. > > If you get it wrong in the kernel, you'd either fail to allocate - or more > likely just get the wrong distribution of pages. The latter is > non-obvious unless you go looking for it, so it might be good to at > least add this test result in the change log. It's hard to write this > in LTP or kernel selftest unfortunately. I think so too : -( One test that I wanted to do while developing this feature was to see if I could figure out how many pages are really allocated from each node. The difficulty in doing this (as you pointed out above) is that because there are other ways that move the round robin forward (without necessarily calling the bulk alloc function), it's hard to directly attribute the page allocations. If this was the only place that we were modifying these values, then a correctness check would be equivalent to just adding a printk of each node and how many pages were allocated on it, then adding all the numbers up to see if it matches the weight ratios in sysfs. So I think I will do what you did as well -- I think that performing some tests, at least on the edge cases in a separate program will help give some confidence that the code works as intended. I'll also continue to think about if there are better ways to be testing this instead. Thanks again for reviewing this, have a great day! Joshua Sent using hkml (https://github.com/sjp38/hackermail) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] mm/mempolicy: Skip extra call to __alloc_pages_bulk in weighted interleave 2025-06-27 16:13 ` Joshua Hahn @ 2025-06-30 15:39 ` Joshua Hahn 0 siblings, 0 replies; 14+ messages in thread From: Joshua Hahn @ 2025-06-30 15:39 UTC (permalink / raw) To: Joshua Hahn Cc: Gregory Price, Andrew Morton, Alistair Popple, Byungchul Park, David Hildenbrand, Matthew Brost, Rakie Kim, Ying Huang, Zi Yan, linux-kernel, linux-mm, kernel-team On Fri, 27 Jun 2025 09:13:18 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote: > On Fri, 27 Jun 2025 00:28:33 -0400 Gregory Price <gourry@gourry.net> wrote: > > Hi Gregory, [...snip...] > > I will say that, at least at the time, I took the core math and > > validated the edge conditions in a separate program. > > > > If you get it wrong in the kernel, you'd either fail to allocate - or more > > likely just get the wrong distribution of pages. The latter is > > non-obvious unless you go looking for it, so it might be good to at > > least add this test result in the change log. It's hard to write this > > in LTP or kernel selftest unfortunately. > > I think so too : -( > > One test that I wanted to do while developing this feature was to see if > I could figure out how many pages are really allocated from each node. The > difficulty in doing this (as you pointed out above) is that because there are > other ways that move the round robin forward (without necessarily calling the > bulk alloc function), it's hard to directly attribute the page allocations. > > If this was the only place that we were modifying these values, then a > correctness check would be equivalent to just adding a printk of each node > and how many pages were allocated on it, then adding all the numbers up to > see if it matches the weight ratios in sysfs. > > So I think I will do what you did as well -- I think that performing some > tests, at least on the edge cases in a separate program will help give > some confidence that the code works as intended. I'll also continue to think > about if there are better ways to be testing this instead. Like you suggested, I decided to run a simulation just to see if the number of nodes allocated from each page lined up with the old version, and if the numbers made sense for both cases. I found a few issues with my version of the code: The math is just incorrect when rounds == 0. Let's say there's a 2-node machine with weights [10, 7]. We should start allocating from node0 with 7 remaining pages, and we want to allocate 14 pages. Here's how the new math goes: - First node should allocate 7 pages, let carryover = 7 - Then remaining pages = 14 - 7 = 7 - Allocate rounds * weight + min(weight, delta) + carryover: = 0 * 10 + min(10, 7) + 7 = 7 + 7 = 14 This is incorrect, since we will be allocating all 14 pages from the first node, and the real distribution should be 7 pages from the first node, and 7 pages from the second node. I think this can also lead to some overallocation issues. So there are a few options now: - Change the addition to be: rounds * weight + min(min(weight, delta + carryover), weight) and adjust the remaining math accordingly. But this looks very bad and is not intuitive at all, so I don't like this idea. - This can easily be solved, if instead of figuring out how many pages have to be allocated as we iterate through the nodes, we do one pass and figure out how many pages must be allocated. The problem here is that we re-introduce another loop, which adds to the code complexity and may actually be a performance decrease instead. So again, I don't think this is the best idea. - Skip this patch! This is a small optimization on performance, and I think it simplifies the code, but turns out the original math to do is just much harder without doing two separate calculations. I'll keep this running in the back of my mind to see if I can figure out a way to solve it later... I also think it makes sense to drop the first patch as well, since there's no optimization included with the cleanup. But since it already got a few reviews, maybe it makes to keep that one : -) Anyways, I really appreciate the review Gregory! Maybe I'll have a brekathrough idea on how to correctly do the math here sometime in the future. Have a great day! Joshua Sent using hkml (https://github.com/sjp38/hackermail) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] mm/mempolicy: Skip extra call to __alloc_pages_bulk in weighted interleave 2025-06-26 20:09 ` [PATCH 2/2] mm/mempolicy: Skip extra call to __alloc_pages_bulk in weighted interleave Joshua Hahn 2025-06-27 4:28 ` Gregory Price @ 2025-06-30 20:05 ` Kees Bakker 2025-06-30 20:21 ` Joshua Hahn 1 sibling, 1 reply; 14+ messages in thread From: Kees Bakker @ 2025-06-30 20:05 UTC (permalink / raw) To: Joshua Hahn, Gregory Price Cc: Andrew Morton, Alistair Popple, Byungchul Park, David Hildenbrand, Matthew Brost, Rakie Kim, Ying Huang, Zi Yan, linux-kernel, linux-mm, kernel-team Op 26-06-2025 om 22:09 schreef Joshua Hahn: > Currently, alloc_pages_bulk_weighted_interleave can make up to nr_node_ids+1 > calls to __alloc_pages_bulk. The additional allocation can happen if the > previous call to this function finished the weighted round robin allocation > partially on a node. To make up for this, the next time this function is > called, an extra allocation is made to finish cleanly on the node boundaries > before performing the weighted round-robin cycles again. > > Instead of making an additional call, we can calculate how many additional > pages should be allocated from the first node (aka carryover) and add that > value to the number of pages that should be allocated as part of the current > round-robin cycle. > > Running a quick benchmark by compiling the kernel shows a small increase > in performance. These experiments were run on a machine with 2 nodes, each > with 125GB memory and 40 CPUs. > > time numactl -w 0,1 make -j$(nproc) > > +----------+---------+------------+---------+ > | Time (s) | 6.16 | With patch | % Delta | > +----------+---------+------------+---------+ > | Real | 88.374 | 88.3356 | -0.2019 | > | User | 3631.7 | 3636.263 | 0.0631 | > | Sys | 366.029 | 363.792 | -0.7534 | > +----------+---------+------------+---------+ > > Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> > > --- > mm/mempolicy.c | 39 ++++++++++++++++++++------------------- > 1 file changed, 20 insertions(+), 19 deletions(-) > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 78ad74a0e249..0d693f96cf66 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -2569,7 +2569,7 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, > unsigned long node_pages, delta; > u8 *weights, weight; > unsigned int weight_total = 0; > - unsigned long rem_pages = nr_pages; > + unsigned long rem_pages = nr_pages, carryover = 0; > nodemask_t nodes; > int nnodes, node; > int resume_node = MAX_NUMNODES - 1; > @@ -2594,18 +2594,12 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, > node = me->il_prev; > weight = me->il_weight; > if (weight && node_isset(node, nodes)) { > - node_pages = min(rem_pages, weight); > - nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, > - page_array); > - page_array += nr_allocated; > - total_allocated += nr_allocated; > - /* if that's all the pages, no need to interleave */ > if (rem_pages <= weight) { > - me->il_weight -= rem_pages; > - return total_allocated; > + node_pages = rem_pages; > + me->il_weight -= node_pages; > + goto allocate; This is a goto into the middle of a for-loop. What do you think is going to happen at the end of that loop? I think (only tested with a small C program) it will go to the start of the loop, do the i++, check i<nnodes, and possibly do the loop again. Variable i is uninitialized at that point. In the loop it hits several uninitialized variables. Even if this is legal C code, it is pretty obscure. > } > - /* Otherwise we adjust remaining pages, continue from there */ > - rem_pages -= weight; > + carryover = weight; > } > /* clear active weight in case of an allocation failure */ > me->il_weight = 0; > @@ -2614,7 +2608,7 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, > /* create a local copy of node weights to operate on outside rcu */ > weights = kzalloc(nr_node_ids, GFP_KERNEL); > if (!weights) > - return total_allocated; > + return 0; > > rcu_read_lock(); > state = rcu_dereference(wi_state); > @@ -2634,16 +2628,17 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, > /* > * Calculate rounds/partial rounds to minimize __alloc_pages_bulk calls. > * Track which node weighted interleave should resume from. > + * Account for carryover. It is always allocated from the first node. > * > * if (rounds > 0) and (delta == 0), resume_node will always be > * the node following prev_node and its weight. > */ > - rounds = rem_pages / weight_total; > - delta = rem_pages % weight_total; > + rounds = (rem_pages - carryover) / weight_total; > + delta = (rem_pages - carryover) % weight_total; > resume_node = next_node_in(prev_node, nodes); > resume_weight = weights[resume_node]; > + node = carryover ? prev_node : next_node_in(prev_node, nodes); > for (i = 0; i < nnodes; i++) { > - node = next_node_in(prev_node, nodes); > weight = weights[node]; > /* when delta is depleted, resume from that node */ > if (delta && delta < weight) { > @@ -2651,12 +2646,14 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, > resume_weight = weight - delta; > } > /* Add the node's portion of the delta, if there is one */ > - node_pages = weight * rounds + min(delta, weight); > + node_pages = weight * rounds + min(delta, weight) + carryover; > delta -= min(delta, weight); > + carryover = 0; > > /* node_pages can be 0 if an allocation fails and rounds == 0 */ > if (!node_pages) > break; > +allocate: > nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, > page_array); > page_array += nr_allocated; > @@ -2664,10 +2661,14 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, > if (total_allocated == nr_pages) > break; > prev_node = node; > + node = next_node_in(prev_node, nodes); > + } > + > + if (weights) { > + me->il_prev = resume_node; > + me->il_weight = resume_weight; > + kfree(weights); > } > - me->il_prev = resume_node; > - me->il_weight = resume_weight; > - kfree(weights); > return total_allocated; > } > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] mm/mempolicy: Skip extra call to __alloc_pages_bulk in weighted interleave 2025-06-30 20:05 ` Kees Bakker @ 2025-06-30 20:21 ` Joshua Hahn 2025-06-30 22:35 ` Andrew Morton 0 siblings, 1 reply; 14+ messages in thread From: Joshua Hahn @ 2025-06-30 20:21 UTC (permalink / raw) To: Kees Bakker Cc: Gregory Price, Andrew Morton, Alistair Popple, Byungchul Park, David Hildenbrand, Matthew Brost, Rakie Kim, Ying Huang, Zi Yan, linux-kernel, linux-mm, kernel-team On Mon, 30 Jun 2025 22:05:48 +0200 Kees Bakker <kees@ijzerbout.nl> wrote: > > mm/mempolicy.c | 39 ++++++++++++++++++++------------------- > > 1 file changed, 20 insertions(+), 19 deletions(-) > > > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > > index 78ad74a0e249..0d693f96cf66 100644 > > --- a/mm/mempolicy.c > > +++ b/mm/mempolicy.c > > @@ -2569,7 +2569,7 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, > > unsigned long node_pages, delta; > > u8 *weights, weight; > > unsigned int weight_total = 0; > > - unsigned long rem_pages = nr_pages; > > + unsigned long rem_pages = nr_pages, carryover = 0; > > nodemask_t nodes; > > int nnodes, node; > > int resume_node = MAX_NUMNODES - 1; > > @@ -2594,18 +2594,12 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp, > > node = me->il_prev; > > weight = me->il_weight; > > if (weight && node_isset(node, nodes)) { > > - node_pages = min(rem_pages, weight); > > - nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, > > - page_array); > > - page_array += nr_allocated; > > - total_allocated += nr_allocated; > > - /* if that's all the pages, no need to interleave */ > > if (rem_pages <= weight) { > > - me->il_weight -= rem_pages; > > - return total_allocated; > > + node_pages = rem_pages; > > + me->il_weight -= node_pages; > > + goto allocate; Hello Kees, Thank you for reviewing my code! > This is a goto into the middle of a for-loop. > What do you think is going to happen at the end of that loop? > > I think (only tested with a small C program) it will go to the start of > the loop, do the i++, check i<nnodes, and possibly do the loop again. > Variable i is uninitialized at that point. In the loop it hits several > uninitialized variables. From what I can see from my code, I think the only the goto statement leads to a second iteration of the for loop is if allocation fails. But otherwise, it should be ok since we always hit if (total_allocated == nr_pages) break; within the loop. For the branch that takes the goto, we set node_pages = rem_pages, then jump to the label and allocate. So nr_allocated = node_pages, and total_allocated = 0 + nr_allocated so total_allocated = node_pages total_allocated == node_pages == rem_pages == nr_pages, so we will break. Phew! To cover the case where allocation fails, I think we should be breaking anyways, so I can definitely add a new check for this. > Even if this is legal C code, it is pretty obscure. I agree that it not very clean. I did this to reduce the amount of repeated code there is. Even if this code works, it could definitely be written better to make it more readable and maintainable. As I noted in my second response to Gregory, I'm not planning on pursuing this version anymore, so if I decide to send a second version, I'll keep this in mind. Thank you again for taking the time to review this, and also testing it on your end! I hope you have a great day : -) Joshua Sent using hkml (https://github.com/sjp38/hackermail) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] mm/mempolicy: Skip extra call to __alloc_pages_bulk in weighted interleave 2025-06-30 20:21 ` Joshua Hahn @ 2025-06-30 22:35 ` Andrew Morton 2025-06-30 23:01 ` Joshua Hahn 0 siblings, 1 reply; 14+ messages in thread From: Andrew Morton @ 2025-06-30 22:35 UTC (permalink / raw) To: Joshua Hahn Cc: Kees Bakker, Gregory Price, Alistair Popple, Byungchul Park, David Hildenbrand, Matthew Brost, Rakie Kim, Ying Huang, Zi Yan, linux-kernel, linux-mm, kernel-team On Mon, 30 Jun 2025 13:21:14 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote: > > This is a goto into the middle of a for-loop. > > What do you think is going to happen at the end of that loop? > > > > I think (only tested with a small C program) it will go to the start of > > the loop, do the i++, check i<nnodes, and possibly do the loop again. > > Variable i is uninitialized at that point. In the loop it hits several > > uninitialized variables. > > >From what I can see from my code, I think the only the goto statement leads > to a second iteration of the for loop is if allocation fails. > But otherwise, it should be ok since we always hit > > if (total_allocated == nr_pages) > break; > > within the loop. For the branch that takes the goto, we set > node_pages = rem_pages, then jump to the label and allocate. > So nr_allocated = node_pages, and total_allocated = 0 + nr_allocated > so total_allocated = node_pages > > total_allocated == node_pages == rem_pages == nr_pages, so we will break. Phew! > > To cover the case where allocation fails, I think we should be breaking > anyways, so I can definitely add a new check for this. I do agree, that goto is a "goto too far". That we can do a thing doesn't mean we should do it! > > Even if this is legal C code, it is pretty obscure. > > I agree that it not very clean. I did this to reduce the amount of repeated > code there is. Even if this code works, it could definitely be written > better to make it more readable and maintainable. As I noted in my second > response to Gregory, I'm not planning on pursuing this version anymore, > so if I decide to send a second version, I'll keep this in mind. Cool, I'll drop this version from mm-unstable. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] mm/mempolicy: Skip extra call to __alloc_pages_bulk in weighted interleave 2025-06-30 22:35 ` Andrew Morton @ 2025-06-30 23:01 ` Joshua Hahn 0 siblings, 0 replies; 14+ messages in thread From: Joshua Hahn @ 2025-06-30 23:01 UTC (permalink / raw) To: Andrew Morton Cc: Kees Bakker, Gregory Price, Alistair Popple, Byungchul Park, David Hildenbrand, Matthew Brost, Rakie Kim, Ying Huang, Zi Yan, linux-kernel, linux-mm, kernel-team On Mon, 30 Jun 2025 15:35:01 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > On Mon, 30 Jun 2025 13:21:14 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote: > > > > This is a goto into the middle of a for-loop. > > > What do you think is going to happen at the end of that loop? > > > > > > I think (only tested with a small C program) it will go to the start of > > > the loop, do the i++, check i<nnodes, and possibly do the loop again. > > > Variable i is uninitialized at that point. In the loop it hits several > > > uninitialized variables. > > > > >From what I can see from my code, I think the only the goto statement leads > > to a second iteration of the for loop is if allocation fails. > > But otherwise, it should be ok since we always hit > > > > if (total_allocated == nr_pages) > > break; > > > > within the loop. For the branch that takes the goto, we set > > node_pages = rem_pages, then jump to the label and allocate. > > So nr_allocated = node_pages, and total_allocated = 0 + nr_allocated > > so total_allocated = node_pages > > > > total_allocated == node_pages == rem_pages == nr_pages, so we will break. Phew! > > > > To cover the case where allocation fails, I think we should be breaking > > anyways, so I can definitely add a new check for this. > > I do agree, that goto is a "goto too far". That we can do a thing > doesn't mean we should do it! Haha : -) > > > Even if this is legal C code, it is pretty obscure. > > > > I agree that it not very clean. I did this to reduce the amount of repeated > > code there is. Even if this code works, it could definitely be written > > better to make it more readable and maintainable. As I noted in my second > > response to Gregory, I'm not planning on pursuing this version anymore, > > so if I decide to send a second version, I'll keep this in mind. > > Cool, I'll drop this version from mm-unstable. Sounds good Andrew, thank you always for all of your help! Joshua Sent using hkml (https://github.com/sjp38/hackermail) ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2025-06-30 23:01 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-06-26 20:09 [PATCH 0/2] mm/mempolicy: Cleanup and optimization for weighted interleave Joshua Hahn 2025-06-26 20:09 ` [PATCH 1/2] mm/mempolicy: Simplify weighted interleave bulk alloc calculations Joshua Hahn 2025-06-26 21:51 ` David Hildenbrand 2025-06-27 4:31 ` Gregory Price 2025-06-27 7:38 ` Rakie Kim 2025-06-27 7:45 ` Oscar Salvador 2025-06-26 20:09 ` [PATCH 2/2] mm/mempolicy: Skip extra call to __alloc_pages_bulk in weighted interleave Joshua Hahn 2025-06-27 4:28 ` Gregory Price 2025-06-27 16:13 ` Joshua Hahn 2025-06-30 15:39 ` Joshua Hahn 2025-06-30 20:05 ` Kees Bakker 2025-06-30 20:21 ` Joshua Hahn 2025-06-30 22:35 ` Andrew Morton 2025-06-30 23:01 ` Joshua Hahn
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).