[PATCH] lib/group_cpus: fix cross-NUMA CPU assignment in group_cpus

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] lib/group_cpus: fix cross-NUMA CPU assignment in group_cpus_evenly
@ 2025-10-20 12:46 Ming Lei
  2025-10-27  1:07 ` Ming Lei
  2025-12-21 19:23 ` Andrew Morton
  0 siblings, 2 replies; 5+ messages in thread
From: Ming Lei @ 2025-10-20 12:46 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-block, Andrew Morton, Jens Axboe, Ming Lei

When numgrps > nodes, group_cpus_evenly() can incorrectly assign CPUs
from different NUMA nodes to the same group due to the wrapping logic.
Then poor block IO performance is caused because of remote IO completion.
And it can be avoided completely in case of `numgrps > nodes` because
each numa node may includes more CPUs than group's.

The issue occurs when curgrp reaches last_grp and wraps to 0. This causes
CPUs from later-processed nodes to be added to groups that already contain
CPUs from earlier-processed nodes, violating NUMA locality.

Example with 8 NUMA nodes, 16 groups:
- Each node gets 2 groups allocated
- After processing nodes, curgrp reaches 16
- Wrapping to 0 causes CPUs from node N to be added to group 0 which
  already has CPUs from node 0

Fix this by adding find_next_node_group() helper that searches for the
next group (starting from 0) that already contains CPUs from the same
NUMA node. When wrapping is needed, use this helper instead of blindly
wrapping to 0, ensuring CPUs are only added to groups within the same
NUMA node.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 lib/group_cpus.c | 28 +++++++++++++++++++++++++---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index 6d08ac05f371..54d70271e2dd 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -246,6 +246,24 @@ static void alloc_nodes_groups(unsigned int numgrps,
 	}
 }
 
+/*
+ * Find next group in round-robin fashion that contains CPUs from the
+ * specified NUMA node. Used for wrapping to avoid cross-NUMA assignment.
+ */
+static unsigned int find_next_node_group(struct cpumask *masks,
+					 unsigned int numgrps,
+					 const struct cpumask *node_cpus)
+{
+	unsigned int i;
+
+	for (i = 0; i < numgrps; i++) {
+		if (cpumask_intersects(&masks[i], node_cpus))
+			return i;
+	}
+
+	return 0;
+}
+
 static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
 			       cpumask_var_t *node_to_cpumask,
 			       const struct cpumask *cpu_mask,
@@ -315,11 +333,15 @@ static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
 			}
 
 			/*
-			 * wrapping has to be considered given 'startgrp'
-			 * may start anywhere
+			 * Wrapping has to be considered given 'startgrp'
+			 * may start anywhere. When wrapping, find the next
+			 * group (in round-robin fashion) that already contains
+			 * CPUs from the same NUMA node to avoid mixing CPUs
+			 * from different NUMA nodes in the same group.
 			 */
 			if (curgrp >= last_grp)
-				curgrp = 0;
+				curgrp = find_next_node_group(masks, numgrps,
+							      node_to_cpumask[nv->id]);
 			grp_spread_init_one(&masks[curgrp], nmsk,
 						cpus_per_grp);
 		}
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] lib/group_cpus: fix cross-NUMA CPU assignment in group_cpus_evenly
  2025-10-20 12:46 [PATCH] lib/group_cpus: fix cross-NUMA CPU assignment in group_cpus_evenly Ming Lei
@ 2025-10-27  1:07 ` Ming Lei
  2025-11-05  3:35   ` Ming Lei
  2025-12-21 19:23 ` Andrew Morton
  1 sibling, 1 reply; 5+ messages in thread
From: Ming Lei @ 2025-10-27  1:07 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: linux-kernel, linux-block, Andrew Morton, Jens Axboe

On Mon, Oct 20, 2025 at 08:46:46PM +0800, Ming Lei wrote:
> When numgrps > nodes, group_cpus_evenly() can incorrectly assign CPUs
> from different NUMA nodes to the same group due to the wrapping logic.
> Then poor block IO performance is caused because of remote IO completion.
> And it can be avoided completely in case of `numgrps > nodes` because
> each numa node may includes more CPUs than group's.
> 
> The issue occurs when curgrp reaches last_grp and wraps to 0. This causes
> CPUs from later-processed nodes to be added to groups that already contain
> CPUs from earlier-processed nodes, violating NUMA locality.
> 
> Example with 8 NUMA nodes, 16 groups:
> - Each node gets 2 groups allocated
> - After processing nodes, curgrp reaches 16
> - Wrapping to 0 causes CPUs from node N to be added to group 0 which
>   already has CPUs from node 0
> 
> Fix this by adding find_next_node_group() helper that searches for the
> next group (starting from 0) that already contains CPUs from the same
> NUMA node. When wrapping is needed, use this helper instead of blindly
> wrapping to 0, ensuring CPUs are only added to groups within the same
> NUMA node.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Hello,

ping...


Thanks,
Ming


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] lib/group_cpus: fix cross-NUMA CPU assignment in group_cpus_evenly
  2025-10-27  1:07 ` Ming Lei
@ 2025-11-05  3:35   ` Ming Lei
  0 siblings, 0 replies; 5+ messages in thread
From: Ming Lei @ 2025-11-05  3:35 UTC (permalink / raw)
  To: Thomas Gleixner, Andrew Morton, Jens Axboe; +Cc: linux-kernel, linux-block

On Mon, Oct 27, 2025 at 9:07 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Mon, Oct 20, 2025 at 08:46:46PM +0800, Ming Lei wrote:
> > When numgrps > nodes, group_cpus_evenly() can incorrectly assign CPUs
> > from different NUMA nodes to the same group due to the wrapping logic.
> > Then poor block IO performance is caused because of remote IO completion.
> > And it can be avoided completely in case of `numgrps > nodes` because
> > each numa node may includes more CPUs than group's.
> >
> > The issue occurs when curgrp reaches last_grp and wraps to 0. This causes
> > CPUs from later-processed nodes to be added to groups that already contain
> > CPUs from earlier-processed nodes, violating NUMA locality.
> >
> > Example with 8 NUMA nodes, 16 groups:
> > - Each node gets 2 groups allocated
> > - After processing nodes, curgrp reaches 16
> > - Wrapping to 0 causes CPUs from node N to be added to group 0 which
> >   already has CPUs from node 0
> >
> > Fix this by adding find_next_node_group() helper that searches for the
> > next group (starting from 0) that already contains CPUs from the same
> > NUMA node. When wrapping is needed, use this helper instead of blindly
> > wrapping to 0, ensuring CPUs are only added to groups within the same
> > NUMA node.
> >
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
>
> Hello,
>
> ping...

Hello,

Ping...

Thanks,
Ming


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] lib/group_cpus: fix cross-NUMA CPU assignment in group_cpus_evenly
  2025-10-20 12:46 [PATCH] lib/group_cpus: fix cross-NUMA CPU assignment in group_cpus_evenly Ming Lei
  2025-10-27  1:07 ` Ming Lei
@ 2025-12-21 19:23 ` Andrew Morton
  2025-12-22 13:50   ` Ming Lei
  1 sibling, 1 reply; 5+ messages in thread
From: Andrew Morton @ 2025-12-21 19:23 UTC (permalink / raw)
  To: Ming Lei
  Cc: Thomas Gleixner, linux-kernel, linux-block, Jens Axboe,
	Wangyang Guo

On Mon, 20 Oct 2025 20:46:46 +0800 Ming Lei <ming.lei@redhat.com> wrote:

> When numgrps > nodes, group_cpus_evenly() can incorrectly assign CPUs
> from different NUMA nodes to the same group due to the wrapping logic.
> Then poor block IO performance is caused because of remote IO completion.
> And it can be avoided completely in case of `numgrps > nodes` because
> each numa node may includes more CPUs than group's.

Please quantify "poor block IO performance", to help people understand
the userspace-visible effect of this change.

> The issue occurs when curgrp reaches last_grp and wraps to 0. This causes
> CPUs from later-processed nodes to be added to groups that already contain
> CPUs from earlier-processed nodes, violating NUMA locality.
> 
> Example with 8 NUMA nodes, 16 groups:
> - Each node gets 2 groups allocated
> - After processing nodes, curgrp reaches 16
> - Wrapping to 0 causes CPUs from node N to be added to group 0 which
>   already has CPUs from node 0
> 
> Fix this by adding find_next_node_group() helper that searches for the
> next group (starting from 0) that already contains CPUs from the same
> NUMA node. When wrapping is needed, use this helper instead of blindly
> wrapping to 0, ensuring CPUs are only added to groups within the same
> NUMA node.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  lib/group_cpus.c | 28 +++++++++++++++++++++++++---

The patch overlaps (a lot) with Wangyang Guo's "lib/group_cpus: make
group CPU cluster aware".  I did a lot of surgery but got stuck on the
absence of node_to_cpumask, so I guess the patch has bitrotted.

Please update the changelog as above and redo this patch against
Wangyang's patch (which will be in linux-next very soon).

Also, it would be great if you and Wangyang were to review and test
each other's changes, thanks.



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] lib/group_cpus: fix cross-NUMA CPU assignment in group_cpus_evenly
  2025-12-21 19:23 ` Andrew Morton
@ 2025-12-22 13:50   ` Ming Lei
  0 siblings, 0 replies; 5+ messages in thread
From: Ming Lei @ 2025-12-22 13:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Thomas Gleixner, linux-kernel, linux-block, Jens Axboe,
	Wangyang Guo

On Sun, Dec 21, 2025 at 11:23:54AM -0800, Andrew Morton wrote:
> On Mon, 20 Oct 2025 20:46:46 +0800 Ming Lei <ming.lei@redhat.com> wrote:
> 
> > When numgrps > nodes, group_cpus_evenly() can incorrectly assign CPUs
> > from different NUMA nodes to the same group due to the wrapping logic.
> > Then poor block IO performance is caused because of remote IO completion.
> > And it can be avoided completely in case of `numgrps > nodes` because
> > each numa node may includes more CPUs than group's.
> 
> Please quantify "poor block IO performance", to help people understand
> the userspace-visible effect of this change.

It is usually a bug, given fast nvme IO perf may drop to 1/2 or 1/3 in case of
remote completion. queue mapping shouldn't cross CPUs from different numa
nodes in case of nr_queues >= nr_nodes.

> 
> > The issue occurs when curgrp reaches last_grp and wraps to 0. This causes
> > CPUs from later-processed nodes to be added to groups that already contain
> > CPUs from earlier-processed nodes, violating NUMA locality.
> > 
> > Example with 8 NUMA nodes, 16 groups:
> > - Each node gets 2 groups allocated
> > - After processing nodes, curgrp reaches 16
> > - Wrapping to 0 causes CPUs from node N to be added to group 0 which
> >   already has CPUs from node 0
> > 
> > Fix this by adding find_next_node_group() helper that searches for the
> > next group (starting from 0) that already contains CPUs from the same
> > NUMA node. When wrapping is needed, use this helper instead of blindly
> > wrapping to 0, ensuring CPUs are only added to groups within the same
> > NUMA node.
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  lib/group_cpus.c | 28 +++++++++++++++++++++++++---
> 
> The patch overlaps (a lot) with Wangyang Guo's "lib/group_cpus: make
> group CPU cluster aware".  I did a lot of surgery but got stuck on the
> absence of node_to_cpumask, so I guess the patch has bitrotted.
> 
> Please update the changelog as above and redo this patch against
> Wangyang's patch (which will be in linux-next very soon).

Please ignore this patch now because I can't reproduce the original issue
on both v6.18 and v6.19-rc.

> 
> Also, it would be great if you and Wangyang were to review and test
> each other's changes, thanks.

OK.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-12-22 13:50 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-20 12:46 [PATCH] lib/group_cpus: fix cross-NUMA CPU assignment in group_cpus_evenly Ming Lei
2025-10-27  1:07 ` Ming Lei
2025-11-05  3:35   ` Ming Lei
2025-12-21 19:23 ` Andrew Morton
2025-12-22 13:50   ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).