[PATCH v5 0/9] lib/group_cpus: rework grp_spread_init

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v5 0/9] lib/group_cpus: rework grp_spread_init_one() and make it O(1)
@ 2024-01-20  2:50 Yury Norov
  2024-01-20  2:50 ` [PATCH 1/9] cpumask: introduce for_each_cpu_and_from() Yury Norov
                   ` (8 more replies)
  0 siblings, 9 replies; 18+ messages in thread
From: Yury Norov @ 2024-01-20  2:50 UTC (permalink / raw)
  To: Andrew Morton, Thomas Gleixner, Ming Lei, linux-kernel
  Cc: Yury Norov, Andy Shevchenko, Breno Leitao, Nathan Chancellor,
	Rasmus Villemoes, Zi Yan

grp_spread_init_one() implementation is sub-optimal because it
traverses bitmaps from the beginning, instead of picking from the
previous iteration.

Fix it and use find_bit API where appropriate. While here, optimize
cpumasks allocation and drop unneeded cpumask_empty() call.

---
v1: https://lore.kernel.org/all/ZW5MI3rKQueLM0Bz@yury-ThinkPad/T/
v2: https://lore.kernel.org/lkml/ZXKNVRu3AfvjaFhK@fedora/T/
v3: https://lore.kernel.org/lkml/20231212042108.682072-7-yury.norov@gmail.com/T/
v4: https://lore.kernel.org/lkml/20231228200936.2475595-1-yury.norov@gmail.com/T/
v5: add CPUMASK_NULL macro and use it to initialize cpumask_var_t
    variables properly.

On cpumask_var_t initialization issue:

The idea of having different types behind the same typedef has been
considered nasty for quite a while. See a comment in include/linux/cpumask.h
for example.

Now that I'm trying to adopt kernel cleanup machinery to cpumasks, it
reveals another disadvantage of this approach - there's no way to assign
a cpumask_var_t variable at declaration time, which is required by
cleanup implementation.

To fix that, in v5 I added a CPUMASK_NULL macro as a workaround. This
CPUMASK_NULL would be also useful for those converting existing codebase
to enable cleanup variables.

On a long term, it's better to drop CPUMASK_OFFSTACK entirely. Moreover,
it's used only on Power and x86 machines if NR_CPUS >= 8K (unless people
enable it explicitly, and nobody bothers doing that in a real life). But
it requires some more discussions with Power and x64 people...

Meanwhile, I'm going to submit a patchset that deprecates cpumask_var_t,
and adds a new set of allocators which would support initialization at
declaration time.

Yury Norov (9):
  cpumask: introduce for_each_cpu_and_from()
  lib/group_cpus: optimize inner loop in grp_spread_init_one()
  lib/group_cpus: relax atomicity requirement in grp_spread_init_one()
  lib/group_cpus: optimize outer loop in grp_spread_init_one()
  lib/group_cpus: don't zero cpumasks in group_cpus_evenly() on
    allocation
  lib/group_cpus: drop unneeded cpumask_empty() call in
    __group_cpus_evenly()
  cpumask: define cleanup function for cpumasks
  lib/group_cpus: rework group_cpus_evenly()
  lib/group_cpus: simplify group_cpus_evenly() for more

 include/linux/cpumask.h |  16 ++++++
 include/linux/find.h    |   3 ++
 lib/group_cpus.c        | 110 ++++++++++++++++------------------------
 3 files changed, 62 insertions(+), 67 deletions(-)

-- 
2.40.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/9] cpumask: introduce for_each_cpu_and_from()
  2024-01-20  2:50 [PATCH v5 0/9] lib/group_cpus: rework grp_spread_init_one() and make it O(1) Yury Norov
@ 2024-01-20  2:50 ` Yury Norov
  2024-01-20  3:03   ` Ming Lei
  2024-01-20  2:50 ` [PATCH 2/9] lib/group_cpus: optimize inner loop in grp_spread_init_one() Yury Norov
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: Yury Norov @ 2024-01-20  2:50 UTC (permalink / raw)
  To: Andrew Morton, Thomas Gleixner, Ming Lei, linux-kernel
  Cc: Yury Norov, Andy Shevchenko, Breno Leitao, Nathan Chancellor,
	Rasmus Villemoes, Zi Yan

Similarly to for_each_cpu_and(), introduce a for_each_cpu_and_from(),
which is handy when it's needed to traverse 2 cpumasks or bitmaps,
starting from a given position.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
 include/linux/cpumask.h | 11 +++++++++++
 include/linux/find.h    |  3 +++
 2 files changed, 14 insertions(+)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index cfb545841a2c..73ff2e0ef090 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -332,6 +332,17 @@ unsigned int __pure cpumask_next_wrap(int n, const struct cpumask *mask, int sta
 #define for_each_cpu_and(cpu, mask1, mask2)				\
 	for_each_and_bit(cpu, cpumask_bits(mask1), cpumask_bits(mask2), small_cpumask_bits)
 
+/**
+ * for_each_cpu_and_from - iterate over every cpu in both masks starting from a given cpu
+ * @cpu: the (optionally unsigned) integer iterator
+ * @mask1: the first cpumask pointer
+ * @mask2: the second cpumask pointer
+ *
+ * After the loop, cpu is >= nr_cpu_ids.
+ */
+#define for_each_cpu_and_from(cpu, mask1, mask2)				\
+	for_each_and_bit_from(cpu, cpumask_bits(mask1), cpumask_bits(mask2), small_cpumask_bits)
+
 /**
  * for_each_cpu_andnot - iterate over every cpu present in one mask, excluding
  *			 those present in another.
diff --git a/include/linux/find.h b/include/linux/find.h
index 5e4f39ef2e72..dfd3d51ff590 100644
--- a/include/linux/find.h
+++ b/include/linux/find.h
@@ -563,6 +563,9 @@ unsigned long find_next_bit_le(const void *addr, unsigned
 	     (bit) = find_next_and_bit((addr1), (addr2), (size), (bit)), (bit) < (size);\
 	     (bit)++)
 
+#define for_each_and_bit_from(bit, addr1, addr2, size) \
+	for (; (bit) = find_next_and_bit((addr1), (addr2), (size), (bit)), (bit) < (size); (bit)++)
+
 #define for_each_andnot_bit(bit, addr1, addr2, size) \
 	for ((bit) = 0;									\
 	     (bit) = find_next_andnot_bit((addr1), (addr2), (size), (bit)), (bit) < (size);\
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/9] cpumask: introduce for_each_cpu_and_from()
  2024-01-20  2:50 ` [PATCH 1/9] cpumask: introduce for_each_cpu_and_from() Yury Norov
@ 2024-01-20  3:03   ` Ming Lei
  2024-01-21 19:50     ` Yury Norov
  0 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2024-01-20  3:03 UTC (permalink / raw)
  To: Yury Norov
  Cc: Andrew Morton, Thomas Gleixner, linux-kernel, Andy Shevchenko,
	Breno Leitao, Nathan Chancellor, Rasmus Villemoes, Zi Yan

On Fri, Jan 19, 2024 at 06:50:45PM -0800, Yury Norov wrote:
> Similarly to for_each_cpu_and(), introduce a for_each_cpu_and_from(),
> which is handy when it's needed to traverse 2 cpumasks or bitmaps,
> starting from a given position.

The new helper is useless, see

https://lore.kernel.org/lkml/ZZNgDb6bzOscrNmk@fedora/


Thanks,
Ming


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/9] cpumask: introduce for_each_cpu_and_from()
  2024-01-20  3:03   ` Ming Lei
@ 2024-01-21 19:50     ` Yury Norov
  2024-01-22  2:41       ` Ming Lei
  0 siblings, 1 reply; 18+ messages in thread
From: Yury Norov @ 2024-01-21 19:50 UTC (permalink / raw)
  To: Ming Lei
  Cc: Andrew Morton, Thomas Gleixner, linux-kernel, Andy Shevchenko,
	Breno Leitao, Nathan Chancellor, Rasmus Villemoes, Zi Yan

On Sat, Jan 20, 2024 at 11:03:37AM +0800, Ming Lei wrote:
> On Fri, Jan 19, 2024 at 06:50:45PM -0800, Yury Norov wrote:
> > Similarly to for_each_cpu_and(), introduce a for_each_cpu_and_from(),
> > which is handy when it's needed to traverse 2 cpumasks or bitmaps,
> > starting from a given position.
> 
> The new helper is useless, see
> 
> https://lore.kernel.org/lkml/ZZNgDb6bzOscrNmk@fedora/

Let's consider the following configuration.

CPUs: 0b1111
Sibling groups: 0b0011 and 0b1100
nmsk: 0b1111

As the complexity measure we take the number of accesses to nmsk in
the outer loop, and to (nmsk & sibl) in the inner loop in search
routines, so that

	cpumask_first(1111)

requires 1 access to find the first set bit, and

	cpumask_first(1000)

requires 4 such accesses.

Actual find_bit() ops work better than this by using __ffs(), but on long
bitmaps the performance will be as described above.

Now, look at the code. This is yours:

  static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
                                  unsigned int cpus_per_grp)
  {
          const struct cpumask *siblmsk;
          int cpu, sibl;
  
          for ( ; cpus_per_grp > 0; ) {
                  cpu = cpumask_first(nmsk);
  
                  /* Should not happen, but I'm too lazy to think about it */
                  if (cpu >= nr_cpu_ids)
                          return;
  
                  cpumask_clear_cpu(cpu, nmsk);
                  cpumask_set_cpu(cpu, irqmsk);
                  cpus_per_grp--;
  
                  /* If the cpu has siblings, use them first */
                  siblmsk = topology_sibling_cpumask(cpu);
                  for (sibl = -1; cpus_per_grp > 0; ) {
                          sibl = cpumask_next(sibl, siblmsk);
                          if (sibl >= nr_cpu_ids)
                                  break;
                          if (!cpumask_test_and_clear_cpu(sibl, nmsk))
                                  continue;
                          cpumask_set_cpu(sibl, irqmsk);
                          cpus_per_grp--;
                  }
          }
  }

This is your code step-by-step:

#	loop	cpu	match	siblmsk	nmsk	irqmsk
 0	outer	0	yes 		1110	0001
 1	inner	0	no  		1110	0001
 2	inner	1	yes 	0011	1100	0011
 3	inner	2	no 		1100	0011
 4	inner	3	no 		1100	0011
 5	outer	0	no	 	1100	0011
 6	outer	1	no	 	1100	0011
 7	outer	2	yes	 	1000	0111
 8	inner	0	no 	1100	1000	0111
 9	inner	1	no 	1100	1000	0111
10	inner	2	no 	1100	1000	0111
11	inner	3	yes	1100	0000	1111
12	outer	0	no	 	0000	1111
13	outer	1	no	 	0000	1111
14	outer	2	no	 	0000	1111
15	outer	3	no	 	0000	1111

This is mine:

  static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
                                  unsigned int cpus_per_grp)
  {
          const struct cpumask *siblmsk;
          int cpu, sibl;
  
          for_each_cpu(cpu, nmsk) {
                  if (cpus_per_grp-- == 0)
                          return;
  
                  /*
                   * If a caller wants to spread IRQa on offline CPUs, we need to
                   * take care of it explicitly because those offline CPUS are not
                   * included in siblings cpumask.
                   */
                  __cpumask_clear_cpu(cpu, nmsk);
                  __cpumask_set_cpu(cpu, irqmsk);
  
                  /* If the cpu has siblings, use them first */
                  siblmsk = topology_sibling_cpumask(cpu);
                  sibl = cpu + 1;
  
                  for_each_cpu_and_from(sibl, siblmsk, nmsk) {
                          if (cpus_per_grp-- == 0)
                                  return;
  
                          __cpumask_clear_cpu(sibl, nmsk);
                          __cpumask_set_cpu(sibl, irqmsk);
                          cpu = sibl + 1;
                  }
          }
  }

Step-by-step:

#	loop	cpu	match	siblmsk	nmsk	irqmsk
 0	outer	0	yes 		1110	0001
 1	inner	1	yes 	0011	1100	0011
 2	inner	2	no 	0011	1100	0011
 3	inner	3	no 	0011	1100	0011
 4	outer	2	yes	 	1000	0111
 5	inner	3	yes	1100	0000	1111

Your code works worse because it's a Schlemiel the Painter's algorithm.
I mentioned it twice in the commit messages and at least 3 times in
replies to your comments.

Here I'll stop and will not reply to your emails, including the rest of
that Friday's night mailbombing, unless you at least admit you're wrong
in this case and for_each_cpu_and_from() is useful here. 

I'd also recommend you to learn more about atomic operations basics and
revoke your NAK from the patch #3.

Thanks,
	Yury

PS: There's a typo in the series name, I meant that the series makes the
function O(N) of course. But even that is overly optimistic. It's O(N*S),
where S is the number of sibling groups. A couple more patches needed to
make it a true O(N). Still, much better.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/9] cpumask: introduce for_each_cpu_and_from()
  2024-01-21 19:50     ` Yury Norov
@ 2024-01-22  2:41       ` Ming Lei
  0 siblings, 0 replies; 18+ messages in thread
From: Ming Lei @ 2024-01-22  2:41 UTC (permalink / raw)
  To: Yury Norov
  Cc: Andrew Morton, Thomas Gleixner, linux-kernel, Andy Shevchenko,
	Breno Leitao, Nathan Chancellor, Rasmus Villemoes, Zi Yan,
	ming.lei

On Sun, Jan 21, 2024 at 11:50:02AM -0800, Yury Norov wrote:
> On Sat, Jan 20, 2024 at 11:03:37AM +0800, Ming Lei wrote:
> > On Fri, Jan 19, 2024 at 06:50:45PM -0800, Yury Norov wrote:
> > > Similarly to for_each_cpu_and(), introduce a for_each_cpu_and_from(),
> > > which is handy when it's needed to traverse 2 cpumasks or bitmaps,
> > > starting from a given position.
> > 
> > The new helper is useless, see
> > 
> > https://lore.kernel.org/lkml/ZZNgDb6bzOscrNmk@fedora/
> 
> Let's consider the following configuration.
> Step-by-step:

...

> 
> #	loop	cpu	match	siblmsk	nmsk	irqmsk
>  0	outer	0	yes 		1110	0001
>  1	inner	1	yes 	0011	1100	0011
>  2	inner	2	no 	0011	1100	0011
>  3	inner	3	no 	0011	1100	0011
>  4	outer	2	yes	 	1000	0111
>  5	inner	3	yes	1100	0000	1111
> 
> Your code works worse because it's a Schlemiel the Painter's algorithm.
> I mentioned it twice in the commit messages and at least 3 times in
> replies to your comments.

Does it really matter here in reality? Which kind of user visible improvements
can be observed?

I have mentioned several times, for control/management code path, we care
more on maintainability, correctness instead of efficiency.

You are _wasting_ resources in wrong place, if you are really interested in
optimization, please do in fast code path, such as, related and not not limited,
irq handling, io handling, memory allocation, ....

Unfortunately, your V5 still have obvious bug, and as you mentioned,
the patchset title is wrong too.

> 
> Here I'll stop and will not reply to your emails, including the rest of
> that Friday's night mailbombing, unless you at least admit you're wrong
> in this case and for_each_cpu_and_from() is useful here. 

It is easy to get same result without adding for_each_cpu_and_from(), see the
patch I sent:

https://lore.kernel.org/lkml/20240120065543.739203-1-ming.lei@redhat.com/

in which we needn't to update iterator variable inside loop, and fix
the bug in your patch 4 of v5, and it is still O(N). Meantime it is
simpler and easier to get proved.

Here your use of for_each_cpu_and_from() is tricky too, cause the
loop condition variable(part of iterator variable) of cpu mask is being updated
inside the loop. And we can get same result by cpumask_next_and()
without playing the trick.

> 
> I'd also recommend you to learn more about atomic operations basics and
> revoke your NAK from the patch #3.

If you think my comment on the NAK is wrong, please reply on the comment
directly.

> 
> Thanks,
> 	Yury
> 
> PS: There's a typo in the series name, I meant that the series makes the
> function O(N) of course. But even that is overly optimistic. It's O(N*S),
> where S is the number of sibling groups. A couple more patches needed to
> make it a true O(N). Still, much better.

Either O(1) or O(N) isn't one big deal here, cause it is oneshot
slow code path, and nr_cpu_ids is not big enough in reality.

Even you can't make real O(N) because your patch 4 has logic
mistake, see my comment:

https://lore.kernel.org/lkml/ZatlggW%2F8SH6od9O@fedora/

Thanks,
Ming

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 2/9] lib/group_cpus: optimize inner loop in grp_spread_init_one()
  2024-01-20  2:50 [PATCH v5 0/9] lib/group_cpus: rework grp_spread_init_one() and make it O(1) Yury Norov
  2024-01-20  2:50 ` [PATCH 1/9] cpumask: introduce for_each_cpu_and_from() Yury Norov
@ 2024-01-20  2:50 ` Yury Norov
  2024-01-20  3:17   ` Ming Lei
  2024-01-20  2:50 ` [PATCH 3/9] lib/group_cpus: relax atomicity requirement " Yury Norov
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: Yury Norov @ 2024-01-20  2:50 UTC (permalink / raw)
  To: Andrew Morton, Thomas Gleixner, Ming Lei, linux-kernel
  Cc: Yury Norov, Andy Shevchenko, Breno Leitao, Nathan Chancellor,
	Rasmus Villemoes, Zi Yan

The loop starts from the beginning every time we switch to the next
sibling mask. This is the Schlemiel the Painter's style of coding
because we know for sure that nmsk is clear up to current CPU, and we
can just continue from the next CPU.

Also, we can do it nicer if leverage the dedicated for_each() iterator,
and simplify the logic of clearing a bit in nmsk.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
 lib/group_cpus.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index ee272c4cefcc..063ed9ae1b8d 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -30,14 +30,14 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
 
 		/* If the cpu has siblings, use them first */
 		siblmsk = topology_sibling_cpumask(cpu);
-		for (sibl = -1; cpus_per_grp > 0; ) {
-			sibl = cpumask_next(sibl, siblmsk);
-			if (sibl >= nr_cpu_ids)
-				break;
-			if (!cpumask_test_and_clear_cpu(sibl, nmsk))
-				continue;
+		sibl = cpu + 1;
+
+		for_each_cpu_and_from(sibl, siblmsk, nmsk) {
+			if (cpus_per_grp-- == 0)
+				return;
+
+			cpumask_clear_cpu(sibl, nmsk);
 			cpumask_set_cpu(sibl, irqmsk);
-			cpus_per_grp--;
 		}
 	}
 }
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/9] lib/group_cpus: optimize inner loop in grp_spread_init_one()
  2024-01-20  2:50 ` [PATCH 2/9] lib/group_cpus: optimize inner loop in grp_spread_init_one() Yury Norov
@ 2024-01-20  3:17   ` Ming Lei
  2024-01-20  7:03     ` Ming Lei
  0 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2024-01-20  3:17 UTC (permalink / raw)
  To: Yury Norov
  Cc: Andrew Morton, Thomas Gleixner, linux-kernel, Andy Shevchenko,
	Breno Leitao, Nathan Chancellor, Rasmus Villemoes, Zi Yan,
	ming.lei

On Fri, Jan 19, 2024 at 06:50:46PM -0800, Yury Norov wrote:
> The loop starts from the beginning every time we switch to the next
> sibling mask. This is the Schlemiel the Painter's style of coding
> because we know for sure that nmsk is clear up to current CPU, and we
> can just continue from the next CPU.
> 
> Also, we can do it nicer if leverage the dedicated for_each() iterator,
> and simplify the logic of clearing a bit in nmsk.
> 
> Signed-off-by: Yury Norov <yury.norov@gmail.com>
> ---
>  lib/group_cpus.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/lib/group_cpus.c b/lib/group_cpus.c
> index ee272c4cefcc..063ed9ae1b8d 100644
> --- a/lib/group_cpus.c
> +++ b/lib/group_cpus.c
> @@ -30,14 +30,14 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
>  
>  		/* If the cpu has siblings, use them first */
>  		siblmsk = topology_sibling_cpumask(cpu);
> -		for (sibl = -1; cpus_per_grp > 0; ) {
> -			sibl = cpumask_next(sibl, siblmsk);
> -			if (sibl >= nr_cpu_ids)
> -				break;
> -			if (!cpumask_test_and_clear_cpu(sibl, nmsk))
> -				continue;
> +		sibl = cpu + 1;

No, it is silly to let 'sibl' point to 'cpu + 1', cause we just
want to iterate over 'siblmsk & nmsk', and nothing to do with
the next cpu('cpu + 1').

> +
> +		for_each_cpu_and_from(sibl, siblmsk, nmsk) {
> +			if (cpus_per_grp-- == 0)
> +				return;
> +
> +			cpumask_clear_cpu(sibl, nmsk);
>  			cpumask_set_cpu(sibl, irqmsk);
> -			cpus_per_grp--;

Andrew, please replace the 1st two patches with the following one:

From 7a983ee5e1b4f05e5ae26c025dffd801b909e2f3 Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@redhat.com>
Date: Sat, 20 Jan 2024 11:07:26 +0800
Subject: [PATCH] lib/group_cpus.c: simplify grp_spread_init_one()

What the inner loop needs to do is to iterate over `siblmsk & nmsk`, and
clear the cpu in 'nmsk' and set it in 'irqmsk'.

Clean it by for_each_cpu_and().

This is based on Yury Norov's patch, which needs one extra
for_each_cpu_and_from(), which is really not necessary.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 lib/group_cpus.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index ee272c4cefcc..564d8e817f65 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -30,14 +30,11 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
 
 		/* If the cpu has siblings, use them first */
 		siblmsk = topology_sibling_cpumask(cpu);
-		for (sibl = -1; cpus_per_grp > 0; ) {
-			sibl = cpumask_next(sibl, siblmsk);
-			if (sibl >= nr_cpu_ids)
-				break;
-			if (!cpumask_test_and_clear_cpu(sibl, nmsk))
-				continue;
+		for_each_cpu_and(sibl, siblmsk, nmsk) {
+			cpumask_clear_cpu(sibl, nmsk);
 			cpumask_set_cpu(sibl, irqmsk);
-			cpus_per_grp--;
+			if (--cpus_per_grp == 0)
+				return;
 		}
 	}
 }
-- 
2.42.0





Thanks,
Ming


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/9] lib/group_cpus: optimize inner loop in grp_spread_init_one()
  2024-01-20  3:17   ` Ming Lei
@ 2024-01-20  7:03     ` Ming Lei
  0 siblings, 0 replies; 18+ messages in thread
From: Ming Lei @ 2024-01-20  7:03 UTC (permalink / raw)
  To: Yury Norov
  Cc: Andrew Morton, Thomas Gleixner, linux-kernel, Andy Shevchenko,
	Breno Leitao, Nathan Chancellor, Rasmus Villemoes, Zi Yan

On Sat, Jan 20, 2024 at 11:17:00AM +0800, Ming Lei wrote:
> On Fri, Jan 19, 2024 at 06:50:46PM -0800, Yury Norov wrote:
> > The loop starts from the beginning every time we switch to the next
> > sibling mask. This is the Schlemiel the Painter's style of coding
> > because we know for sure that nmsk is clear up to current CPU, and we
> > can just continue from the next CPU.
> > 
> > Also, we can do it nicer if leverage the dedicated for_each() iterator,
> > and simplify the logic of clearing a bit in nmsk.
> > 
> > Signed-off-by: Yury Norov <yury.norov@gmail.com>
> > ---
> >  lib/group_cpus.c | 14 +++++++-------
> >  1 file changed, 7 insertions(+), 7 deletions(-)
> > 
> > diff --git a/lib/group_cpus.c b/lib/group_cpus.c
> > index ee272c4cefcc..063ed9ae1b8d 100644
> > --- a/lib/group_cpus.c
> > +++ b/lib/group_cpus.c
> > @@ -30,14 +30,14 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
> >  
> >  		/* If the cpu has siblings, use them first */
> >  		siblmsk = topology_sibling_cpumask(cpu);
> > -		for (sibl = -1; cpus_per_grp > 0; ) {
> > -			sibl = cpumask_next(sibl, siblmsk);
> > -			if (sibl >= nr_cpu_ids)
> > -				break;
> > -			if (!cpumask_test_and_clear_cpu(sibl, nmsk))
> > -				continue;
> > +		sibl = cpu + 1;
> 
> No, it is silly to let 'sibl' point to 'cpu + 1', cause we just
> want to iterate over 'siblmsk & nmsk', and nothing to do with
> the next cpu('cpu + 1').
> 
> > +
> > +		for_each_cpu_and_from(sibl, siblmsk, nmsk) {
> > +			if (cpus_per_grp-- == 0)
> > +				return;
> > +
> > +			cpumask_clear_cpu(sibl, nmsk);
> >  			cpumask_set_cpu(sibl, irqmsk);
> > -			cpus_per_grp--;
> 
> Andrew, please replace the 1st two patches with the following one:
> 
> From 7a983ee5e1b4f05e5ae26c025dffd801b909e2f3 Mon Sep 17 00:00:00 2001
> From: Ming Lei <ming.lei@redhat.com>
> Date: Sat, 20 Jan 2024 11:07:26 +0800
> Subject: [PATCH] lib/group_cpus.c: simplify grp_spread_init_one()
> 
> What the inner loop needs to do is to iterate over `siblmsk & nmsk`, and
> clear the cpu in 'nmsk' and set it in 'irqmsk'.
> 
> Clean it by for_each_cpu_and().
> 
> This is based on Yury Norov's patch, which needs one extra
> for_each_cpu_and_from(), which is really not necessary.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  lib/group_cpus.c | 11 ++++-------
>  1 file changed, 4 insertions(+), 7 deletions(-)
> 
> diff --git a/lib/group_cpus.c b/lib/group_cpus.c
> index ee272c4cefcc..564d8e817f65 100644
> --- a/lib/group_cpus.c
> +++ b/lib/group_cpus.c
> @@ -30,14 +30,11 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
>  
>  		/* If the cpu has siblings, use them first */
>  		siblmsk = topology_sibling_cpumask(cpu);
> -		for (sibl = -1; cpus_per_grp > 0; ) {
> -			sibl = cpumask_next(sibl, siblmsk);
> -			if (sibl >= nr_cpu_ids)
> -				break;
> -			if (!cpumask_test_and_clear_cpu(sibl, nmsk))
> -				continue;
> +		for_each_cpu_and(sibl, siblmsk, nmsk) {
> +			cpumask_clear_cpu(sibl, nmsk);
>  			cpumask_set_cpu(sibl, irqmsk);
> -			cpus_per_grp--;
> +			if (--cpus_per_grp == 0)
> +				return;

Iterator variable of 'nmsk' is updated inside loop, and it is still
tricky, so please ignore it, I just sent one formal & revised patch:

https://lkml.org/lkml/2024/1/20/43


Thanks,
Ming


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 3/9] lib/group_cpus: relax atomicity requirement in grp_spread_init_one()
  2024-01-20  2:50 [PATCH v5 0/9] lib/group_cpus: rework grp_spread_init_one() and make it O(1) Yury Norov
  2024-01-20  2:50 ` [PATCH 1/9] cpumask: introduce for_each_cpu_and_from() Yury Norov
  2024-01-20  2:50 ` [PATCH 2/9] lib/group_cpus: optimize inner loop in grp_spread_init_one() Yury Norov
@ 2024-01-20  2:50 ` Yury Norov
  2024-01-20  2:50 ` [PATCH 4/9] lib/group_cpus: optimize outer loop " Yury Norov
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Yury Norov @ 2024-01-20  2:50 UTC (permalink / raw)
  To: Andrew Morton, Thomas Gleixner, Ming Lei, linux-kernel
  Cc: Yury Norov, Andy Shevchenko, Breno Leitao, Nathan Chancellor,
	Rasmus Villemoes, Zi Yan

Because nmsk and irqmsk are stable, extra atomicity is not required.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
NAKed-by: Ming Lei <ming.lei@redhat.com>
---

Regarding the NAK discussion:

  > > > > I think this kind of change should be avoided, here the code is
  > > > > absolutely in slow path, and we care code cleanness and readability
  > > > > much more than the saved cycle from non atomicity.
  > > >
  > > > Atomic ops have special meaning and special function. This 'atomic' way
  > > > of moving a bit from one bitmap to another looks completely non-trivial
  > > > and puzzling to me.
  > > >
  > > > A sequence of atomic ops is not atomic itself. Normally it's a sing of
  > > > a bug. But in this case, both masks are stable, and we don't need
  > > > atomicity at all.
  > >
  > > Here we don't care the atomicity.
  > >
  > > >
  > > > It's not about performance, it's about readability.
  > >
  > > __cpumask_clear_cpu() and __cpumask_set_cpu() are more like private
  > > helper, and more hard to follow.
  >
  > No that's not true. Non-atomic version of the function is not a
  > private helper of course.
  >
  > > [@linux]$ git grep -n -w -E "cpumask_clear_cpu|cpumask_set_cpu" ./ | wc
  > >     674    2055   53954
  > > [@linux]$ git grep -n -w -E "__cpumask_clear_cpu|__cpumask_set_cpu" ./ | wc
  > >      21      74    1580
  > >
  > > I don't object to comment the current usage, but NAK for this change.
  >
  > No problem, I'll add you NAK.
  
  You can add the following words meantime:
  
  __cpumask_clear_cpu() and __cpumask_set_cpu() are added in commit 6c8557bdb28d
  ("smp, cpumask: Use non-atomic cpumask_{set,clear}_cpu()") for fast code path(
  smp_call_function_many()).
  
  We have ~670 users of cpumask_clear_cpu & cpumask_set_cpu, lots of them
  fall into same category with group_cpus.c(doesn't care atomicity, not in fast
  code path), and needn't change to __cpumask_clear_cpu() and __cpumask_set_cpu().
  Otherwise, this way may encourage to update others into the __cpumask_* version.

 lib/group_cpus.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index 063ed9ae1b8d..0a8ac7cb1a5d 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -24,8 +24,8 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
 		if (cpu >= nr_cpu_ids)
 			return;
 
-		cpumask_clear_cpu(cpu, nmsk);
-		cpumask_set_cpu(cpu, irqmsk);
+		__cpumask_clear_cpu(cpu, nmsk);
+		__cpumask_set_cpu(cpu, irqmsk);
 		cpus_per_grp--;
 
 		/* If the cpu has siblings, use them first */
@@ -36,8 +36,8 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
 			if (cpus_per_grp-- == 0)
 				return;
 
-			cpumask_clear_cpu(sibl, nmsk);
-			cpumask_set_cpu(sibl, irqmsk);
+			__cpumask_clear_cpu(sibl, nmsk);
+			__cpumask_set_cpu(sibl, irqmsk);
 		}
 	}
 }
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/9] lib/group_cpus: optimize outer loop in grp_spread_init_one()
  2024-01-20  2:50 [PATCH v5 0/9] lib/group_cpus: rework grp_spread_init_one() and make it O(1) Yury Norov
                   ` (2 preceding siblings ...)
  2024-01-20  2:50 ` [PATCH 3/9] lib/group_cpus: relax atomicity requirement " Yury Norov
@ 2024-01-20  2:50 ` Yury Norov
  2024-01-20  3:51   ` Ming Lei
  2024-01-20  2:50 ` [PATCH 5/9] lib/group_cpus: don't zero cpumasks in group_cpus_evenly() on allocation Yury Norov
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: Yury Norov @ 2024-01-20  2:50 UTC (permalink / raw)
  To: Andrew Morton, Thomas Gleixner, Ming Lei, linux-kernel
  Cc: Yury Norov, Andy Shevchenko, Breno Leitao, Nathan Chancellor,
	Rasmus Villemoes, Zi Yan

Similarly to the inner loop, in the outer loop we can use for_each_cpu()
macro, and skip CPUs that have been moved.

With this patch, the function becomes O(1), despite that it's a
double-loop.

While here, add a comment why we can't merge outer logic into the inner
loop.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
 lib/group_cpus.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index 0a8ac7cb1a5d..952aac9eaa81 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -17,16 +17,17 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
 	const struct cpumask *siblmsk;
 	int cpu, sibl;
 
-	for ( ; cpus_per_grp > 0; ) {
-		cpu = cpumask_first(nmsk);
-
-		/* Should not happen, but I'm too lazy to think about it */
-		if (cpu >= nr_cpu_ids)
+	for_each_cpu(cpu, nmsk) {
+		if (cpus_per_grp-- == 0)
 			return;
 
+		/*
+		 * If a caller wants to spread IRQa on offline CPUs, we need to
+		 * take care of it explicitly because those offline CPUS are not
+		 * included in siblings cpumask.
+		 */
 		__cpumask_clear_cpu(cpu, nmsk);
 		__cpumask_set_cpu(cpu, irqmsk);
-		cpus_per_grp--;
 
 		/* If the cpu has siblings, use them first */
 		siblmsk = topology_sibling_cpumask(cpu);
@@ -38,6 +39,7 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
 
 			__cpumask_clear_cpu(sibl, nmsk);
 			__cpumask_set_cpu(sibl, irqmsk);
+			cpu = sibl + 1;
 		}
 	}
 }
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/9] lib/group_cpus: optimize outer loop in grp_spread_init_one()
  2024-01-20  2:50 ` [PATCH 4/9] lib/group_cpus: optimize outer loop " Yury Norov
@ 2024-01-20  3:51   ` Ming Lei
  2024-01-20  6:17     ` Ming Lei
  0 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2024-01-20  3:51 UTC (permalink / raw)
  To: Yury Norov
  Cc: Andrew Morton, Thomas Gleixner, linux-kernel, Andy Shevchenko,
	Breno Leitao, Nathan Chancellor, Rasmus Villemoes, Zi Yan,
	ming.lei

On Fri, Jan 19, 2024 at 06:50:48PM -0800, Yury Norov wrote:
> Similarly to the inner loop, in the outer loop we can use for_each_cpu()
> macro, and skip CPUs that have been moved.
> 
> With this patch, the function becomes O(1), despite that it's a
> double-loop.
> 
> While here, add a comment why we can't merge outer logic into the inner
> loop.
> 
> Signed-off-by: Yury Norov <yury.norov@gmail.com>
> ---
>  lib/group_cpus.c | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/lib/group_cpus.c b/lib/group_cpus.c
> index 0a8ac7cb1a5d..952aac9eaa81 100644
> --- a/lib/group_cpus.c
> +++ b/lib/group_cpus.c
> @@ -17,16 +17,17 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
>  	const struct cpumask *siblmsk;
>  	int cpu, sibl;
>  
> -	for ( ; cpus_per_grp > 0; ) {
> -		cpu = cpumask_first(nmsk);
> -
> -		/* Should not happen, but I'm too lazy to think about it */
> -		if (cpu >= nr_cpu_ids)
> +	for_each_cpu(cpu, nmsk) {
> +		if (cpus_per_grp-- == 0)
>  			return;
>  
> +		/*
> +		 * If a caller wants to spread IRQa on offline CPUs, we need to
> +		 * take care of it explicitly because those offline CPUS are not
> +		 * included in siblings cpumask.
> +		 */
>  		__cpumask_clear_cpu(cpu, nmsk);
>  		__cpumask_set_cpu(cpu, irqmsk);
> -		cpus_per_grp--;
>  
>  		/* If the cpu has siblings, use them first */
>  		siblmsk = topology_sibling_cpumask(cpu);
> @@ -38,6 +39,7 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
>  
>  			__cpumask_clear_cpu(sibl, nmsk);
>  			__cpumask_set_cpu(sibl, irqmsk);
> +			cpu = sibl + 1;

It has been tricky enough to update condition variable of for_each_cpu()
(such kind of pattern can't build in Rust at all), and the above line could
be more tricky actually.

You can get O(1)(not sure it matters here) by using cpumask_next(), which
is more readable, isn't it?

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index 564d8e817f65..e0ce878ac4c4 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -15,10 +15,10 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
 				unsigned int cpus_per_grp)
 {
 	const struct cpumask *siblmsk;
-	int cpu, sibl;
+	int cpu = -1;
 
-	for ( ; cpus_per_grp > 0; ) {
-		cpu = cpumask_first(nmsk);
+	while (cpus_per_grp > 0) {
+		cpu = cpumask_next(cpu, nmsk);
 
 		/* Should not happen, but I'm too lazy to think about it */
 		if (cpu >= nr_cpu_ids)
@@ -30,9 +30,9 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
 
 		/* If the cpu has siblings, use them first */
 		siblmsk = topology_sibling_cpumask(cpu);
-		for_each_cpu_and(sibl, siblmsk, nmsk) {
-			cpumask_clear_cpu(sibl, nmsk);
-			cpumask_set_cpu(sibl, irqmsk);
+		for_each_cpu_and(cpu, siblmsk, nmsk) {
+			cpumask_clear_cpu(cpu, nmsk);
+			cpumask_set_cpu(cpu, irqmsk);
 			if (--cpus_per_grp == 0)
 				return;
 		}




Thanks,
Ming


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/9] lib/group_cpus: optimize outer loop in grp_spread_init_one()
  2024-01-20  3:51   ` Ming Lei
@ 2024-01-20  6:17     ` Ming Lei
  0 siblings, 0 replies; 18+ messages in thread
From: Ming Lei @ 2024-01-20  6:17 UTC (permalink / raw)
  To: Yury Norov
  Cc: Andrew Morton, Thomas Gleixner, linux-kernel, Andy Shevchenko,
	Breno Leitao, Nathan Chancellor, Rasmus Villemoes, Zi Yan,
	ming.lei

On Sat, Jan 20, 2024 at 11:51:58AM +0800, Ming Lei wrote:
> On Fri, Jan 19, 2024 at 06:50:48PM -0800, Yury Norov wrote:
> > Similarly to the inner loop, in the outer loop we can use for_each_cpu()
> > macro, and skip CPUs that have been moved.
> > 
> > With this patch, the function becomes O(1), despite that it's a
> > double-loop.
> > 
> > While here, add a comment why we can't merge outer logic into the inner
> > loop.
> > 
> > Signed-off-by: Yury Norov <yury.norov@gmail.com>
> > ---
> >  lib/group_cpus.c | 14 ++++++++------
> >  1 file changed, 8 insertions(+), 6 deletions(-)
> > 
> > diff --git a/lib/group_cpus.c b/lib/group_cpus.c
> > index 0a8ac7cb1a5d..952aac9eaa81 100644
> > --- a/lib/group_cpus.c
> > +++ b/lib/group_cpus.c
> > @@ -17,16 +17,17 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
> >  	const struct cpumask *siblmsk;
> >  	int cpu, sibl;
> >  
> > -	for ( ; cpus_per_grp > 0; ) {
> > -		cpu = cpumask_first(nmsk);
> > -
> > -		/* Should not happen, but I'm too lazy to think about it */
> > -		if (cpu >= nr_cpu_ids)
> > +	for_each_cpu(cpu, nmsk) {
> > +		if (cpus_per_grp-- == 0)
> >  			return;
> >  
> > +		/*
> > +		 * If a caller wants to spread IRQa on offline CPUs, we need to
> > +		 * take care of it explicitly because those offline CPUS are not
> > +		 * included in siblings cpumask.
> > +		 */
> >  		__cpumask_clear_cpu(cpu, nmsk);
> >  		__cpumask_set_cpu(cpu, irqmsk);
> > -		cpus_per_grp--;
> >  
> >  		/* If the cpu has siblings, use them first */
> >  		siblmsk = topology_sibling_cpumask(cpu);
> > @@ -38,6 +39,7 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
> >  
> >  			__cpumask_clear_cpu(sibl, nmsk);
> >  			__cpumask_set_cpu(sibl, irqmsk);
> > +			cpu = sibl + 1;
> 
> It has been tricky enough to update condition variable of for_each_cpu()
> (such kind of pattern can't build in Rust at all), and the above line could
> be more tricky actually.

Not only the above line is tricky, but also it is wrong, because 'cpu'
local variable should always point to the 1st bit in 'nmsk'. However, if
you set it to 'sibl + 1', some bits in 'nmsk' are skipped in the loop,
aren't they?


Thanks,
Ming


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 5/9] lib/group_cpus: don't zero cpumasks in group_cpus_evenly() on allocation
  2024-01-20  2:50 [PATCH v5 0/9] lib/group_cpus: rework grp_spread_init_one() and make it O(1) Yury Norov
                   ` (3 preceding siblings ...)
  2024-01-20  2:50 ` [PATCH 4/9] lib/group_cpus: optimize outer loop " Yury Norov
@ 2024-01-20  2:50 ` Yury Norov
  2024-01-20  2:50 ` [PATCH 6/9] lib/group_cpus: drop unneeded cpumask_empty() call in __group_cpus_evenly() Yury Norov
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Yury Norov @ 2024-01-20  2:50 UTC (permalink / raw)
  To: Andrew Morton, Thomas Gleixner, Ming Lei, linux-kernel
  Cc: Yury Norov, Andy Shevchenko, Breno Leitao, Nathan Chancellor,
	Rasmus Villemoes, Zi Yan

nmsk and npresmsk are both allocated with zalloc_cpumask_var(), but they
are initialized by copying later in the code, and so may be allocated
uninitialized.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
 lib/group_cpus.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index 952aac9eaa81..72c308f8c322 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -354,10 +354,10 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
 	int ret = -ENOMEM;
 	struct cpumask *masks = NULL;
 
-	if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
+	if (!alloc_cpumask_var(&nmsk, GFP_KERNEL))
 		return NULL;
 
-	if (!zalloc_cpumask_var(&npresmsk, GFP_KERNEL))
+	if (!alloc_cpumask_var(&npresmsk, GFP_KERNEL))
 		goto fail_nmsk;
 
 	node_to_cpumask = alloc_node_to_cpumask();
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 6/9] lib/group_cpus: drop unneeded cpumask_empty() call in __group_cpus_evenly()
  2024-01-20  2:50 [PATCH v5 0/9] lib/group_cpus: rework grp_spread_init_one() and make it O(1) Yury Norov
                   ` (4 preceding siblings ...)
  2024-01-20  2:50 ` [PATCH 5/9] lib/group_cpus: don't zero cpumasks in group_cpus_evenly() on allocation Yury Norov
@ 2024-01-20  2:50 ` Yury Norov
  2024-01-20  2:50 ` [PATCH 7/9] cpumask: define cleanup function for cpumasks Yury Norov
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Yury Norov @ 2024-01-20  2:50 UTC (permalink / raw)
  To: Andrew Morton, Thomas Gleixner, Ming Lei, linux-kernel
  Cc: Yury Norov, Andy Shevchenko, Breno Leitao, Nathan Chancellor,
	Rasmus Villemoes, Zi Yan

The function is called twice. First time it's called with
cpumask_present as a parameter, which can't be empty. Second time it's
called with a mask created with cpumask_andnot(), which returns false if
the result is an empty mask.

We can safely drop redundant cpumask_empty() call from the
__group_cpus_evenly() and save few cycles.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
---
 lib/group_cpus.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index 72c308f8c322..b8c0c3ae2bbd 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -259,9 +259,6 @@ static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
 	nodemask_t nodemsk = NODE_MASK_NONE;
 	struct node_groups *node_groups;
 
-	if (cpumask_empty(cpu_mask))
-		return 0;
-
 	nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_mask, &nodemsk);
 
 	/*
@@ -401,9 +398,14 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
 		curgrp = 0;
 	else
 		curgrp = nr_present;
-	cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk);
-	ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
-				  npresmsk, nmsk, masks);
+
+	if (cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk))
+		/* If npresmsk is not empty */
+		ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
+					  npresmsk, nmsk, masks);
+	else
+		ret = 0;
+
 	if (ret >= 0)
 		nr_others = ret;
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 7/9] cpumask: define cleanup function for cpumasks
  2024-01-20  2:50 [PATCH v5 0/9] lib/group_cpus: rework grp_spread_init_one() and make it O(1) Yury Norov
                   ` (5 preceding siblings ...)
  2024-01-20  2:50 ` [PATCH 6/9] lib/group_cpus: drop unneeded cpumask_empty() call in __group_cpus_evenly() Yury Norov
@ 2024-01-20  2:50 ` Yury Norov
  2024-01-20  2:50 ` [PATCH 8/9] lib/group_cpus: rework group_cpus_evenly() Yury Norov
  2024-01-20  2:50 ` [PATCH 9/9] lib/group_cpus: simplify group_cpus_evenly() for more Yury Norov
  8 siblings, 0 replies; 18+ messages in thread
From: Yury Norov @ 2024-01-20  2:50 UTC (permalink / raw)
  To: Andrew Morton, Thomas Gleixner, Ming Lei, linux-kernel
  Cc: Yury Norov, Andy Shevchenko, Breno Leitao, Nathan Chancellor,
	Rasmus Villemoes, Zi Yan

Now we can simplify a code that allocates cpumasks for local needs.

Automatic variables have to be initialized at declaration, or at least
before any possibility for the logic to return, so that compiler
wouldn't try to call an associate destructor function on a random stack
number.

Because cpumask_var_t, depending on the CPUMASK_OFFSTACK config, is
either a pointer or an array, we have to have a macro for initialization.

So define a CPUMASK_NULL macro, which allows to init struct cpumask
pointer with NULL when CPUMASK_OFFSTACK is enabled, and effectively
a no-op when CPUMASK_OFFSTACK is disabled.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
 include/linux/cpumask.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 73ff2e0ef090..0dd8e810200f 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -7,6 +7,7 @@
  * set of CPUs in a system, one bit position per CPU number.  In general,
  * only nr_cpu_ids (<= NR_CPUS) bits are valid.
  */
+#include <linux/cleanup.h>
 #include <linux/kernel.h>
 #include <linux/threads.h>
 #include <linux/bitmap.h>
@@ -898,6 +899,7 @@ typedef struct cpumask *cpumask_var_t;
 
 #define this_cpu_cpumask_var_ptr(x)	this_cpu_read(x)
 #define __cpumask_var_read_mostly	__read_mostly
+#define CPUMASK_NULL			NULL
 
 bool alloc_cpumask_var_node(cpumask_var_t *mask, gfp_t flags, int node);
 
@@ -945,6 +947,7 @@ typedef struct cpumask cpumask_var_t[1];
 
 #define this_cpu_cpumask_var_ptr(x) this_cpu_ptr(x)
 #define __cpumask_var_read_mostly
+#define CPUMASK_NULL {}
 
 static inline bool alloc_cpumask_var(cpumask_var_t *mask, gfp_t flags)
 {
@@ -988,6 +991,8 @@ static inline bool cpumask_available(cpumask_var_t mask)
 }
 #endif /* CONFIG_CPUMASK_OFFSTACK */
 
+DEFINE_FREE(free_cpumask_var, struct cpumask *, if (_T) free_cpumask_var(_T));
+
 /* It's common to want to use cpu_all_mask in struct member initializers,
  * so it has to refer to an address rather than a pointer. */
 extern const DECLARE_BITMAP(cpu_all_bits, NR_CPUS);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 8/9] lib/group_cpus: rework group_cpus_evenly()
  2024-01-20  2:50 [PATCH v5 0/9] lib/group_cpus: rework grp_spread_init_one() and make it O(1) Yury Norov
                   ` (6 preceding siblings ...)
  2024-01-20  2:50 ` [PATCH 7/9] cpumask: define cleanup function for cpumasks Yury Norov
@ 2024-01-20  2:50 ` Yury Norov
  2024-01-20  2:50 ` [PATCH 9/9] lib/group_cpus: simplify group_cpus_evenly() for more Yury Norov
  8 siblings, 0 replies; 18+ messages in thread
From: Yury Norov @ 2024-01-20  2:50 UTC (permalink / raw)
  To: Andrew Morton, Thomas Gleixner, Ming Lei, linux-kernel
  Cc: Yury Norov, Andy Shevchenko, Breno Leitao, Nathan Chancellor,
	Rasmus Villemoes, Zi Yan

Leverage cleanup machinery and drop most of housekeeping code.
Particularly, drop unneeded and erroneously initialized with -ENOMEM
variable ret.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
 lib/group_cpus.c | 79 +++++++++++++++---------------------------------
 1 file changed, 25 insertions(+), 54 deletions(-)

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index b8c0c3ae2bbd..4c09df9eb886 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -76,6 +76,8 @@ static void free_node_to_cpumask(cpumask_var_t *masks)
 	kfree(masks);
 }
 
+DEFINE_FREE(free_node_to_cpumask, cpumask_var_t *, if (_T) free_node_to_cpumask(_T));
+
 static void build_node_to_cpumask(cpumask_var_t *masks)
 {
 	int cpu;
@@ -345,26 +347,16 @@ static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
  */
 struct cpumask *group_cpus_evenly(unsigned int numgrps)
 {
-	unsigned int curgrp = 0, nr_present = 0, nr_others = 0;
-	cpumask_var_t *node_to_cpumask;
-	cpumask_var_t nmsk, npresmsk;
-	int ret = -ENOMEM;
-	struct cpumask *masks = NULL;
-
-	if (!alloc_cpumask_var(&nmsk, GFP_KERNEL))
+	cpumask_var_t *node_to_cpumask __free(free_node_to_cpumask) = alloc_node_to_cpumask();
+	struct cpumask *masks __free(kfree) = kcalloc(numgrps, sizeof(*masks), GFP_KERNEL);
+	cpumask_var_t npresmsk __free(free_cpumask_var) = CPUMASK_NULL;
+	cpumask_var_t nmsk __free(free_cpumask_var) = CPUMASK_NULL;
+	int curgrp, nr_present, nr_others;
+
+	if (!masks || !node_to_cpumask || !alloc_cpumask_var(&nmsk, GFP_KERNEL)
+			|| !alloc_cpumask_var(&npresmsk, GFP_KERNEL))
 		return NULL;
 
-	if (!alloc_cpumask_var(&npresmsk, GFP_KERNEL))
-		goto fail_nmsk;
-
-	node_to_cpumask = alloc_node_to_cpumask();
-	if (!node_to_cpumask)
-		goto fail_npresmsk;
-
-	masks = kcalloc(numgrps, sizeof(*masks), GFP_KERNEL);
-	if (!masks)
-		goto fail_node_to_cpumask;
-
 	build_node_to_cpumask(node_to_cpumask);
 
 	/*
@@ -382,11 +374,15 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
 	cpumask_copy(npresmsk, data_race(cpu_present_mask));
 
 	/* grouping present CPUs first */
-	ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
-				  npresmsk, nmsk, masks);
-	if (ret < 0)
-		goto fail_build_affinity;
-	nr_present = ret;
+	nr_present = __group_cpus_evenly(0, numgrps, node_to_cpumask, npresmsk, nmsk, masks);
+	if (nr_present < 0)
+		return NULL;
+
+	/* If npresmsk is empty */
+	if (!cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk))
+		return_ptr(masks);
+
+	curgrp = nr_present < numgrps ? nr_present : 0;
 
 	/*
 	 * Allocate non present CPUs starting from the next group to be
@@ -394,38 +390,13 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
 	 * group space, assign the non present CPUs to the already
 	 * allocated out groups.
 	 */
-	if (nr_present >= numgrps)
-		curgrp = 0;
-	else
-		curgrp = nr_present;
-
-	if (cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk))
-		/* If npresmsk is not empty */
-		ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
-					  npresmsk, nmsk, masks);
-	else
-		ret = 0;
-
-	if (ret >= 0)
-		nr_others = ret;
-
- fail_build_affinity:
-	if (ret >= 0)
-		WARN_ON(nr_present + nr_others < numgrps);
-
- fail_node_to_cpumask:
-	free_node_to_cpumask(node_to_cpumask);
-
- fail_npresmsk:
-	free_cpumask_var(npresmsk);
-
- fail_nmsk:
-	free_cpumask_var(nmsk);
-	if (ret < 0) {
-		kfree(masks);
+	nr_others = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
+					npresmsk, nmsk, masks);
+	if (nr_others < 0)
 		return NULL;
-	}
-	return masks;
+
+	WARN_ON(nr_present + nr_others < numgrps);
+	return_ptr(masks);
 }
 #else /* CONFIG_SMP */
 struct cpumask *group_cpus_evenly(unsigned int numgrps)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 9/9] lib/group_cpus: simplify group_cpus_evenly() for more
  2024-01-20  2:50 [PATCH v5 0/9] lib/group_cpus: rework grp_spread_init_one() and make it O(1) Yury Norov
                   ` (7 preceding siblings ...)
  2024-01-20  2:50 ` [PATCH 8/9] lib/group_cpus: rework group_cpus_evenly() Yury Norov
@ 2024-01-20  2:50 ` Yury Norov
  8 siblings, 0 replies; 18+ messages in thread
From: Yury Norov @ 2024-01-20  2:50 UTC (permalink / raw)
  To: Andrew Morton, Thomas Gleixner, Ming Lei, linux-kernel
  Cc: Yury Norov, Andy Shevchenko, Breno Leitao, Nathan Chancellor,
	Rasmus Villemoes, Zi Yan

The nmsk parameter is used only in helper function, so move it there.

Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
 lib/group_cpus.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index 4c09df9eb886..71e802fca35f 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -253,13 +253,17 @@ static void alloc_nodes_groups(unsigned int numgrps,
 static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
 			       cpumask_var_t *node_to_cpumask,
 			       const struct cpumask *cpu_mask,
-			       struct cpumask *nmsk, struct cpumask *masks)
+			       struct cpumask *masks)
 {
 	unsigned int i, n, nodes, cpus_per_grp, extra_grps, done = 0;
 	unsigned int last_grp = numgrps;
 	unsigned int curgrp = startgrp;
 	nodemask_t nodemsk = NODE_MASK_NONE;
 	struct node_groups *node_groups;
+	cpumask_var_t nmsk __free(free_cpumask_var) = CPUMASK_NULL;
+
+	if (!alloc_cpumask_var(&nmsk, GFP_KERNEL))
+		return -ENOMEM;
 
 	nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_mask, &nodemsk);
 
@@ -350,11 +354,9 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
 	cpumask_var_t *node_to_cpumask __free(free_node_to_cpumask) = alloc_node_to_cpumask();
 	struct cpumask *masks __free(kfree) = kcalloc(numgrps, sizeof(*masks), GFP_KERNEL);
 	cpumask_var_t npresmsk __free(free_cpumask_var) = CPUMASK_NULL;
-	cpumask_var_t nmsk __free(free_cpumask_var) = CPUMASK_NULL;
 	int curgrp, nr_present, nr_others;
 
-	if (!masks || !node_to_cpumask || !alloc_cpumask_var(&nmsk, GFP_KERNEL)
-			|| !alloc_cpumask_var(&npresmsk, GFP_KERNEL))
+	if (!masks || !node_to_cpumask || !alloc_cpumask_var(&npresmsk, GFP_KERNEL))
 		return NULL;
 
 	build_node_to_cpumask(node_to_cpumask);
@@ -374,7 +376,7 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
 	cpumask_copy(npresmsk, data_race(cpu_present_mask));
 
 	/* grouping present CPUs first */
-	nr_present = __group_cpus_evenly(0, numgrps, node_to_cpumask, npresmsk, nmsk, masks);
+	nr_present = __group_cpus_evenly(0, numgrps, node_to_cpumask, npresmsk, masks);
 	if (nr_present < 0)
 		return NULL;
 
@@ -390,8 +392,7 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
 	 * group space, assign the non present CPUs to the already
 	 * allocated out groups.
 	 */
-	nr_others = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
-					npresmsk, nmsk, masks);
+	nr_others = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask, npresmsk, masks);
 	if (nr_others < 0)
 		return NULL;
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 0/9] lib/group_cpus: rework grp_spread_init_one() and make it O(1)
@ 2023-12-28 20:09 Yury Norov
  2023-12-28 20:09 ` [PATCH 1/9] cpumask: introduce for_each_cpu_and_from() Yury Norov
  0 siblings, 1 reply; 18+ messages in thread
From: Yury Norov @ 2023-12-28 20:09 UTC (permalink / raw)
  To: Andrew Morton, Thomas Gleixner, Ming Lei, linux-kernel
  Cc: Yury Norov, Andy Shevchenko, Rasmus Villemoes

Hi Andrew, Ming,

Now that we've got a couple more weeks, let's try to merge this series in
the upcoming merge window? In addition to addressing the v3 comments, in
v4 I added a few more patches that simplify the code for more by using
cleanup machinery.

Thanks,
	Yury
--

grp_spread_init_one() implementation is sub-optimal because it
traverses bitmaps from the beginning, instead of picking from the
previous iteration.

Fix it and use find_bit API where appropriate. While here, optimize
cpumasks allocation and drop unneeded cpumask_empty() call.

---
v1: https://lore.kernel.org/all/ZW5MI3rKQueLM0Bz@yury-ThinkPad/T/
v2: https://lore.kernel.org/lkml/ZXKNVRu3AfvjaFhK@fedora/T/
v3: https://lore.kernel.org/all/ZYnD4Bp8R9oIz19s@yury-ThinkPad/T/
v4:
 - drop patch v3 #7 and add a comment in patch #4;
 - patch #2: fix cpus_per_grp decrement order;
 - patch #3: add NAK fro Ming Lei;
 - add patches #7-9 that simplify the code for more.

Yury Norov (9):
  cpumask: introduce for_each_cpu_and_from()
  lib/group_cpus: optimize inner loop in grp_spread_init_one()
  lib/group_cpus: relax atomicity requirement in grp_spread_init_one()
  lib/group_cpus: optimize outer loop in grp_spread_init_one()
  lib/group_cpus: don't zero cpumasks in group_cpus_evenly() on
    allocation
  lib/group_cpus: drop unneeded cpumask_empty() call in
    __group_cpus_evenly()
  cpumask: define cleanup function for cpumasks
  lib/group_cpus: rework group_cpus_evenly()
  lib/group_cpus: simplify group_cpus_evenly() for more

 include/linux/cpumask.h |  14 ++++++
 include/linux/find.h    |   3 ++
 lib/group_cpus.c        | 104 +++++++++++++++++-----------------------
 3 files changed, 62 insertions(+), 59 deletions(-)

-- 
2.40.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/9] cpumask: introduce for_each_cpu_and_from()
  2023-12-28 20:09 [PATCH v4 0/9] lib/group_cpus: rework grp_spread_init_one() and make it O(1) Yury Norov
@ 2023-12-28 20:09 ` Yury Norov
  0 siblings, 0 replies; 18+ messages in thread
From: Yury Norov @ 2023-12-28 20:09 UTC (permalink / raw)
  To: Andrew Morton, Thomas Gleixner, Ming Lei, linux-kernel
  Cc: Yury Norov, Andy Shevchenko, Rasmus Villemoes

Similarly to for_each_cpu_and(), introduce a for_each_cpu_and_from(),
which is handy when it's needed to traverse 2 cpumasks or bitmaps,
starting from a given position.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
 include/linux/cpumask.h | 11 +++++++++++
 include/linux/find.h    |  3 +++
 2 files changed, 14 insertions(+)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index cfb545841a2c..73ff2e0ef090 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -332,6 +332,17 @@ unsigned int __pure cpumask_next_wrap(int n, const struct cpumask *mask, int sta
 #define for_each_cpu_and(cpu, mask1, mask2)				\
 	for_each_and_bit(cpu, cpumask_bits(mask1), cpumask_bits(mask2), small_cpumask_bits)
 
+/**
+ * for_each_cpu_and_from - iterate over every cpu in both masks starting from a given cpu
+ * @cpu: the (optionally unsigned) integer iterator
+ * @mask1: the first cpumask pointer
+ * @mask2: the second cpumask pointer
+ *
+ * After the loop, cpu is >= nr_cpu_ids.
+ */
+#define for_each_cpu_and_from(cpu, mask1, mask2)				\
+	for_each_and_bit_from(cpu, cpumask_bits(mask1), cpumask_bits(mask2), small_cpumask_bits)
+
 /**
  * for_each_cpu_andnot - iterate over every cpu present in one mask, excluding
  *			 those present in another.
diff --git a/include/linux/find.h b/include/linux/find.h
index 5e4f39ef2e72..dfd3d51ff590 100644
--- a/include/linux/find.h
+++ b/include/linux/find.h
@@ -563,6 +563,9 @@ unsigned long find_next_bit_le(const void *addr, unsigned
 	     (bit) = find_next_and_bit((addr1), (addr2), (size), (bit)), (bit) < (size);\
 	     (bit)++)
 
+#define for_each_and_bit_from(bit, addr1, addr2, size) \
+	for (; (bit) = find_next_and_bit((addr1), (addr2), (size), (bit)), (bit) < (size); (bit)++)
+
 #define for_each_andnot_bit(bit, addr1, addr2, size) \
 	for ((bit) = 0;									\
 	     (bit) = find_next_andnot_bit((addr1), (addr2), (size), (bit)), (bit) < (size);\
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2024-01-22  2:41 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-20  2:50 [PATCH v5 0/9] lib/group_cpus: rework grp_spread_init_one() and make it O(1) Yury Norov
2024-01-20  2:50 ` [PATCH 1/9] cpumask: introduce for_each_cpu_and_from() Yury Norov
2024-01-20  3:03   ` Ming Lei
2024-01-21 19:50     ` Yury Norov
2024-01-22  2:41       ` Ming Lei
2024-01-20  2:50 ` [PATCH 2/9] lib/group_cpus: optimize inner loop in grp_spread_init_one() Yury Norov
2024-01-20  3:17   ` Ming Lei
2024-01-20  7:03     ` Ming Lei
2024-01-20  2:50 ` [PATCH 3/9] lib/group_cpus: relax atomicity requirement " Yury Norov
2024-01-20  2:50 ` [PATCH 4/9] lib/group_cpus: optimize outer loop " Yury Norov
2024-01-20  3:51   ` Ming Lei
2024-01-20  6:17     ` Ming Lei
2024-01-20  2:50 ` [PATCH 5/9] lib/group_cpus: don't zero cpumasks in group_cpus_evenly() on allocation Yury Norov
2024-01-20  2:50 ` [PATCH 6/9] lib/group_cpus: drop unneeded cpumask_empty() call in __group_cpus_evenly() Yury Norov
2024-01-20  2:50 ` [PATCH 7/9] cpumask: define cleanup function for cpumasks Yury Norov
2024-01-20  2:50 ` [PATCH 8/9] lib/group_cpus: rework group_cpus_evenly() Yury Norov
2024-01-20  2:50 ` [PATCH 9/9] lib/group_cpus: simplify group_cpus_evenly() for more Yury Norov
  -- strict thread matches above, loose matches on Subject: below --
2023-12-28 20:09 [PATCH v4 0/9] lib/group_cpus: rework grp_spread_init_one() and make it O(1) Yury Norov
2023-12-28 20:09 ` [PATCH 1/9] cpumask: introduce for_each_cpu_and_from() Yury Norov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox