* [PATCH] cpumask: Optimize cpumask_any_but()
@ 2025-01-17 14:26 Kuan-Wei Chiu
2025-01-17 14:59 ` I Hsin Cheng
0 siblings, 1 reply; 6+ messages in thread
From: Kuan-Wei Chiu @ 2025-01-17 14:26 UTC (permalink / raw)
To: yury.norov
Cc: linux, richard120310, jserv, mark.rutland, linux-kernel,
Kuan-Wei Chiu, Yu-Chun Lin
The cpumask_any_but() function can avoid using a loop to determine the
CPU index to return. If the first set bit in the cpumask is not equal
to the specified CPU, we can directly return the index of the first set
bit. Otherwise, we return the next set bit's index.
This optimization replaces the loop with a single if statement,
allowing the compiler to generate more concise and efficient code.
As a result, the size of the bzImage built with x86 defconfig is
reduced by 4096 bytes:
* Before:
$ size arch/x86/boot/bzImage
text data bss dec hex filename
13537280 1024 0 13538304 ce9400 arch/x86/boot/bzImage
* After:
$ size arch/x86/boot/bzImage
text data bss dec hex filename
13533184 1024 0 13534208 ce8400 arch/x86/boot/bzImage
Co-developed-by: Yu-Chun Lin <eleanor15x@gmail.com>
Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com>
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
---
Not sure how to measure the efficiency difference, but I guess this
patch might be slightly more efficient or nearly the same as before. If
you have any good ideas for measuring efficiency, please let me know!
include/linux/cpumask.h | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 9278a50d514f..b769fcdbaa10 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -404,10 +404,10 @@ unsigned int cpumask_any_but(const struct cpumask *mask, unsigned int cpu)
unsigned int i;
cpumask_check(cpu);
- for_each_cpu(i, mask)
- if (i != cpu)
- break;
- return i;
+ i = find_first_bit(cpumask_bits(mask), small_cpumask_bits);
+ if (i != cpu)
+ return i;
+ return find_next_bit(cpumask_bits(mask), small_cpumask_bits, i + 1);
}
/**
--
2.34.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] cpumask: Optimize cpumask_any_but()
2025-01-17 14:26 [PATCH] cpumask: Optimize cpumask_any_but() Kuan-Wei Chiu
@ 2025-01-17 14:59 ` I Hsin Cheng
2025-01-17 16:32 ` Kuan-Wei Chiu
2025-01-17 16:32 ` Yury Norov
0 siblings, 2 replies; 6+ messages in thread
From: I Hsin Cheng @ 2025-01-17 14:59 UTC (permalink / raw)
To: Kuan-Wei Chiu
Cc: yury.norov, linux, jserv, mark.rutland, linux-kernel, eleanor15x
On Fri, Jan 17, 2025 at 10:26:58PM +0800, Kuan-Wei Chiu wrote:
> The cpumask_any_but() function can avoid using a loop to determine the
> CPU index to return. If the first set bit in the cpumask is not equal
> to the specified CPU, we can directly return the index of the first set
> bit. Otherwise, we return the next set bit's index.
>
> This optimization replaces the loop with a single if statement,
> allowing the compiler to generate more concise and efficient code.
>
> As a result, the size of the bzImage built with x86 defconfig is
> reduced by 4096 bytes:
>
> * Before:
> $ size arch/x86/boot/bzImage
> text data bss dec hex filename
> 13537280 1024 0 13538304 ce9400 arch/x86/boot/bzImage
>
> * After:
> $ size arch/x86/boot/bzImage
> text data bss dec hex filename
> 13533184 1024 0 13534208 ce8400 arch/x86/boot/bzImage
>
> Co-developed-by: Yu-Chun Lin <eleanor15x@gmail.com>
> Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com>
> Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
> ---
> Not sure how to measure the efficiency difference, but I guess this
> patch might be slightly more efficient or nearly the same as before. If
> you have any good ideas for measuring efficiency, please let me know!
>
> include/linux/cpumask.h | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> index 9278a50d514f..b769fcdbaa10 100644
> --- a/include/linux/cpumask.h
> +++ b/include/linux/cpumask.h
> @@ -404,10 +404,10 @@ unsigned int cpumask_any_but(const struct cpumask *mask, unsigned int cpu)
> unsigned int i;
>
> cpumask_check(cpu);
> - for_each_cpu(i, mask)
> - if (i != cpu)
> - break;
> - return i;
> + i = find_first_bit(cpumask_bits(mask), small_cpumask_bits);
Hi Kuan-Wei,
How about using cpumask_first(mask) here to keep better consistency?
> + if (i != cpu)
> + return i;
Wouldn't it benefit abit to check "i >= nr_cpu_ids" prior to
find_next_bit() ? if "i >= nr_cpu_ids" holds it would be a waste to
perform find_next_bit().
> + return find_next_bit(cpumask_bits(mask), small_cpumask_bits, i + 1);
> }
>
Regards,
I Hsin
> /**
> --
> 2.34.1
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] cpumask: Optimize cpumask_any_but()
2025-01-17 14:59 ` I Hsin Cheng
@ 2025-01-17 16:32 ` Kuan-Wei Chiu
2025-01-17 16:32 ` Yury Norov
1 sibling, 0 replies; 6+ messages in thread
From: Kuan-Wei Chiu @ 2025-01-17 16:32 UTC (permalink / raw)
To: I Hsin Cheng
Cc: yury.norov, linux, jserv, mark.rutland, linux-kernel, eleanor15x
On Fri, Jan 17, 2025 at 10:59:31PM +0800, I Hsin Cheng wrote:
> On Fri, Jan 17, 2025 at 10:26:58PM +0800, Kuan-Wei Chiu wrote:
> > The cpumask_any_but() function can avoid using a loop to determine the
> > CPU index to return. If the first set bit in the cpumask is not equal
> > to the specified CPU, we can directly return the index of the first set
> > bit. Otherwise, we return the next set bit's index.
> >
> > This optimization replaces the loop with a single if statement,
> > allowing the compiler to generate more concise and efficient code.
> >
> > As a result, the size of the bzImage built with x86 defconfig is
> > reduced by 4096 bytes:
> >
> > * Before:
> > $ size arch/x86/boot/bzImage
> > text data bss dec hex filename
> > 13537280 1024 0 13538304 ce9400 arch/x86/boot/bzImage
> >
> > * After:
> > $ size arch/x86/boot/bzImage
> > text data bss dec hex filename
> > 13533184 1024 0 13534208 ce8400 arch/x86/boot/bzImage
> >
> > Co-developed-by: Yu-Chun Lin <eleanor15x@gmail.com>
> > Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com>
> > Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
> > ---
> > Not sure how to measure the efficiency difference, but I guess this
> > patch might be slightly more efficient or nearly the same as before. If
> > you have any good ideas for measuring efficiency, please let me know!
> >
> > include/linux/cpumask.h | 8 ++++----
> > 1 file changed, 4 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> > index 9278a50d514f..b769fcdbaa10 100644
> > --- a/include/linux/cpumask.h
> > +++ b/include/linux/cpumask.h
> > @@ -404,10 +404,10 @@ unsigned int cpumask_any_but(const struct cpumask *mask, unsigned int cpu)
> > unsigned int i;
> >
> > cpumask_check(cpu);
> > - for_each_cpu(i, mask)
> > - if (i != cpu)
> > - break;
> > - return i;
> > + i = find_first_bit(cpumask_bits(mask), small_cpumask_bits);
>
> Hi Kuan-Wei,
>
> How about using cpumask_first(mask) here to keep better consistency?
>
Sure.
> > + if (i != cpu)
> > + return i;
> Wouldn't it benefit abit to check "i >= nr_cpu_ids" prior to
> find_next_bit() ? if "i >= nr_cpu_ids" holds it would be a waste to
> perform find_next_bit().
>
Hmm, adding this check increases the image size back to what it was
before the patch. Also, it only saves the execution time of
find_next_bit() when the cpumask is entirely zero. I'm not sure how
often a fully zero cpumask actually occurs in practice, but if we
assume a uniform distribution, the probability of this case would
approach zero.
Regards,
Kuan-Wei
> > + return find_next_bit(cpumask_bits(mask), small_cpumask_bits, i + 1);
> > }
> >
>
> Regards,
> I Hsin
>
> > /**
> > --
> > 2.34.1
> >
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] cpumask: Optimize cpumask_any_but()
2025-01-17 14:59 ` I Hsin Cheng
2025-01-17 16:32 ` Kuan-Wei Chiu
@ 2025-01-17 16:32 ` Yury Norov
2025-01-18 7:32 ` Kuan-Wei Chiu
1 sibling, 1 reply; 6+ messages in thread
From: Yury Norov @ 2025-01-17 16:32 UTC (permalink / raw)
To: I Hsin Cheng
Cc: Kuan-Wei Chiu, linux, jserv, mark.rutland, linux-kernel,
eleanor15x
On Fri, Jan 17, 2025 at 10:59:31PM +0800, I Hsin Cheng wrote:
> On Fri, Jan 17, 2025 at 10:26:58PM +0800, Kuan-Wei Chiu wrote:
> > The cpumask_any_but() function can avoid using a loop to determine the
> > CPU index to return. If the first set bit in the cpumask is not equal
> > to the specified CPU, we can directly return the index of the first set
> > bit. Otherwise, we return the next set bit's index.
> >
> > This optimization replaces the loop with a single if statement,
> > allowing the compiler to generate more concise and efficient code.
I thought compilers are smart enough to unroll loop in this case. Can
you show disassembled code before and after?
> >
> > As a result, the size of the bzImage built with x86 defconfig is
> > reduced by 4096 bytes:
> >
> > * Before:
> > $ size arch/x86/boot/bzImage
> > text data bss dec hex filename
> > 13537280 1024 0 13538304 ce9400 arch/x86/boot/bzImage
> >
> > * After:
> > $ size arch/x86/boot/bzImage
> > text data bss dec hex filename
> > 13533184 1024 0 13534208 ce8400 arch/x86/boot/bzImage
Comparing zipped images tells little about code generation. Please use
scripts/bloat-o-meter.
> >
> > Co-developed-by: Yu-Chun Lin <eleanor15x@gmail.com>
> > Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com>
> > Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
> > ---
> > Not sure how to measure the efficiency difference, but I guess this
> > patch might be slightly more efficient or nearly the same as before. If
> > you have any good ideas for measuring efficiency, please let me know!
Check lib/find_bit_benchmark.c
> >
> > include/linux/cpumask.h | 8 ++++----
> > 1 file changed, 4 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> > index 9278a50d514f..b769fcdbaa10 100644
> > --- a/include/linux/cpumask.h
> > +++ b/include/linux/cpumask.h
> > @@ -404,10 +404,10 @@ unsigned int cpumask_any_but(const struct cpumask *mask, unsigned int cpu)
> > unsigned int i;
> >
> > cpumask_check(cpu);
> > - for_each_cpu(i, mask)
> > - if (i != cpu)
> > - break;
> > - return i;
> > + i = find_first_bit(cpumask_bits(mask), small_cpumask_bits);
>
> Hi Kuan-Wei,
>
> How about using cpumask_first(mask) here to keep better consistency?
I would do it the other way: introduce find_first_but_bit(), and then
make cpumask_any_but() a wrapper around it. Doing this you'll be able
to leverage find_bit_benchmark infrastructure to measure performance
difference, if any.
> > + if (i != cpu)
> > + return i;
> Wouldn't it benefit abit to check "i >= nr_cpu_ids" prior to
> find_next_bit() ?
Yes it would.
Thanks,
Yury
> if "i >= nr_cpu_ids" holds it would be a waste to
> perform find_next_bit().
>
> > + return find_next_bit(cpumask_bits(mask), small_cpumask_bits, i + 1);
> > }
> >
>
> Regards,
> I Hsin
>
> > /**
> > --
> > 2.34.1
> >
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] cpumask: Optimize cpumask_any_but()
2025-01-17 16:32 ` Yury Norov
@ 2025-01-18 7:32 ` Kuan-Wei Chiu
2025-01-23 22:39 ` Yury Norov
0 siblings, 1 reply; 6+ messages in thread
From: Kuan-Wei Chiu @ 2025-01-18 7:32 UTC (permalink / raw)
To: Yury Norov
Cc: I Hsin Cheng, linux, jserv, mark.rutland, linux-kernel,
eleanor15x
Hi Yury,
On Fri, Jan 17, 2025 at 11:32:54AM -0500, Yury Norov wrote:
> On Fri, Jan 17, 2025 at 10:59:31PM +0800, I Hsin Cheng wrote:
> > On Fri, Jan 17, 2025 at 10:26:58PM +0800, Kuan-Wei Chiu wrote:
> > > The cpumask_any_but() function can avoid using a loop to determine the
> > > CPU index to return. If the first set bit in the cpumask is not equal
> > > to the specified CPU, we can directly return the index of the first set
> > > bit. Otherwise, we return the next set bit's index.
> > >
> > > This optimization replaces the loop with a single if statement,
> > > allowing the compiler to generate more concise and efficient code.
>
> I thought compilers are smart enough to unroll loop in this case. Can
> you show disassembled code before and after?
>
Since cpumask_any_but() is an inline function, I added the following to
lib/cpumask.c for convenience:
unsigned int non_inline_cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
unsigned int non_inline_cpumask_any_but(const struct cpumask *mask, unsigned int cpu)
{
return cpumask_any_but(mask, cpu);
}
I used objdump -d ./lib/cpumask.o to compare the differences.
* Before the patch:
00000000000001f0 <non_inline_cpumask_any_but>:
1f0: f3 0f 1e fa endbr64
1f4: 48 8b 3f mov (%rdi),%rdi
1f7: b8 40 00 00 00 mov $0x40,%eax
1fc: 48 85 ff test %rdi,%rdi
1ff: 74 4b je 24c <non_inline_cpumask_any_but+0x5c>
201: f3 48 0f bc d7 tzcnt %rdi,%rdx
206: 89 d0 mov %edx,%eax
208: 39 d6 cmp %edx,%esi
20a: 75 40 jne 24c <non_inline_cpumask_any_but+0x5c>
20c: 83 fa 3f cmp $0x3f,%edx
20f: 77 3b ja 24c <non_inline_cpumask_any_but+0x5c>
211: 41 b8 01 00 00 00 mov $0x1,%r8d
217: 83 c0 01 add $0x1,%eax
21a: 83 f8 40 cmp $0x40,%eax
21d: 74 2d je 24c <non_inline_cpumask_any_but+0x5c>
21f: 89 c1 mov %eax,%ecx
221: 4c 89 c2 mov %r8,%rdx
224: 48 d3 e2 shl %cl,%rdx
227: 48 89 d0 mov %rdx,%rax
22a: 48 f7 d8 neg %rax
22d: 48 21 f8 and %rdi,%rax
230: 74 15 je 247 <non_inline_cpumask_any_but+0x57>
232: f3 48 0f bc d0 tzcnt %rax,%rdx
237: 89 d0 mov %edx,%eax
239: 39 d6 cmp %edx,%esi
23b: 75 0f jne 24c <non_inline_cpumask_any_but+0x5c>
23d: 83 fa 3f cmp $0x3f,%edx
240: 76 d5 jbe 217 <non_inline_cpumask_any_but+0x27>
242: e9 00 00 00 00 jmp 247 <non_inline_cpumask_any_but+0x57>
247: b8 40 00 00 00 mov $0x40,%eax
24c: e9 00 00 00 00 jmp 251 <non_inline_cpumask_any_but+0x61>
* After the patch:
00000000000001f0 <non_inline_cpumask_any_but>:
1f0: f3 0f 1e fa endbr64
1f4: 48 8b 17 mov (%rdi),%rdx
1f7: 48 85 d2 test %rdx,%rdx
1fa: 74 34 je 230 <non_inline_cpumask_any_but+0x40>
1fc: f3 48 0f bc ca tzcnt %rdx,%rcx
201: 89 c8 mov %ecx,%eax
203: 39 ce cmp %ecx,%esi
205: 75 2e jne 235 <non_inline_cpumask_any_but+0x45>
207: 83 c1 01 add $0x1,%ecx
20a: 83 f9 3f cmp $0x3f,%ecx
20d: 77 21 ja 230 <non_inline_cpumask_any_but+0x40>
20f: 48 c7 c0 ff ff ff ff mov $0xffffffffffffffff,%rax
216: 48 d3 e0 shl %cl,%rax
219: 48 89 c1 mov %rax,%rcx
21c: b8 40 00 00 00 mov $0x40,%eax
221: 48 21 d1 and %rdx,%rcx
224: 74 0f je 235 <non_inline_cpumask_any_but+0x45>
226: f3 48 0f bc c1 tzcnt %rcx,%rax
22b: e9 00 00 00 00 jmp 230 <non_inline_cpumask_any_but+0x40>
230: b8 40 00 00 00 mov $0x40,%eax
235: e9 00 00 00 00 jmp 23a <non_inline_cpumask_any_but+0x4a>
> > >
> > > As a result, the size of the bzImage built with x86 defconfig is
> > > reduced by 4096 bytes:
> > >
> > > * Before:
> > > $ size arch/x86/boot/bzImage
> > > text data bss dec hex filename
> > > 13537280 1024 0 13538304 ce9400 arch/x86/boot/bzImage
> > >
> > > * After:
> > > $ size arch/x86/boot/bzImage
> > > text data bss dec hex filename
> > > 13533184 1024 0 13534208 ce8400 arch/x86/boot/bzImage
>
> Comparing zipped images tells little about code generation. Please use
> scripts/bloat-o-meter.
>
$ ./scripts/bloat-o-meter ./old_cpumask.o ./new_cpumask.o
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-23 (-23)
Function old new delta
non_inline_cpumask_any_but 97 74 -23
Total: Before=522, After=499, chg -4.41%
> > >
> > > Co-developed-by: Yu-Chun Lin <eleanor15x@gmail.com>
> > > Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com>
> > > Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
> > > ---
> > > Not sure how to measure the efficiency difference, but I guess this
> > > patch might be slightly more efficient or nearly the same as before. If
> > > you have any good ideas for measuring efficiency, please let me know!
>
> Check lib/find_bit_benchmark.c
>
> > >
> > > include/linux/cpumask.h | 8 ++++----
> > > 1 file changed, 4 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> > > index 9278a50d514f..b769fcdbaa10 100644
> > > --- a/include/linux/cpumask.h
> > > +++ b/include/linux/cpumask.h
> > > @@ -404,10 +404,10 @@ unsigned int cpumask_any_but(const struct cpumask *mask, unsigned int cpu)
> > > unsigned int i;
> > >
> > > cpumask_check(cpu);
> > > - for_each_cpu(i, mask)
> > > - if (i != cpu)
> > > - break;
> > > - return i;
> > > + i = find_first_bit(cpumask_bits(mask), small_cpumask_bits);
> >
> > Hi Kuan-Wei,
> >
> > How about using cpumask_first(mask) here to keep better consistency?
>
> I would do it the other way: introduce find_first_but_bit(), and then
> make cpumask_any_but() a wrapper around it. Doing this you'll be able
> to leverage find_bit_benchmark infrastructure to measure performance
> difference, if any.
>
I'll try to conduct this experiment.
Regards,
Kuan-Wei
> > > + if (i != cpu)
> > > + return i;
> > Wouldn't it benefit abit to check "i >= nr_cpu_ids" prior to
> > find_next_bit() ?
>
> Yes it would.
>
> Thanks,
> Yury
>
> > if "i >= nr_cpu_ids" holds it would be a waste to
> > perform find_next_bit().
> >
> > > + return find_next_bit(cpumask_bits(mask), small_cpumask_bits, i + 1);
> > > }
> > >
> >
> > Regards,
> > I Hsin
> >
> > > /**
> > > --
> > > 2.34.1
> > >
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] cpumask: Optimize cpumask_any_but()
2025-01-18 7:32 ` Kuan-Wei Chiu
@ 2025-01-23 22:39 ` Yury Norov
0 siblings, 0 replies; 6+ messages in thread
From: Yury Norov @ 2025-01-23 22:39 UTC (permalink / raw)
To: Kuan-Wei Chiu
Cc: I Hsin Cheng, linux, jserv, mark.rutland, linux-kernel,
eleanor15x
On Sat, Jan 18, 2025 at 03:32:29PM +0800, Kuan-Wei Chiu wrote:
> Hi Yury,
>
> On Fri, Jan 17, 2025 at 11:32:54AM -0500, Yury Norov wrote:
> > On Fri, Jan 17, 2025 at 10:59:31PM +0800, I Hsin Cheng wrote:
> > > On Fri, Jan 17, 2025 at 10:26:58PM +0800, Kuan-Wei Chiu wrote:
> > > > The cpumask_any_but() function can avoid using a loop to determine the
> > > > CPU index to return. If the first set bit in the cpumask is not equal
> > > > to the specified CPU, we can directly return the index of the first set
> > > > bit. Otherwise, we return the next set bit's index.
> > > >
> > > > This optimization replaces the loop with a single if statement,
> > > > allowing the compiler to generate more concise and efficient code.
> >
> > I thought compilers are smart enough to unroll loop in this case. Can
> > you show disassembled code before and after?
> >
> Since cpumask_any_but() is an inline function, I added the following to
> lib/cpumask.c for convenience:
>
> unsigned int non_inline_cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
> unsigned int non_inline_cpumask_any_but(const struct cpumask *mask, unsigned int cpu)
> {
> return cpumask_any_but(mask, cpu);
> }
>
> I used objdump -d ./lib/cpumask.o to compare the differences.
>
> * Before the patch:
>
> 00000000000001f0 <non_inline_cpumask_any_but>:
> 1f0: f3 0f 1e fa endbr64
> 1f4: 48 8b 3f mov (%rdi),%rdi
> 1f7: b8 40 00 00 00 mov $0x40,%eax
> 1fc: 48 85 ff test %rdi,%rdi
> 1ff: 74 4b je 24c <non_inline_cpumask_any_but+0x5c>
> 201: f3 48 0f bc d7 tzcnt %rdi,%rdx
> 206: 89 d0 mov %edx,%eax
> 208: 39 d6 cmp %edx,%esi
> 20a: 75 40 jne 24c <non_inline_cpumask_any_but+0x5c>
> 20c: 83 fa 3f cmp $0x3f,%edx
> 20f: 77 3b ja 24c <non_inline_cpumask_any_but+0x5c>
> 211: 41 b8 01 00 00 00 mov $0x1,%r8d
> 217: 83 c0 01 add $0x1,%eax
> 21a: 83 f8 40 cmp $0x40,%eax
> 21d: 74 2d je 24c <non_inline_cpumask_any_but+0x5c>
> 21f: 89 c1 mov %eax,%ecx
> 221: 4c 89 c2 mov %r8,%rdx
> 224: 48 d3 e2 shl %cl,%rdx
> 227: 48 89 d0 mov %rdx,%rax
> 22a: 48 f7 d8 neg %rax
> 22d: 48 21 f8 and %rdi,%rax
> 230: 74 15 je 247 <non_inline_cpumask_any_but+0x57>
> 232: f3 48 0f bc d0 tzcnt %rax,%rdx
> 237: 89 d0 mov %edx,%eax
> 239: 39 d6 cmp %edx,%esi
> 23b: 75 0f jne 24c <non_inline_cpumask_any_but+0x5c>
> 23d: 83 fa 3f cmp $0x3f,%edx
> 240: 76 d5 jbe 217 <non_inline_cpumask_any_but+0x27>
> 242: e9 00 00 00 00 jmp 247 <non_inline_cpumask_any_but+0x57>
> 247: b8 40 00 00 00 mov $0x40,%eax
> 24c: e9 00 00 00 00 jmp 251 <non_inline_cpumask_any_but+0x61>
>
> * After the patch:
>
> 00000000000001f0 <non_inline_cpumask_any_but>:
> 1f0: f3 0f 1e fa endbr64
> 1f4: 48 8b 17 mov (%rdi),%rdx
> 1f7: 48 85 d2 test %rdx,%rdx
> 1fa: 74 34 je 230 <non_inline_cpumask_any_but+0x40>
> 1fc: f3 48 0f bc ca tzcnt %rdx,%rcx
> 201: 89 c8 mov %ecx,%eax
> 203: 39 ce cmp %ecx,%esi
> 205: 75 2e jne 235 <non_inline_cpumask_any_but+0x45>
> 207: 83 c1 01 add $0x1,%ecx
> 20a: 83 f9 3f cmp $0x3f,%ecx
> 20d: 77 21 ja 230 <non_inline_cpumask_any_but+0x40>
> 20f: 48 c7 c0 ff ff ff ff mov $0xffffffffffffffff,%rax
> 216: 48 d3 e0 shl %cl,%rax
> 219: 48 89 c1 mov %rax,%rcx
> 21c: b8 40 00 00 00 mov $0x40,%eax
> 221: 48 21 d1 and %rdx,%rcx
> 224: 74 0f je 235 <non_inline_cpumask_any_but+0x45>
> 226: f3 48 0f bc c1 tzcnt %rcx,%rax
> 22b: e9 00 00 00 00 jmp 230 <non_inline_cpumask_any_but+0x40>
> 230: b8 40 00 00 00 mov $0x40,%eax
> 235: e9 00 00 00 00 jmp 23a <non_inline_cpumask_any_but+0x4a>
>
> > > >
> > > > As a result, the size of the bzImage built with x86 defconfig is
> > > > reduced by 4096 bytes:
> > > >
> > > > * Before:
> > > > $ size arch/x86/boot/bzImage
> > > > text data bss dec hex filename
> > > > 13537280 1024 0 13538304 ce9400 arch/x86/boot/bzImage
> > > >
> > > > * After:
> > > > $ size arch/x86/boot/bzImage
> > > > text data bss dec hex filename
> > > > 13533184 1024 0 13534208 ce8400 arch/x86/boot/bzImage
> >
> > Comparing zipped images tells little about code generation. Please use
> > scripts/bloat-o-meter.
> >
> $ ./scripts/bloat-o-meter ./old_cpumask.o ./new_cpumask.o
> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-23 (-23)
> Function old new delta
> non_inline_cpumask_any_but 97 74 -23
> Total: Before=522, After=499, chg -4.41%
No need to introduce a wrapper. You need to build allyesconfig (or
defconfig) before and after your patch, and then run bloat-o-meter
against old and new vmlinux.
And specifically for cpumasks, can you please run this experiment
with NR_CPUS == 32 and NR_CPUS == 4096, for example. That way you
will test the change against small_cpumask_bits optimization.
Thanks,
Yury
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-01-23 22:39 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-17 14:26 [PATCH] cpumask: Optimize cpumask_any_but() Kuan-Wei Chiu
2025-01-17 14:59 ` I Hsin Cheng
2025-01-17 16:32 ` Kuan-Wei Chiu
2025-01-17 16:32 ` Yury Norov
2025-01-18 7:32 ` Kuan-Wei Chiu
2025-01-23 22:39 ` Yury Norov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox