* [PATCH] cpumask: Optimize cpumask_any_but() @ 2025-01-17 14:26 Kuan-Wei Chiu 2025-01-17 14:59 ` I Hsin Cheng 0 siblings, 1 reply; 6+ messages in thread From: Kuan-Wei Chiu @ 2025-01-17 14:26 UTC (permalink / raw) To: yury.norov Cc: linux, richard120310, jserv, mark.rutland, linux-kernel, Kuan-Wei Chiu, Yu-Chun Lin The cpumask_any_but() function can avoid using a loop to determine the CPU index to return. If the first set bit in the cpumask is not equal to the specified CPU, we can directly return the index of the first set bit. Otherwise, we return the next set bit's index. This optimization replaces the loop with a single if statement, allowing the compiler to generate more concise and efficient code. As a result, the size of the bzImage built with x86 defconfig is reduced by 4096 bytes: * Before: $ size arch/x86/boot/bzImage text data bss dec hex filename 13537280 1024 0 13538304 ce9400 arch/x86/boot/bzImage * After: $ size arch/x86/boot/bzImage text data bss dec hex filename 13533184 1024 0 13534208 ce8400 arch/x86/boot/bzImage Co-developed-by: Yu-Chun Lin <eleanor15x@gmail.com> Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com> Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> --- Not sure how to measure the efficiency difference, but I guess this patch might be slightly more efficient or nearly the same as before. If you have any good ideas for measuring efficiency, please let me know! include/linux/cpumask.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h index 9278a50d514f..b769fcdbaa10 100644 --- a/include/linux/cpumask.h +++ b/include/linux/cpumask.h @@ -404,10 +404,10 @@ unsigned int cpumask_any_but(const struct cpumask *mask, unsigned int cpu) unsigned int i; cpumask_check(cpu); - for_each_cpu(i, mask) - if (i != cpu) - break; - return i; + i = find_first_bit(cpumask_bits(mask), small_cpumask_bits); + if (i != cpu) + return i; + return find_next_bit(cpumask_bits(mask), small_cpumask_bits, i + 1); } /** -- 2.34.1 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] cpumask: Optimize cpumask_any_but() 2025-01-17 14:26 [PATCH] cpumask: Optimize cpumask_any_but() Kuan-Wei Chiu @ 2025-01-17 14:59 ` I Hsin Cheng 2025-01-17 16:32 ` Kuan-Wei Chiu 2025-01-17 16:32 ` Yury Norov 0 siblings, 2 replies; 6+ messages in thread From: I Hsin Cheng @ 2025-01-17 14:59 UTC (permalink / raw) To: Kuan-Wei Chiu Cc: yury.norov, linux, jserv, mark.rutland, linux-kernel, eleanor15x On Fri, Jan 17, 2025 at 10:26:58PM +0800, Kuan-Wei Chiu wrote: > The cpumask_any_but() function can avoid using a loop to determine the > CPU index to return. If the first set bit in the cpumask is not equal > to the specified CPU, we can directly return the index of the first set > bit. Otherwise, we return the next set bit's index. > > This optimization replaces the loop with a single if statement, > allowing the compiler to generate more concise and efficient code. > > As a result, the size of the bzImage built with x86 defconfig is > reduced by 4096 bytes: > > * Before: > $ size arch/x86/boot/bzImage > text data bss dec hex filename > 13537280 1024 0 13538304 ce9400 arch/x86/boot/bzImage > > * After: > $ size arch/x86/boot/bzImage > text data bss dec hex filename > 13533184 1024 0 13534208 ce8400 arch/x86/boot/bzImage > > Co-developed-by: Yu-Chun Lin <eleanor15x@gmail.com> > Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com> > Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> > --- > Not sure how to measure the efficiency difference, but I guess this > patch might be slightly more efficient or nearly the same as before. If > you have any good ideas for measuring efficiency, please let me know! > > include/linux/cpumask.h | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h > index 9278a50d514f..b769fcdbaa10 100644 > --- a/include/linux/cpumask.h > +++ b/include/linux/cpumask.h > @@ -404,10 +404,10 @@ unsigned int cpumask_any_but(const struct cpumask *mask, unsigned int cpu) > unsigned int i; > > cpumask_check(cpu); > - for_each_cpu(i, mask) > - if (i != cpu) > - break; > - return i; > + i = find_first_bit(cpumask_bits(mask), small_cpumask_bits); Hi Kuan-Wei, How about using cpumask_first(mask) here to keep better consistency? > + if (i != cpu) > + return i; Wouldn't it benefit abit to check "i >= nr_cpu_ids" prior to find_next_bit() ? if "i >= nr_cpu_ids" holds it would be a waste to perform find_next_bit(). > + return find_next_bit(cpumask_bits(mask), small_cpumask_bits, i + 1); > } > Regards, I Hsin > /** > -- > 2.34.1 > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] cpumask: Optimize cpumask_any_but() 2025-01-17 14:59 ` I Hsin Cheng @ 2025-01-17 16:32 ` Kuan-Wei Chiu 2025-01-17 16:32 ` Yury Norov 1 sibling, 0 replies; 6+ messages in thread From: Kuan-Wei Chiu @ 2025-01-17 16:32 UTC (permalink / raw) To: I Hsin Cheng Cc: yury.norov, linux, jserv, mark.rutland, linux-kernel, eleanor15x On Fri, Jan 17, 2025 at 10:59:31PM +0800, I Hsin Cheng wrote: > On Fri, Jan 17, 2025 at 10:26:58PM +0800, Kuan-Wei Chiu wrote: > > The cpumask_any_but() function can avoid using a loop to determine the > > CPU index to return. If the first set bit in the cpumask is not equal > > to the specified CPU, we can directly return the index of the first set > > bit. Otherwise, we return the next set bit's index. > > > > This optimization replaces the loop with a single if statement, > > allowing the compiler to generate more concise and efficient code. > > > > As a result, the size of the bzImage built with x86 defconfig is > > reduced by 4096 bytes: > > > > * Before: > > $ size arch/x86/boot/bzImage > > text data bss dec hex filename > > 13537280 1024 0 13538304 ce9400 arch/x86/boot/bzImage > > > > * After: > > $ size arch/x86/boot/bzImage > > text data bss dec hex filename > > 13533184 1024 0 13534208 ce8400 arch/x86/boot/bzImage > > > > Co-developed-by: Yu-Chun Lin <eleanor15x@gmail.com> > > Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com> > > Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> > > --- > > Not sure how to measure the efficiency difference, but I guess this > > patch might be slightly more efficient or nearly the same as before. If > > you have any good ideas for measuring efficiency, please let me know! > > > > include/linux/cpumask.h | 8 ++++---- > > 1 file changed, 4 insertions(+), 4 deletions(-) > > > > diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h > > index 9278a50d514f..b769fcdbaa10 100644 > > --- a/include/linux/cpumask.h > > +++ b/include/linux/cpumask.h > > @@ -404,10 +404,10 @@ unsigned int cpumask_any_but(const struct cpumask *mask, unsigned int cpu) > > unsigned int i; > > > > cpumask_check(cpu); > > - for_each_cpu(i, mask) > > - if (i != cpu) > > - break; > > - return i; > > + i = find_first_bit(cpumask_bits(mask), small_cpumask_bits); > > Hi Kuan-Wei, > > How about using cpumask_first(mask) here to keep better consistency? > Sure. > > + if (i != cpu) > > + return i; > Wouldn't it benefit abit to check "i >= nr_cpu_ids" prior to > find_next_bit() ? if "i >= nr_cpu_ids" holds it would be a waste to > perform find_next_bit(). > Hmm, adding this check increases the image size back to what it was before the patch. Also, it only saves the execution time of find_next_bit() when the cpumask is entirely zero. I'm not sure how often a fully zero cpumask actually occurs in practice, but if we assume a uniform distribution, the probability of this case would approach zero. Regards, Kuan-Wei > > + return find_next_bit(cpumask_bits(mask), small_cpumask_bits, i + 1); > > } > > > > Regards, > I Hsin > > > /** > > -- > > 2.34.1 > > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] cpumask: Optimize cpumask_any_but() 2025-01-17 14:59 ` I Hsin Cheng 2025-01-17 16:32 ` Kuan-Wei Chiu @ 2025-01-17 16:32 ` Yury Norov 2025-01-18 7:32 ` Kuan-Wei Chiu 1 sibling, 1 reply; 6+ messages in thread From: Yury Norov @ 2025-01-17 16:32 UTC (permalink / raw) To: I Hsin Cheng Cc: Kuan-Wei Chiu, linux, jserv, mark.rutland, linux-kernel, eleanor15x On Fri, Jan 17, 2025 at 10:59:31PM +0800, I Hsin Cheng wrote: > On Fri, Jan 17, 2025 at 10:26:58PM +0800, Kuan-Wei Chiu wrote: > > The cpumask_any_but() function can avoid using a loop to determine the > > CPU index to return. If the first set bit in the cpumask is not equal > > to the specified CPU, we can directly return the index of the first set > > bit. Otherwise, we return the next set bit's index. > > > > This optimization replaces the loop with a single if statement, > > allowing the compiler to generate more concise and efficient code. I thought compilers are smart enough to unroll loop in this case. Can you show disassembled code before and after? > > > > As a result, the size of the bzImage built with x86 defconfig is > > reduced by 4096 bytes: > > > > * Before: > > $ size arch/x86/boot/bzImage > > text data bss dec hex filename > > 13537280 1024 0 13538304 ce9400 arch/x86/boot/bzImage > > > > * After: > > $ size arch/x86/boot/bzImage > > text data bss dec hex filename > > 13533184 1024 0 13534208 ce8400 arch/x86/boot/bzImage Comparing zipped images tells little about code generation. Please use scripts/bloat-o-meter. > > > > Co-developed-by: Yu-Chun Lin <eleanor15x@gmail.com> > > Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com> > > Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> > > --- > > Not sure how to measure the efficiency difference, but I guess this > > patch might be slightly more efficient or nearly the same as before. If > > you have any good ideas for measuring efficiency, please let me know! Check lib/find_bit_benchmark.c > > > > include/linux/cpumask.h | 8 ++++---- > > 1 file changed, 4 insertions(+), 4 deletions(-) > > > > diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h > > index 9278a50d514f..b769fcdbaa10 100644 > > --- a/include/linux/cpumask.h > > +++ b/include/linux/cpumask.h > > @@ -404,10 +404,10 @@ unsigned int cpumask_any_but(const struct cpumask *mask, unsigned int cpu) > > unsigned int i; > > > > cpumask_check(cpu); > > - for_each_cpu(i, mask) > > - if (i != cpu) > > - break; > > - return i; > > + i = find_first_bit(cpumask_bits(mask), small_cpumask_bits); > > Hi Kuan-Wei, > > How about using cpumask_first(mask) here to keep better consistency? I would do it the other way: introduce find_first_but_bit(), and then make cpumask_any_but() a wrapper around it. Doing this you'll be able to leverage find_bit_benchmark infrastructure to measure performance difference, if any. > > + if (i != cpu) > > + return i; > Wouldn't it benefit abit to check "i >= nr_cpu_ids" prior to > find_next_bit() ? Yes it would. Thanks, Yury > if "i >= nr_cpu_ids" holds it would be a waste to > perform find_next_bit(). > > > + return find_next_bit(cpumask_bits(mask), small_cpumask_bits, i + 1); > > } > > > > Regards, > I Hsin > > > /** > > -- > > 2.34.1 > > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] cpumask: Optimize cpumask_any_but() 2025-01-17 16:32 ` Yury Norov @ 2025-01-18 7:32 ` Kuan-Wei Chiu 2025-01-23 22:39 ` Yury Norov 0 siblings, 1 reply; 6+ messages in thread From: Kuan-Wei Chiu @ 2025-01-18 7:32 UTC (permalink / raw) To: Yury Norov Cc: I Hsin Cheng, linux, jserv, mark.rutland, linux-kernel, eleanor15x Hi Yury, On Fri, Jan 17, 2025 at 11:32:54AM -0500, Yury Norov wrote: > On Fri, Jan 17, 2025 at 10:59:31PM +0800, I Hsin Cheng wrote: > > On Fri, Jan 17, 2025 at 10:26:58PM +0800, Kuan-Wei Chiu wrote: > > > The cpumask_any_but() function can avoid using a loop to determine the > > > CPU index to return. If the first set bit in the cpumask is not equal > > > to the specified CPU, we can directly return the index of the first set > > > bit. Otherwise, we return the next set bit's index. > > > > > > This optimization replaces the loop with a single if statement, > > > allowing the compiler to generate more concise and efficient code. > > I thought compilers are smart enough to unroll loop in this case. Can > you show disassembled code before and after? > Since cpumask_any_but() is an inline function, I added the following to lib/cpumask.c for convenience: unsigned int non_inline_cpumask_any_but(const struct cpumask *mask, unsigned int cpu); unsigned int non_inline_cpumask_any_but(const struct cpumask *mask, unsigned int cpu) { return cpumask_any_but(mask, cpu); } I used objdump -d ./lib/cpumask.o to compare the differences. * Before the patch: 00000000000001f0 <non_inline_cpumask_any_but>: 1f0: f3 0f 1e fa endbr64 1f4: 48 8b 3f mov (%rdi),%rdi 1f7: b8 40 00 00 00 mov $0x40,%eax 1fc: 48 85 ff test %rdi,%rdi 1ff: 74 4b je 24c <non_inline_cpumask_any_but+0x5c> 201: f3 48 0f bc d7 tzcnt %rdi,%rdx 206: 89 d0 mov %edx,%eax 208: 39 d6 cmp %edx,%esi 20a: 75 40 jne 24c <non_inline_cpumask_any_but+0x5c> 20c: 83 fa 3f cmp $0x3f,%edx 20f: 77 3b ja 24c <non_inline_cpumask_any_but+0x5c> 211: 41 b8 01 00 00 00 mov $0x1,%r8d 217: 83 c0 01 add $0x1,%eax 21a: 83 f8 40 cmp $0x40,%eax 21d: 74 2d je 24c <non_inline_cpumask_any_but+0x5c> 21f: 89 c1 mov %eax,%ecx 221: 4c 89 c2 mov %r8,%rdx 224: 48 d3 e2 shl %cl,%rdx 227: 48 89 d0 mov %rdx,%rax 22a: 48 f7 d8 neg %rax 22d: 48 21 f8 and %rdi,%rax 230: 74 15 je 247 <non_inline_cpumask_any_but+0x57> 232: f3 48 0f bc d0 tzcnt %rax,%rdx 237: 89 d0 mov %edx,%eax 239: 39 d6 cmp %edx,%esi 23b: 75 0f jne 24c <non_inline_cpumask_any_but+0x5c> 23d: 83 fa 3f cmp $0x3f,%edx 240: 76 d5 jbe 217 <non_inline_cpumask_any_but+0x27> 242: e9 00 00 00 00 jmp 247 <non_inline_cpumask_any_but+0x57> 247: b8 40 00 00 00 mov $0x40,%eax 24c: e9 00 00 00 00 jmp 251 <non_inline_cpumask_any_but+0x61> * After the patch: 00000000000001f0 <non_inline_cpumask_any_but>: 1f0: f3 0f 1e fa endbr64 1f4: 48 8b 17 mov (%rdi),%rdx 1f7: 48 85 d2 test %rdx,%rdx 1fa: 74 34 je 230 <non_inline_cpumask_any_but+0x40> 1fc: f3 48 0f bc ca tzcnt %rdx,%rcx 201: 89 c8 mov %ecx,%eax 203: 39 ce cmp %ecx,%esi 205: 75 2e jne 235 <non_inline_cpumask_any_but+0x45> 207: 83 c1 01 add $0x1,%ecx 20a: 83 f9 3f cmp $0x3f,%ecx 20d: 77 21 ja 230 <non_inline_cpumask_any_but+0x40> 20f: 48 c7 c0 ff ff ff ff mov $0xffffffffffffffff,%rax 216: 48 d3 e0 shl %cl,%rax 219: 48 89 c1 mov %rax,%rcx 21c: b8 40 00 00 00 mov $0x40,%eax 221: 48 21 d1 and %rdx,%rcx 224: 74 0f je 235 <non_inline_cpumask_any_but+0x45> 226: f3 48 0f bc c1 tzcnt %rcx,%rax 22b: e9 00 00 00 00 jmp 230 <non_inline_cpumask_any_but+0x40> 230: b8 40 00 00 00 mov $0x40,%eax 235: e9 00 00 00 00 jmp 23a <non_inline_cpumask_any_but+0x4a> > > > > > > As a result, the size of the bzImage built with x86 defconfig is > > > reduced by 4096 bytes: > > > > > > * Before: > > > $ size arch/x86/boot/bzImage > > > text data bss dec hex filename > > > 13537280 1024 0 13538304 ce9400 arch/x86/boot/bzImage > > > > > > * After: > > > $ size arch/x86/boot/bzImage > > > text data bss dec hex filename > > > 13533184 1024 0 13534208 ce8400 arch/x86/boot/bzImage > > Comparing zipped images tells little about code generation. Please use > scripts/bloat-o-meter. > $ ./scripts/bloat-o-meter ./old_cpumask.o ./new_cpumask.o add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-23 (-23) Function old new delta non_inline_cpumask_any_but 97 74 -23 Total: Before=522, After=499, chg -4.41% > > > > > > Co-developed-by: Yu-Chun Lin <eleanor15x@gmail.com> > > > Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com> > > > Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> > > > --- > > > Not sure how to measure the efficiency difference, but I guess this > > > patch might be slightly more efficient or nearly the same as before. If > > > you have any good ideas for measuring efficiency, please let me know! > > Check lib/find_bit_benchmark.c > > > > > > > include/linux/cpumask.h | 8 ++++---- > > > 1 file changed, 4 insertions(+), 4 deletions(-) > > > > > > diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h > > > index 9278a50d514f..b769fcdbaa10 100644 > > > --- a/include/linux/cpumask.h > > > +++ b/include/linux/cpumask.h > > > @@ -404,10 +404,10 @@ unsigned int cpumask_any_but(const struct cpumask *mask, unsigned int cpu) > > > unsigned int i; > > > > > > cpumask_check(cpu); > > > - for_each_cpu(i, mask) > > > - if (i != cpu) > > > - break; > > > - return i; > > > + i = find_first_bit(cpumask_bits(mask), small_cpumask_bits); > > > > Hi Kuan-Wei, > > > > How about using cpumask_first(mask) here to keep better consistency? > > I would do it the other way: introduce find_first_but_bit(), and then > make cpumask_any_but() a wrapper around it. Doing this you'll be able > to leverage find_bit_benchmark infrastructure to measure performance > difference, if any. > I'll try to conduct this experiment. Regards, Kuan-Wei > > > + if (i != cpu) > > > + return i; > > Wouldn't it benefit abit to check "i >= nr_cpu_ids" prior to > > find_next_bit() ? > > Yes it would. > > Thanks, > Yury > > > if "i >= nr_cpu_ids" holds it would be a waste to > > perform find_next_bit(). > > > > > + return find_next_bit(cpumask_bits(mask), small_cpumask_bits, i + 1); > > > } > > > > > > > Regards, > > I Hsin > > > > > /** > > > -- > > > 2.34.1 > > > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] cpumask: Optimize cpumask_any_but() 2025-01-18 7:32 ` Kuan-Wei Chiu @ 2025-01-23 22:39 ` Yury Norov 0 siblings, 0 replies; 6+ messages in thread From: Yury Norov @ 2025-01-23 22:39 UTC (permalink / raw) To: Kuan-Wei Chiu Cc: I Hsin Cheng, linux, jserv, mark.rutland, linux-kernel, eleanor15x On Sat, Jan 18, 2025 at 03:32:29PM +0800, Kuan-Wei Chiu wrote: > Hi Yury, > > On Fri, Jan 17, 2025 at 11:32:54AM -0500, Yury Norov wrote: > > On Fri, Jan 17, 2025 at 10:59:31PM +0800, I Hsin Cheng wrote: > > > On Fri, Jan 17, 2025 at 10:26:58PM +0800, Kuan-Wei Chiu wrote: > > > > The cpumask_any_but() function can avoid using a loop to determine the > > > > CPU index to return. If the first set bit in the cpumask is not equal > > > > to the specified CPU, we can directly return the index of the first set > > > > bit. Otherwise, we return the next set bit's index. > > > > > > > > This optimization replaces the loop with a single if statement, > > > > allowing the compiler to generate more concise and efficient code. > > > > I thought compilers are smart enough to unroll loop in this case. Can > > you show disassembled code before and after? > > > Since cpumask_any_but() is an inline function, I added the following to > lib/cpumask.c for convenience: > > unsigned int non_inline_cpumask_any_but(const struct cpumask *mask, unsigned int cpu); > unsigned int non_inline_cpumask_any_but(const struct cpumask *mask, unsigned int cpu) > { > return cpumask_any_but(mask, cpu); > } > > I used objdump -d ./lib/cpumask.o to compare the differences. > > * Before the patch: > > 00000000000001f0 <non_inline_cpumask_any_but>: > 1f0: f3 0f 1e fa endbr64 > 1f4: 48 8b 3f mov (%rdi),%rdi > 1f7: b8 40 00 00 00 mov $0x40,%eax > 1fc: 48 85 ff test %rdi,%rdi > 1ff: 74 4b je 24c <non_inline_cpumask_any_but+0x5c> > 201: f3 48 0f bc d7 tzcnt %rdi,%rdx > 206: 89 d0 mov %edx,%eax > 208: 39 d6 cmp %edx,%esi > 20a: 75 40 jne 24c <non_inline_cpumask_any_but+0x5c> > 20c: 83 fa 3f cmp $0x3f,%edx > 20f: 77 3b ja 24c <non_inline_cpumask_any_but+0x5c> > 211: 41 b8 01 00 00 00 mov $0x1,%r8d > 217: 83 c0 01 add $0x1,%eax > 21a: 83 f8 40 cmp $0x40,%eax > 21d: 74 2d je 24c <non_inline_cpumask_any_but+0x5c> > 21f: 89 c1 mov %eax,%ecx > 221: 4c 89 c2 mov %r8,%rdx > 224: 48 d3 e2 shl %cl,%rdx > 227: 48 89 d0 mov %rdx,%rax > 22a: 48 f7 d8 neg %rax > 22d: 48 21 f8 and %rdi,%rax > 230: 74 15 je 247 <non_inline_cpumask_any_but+0x57> > 232: f3 48 0f bc d0 tzcnt %rax,%rdx > 237: 89 d0 mov %edx,%eax > 239: 39 d6 cmp %edx,%esi > 23b: 75 0f jne 24c <non_inline_cpumask_any_but+0x5c> > 23d: 83 fa 3f cmp $0x3f,%edx > 240: 76 d5 jbe 217 <non_inline_cpumask_any_but+0x27> > 242: e9 00 00 00 00 jmp 247 <non_inline_cpumask_any_but+0x57> > 247: b8 40 00 00 00 mov $0x40,%eax > 24c: e9 00 00 00 00 jmp 251 <non_inline_cpumask_any_but+0x61> > > * After the patch: > > 00000000000001f0 <non_inline_cpumask_any_but>: > 1f0: f3 0f 1e fa endbr64 > 1f4: 48 8b 17 mov (%rdi),%rdx > 1f7: 48 85 d2 test %rdx,%rdx > 1fa: 74 34 je 230 <non_inline_cpumask_any_but+0x40> > 1fc: f3 48 0f bc ca tzcnt %rdx,%rcx > 201: 89 c8 mov %ecx,%eax > 203: 39 ce cmp %ecx,%esi > 205: 75 2e jne 235 <non_inline_cpumask_any_but+0x45> > 207: 83 c1 01 add $0x1,%ecx > 20a: 83 f9 3f cmp $0x3f,%ecx > 20d: 77 21 ja 230 <non_inline_cpumask_any_but+0x40> > 20f: 48 c7 c0 ff ff ff ff mov $0xffffffffffffffff,%rax > 216: 48 d3 e0 shl %cl,%rax > 219: 48 89 c1 mov %rax,%rcx > 21c: b8 40 00 00 00 mov $0x40,%eax > 221: 48 21 d1 and %rdx,%rcx > 224: 74 0f je 235 <non_inline_cpumask_any_but+0x45> > 226: f3 48 0f bc c1 tzcnt %rcx,%rax > 22b: e9 00 00 00 00 jmp 230 <non_inline_cpumask_any_but+0x40> > 230: b8 40 00 00 00 mov $0x40,%eax > 235: e9 00 00 00 00 jmp 23a <non_inline_cpumask_any_but+0x4a> > > > > > > > > > As a result, the size of the bzImage built with x86 defconfig is > > > > reduced by 4096 bytes: > > > > > > > > * Before: > > > > $ size arch/x86/boot/bzImage > > > > text data bss dec hex filename > > > > 13537280 1024 0 13538304 ce9400 arch/x86/boot/bzImage > > > > > > > > * After: > > > > $ size arch/x86/boot/bzImage > > > > text data bss dec hex filename > > > > 13533184 1024 0 13534208 ce8400 arch/x86/boot/bzImage > > > > Comparing zipped images tells little about code generation. Please use > > scripts/bloat-o-meter. > > > $ ./scripts/bloat-o-meter ./old_cpumask.o ./new_cpumask.o > add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-23 (-23) > Function old new delta > non_inline_cpumask_any_but 97 74 -23 > Total: Before=522, After=499, chg -4.41% No need to introduce a wrapper. You need to build allyesconfig (or defconfig) before and after your patch, and then run bloat-o-meter against old and new vmlinux. And specifically for cpumasks, can you please run this experiment with NR_CPUS == 32 and NR_CPUS == 4096, for example. That way you will test the change against small_cpumask_bits optimization. Thanks, Yury ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-01-23 22:39 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-01-17 14:26 [PATCH] cpumask: Optimize cpumask_any_but() Kuan-Wei Chiu 2025-01-17 14:59 ` I Hsin Cheng 2025-01-17 16:32 ` Kuan-Wei Chiu 2025-01-17 16:32 ` Yury Norov 2025-01-18 7:32 ` Kuan-Wei Chiu 2025-01-23 22:39 ` Yury Norov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox