* + bitops-optimize-fns-for-improved-performance.patch added to mm-nonmm-unstable branch
@ 2024-05-02 15:18 Andrew Morton
0 siblings, 0 replies; 11+ messages in thread
From: Andrew Morton @ 2024-05-02 15:18 UTC (permalink / raw)
To: mm-commits, yury.norov, linux, jserv, visitorckw, akpm
The patch titled
Subject: bitops: optimize fns() for improved performance
has been added to the -mm mm-nonmm-unstable branch. Its filename is
bitops-optimize-fns-for-improved-performance.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/bitops-optimize-fns-for-improved-performance.patch
This patch will later appear in the mm-nonmm-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Kuan-Wei Chiu <visitorckw@gmail.com>
Subject: bitops: optimize fns() for improved performance
Date: Thu, 2 May 2024 17:24:43 +0800
The current fns() repeatedly uses __ffs() to find the index of the least
significant bit and then clears the corresponding bit using __clear_bit().
The method for clearing the least significant bit can be optimized by
using word &= word - 1 instead.
Typically, the execution time of one __ffs() plus one __clear_bit() is
longer than that of a bitwise AND operation and a subtraction. To improve
performance, the loop for clearing the least significant bit has been
replaced with word &= word - 1, followed by a single __ffs() operation to
obtain the answer. This change reduces the number of __ffs() iterations
from n to just one, enhancing overall performance.
This modification significantly accelerates the fns() function in the
test_bitops benchmark, improving its speed by approximately 7.6 times.
Additionally, it enhances the performance of find_nth_bit() in the
find_bit benchmark by approximately 26%.
Before:
test_bitops: fns: 58033164 ns
find_nth_bit: 4254313 ns, 16525 iterations
After:
test_bitops: fns: 7637268 ns
find_nth_bit: 3362863 ns, 16501 iterations
Link: https://lkml.kernel.org/r/20240502092443.6845-3-visitorckw@gmail.com
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
include/linux/bitops.h | 12 +++---------
1 file changed, 3 insertions(+), 9 deletions(-)
--- a/include/linux/bitops.h~bitops-optimize-fns-for-improved-performance
+++ a/include/linux/bitops.h
@@ -254,16 +254,10 @@ static inline unsigned long __ffs64(u64
*/
static inline unsigned long fns(unsigned long word, unsigned int n)
{
- unsigned int bit;
+ while (word && n--)
+ word &= word - 1;
- while (word) {
- bit = __ffs(word);
- if (n-- == 0)
- return bit;
- __clear_bit(bit, &word);
- }
-
- return BITS_PER_LONG;
+ return word ? __ffs(word) : BITS_PER_LONG;
}
/**
_
Patches currently in -mm which might be from visitorckw@gmail.com are
lib-test_bitops-add-benchmark-test-for-fns.patch
bitops-optimize-fns-for-improved-performance.patch
^ permalink raw reply [flat|nested] 11+ messages in thread* + bitops-optimize-fns-for-improved-performance.patch added to mm-nonmm-unstable branch
@ 2024-04-26 19:08 Andrew Morton
2024-04-26 19:48 ` Yury Norov
0 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2024-04-26 19:08 UTC (permalink / raw)
To: mm-commits, yury.norov, jserv, visitorckw, akpm
The patch titled
Subject: bitops: optimize fns() for improved performance
has been added to the -mm mm-nonmm-unstable branch. Its filename is
bitops-optimize-fns-for-improved-performance.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/bitops-optimize-fns-for-improved-performance.patch
This patch will later appear in the mm-nonmm-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Kuan-Wei Chiu <visitorckw@gmail.com>
Subject: bitops: optimize fns() for improved performance
Date: Fri, 26 Apr 2024 11:51:52 +0800
The current fns() repeatedly uses __ffs() to find the index of the least
significant bit and then clears the corresponding bit using __clear_bit().
The method for clearing the least significant bit can be optimized by
using word &= word - 1 instead.
Typically, the execution time of one __ffs() plus one __clear_bit() is
longer than that of a bitwise AND operation and a subtraction. To improve
performance, the loop for clearing the least significant bit has been
replaced with word &= word - 1, followed by a single __ffs() operation to
obtain the answer. This change reduces the number of __ffs() iterations
from n to just one, enhancing overall performance.
The following microbenchmark data, conducted on my x86-64 machine, shows
the execution time (in microseconds) required for 1000000 test data
generated by get_random_u64() and executed by fns() under different values
of n:
+-----+---------------+---------------+
| n | time_old | time_new |
+-----+---------------+---------------+
| 0 | 29194 | 25878 |
| 1 | 25510 | 25497 |
| 2 | 27836 | 25721 |
| 3 | 30140 | 25673 |
| 4 | 32569 | 25426 |
| 5 | 34792 | 25690 |
| 6 | 37117 | 25651 |
| 7 | 39742 | 25383 |
| 8 | 42360 | 25657 |
| 9 | 44672 | 25897 |
| 10 | 47237 | 25819 |
| 11 | 49884 | 26530 |
| 12 | 51864 | 26647 |
| 13 | 54265 | 28915 |
| 14 | 56440 | 28373 |
| 15 | 58839 | 28616 |
| 16 | 62383 | 29128 |
| 17 | 64257 | 30041 |
| 18 | 66805 | 29773 |
| 19 | 69368 | 33203 |
| 20 | 72942 | 33688 |
| 21 | 77006 | 34518 |
| 22 | 80926 | 34298 |
| 23 | 85723 | 35586 |
| 24 | 90324 | 36376 |
| 25 | 95992 | 37465 |
| 26 | 101101 | 37599 |
| 27 | 106520 | 37466 |
| 28 | 113287 | 38163 |
| 29 | 120552 | 38810 |
| 30 | 128040 | 39373 |
| 31 | 135624 | 40500 |
| 32 | 142580 | 40343 |
| 33 | 148915 | 40460 |
| 34 | 154005 | 41294 |
| 35 | 157996 | 41730 |
| 36 | 160806 | 41523 |
| 37 | 162975 | 42088 |
| 38 | 163426 | 41530 |
| 39 | 164872 | 41789 |
| 40 | 164477 | 42505 |
| 41 | 164758 | 41879 |
| 42 | 164182 | 41415 |
| 43 | 164842 | 42119 |
| 44 | 164881 | 42297 |
| 45 | 164870 | 42145 |
| 46 | 164673 | 42066 |
| 47 | 164616 | 42051 |
| 48 | 165055 | 41902 |
| 49 | 164847 | 41862 |
| 50 | 165171 | 41960 |
| 51 | 164851 | 42089 |
| 52 | 164763 | 41717 |
| 53 | 164635 | 42154 |
| 54 | 164757 | 41983 |
| 55 | 165095 | 41419 |
| 56 | 164641 | 42381 |
| 57 | 164601 | 41654 |
| 58 | 164864 | 41834 |
| 59 | 164594 | 41920 |
| 60 | 165207 | 42020 |
| 61 | 165056 | 41185 |
| 62 | 165160 | 41722 |
| 63 | 164923 | 41702 |
| 64 | 164777 | 41880 |
+-----+---------------+---------------+
Link: https://lkml.kernel.org/r/20240426035152.956702-1-visitorckw@gmail.com
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
Cc: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
include/linux/bitops.h | 12 ++++--------
1 file changed, 4 insertions(+), 8 deletions(-)
--- a/include/linux/bitops.h~bitops-optimize-fns-for-improved-performance
+++ a/include/linux/bitops.h
@@ -254,16 +254,12 @@ static inline unsigned long __ffs64(u64
*/
static inline unsigned long fns(unsigned long word, unsigned int n)
{
- unsigned int bit;
+ unsigned int i;
- while (word) {
- bit = __ffs(word);
- if (n-- == 0)
- return bit;
- __clear_bit(bit, &word);
- }
+ for (i = 0; word && i < n; i++)
+ word &= word - 1;
- return BITS_PER_LONG;
+ return word ? __ffs(word) : BITS_PER_LONG;
}
/**
_
Patches currently in -mm which might be from visitorckw@gmail.com are
bitops-optimize-fns-for-improved-performance.patch
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: + bitops-optimize-fns-for-improved-performance.patch added to mm-nonmm-unstable branch 2024-04-26 19:08 Andrew Morton @ 2024-04-26 19:48 ` Yury Norov 2024-04-27 5:33 ` Kuan-Wei Chiu 0 siblings, 1 reply; 11+ messages in thread From: Yury Norov @ 2024-04-26 19:48 UTC (permalink / raw) To: Andrew Morton; +Cc: mm-commits, jserv, visitorckw On Fri, Apr 26, 2024 at 12:08 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > > The patch titled > Subject: bitops: optimize fns() for improved performance > has been added to the -mm mm-nonmm-unstable branch. Its filename is > bitops-optimize-fns-for-improved-performance.patch > > This patch will shortly appear at > https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/bitops-optimize-fns-for-improved-performance.patch > > This patch will later appear in the mm-nonmm-unstable branch at > git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm > > Before you just go and hit "reply", please: > a) Consider who else should be cc'ed > b) Prefer to cc a suitable mailing list as well > c) Ideally: find the original patch on the mailing list and do a > reply-to-all to that, adding suitable additional cc's > > *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** > > The -mm tree is included into linux-next via the mm-everything > branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm > and is updated there every 2-3 working days > > ------------------------------------------------------ > From: Kuan-Wei Chiu <visitorckw@gmail.com> > Subject: bitops: optimize fns() for improved performance > Date: Fri, 26 Apr 2024 11:51:52 +0800 > > The current fns() repeatedly uses __ffs() to find the index of the least > significant bit and then clears the corresponding bit using __clear_bit(). > The method for clearing the least significant bit can be optimized by > using word &= word - 1 instead. > > Typically, the execution time of one __ffs() plus one __clear_bit() is > longer than that of a bitwise AND operation and a subtraction. To improve > performance, the loop for clearing the least significant bit has been > replaced with word &= word - 1, followed by a single __ffs() operation to > obtain the answer. This change reduces the number of __ffs() iterations > from n to just one, enhancing overall performance. > > The following microbenchmark data, conducted on my x86-64 machine, shows > the execution time (in microseconds) required for 1000000 test data > generated by get_random_u64() and executed by fns() under different values > of n: > > +-----+---------------+---------------+ > | n | time_old | time_new | > +-----+---------------+---------------+ > | 0 | 29194 | 25878 | > | 1 | 25510 | 25497 | > | 2 | 27836 | 25721 | > | 3 | 30140 | 25673 | > | 4 | 32569 | 25426 | > | 5 | 34792 | 25690 | > | 6 | 37117 | 25651 | > | 7 | 39742 | 25383 | > | 8 | 42360 | 25657 | > | 9 | 44672 | 25897 | > | 10 | 47237 | 25819 | > | 11 | 49884 | 26530 | > | 12 | 51864 | 26647 | > | 13 | 54265 | 28915 | > | 14 | 56440 | 28373 | > | 15 | 58839 | 28616 | > | 16 | 62383 | 29128 | > | 17 | 64257 | 30041 | > | 18 | 66805 | 29773 | > | 19 | 69368 | 33203 | > | 20 | 72942 | 33688 | > | 21 | 77006 | 34518 | > | 22 | 80926 | 34298 | > | 23 | 85723 | 35586 | > | 24 | 90324 | 36376 | > | 25 | 95992 | 37465 | > | 26 | 101101 | 37599 | > | 27 | 106520 | 37466 | > | 28 | 113287 | 38163 | > | 29 | 120552 | 38810 | > | 30 | 128040 | 39373 | > | 31 | 135624 | 40500 | > | 32 | 142580 | 40343 | > | 33 | 148915 | 40460 | > | 34 | 154005 | 41294 | > | 35 | 157996 | 41730 | > | 36 | 160806 | 41523 | > | 37 | 162975 | 42088 | > | 38 | 163426 | 41530 | > | 39 | 164872 | 41789 | > | 40 | 164477 | 42505 | > | 41 | 164758 | 41879 | > | 42 | 164182 | 41415 | > | 43 | 164842 | 42119 | > | 44 | 164881 | 42297 | > | 45 | 164870 | 42145 | > | 46 | 164673 | 42066 | > | 47 | 164616 | 42051 | > | 48 | 165055 | 41902 | > | 49 | 164847 | 41862 | > | 50 | 165171 | 41960 | > | 51 | 164851 | 42089 | > | 52 | 164763 | 41717 | > | 53 | 164635 | 42154 | > | 54 | 164757 | 41983 | > | 55 | 165095 | 41419 | > | 56 | 164641 | 42381 | > | 57 | 164601 | 41654 | > | 58 | 164864 | 41834 | > | 59 | 164594 | 41920 | > | 60 | 165207 | 42020 | > | 61 | 165056 | 41185 | > | 62 | 165160 | 41722 | > | 63 | 164923 | 41702 | > | 64 | 164777 | 41880 | > +-----+---------------+---------------+ Hi Kuan-Wei, I didn't receive the original email for some reason... We've got a performance test for the function in find_bit_benchmark. Can you print before/after here? Thanks, Yury > Link: https://lkml.kernel.org/r/20240426035152.956702-1-visitorckw@gmail.com > Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> > Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> > Cc: Yury Norov <yury.norov@gmail.com> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > --- > > include/linux/bitops.h | 12 ++++-------- > 1 file changed, 4 insertions(+), 8 deletions(-) > > --- a/include/linux/bitops.h~bitops-optimize-fns-for-improved-performance > +++ a/include/linux/bitops.h > @@ -254,16 +254,12 @@ static inline unsigned long __ffs64(u64 > */ > static inline unsigned long fns(unsigned long word, unsigned int n) > { > - unsigned int bit; > + unsigned int i; > > - while (word) { > - bit = __ffs(word); > - if (n-- == 0) > - return bit; > - __clear_bit(bit, &word); > - } > + for (i = 0; word && i < n; i++) > + word &= word - 1; > > - return BITS_PER_LONG; > + return word ? __ffs(word) : BITS_PER_LONG; > } > > /** > _ > > Patches currently in -mm which might be from visitorckw@gmail.com are > > bitops-optimize-fns-for-improved-performance.patch > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: + bitops-optimize-fns-for-improved-performance.patch added to mm-nonmm-unstable branch 2024-04-26 19:48 ` Yury Norov @ 2024-04-27 5:33 ` Kuan-Wei Chiu 2024-04-28 16:08 ` Yury Norov 0 siblings, 1 reply; 11+ messages in thread From: Kuan-Wei Chiu @ 2024-04-27 5:33 UTC (permalink / raw) To: Yury Norov; +Cc: Andrew Morton, mm-commits, jserv, n26122115 On Fri, Apr 26, 2024 at 12:48:48PM -0700, Yury Norov wrote: > On Fri, Apr 26, 2024 at 12:08 PM Andrew Morton > <akpm@linux-foundation.org> wrote: > > > > > > The patch titled > > Subject: bitops: optimize fns() for improved performance > > has been added to the -mm mm-nonmm-unstable branch. Its filename is > > bitops-optimize-fns-for-improved-performance.patch > > > > This patch will shortly appear at > > https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/bitops-optimize-fns-for-improved-performance.patch > > > > This patch will later appear in the mm-nonmm-unstable branch at > > git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm > > > > Before you just go and hit "reply", please: > > a) Consider who else should be cc'ed > > b) Prefer to cc a suitable mailing list as well > > c) Ideally: find the original patch on the mailing list and do a > > reply-to-all to that, adding suitable additional cc's > > > > *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** > > > > The -mm tree is included into linux-next via the mm-everything > > branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm > > and is updated there every 2-3 working days > > > > ------------------------------------------------------ > > From: Kuan-Wei Chiu <visitorckw@gmail.com> > > Subject: bitops: optimize fns() for improved performance > > Date: Fri, 26 Apr 2024 11:51:52 +0800 > > > > The current fns() repeatedly uses __ffs() to find the index of the least > > significant bit and then clears the corresponding bit using __clear_bit(). > > The method for clearing the least significant bit can be optimized by > > using word &= word - 1 instead. > > > > Typically, the execution time of one __ffs() plus one __clear_bit() is > > longer than that of a bitwise AND operation and a subtraction. To improve > > performance, the loop for clearing the least significant bit has been > > replaced with word &= word - 1, followed by a single __ffs() operation to > > obtain the answer. This change reduces the number of __ffs() iterations > > from n to just one, enhancing overall performance. > > > > The following microbenchmark data, conducted on my x86-64 machine, shows > > the execution time (in microseconds) required for 1000000 test data > > generated by get_random_u64() and executed by fns() under different values > > of n: > > > > +-----+---------------+---------------+ > > | n | time_old | time_new | > > +-----+---------------+---------------+ > > | 0 | 29194 | 25878 | > > | 1 | 25510 | 25497 | > > | 2 | 27836 | 25721 | > > | 3 | 30140 | 25673 | > > | 4 | 32569 | 25426 | > > | 5 | 34792 | 25690 | > > | 6 | 37117 | 25651 | > > | 7 | 39742 | 25383 | > > | 8 | 42360 | 25657 | > > | 9 | 44672 | 25897 | > > | 10 | 47237 | 25819 | > > | 11 | 49884 | 26530 | > > | 12 | 51864 | 26647 | > > | 13 | 54265 | 28915 | > > | 14 | 56440 | 28373 | > > | 15 | 58839 | 28616 | > > | 16 | 62383 | 29128 | > > | 17 | 64257 | 30041 | > > | 18 | 66805 | 29773 | > > | 19 | 69368 | 33203 | > > | 20 | 72942 | 33688 | > > | 21 | 77006 | 34518 | > > | 22 | 80926 | 34298 | > > | 23 | 85723 | 35586 | > > | 24 | 90324 | 36376 | > > | 25 | 95992 | 37465 | > > | 26 | 101101 | 37599 | > > | 27 | 106520 | 37466 | > > | 28 | 113287 | 38163 | > > | 29 | 120552 | 38810 | > > | 30 | 128040 | 39373 | > > | 31 | 135624 | 40500 | > > | 32 | 142580 | 40343 | > > | 33 | 148915 | 40460 | > > | 34 | 154005 | 41294 | > > | 35 | 157996 | 41730 | > > | 36 | 160806 | 41523 | > > | 37 | 162975 | 42088 | > > | 38 | 163426 | 41530 | > > | 39 | 164872 | 41789 | > > | 40 | 164477 | 42505 | > > | 41 | 164758 | 41879 | > > | 42 | 164182 | 41415 | > > | 43 | 164842 | 42119 | > > | 44 | 164881 | 42297 | > > | 45 | 164870 | 42145 | > > | 46 | 164673 | 42066 | > > | 47 | 164616 | 42051 | > > | 48 | 165055 | 41902 | > > | 49 | 164847 | 41862 | > > | 50 | 165171 | 41960 | > > | 51 | 164851 | 42089 | > > | 52 | 164763 | 41717 | > > | 53 | 164635 | 42154 | > > | 54 | 164757 | 41983 | > > | 55 | 165095 | 41419 | > > | 56 | 164641 | 42381 | > > | 57 | 164601 | 41654 | > > | 58 | 164864 | 41834 | > > | 59 | 164594 | 41920 | > > | 60 | 165207 | 42020 | > > | 61 | 165056 | 41185 | > > | 62 | 165160 | 41722 | > > | 63 | 164923 | 41702 | > > | 64 | 164777 | 41880 | > > +-----+---------------+---------------+ > > Hi Kuan-Wei, > > I didn't receive the original email for some reason... > We've got a performance test for the function in find_bit_benchmark. > Can you print before/after here? > > Thanks, > Yury > Hi Yury, Here are the benchmark results: Before: Start testing find_bit() with random-filled bitmap [ 0.299085] fbcon: Taking over console [ 0.299820] find_next_bit: 606286 ns, 164169 iterations [ 0.300463] find_next_zero_bit: 641072 ns, 163512 iterations [ 0.300996] find_last_bit: 531027 ns, 164169 iterations [ 0.305233] find_nth_bit: 4235859 ns, 16454 iterations [ 0.306434] find_first_bit: 1199357 ns, 16455 iterations [ 0.321616] find_first_and_bit: 15179667 ns, 32869 iterations [ 0.321917] find_next_and_bit: 298836 ns, 73875 iterations [ 0.321918] Start testing find_bit() with sparse bitmap [ 0.321953] find_next_bit: 7931 ns, 656 iterations [ 0.323201] find_next_zero_bit: 1246980 ns, 327025 iterations [ 0.323210] find_last_bit: 8000 ns, 656 iterations [ 0.324427] find_nth_bit: 1213161 ns, 655 iterations [ 0.324813] find_first_bit: 384747 ns, 656 iterations [ 0.324817] find_first_and_bit: 2220 ns, 1 iterations [ 0.324820] find_next_and_bit: 1831 ns, 1 iterations After: Start testing find_bit() with random-filled bitmap [ 0.305081] fbcon: Taking over console [ 0.306126] find_next_bit: 854517 ns, 163960 iterations [ 0.307041] find_next_zero_bit: 911725 ns, 163721 iterations [ 0.307711] find_last_bit: 668261 ns, 163960 iterations [ 0.311160] find_nth_bit: 3447530 ns, 16372 iterations [ 0.312358] find_first_bit: 1196633 ns, 16373 iterations [ 0.327191] find_first_and_bit: 14830129 ns, 32951 iterations [ 0.327503] find_next_and_bit: 310560 ns, 73719 iterations [ 0.327504] Start testing find_bit() with sparse bitmap [ 0.327539] find_next_bit: 7633 ns, 656 iterations [ 0.328787] find_next_zero_bit: 1247398 ns, 327025 iterations [ 0.328797] find_last_bit: 8425 ns, 656 iterations [ 0.330034] find_nth_bit: 1234044 ns, 655 iterations [ 0.330428] find_first_bit: 392086 ns, 656 iterations [ 0.330431] find_first_and_bit: 1980 ns, 1 iterations [ 0.330434] find_next_and_bit: 1831 ns, 1 iterations Some benchmarks seem to have worsened after applying this patch. However, unless I'm mistaken, the fns() changes should only affect the results of find_nth_bit, while the others are just random fluctuations. Should I include the above benchmark data in the commit message and send a v2 patch? Additionally, I apologize for you not receiving the email. I received the following "Message not delivered" email, but I'm unsure if it's related and what caused the error: Date: Sat, 27 Apr 2024 04:29:04 +0000 (UTC) From: do-not-reply@sophosemail.com To: visitorckw@gmail.com Subject: Undelivered Mail This is an automated message from mail service of vivek.yagnik@sophosemail.com ⚠ Message not delivered ------------------ Message details ------------------ From: visitorckw@gmail.com To: vivek.yagnik@sophosemail.com Sent: 2024-04-27T04:29:03.000Z Subject: [PATCH] bitops: Optimize fns() for improved performance Failure reason: <vivek.yagnik@sophosemail.com>: host sophosemail-com.mail.protection.outlook.com[52.101.144.3] +said: 451 4.4.4 Mail received as unauthenticated, incoming to a recipient domain configured in a hosted tenant +which has no mail-enabled subscriptions. ATTR5 [MA1PEPF000072B2.INDPRD01.PROD.OUTLOOK.COM 2024-04-27T04:29:03.836Z +08DC631634A0BBEB] (in reply to end of DATA command) Regards, Kuan-Wei > > Link: https://lkml.kernel.org/r/20240426035152.956702-1-visitorckw@gmail.com > > Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> > > Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> > > Cc: Yury Norov <yury.norov@gmail.com> > > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > > --- > > > > include/linux/bitops.h | 12 ++++-------- > > 1 file changed, 4 insertions(+), 8 deletions(-) > > > > --- a/include/linux/bitops.h~bitops-optimize-fns-for-improved-performance > > +++ a/include/linux/bitops.h > > @@ -254,16 +254,12 @@ static inline unsigned long __ffs64(u64 > > */ > > static inline unsigned long fns(unsigned long word, unsigned int n) > > { > > - unsigned int bit; > > + unsigned int i; > > > > - while (word) { > > - bit = __ffs(word); > > - if (n-- == 0) > > - return bit; > > - __clear_bit(bit, &word); > > - } > > + for (i = 0; word && i < n; i++) > > + word &= word - 1; > > > > - return BITS_PER_LONG; > > + return word ? __ffs(word) : BITS_PER_LONG; > > } > > > > /** > > _ > > > > Patches currently in -mm which might be from visitorckw@gmail.com are > > > > bitops-optimize-fns-for-improved-performance.patch > > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: + bitops-optimize-fns-for-improved-performance.patch added to mm-nonmm-unstable branch 2024-04-27 5:33 ` Kuan-Wei Chiu @ 2024-04-28 16:08 ` Yury Norov 2024-04-29 8:05 ` Rasmus Villemoes 2024-04-29 15:31 ` Kuan-Wei Chiu 0 siblings, 2 replies; 11+ messages in thread From: Yury Norov @ 2024-04-28 16:08 UTC (permalink / raw) To: Kuan-Wei Chiu Cc: Andrew Morton, mm-commits, jserv, n26122115, Rasmus Villemoes + Rasmus On Sat, Apr 27, 2024 at 01:33:55PM +0800, Kuan-Wei Chiu wrote: > Before: > Start testing find_bit() with random-filled bitmap > [ 0.299085] fbcon: Taking over console > [ 0.299820] find_next_bit: 606286 ns, 164169 iterations > [ 0.300463] find_next_zero_bit: 641072 ns, 163512 iterations > [ 0.300996] find_last_bit: 531027 ns, 164169 iterations > [ 0.305233] find_nth_bit: 4235859 ns, 16454 iterations > [ 0.306434] find_first_bit: 1199357 ns, 16455 iterations > [ 0.321616] find_first_and_bit: 15179667 ns, 32869 iterations > [ 0.321917] find_next_and_bit: 298836 ns, 73875 iterations > [ 0.321918] > Start testing find_bit() with sparse bitmap > [ 0.321953] find_next_bit: 7931 ns, 656 iterations > [ 0.323201] find_next_zero_bit: 1246980 ns, 327025 iterations > [ 0.323210] find_last_bit: 8000 ns, 656 iterations > [ 0.324427] find_nth_bit: 1213161 ns, 655 iterations > [ 0.324813] find_first_bit: 384747 ns, 656 iterations > [ 0.324817] find_first_and_bit: 2220 ns, 1 iterations > [ 0.324820] find_next_and_bit: 1831 ns, 1 iterations > > After: > Start testing find_bit() with random-filled bitmap > [ 0.305081] fbcon: Taking over console > [ 0.306126] find_next_bit: 854517 ns, 163960 iterations > [ 0.307041] find_next_zero_bit: 911725 ns, 163721 iterations > [ 0.307711] find_last_bit: 668261 ns, 163960 iterations > [ 0.311160] find_nth_bit: 3447530 ns, 16372 iterations > [ 0.312358] find_first_bit: 1196633 ns, 16373 iterations > [ 0.327191] find_first_and_bit: 14830129 ns, 32951 iterations > [ 0.327503] find_next_and_bit: 310560 ns, 73719 iterations > [ 0.327504] > Start testing find_bit() with sparse bitmap > [ 0.327539] find_next_bit: 7633 ns, 656 iterations > [ 0.328787] find_next_zero_bit: 1247398 ns, 327025 iterations > [ 0.328797] find_last_bit: 8425 ns, 656 iterations > [ 0.330034] find_nth_bit: 1234044 ns, 655 iterations > [ 0.330428] find_first_bit: 392086 ns, 656 iterations > [ 0.330431] find_first_and_bit: 1980 ns, 1 iterations > [ 0.330434] find_next_and_bit: 1831 ns, 1 iterations > > Some benchmarks seem to have worsened after applying this patch. > However, unless I'm mistaken, the fns() changes should only affect the > results of find_nth_bit, while the others are just random fluctuations. So... The patch itself looks good, and indeed it should not affect anything except find_nth_bit(). But -40% for find_next_bit() and find_next_zero_bit(), and -25% for find_last_bit() don't look like a fluctuation by any measure. Looking at the numbers, I can only guess that your testing machine isn't trustworthy, which means that +18% for find_nth_bit() is not trustworthy as well. Maybe it's a sort of virtualized environment? Can you share more about your hardware? Can you make sure it's a bare metal machine and run your test compiled in the kernel? That way it will run at boot time, at least before any possible userspace payload. Can you also run test_bitmap? It has some functional testing for the function? > Should I include the above benchmark data in the commit message and > send a v2 patch? Yes, I'd like to do so. Because the current benchmark numbers look so weird, can you run it again? Maybe several times before and after, to estimate the scatter. Also please run the test_bitmap to check functional correctness. Since you already have a code testing the performance of fns(), can you add it in lib/find_bit_benchmark or lib/test_bitops in a following patch? Thanks, Yury > Additionally, I apologize for you not receiving the email. I received > the following "Message not delivered" email, but I'm unsure if it's > related and what caused the error: ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: + bitops-optimize-fns-for-improved-performance.patch added to mm-nonmm-unstable branch 2024-04-28 16:08 ` Yury Norov @ 2024-04-29 8:05 ` Rasmus Villemoes 2024-04-29 15:40 ` Kuan-Wei Chiu 2024-04-29 16:26 ` Yury Norov 2024-04-29 15:31 ` Kuan-Wei Chiu 1 sibling, 2 replies; 11+ messages in thread From: Rasmus Villemoes @ 2024-04-29 8:05 UTC (permalink / raw) To: Yury Norov, Kuan-Wei Chiu; +Cc: Andrew Morton, mm-commits, jserv, n26122115 On 28/04/2024 18.08, Yury Norov wrote: > + Rasmus > > On Sat, Apr 27, 2024 at 01:33:55PM +0800, Kuan-Wei Chiu wrote: >> Before: >> Start testing find_bit() with random-filled bitmap >> [ 0.299085] fbcon: Taking over console >> [ 0.299820] find_next_bit: 606286 ns, 164169 iterations >> [ 0.300463] find_next_zero_bit: 641072 ns, 163512 iterations >> [ 0.300996] find_last_bit: 531027 ns, 164169 iterations >> [ 0.305233] find_nth_bit: 4235859 ns, 16454 iterations >> [ 0.306434] find_first_bit: 1199357 ns, 16455 iterations >> [ 0.321616] find_first_and_bit: 15179667 ns, 32869 iterations >> [ 0.321917] find_next_and_bit: 298836 ns, 73875 iterations >> [ 0.321918] >> Start testing find_bit() with sparse bitmap >> [ 0.321953] find_next_bit: 7931 ns, 656 iterations >> [ 0.323201] find_next_zero_bit: 1246980 ns, 327025 iterations >> [ 0.323210] find_last_bit: 8000 ns, 656 iterations >> [ 0.324427] find_nth_bit: 1213161 ns, 655 iterations >> [ 0.324813] find_first_bit: 384747 ns, 656 iterations >> [ 0.324817] find_first_and_bit: 2220 ns, 1 iterations >> [ 0.324820] find_next_and_bit: 1831 ns, 1 iterations >> >> After: >> Start testing find_bit() with random-filled bitmap >> [ 0.305081] fbcon: Taking over console >> [ 0.306126] find_next_bit: 854517 ns, 163960 iterations >> [ 0.307041] find_next_zero_bit: 911725 ns, 163721 iterations >> [ 0.307711] find_last_bit: 668261 ns, 163960 iterations >> [ 0.311160] find_nth_bit: 3447530 ns, 16372 iterations >> [ 0.312358] find_first_bit: 1196633 ns, 16373 iterations >> [ 0.327191] find_first_and_bit: 14830129 ns, 32951 iterations >> [ 0.327503] find_next_and_bit: 310560 ns, 73719 iterations >> [ 0.327504] >> Start testing find_bit() with sparse bitmap >> [ 0.327539] find_next_bit: 7633 ns, 656 iterations >> [ 0.328787] find_next_zero_bit: 1247398 ns, 327025 iterations >> [ 0.328797] find_last_bit: 8425 ns, 656 iterations >> [ 0.330034] find_nth_bit: 1234044 ns, 655 iterations >> [ 0.330428] find_first_bit: 392086 ns, 656 iterations >> [ 0.330431] find_first_and_bit: 1980 ns, 1 iterations >> [ 0.330434] find_next_and_bit: 1831 ns, 1 iterations >> >> Some benchmarks seem to have worsened after applying this patch. >> However, unless I'm mistaken, the fns() changes should only affect the >> results of find_nth_bit, while the others are just random fluctuations. > > So... > > The patch itself looks good, Well, I think it could be even better. While I agree that bit=ffs(); clear_bit(bit, ) is probably a bad way of doing it, I think the basic structure of the function is good. Introducing a "count from 0 up to n" loop is rarely a good thing, keeping the n counting down to 0 is likely better. So I'd instead just change the function to static inline unsigned long fns(unsigned long word, unsigned int n) { while (word) { if (n-- == 0) return __ffs(word); word &= word - 1; } return BITS_PER_LONG; } Now that I look closer, I think the * Returns the bit number of the N'th set bit. * If no such, returns @size. in the find_nth_bit and friends' docs is wrong. Tell me what happens here: DECLARE_BITMAP(x, 12); int i; x[0] = 3; i = find_nth_bit(&x, 12, 7); So I'm asking for the seventh (counting from 0) bit set in a bitmap of 12 bits, but only two bits are set. So i should be 12? No, i will be BITS_PER_LONG, because fns() doesn't know anything about the limit of 12. Do we really not have any tests covering that? Or indeed covering any of the small_const_nbits optimizations? I've said this before, and I repeat. It was a mistake for the bitmap functions to promise a return of exactly @size when stuff is not found, they should always have just promised to return something >= @size. The checking in the callers would be just as easy (and some indeed do >= instead of ==), but the implementation can be somewhat cheaper. I'm afraid that ship has sailed, but it annoys me every time I stumble on this. Rasmus ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: + bitops-optimize-fns-for-improved-performance.patch added to mm-nonmm-unstable branch 2024-04-29 8:05 ` Rasmus Villemoes @ 2024-04-29 15:40 ` Kuan-Wei Chiu 2024-04-29 16:34 ` Yury Norov 2024-04-29 16:26 ` Yury Norov 1 sibling, 1 reply; 11+ messages in thread From: Kuan-Wei Chiu @ 2024-04-29 15:40 UTC (permalink / raw) To: Rasmus Villemoes; +Cc: Yury Norov, Andrew Morton, mm-commits, jserv, n26122115 On Mon, Apr 29, 2024 at 10:05:12AM +0200, Rasmus Villemoes wrote: > On 28/04/2024 18.08, Yury Norov wrote: > > + Rasmus > > > > On Sat, Apr 27, 2024 at 01:33:55PM +0800, Kuan-Wei Chiu wrote: > >> Before: > >> Start testing find_bit() with random-filled bitmap > >> [ 0.299085] fbcon: Taking over console > >> [ 0.299820] find_next_bit: 606286 ns, 164169 iterations > >> [ 0.300463] find_next_zero_bit: 641072 ns, 163512 iterations > >> [ 0.300996] find_last_bit: 531027 ns, 164169 iterations > >> [ 0.305233] find_nth_bit: 4235859 ns, 16454 iterations > >> [ 0.306434] find_first_bit: 1199357 ns, 16455 iterations > >> [ 0.321616] find_first_and_bit: 15179667 ns, 32869 iterations > >> [ 0.321917] find_next_and_bit: 298836 ns, 73875 iterations > >> [ 0.321918] > >> Start testing find_bit() with sparse bitmap > >> [ 0.321953] find_next_bit: 7931 ns, 656 iterations > >> [ 0.323201] find_next_zero_bit: 1246980 ns, 327025 iterations > >> [ 0.323210] find_last_bit: 8000 ns, 656 iterations > >> [ 0.324427] find_nth_bit: 1213161 ns, 655 iterations > >> [ 0.324813] find_first_bit: 384747 ns, 656 iterations > >> [ 0.324817] find_first_and_bit: 2220 ns, 1 iterations > >> [ 0.324820] find_next_and_bit: 1831 ns, 1 iterations > >> > >> After: > >> Start testing find_bit() with random-filled bitmap > >> [ 0.305081] fbcon: Taking over console > >> [ 0.306126] find_next_bit: 854517 ns, 163960 iterations > >> [ 0.307041] find_next_zero_bit: 911725 ns, 163721 iterations > >> [ 0.307711] find_last_bit: 668261 ns, 163960 iterations > >> [ 0.311160] find_nth_bit: 3447530 ns, 16372 iterations > >> [ 0.312358] find_first_bit: 1196633 ns, 16373 iterations > >> [ 0.327191] find_first_and_bit: 14830129 ns, 32951 iterations > >> [ 0.327503] find_next_and_bit: 310560 ns, 73719 iterations > >> [ 0.327504] > >> Start testing find_bit() with sparse bitmap > >> [ 0.327539] find_next_bit: 7633 ns, 656 iterations > >> [ 0.328787] find_next_zero_bit: 1247398 ns, 327025 iterations > >> [ 0.328797] find_last_bit: 8425 ns, 656 iterations > >> [ 0.330034] find_nth_bit: 1234044 ns, 655 iterations > >> [ 0.330428] find_first_bit: 392086 ns, 656 iterations > >> [ 0.330431] find_first_and_bit: 1980 ns, 1 iterations > >> [ 0.330434] find_next_and_bit: 1831 ns, 1 iterations > >> > >> Some benchmarks seem to have worsened after applying this patch. > >> However, unless I'm mistaken, the fns() changes should only affect the > >> results of find_nth_bit, while the others are just random fluctuations. > > > > So... > > > > The patch itself looks good, > > Well, I think it could be even better. While I agree that bit=ffs(); > clear_bit(bit, ) is probably a bad way of doing it, I think the basic > structure of the function is good. Introducing a "count from 0 up to n" > loop is rarely a good thing, keeping the n counting down to 0 is likely > better. > > So I'd instead just change the function to > > static inline unsigned long fns(unsigned long word, unsigned int n) > { > while (word) { > if (n-- == 0) > return __ffs(word); > word &= word - 1; > } > > return BITS_PER_LONG; > } > How about rewriting it as follows: static inline unsigned long fns(unsigned long word, unsigned int n) { while (word && n--) word &= word - 1; return word ? __ffs(word) : BITS_PER_LONG; } IMHO, this way the code can be shorter and more elegant. Regards, Kuan-Wei > > Now that I look closer, I think the > > * Returns the bit number of the N'th set bit. > * If no such, returns @size. > > in the find_nth_bit and friends' docs is wrong. Tell me what happens here: > > > DECLARE_BITMAP(x, 12); > int i; > > x[0] = 3; > i = find_nth_bit(&x, 12, 7); > > So I'm asking for the seventh (counting from 0) bit set in a bitmap of > 12 bits, but only two bits are set. So i should be 12? No, i will be > BITS_PER_LONG, because fns() doesn't know anything about the limit of > 12. Do we really not have any tests covering that? Or indeed covering > any of the small_const_nbits optimizations? > > I've said this before, and I repeat. It was a mistake for the bitmap > functions to promise a return of exactly @size when stuff is not found, > they should always have just promised to return something >= @size. The > checking in the callers would be just as easy (and some indeed do >= > instead of ==), but the implementation can be somewhat cheaper. I'm > afraid that ship has sailed, but it annoys me every time I stumble on this. > > Rasmus > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: + bitops-optimize-fns-for-improved-performance.patch added to mm-nonmm-unstable branch 2024-04-29 15:40 ` Kuan-Wei Chiu @ 2024-04-29 16:34 ` Yury Norov 2024-04-29 17:06 ` Kuan-Wei Chiu 0 siblings, 1 reply; 11+ messages in thread From: Yury Norov @ 2024-04-29 16:34 UTC (permalink / raw) To: Kuan-Wei Chiu Cc: Rasmus Villemoes, Andrew Morton, mm-commits, jserv, n26122115 On Mon, Apr 29, 2024 at 11:40:34PM +0800, Kuan-Wei Chiu wrote: > On Mon, Apr 29, 2024 at 10:05:12AM +0200, Rasmus Villemoes wrote: > > On 28/04/2024 18.08, Yury Norov wrote: > > > + Rasmus > > > > > > On Sat, Apr 27, 2024 at 01:33:55PM +0800, Kuan-Wei Chiu wrote: > > >> Before: > > >> Start testing find_bit() with random-filled bitmap > > >> [ 0.299085] fbcon: Taking over console > > >> [ 0.299820] find_next_bit: 606286 ns, 164169 iterations > > >> [ 0.300463] find_next_zero_bit: 641072 ns, 163512 iterations > > >> [ 0.300996] find_last_bit: 531027 ns, 164169 iterations > > >> [ 0.305233] find_nth_bit: 4235859 ns, 16454 iterations > > >> [ 0.306434] find_first_bit: 1199357 ns, 16455 iterations > > >> [ 0.321616] find_first_and_bit: 15179667 ns, 32869 iterations > > >> [ 0.321917] find_next_and_bit: 298836 ns, 73875 iterations > > >> [ 0.321918] > > >> Start testing find_bit() with sparse bitmap > > >> [ 0.321953] find_next_bit: 7931 ns, 656 iterations > > >> [ 0.323201] find_next_zero_bit: 1246980 ns, 327025 iterations > > >> [ 0.323210] find_last_bit: 8000 ns, 656 iterations > > >> [ 0.324427] find_nth_bit: 1213161 ns, 655 iterations > > >> [ 0.324813] find_first_bit: 384747 ns, 656 iterations > > >> [ 0.324817] find_first_and_bit: 2220 ns, 1 iterations > > >> [ 0.324820] find_next_and_bit: 1831 ns, 1 iterations > > >> > > >> After: > > >> Start testing find_bit() with random-filled bitmap > > >> [ 0.305081] fbcon: Taking over console > > >> [ 0.306126] find_next_bit: 854517 ns, 163960 iterations > > >> [ 0.307041] find_next_zero_bit: 911725 ns, 163721 iterations > > >> [ 0.307711] find_last_bit: 668261 ns, 163960 iterations > > >> [ 0.311160] find_nth_bit: 3447530 ns, 16372 iterations > > >> [ 0.312358] find_first_bit: 1196633 ns, 16373 iterations > > >> [ 0.327191] find_first_and_bit: 14830129 ns, 32951 iterations > > >> [ 0.327503] find_next_and_bit: 310560 ns, 73719 iterations > > >> [ 0.327504] > > >> Start testing find_bit() with sparse bitmap > > >> [ 0.327539] find_next_bit: 7633 ns, 656 iterations > > >> [ 0.328787] find_next_zero_bit: 1247398 ns, 327025 iterations > > >> [ 0.328797] find_last_bit: 8425 ns, 656 iterations > > >> [ 0.330034] find_nth_bit: 1234044 ns, 655 iterations > > >> [ 0.330428] find_first_bit: 392086 ns, 656 iterations > > >> [ 0.330431] find_first_and_bit: 1980 ns, 1 iterations > > >> [ 0.330434] find_next_and_bit: 1831 ns, 1 iterations > > >> > > >> Some benchmarks seem to have worsened after applying this patch. > > >> However, unless I'm mistaken, the fns() changes should only affect the > > >> results of find_nth_bit, while the others are just random fluctuations. > > > > > > So... > > > > > > The patch itself looks good, > > > > Well, I think it could be even better. While I agree that bit=ffs(); > > clear_bit(bit, ) is probably a bad way of doing it, I think the basic > > structure of the function is good. Introducing a "count from 0 up to n" > > loop is rarely a good thing, keeping the n counting down to 0 is likely > > better. > > > > So I'd instead just change the function to > > > > static inline unsigned long fns(unsigned long word, unsigned int n) > > { > > while (word) { > > if (n-- == 0) > > return __ffs(word); > > word &= word - 1; > > } > > > > return BITS_PER_LONG; > > } > > > How about rewriting it as follows: > > static inline unsigned long fns(unsigned long word, unsigned int n) > { > while (word && n--) > word &= word - 1; > > return word ? __ffs(word) : BITS_PER_LONG; > } > > IMHO, this way the code can be shorter and more elegant. Rasmus' code looks shorter because it tests the 'word != 0' only once when n == 0, for example. But we never sure how the compiler would optimize it. Can you show a disassembly of both and pick the best? Thanks, Yury ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: + bitops-optimize-fns-for-improved-performance.patch added to mm-nonmm-unstable branch 2024-04-29 16:34 ` Yury Norov @ 2024-04-29 17:06 ` Kuan-Wei Chiu 0 siblings, 0 replies; 11+ messages in thread From: Kuan-Wei Chiu @ 2024-04-29 17:06 UTC (permalink / raw) To: Yury Norov; +Cc: Rasmus Villemoes, Andrew Morton, mm-commits, jserv, n26122115 On Mon, Apr 29, 2024 at 09:34:06AM -0700, Yury Norov wrote: > On Mon, Apr 29, 2024 at 11:40:34PM +0800, Kuan-Wei Chiu wrote: > > On Mon, Apr 29, 2024 at 10:05:12AM +0200, Rasmus Villemoes wrote: > > > On 28/04/2024 18.08, Yury Norov wrote: > > > > + Rasmus > > > > > > > > On Sat, Apr 27, 2024 at 01:33:55PM +0800, Kuan-Wei Chiu wrote: > > > >> Before: > > > >> Start testing find_bit() with random-filled bitmap > > > >> [ 0.299085] fbcon: Taking over console > > > >> [ 0.299820] find_next_bit: 606286 ns, 164169 iterations > > > >> [ 0.300463] find_next_zero_bit: 641072 ns, 163512 iterations > > > >> [ 0.300996] find_last_bit: 531027 ns, 164169 iterations > > > >> [ 0.305233] find_nth_bit: 4235859 ns, 16454 iterations > > > >> [ 0.306434] find_first_bit: 1199357 ns, 16455 iterations > > > >> [ 0.321616] find_first_and_bit: 15179667 ns, 32869 iterations > > > >> [ 0.321917] find_next_and_bit: 298836 ns, 73875 iterations > > > >> [ 0.321918] > > > >> Start testing find_bit() with sparse bitmap > > > >> [ 0.321953] find_next_bit: 7931 ns, 656 iterations > > > >> [ 0.323201] find_next_zero_bit: 1246980 ns, 327025 iterations > > > >> [ 0.323210] find_last_bit: 8000 ns, 656 iterations > > > >> [ 0.324427] find_nth_bit: 1213161 ns, 655 iterations > > > >> [ 0.324813] find_first_bit: 384747 ns, 656 iterations > > > >> [ 0.324817] find_first_and_bit: 2220 ns, 1 iterations > > > >> [ 0.324820] find_next_and_bit: 1831 ns, 1 iterations > > > >> > > > >> After: > > > >> Start testing find_bit() with random-filled bitmap > > > >> [ 0.305081] fbcon: Taking over console > > > >> [ 0.306126] find_next_bit: 854517 ns, 163960 iterations > > > >> [ 0.307041] find_next_zero_bit: 911725 ns, 163721 iterations > > > >> [ 0.307711] find_last_bit: 668261 ns, 163960 iterations > > > >> [ 0.311160] find_nth_bit: 3447530 ns, 16372 iterations > > > >> [ 0.312358] find_first_bit: 1196633 ns, 16373 iterations > > > >> [ 0.327191] find_first_and_bit: 14830129 ns, 32951 iterations > > > >> [ 0.327503] find_next_and_bit: 310560 ns, 73719 iterations > > > >> [ 0.327504] > > > >> Start testing find_bit() with sparse bitmap > > > >> [ 0.327539] find_next_bit: 7633 ns, 656 iterations > > > >> [ 0.328787] find_next_zero_bit: 1247398 ns, 327025 iterations > > > >> [ 0.328797] find_last_bit: 8425 ns, 656 iterations > > > >> [ 0.330034] find_nth_bit: 1234044 ns, 655 iterations > > > >> [ 0.330428] find_first_bit: 392086 ns, 656 iterations > > > >> [ 0.330431] find_first_and_bit: 1980 ns, 1 iterations > > > >> [ 0.330434] find_next_and_bit: 1831 ns, 1 iterations > > > >> > > > >> Some benchmarks seem to have worsened after applying this patch. > > > >> However, unless I'm mistaken, the fns() changes should only affect the > > > >> results of find_nth_bit, while the others are just random fluctuations. > > > > > > > > So... > > > > > > > > The patch itself looks good, > > > > > > Well, I think it could be even better. While I agree that bit=ffs(); > > > clear_bit(bit, ) is probably a bad way of doing it, I think the basic > > > structure of the function is good. Introducing a "count from 0 up to n" > > > loop is rarely a good thing, keeping the n counting down to 0 is likely > > > better. > > > > > > So I'd instead just change the function to > > > > > > static inline unsigned long fns(unsigned long word, unsigned int n) > > > { > > > while (word) { > > > if (n-- == 0) > > > return __ffs(word); > > > word &= word - 1; > > > } > > > > > > return BITS_PER_LONG; > > > } > > > > > How about rewriting it as follows: > > > > static inline unsigned long fns(unsigned long word, unsigned int n) > > { > > while (word && n--) > > word &= word - 1; > > > > return word ? __ffs(word) : BITS_PER_LONG; > > } > > > > IMHO, this way the code can be shorter and more elegant. > > Rasmus' code looks shorter because it tests the 'word != 0' only once > when n == 0, for example. But we never sure how the compiler would > optimize it. Can you show a disassembly of both and pick the best? > > Thanks, > Yury It appears that gcc with the O2 flag is smart enough to generate the same code. The only difference in my disassembly results is the generated label names. Tested with gcc 11.4.0. Regards, Kuan-Wei ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: + bitops-optimize-fns-for-improved-performance.patch added to mm-nonmm-unstable branch 2024-04-29 8:05 ` Rasmus Villemoes 2024-04-29 15:40 ` Kuan-Wei Chiu @ 2024-04-29 16:26 ` Yury Norov 1 sibling, 0 replies; 11+ messages in thread From: Yury Norov @ 2024-04-29 16:26 UTC (permalink / raw) To: Rasmus Villemoes Cc: Kuan-Wei Chiu, Andrew Morton, mm-commits, jserv, n26122115 On Mon, Apr 29, 2024 at 10:05:12AM +0200, Rasmus Villemoes wrote: > On 28/04/2024 18.08, Yury Norov wrote: > > + Rasmus > > > > On Sat, Apr 27, 2024 at 01:33:55PM +0800, Kuan-Wei Chiu wrote: > >> Before: > >> Start testing find_bit() with random-filled bitmap > >> [ 0.299085] fbcon: Taking over console > >> [ 0.299820] find_next_bit: 606286 ns, 164169 iterations > >> [ 0.300463] find_next_zero_bit: 641072 ns, 163512 iterations > >> [ 0.300996] find_last_bit: 531027 ns, 164169 iterations > >> [ 0.305233] find_nth_bit: 4235859 ns, 16454 iterations > >> [ 0.306434] find_first_bit: 1199357 ns, 16455 iterations > >> [ 0.321616] find_first_and_bit: 15179667 ns, 32869 iterations > >> [ 0.321917] find_next_and_bit: 298836 ns, 73875 iterations > >> [ 0.321918] > >> Start testing find_bit() with sparse bitmap > >> [ 0.321953] find_next_bit: 7931 ns, 656 iterations > >> [ 0.323201] find_next_zero_bit: 1246980 ns, 327025 iterations > >> [ 0.323210] find_last_bit: 8000 ns, 656 iterations > >> [ 0.324427] find_nth_bit: 1213161 ns, 655 iterations > >> [ 0.324813] find_first_bit: 384747 ns, 656 iterations > >> [ 0.324817] find_first_and_bit: 2220 ns, 1 iterations > >> [ 0.324820] find_next_and_bit: 1831 ns, 1 iterations > >> > >> After: > >> Start testing find_bit() with random-filled bitmap > >> [ 0.305081] fbcon: Taking over console > >> [ 0.306126] find_next_bit: 854517 ns, 163960 iterations > >> [ 0.307041] find_next_zero_bit: 911725 ns, 163721 iterations > >> [ 0.307711] find_last_bit: 668261 ns, 163960 iterations > >> [ 0.311160] find_nth_bit: 3447530 ns, 16372 iterations > >> [ 0.312358] find_first_bit: 1196633 ns, 16373 iterations > >> [ 0.327191] find_first_and_bit: 14830129 ns, 32951 iterations > >> [ 0.327503] find_next_and_bit: 310560 ns, 73719 iterations > >> [ 0.327504] > >> Start testing find_bit() with sparse bitmap > >> [ 0.327539] find_next_bit: 7633 ns, 656 iterations > >> [ 0.328787] find_next_zero_bit: 1247398 ns, 327025 iterations > >> [ 0.328797] find_last_bit: 8425 ns, 656 iterations > >> [ 0.330034] find_nth_bit: 1234044 ns, 655 iterations > >> [ 0.330428] find_first_bit: 392086 ns, 656 iterations > >> [ 0.330431] find_first_and_bit: 1980 ns, 1 iterations > >> [ 0.330434] find_next_and_bit: 1831 ns, 1 iterations > >> > >> Some benchmarks seem to have worsened after applying this patch. > >> However, unless I'm mistaken, the fns() changes should only affect the > >> results of find_nth_bit, while the others are just random fluctuations. > > > > So... > > > > The patch itself looks good, > > Well, I think it could be even better. While I agree that bit=ffs(); > clear_bit(bit, ) is probably a bad way of doing it, I think the basic > structure of the function is good. Introducing a "count from 0 up to n" > loop is rarely a good thing, keeping the n counting down to 0 is likely > better. > > So I'd instead just change the function to > > static inline unsigned long fns(unsigned long word, unsigned int n) > { > while (word) { > if (n-- == 0) > return __ffs(word); > word &= word - 1; > } > > return BITS_PER_LONG; > } Agree. Even better. > Now that I look closer, I think the > > * Returns the bit number of the N'th set bit. > * If no such, returns @size. > > in the find_nth_bit and friends' docs is wrong. Tell me what happens here: > > > DECLARE_BITMAP(x, 12); > int i; > > x[0] = 3; > i = find_nth_bit(&x, 12, 7); > > So I'm asking for the seventh (counting from 0) bit set in a bitmap of > 12 bits, but only two bits are set. So i should be 12? No, i will be > BITS_PER_LONG, because fns() doesn't know anything about the limit of > 12. Do we really not have any tests covering that? Or indeed covering > any of the small_const_nbits optimizations? We test non-small_const_nbits() case in test_find_nth_bit() tests at line 257: expect_eq_uint(64 * 3 - 1, find_nth_bit(bmap, 64 * 3 - 1, 8)); And this is enforced in FIND_NTH_BIT() like this: sz = min(idx * BITS_PER_LONG + fns(tmp, nr), sz); We don't test small_const_nbits() case, and it seems violating that rule. > I've said this before, and I repeat. It was a mistake for the bitmap > functions to promise a return of exactly @size when stuff is not found, > they should always have just promised to return something >= @size. The > checking in the callers would be just as easy (and some indeed do >= > instead of ==), but the implementation can be somewhat cheaper. I'm > afraid that ship has sailed, but it annoys me every time I stumble on this. The ship has sailed for an old API. For the new one we can make new rules. Moreover, almost all kernel users of the function use it as cpumask_nth(), and in cpumask API we always claim to return >= @size if nothing is found. So, we need to either fix small_const_bit() branch, or relax the '== @size' requirement in FIND_NTH_BIT() and in the comment. I'd rather do 2nd despite it breaks consistency... Can you send a patch? Or I can do it myself if you prefer. Thanks, Yury ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: + bitops-optimize-fns-for-improved-performance.patch added to mm-nonmm-unstable branch 2024-04-28 16:08 ` Yury Norov 2024-04-29 8:05 ` Rasmus Villemoes @ 2024-04-29 15:31 ` Kuan-Wei Chiu 1 sibling, 0 replies; 11+ messages in thread From: Kuan-Wei Chiu @ 2024-04-29 15:31 UTC (permalink / raw) To: Yury Norov; +Cc: Andrew Morton, mm-commits, jserv, n26122115, Rasmus Villemoes On Sun, Apr 28, 2024 at 09:08:26AM -0700, Yury Norov wrote: > + Rasmus > > On Sat, Apr 27, 2024 at 01:33:55PM +0800, Kuan-Wei Chiu wrote: > > Before: > > Start testing find_bit() with random-filled bitmap > > [ 0.299085] fbcon: Taking over console > > [ 0.299820] find_next_bit: 606286 ns, 164169 iterations > > [ 0.300463] find_next_zero_bit: 641072 ns, 163512 iterations > > [ 0.300996] find_last_bit: 531027 ns, 164169 iterations > > [ 0.305233] find_nth_bit: 4235859 ns, 16454 iterations > > [ 0.306434] find_first_bit: 1199357 ns, 16455 iterations > > [ 0.321616] find_first_and_bit: 15179667 ns, 32869 iterations > > [ 0.321917] find_next_and_bit: 298836 ns, 73875 iterations > > [ 0.321918] > > Start testing find_bit() with sparse bitmap > > [ 0.321953] find_next_bit: 7931 ns, 656 iterations > > [ 0.323201] find_next_zero_bit: 1246980 ns, 327025 iterations > > [ 0.323210] find_last_bit: 8000 ns, 656 iterations > > [ 0.324427] find_nth_bit: 1213161 ns, 655 iterations > > [ 0.324813] find_first_bit: 384747 ns, 656 iterations > > [ 0.324817] find_first_and_bit: 2220 ns, 1 iterations > > [ 0.324820] find_next_and_bit: 1831 ns, 1 iterations > > > > After: > > Start testing find_bit() with random-filled bitmap > > [ 0.305081] fbcon: Taking over console > > [ 0.306126] find_next_bit: 854517 ns, 163960 iterations > > [ 0.307041] find_next_zero_bit: 911725 ns, 163721 iterations > > [ 0.307711] find_last_bit: 668261 ns, 163960 iterations > > [ 0.311160] find_nth_bit: 3447530 ns, 16372 iterations > > [ 0.312358] find_first_bit: 1196633 ns, 16373 iterations > > [ 0.327191] find_first_and_bit: 14830129 ns, 32951 iterations > > [ 0.327503] find_next_and_bit: 310560 ns, 73719 iterations > > [ 0.327504] > > Start testing find_bit() with sparse bitmap > > [ 0.327539] find_next_bit: 7633 ns, 656 iterations > > [ 0.328787] find_next_zero_bit: 1247398 ns, 327025 iterations > > [ 0.328797] find_last_bit: 8425 ns, 656 iterations > > [ 0.330034] find_nth_bit: 1234044 ns, 655 iterations > > [ 0.330428] find_first_bit: 392086 ns, 656 iterations > > [ 0.330431] find_first_and_bit: 1980 ns, 1 iterations > > [ 0.330434] find_next_and_bit: 1831 ns, 1 iterations > > > > Some benchmarks seem to have worsened after applying this patch. > > However, unless I'm mistaken, the fns() changes should only affect the > > results of find_nth_bit, while the others are just random fluctuations. > > So... > > The patch itself looks good, and indeed it should not affect > anything except find_nth_bit(). But -40% for find_next_bit() > and find_next_zero_bit(), and -25% for find_last_bit() don't > look like a fluctuation by any measure. > > Looking at the numbers, I can only guess that your testing machine > isn't trustworthy, which means that +18% for find_nth_bit() is not > trustworthy as well. Maybe it's a sort of virtualized environment? > I've retested the patch against Linux v6.9-rc6, running three iterations before and after applying the patch. The data now appears to be much more consistent and shows a clear efficiency improvement in the benchmark for find_nth_bit. I believe there may have been an error in my previous testing that I didn't catch. Here are the results from the three runs before and after applying the patch: Before: # run1: [ 0.295555] Start testing find_bit() with random-filled bitmap [ 0.295557] fbcon: Taking over console [ 0.296287] find_next_bit: 602426 ns, 163845 iterations [ 0.296933] find_next_zero_bit: 643572 ns, 163836 iterations [ 0.297456] find_last_bit: 521223 ns, 163844 iterations [ 0.301680] find_nth_bit: 4223086 ns, 16358 iterations [ 0.302865] find_first_bit: 1183479 ns, 16359 iterations [ 0.318173] find_first_and_bit: 15304868 ns, 32955 iterations [ 0.318473] find_next_and_bit: 297802 ns, 74031 iterations [ 0.318474] Start testing find_bit() with sparse bitmap [ 0.318508] find_next_bit: 7892 ns, 653 iterations [ 0.319758] find_next_zero_bit: 1248034 ns, 327028 iterations [ 0.319767] find_last_bit: 7818 ns, 653 iterations [ 0.320995] find_nth_bit: 1224578 ns, 652 iterations [ 0.321383] find_first_bit: 387053 ns, 653 iterations [ 0.321387] find_first_and_bit: 2005 ns, 1 iterations [ 0.321390] find_next_and_bit: 1833 ns, 1 iterations [ 0.321391] test_bitmap: loaded. [ 0.321454] test_bitmap: parselist: 14: input is '0-2047:128/256' OK, Time: 189 [ 0.321465] test_bitmap: bitmap_print_to_pagebuf: input is '0-32767 ', Time: 564 [ 0.322496] test_bitmap: all 6564 tests passed # run2: [ 0.295442] Start testing find_bit() with random-filled bitmap [ 0.295445] fbcon: Taking over console [ 0.296182] find_next_bit: 603891 ns, 164170 iterations [ 0.296825] find_next_zero_bit: 642025 ns, 163511 iterations [ 0.297357] find_last_bit: 529755 ns, 164170 iterations [ 0.301671] find_nth_bit: 4312387 ns, 16590 iterations [ 0.302876] find_first_bit: 1203730 ns, 16591 iterations [ 0.317946] find_first_and_bit: 15067894 ns, 32773 iterations [ 0.318247] find_next_and_bit: 297303 ns, 73781 iterations [ 0.318248] Start testing find_bit() with sparse bitmap [ 0.318283] find_next_bit: 8193 ns, 654 iterations [ 0.319532] find_next_zero_bit: 1247954 ns, 327027 iterations [ 0.319541] find_last_bit: 7530 ns, 654 iterations [ 0.320750] find_nth_bit: 1205862 ns, 653 iterations [ 0.321134] find_first_bit: 381843 ns, 654 iterations [ 0.321138] find_first_and_bit: 2087 ns, 1 iterations [ 0.321141] find_next_and_bit: 1843 ns, 1 iterations [ 0.321142] test_bitmap: loaded. [ 0.321207] test_bitmap: parselist: 14: input is '0-2047:128/256' OK, Time: 162 [ 0.321225] test_bitmap: bitmap_print_to_pagebuf: input is '0-32767 ', Time: 1079 [ 0.322286] test_bitmap: all 6564 tests passed # run3: [ 0.295315] Start testing find_bit() with random-filled bitmap [ 0.295318] fbcon: Taking over console [ 0.296054] find_next_bit: 604012 ns, 163612 iterations [ 0.296697] find_next_zero_bit: 641051 ns, 164069 iterations [ 0.297220] find_last_bit: 521720 ns, 163611 iterations [ 0.301476] find_nth_bit: 4254313 ns, 16525 iterations [ 0.302667] find_first_bit: 1189415 ns, 16526 iterations [ 0.317626] find_first_and_bit: 14955907 ns, 32801 iterations [ 0.317924] find_next_and_bit: 296492 ns, 73456 iterations [ 0.317925] Start testing find_bit() with sparse bitmap [ 0.317959] find_next_bit: 8187 ns, 656 iterations [ 0.319208] find_next_zero_bit: 1247622 ns, 327025 iterations [ 0.319218] find_last_bit: 8146 ns, 656 iterations [ 0.320406] find_nth_bit: 1184997 ns, 655 iterations [ 0.320784] find_first_bit: 376883 ns, 656 iterations [ 0.320788] find_first_and_bit: 2270 ns, 1 iterations [ 0.320791] find_next_and_bit: 1844 ns, 1 iterations [ 0.320793] test_bitmap: loaded. [ 0.320856] test_bitmap: parselist: 14: input is '0-2047:128/256' OK, Time: 176 [ 0.320868] test_bitmap: bitmap_print_to_pagebuf: input is '0-32767 ', Time: 624 [ 0.321903] test_bitmap: all 6564 tests passed After: # run1: [ 0.299444] Start testing find_bit() with random-filled bitmap [ 0.299446] fbcon: Taking over console [ 0.300176] find_next_bit: 601818 ns, 163940 iterations [ 0.300820] find_next_zero_bit: 642027 ns, 163741 iterations [ 0.301350] find_last_bit: 528024 ns, 163939 iterations [ 0.304680] find_nth_bit: 3329159 ns, 16479 iterations [ 0.305868] find_first_bit: 1186153 ns, 16480 iterations [ 0.320205] find_first_and_bit: 14334269 ns, 32836 iterations [ 0.320504] find_next_and_bit: 297451 ns, 73788 iterations [ 0.320505] Start testing find_bit() with sparse bitmap [ 0.320539] find_next_bit: 7372 ns, 655 iterations [ 0.321789] find_next_zero_bit: 1248188 ns, 327026 iterations [ 0.321799] find_last_bit: 8284 ns, 655 iterations [ 0.323032] find_nth_bit: 1229792 ns, 654 iterations [ 0.323422] find_first_bit: 389751 ns, 655 iterations [ 0.323426] find_first_and_bit: 2166 ns, 1 iterations [ 0.323430] find_next_and_bit: 1834 ns, 1 iterations [ 0.323431] test_bitmap: loaded. [ 0.323494] test_bitmap: parselist: 14: input is '0-2047:128/256' OK, Time: 184 [ 0.323505] test_bitmap: bitmap_print_to_pagebuf: input is '0-32767 ', Time: 503 [ 0.324560] test_bitmap: all 6564 tests passed # run2: [ 0.295331] Start testing find_bit() with random-filled bitmap [ 0.295333] fbcon: Taking over console [ 0.296066] find_next_bit: 600626 ns, 163710 iterations [ 0.296713] find_next_zero_bit: 644866 ns, 163971 iterations [ 0.297243] find_last_bit: 528601 ns, 163711 iterations [ 0.300570] find_nth_bit: 3325597 ns, 16300 iterations [ 0.301751] find_first_bit: 1179476 ns, 16301 iterations [ 0.316235] find_first_and_bit: 14480647 ns, 32619 iterations [ 0.316536] find_next_and_bit: 298766 ns, 73682 iterations [ 0.316537] Start testing find_bit() with sparse bitmap [ 0.316571] find_next_bit: 7277 ns, 656 iterations [ 0.317820] find_next_zero_bit: 1248426 ns, 327025 iterations [ 0.317829] find_last_bit: 7801 ns, 656 iterations [ 0.319052] find_nth_bit: 1219313 ns, 655 iterations [ 0.319439] find_first_bit: 386162 ns, 656 iterations [ 0.319443] find_first_and_bit: 1869 ns, 1 iterations [ 0.319446] find_next_and_bit: 1827 ns, 1 iterations [ 0.319448] test_bitmap: loaded. [ 0.319511] test_bitmap: parselist: 14: input is '0-2047:128/256' OK, Time: 172 [ 0.319522] test_bitmap: bitmap_print_to_pagebuf: input is '0-32767 ', Time: 492 [ 0.320568] test_bitmap: all 6564 tests passed # run3: [ 0.299355] Start testing find_bit() with random-filled bitmap [ 0.299357] fbcon: Taking over console [ 0.300087] find_next_bit: 601452 ns, 163877 iterations [ 0.300735] find_next_zero_bit: 646048 ns, 163804 iterations [ 0.301256] find_last_bit: 519870 ns, 163877 iterations [ 0.304620] find_nth_bit: 3362863 ns, 16501 iterations [ 0.305818] find_first_bit: 1195522 ns, 16502 iterations [ 0.320875] find_first_and_bit: 15054509 ns, 32826 iterations [ 0.321172] find_next_and_bit: 295444 ns, 73697 iterations [ 0.321173] Start testing find_bit() with sparse bitmap [ 0.321207] find_next_bit: 7366 ns, 656 iterations [ 0.322456] find_next_zero_bit: 1248232 ns, 327025 iterations [ 0.322466] find_last_bit: 8266 ns, 656 iterations [ 0.323732] find_nth_bit: 1262568 ns, 655 iterations [ 0.324133] find_first_bit: 400010 ns, 656 iterations [ 0.324137] find_first_and_bit: 2000 ns, 1 iterations [ 0.324140] find_next_and_bit: 1824 ns, 1 iterations [ 0.324141] test_bitmap: loaded. [ 0.324204] test_bitmap: parselist: 14: input is '0-2047:128/256' OK, Time: 187 [ 0.324218] test_bitmap: bitmap_print_to_pagebuf: input is '0-32767 ', Time: 540 [ 0.325256] test_bitmap: all 6564 tests passed > Can you share more about your hardware? Can you make sure it's a bare > metal machine and run your test compiled in the kernel? That way it > will run at boot time, at least before any possible userspace payload. > Can you also run test_bitmap? It has some functional testing for the > function? > Both the previous and current tests were conducted on real hardware during the boot phase. While I did test in a qemu environment previously, it was solely for validation purposes and not for performance testing. Below is the hardware information I used: $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: GenuineIntel Model name: 11th Gen Intel(R) Core(TM) i7-11700 @ 2.50GHz CPU family: 6 Model: 167 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 Stepping: 1 CPU max MHz: 4900.0000 CPU min MHz: 800.0000 BogoMIPS: 4992.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe sy scall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc _known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic mo vbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibr s_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512 dq rdseed adx smap avx512ifma clflushopt intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes v pclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear flush_l1d arch_capabilities Virtualization features: Virtualization: VT-x Caches (sum of all): L1d: 384 KiB (8 instances) L1i: 256 KiB (8 instances) L2: 4 MiB (8 instances) L3: 16 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-15 Vulnerabilities: Gather data sampling: Mitigation; Microcode Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Reg file data sampling: Not affected Retbleed: Mitigation; Enhanced IBRS Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop Srbds: Not affected Tsx async abort: Not affected > > Should I include the above benchmark data in the commit message and > > send a v2 patch? > > Yes, I'd like to do so. Because the current benchmark numbers look so > weird, can you run it again? Maybe several times before and after, to > estimate the scatter. Also please run the test_bitmap to check > functional correctness. > > Since you already have a code testing the performance of fns(), can > you add it in lib/find_bit_benchmark or lib/test_bitops in a following > patch? > Sure, I can do that. I'll include the benchmark data along with the testing code in the v2 patch series that I'll send. Regards, Kuan-Wei > > > Additionally, I apologize for you not receiving the email. I received > > the following "Message not delivered" email, but I'm unsure if it's > > related and what caused the error: ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2024-05-02 15:18 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-05-02 15:18 + bitops-optimize-fns-for-improved-performance.patch added to mm-nonmm-unstable branch Andrew Morton -- strict thread matches above, loose matches on Subject: below -- 2024-04-26 19:08 Andrew Morton 2024-04-26 19:48 ` Yury Norov 2024-04-27 5:33 ` Kuan-Wei Chiu 2024-04-28 16:08 ` Yury Norov 2024-04-29 8:05 ` Rasmus Villemoes 2024-04-29 15:40 ` Kuan-Wei Chiu 2024-04-29 16:34 ` Yury Norov 2024-04-29 17:06 ` Kuan-Wei Chiu 2024-04-29 16:26 ` Yury Norov 2024-04-29 15:31 ` Kuan-Wei Chiu
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.