* Re: Linux 5.19-rc8 [not found] ` <CAHk-=wgYpJTMMxmfbpqc=JVtSK0Zj4b15G=AvEYk6vPNySDSsA@mail.gmail.com> @ 2022-07-26 18:18 ` Yury Norov 2022-07-26 18:36 ` Linus Torvalds 0 siblings, 1 reply; 11+ messages in thread From: Yury Norov @ 2022-07-26 18:18 UTC (permalink / raw) To: Linus Torvalds Cc: Dennis Zhou, Guenter Roeck, Russell King - ARM Linux, Catalin Marinas, Linux Kernel Mailing List, Geert Uytterhoeven, linux-m68k + Geert Uytterhoeven <geert@linux-m68k.org> + linux-m68k@lists.linux-m68k.org On Tue, Jul 26, 2022 at 10:51:01AM -0700, Linus Torvalds wrote: > On Tue, Jul 26, 2022 at 10:39 AM Dennis Zhou <dennis@kernel.org> wrote: > > > > Are we okay with adding the contract find_*_bit() operations must handle > > asking for past size properly? FWIW, we'd have to modify most of the > > iterators in find.h. > > So I think we're ok with it, if only it makes the macros simpler. > > I also think we should probably look at the m68k case, because while > that one seems to not have the bug that the arm case had, if we remove > the arm case the m68k code is now the only non-generic case remaining. > > And it just makes me go "maybe we should get rid of the whole > 'override the generic code' thing entirely?" > > I don't think that inlining the first word (like the m68k code does) > is worth it, but it *is* possible that the architecture-specific > functions generate better code for some common cases, We have find_bit_benchmark to check how it works in practice. Would be great if someone with access to the hardware can share numbers. > so I think this > is a "needs looking at the generated code" and not just a blind > removal. > > Linus ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Linux 5.19-rc8 2022-07-26 18:18 ` Linux 5.19-rc8 Yury Norov @ 2022-07-26 18:36 ` Linus Torvalds 2022-07-26 19:44 ` Russell King (Oracle) 0 siblings, 1 reply; 11+ messages in thread From: Linus Torvalds @ 2022-07-26 18:36 UTC (permalink / raw) To: Yury Norov Cc: Dennis Zhou, Guenter Roeck, Russell King - ARM Linux, Catalin Marinas, Linux Kernel Mailing List, Geert Uytterhoeven, linux-m68k On Tue, Jul 26, 2022 at 11:18 AM Yury Norov <yury.norov@gmail.com> wrote: > > We have find_bit_benchmark to check how it works in practice. Would > be great if someone with access to the hardware can share numbers. Honestly, I doubt benchmarking find_bit in a loop is all that sensible. These are helper functions that are probably seldom super-hot in cache, and as with so many of these things, I suspect the cool-I$ numbers are the ones that matter most in real life. When some filesystem ends up searching for a free block or similar, it will probably have done other things before that that means that the L1 I$ has been long flushed, and branch history is quite possibly entirely gone too. The same is quite possibly true for the bitmap itself in D$ too. That said, looking at the x86 code generation (not only because I have the build environment, but where I can actually read make sense of the asm), the only thing that looks bad is the conditional bswap. So the code gen looks fairly good, Linus ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Linux 5.19-rc8 2022-07-26 18:36 ` Linus Torvalds @ 2022-07-26 19:44 ` Russell King (Oracle) 2022-07-26 20:20 ` Linus Torvalds 0 siblings, 1 reply; 11+ messages in thread From: Russell King (Oracle) @ 2022-07-26 19:44 UTC (permalink / raw) To: Linus Torvalds Cc: Yury Norov, Dennis Zhou, Guenter Roeck, Catalin Marinas, Linux Kernel Mailing List, Geert Uytterhoeven, linux-m68k On Tue, Jul 26, 2022 at 11:36:21AM -0700, Linus Torvalds wrote: > On Tue, Jul 26, 2022 at 11:18 AM Yury Norov <yury.norov@gmail.com> wrote: > > > > We have find_bit_benchmark to check how it works in practice. Would > > be great if someone with access to the hardware can share numbers. > > Honestly, I doubt benchmarking find_bit in a loop is all that sensible. Yes, that's what I was thinking - I've never seen it crop up in any of the perf traces I've seen. Nevertheless, here's some numbers from a single run of the find_bit_benchmark module, kernel built with: arm-linux-gnueabihf-gcc (Debian 10.2.1-6) 10.2.1 20210110 Current native implementation: [ 46.184565] Start testing find_bit() with random-filled bitmap [ 46.195127] find_next_bit: 2440833 ns, 163112 iterations [ 46.204226] find_next_zero_bit: 2372128 ns, 164569 iterations [ 46.213152] find_last_bit: 2199779 ns, 163112 iterations [ 46.299398] find_first_bit: 79526013 ns, 16234 iterations [ 46.684026] find_first_and_bit: 377912990 ns, 32617 iterations [ 46.692020] find_next_and_bit: 1269071 ns, 73562 iterations [ 46.698745] Start testing find_bit() with sparse bitmap [ 46.705711] find_next_bit: 118652 ns, 656 iterations [ 46.716621] find_next_zero_bit: 4183472 ns, 327025 iterations [ 46.723395] find_last_bit: 50448 ns, 656 iterations [ 46.762308] find_first_bit: 32190802 ns, 656 iterations [ 46.769093] find_first_and_bit: 52129 ns, 1 iterations [ 46.775882] find_next_and_bit: 62522 ns, 1 iterations Generic implementation: [ 25.149238] Start testing find_bit() with random-filled bitmap [ 25.160002] find_next_bit: 2640943 ns, 163537 iterations [ 25.169567] find_next_zero_bit: 2838485 ns, 164144 iterations [ 25.178595] find_last_bit: 2302372 ns, 163538 iterations [ 25.204016] find_first_bit: 18697630 ns, 16373 iterations [ 25.602571] find_first_and_bit: 391841480 ns, 32555 iterations [ 25.610563] find_next_and_bit: 1260306 ns, 73587 iterations [ 25.617295] Start testing find_bit() with sparse bitmap [ 25.624222] find_next_bit: 70289 ns, 656 iterations [ 25.636478] find_next_zero_bit: 5527050 ns, 327025 iterations [ 25.643253] find_last_bit: 52147 ns, 656 iterations [ 25.657304] find_first_bit: 7328573 ns, 656 iterations [ 25.664087] find_first_and_bit: 48518 ns, 1 iterations [ 25.670871] find_next_and_bit: 59750 ns, 1 iterations Overall, I would say it's pretty similar (some generic perform marginally better, some native perform marginally better) with the exception of find_first_bit() being much better with the generic implementation, but find_next_zero_bit() being noticably worse. So, pretty much nothing of any relevance between them, which may come as a surprise given the byte vs word access differences between the two implementations. I suspect the reason behind that may be because the native implementation code is smaller than the generic implementation, outweighing the effects of the by-byte rather than by-word. I would also suspect that, because of the smaller implementation, the native version performs better in a I$-cool situation than the generic. Lastly, I would suspect if we fixed the bug in the native version, and converted it to use word loads, it would probably be better than the generic version. I haven't anything to base that on other than gut feeling at the moment, but I can make the changes to the native implementation and see what effect that has, possibly tomorrow. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last! ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Linux 5.19-rc8 2022-07-26 19:44 ` Russell King (Oracle) @ 2022-07-26 20:20 ` Linus Torvalds 2022-07-27 0:15 ` Russell King (Oracle) 0 siblings, 1 reply; 11+ messages in thread From: Linus Torvalds @ 2022-07-26 20:20 UTC (permalink / raw) To: Russell King (Oracle) Cc: Yury Norov, Dennis Zhou, Guenter Roeck, Catalin Marinas, Linux Kernel Mailing List, Geert Uytterhoeven, linux-m68k On Tue, Jul 26, 2022 at 12:44 PM Russell King (Oracle) <linux@armlinux.org.uk> wrote: > > Overall, I would say it's pretty similar (some generic perform > marginally better, some native perform marginally better) with the > exception of find_first_bit() being much better with the generic > implementation, but find_next_zero_bit() being noticably worse. The generic _find_first_bit() code is actually sane and simple. It loops over words until it finds a non-zero one, and then does trivial calculations on that last word. That explains why the generic code does so much better than your byte-wise asm. In contrast, the generic _find_next_bit() I find almost offensively silly - which in turn explains why your byte-wide asm does better. I think the generic _find_next_bit() should actually do what the m68k find_next_bit code does: handle the first special word itself, and then just call find_first_bit() on the rest of it. And it should *not* try to handle the dynamic "bswap and/or bit sense invert" thing at all. That should be just four different (trivial) cases for the first word. Linus ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Linux 5.19-rc8 2022-07-26 20:20 ` Linus Torvalds @ 2022-07-27 0:15 ` Russell King (Oracle) 2022-07-27 1:33 ` Yury Norov 0 siblings, 1 reply; 11+ messages in thread From: Russell King (Oracle) @ 2022-07-27 0:15 UTC (permalink / raw) To: Linus Torvalds Cc: Yury Norov, Dennis Zhou, Guenter Roeck, Catalin Marinas, Linux Kernel Mailing List, Geert Uytterhoeven, linux-m68k On Tue, Jul 26, 2022 at 01:20:23PM -0700, Linus Torvalds wrote: > On Tue, Jul 26, 2022 at 12:44 PM Russell King (Oracle) > <linux@armlinux.org.uk> wrote: > > > > Overall, I would say it's pretty similar (some generic perform > > marginally better, some native perform marginally better) with the > > exception of find_first_bit() being much better with the generic > > implementation, but find_next_zero_bit() being noticably worse. > > The generic _find_first_bit() code is actually sane and simple. It > loops over words until it finds a non-zero one, and then does trivial > calculations on that last word. > > That explains why the generic code does so much better than your byte-wise asm. > > In contrast, the generic _find_next_bit() I find almost offensively > silly - which in turn explains why your byte-wide asm does better. > > I think the generic _find_next_bit() should actually do what the m68k > find_next_bit code does: handle the first special word itself, and > then just call find_first_bit() on the rest of it. > > And it should *not* try to handle the dynamic "bswap and/or bit sense > invert" thing at all. That should be just four different (trivial) > cases for the first word. Here's the results for the native version converted to use word loads: [ 37.319937] Start testing find_bit() with random-filled bitmap [ 37.330289] find_next_bit: 2222703 ns, 163781 iterations [ 37.339186] find_next_zero_bit: 2154375 ns, 163900 iterations [ 37.348118] find_last_bit: 2208104 ns, 163780 iterations [ 37.372564] find_first_bit: 17722203 ns, 16370 iterations [ 37.737415] find_first_and_bit: 358135191 ns, 32453 iterations [ 37.745420] find_next_and_bit: 1280537 ns, 73644 iterations [ 37.752143] Start testing find_bit() with sparse bitmap [ 37.759032] find_next_bit: 41256 ns, 655 iterations [ 37.769905] find_next_zero_bit: 4148410 ns, 327026 iterations [ 37.776675] find_last_bit: 48742 ns, 655 iterations [ 37.790961] find_first_bit: 7562371 ns, 655 iterations [ 37.797743] find_first_and_bit: 47366 ns, 1 iterations [ 37.804527] find_next_and_bit: 59924 ns, 1 iterations which is generally faster than the generic version, with the exception of the sparse find_first_bit (generic was: [ 25.657304] find_first_bit: 7328573 ns, 656 iterations) find_next_{,zero_}bit() in the sparse case are quite a bit faster than the generic code. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last! ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Linux 5.19-rc8 2022-07-27 0:15 ` Russell King (Oracle) @ 2022-07-27 1:33 ` Yury Norov 2022-07-27 7:43 ` Russell King (Oracle) 2022-07-27 7:46 ` David Laight 0 siblings, 2 replies; 11+ messages in thread From: Yury Norov @ 2022-07-27 1:33 UTC (permalink / raw) To: Russell King (Oracle) Cc: Linus Torvalds, Dennis Zhou, Guenter Roeck, Catalin Marinas, Linux Kernel Mailing List, Geert Uytterhoeven, linux-m68k On Tue, Jul 26, 2022 at 5:15 PM Russell King (Oracle) <linux@armlinux.org.uk> wrote: > > On Tue, Jul 26, 2022 at 01:20:23PM -0700, Linus Torvalds wrote: > > On Tue, Jul 26, 2022 at 12:44 PM Russell King (Oracle) > > <linux@armlinux.org.uk> wrote: > > > > > > Overall, I would say it's pretty similar (some generic perform > > > marginally better, some native perform marginally better) with the > > > exception of find_first_bit() being much better with the generic > > > implementation, but find_next_zero_bit() being noticably worse. > > > > The generic _find_first_bit() code is actually sane and simple. It > > loops over words until it finds a non-zero one, and then does trivial > > calculations on that last word. > > > > That explains why the generic code does so much better than your byte-wise asm. > > > > In contrast, the generic _find_next_bit() I find almost offensively > > silly - which in turn explains why your byte-wide asm does better. > > > > I think the generic _find_next_bit() should actually do what the m68k > > find_next_bit code does: handle the first special word itself, and > > then just call find_first_bit() on the rest of it. > > > > And it should *not* try to handle the dynamic "bswap and/or bit sense > > invert" thing at all. That should be just four different (trivial) > > cases for the first word. > > Here's the results for the native version converted to use word loads: > > [ 37.319937] > Start testing find_bit() with random-filled bitmap > [ 37.330289] find_next_bit: 2222703 ns, 163781 iterations > [ 37.339186] find_next_zero_bit: 2154375 ns, 163900 iterations > [ 37.348118] find_last_bit: 2208104 ns, 163780 iterations > [ 37.372564] find_first_bit: 17722203 ns, 16370 iterations > [ 37.737415] find_first_and_bit: 358135191 ns, 32453 iterations > [ 37.745420] find_next_and_bit: 1280537 ns, 73644 iterations > [ 37.752143] > Start testing find_bit() with sparse bitmap > [ 37.759032] find_next_bit: 41256 ns, 655 iterations > [ 37.769905] find_next_zero_bit: 4148410 ns, 327026 iterations > [ 37.776675] find_last_bit: 48742 ns, 655 iterations > [ 37.790961] find_first_bit: 7562371 ns, 655 iterations > [ 37.797743] find_first_and_bit: 47366 ns, 1 iterations > [ 37.804527] find_next_and_bit: 59924 ns, 1 iterations > > which is generally faster than the generic version, with the exception > of the sparse find_first_bit (generic was: > [ 25.657304] find_first_bit: 7328573 ns, 656 iterations) > > find_next_{,zero_}bit() in the sparse case are quite a bit faster than > the generic code. Look at find_{first,next}_and_bit results. Those two have no arch version and in both cases use generic code. In theory they should be equally fast before and after, but your testing says that generic case is slower even for them, and the difference is comparable with real arch functions numbers. It makes me feel like: - there's something unrelated, like governor/throttling that affect results; - the numbers are identical, taking the dispersion into account. If the difference really concerns you, I'd suggest running the test several times to measure confidence intervals. Thanks, Yury ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Linux 5.19-rc8 2022-07-27 1:33 ` Yury Norov @ 2022-07-27 7:43 ` Russell King (Oracle) 2022-07-30 21:38 ` Yury Norov 2022-07-27 7:46 ` David Laight 1 sibling, 1 reply; 11+ messages in thread From: Russell King (Oracle) @ 2022-07-27 7:43 UTC (permalink / raw) To: Yury Norov Cc: Linus Torvalds, Dennis Zhou, Guenter Roeck, Catalin Marinas, Linux Kernel Mailing List, Geert Uytterhoeven, linux-m68k On Tue, Jul 26, 2022 at 06:33:55PM -0700, Yury Norov wrote: > On Tue, Jul 26, 2022 at 5:15 PM Russell King (Oracle) > <linux@armlinux.org.uk> wrote: > > > > On Tue, Jul 26, 2022 at 01:20:23PM -0700, Linus Torvalds wrote: > > > On Tue, Jul 26, 2022 at 12:44 PM Russell King (Oracle) > > > <linux@armlinux.org.uk> wrote: > > > > > > > > Overall, I would say it's pretty similar (some generic perform > > > > marginally better, some native perform marginally better) with the > > > > exception of find_first_bit() being much better with the generic > > > > implementation, but find_next_zero_bit() being noticably worse. > > > > > > The generic _find_first_bit() code is actually sane and simple. It > > > loops over words until it finds a non-zero one, and then does trivial > > > calculations on that last word. > > > > > > That explains why the generic code does so much better than your byte-wise asm. > > > > > > In contrast, the generic _find_next_bit() I find almost offensively > > > silly - which in turn explains why your byte-wide asm does better. > > > > > > I think the generic _find_next_bit() should actually do what the m68k > > > find_next_bit code does: handle the first special word itself, and > > > then just call find_first_bit() on the rest of it. > > > > > > And it should *not* try to handle the dynamic "bswap and/or bit sense > > > invert" thing at all. That should be just four different (trivial) > > > cases for the first word. > > > > Here's the results for the native version converted to use word loads: > > > > [ 37.319937] > > Start testing find_bit() with random-filled bitmap > > [ 37.330289] find_next_bit: 2222703 ns, 163781 iterations > > [ 37.339186] find_next_zero_bit: 2154375 ns, 163900 iterations > > [ 37.348118] find_last_bit: 2208104 ns, 163780 iterations > > [ 37.372564] find_first_bit: 17722203 ns, 16370 iterations > > [ 37.737415] find_first_and_bit: 358135191 ns, 32453 iterations > > [ 37.745420] find_next_and_bit: 1280537 ns, 73644 iterations > > [ 37.752143] > > Start testing find_bit() with sparse bitmap > > [ 37.759032] find_next_bit: 41256 ns, 655 iterations > > [ 37.769905] find_next_zero_bit: 4148410 ns, 327026 iterations > > [ 37.776675] find_last_bit: 48742 ns, 655 iterations > > [ 37.790961] find_first_bit: 7562371 ns, 655 iterations > > [ 37.797743] find_first_and_bit: 47366 ns, 1 iterations > > [ 37.804527] find_next_and_bit: 59924 ns, 1 iterations > > > > which is generally faster than the generic version, with the exception > > of the sparse find_first_bit (generic was: > > [ 25.657304] find_first_bit: 7328573 ns, 656 iterations) > > > > find_next_{,zero_}bit() in the sparse case are quite a bit faster than > > the generic code. > > Look at find_{first,next}_and_bit results. Those two have no arch version > and in both cases use generic code. In theory they should be equally fast > before and after, but your testing says that generic case is slower even > for them, and the difference is comparable with real arch functions numbers. > It makes me feel like: > - there's something unrelated, like governor/throttling that affect results; > - the numbers are identical, taking the dispersion into account. > > If the difference really concerns you, I'd suggest running the test > several times > to measure confidence intervals. Given that the benchmark is run against random bitmaps and with interrupts enabled, there is going to be noise in the results. Here's the second run: [26234.429389] Start testing find_bit() with random-filled bitmap [26234.439722] find_next_bit: 2206687 ns, 164277 iterations [26234.448664] find_next_zero_bit: 2188368 ns, 163404 iterations [26234.457612] find_last_bit: 2223742 ns, 164278 iterations [26234.482056] find_first_bit: 17720726 ns, 16384 iterations [26234.859374] find_first_and_bit: 370602019 ns, 32877 iterations [26234.867379] find_next_and_bit: 1280651 ns, 74091 iterations [26234.874107] Start testing find_bit() with sparse bitmap [26234.881014] find_next_bit: 46142 ns, 656 iterations [26234.891900] find_next_zero_bit: 4158987 ns, 327025 iterations [26234.898672] find_last_bit: 49727 ns, 656 iterations [26234.912504] find_first_bit: 7107862 ns, 656 iterations [26234.919290] find_first_and_bit: 52092 ns, 1 iterations [26234.926076] find_next_and_bit: 60856 ns, 1 iterations And a third run: [26459.679524] Start testing find_bit() with random-filled bitmap [26459.689871] find_next_bit: 2199418 ns, 163311 iterations [26459.698798] find_next_zero_bit: 2181289 ns, 164370 iterations [26459.707738] find_last_bit: 2213638 ns, 163311 iterations [26459.732224] find_first_bit: 17764152 ns, 16429 iterations [26460.133823] find_first_and_bit: 394886375 ns, 32672 iterations [26460.141818] find_next_and_bit: 1269693 ns, 73485 iterations [26460.148545] Start testing find_bit() with sparse bitmap [26460.155433] find_next_bit: 40753 ns, 653 iterations [26460.166307] find_next_zero_bit: 4148211 ns, 327028 iterations [26460.173078] find_last_bit: 50017 ns, 653 iterations [26460.187007] find_first_bit: 7205325 ns, 653 iterations [26460.193790] find_first_and_bit: 49358 ns, 1 iterations [26460.200577] find_next_and_bit: 62332 ns, 1 iterations My gut feeling is that yes, there is some variance, but not on an order that is significant that would allow us to say "there's no difference". find_next_bit results for random are: 2222703, 2206687, 2199418, which is an average of 2209603 and a variance of around 0.5%. The difference between this and the single generic figure I have is on the order of 20%. I'll do the same with find_first_bit for random: 17722203, 17720726, and 17764152. Average is 17735694. Variance is around 0.1% or 0.2%. The difference between this and the single generic figure I have is on the order of 5%. Not so large, but still quite a big difference compared to the variance. find_first_bit for sparse: 7562371, 7107862, 7205325. Average is 7291853. Variance is higher at about 4%. Difference between this and the generic figure is 0.5%, so this one is not significantly different. The best result looks to be find_next_zero_bit for the sparse bitmap case. The generic code measures 5.5ms, the native code is sitting around 4.1ms. That's a difference of around 34%, and by just looking at the range in the figures above we can see this is a significant result without needing to do the calculations. Similar is true of find_next_bit for the sparse bitmap. So, I think the results are significant in most cases and variance doesn't account for the differences. The only one which isn't is find_first_bit for the sparse case. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last! ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Linux 5.19-rc8 2022-07-27 7:43 ` Russell King (Oracle) @ 2022-07-30 21:38 ` Yury Norov 2022-08-01 15:48 ` Russell King (Oracle) 0 siblings, 1 reply; 11+ messages in thread From: Yury Norov @ 2022-07-30 21:38 UTC (permalink / raw) To: Russell King (Oracle) Cc: Linus Torvalds, Dennis Zhou, Guenter Roeck, Catalin Marinas, Linux Kernel Mailing List, Geert Uytterhoeven, Alexey Klimov, linux-m68k On Wed, Jul 27, 2022 at 08:43:22AM +0100, Russell King (Oracle) wrote: > On Tue, Jul 26, 2022 at 06:33:55PM -0700, Yury Norov wrote: > > On Tue, Jul 26, 2022 at 5:15 PM Russell King (Oracle) > > <linux@armlinux.org.uk> wrote: > > > > > > On Tue, Jul 26, 2022 at 01:20:23PM -0700, Linus Torvalds wrote: > > > > On Tue, Jul 26, 2022 at 12:44 PM Russell King (Oracle) > > > > <linux@armlinux.org.uk> wrote: > > > > > > > > > > Overall, I would say it's pretty similar (some generic perform > > > > > marginally better, some native perform marginally better) with the > > > > > exception of find_first_bit() being much better with the generic > > > > > implementation, but find_next_zero_bit() being noticably worse. > > > > > > > > The generic _find_first_bit() code is actually sane and simple. It > > > > loops over words until it finds a non-zero one, and then does trivial > > > > calculations on that last word. > > > > > > > > That explains why the generic code does so much better than your byte-wise asm. > > > > > > > > In contrast, the generic _find_next_bit() I find almost offensively > > > > silly - which in turn explains why your byte-wide asm does better. > > > > > > > > I think the generic _find_next_bit() should actually do what the m68k > > > > find_next_bit code does: handle the first special word itself, and > > > > then just call find_first_bit() on the rest of it. > > > > > > > > And it should *not* try to handle the dynamic "bswap and/or bit sense > > > > invert" thing at all. That should be just four different (trivial) > > > > cases for the first word. > > > > > > Here's the results for the native version converted to use word loads: > > > > > > [ 37.319937] > > > Start testing find_bit() with random-filled bitmap > > > [ 37.330289] find_next_bit: 2222703 ns, 163781 iterations > > > [ 37.339186] find_next_zero_bit: 2154375 ns, 163900 iterations > > > [ 37.348118] find_last_bit: 2208104 ns, 163780 iterations > > > [ 37.372564] find_first_bit: 17722203 ns, 16370 iterations > > > [ 37.737415] find_first_and_bit: 358135191 ns, 32453 iterations > > > [ 37.745420] find_next_and_bit: 1280537 ns, 73644 iterations > > > [ 37.752143] > > > Start testing find_bit() with sparse bitmap > > > [ 37.759032] find_next_bit: 41256 ns, 655 iterations > > > [ 37.769905] find_next_zero_bit: 4148410 ns, 327026 iterations > > > [ 37.776675] find_last_bit: 48742 ns, 655 iterations > > > [ 37.790961] find_first_bit: 7562371 ns, 655 iterations > > > [ 37.797743] find_first_and_bit: 47366 ns, 1 iterations > > > [ 37.804527] find_next_and_bit: 59924 ns, 1 iterations > > > > > > which is generally faster than the generic version, with the exception > > > of the sparse find_first_bit (generic was: > > > [ 25.657304] find_first_bit: 7328573 ns, 656 iterations) > > > > > > find_next_{,zero_}bit() in the sparse case are quite a bit faster than > > > the generic code. > > > > Look at find_{first,next}_and_bit results. Those two have no arch version > > and in both cases use generic code. In theory they should be equally fast > > before and after, but your testing says that generic case is slower even > > for them, and the difference is comparable with real arch functions numbers. > > It makes me feel like: > > - there's something unrelated, like governor/throttling that affect results; > > - the numbers are identical, taking the dispersion into account. > > > > If the difference really concerns you, I'd suggest running the test > > several times > > to measure confidence intervals. > > Given that the benchmark is run against random bitmaps and with > interrupts enabled, there is going to be noise in the results. > > Here's the second run: > > [26234.429389] > Start testing find_bit() with random-filled bitmap > [26234.439722] find_next_bit: 2206687 ns, 164277 iterations > [26234.448664] find_next_zero_bit: 2188368 ns, 163404 iterations > [26234.457612] find_last_bit: 2223742 ns, 164278 iterations > [26234.482056] find_first_bit: 17720726 ns, 16384 iterations > [26234.859374] find_first_and_bit: 370602019 ns, 32877 iterations > [26234.867379] find_next_and_bit: 1280651 ns, 74091 iterations > [26234.874107] > Start testing find_bit() with sparse bitmap > [26234.881014] find_next_bit: 46142 ns, 656 iterations > [26234.891900] find_next_zero_bit: 4158987 ns, 327025 iterations > [26234.898672] find_last_bit: 49727 ns, 656 iterations > [26234.912504] find_first_bit: 7107862 ns, 656 iterations > [26234.919290] find_first_and_bit: 52092 ns, 1 iterations > [26234.926076] find_next_and_bit: 60856 ns, 1 iterations > > And a third run: > > [26459.679524] > Start testing find_bit() with random-filled bitmap > [26459.689871] find_next_bit: 2199418 ns, 163311 iterations > [26459.698798] find_next_zero_bit: 2181289 ns, 164370 iterations > [26459.707738] find_last_bit: 2213638 ns, 163311 iterations > [26459.732224] find_first_bit: 17764152 ns, 16429 iterations > [26460.133823] find_first_and_bit: 394886375 ns, 32672 iterations > [26460.141818] find_next_and_bit: 1269693 ns, 73485 iterations > [26460.148545] > Start testing find_bit() with sparse bitmap > [26460.155433] find_next_bit: 40753 ns, 653 iterations > [26460.166307] find_next_zero_bit: 4148211 ns, 327028 iterations > [26460.173078] find_last_bit: 50017 ns, 653 iterations > [26460.187007] find_first_bit: 7205325 ns, 653 iterations > [26460.193790] find_first_and_bit: 49358 ns, 1 iterations > [26460.200577] find_next_and_bit: 62332 ns, 1 iterations > > My gut feeling is that yes, there is some variance, but not on an > order that is significant that would allow us to say "there's no > difference". > > find_next_bit results for random are: 2222703, 2206687, 2199418, > which is an average of 2209603 and a variance of around 0.5%. > The difference between this and the single generic figure I have > is on the order of 20%. > > I'll do the same with find_first_bit for random: 17722203, 17720726, > and 17764152. Average is 17735694. Variance is around 0.1% or 0.2%. > The difference between this and the single generic figure I have is > on the order of 5%. Not so large, but still quite a big difference > compared to the variance. > > find_first_bit for sparse: 7562371, 7107862, 7205325. Average is > 7291853. Variance is higher at about 4%. Difference between this and > the generic figure is 0.5%, so this one is not significantly > different. > > The best result looks to be find_next_zero_bit for the sparse bitmap > case. The generic code measures 5.5ms, the native code is sitting > around 4.1ms. That's a difference of around 34%, and by just looking > at the range in the figures above we can see this is a significant > result without needing to do the calculations. Similar is true of > find_next_bit for the sparse bitmap. > > So, I think the results are significant in most cases and variance > doesn't account for the differences. The only one which isn't is > find_first_bit for the sparse case. Hi Russel, + Alexey Klimov <klimov.linux@gmail.com> This is my testings for native vs generic find_bit operations on a15 and 17. The raw numbers are collected by Alexey Klimov on Odroid-xu3. All cpu frequencies were fixed at 1000Mhz. (Thanks a lot!) For each native/generic @ a15/a7 configuration, the find_bit_benchmark was run 5 times, and the results are summarized below: A15 Native Generic Difference Dense ns ns % sigmas find_next_bit: 3746929 3231641 14 8.3 find_next_zero_bit: 3935354 3202608 19 10.4 find_last_bit: 3134713 3129717 0 0.1 find_first_bit: 85626542 20498669 76 172.4 find_first_and_bit: 409252997 414820417 -1 -0.2 find_next_and_bit: 1678706 1654420 1 0.4 Sparse find_next_bit: 143208 77924 46 29.4 find_next_zero_bit: 6893375 6059177 12 14.3 find_last_bit: 67174 68616 -2 -1.0 find_first_bit: 33689256 8151493 76 47.8 find_first_and_bit: 124758 156974 -26 -1.3 find_next_and_bit: 53391 56716 -6 -0.2 A7 Native Generic Difference Dense ns ns % sigmas find_next_bit: 4207627 5532764 -31 -14.9 find_next_zero_bit: 4259961 5236880 -23 -10.0 find_last_bit: 4281386 4201025 2 1.5 find_first_bit: 236913620 50970424 78 163.3 find_first_and_bit: 728069762 750580781 -3 -0.7 find_next_and_bit: 2696263 2766077 -3 -0.9 Sparse find_next_bit: 327241 143776 56 40.7 find_next_zero_bit: 6987249 10235989 -46 -21.8 find_last_bit: 97758 94725 3 0.6 find_first_bit: 94628040 21051964 78 41.8 find_first_and_bit: 248133 241267 3 0.3 find_next_and_bit: 136475 154000 -13 -0.5 The last column is the difference between native and generic code performance normalized to a standard deviation: (mean(native) - mean(generic)) / max(std(native), std(generic)) The results look consistent to me because 'and' subtests that are always generic differ by less than one sigma. On A15 generic code is a clear winner. On A7 results are inconsistent although significant. Maybe it's worth to retest on A7. Regarding the choice between native and generic core - I would prefer generic version even if it's slightly slower because it is tested and maintained better. And because the results of the test are at least on par, to me it's a no-brainer. Would be really interesting to compare performance of your LDRB->LDR patch with the generic code using the same procedure. Thanks, Yury ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Linux 5.19-rc8 2022-07-30 21:38 ` Yury Norov @ 2022-08-01 15:48 ` Russell King (Oracle) 2022-08-01 15:54 ` Russell King (Oracle) 0 siblings, 1 reply; 11+ messages in thread From: Russell King (Oracle) @ 2022-08-01 15:48 UTC (permalink / raw) To: Yury Norov Cc: Linus Torvalds, Dennis Zhou, Guenter Roeck, Catalin Marinas, Linux Kernel Mailing List, Geert Uytterhoeven, Alexey Klimov, linux-m68k Oh FFS. I see you decided off your own back to remove the ARM version of the find_bit functions, with NO agreement from the arch maintainer. This is not on. On Sat, Jul 30, 2022 at 02:38:38PM -0700, Yury Norov wrote: > On Wed, Jul 27, 2022 at 08:43:22AM +0100, Russell King (Oracle) wrote: > > On Tue, Jul 26, 2022 at 06:33:55PM -0700, Yury Norov wrote: > > > On Tue, Jul 26, 2022 at 5:15 PM Russell King (Oracle) > > > <linux@armlinux.org.uk> wrote: > > > > > > > > On Tue, Jul 26, 2022 at 01:20:23PM -0700, Linus Torvalds wrote: > > > > > On Tue, Jul 26, 2022 at 12:44 PM Russell King (Oracle) > > > > > <linux@armlinux.org.uk> wrote: > > > > > > > > > > > > Overall, I would say it's pretty similar (some generic perform > > > > > > marginally better, some native perform marginally better) with the > > > > > > exception of find_first_bit() being much better with the generic > > > > > > implementation, but find_next_zero_bit() being noticably worse. > > > > > > > > > > The generic _find_first_bit() code is actually sane and simple. It > > > > > loops over words until it finds a non-zero one, and then does trivial > > > > > calculations on that last word. > > > > > > > > > > That explains why the generic code does so much better than your byte-wise asm. > > > > > > > > > > In contrast, the generic _find_next_bit() I find almost offensively > > > > > silly - which in turn explains why your byte-wide asm does better. > > > > > > > > > > I think the generic _find_next_bit() should actually do what the m68k > > > > > find_next_bit code does: handle the first special word itself, and > > > > > then just call find_first_bit() on the rest of it. > > > > > > > > > > And it should *not* try to handle the dynamic "bswap and/or bit sense > > > > > invert" thing at all. That should be just four different (trivial) > > > > > cases for the first word. > > > > > > > > Here's the results for the native version converted to use word loads: > > > > > > > > [ 37.319937] > > > > Start testing find_bit() with random-filled bitmap > > > > [ 37.330289] find_next_bit: 2222703 ns, 163781 iterations > > > > [ 37.339186] find_next_zero_bit: 2154375 ns, 163900 iterations > > > > [ 37.348118] find_last_bit: 2208104 ns, 163780 iterations > > > > [ 37.372564] find_first_bit: 17722203 ns, 16370 iterations > > > > [ 37.737415] find_first_and_bit: 358135191 ns, 32453 iterations > > > > [ 37.745420] find_next_and_bit: 1280537 ns, 73644 iterations > > > > [ 37.752143] > > > > Start testing find_bit() with sparse bitmap > > > > [ 37.759032] find_next_bit: 41256 ns, 655 iterations > > > > [ 37.769905] find_next_zero_bit: 4148410 ns, 327026 iterations > > > > [ 37.776675] find_last_bit: 48742 ns, 655 iterations > > > > [ 37.790961] find_first_bit: 7562371 ns, 655 iterations > > > > [ 37.797743] find_first_and_bit: 47366 ns, 1 iterations > > > > [ 37.804527] find_next_and_bit: 59924 ns, 1 iterations > > > > > > > > which is generally faster than the generic version, with the exception > > > > of the sparse find_first_bit (generic was: > > > > [ 25.657304] find_first_bit: 7328573 ns, 656 iterations) > > > > > > > > find_next_{,zero_}bit() in the sparse case are quite a bit faster than > > > > the generic code. > > > > > > Look at find_{first,next}_and_bit results. Those two have no arch version > > > and in both cases use generic code. In theory they should be equally fast > > > before and after, but your testing says that generic case is slower even > > > for them, and the difference is comparable with real arch functions numbers. > > > It makes me feel like: > > > - there's something unrelated, like governor/throttling that affect results; > > > - the numbers are identical, taking the dispersion into account. > > > > > > If the difference really concerns you, I'd suggest running the test > > > several times > > > to measure confidence intervals. > > > > Given that the benchmark is run against random bitmaps and with > > interrupts enabled, there is going to be noise in the results. > > > > Here's the second run: > > > > [26234.429389] > > Start testing find_bit() with random-filled bitmap > > [26234.439722] find_next_bit: 2206687 ns, 164277 iterations > > [26234.448664] find_next_zero_bit: 2188368 ns, 163404 iterations > > [26234.457612] find_last_bit: 2223742 ns, 164278 iterations > > [26234.482056] find_first_bit: 17720726 ns, 16384 iterations > > [26234.859374] find_first_and_bit: 370602019 ns, 32877 iterations > > [26234.867379] find_next_and_bit: 1280651 ns, 74091 iterations > > [26234.874107] > > Start testing find_bit() with sparse bitmap > > [26234.881014] find_next_bit: 46142 ns, 656 iterations > > [26234.891900] find_next_zero_bit: 4158987 ns, 327025 iterations > > [26234.898672] find_last_bit: 49727 ns, 656 iterations > > [26234.912504] find_first_bit: 7107862 ns, 656 iterations > > [26234.919290] find_first_and_bit: 52092 ns, 1 iterations > > [26234.926076] find_next_and_bit: 60856 ns, 1 iterations > > > > And a third run: > > > > [26459.679524] > > Start testing find_bit() with random-filled bitmap > > [26459.689871] find_next_bit: 2199418 ns, 163311 iterations > > [26459.698798] find_next_zero_bit: 2181289 ns, 164370 iterations > > [26459.707738] find_last_bit: 2213638 ns, 163311 iterations > > [26459.732224] find_first_bit: 17764152 ns, 16429 iterations > > [26460.133823] find_first_and_bit: 394886375 ns, 32672 iterations > > [26460.141818] find_next_and_bit: 1269693 ns, 73485 iterations > > [26460.148545] > > Start testing find_bit() with sparse bitmap > > [26460.155433] find_next_bit: 40753 ns, 653 iterations > > [26460.166307] find_next_zero_bit: 4148211 ns, 327028 iterations > > [26460.173078] find_last_bit: 50017 ns, 653 iterations > > [26460.187007] find_first_bit: 7205325 ns, 653 iterations > > [26460.193790] find_first_and_bit: 49358 ns, 1 iterations > > [26460.200577] find_next_and_bit: 62332 ns, 1 iterations > > > > My gut feeling is that yes, there is some variance, but not on an > > order that is significant that would allow us to say "there's no > > difference". > > > > find_next_bit results for random are: 2222703, 2206687, 2199418, > > which is an average of 2209603 and a variance of around 0.5%. > > The difference between this and the single generic figure I have > > is on the order of 20%. > > > > I'll do the same with find_first_bit for random: 17722203, 17720726, > > and 17764152. Average is 17735694. Variance is around 0.1% or 0.2%. > > The difference between this and the single generic figure I have is > > on the order of 5%. Not so large, but still quite a big difference > > compared to the variance. > > > > find_first_bit for sparse: 7562371, 7107862, 7205325. Average is > > 7291853. Variance is higher at about 4%. Difference between this and > > the generic figure is 0.5%, so this one is not significantly > > different. > > > > The best result looks to be find_next_zero_bit for the sparse bitmap > > case. The generic code measures 5.5ms, the native code is sitting > > around 4.1ms. That's a difference of around 34%, and by just looking > > at the range in the figures above we can see this is a significant > > result without needing to do the calculations. Similar is true of > > find_next_bit for the sparse bitmap. > > > > So, I think the results are significant in most cases and variance > > doesn't account for the differences. The only one which isn't is > > find_first_bit for the sparse case. > > Hi Russel, > > + Alexey Klimov <klimov.linux@gmail.com> > > This is my testings for native vs generic find_bit operations on a15 > and 17. > > The raw numbers are collected by Alexey Klimov on Odroid-xu3. All cpu > frequencies were fixed at 1000Mhz. (Thanks a lot!) > > For each native/generic @ a15/a7 configuration, the find_bit_benchmark > was run 5 times, and the results are summarized below: > > A15 Native Generic Difference > Dense ns ns % sigmas > find_next_bit: 3746929 3231641 14 8.3 > find_next_zero_bit: 3935354 3202608 19 10.4 > find_last_bit: 3134713 3129717 0 0.1 > find_first_bit: 85626542 20498669 76 172.4 > find_first_and_bit: 409252997 414820417 -1 -0.2 > find_next_and_bit: 1678706 1654420 1 0.4 > > Sparse > find_next_bit: 143208 77924 46 29.4 > find_next_zero_bit: 6893375 6059177 12 14.3 > find_last_bit: 67174 68616 -2 -1.0 > find_first_bit: 33689256 8151493 76 47.8 > find_first_and_bit: 124758 156974 -26 -1.3 > find_next_and_bit: 53391 56716 -6 -0.2 > > A7 Native Generic Difference > Dense ns ns % sigmas > find_next_bit: 4207627 5532764 -31 -14.9 > find_next_zero_bit: 4259961 5236880 -23 -10.0 > find_last_bit: 4281386 4201025 2 1.5 > find_first_bit: 236913620 50970424 78 163.3 > find_first_and_bit: 728069762 750580781 -3 -0.7 > find_next_and_bit: 2696263 2766077 -3 -0.9 > > Sparse > find_next_bit: 327241 143776 56 40.7 > find_next_zero_bit: 6987249 10235989 -46 -21.8 > find_last_bit: 97758 94725 3 0.6 > find_first_bit: 94628040 21051964 78 41.8 > find_first_and_bit: 248133 241267 3 0.3 > find_next_and_bit: 136475 154000 -13 -0.5 > > The last column is the difference between native and generic code > performance normalized to a standard deviation: > (mean(native) - mean(generic)) / max(std(native), std(generic)) > > The results look consistent to me because 'and' subtests that are always > generic differ by less than one sigma. > > On A15 generic code is a clear winner. On A7 results are inconsistent > although significant. Maybe it's worth to retest on A7. > > Regarding the choice between native and generic core - I would prefer > generic version even if it's slightly slower because it is tested and > maintained better. And because the results of the test are at least on > par, to me it's a no-brainer. > > Would be really interesting to compare performance of your LDRB->LDR > patch with the generic code using the same procedure. > > Thanks, > Yury > -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last! ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Linux 5.19-rc8 2022-08-01 15:48 ` Russell King (Oracle) @ 2022-08-01 15:54 ` Russell King (Oracle) 0 siblings, 0 replies; 11+ messages in thread From: Russell King (Oracle) @ 2022-08-01 15:54 UTC (permalink / raw) To: Yury Norov Cc: Linus Torvalds, Dennis Zhou, Guenter Roeck, Catalin Marinas, Linux Kernel Mailing List, Geert Uytterhoeven, Alexey Klimov, linux-m68k On Mon, Aug 01, 2022 at 04:48:21PM +0100, Russell King (Oracle) wrote: > Oh FFS. > > I see you decided off your own back to remove the ARM version of the > find_bit functions, with NO agreement from the arch maintainer. This > is not on. Sorry, my mistake, I'm getting confused with git over conflicts which aren't making much sense. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last! ^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: Linux 5.19-rc8 2022-07-27 1:33 ` Yury Norov 2022-07-27 7:43 ` Russell King (Oracle) @ 2022-07-27 7:46 ` David Laight 1 sibling, 0 replies; 11+ messages in thread From: David Laight @ 2022-07-27 7:46 UTC (permalink / raw) To: 'Yury Norov', Russell King (Oracle) Cc: Linus Torvalds, Dennis Zhou, Guenter Roeck, Catalin Marinas, Linux Kernel Mailing List, Geert Uytterhoeven, linux-m68k@lists.linux-m68k.org From: Yury Norov > Sent: 27 July 2022 02:34 ... > If the difference really concerns you, I'd suggest running the test > several times to measure confidence intervals. Or, do what I've been doing and get an accurate cycle count for a single call and repeat about 10 times. The first value is typically slow (loading I$), the rest typically identical. They can often even be matched to the expected value! The fastest value is the one to quote! On x86 you have to use the perf cycle counter, the tsc is just useless. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2022-08-01 15:55 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CAHk-=wiWwDYxNhnStS0e+p-NTFAQSHvab=2-8LwxunCVuY5-2A@mail.gmail.com>
[not found] ` <20220725161141.GA1306881@roeck-us.net>
[not found] ` <CAHk-=whtGUwJwHUSNsXd4g7cok=n0Zwje7nACp8skh1fa2NtJA@mail.gmail.com>
[not found] ` <YuAm5h1B6bsrR/9q@fedora>
[not found] ` <CAHk-=wgYpJTMMxmfbpqc=JVtSK0Zj4b15G=AvEYk6vPNySDSsA@mail.gmail.com>
2022-07-26 18:18 ` Linux 5.19-rc8 Yury Norov
2022-07-26 18:36 ` Linus Torvalds
2022-07-26 19:44 ` Russell King (Oracle)
2022-07-26 20:20 ` Linus Torvalds
2022-07-27 0:15 ` Russell King (Oracle)
2022-07-27 1:33 ` Yury Norov
2022-07-27 7:43 ` Russell King (Oracle)
2022-07-30 21:38 ` Yury Norov
2022-08-01 15:48 ` Russell King (Oracle)
2022-08-01 15:54 ` Russell King (Oracle)
2022-07-27 7:46 ` David Laight
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox