public inbox for linux-m68k@lists.linux-m68k.org
 help / color / mirror / Atom feed
* Re: Linux 5.19-rc8
       [not found]       ` <CAHk-=wgYpJTMMxmfbpqc=JVtSK0Zj4b15G=AvEYk6vPNySDSsA@mail.gmail.com>
@ 2022-07-26 18:18         ` Yury Norov
  2022-07-26 18:36           ` Linus Torvalds
  0 siblings, 1 reply; 11+ messages in thread
From: Yury Norov @ 2022-07-26 18:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dennis Zhou, Guenter Roeck, Russell King - ARM Linux,
	Catalin Marinas, Linux Kernel Mailing List, Geert Uytterhoeven,
	linux-m68k

+ Geert Uytterhoeven <geert@linux-m68k.org>
+ linux-m68k@lists.linux-m68k.org

On Tue, Jul 26, 2022 at 10:51:01AM -0700, Linus Torvalds wrote:
> On Tue, Jul 26, 2022 at 10:39 AM Dennis Zhou <dennis@kernel.org> wrote:
> >
> > Are we okay with adding the contract find_*_bit() operations must handle
> > asking for past size properly? FWIW, we'd have to modify most of the
> > iterators in find.h.
> 
> So I think we're ok with it, if only it makes the macros simpler.
> 
> I also think we should probably look at the m68k case, because while
> that one seems to not have the bug that the arm case had, if we remove
> the arm case the m68k code is now the only non-generic case remaining.
> 
> And it just makes me go "maybe we should get rid of the whole
> 'override the generic code' thing entirely?"
> 
> I don't think that inlining the first word (like the m68k code does)
> is worth it, but it *is* possible that the architecture-specific
> functions generate better code for some common cases,

We have find_bit_benchmark to check how it works in practice. Would
be great if someone with access to the hardware can share numbers.

> so I think this
> is a "needs looking at the generated code" and not just a blind
> removal.
> 
>               Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux 5.19-rc8
  2022-07-26 18:18         ` Linux 5.19-rc8 Yury Norov
@ 2022-07-26 18:36           ` Linus Torvalds
  2022-07-26 19:44             ` Russell King (Oracle)
  0 siblings, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2022-07-26 18:36 UTC (permalink / raw)
  To: Yury Norov
  Cc: Dennis Zhou, Guenter Roeck, Russell King - ARM Linux,
	Catalin Marinas, Linux Kernel Mailing List, Geert Uytterhoeven,
	linux-m68k

On Tue, Jul 26, 2022 at 11:18 AM Yury Norov <yury.norov@gmail.com> wrote:
>
> We have find_bit_benchmark to check how it works in practice. Would
> be great if someone with access to the hardware can share numbers.

Honestly, I doubt benchmarking find_bit in a loop is all that sensible.

These are helper functions that are probably seldom super-hot in
cache, and as with so many of these things, I suspect the cool-I$
numbers are the ones that matter most in real life.

When some filesystem ends up searching for a free block or similar, it
will probably have done other things before that that means that the
L1 I$ has been long flushed, and branch history is quite possibly
entirely gone too.

The same is quite possibly true for the bitmap itself in D$ too.

That said, looking at the x86 code generation (not only because I have
the build environment, but where I can actually read make sense of the
asm), the only thing that looks bad is the conditional bswap.

So the code gen looks fairly good,

                 Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux 5.19-rc8
  2022-07-26 18:36           ` Linus Torvalds
@ 2022-07-26 19:44             ` Russell King (Oracle)
  2022-07-26 20:20               ` Linus Torvalds
  0 siblings, 1 reply; 11+ messages in thread
From: Russell King (Oracle) @ 2022-07-26 19:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Yury Norov, Dennis Zhou, Guenter Roeck, Catalin Marinas,
	Linux Kernel Mailing List, Geert Uytterhoeven, linux-m68k

On Tue, Jul 26, 2022 at 11:36:21AM -0700, Linus Torvalds wrote:
> On Tue, Jul 26, 2022 at 11:18 AM Yury Norov <yury.norov@gmail.com> wrote:
> >
> > We have find_bit_benchmark to check how it works in practice. Would
> > be great if someone with access to the hardware can share numbers.
> 
> Honestly, I doubt benchmarking find_bit in a loop is all that sensible.

Yes, that's what I was thinking - I've never seen it crop up in any of
the perf traces I've seen.

Nevertheless, here's some numbers from a single run of the
find_bit_benchmark module, kernel built with:
arm-linux-gnueabihf-gcc (Debian 10.2.1-6) 10.2.1 20210110

Current native implementation:

[   46.184565]
               Start testing find_bit() with random-filled bitmap
[   46.195127] find_next_bit:                 2440833 ns, 163112 iterations
[   46.204226] find_next_zero_bit:            2372128 ns, 164569 iterations
[   46.213152] find_last_bit:                 2199779 ns, 163112 iterations
[   46.299398] find_first_bit:               79526013 ns,  16234 iterations
[   46.684026] find_first_and_bit:          377912990 ns,  32617 iterations
[   46.692020] find_next_and_bit:             1269071 ns,  73562 iterations
[   46.698745]
               Start testing find_bit() with sparse bitmap
[   46.705711] find_next_bit:                  118652 ns,    656 iterations
[   46.716621] find_next_zero_bit:            4183472 ns, 327025 iterations
[   46.723395] find_last_bit:                   50448 ns,    656 iterations
[   46.762308] find_first_bit:               32190802 ns,    656 iterations
[   46.769093] find_first_and_bit:              52129 ns,      1 iterations
[   46.775882] find_next_and_bit:               62522 ns,      1 iterations

Generic implementation:

[   25.149238]
               Start testing find_bit() with random-filled bitmap
[   25.160002] find_next_bit:                 2640943 ns, 163537 iterations
[   25.169567] find_next_zero_bit:            2838485 ns, 164144 iterations
[   25.178595] find_last_bit:                 2302372 ns, 163538 iterations
[   25.204016] find_first_bit:               18697630 ns,  16373 iterations
[   25.602571] find_first_and_bit:          391841480 ns,  32555 iterations
[   25.610563] find_next_and_bit:             1260306 ns,  73587 iterations
[   25.617295]
               Start testing find_bit() with sparse bitmap
[   25.624222] find_next_bit:                   70289 ns,    656 iterations
[   25.636478] find_next_zero_bit:            5527050 ns, 327025 iterations
[   25.643253] find_last_bit:                   52147 ns,    656 iterations
[   25.657304] find_first_bit:                7328573 ns,    656 iterations
[   25.664087] find_first_and_bit:              48518 ns,      1 iterations
[   25.670871] find_next_and_bit:               59750 ns,      1 iterations

Overall, I would say it's pretty similar (some generic perform
marginally better, some native perform marginally better) with the
exception of find_first_bit() being much better with the generic
implementation, but find_next_zero_bit() being noticably worse.

So, pretty much nothing of any relevance between them, which may
come as a surprise given the byte vs word access differences between
the two implementations.

I suspect the reason behind that may be because the native
implementation code is smaller than the generic implementation,
outweighing the effects of the by-byte rather than by-word. I would
also suspect that, because of the smaller implementation, the native
version performs better in a I$-cool situation than the generic. Lastly,
I would suspect if we fixed the bug in the native version, and converted
it to use word loads, it would probably be better than the generic
version. I haven't anything to base that on other than gut feeling at
the moment, but I can make the changes to the native implementation and
see what effect that has, possibly tomorrow.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux 5.19-rc8
  2022-07-26 19:44             ` Russell King (Oracle)
@ 2022-07-26 20:20               ` Linus Torvalds
  2022-07-27  0:15                 ` Russell King (Oracle)
  0 siblings, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2022-07-26 20:20 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Yury Norov, Dennis Zhou, Guenter Roeck, Catalin Marinas,
	Linux Kernel Mailing List, Geert Uytterhoeven, linux-m68k

On Tue, Jul 26, 2022 at 12:44 PM Russell King (Oracle)
<linux@armlinux.org.uk> wrote:
>
> Overall, I would say it's pretty similar (some generic perform
> marginally better, some native perform marginally better) with the
> exception of find_first_bit() being much better with the generic
> implementation, but find_next_zero_bit() being noticably worse.

The generic _find_first_bit() code is actually sane and simple. It
loops over words until it finds a non-zero one, and then does trivial
calculations on that last word.

That explains why the generic code does so much better than your byte-wise asm.

In contrast, the generic _find_next_bit() I find almost offensively
silly - which in turn explains why your byte-wide asm does better.

I think the generic _find_next_bit() should actually do what the m68k
find_next_bit code does: handle the first special word itself, and
then just call find_first_bit() on the rest of it.

And it should *not* try to handle the dynamic "bswap and/or bit sense
invert" thing at all. That should be just four different (trivial)
cases for the first word.

             Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux 5.19-rc8
  2022-07-26 20:20               ` Linus Torvalds
@ 2022-07-27  0:15                 ` Russell King (Oracle)
  2022-07-27  1:33                   ` Yury Norov
  0 siblings, 1 reply; 11+ messages in thread
From: Russell King (Oracle) @ 2022-07-27  0:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Yury Norov, Dennis Zhou, Guenter Roeck, Catalin Marinas,
	Linux Kernel Mailing List, Geert Uytterhoeven, linux-m68k

On Tue, Jul 26, 2022 at 01:20:23PM -0700, Linus Torvalds wrote:
> On Tue, Jul 26, 2022 at 12:44 PM Russell King (Oracle)
> <linux@armlinux.org.uk> wrote:
> >
> > Overall, I would say it's pretty similar (some generic perform
> > marginally better, some native perform marginally better) with the
> > exception of find_first_bit() being much better with the generic
> > implementation, but find_next_zero_bit() being noticably worse.
> 
> The generic _find_first_bit() code is actually sane and simple. It
> loops over words until it finds a non-zero one, and then does trivial
> calculations on that last word.
> 
> That explains why the generic code does so much better than your byte-wise asm.
> 
> In contrast, the generic _find_next_bit() I find almost offensively
> silly - which in turn explains why your byte-wide asm does better.
> 
> I think the generic _find_next_bit() should actually do what the m68k
> find_next_bit code does: handle the first special word itself, and
> then just call find_first_bit() on the rest of it.
> 
> And it should *not* try to handle the dynamic "bswap and/or bit sense
> invert" thing at all. That should be just four different (trivial)
> cases for the first word.

Here's the results for the native version converted to use word loads:

[   37.319937]
               Start testing find_bit() with random-filled bitmap
[   37.330289] find_next_bit:                 2222703 ns, 163781 iterations
[   37.339186] find_next_zero_bit:            2154375 ns, 163900 iterations
[   37.348118] find_last_bit:                 2208104 ns, 163780 iterations
[   37.372564] find_first_bit:               17722203 ns,  16370 iterations
[   37.737415] find_first_and_bit:          358135191 ns,  32453 iterations
[   37.745420] find_next_and_bit:             1280537 ns,  73644 iterations
[   37.752143]
               Start testing find_bit() with sparse bitmap
[   37.759032] find_next_bit:                   41256 ns,    655 iterations
[   37.769905] find_next_zero_bit:            4148410 ns, 327026 iterations
[   37.776675] find_last_bit:                   48742 ns,    655 iterations
[   37.790961] find_first_bit:                7562371 ns,    655 iterations
[   37.797743] find_first_and_bit:              47366 ns,      1 iterations
[   37.804527] find_next_and_bit:               59924 ns,      1 iterations

which is generally faster than the generic version, with the exception
of the sparse find_first_bit (generic was:
[   25.657304] find_first_bit:                7328573 ns,    656 iterations)

find_next_{,zero_}bit() in the sparse case are quite a bit faster than
the generic code.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux 5.19-rc8
  2022-07-27  0:15                 ` Russell King (Oracle)
@ 2022-07-27  1:33                   ` Yury Norov
  2022-07-27  7:43                     ` Russell King (Oracle)
  2022-07-27  7:46                     ` David Laight
  0 siblings, 2 replies; 11+ messages in thread
From: Yury Norov @ 2022-07-27  1:33 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Linus Torvalds, Dennis Zhou, Guenter Roeck, Catalin Marinas,
	Linux Kernel Mailing List, Geert Uytterhoeven, linux-m68k

On Tue, Jul 26, 2022 at 5:15 PM Russell King (Oracle)
<linux@armlinux.org.uk> wrote:
>
> On Tue, Jul 26, 2022 at 01:20:23PM -0700, Linus Torvalds wrote:
> > On Tue, Jul 26, 2022 at 12:44 PM Russell King (Oracle)
> > <linux@armlinux.org.uk> wrote:
> > >
> > > Overall, I would say it's pretty similar (some generic perform
> > > marginally better, some native perform marginally better) with the
> > > exception of find_first_bit() being much better with the generic
> > > implementation, but find_next_zero_bit() being noticably worse.
> >
> > The generic _find_first_bit() code is actually sane and simple. It
> > loops over words until it finds a non-zero one, and then does trivial
> > calculations on that last word.
> >
> > That explains why the generic code does so much better than your byte-wise asm.
> >
> > In contrast, the generic _find_next_bit() I find almost offensively
> > silly - which in turn explains why your byte-wide asm does better.
> >
> > I think the generic _find_next_bit() should actually do what the m68k
> > find_next_bit code does: handle the first special word itself, and
> > then just call find_first_bit() on the rest of it.
> >
> > And it should *not* try to handle the dynamic "bswap and/or bit sense
> > invert" thing at all. That should be just four different (trivial)
> > cases for the first word.
>
> Here's the results for the native version converted to use word loads:
>
> [   37.319937]
>                Start testing find_bit() with random-filled bitmap
> [   37.330289] find_next_bit:                 2222703 ns, 163781 iterations
> [   37.339186] find_next_zero_bit:            2154375 ns, 163900 iterations
> [   37.348118] find_last_bit:                 2208104 ns, 163780 iterations
> [   37.372564] find_first_bit:               17722203 ns,  16370 iterations
> [   37.737415] find_first_and_bit:          358135191 ns,  32453 iterations
> [   37.745420] find_next_and_bit:             1280537 ns,  73644 iterations
> [   37.752143]
>                Start testing find_bit() with sparse bitmap
> [   37.759032] find_next_bit:                   41256 ns,    655 iterations
> [   37.769905] find_next_zero_bit:            4148410 ns, 327026 iterations
> [   37.776675] find_last_bit:                   48742 ns,    655 iterations
> [   37.790961] find_first_bit:                7562371 ns,    655 iterations
> [   37.797743] find_first_and_bit:              47366 ns,      1 iterations
> [   37.804527] find_next_and_bit:               59924 ns,      1 iterations
>
> which is generally faster than the generic version, with the exception
> of the sparse find_first_bit (generic was:
> [   25.657304] find_first_bit:                7328573 ns,    656 iterations)
>
> find_next_{,zero_}bit() in the sparse case are quite a bit faster than
> the generic code.

Look at find_{first,next}_and_bit results. Those two have no arch version
and in both cases use generic code. In theory they should be equally fast
before and after, but your testing says that generic case is slower even
for them, and the difference is comparable with real arch functions numbers.
It makes me feel like:
 - there's something unrelated, like governor/throttling that affect results;
 - the numbers are identical, taking the dispersion into account.

If the difference really concerns you, I'd suggest running the test
several times
to measure confidence intervals.

Thanks,
Yury

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux 5.19-rc8
  2022-07-27  1:33                   ` Yury Norov
@ 2022-07-27  7:43                     ` Russell King (Oracle)
  2022-07-30 21:38                       ` Yury Norov
  2022-07-27  7:46                     ` David Laight
  1 sibling, 1 reply; 11+ messages in thread
From: Russell King (Oracle) @ 2022-07-27  7:43 UTC (permalink / raw)
  To: Yury Norov
  Cc: Linus Torvalds, Dennis Zhou, Guenter Roeck, Catalin Marinas,
	Linux Kernel Mailing List, Geert Uytterhoeven, linux-m68k

On Tue, Jul 26, 2022 at 06:33:55PM -0700, Yury Norov wrote:
> On Tue, Jul 26, 2022 at 5:15 PM Russell King (Oracle)
> <linux@armlinux.org.uk> wrote:
> >
> > On Tue, Jul 26, 2022 at 01:20:23PM -0700, Linus Torvalds wrote:
> > > On Tue, Jul 26, 2022 at 12:44 PM Russell King (Oracle)
> > > <linux@armlinux.org.uk> wrote:
> > > >
> > > > Overall, I would say it's pretty similar (some generic perform
> > > > marginally better, some native perform marginally better) with the
> > > > exception of find_first_bit() being much better with the generic
> > > > implementation, but find_next_zero_bit() being noticably worse.
> > >
> > > The generic _find_first_bit() code is actually sane and simple. It
> > > loops over words until it finds a non-zero one, and then does trivial
> > > calculations on that last word.
> > >
> > > That explains why the generic code does so much better than your byte-wise asm.
> > >
> > > In contrast, the generic _find_next_bit() I find almost offensively
> > > silly - which in turn explains why your byte-wide asm does better.
> > >
> > > I think the generic _find_next_bit() should actually do what the m68k
> > > find_next_bit code does: handle the first special word itself, and
> > > then just call find_first_bit() on the rest of it.
> > >
> > > And it should *not* try to handle the dynamic "bswap and/or bit sense
> > > invert" thing at all. That should be just four different (trivial)
> > > cases for the first word.
> >
> > Here's the results for the native version converted to use word loads:
> >
> > [   37.319937]
> >                Start testing find_bit() with random-filled bitmap
> > [   37.330289] find_next_bit:                 2222703 ns, 163781 iterations
> > [   37.339186] find_next_zero_bit:            2154375 ns, 163900 iterations
> > [   37.348118] find_last_bit:                 2208104 ns, 163780 iterations
> > [   37.372564] find_first_bit:               17722203 ns,  16370 iterations
> > [   37.737415] find_first_and_bit:          358135191 ns,  32453 iterations
> > [   37.745420] find_next_and_bit:             1280537 ns,  73644 iterations
> > [   37.752143]
> >                Start testing find_bit() with sparse bitmap
> > [   37.759032] find_next_bit:                   41256 ns,    655 iterations
> > [   37.769905] find_next_zero_bit:            4148410 ns, 327026 iterations
> > [   37.776675] find_last_bit:                   48742 ns,    655 iterations
> > [   37.790961] find_first_bit:                7562371 ns,    655 iterations
> > [   37.797743] find_first_and_bit:              47366 ns,      1 iterations
> > [   37.804527] find_next_and_bit:               59924 ns,      1 iterations
> >
> > which is generally faster than the generic version, with the exception
> > of the sparse find_first_bit (generic was:
> > [   25.657304] find_first_bit:                7328573 ns,    656 iterations)
> >
> > find_next_{,zero_}bit() in the sparse case are quite a bit faster than
> > the generic code.
> 
> Look at find_{first,next}_and_bit results. Those two have no arch version
> and in both cases use generic code. In theory they should be equally fast
> before and after, but your testing says that generic case is slower even
> for them, and the difference is comparable with real arch functions numbers.
> It makes me feel like:
>  - there's something unrelated, like governor/throttling that affect results;
>  - the numbers are identical, taking the dispersion into account.
> 
> If the difference really concerns you, I'd suggest running the test
> several times
> to measure confidence intervals.

Given that the benchmark is run against random bitmaps and with
interrupts enabled, there is going to be noise in the results.

Here's the second run:

[26234.429389]
               Start testing find_bit() with random-filled bitmap
[26234.439722] find_next_bit:                 2206687 ns, 164277 iterations
[26234.448664] find_next_zero_bit:            2188368 ns, 163404 iterations
[26234.457612] find_last_bit:                 2223742 ns, 164278 iterations
[26234.482056] find_first_bit:               17720726 ns,  16384 iterations
[26234.859374] find_first_and_bit:          370602019 ns,  32877 iterations
[26234.867379] find_next_and_bit:             1280651 ns,  74091 iterations
[26234.874107]
               Start testing find_bit() with sparse bitmap
[26234.881014] find_next_bit:                   46142 ns,    656 iterations
[26234.891900] find_next_zero_bit:            4158987 ns, 327025 iterations
[26234.898672] find_last_bit:                   49727 ns,    656 iterations
[26234.912504] find_first_bit:                7107862 ns,    656 iterations
[26234.919290] find_first_and_bit:              52092 ns,      1 iterations
[26234.926076] find_next_and_bit:               60856 ns,      1 iterations

And a third run:

[26459.679524]
               Start testing find_bit() with random-filled bitmap
[26459.689871] find_next_bit:                 2199418 ns, 163311 iterations
[26459.698798] find_next_zero_bit:            2181289 ns, 164370 iterations
[26459.707738] find_last_bit:                 2213638 ns, 163311 iterations
[26459.732224] find_first_bit:               17764152 ns,  16429 iterations
[26460.133823] find_first_and_bit:          394886375 ns,  32672 iterations
[26460.141818] find_next_and_bit:             1269693 ns,  73485 iterations
[26460.148545]
               Start testing find_bit() with sparse bitmap
[26460.155433] find_next_bit:                   40753 ns,    653 iterations
[26460.166307] find_next_zero_bit:            4148211 ns, 327028 iterations
[26460.173078] find_last_bit:                   50017 ns,    653 iterations
[26460.187007] find_first_bit:                7205325 ns,    653 iterations
[26460.193790] find_first_and_bit:              49358 ns,      1 iterations
[26460.200577] find_next_and_bit:               62332 ns,      1 iterations

My gut feeling is that yes, there is some variance, but not on an
order that is significant that would allow us to say "there's no
difference".

find_next_bit results for random are: 2222703, 2206687, 2199418,
which is an average of 2209603 and a variance of around 0.5%.
The difference between this and the single generic figure I have
is on the order of 20%.

I'll do the same with find_first_bit for random: 17722203, 17720726,
and 17764152. Average is 17735694. Variance is around 0.1% or 0.2%.
The difference between this and the single generic figure I have is
on the order of 5%. Not so large, but still quite a big difference
compared to the variance.

find_first_bit for sparse: 7562371, 7107862, 7205325. Average is
7291853. Variance is higher at about 4%. Difference between this and
the generic figure is 0.5%, so this one is not significantly
different.

The best result looks to be find_next_zero_bit for the sparse bitmap
case. The generic code measures 5.5ms, the native code is sitting
around 4.1ms. That's a difference of around 34%, and by just looking
at the range in the figures above we can see this is a significant
result without needing to do the calculations. Similar is true of
find_next_bit for the sparse bitmap.

So, I think the results are significant in most cases and variance
doesn't account for the differences. The only one which isn't is
find_first_bit for the sparse case.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: Linux 5.19-rc8
  2022-07-27  1:33                   ` Yury Norov
  2022-07-27  7:43                     ` Russell King (Oracle)
@ 2022-07-27  7:46                     ` David Laight
  1 sibling, 0 replies; 11+ messages in thread
From: David Laight @ 2022-07-27  7:46 UTC (permalink / raw)
  To: 'Yury Norov', Russell King (Oracle)
  Cc: Linus Torvalds, Dennis Zhou, Guenter Roeck, Catalin Marinas,
	Linux Kernel Mailing List, Geert Uytterhoeven,
	linux-m68k@lists.linux-m68k.org

From: Yury Norov
> Sent: 27 July 2022 02:34
...
> If the difference really concerns you, I'd suggest running the test
> several times to measure confidence intervals.

Or, do what I've been doing and get an accurate cycle count
for a single call and repeat about 10 times.
The first value is typically slow (loading I$), the
rest typically identical.
They can often even be matched to the expected value!
The fastest value is the one to quote!

On x86 you have to use the perf cycle counter, the tsc
is just useless.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux 5.19-rc8
  2022-07-27  7:43                     ` Russell King (Oracle)
@ 2022-07-30 21:38                       ` Yury Norov
  2022-08-01 15:48                         ` Russell King (Oracle)
  0 siblings, 1 reply; 11+ messages in thread
From: Yury Norov @ 2022-07-30 21:38 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Linus Torvalds, Dennis Zhou, Guenter Roeck, Catalin Marinas,
	Linux Kernel Mailing List, Geert Uytterhoeven, Alexey Klimov,
	linux-m68k

On Wed, Jul 27, 2022 at 08:43:22AM +0100, Russell King (Oracle) wrote:
> On Tue, Jul 26, 2022 at 06:33:55PM -0700, Yury Norov wrote:
> > On Tue, Jul 26, 2022 at 5:15 PM Russell King (Oracle)
> > <linux@armlinux.org.uk> wrote:
> > >
> > > On Tue, Jul 26, 2022 at 01:20:23PM -0700, Linus Torvalds wrote:
> > > > On Tue, Jul 26, 2022 at 12:44 PM Russell King (Oracle)
> > > > <linux@armlinux.org.uk> wrote:
> > > > >
> > > > > Overall, I would say it's pretty similar (some generic perform
> > > > > marginally better, some native perform marginally better) with the
> > > > > exception of find_first_bit() being much better with the generic
> > > > > implementation, but find_next_zero_bit() being noticably worse.
> > > >
> > > > The generic _find_first_bit() code is actually sane and simple. It
> > > > loops over words until it finds a non-zero one, and then does trivial
> > > > calculations on that last word.
> > > >
> > > > That explains why the generic code does so much better than your byte-wise asm.
> > > >
> > > > In contrast, the generic _find_next_bit() I find almost offensively
> > > > silly - which in turn explains why your byte-wide asm does better.
> > > >
> > > > I think the generic _find_next_bit() should actually do what the m68k
> > > > find_next_bit code does: handle the first special word itself, and
> > > > then just call find_first_bit() on the rest of it.
> > > >
> > > > And it should *not* try to handle the dynamic "bswap and/or bit sense
> > > > invert" thing at all. That should be just four different (trivial)
> > > > cases for the first word.
> > >
> > > Here's the results for the native version converted to use word loads:
> > >
> > > [   37.319937]
> > >                Start testing find_bit() with random-filled bitmap
> > > [   37.330289] find_next_bit:                 2222703 ns, 163781 iterations
> > > [   37.339186] find_next_zero_bit:            2154375 ns, 163900 iterations
> > > [   37.348118] find_last_bit:                 2208104 ns, 163780 iterations
> > > [   37.372564] find_first_bit:               17722203 ns,  16370 iterations
> > > [   37.737415] find_first_and_bit:          358135191 ns,  32453 iterations
> > > [   37.745420] find_next_and_bit:             1280537 ns,  73644 iterations
> > > [   37.752143]
> > >                Start testing find_bit() with sparse bitmap
> > > [   37.759032] find_next_bit:                   41256 ns,    655 iterations
> > > [   37.769905] find_next_zero_bit:            4148410 ns, 327026 iterations
> > > [   37.776675] find_last_bit:                   48742 ns,    655 iterations
> > > [   37.790961] find_first_bit:                7562371 ns,    655 iterations
> > > [   37.797743] find_first_and_bit:              47366 ns,      1 iterations
> > > [   37.804527] find_next_and_bit:               59924 ns,      1 iterations
> > >
> > > which is generally faster than the generic version, with the exception
> > > of the sparse find_first_bit (generic was:
> > > [   25.657304] find_first_bit:                7328573 ns,    656 iterations)
> > >
> > > find_next_{,zero_}bit() in the sparse case are quite a bit faster than
> > > the generic code.
> > 
> > Look at find_{first,next}_and_bit results. Those two have no arch version
> > and in both cases use generic code. In theory they should be equally fast
> > before and after, but your testing says that generic case is slower even
> > for them, and the difference is comparable with real arch functions numbers.
> > It makes me feel like:
> >  - there's something unrelated, like governor/throttling that affect results;
> >  - the numbers are identical, taking the dispersion into account.
> > 
> > If the difference really concerns you, I'd suggest running the test
> > several times
> > to measure confidence intervals.
> 
> Given that the benchmark is run against random bitmaps and with
> interrupts enabled, there is going to be noise in the results.
> 
> Here's the second run:
> 
> [26234.429389]
>                Start testing find_bit() with random-filled bitmap
> [26234.439722] find_next_bit:                 2206687 ns, 164277 iterations
> [26234.448664] find_next_zero_bit:            2188368 ns, 163404 iterations
> [26234.457612] find_last_bit:                 2223742 ns, 164278 iterations
> [26234.482056] find_first_bit:               17720726 ns,  16384 iterations
> [26234.859374] find_first_and_bit:          370602019 ns,  32877 iterations
> [26234.867379] find_next_and_bit:             1280651 ns,  74091 iterations
> [26234.874107]
>                Start testing find_bit() with sparse bitmap
> [26234.881014] find_next_bit:                   46142 ns,    656 iterations
> [26234.891900] find_next_zero_bit:            4158987 ns, 327025 iterations
> [26234.898672] find_last_bit:                   49727 ns,    656 iterations
> [26234.912504] find_first_bit:                7107862 ns,    656 iterations
> [26234.919290] find_first_and_bit:              52092 ns,      1 iterations
> [26234.926076] find_next_and_bit:               60856 ns,      1 iterations
> 
> And a third run:
> 
> [26459.679524]
>                Start testing find_bit() with random-filled bitmap
> [26459.689871] find_next_bit:                 2199418 ns, 163311 iterations
> [26459.698798] find_next_zero_bit:            2181289 ns, 164370 iterations
> [26459.707738] find_last_bit:                 2213638 ns, 163311 iterations
> [26459.732224] find_first_bit:               17764152 ns,  16429 iterations
> [26460.133823] find_first_and_bit:          394886375 ns,  32672 iterations
> [26460.141818] find_next_and_bit:             1269693 ns,  73485 iterations
> [26460.148545]
>                Start testing find_bit() with sparse bitmap
> [26460.155433] find_next_bit:                   40753 ns,    653 iterations
> [26460.166307] find_next_zero_bit:            4148211 ns, 327028 iterations
> [26460.173078] find_last_bit:                   50017 ns,    653 iterations
> [26460.187007] find_first_bit:                7205325 ns,    653 iterations
> [26460.193790] find_first_and_bit:              49358 ns,      1 iterations
> [26460.200577] find_next_and_bit:               62332 ns,      1 iterations
> 
> My gut feeling is that yes, there is some variance, but not on an
> order that is significant that would allow us to say "there's no
> difference".
> 
> find_next_bit results for random are: 2222703, 2206687, 2199418,
> which is an average of 2209603 and a variance of around 0.5%.
> The difference between this and the single generic figure I have
> is on the order of 20%.
> 
> I'll do the same with find_first_bit for random: 17722203, 17720726,
> and 17764152. Average is 17735694. Variance is around 0.1% or 0.2%.
> The difference between this and the single generic figure I have is
> on the order of 5%. Not so large, but still quite a big difference
> compared to the variance.
> 
> find_first_bit for sparse: 7562371, 7107862, 7205325. Average is
> 7291853. Variance is higher at about 4%. Difference between this and
> the generic figure is 0.5%, so this one is not significantly
> different.
> 
> The best result looks to be find_next_zero_bit for the sparse bitmap
> case. The generic code measures 5.5ms, the native code is sitting
> around 4.1ms. That's a difference of around 34%, and by just looking
> at the range in the figures above we can see this is a significant
> result without needing to do the calculations. Similar is true of
> find_next_bit for the sparse bitmap.
> 
> So, I think the results are significant in most cases and variance
> doesn't account for the differences. The only one which isn't is
> find_first_bit for the sparse case.

Hi Russel,

+ Alexey Klimov <klimov.linux@gmail.com>

This is my testings for native vs generic find_bit operations on a15
and 17.

The raw numbers are collected by Alexey Klimov on Odroid-xu3. All cpu
frequencies were fixed at 1000Mhz. (Thanks a lot!)

For each native/generic @ a15/a7 configuration, the find_bit_benchmark 
was run 5 times, and the results are summarized below:

A15                      Native     Generic       Difference
Dense                        ns          ns       %   sigmas
find_next_bit:          3746929     3231641      14      8.3
find_next_zero_bit:     3935354     3202608      19     10.4
find_last_bit:          3134713     3129717       0      0.1
find_first_bit:        85626542    20498669      76    172.4
find_first_and_bit:   409252997   414820417      -1     -0.2
find_next_and_bit:      1678706     1654420       1      0.4
                                              
Sparse                                        
find_next_bit:          143208        77924      46     29.4
find_next_zero_bit:    6893375      6059177      12     14.3
find_last_bit:           67174        68616      -2     -1.0
find_first_bit:       33689256      8151493      76     47.8
find_first_and_bit:     124758       156974     -26     -1.3
find_next_and_bit:       53391        56716      -6     -0.2

A7                      Native      Generic       Difference
Dense                       ns           ns       %   sigmas
find_next_bit:         4207627      5532764     -31    -14.9
find_next_zero_bit:    4259961      5236880     -23    -10.0
find_last_bit:         4281386      4201025       2      1.5
find_first_bit:      236913620     50970424      78    163.3
find_first_and_bit:  728069762    750580781      -3     -0.7
find_next_and_bit:     2696263      2766077      -3     -0.9

Sparse
find_next_bit:          327241       143776      56     40.7
find_next_zero_bit:    6987249     10235989     -46    -21.8
find_last_bit:           97758        94725       3      0.6
find_first_bit:       94628040     21051964      78     41.8
find_first_and_bit:     248133       241267       3      0.3
find_next_and_bit:      136475       154000     -13     -0.5

The last column is the difference between native and generic code
performance normalized to a standard deviation:
        (mean(native) - mean(generic)) / max(std(native), std(generic))

The results look consistent to me because 'and' subtests that are always
generic differ by less than one sigma.

On A15 generic code is a clear winner. On A7 results are inconsistent
although significant. Maybe it's worth to retest on A7.
 
Regarding the choice between native and generic core - I would prefer
generic version even if it's slightly slower because it is tested and
maintained better. And because the results of the test are at least on
par, to me it's a no-brainer.

Would be really interesting to compare performance of your LDRB->LDR
patch with the generic code using the same procedure.

Thanks,
Yury

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux 5.19-rc8
  2022-07-30 21:38                       ` Yury Norov
@ 2022-08-01 15:48                         ` Russell King (Oracle)
  2022-08-01 15:54                           ` Russell King (Oracle)
  0 siblings, 1 reply; 11+ messages in thread
From: Russell King (Oracle) @ 2022-08-01 15:48 UTC (permalink / raw)
  To: Yury Norov
  Cc: Linus Torvalds, Dennis Zhou, Guenter Roeck, Catalin Marinas,
	Linux Kernel Mailing List, Geert Uytterhoeven, Alexey Klimov,
	linux-m68k

Oh FFS.

I see you decided off your own back to remove the ARM version of the
find_bit functions, with NO agreement from the arch maintainer. This
is not on.


On Sat, Jul 30, 2022 at 02:38:38PM -0700, Yury Norov wrote:
> On Wed, Jul 27, 2022 at 08:43:22AM +0100, Russell King (Oracle) wrote:
> > On Tue, Jul 26, 2022 at 06:33:55PM -0700, Yury Norov wrote:
> > > On Tue, Jul 26, 2022 at 5:15 PM Russell King (Oracle)
> > > <linux@armlinux.org.uk> wrote:
> > > >
> > > > On Tue, Jul 26, 2022 at 01:20:23PM -0700, Linus Torvalds wrote:
> > > > > On Tue, Jul 26, 2022 at 12:44 PM Russell King (Oracle)
> > > > > <linux@armlinux.org.uk> wrote:
> > > > > >
> > > > > > Overall, I would say it's pretty similar (some generic perform
> > > > > > marginally better, some native perform marginally better) with the
> > > > > > exception of find_first_bit() being much better with the generic
> > > > > > implementation, but find_next_zero_bit() being noticably worse.
> > > > >
> > > > > The generic _find_first_bit() code is actually sane and simple. It
> > > > > loops over words until it finds a non-zero one, and then does trivial
> > > > > calculations on that last word.
> > > > >
> > > > > That explains why the generic code does so much better than your byte-wise asm.
> > > > >
> > > > > In contrast, the generic _find_next_bit() I find almost offensively
> > > > > silly - which in turn explains why your byte-wide asm does better.
> > > > >
> > > > > I think the generic _find_next_bit() should actually do what the m68k
> > > > > find_next_bit code does: handle the first special word itself, and
> > > > > then just call find_first_bit() on the rest of it.
> > > > >
> > > > > And it should *not* try to handle the dynamic "bswap and/or bit sense
> > > > > invert" thing at all. That should be just four different (trivial)
> > > > > cases for the first word.
> > > >
> > > > Here's the results for the native version converted to use word loads:
> > > >
> > > > [   37.319937]
> > > >                Start testing find_bit() with random-filled bitmap
> > > > [   37.330289] find_next_bit:                 2222703 ns, 163781 iterations
> > > > [   37.339186] find_next_zero_bit:            2154375 ns, 163900 iterations
> > > > [   37.348118] find_last_bit:                 2208104 ns, 163780 iterations
> > > > [   37.372564] find_first_bit:               17722203 ns,  16370 iterations
> > > > [   37.737415] find_first_and_bit:          358135191 ns,  32453 iterations
> > > > [   37.745420] find_next_and_bit:             1280537 ns,  73644 iterations
> > > > [   37.752143]
> > > >                Start testing find_bit() with sparse bitmap
> > > > [   37.759032] find_next_bit:                   41256 ns,    655 iterations
> > > > [   37.769905] find_next_zero_bit:            4148410 ns, 327026 iterations
> > > > [   37.776675] find_last_bit:                   48742 ns,    655 iterations
> > > > [   37.790961] find_first_bit:                7562371 ns,    655 iterations
> > > > [   37.797743] find_first_and_bit:              47366 ns,      1 iterations
> > > > [   37.804527] find_next_and_bit:               59924 ns,      1 iterations
> > > >
> > > > which is generally faster than the generic version, with the exception
> > > > of the sparse find_first_bit (generic was:
> > > > [   25.657304] find_first_bit:                7328573 ns,    656 iterations)
> > > >
> > > > find_next_{,zero_}bit() in the sparse case are quite a bit faster than
> > > > the generic code.
> > > 
> > > Look at find_{first,next}_and_bit results. Those two have no arch version
> > > and in both cases use generic code. In theory they should be equally fast
> > > before and after, but your testing says that generic case is slower even
> > > for them, and the difference is comparable with real arch functions numbers.
> > > It makes me feel like:
> > >  - there's something unrelated, like governor/throttling that affect results;
> > >  - the numbers are identical, taking the dispersion into account.
> > > 
> > > If the difference really concerns you, I'd suggest running the test
> > > several times
> > > to measure confidence intervals.
> > 
> > Given that the benchmark is run against random bitmaps and with
> > interrupts enabled, there is going to be noise in the results.
> > 
> > Here's the second run:
> > 
> > [26234.429389]
> >                Start testing find_bit() with random-filled bitmap
> > [26234.439722] find_next_bit:                 2206687 ns, 164277 iterations
> > [26234.448664] find_next_zero_bit:            2188368 ns, 163404 iterations
> > [26234.457612] find_last_bit:                 2223742 ns, 164278 iterations
> > [26234.482056] find_first_bit:               17720726 ns,  16384 iterations
> > [26234.859374] find_first_and_bit:          370602019 ns,  32877 iterations
> > [26234.867379] find_next_and_bit:             1280651 ns,  74091 iterations
> > [26234.874107]
> >                Start testing find_bit() with sparse bitmap
> > [26234.881014] find_next_bit:                   46142 ns,    656 iterations
> > [26234.891900] find_next_zero_bit:            4158987 ns, 327025 iterations
> > [26234.898672] find_last_bit:                   49727 ns,    656 iterations
> > [26234.912504] find_first_bit:                7107862 ns,    656 iterations
> > [26234.919290] find_first_and_bit:              52092 ns,      1 iterations
> > [26234.926076] find_next_and_bit:               60856 ns,      1 iterations
> > 
> > And a third run:
> > 
> > [26459.679524]
> >                Start testing find_bit() with random-filled bitmap
> > [26459.689871] find_next_bit:                 2199418 ns, 163311 iterations
> > [26459.698798] find_next_zero_bit:            2181289 ns, 164370 iterations
> > [26459.707738] find_last_bit:                 2213638 ns, 163311 iterations
> > [26459.732224] find_first_bit:               17764152 ns,  16429 iterations
> > [26460.133823] find_first_and_bit:          394886375 ns,  32672 iterations
> > [26460.141818] find_next_and_bit:             1269693 ns,  73485 iterations
> > [26460.148545]
> >                Start testing find_bit() with sparse bitmap
> > [26460.155433] find_next_bit:                   40753 ns,    653 iterations
> > [26460.166307] find_next_zero_bit:            4148211 ns, 327028 iterations
> > [26460.173078] find_last_bit:                   50017 ns,    653 iterations
> > [26460.187007] find_first_bit:                7205325 ns,    653 iterations
> > [26460.193790] find_first_and_bit:              49358 ns,      1 iterations
> > [26460.200577] find_next_and_bit:               62332 ns,      1 iterations
> > 
> > My gut feeling is that yes, there is some variance, but not on an
> > order that is significant that would allow us to say "there's no
> > difference".
> > 
> > find_next_bit results for random are: 2222703, 2206687, 2199418,
> > which is an average of 2209603 and a variance of around 0.5%.
> > The difference between this and the single generic figure I have
> > is on the order of 20%.
> > 
> > I'll do the same with find_first_bit for random: 17722203, 17720726,
> > and 17764152. Average is 17735694. Variance is around 0.1% or 0.2%.
> > The difference between this and the single generic figure I have is
> > on the order of 5%. Not so large, but still quite a big difference
> > compared to the variance.
> > 
> > find_first_bit for sparse: 7562371, 7107862, 7205325. Average is
> > 7291853. Variance is higher at about 4%. Difference between this and
> > the generic figure is 0.5%, so this one is not significantly
> > different.
> > 
> > The best result looks to be find_next_zero_bit for the sparse bitmap
> > case. The generic code measures 5.5ms, the native code is sitting
> > around 4.1ms. That's a difference of around 34%, and by just looking
> > at the range in the figures above we can see this is a significant
> > result without needing to do the calculations. Similar is true of
> > find_next_bit for the sparse bitmap.
> > 
> > So, I think the results are significant in most cases and variance
> > doesn't account for the differences. The only one which isn't is
> > find_first_bit for the sparse case.
> 
> Hi Russel,
> 
> + Alexey Klimov <klimov.linux@gmail.com>
> 
> This is my testings for native vs generic find_bit operations on a15
> and 17.
> 
> The raw numbers are collected by Alexey Klimov on Odroid-xu3. All cpu
> frequencies were fixed at 1000Mhz. (Thanks a lot!)
> 
> For each native/generic @ a15/a7 configuration, the find_bit_benchmark 
> was run 5 times, and the results are summarized below:
> 
> A15                      Native     Generic       Difference
> Dense                        ns          ns       %   sigmas
> find_next_bit:          3746929     3231641      14      8.3
> find_next_zero_bit:     3935354     3202608      19     10.4
> find_last_bit:          3134713     3129717       0      0.1
> find_first_bit:        85626542    20498669      76    172.4
> find_first_and_bit:   409252997   414820417      -1     -0.2
> find_next_and_bit:      1678706     1654420       1      0.4
>                                               
> Sparse                                        
> find_next_bit:          143208        77924      46     29.4
> find_next_zero_bit:    6893375      6059177      12     14.3
> find_last_bit:           67174        68616      -2     -1.0
> find_first_bit:       33689256      8151493      76     47.8
> find_first_and_bit:     124758       156974     -26     -1.3
> find_next_and_bit:       53391        56716      -6     -0.2
> 
> A7                      Native      Generic       Difference
> Dense                       ns           ns       %   sigmas
> find_next_bit:         4207627      5532764     -31    -14.9
> find_next_zero_bit:    4259961      5236880     -23    -10.0
> find_last_bit:         4281386      4201025       2      1.5
> find_first_bit:      236913620     50970424      78    163.3
> find_first_and_bit:  728069762    750580781      -3     -0.7
> find_next_and_bit:     2696263      2766077      -3     -0.9
> 
> Sparse
> find_next_bit:          327241       143776      56     40.7
> find_next_zero_bit:    6987249     10235989     -46    -21.8
> find_last_bit:           97758        94725       3      0.6
> find_first_bit:       94628040     21051964      78     41.8
> find_first_and_bit:     248133       241267       3      0.3
> find_next_and_bit:      136475       154000     -13     -0.5
> 
> The last column is the difference between native and generic code
> performance normalized to a standard deviation:
>         (mean(native) - mean(generic)) / max(std(native), std(generic))
> 
> The results look consistent to me because 'and' subtests that are always
> generic differ by less than one sigma.
> 
> On A15 generic code is a clear winner. On A7 results are inconsistent
> although significant. Maybe it's worth to retest on A7.
>  
> Regarding the choice between native and generic core - I would prefer
> generic version even if it's slightly slower because it is tested and
> maintained better. And because the results of the test are at least on
> par, to me it's a no-brainer.
> 
> Would be really interesting to compare performance of your LDRB->LDR
> patch with the generic code using the same procedure.
> 
> Thanks,
> Yury
> 

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux 5.19-rc8
  2022-08-01 15:48                         ` Russell King (Oracle)
@ 2022-08-01 15:54                           ` Russell King (Oracle)
  0 siblings, 0 replies; 11+ messages in thread
From: Russell King (Oracle) @ 2022-08-01 15:54 UTC (permalink / raw)
  To: Yury Norov
  Cc: Linus Torvalds, Dennis Zhou, Guenter Roeck, Catalin Marinas,
	Linux Kernel Mailing List, Geert Uytterhoeven, Alexey Klimov,
	linux-m68k

On Mon, Aug 01, 2022 at 04:48:21PM +0100, Russell King (Oracle) wrote:
> Oh FFS.
> 
> I see you decided off your own back to remove the ARM version of the
> find_bit functions, with NO agreement from the arch maintainer. This
> is not on.

Sorry, my mistake, I'm getting confused with git over conflicts which
aren't making much sense.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-08-01 15:55 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAHk-=wiWwDYxNhnStS0e+p-NTFAQSHvab=2-8LwxunCVuY5-2A@mail.gmail.com>
     [not found] ` <20220725161141.GA1306881@roeck-us.net>
     [not found]   ` <CAHk-=whtGUwJwHUSNsXd4g7cok=n0Zwje7nACp8skh1fa2NtJA@mail.gmail.com>
     [not found]     ` <YuAm5h1B6bsrR/9q@fedora>
     [not found]       ` <CAHk-=wgYpJTMMxmfbpqc=JVtSK0Zj4b15G=AvEYk6vPNySDSsA@mail.gmail.com>
2022-07-26 18:18         ` Linux 5.19-rc8 Yury Norov
2022-07-26 18:36           ` Linus Torvalds
2022-07-26 19:44             ` Russell King (Oracle)
2022-07-26 20:20               ` Linus Torvalds
2022-07-27  0:15                 ` Russell King (Oracle)
2022-07-27  1:33                   ` Yury Norov
2022-07-27  7:43                     ` Russell King (Oracle)
2022-07-30 21:38                       ` Yury Norov
2022-08-01 15:48                         ` Russell King (Oracle)
2022-08-01 15:54                           ` Russell King (Oracle)
2022-07-27  7:46                     ` David Laight

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox