* block-sha1: improve code on large-register-set machines
@ 2009-08-10 23:52 Linus Torvalds
2009-08-11 6:15 ` Nicolas Pitre
0 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2009-08-10 23:52 UTC (permalink / raw)
To: Git Mailing List, Junio C Hamano
For x86 performance (especially in 32-bit mode) I added that hack to write
the SHA1 internal temporary hash using a volatile pointer, in order to get
gcc to not try to cache the array contents. Because gcc will do all the
wrong things, and then spill things in insane random ways.
But on architectures like PPC, where you have 32 registers, it's actually
perfectly reasonable to put the whole temporary array[] into the register
set, and gcc can do so.
So make the 'volatile unsigned int *' cast be dependent on a
SMALL_REGISTER_SET preprocessor symbol, and enable it (currently) on just
x86 and x86-64. With that, the routine is fairly reasonable even when
compared to the hand-scheduled PPC version. Ben Herrenschmidt reports on
a G5:
* Paulus asm version: about 3.67s
* Yours with no change: about 5.74s
* Yours without "volatile": about 3.78s
so with this the C version is within about 3% of the asm one.
And add a lot of commentary on what the heck is going on.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
I also asked David Miller to test the non-volatile version on Sparc, but I
suspect it will have the same pattern. ia64 likewise (but I have not asked
anybody to test).
Of the other architectures, ARM probably would wants SMALL_REGISTER_SET,
but I suspect the problem there is the htonl() (on little-endian), and
possibly the unaligned loads - at least on older ARM. The latter is
something gcc could be taught about, though (the SHA_SRC macro would just
need to use a pointer that goes through a packed struct member or
something).
block-sha1/sha1.c | 25 ++++++++++++++++++++++++-
1 files changed, 24 insertions(+), 1 deletions(-)
diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index 36da763..9bc8b8a 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -82,6 +82,7 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
#define SHA_ASM(op, x, n) ({ unsigned int __res; asm(op " %1,%0":"=r" (__res):"i" (n), "0" (x)); __res; })
#define SHA_ROL(x,n) SHA_ASM("rol", x, n)
#define SHA_ROR(x,n) SHA_ASM("ror", x, n)
+#define SMALL_REGISTER_SET
#else
@@ -93,7 +94,29 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
/* This "rolls" over the 512-bit array */
#define W(x) (array[(x)&15])
-#define setW(x, val) (*(volatile unsigned int *)&W(x) = (val))
+
+/*
+ * If you have 32 registers or more, the compiler can (and should)
+ * try to change the array[] accesses into registers. However, on
+ * machines with less than ~25 registers, that won't really work,
+ * and at least gcc will make an unholy mess of it.
+ *
+ * So to avoid that mess which just slows things down, we force
+ * the stores to memory to actually happen (we might be better off
+ * with a 'W(t)=(val);asm("":"+m" (W(t))' there instead, as
+ * suggested by Artur Skawina - that will also make gcc unable to
+ * try to do the silly "optimize away loads" part because it won't
+ * see what the value will be).
+ *
+ * Ben Herrenschmidt reports that on PPC, the C version comes close
+ * to the optimized asm with this (ie on PPC you don't want that
+ * 'volatile', since there are lots of registers).
+ */
+#ifdef SMALL_REGISTER_SET
+ #define setW(x, val) (*(volatile unsigned int *)&W(x) = (val))
+#else
+ #define setW(x, val) (W(x) = (val))
+#endif
/*
* Where do we get the source from? The first 16 iterations get it from
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: block-sha1: improve code on large-register-set machines
2009-08-10 23:52 block-sha1: improve code on large-register-set machines Linus Torvalds
@ 2009-08-11 6:15 ` Nicolas Pitre
2009-08-11 15:04 ` Linus Torvalds
2009-08-11 15:43 ` Linus Torvalds
0 siblings, 2 replies; 16+ messages in thread
From: Nicolas Pitre @ 2009-08-11 6:15 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano
On Mon, 10 Aug 2009, Linus Torvalds wrote:
>
> For x86 performance (especially in 32-bit mode) I added that hack to write
> the SHA1 internal temporary hash using a volatile pointer, in order to get
> gcc to not try to cache the array contents. Because gcc will do all the
> wrong things, and then spill things in insane random ways.
>
> But on architectures like PPC, where you have 32 registers, it's actually
> perfectly reasonable to put the whole temporary array[] into the register
> set, and gcc can do so.
>
> So make the 'volatile unsigned int *' cast be dependent on a
> SMALL_REGISTER_SET preprocessor symbol, and enable it (currently) on just
> x86 and x86-64. With that, the routine is fairly reasonable even when
> compared to the hand-scheduled PPC version. Ben Herrenschmidt reports on
> a G5:
>
> * Paulus asm version: about 3.67s
> * Yours with no change: about 5.74s
> * Yours without "volatile": about 3.78s
>
> so with this the C version is within about 3% of the asm one.
>
> And add a lot of commentary on what the heck is going on.
>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> ---
>
> I also asked David Miller to test the non-volatile version on Sparc, but I
> suspect it will have the same pattern. ia64 likewise (but I have not asked
> anybody to test).
>
> Of the other architectures, ARM probably would wants SMALL_REGISTER_SET,
> but I suspect the problem there is the htonl() (on little-endian), and
> possibly the unaligned loads - at least on older ARM. The latter is
> something gcc could be taught about, though (the SHA_SRC macro would just
> need to use a pointer that goes through a packed struct member or
> something).
The "older" ARM (those that don't perform unaligned accesses in
hardware) are still the majority by far in the field.
Here some numbers on ARM for 203247018 bytes.
MOZILLA_SHA1: 14.520s
ARM_SHA1: 5.600s
OPENSSL: 5.530s
BLK_SHA1: 5.280s [original]
BLK_SHA1: 7.410s [with SMALL_REGISTER_SET defined]
BLK_SHA1: 7.480s [with 'W(x)=(val);asm("":"+m" (W(x)))']
BLK_SHA1: 4.980s [with 'W(x)=(val);asm("":::"memory")']
At this point the generated assembly is pretty slick. I bet the full
memory barrier might help on x86 as well.
However the above BLK_SHA1 works only for aligned source buffers. So
let's define our own SHA_SRC to replace the htonl() (which should
probably be ntohl() by the way) like this:
#define SHA_SRC(t) \
({ unsigned char *__d = (unsigned char *)&data[t]; \
(__d[0] << 24) | (__d[1] << 16) | (__d[2] << 8) | (__d[3] << 0); })
And this provides the exact same performance as the ntohl() based
version (4.980s) except that this now cope with unaligned buffers too.
Of course the BLK_SHA1 version is a pig since it is totally unrolled
text data bss dec hex filename
1220 0 0 1220 4c4 mozilla-sha1/sha1.o
852 0 0 852 354 arm/sha1_arm.o
6292 0 0 6292 1894 block-sha1/sha1.o
so the speed advantage has a significant (but relative) code size cost.
Nicolas
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: block-sha1: improve code on large-register-set machines
2009-08-11 6:15 ` Nicolas Pitre
@ 2009-08-11 15:04 ` Linus Torvalds
2009-08-11 18:00 ` Nicolas Pitre
2009-08-11 15:43 ` Linus Torvalds
1 sibling, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2009-08-11 15:04 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Git Mailing List, Junio C Hamano
On Tue, 11 Aug 2009, Nicolas Pitre wrote:
>
> #define SHA_SRC(t) \
> ({ unsigned char *__d = (unsigned char *)&data[t]; \
> (__d[0] << 24) | (__d[1] << 16) | (__d[2] << 8) | (__d[3] << 0); })
>
> And this provides the exact same performance as the ntohl() based
> version (4.980s) except that this now cope with unaligned buffers too.
Is it better to do a (conditional) memcpy up front? Or is the byte-based
one better just because you always end up doing the shifting anyway due to
most ARM situations being little-endian?
I _suspect_ that most large SHA1 calls from git are pre-aligned. The big
SHA1 calls are for pack-file verification in fsck, which should all be
aligned. Same goes for index file integrity checking.
The actual object SHA1 calculations are likely not aligned (we do that
object header thing), and if you can't do the htonl() any better way I
guess the byte-based thing is the way to go..
Linus
---
block-sha1/sha1.c | 13 ++++++++++++-
1 files changed, 12 insertions(+), 1 deletions(-)
diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index 9bc8b8a..df27e66 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -25,6 +25,12 @@ void blk_SHA1_Init(blk_SHA_CTX *ctx)
ctx->H[4] = 0xc3d2e1f0;
}
+#ifdef REALLY_SLOW_UNALIGNED
+ #define is_unaligned(ptr) (3 & (unsigned long)(ptr))
+#else
+ #define is_unaligned(ptr) 0
+#endif
+
void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len)
{
@@ -47,7 +53,12 @@ void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len)
blk_SHA1Block(ctx, ctx->W);
}
while (len >= 64) {
- blk_SHA1Block(ctx, data);
+ const unsigned int *block = data;
+ if (is_unaligned(data)) {
+ memcpy(ctx->W, data, 64);
+ block = ctx->W;
+ }
+ blk_SHA1Block(ctx, block);
data += 64;
len -= 64;
}
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: block-sha1: improve code on large-register-set machines
2009-08-11 6:15 ` Nicolas Pitre
2009-08-11 15:04 ` Linus Torvalds
@ 2009-08-11 15:43 ` Linus Torvalds
2009-08-11 20:03 ` Nicolas Pitre
1 sibling, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2009-08-11 15:43 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Git Mailing List, Junio C Hamano
On Tue, 11 Aug 2009, Nicolas Pitre wrote:
>
> BLK_SHA1: 5.280s [original]
> BLK_SHA1: 7.410s [with SMALL_REGISTER_SET defined]
> BLK_SHA1: 7.480s [with 'W(x)=(val);asm("":"+m" (W(x)))']
> BLK_SHA1: 4.980s [with 'W(x)=(val);asm("":::"memory")']
>
> At this point the generated assembly is pretty slick. I bet the full
> memory barrier might help on x86 as well.
No, I had tested that earlier - single-word memory barrier for some reason
gets _much_ better numbers at least on x86-64. We're talking
linus 1.46 418.2
vs
linus 2.004 304.6
kind of differences. With the "+m" it outperforms openssl (375-380MB/s).
The "volatile unsigned int *" cast looks pretty much like the "+m" version
to me, but Arthur got a speedup from whatever gcc code generation
differences on his P4.
The really fundamental and basic problem with gcc on this code is that gcc
does not see _any_ difference what-so-ever between the five variables
declared with
unsigned int A, B, C, D, E;
and the sixteen variables declared with
unsigned int array[16];
and considers those all to be 21 local variables. It really seems to think
that they are all 100% equivalent, and gcc totally ignores me doing things
like adding "register" to the A-E ones etc.
And if you are a compiler, and think that the routine has 21 equal
register variables, you're going to do crazy reload sh*t when you have
only 7 (or 15) GP registers. So doing that full memory barrier seems to
just take that random situation, and force some random variable to be
spilled (this is all from looking at the generated code, not from looking
at gcc).
In contrast, with the _targeted_ thing ("you'd better write back into
array[]") we force gcc to spill the array[16] values, and not the A-E
ones, and that's why it seems to make such a big difference.
And no, I'm not sure why ARM apparently doesn't show the same behavior. Or
maybe it does, but with an in-order core it doesn't matter as much which
registers you keep reloading - you'll be serialized all the time _anyway_.
Linus
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: block-sha1: improve code on large-register-set machines
2009-08-11 15:04 ` Linus Torvalds
@ 2009-08-11 18:00 ` Nicolas Pitre
2009-08-11 19:31 ` Nicolas Pitre
0 siblings, 1 reply; 16+ messages in thread
From: Nicolas Pitre @ 2009-08-11 18:00 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano
On Tue, 11 Aug 2009, Linus Torvalds wrote:
> On Tue, 11 Aug 2009, Nicolas Pitre wrote:
> >
> > #define SHA_SRC(t) \
> > ({ unsigned char *__d = (unsigned char *)&data[t]; \
> > (__d[0] << 24) | (__d[1] << 16) | (__d[2] << 8) | (__d[3] << 0); })
> >
> > And this provides the exact same performance as the ntohl() based
> > version (4.980s) except that this now cope with unaligned buffers too.
>
> Is it better to do a (conditional) memcpy up front? Or is the byte-based
> one better just because you always end up doing the shifting anyway due to
> most ARM situations being little-endian?
The vast majority of ARM processors where git might be run are using a
LE environment.
> I _suspect_ that most large SHA1 calls from git are pre-aligned. The big
> SHA1 calls are for pack-file verification in fsck, which should all be
> aligned. Same goes for index file integrity checking.
>
> The actual object SHA1 calculations are likely not aligned (we do that
> object header thing), and if you can't do the htonl() any better way I
> guess the byte-based thing is the way to go..
Let's see. The default ntohl() provided by glibc generates this code:
ldr r3, [r0, #0]
mov r0, r3, lsr #24
and r2, r3, #0x00ff0000
orr r0, r0, r3, asl #24
orr r0, r0, r2, lsr #8
and r3, r3, #0x0000ff00
orr r0, r0, r3, asl #8
Ignoring the load result delay that gcc should properly schedule anyway,
that makes for 7 cycles.
Using the smarter ARM swab32 implementation from Linux we get:
ldr r3, [r0, #0]
eor r0, r3, r3, ror #16
bic r0, r0, #0x00ff0000
mov r0, r0, lsr #8
eor r0, r0, r3, ror #8
So we're down to 5 cycles. And the SHA1 test is a bit faster too:
4.570s down from 4.980s. However this is still using purely aligned
buffers.
Adding your patch using memcpy() to align the data in the unaligned case
gives me wild results. Sometimes it is 4.930s, sometimes it is 5.560s.
I suspect the icache starts to get tight and sometimes the SHA1 code
and/or the special unaligned memcpy path get evicted sometimes.
Using the byte access version we get:
ldrb r3, [r0, #3]
ldrb r2, [r0, #0]
ldrb r1, [r0, #1]
orr r3, r3, r2, asl #24
ldrb r0, [r0, #2]
orr r3, r3, r1, asl #16
orr r0, r3, r0, asl #8
Again 7 cycles, like the ntohl() based version, which is coherent with
the fact that they both make for 4.980s.. However this time any buffer
alignment is supported. And in fact the extra 2 cycles over the
swab32() version should actually be less overhead per word than the
unaligned memcpy which is around 4 cycles per word.
Nicolas
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: block-sha1: improve code on large-register-set machines
2009-08-11 18:00 ` Nicolas Pitre
@ 2009-08-11 19:31 ` Nicolas Pitre
2009-08-11 21:20 ` Brandon Casey
0 siblings, 1 reply; 16+ messages in thread
From: Nicolas Pitre @ 2009-08-11 19:31 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano
On Tue, 11 Aug 2009, Nicolas Pitre wrote:
> On Tue, 11 Aug 2009, Linus Torvalds wrote:
>
> > On Tue, 11 Aug 2009, Nicolas Pitre wrote:
> > >
> > > #define SHA_SRC(t) \
> > > ({ unsigned char *__d = (unsigned char *)&data[t]; \
> > > (__d[0] << 24) | (__d[1] << 16) | (__d[2] << 8) | (__d[3] << 0); })
> > >
> > > And this provides the exact same performance as the ntohl() based
> > > version (4.980s) except that this now cope with unaligned buffers too.
> >
> > The actual object SHA1 calculations are likely not aligned (we do that
> > object header thing), and if you can't do the htonl() any better way I
> > guess the byte-based thing is the way to go..
OK, even better: 4.400s.
This is with this instead of the above:
#define SHA_SRC(t) \
({ unsigned char *__d = (unsigned char *)data; \
(__d[(t)*4 + 0] << 24) | (__d[(t)*4 + 1] << 16) | \
(__d[(t)*4 + 2] << 8) | (__d[(t)*4 + 3] << 0); })
The previous version would allocate a new register for __d and then
index it with an offset of 0, 1, 2 or 3. This version always uses the
register containing the data pointer with absolute offsets. The binary
is a bit smaller too.
Nicolas
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: block-sha1: improve code on large-register-set machines
2009-08-11 15:43 ` Linus Torvalds
@ 2009-08-11 20:03 ` Nicolas Pitre
2009-08-11 22:53 ` Linus Torvalds
0 siblings, 1 reply; 16+ messages in thread
From: Nicolas Pitre @ 2009-08-11 20:03 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano
On Tue, 11 Aug 2009, Linus Torvalds wrote:
> On Tue, 11 Aug 2009, Nicolas Pitre wrote:
> >
> > BLK_SHA1: 5.280s [original]
> > BLK_SHA1: 7.410s [with SMALL_REGISTER_SET defined]
> > BLK_SHA1: 7.480s [with 'W(x)=(val);asm("":"+m" (W(x)))']
> > BLK_SHA1: 4.980s [with 'W(x)=(val);asm("":::"memory")']
> >
> > At this point the generated assembly is pretty slick. I bet the full
> > memory barrier might help on x86 as well.
>
> No, I had tested that earlier - single-word memory barrier for some reason
> gets _much_ better numbers at least on x86-64. We're talking
>
> linus 1.46 418.2
> vs
> linus 2.004 304.6
>
> kind of differences. With the "+m" it outperforms openssl (375-380MB/s).
>
> The "volatile unsigned int *" cast looks pretty much like the "+m" version
> to me, but Arthur got a speedup from whatever gcc code generation
> differences on his P4.
The volatile pointer forces a write to memory but the cached value in
the processor's register remains valid, whereas the "+m" forces gcc to
assume the register copy is not valid anymore. That certainly gives the
compiler a different clue about register availability, etc.
> The really fundamental and basic problem with gcc on this code is that gcc
> does not see _any_ difference what-so-ever between the five variables
> declared with
>
> unsigned int A, B, C, D, E;
>
> and the sixteen variables declared with
>
> unsigned int array[16];
>
> and considers those all to be 21 local variables. It really seems to think
> that they are all 100% equivalent, and gcc totally ignores me doing things
> like adding "register" to the A-E ones etc.
>
> And if you are a compiler, and think that the routine has 21 equal
> register variables, you're going to do crazy reload sh*t when you have
> only 7 (or 15) GP registers. So doing that full memory barrier seems to
> just take that random situation, and force some random variable to be
> spilled (this is all from looking at the generated code, not from looking
> at gcc).
>
> In contrast, with the _targeted_ thing ("you'd better write back into
> array[]") we force gcc to spill the array[16] values, and not the A-E
> ones, and that's why it seems to make such a big difference.
>
> And no, I'm not sure why ARM apparently doesn't show the same behavior. Or
> maybe it does, but with an in-order core it doesn't matter as much which
> registers you keep reloading - you'll be serialized all the time _anyway_.
Well... gcc is really strange in this case (and similar other ones) with
ARM compilation. A good indicator of the quality of the code is the
size of the stack frame. When using the "+m" then gcc creates a 816
byte stack frame, the generated binary grows by approx 3000 bytes, and
performances is almost halved (7.600s). Looking at the assembly result
I just can't figure out all the crazy moves taking place. Even the
version with no barrier what so ever produces better assembly with a
stack frame of 560 bytes.
The volatile version is second to worst with a 808 byte stack frame with
similar bad performances.
With the full "memory" the stack frame shrinks to 280 bytes and best
performances so far is obtained. And none of the A B C D E or data
variables are ever spilled onto the stack either, only the array[16]
gets allocated stack slots, and the TEMP variable.
Nicolas
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: block-sha1: improve code on large-register-set machines
2009-08-11 19:31 ` Nicolas Pitre
@ 2009-08-11 21:20 ` Brandon Casey
2009-08-11 21:36 ` Nicolas Pitre
2009-08-11 22:57 ` Linus Torvalds
0 siblings, 2 replies; 16+ messages in thread
From: Brandon Casey @ 2009-08-11 21:20 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Linus Torvalds, Git Mailing List, Junio C Hamano
Nicolas Pitre wrote:
> On Tue, 11 Aug 2009, Nicolas Pitre wrote:
>
>> On Tue, 11 Aug 2009, Linus Torvalds wrote:
>>
>>> On Tue, 11 Aug 2009, Nicolas Pitre wrote:
>>>> #define SHA_SRC(t) \
>>>> ({ unsigned char *__d = (unsigned char *)&data[t]; \
>>>> (__d[0] << 24) | (__d[1] << 16) | (__d[2] << 8) | (__d[3] << 0); })
>>>>
>>>> And this provides the exact same performance as the ntohl() based
>>>> version (4.980s) except that this now cope with unaligned buffers too.
>>> The actual object SHA1 calculations are likely not aligned (we do that
>>> object header thing), and if you can't do the htonl() any better way I
>>> guess the byte-based thing is the way to go..
>
> OK, even better: 4.400s.
>
> This is with this instead of the above:
>
> #define SHA_SRC(t) \
> ({ unsigned char *__d = (unsigned char *)data; \
> (__d[(t)*4 + 0] << 24) | (__d[(t)*4 + 1] << 16) | \
> (__d[(t)*4 + 2] << 8) | (__d[(t)*4 + 3] << 0); })
>
> The previous version would allocate a new register for __d and then
> index it with an offset of 0, 1, 2 or 3. This version always uses the
> register containing the data pointer with absolute offsets. The binary
> is a bit smaller too.
In that case, why not change the interface of blk_SHA1Block() so that its
second argument is const unsigned char* and get rid of __d and the { } ?
Then it will look like this:
static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned char *data);
...
#define SHA_SRC(t) \
( (data[(t)*4 + 0] << 24) | (data[(t)*4 + 1] << 16) | \
(data[(t)*4 + 2] << 8) | (data[(t)*4 + 3] << 0) )
Plus, we need something like the following to handle storing the hash to
an unaligned buffer (warning copy/pasted):
@@ -73,8 +74,12 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *c
/* Output hash
*/
- for (i = 0; i < 5; i++)
- ((unsigned int *)hashout)[i] = htonl(ctx->H[i]);
+ for (i = 0; i < 5; i++) {
+ *hashout++ = (unsigned char) (ctx->H[i] >> 24);
+ *hashout++ = (unsigned char) (ctx->H[i] >> 16);
+ *hashout++ = (unsigned char) (ctx->H[i] >> 8);
+ *hashout++ = (unsigned char) (ctx->H[i] >> 0);
+ }
}
#if defined(__i386__) || defined(__x86_64__)
With these two changes plus a few other minor tweaks, the block-sha1 code compiles
and passes the test suite on sparc (solaris 7) and mips (irix 6.5).
-brandon
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: block-sha1: improve code on large-register-set machines
2009-08-11 21:20 ` Brandon Casey
@ 2009-08-11 21:36 ` Nicolas Pitre
2009-08-11 21:49 ` Brandon Casey
2009-08-11 22:57 ` Linus Torvalds
1 sibling, 1 reply; 16+ messages in thread
From: Nicolas Pitre @ 2009-08-11 21:36 UTC (permalink / raw)
To: Brandon Casey; +Cc: Linus Torvalds, Git Mailing List, Junio C Hamano
On Tue, 11 Aug 2009, Brandon Casey wrote:
> Nicolas Pitre wrote:
> > On Tue, 11 Aug 2009, Nicolas Pitre wrote:
> >
> >> On Tue, 11 Aug 2009, Linus Torvalds wrote:
> >>
> >>> On Tue, 11 Aug 2009, Nicolas Pitre wrote:
> >>>> #define SHA_SRC(t) \
> >>>> ({ unsigned char *__d = (unsigned char *)&data[t]; \
> >>>> (__d[0] << 24) | (__d[1] << 16) | (__d[2] << 8) | (__d[3] << 0); })
> >>>>
> >>>> And this provides the exact same performance as the ntohl() based
> >>>> version (4.980s) except that this now cope with unaligned buffers too.
> >>> The actual object SHA1 calculations are likely not aligned (we do that
> >>> object header thing), and if you can't do the htonl() any better way I
> >>> guess the byte-based thing is the way to go..
> >
> > OK, even better: 4.400s.
> >
> > This is with this instead of the above:
> >
> > #define SHA_SRC(t) \
> > ({ unsigned char *__d = (unsigned char *)data; \
> > (__d[(t)*4 + 0] << 24) | (__d[(t)*4 + 1] << 16) | \
> > (__d[(t)*4 + 2] << 8) | (__d[(t)*4 + 3] << 0); })
> >
> > The previous version would allocate a new register for __d and then
> > index it with an offset of 0, 1, 2 or 3. This version always uses the
> > register containing the data pointer with absolute offsets. The binary
> > is a bit smaller too.
>
> In that case, why not change the interface of blk_SHA1Block() so that its
> second argument is const unsigned char* and get rid of __d and the { } ?
Because not all architectures care to access the data bytewise. Please
see the original SHA_SRC definition.
Nicolas
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: block-sha1: improve code on large-register-set machines
2009-08-11 21:36 ` Nicolas Pitre
@ 2009-08-11 21:49 ` Brandon Casey
0 siblings, 0 replies; 16+ messages in thread
From: Brandon Casey @ 2009-08-11 21:49 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Linus Torvalds, Git Mailing List, Junio C Hamano
Nicolas Pitre wrote:
> On Tue, 11 Aug 2009, Brandon Casey wrote:
>
>> Nicolas Pitre wrote:
>>> On Tue, 11 Aug 2009, Nicolas Pitre wrote:
>>>
>>>> On Tue, 11 Aug 2009, Linus Torvalds wrote:
>>>>
>>>>> On Tue, 11 Aug 2009, Nicolas Pitre wrote:
>>>>>> #define SHA_SRC(t) \
>>>>>> ({ unsigned char *__d = (unsigned char *)&data[t]; \
>>>>>> (__d[0] << 24) | (__d[1] << 16) | (__d[2] << 8) | (__d[3] << 0); })
>>>>>>
>>>>>> And this provides the exact same performance as the ntohl() based
>>>>>> version (4.980s) except that this now cope with unaligned buffers too.
>>>>> The actual object SHA1 calculations are likely not aligned (we do that
>>>>> object header thing), and if you can't do the htonl() any better way I
>>>>> guess the byte-based thing is the way to go..
>>> OK, even better: 4.400s.
>>>
>>> This is with this instead of the above:
>>>
>>> #define SHA_SRC(t) \
>>> ({ unsigned char *__d = (unsigned char *)data; \
>>> (__d[(t)*4 + 0] << 24) | (__d[(t)*4 + 1] << 16) | \
>>> (__d[(t)*4 + 2] << 8) | (__d[(t)*4 + 3] << 0); })
>>>
>>> The previous version would allocate a new register for __d and then
>>> index it with an offset of 0, 1, 2 or 3. This version always uses the
>>> register containing the data pointer with absolute offsets. The binary
>>> is a bit smaller too.
>> In that case, why not change the interface of blk_SHA1Block() so that its
>> second argument is const unsigned char* and get rid of __d and the { } ?
>
> Because not all architectures care to access the data bytewise. Please
> see the original SHA_SRC definition.
You mean this:
#define SHA_SRC(t) htonl(data[t])
? Or was there a definition before this one?
I don't follow what you are saying. Are you saying that the following two
examples are different?
unsigned int *data;
#define SHA_SRC(t) \
({ unsigned char *__d = (unsigned char *)data; \
(__d[(t)*4 + 0] << 24) | (__d[(t)*4 + 1] << 16) | \
(__d[(t)*4 + 2] << 8) | (__d[(t)*4 + 3] << 0); })
and
unsigned char *data;
#define SHA_SRC(t) \
( (data[(t)*4 + 0] << 24) | (data[(t)*4 + 1] << 16) | \
(data[(t)*4 + 2] << 8) | (data[(t)*4 + 3] << 0) )
-brandon
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: block-sha1: improve code on large-register-set machines
2009-08-11 20:03 ` Nicolas Pitre
@ 2009-08-11 22:53 ` Linus Torvalds
2009-08-11 23:14 ` Linus Torvalds
2009-08-11 23:45 ` Artur Skawina
0 siblings, 2 replies; 16+ messages in thread
From: Linus Torvalds @ 2009-08-11 22:53 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Git Mailing List, Junio C Hamano
On Tue, 11 Aug 2009, Nicolas Pitre wrote:
>
> Well... gcc is really strange in this case (and similar other ones) with
> ARM compilation. A good indicator of the quality of the code is the
> size of the stack frame. When using the "+m" then gcc creates a 816
> byte stack frame, the generated binary grows by approx 3000 bytes, and
> performances is almost halved (7.600s). Looking at the assembly result
> I just can't figure out all the crazy moves taking place. Even the
> version with no barrier what so ever produces better assembly with a
> stack frame of 560 bytes.
Ok, that's just crazy. That function has a required stack size of exactly
64 bytes, and anything more than that is just spilling. And if you end up
with a stack frame of 560 bytes, that means that gcc is doing some _crazy_
spilling.
One thing that strikes me is that I've been just testing with gcc-4.4, and
BenH (who did some tests on PPC where SHA1 is just _trivial_ because it
all fits in the normal register space) noticed that older versions of gcc
that he tested did much worse on this.
I think Artur also posted (x86) numbers with older gcc versions doing
worse. Maybe you're seeing some of that?
Linus
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: block-sha1: improve code on large-register-set machines
2009-08-11 21:20 ` Brandon Casey
2009-08-11 21:36 ` Nicolas Pitre
@ 2009-08-11 22:57 ` Linus Torvalds
2009-08-11 23:13 ` Brandon Casey
1 sibling, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2009-08-11 22:57 UTC (permalink / raw)
To: Brandon Casey; +Cc: Nicolas Pitre, Git Mailing List, Junio C Hamano
On Tue, 11 Aug 2009, Brandon Casey wrote:
>
> In that case, why not change the interface of blk_SHA1Block() so that its
> second argument is const unsigned char* and get rid of __d and the { } ?
Because on big-endian, or on architectures like x86 that have an efficient
byte swap, that would be horrible.
You absoluetoy MUST NOT do things a byte at a time in those cases. The
memory operations and the shifting just kills you.
The reason you want to do things a byte at a time on ARM is that ARM
cannot do unaligned accesses well (very modern cores are better, but
rare), and that ARM has no bswap instruction and has fairly cheap shifts.
On no sane architecture is that true. Unaligned loads are fast (and quite
frankly, hardware where unaliged loads aren't fast is just crazy sh*t),
and doing 'bswap' is way faster than doing many shifts and masks.
So everything should be fundamentally word-oriented. Then, broken
architectures that can't handle it should split up the words, not the
other way around.
Linus
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: block-sha1: improve code on large-register-set machines
2009-08-11 22:57 ` Linus Torvalds
@ 2009-08-11 23:13 ` Brandon Casey
0 siblings, 0 replies; 16+ messages in thread
From: Brandon Casey @ 2009-08-11 23:13 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Nicolas Pitre, Git Mailing List, Junio C Hamano
Linus Torvalds wrote:
>
> On Tue, 11 Aug 2009, Brandon Casey wrote:
>> In that case, why not change the interface of blk_SHA1Block() so that its
>> second argument is const unsigned char* and get rid of __d and the { } ?
>
> Because on big-endian, or on architectures like x86 that have an efficient
> byte swap, that would be horrible.
Sorry, I missed Nicolas's first message where he said his SHA_SRC macro was
for arm only.
I started at your reply to him which only has the snippet which says
"...this provides the exact same performance as the ntohl() based version
except that this now cope with unaligned buffers too".
My mistake.
-brandon
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: block-sha1: improve code on large-register-set machines
2009-08-11 22:53 ` Linus Torvalds
@ 2009-08-11 23:14 ` Linus Torvalds
2009-08-12 2:26 ` Nicolas Pitre
2009-08-11 23:45 ` Artur Skawina
1 sibling, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2009-08-11 23:14 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Git Mailing List, Junio C Hamano
On Tue, 11 Aug 2009, Linus Torvalds wrote:
>
>
> On Tue, 11 Aug 2009, Nicolas Pitre wrote:
> >
> > Well... gcc is really strange in this case (and similar other ones) with
> > ARM compilation. A good indicator of the quality of the code is the
> > size of the stack frame. When using the "+m" then gcc creates a 816
> > byte stack frame, the generated binary grows by approx 3000 bytes, and
> > performances is almost halved (7.600s). Looking at the assembly result
> > I just can't figure out all the crazy moves taking place. Even the
> > version with no barrier what so ever produces better assembly with a
> > stack frame of 560 bytes.
>
> Ok, that's just crazy. That function has a required stack size of exactly
> 64 bytes, and anything more than that is just spilling. And if you end up
> with a stack frame of 560 bytes, that means that gcc is doing some _crazy_
> spilling.
Btw, what I think happens is:
- gcc turns all those array accesses into pseudo's
So now the 'array[16]' is seen as just another 16 variables rather than
an array.
- gcc then turns it into SSA, where each assignment basically creates a
new variable. So the 16 array variables (and 5 regular variables) are
now expanded to 80 SSA asignments (one array assignment per SHA1 round)
plus an additional 2 assignments to the "regular" variables per round
(B and E are changed each round). So in SSA form, you actually end up
having 240 pseudo's associated with the actual variables. Plus all
the temporaries of course.
- the thing then goes crazy and tries to generate great code from that
internal SSA model. And since there are never more than ~25 things
_live_ at any particular point, it works fine with lots of registers,
but on small-register machines gcc just goes crazy and has to spill.
And it doesn't spill 'array[x]' entries - it spills the _pseudo's_ it
has created - hundreds of them.
- End result: if the spill code doesn't share slots, it's going to create
a totally unholy mess of crap.
That's what the whole 'volatile unsigned int *' game tried to avoid. But
it really sounds like it's not working too well for you. And the _big_
memory barrier ends up helping just because with that in place, you end up
being almost entirely unable to schedule _anything_ between the different
SHA rounds, so you end up with only six or seven variables "live" in
between those barriers, and the stupid register allocator/spill logic
doesn't break down too badly.
The thing is, if you do full memory barriers, then you're probably better
off making both the loads and the stores be "volatile". That should have
similar effects.
The downside with that is that it really limits the loads. So (like the
full memory barrier) it's a big hammer approach. But it probably generates
better code for you, because it avoids the mental breakdown of gcc
spilling its pseudo's.
Linus
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: block-sha1: improve code on large-register-set machines
2009-08-11 22:53 ` Linus Torvalds
2009-08-11 23:14 ` Linus Torvalds
@ 2009-08-11 23:45 ` Artur Skawina
1 sibling, 0 replies; 16+ messages in thread
From: Artur Skawina @ 2009-08-11 23:45 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Nicolas Pitre, Git Mailing List, Junio C Hamano
Linus Torvalds wrote:
>
> One thing that strikes me is that I've been just testing with gcc-4.4, and
> BenH (who did some tests on PPC where SHA1 is just _trivial_ because it
> all fits in the normal register space) noticed that older versions of gcc
> that he tested did much worse on this.
>
> I think Artur also posted (x86) numbers with older gcc versions doing
> worse. Maybe you're seeing some of that?
FWIW, this is how it looks here. On 32-bit x86 gcc4.4 makes a large
difference[1], but your code does fairly well w/ most gccs, relatively
to all the other C implementations.
artur
P4: [linusv is the recent one w/ the volatile stores]
### sha1bench-gcc295: GCCVER 2.95.4 20030502 (prerelease)
rfc3174 0.9438 64.67
linus 0.9081 67.21
linusv 0.4155 146.9
linusp4 0.8761 69.66
linusas 0.9619 63.45
linusas2 1.025 59.52
mozilla 1.314 46.46
mozillaas 1.132 53.92
### sha1bench-gcc31: GCCVER 3.2 2002-07-26 (prerelease)
rfc3174 0.8582 71.12
linus 0.7943 76.84
linusv 0.5667 107.7
linusp4 0.7224 84.48
linusas 0.7127 85.64
linusas2 0.5109 119.5
mozilla 1.251 48.79
mozillaas 1.239 49.27
### sha1bench-gcc32: GCCVER 3.2.3
rfc3174 0.9062 67.35
linus 0.5555 109.9
linusv 0.3647 167.4
linusp4 0.5337 114.4
linusas 0.7126 85.66
linusas2 0.5089 119.9
mozilla 1.138 53.64
mozillaas 1.075 56.78
### sha1bench-gcc33: GCCVER 3.3.6
rfc3174 0.9029 67.6
linus 0.6059 100.7
linusv 0.3734 163.4
linusp4 0.6695 91.16
linusas 0.7832 77.93
linusas2 0.571 106.9
mozilla 1.083 56.36
mozillaas 1.078 56.62
### sha1bench-gcc34: GCCVER 3.4.6 20060121 (prerelease)
rfc3174 0.9277 65.79
linus 0.6583 92.71
linusv 0.6096 100.1
linusp4 0.7326 83.31
linusas 0.7383 82.67
linusas2 0.6264 97.44
mozilla 1.398 43.67
mozillaas 1.392 43.84
### sha1bench-gcc40: GCCVER 4.0.4 20061113 (prerelease)
rfc3174 0.9889 61.72
linus 0.7508 81.29
linusv 0.7752 78.73
linusp4 0.6548 93.21
linusas 0.4904 124.5
linusas2 0.6378 95.7
mozilla 1.528 39.93
mozillaas 1.596 38.24
### sha1bench-gcc41: GCCVER 4.1.3 20080704 (prerelease)
rfc3174 0.9798 62.29
linus 0.6993 87.28
linusv 0.767 79.57
linusp4 0.6785 89.95
linusas 0.6555 93.11
linusas2 0.691 88.32
mozilla 1.594 38.3
mozillaas 1.566 38.98
### sha1bench-gcc42: GCCVER 4.2.5 20090330 (prerelease)
rfc3174 1.138 53.63
linus 0.7772 78.53
linusv 0.6138 99.43
linusp4 0.7018 86.97
linusas 0.8164 74.76
linusas2 0.7038 86.73
mozilla 1.697 35.97
mozillaas 1.491 40.94
### sha1bench-gcc43: GCCVER 4.3.5 20090810 (prerelease)
rfc3174 1.148 53.15
linus 0.7085 86.14
linusv 0.5474 111.5
linusp4 0.5399 113
linusas 0.7522 81.14
linusas2 0.5341 114.3
mozilla 1.723 35.43
mozillaas 1.502 40.64
### sha1bench-gcc44: GCCVER 4.4.2 20090809 (prerelease)
rfc3174 1.451 42.06
linus 0.5871 104
linusv 0.3713 164.4
linusp4 0.4367 139.8
linusas 0.4083 149.5
linusas2 0.4372 139.6
mozilla 1.104 55.27
mozillaas 1.314 46.44
And on Atom:
### sha1bench-gcc295: GCCVER 2.95.4 20030502 (prerelease)
rfc3174 1.905 32.04
linus 1.089 56.06
linusv 0.8134 75.04
linusp4 1.086 56.19
linusas 1.009 60.52
linusas2 1.255 48.63
mozilla 2.663 22.92
### sha1bench-gcc31: GCCVER 3.2 2002-07-26 (prerelease)
rfc3174 2.141 28.51
linus 1.022 59.75
linusv 0.8323 73.34
linusp4 1.061 57.54
linusas 0.9889 61.72
linusas2 0.9204 66.32
mozilla 2.573 23.72
### sha1bench-gcc32: GCCVER 3.2.3
rfc3174 2.155 28.32
linus 0.9031 67.58
linusv 0.7849 77.76
linusp4 0.847 72.06
linusas 0.9888 61.73
linusas2 0.912 66.93
mozilla 2.485 24.56
### sha1bench-gcc33: GCCVER 3.3.6
rfc3174 2.178 28.02
linus 0.9489 64.32
linusv 0.8642 70.63
linusp4 0.8784 69.48
linusas 1.017 60.03
linusas2 0.906 67.37
mozilla 2.541 24.02
### sha1bench-gcc34: GCCVER 3.4.6 20060121 (prerelease)
rfc3174 2.157 28.3
linus 0.9481 64.37
linusv 0.8383 72.8
linusp4 0.965 63.25
linusas 0.9852 61.95
linusas2 0.9809 62.22
mozilla 3.143 19.42
### sha1bench-gcc40: GCCVER 4.0.4 20061113 (prerelease)
rfc3174 2.088 29.24
linus 0.9706 62.89
linusv 0.928 65.77
linusp4 1.003 60.85
linusas 0.9478 64.4
linusas2 0.9475 64.42
mozilla 2.742 22.26
### sha1bench-gcc41: GCCVER 4.1.3 20080704 (prerelease)
rfc3174 2.047 29.81
linus 0.9778 62.42
linusv 1.051 58.06
linusp4 1.062 57.46
linusas 1.052 58.01
linusas2 1.069 57.12
mozilla 2.6 23.47
### sha1bench-gcc42: GCCVER 4.2.5 20090330 (prerelease)
rfc3174 2.025 30.14
linus 0.9622 63.43
linusv 0.7984 76.44
linusp4 0.8967 68.07
linusas 1.018 59.94
linusas2 0.9048 67.46
mozilla 2.748 22.21
### sha1bench-gcc43: GCCVER 4.3.5 20090810 (prerelease)
rfc3174 2.043 29.88
linus 0.9436 64.69
linusv 0.8532 71.54
linusp4 0.8531 71.54
linusas 1.04 58.71
linusas2 0.8495 71.85
mozilla 2.678 22.79
### sha1bench-gcc44: GCCVER 4.4.2 20090809 (prerelease)
rfc3174 2.119 28.8
linus 0.9132 66.84
linusv 0.8632 70.71
linusp4 0.9842 62.02
linusas 1.027 59.45
linusas2 0.9844 62
mozilla 2.214 27.57
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: block-sha1: improve code on large-register-set machines
2009-08-11 23:14 ` Linus Torvalds
@ 2009-08-12 2:26 ` Nicolas Pitre
0 siblings, 0 replies; 16+ messages in thread
From: Nicolas Pitre @ 2009-08-12 2:26 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano
On Tue, 11 Aug 2009, Linus Torvalds wrote:
>
>
> On Tue, 11 Aug 2009, Linus Torvalds wrote:
>
> >
> >
> > On Tue, 11 Aug 2009, Nicolas Pitre wrote:
> > >
> > > Well... gcc is really strange in this case (and similar other ones) with
> > > ARM compilation. A good indicator of the quality of the code is the
> > > size of the stack frame. When using the "+m" then gcc creates a 816
> > > byte stack frame, the generated binary grows by approx 3000 bytes, and
> > > performances is almost halved (7.600s). Looking at the assembly result
> > > I just can't figure out all the crazy moves taking place. Even the
> > > version with no barrier what so ever produces better assembly with a
> > > stack frame of 560 bytes.
> >
> > Ok, that's just crazy. That function has a required stack size of exactly
> > 64 bytes, and anything more than that is just spilling. And if you end up
> > with a stack frame of 560 bytes, that means that gcc is doing some _crazy_
> > spilling.
>
> Btw, what I think happens is:
>
> - gcc turns all those array accesses into pseudo's
>
> So now the 'array[16]' is seen as just another 16 variables rather than
> an array.
>
> - gcc then turns it into SSA, where each assignment basically creates a
> new variable. So the 16 array variables (and 5 regular variables) are
> now expanded to 80 SSA asignments (one array assignment per SHA1 round)
> plus an additional 2 assignments to the "regular" variables per round
> (B and E are changed each round). So in SSA form, you actually end up
> having 240 pseudo's associated with the actual variables. Plus all
> the temporaries of course.
>
> - the thing then goes crazy and tries to generate great code from that
> internal SSA model. And since there are never more than ~25 things
> _live_ at any particular point, it works fine with lots of registers,
> but on small-register machines gcc just goes crazy and has to spill.
> And it doesn't spill 'array[x]' entries - it spills the _pseudo's_ it
> has created - hundreds of them.
>
> - End result: if the spill code doesn't share slots, it's going to create
> a totally unholy mess of crap.
>
> That's what the whole 'volatile unsigned int *' game tried to avoid. But
> it really sounds like it's not working too well for you. And the _big_
> memory barrier ends up helping just because with that in place, you end up
> being almost entirely unable to schedule _anything_ between the different
> SHA rounds, so you end up with only six or seven variables "live" in
> between those barriers, and the stupid register allocator/spill logic
> doesn't break down too badly.
>
> The thing is, if you do full memory barriers, then you're probably better
> off making both the loads and the stores be "volatile". That should have
> similar effects.
If the loads are volatile then gcc has less flexibility when scheduling
them.
> The downside with that is that it really limits the loads. So (like the
> full memory barrier) it's a big hammer approach. But it probably generates
> better code for you, because it avoids the mental breakdown of gcc
> spilling its pseudo's.
Actually, all my previous tests were done with gcc-4.3.2. I now have
installed Fedora 11 which has gcc-4.4.0. And now the stack frame is a
nice 64 bytes. ;-)
That's with the "memory" though. With the volatile, stack frame goes up
to 224 bytes and performance, although not as bad as before, is like
5.160s instead of 4.410s. The "+m" version is not much better: 208 byte
stack frame and similar performance.
The version with no barrier what so ever runs in 4.580s and uses a 88
byte stack frame. The generated assembly contains stupid things, but
this is still the second best version, even better than the "+m" and
volatile ptr ones.
Conclusion: the full "memory" barrier remains the best choice on ARM.
Nicolas
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2009-08-12 2:27 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-10 23:52 block-sha1: improve code on large-register-set machines Linus Torvalds
2009-08-11 6:15 ` Nicolas Pitre
2009-08-11 15:04 ` Linus Torvalds
2009-08-11 18:00 ` Nicolas Pitre
2009-08-11 19:31 ` Nicolas Pitre
2009-08-11 21:20 ` Brandon Casey
2009-08-11 21:36 ` Nicolas Pitre
2009-08-11 21:49 ` Brandon Casey
2009-08-11 22:57 ` Linus Torvalds
2009-08-11 23:13 ` Brandon Casey
2009-08-11 15:43 ` Linus Torvalds
2009-08-11 20:03 ` Nicolas Pitre
2009-08-11 22:53 ` Linus Torvalds
2009-08-11 23:14 ` Linus Torvalds
2009-08-12 2:26 ` Nicolas Pitre
2009-08-11 23:45 ` Artur Skawina
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.