block-sha1: improve code on large-register-set machines

All of lore.kernel.org
 help / color / mirror / Atom feed

* block-sha1: improve code on large-register-set machines
@ 2009-08-10 23:52 Linus Torvalds
  2009-08-11  6:15 ` Nicolas Pitre
  0 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2009-08-10 23:52 UTC (permalink / raw)
  To: Git Mailing List, Junio C Hamano

For x86 performance (especially in 32-bit mode) I added that hack to write 
the SHA1 internal temporary hash using a volatile pointer, in order to get 
gcc to not try to cache the array contents. Because gcc will do all the 
wrong things, and then spill things in insane random ways.

But on architectures like PPC, where you have 32 registers, it's actually 
perfectly reasonable to put the whole temporary array[] into the register 
set, and gcc can do so.

So make the 'volatile unsigned int *' cast be dependent on a 
SMALL_REGISTER_SET preprocessor symbol, and enable it (currently) on just 
x86 and x86-64.  With that, the routine is fairly reasonable even when 
compared to the hand-scheduled PPC version. Ben Herrenschmidt reports on 
a G5:

 * Paulus asm version:       about 3.67s
 * Yours with no change:     about 5.74s
 * Yours without "volatile": about 3.78s

so with this the C version is within about 3% of the asm one.

And add a lot of commentary on what the heck is going on.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

I also asked David Miller to test the non-volatile version on Sparc, but I 
suspect it will have the same pattern. ia64 likewise (but I have not asked 
anybody to test).

Of the other architectures, ARM probably would wants SMALL_REGISTER_SET, 
but I suspect the problem there is the htonl() (on little-endian), and 
possibly the unaligned loads - at least on older ARM. The latter is 
something gcc could be taught about, though (the SHA_SRC macro would just 
need to use a pointer that goes through a packed struct member or 
something).

 block-sha1/sha1.c |   25 ++++++++++++++++++++++++-
 1 files changed, 24 insertions(+), 1 deletions(-)

diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index 36da763..9bc8b8a 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -82,6 +82,7 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
 #define SHA_ASM(op, x, n) ({ unsigned int __res; asm(op " %1,%0":"=r" (__res):"i" (n), "0" (x)); __res; })
 #define SHA_ROL(x,n)	SHA_ASM("rol", x, n)
 #define SHA_ROR(x,n)	SHA_ASM("ror", x, n)
+#define SMALL_REGISTER_SET

 #else

@@ -93,7 +94,29 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)

 /* This "rolls" over the 512-bit array */
 #define W(x) (array[(x)&15])
-#define setW(x, val) (*(volatile unsigned int *)&W(x) = (val))
+
+/*
+ * If you have 32 registers or more, the compiler can (and should)
+ * try to change the array[] accesses into registers. However, on
+ * machines with less than ~25 registers, that won't really work,
+ * and at least gcc will make an unholy mess of it.
+ *
+ * So to avoid that mess which just slows things down, we force
+ * the stores to memory to actually happen (we might be better off
+ * with a 'W(t)=(val);asm("":"+m" (W(t))' there instead, as
+ * suggested by Artur Skawina - that will also make gcc unable to
+ * try to do the silly "optimize away loads" part because it won't
+ * see what the value will be).
+ *
+ * Ben Herrenschmidt reports that on PPC, the C version comes close
+ * to the optimized asm with this (ie on PPC you don't want that
+ * 'volatile', since there are lots of registers).
+ */
+#ifdef SMALL_REGISTER_SET
+  #define setW(x, val) (*(volatile unsigned int *)&W(x) = (val))
+#else
+  #define setW(x, val) (W(x) = (val))
+#endif

 /*
  * Where do we get the source from? The first 16 iterations get it from

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: block-sha1: improve code on large-register-set machines
  2009-08-10 23:52 block-sha1: improve code on large-register-set machines Linus Torvalds
@ 2009-08-11  6:15 ` Nicolas Pitre
  2009-08-11 15:04   ` Linus Torvalds
  2009-08-11 15:43   ` Linus Torvalds
  0 siblings, 2 replies; 16+ messages in thread
From: Nicolas Pitre @ 2009-08-11  6:15 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano

On Mon, 10 Aug 2009, Linus Torvalds wrote:

> 
> For x86 performance (especially in 32-bit mode) I added that hack to write 
> the SHA1 internal temporary hash using a volatile pointer, in order to get 
> gcc to not try to cache the array contents. Because gcc will do all the 
> wrong things, and then spill things in insane random ways.
> 
> But on architectures like PPC, where you have 32 registers, it's actually 
> perfectly reasonable to put the whole temporary array[] into the register 
> set, and gcc can do so.
> 
> So make the 'volatile unsigned int *' cast be dependent on a 
> SMALL_REGISTER_SET preprocessor symbol, and enable it (currently) on just 
> x86 and x86-64.  With that, the routine is fairly reasonable even when 
> compared to the hand-scheduled PPC version. Ben Herrenschmidt reports on 
> a G5:
> 
>  * Paulus asm version:       about 3.67s
>  * Yours with no change:     about 5.74s
>  * Yours without "volatile": about 3.78s
> 
> so with this the C version is within about 3% of the asm one.
> 
> And add a lot of commentary on what the heck is going on.
> 
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> ---
> 
> I also asked David Miller to test the non-volatile version on Sparc, but I 
> suspect it will have the same pattern. ia64 likewise (but I have not asked 
> anybody to test).
> 
> Of the other architectures, ARM probably would wants SMALL_REGISTER_SET, 
> but I suspect the problem there is the htonl() (on little-endian), and 
> possibly the unaligned loads - at least on older ARM. The latter is 
> something gcc could be taught about, though (the SHA_SRC macro would just 
> need to use a pointer that goes through a packed struct member or 
> something).

The "older" ARM (those that don't perform unaligned accesses in 
hardware) are still the majority by far in the field.

Here some numbers on ARM for 203247018 bytes.

MOZILLA_SHA1:	14.520s
ARM_SHA1:	 5.600s
OPENSSL:	 5.530s

BLK_SHA1:	 5.280s		[original]
BLK_SHA1:	 7.410s		[with SMALL_REGISTER_SET defined]
BLK_SHA1:	 7.480s		[with 'W(x)=(val);asm("":"+m" (W(x)))']
BLK_SHA1:	 4.980s		[with 'W(x)=(val);asm("":::"memory")']

At this point the generated assembly is pretty slick.  I bet the full 
memory barrier might help on x86 as well.

However the above BLK_SHA1 works only for aligned source buffers.  So 
let's define our own SHA_SRC to replace the htonl() (which should 
probably be ntohl() by the way) like this:

#define SHA_SRC(t) \
  ({ unsigned char *__d = (unsigned char *)&data[t]; \
     (__d[0] << 24) | (__d[1] << 16) | (__d[2] << 8) | (__d[3] << 0); })

And this provides the exact same performance as the ntohl() based 
version (4.980s) except that this now cope with unaligned buffers too.

Of course the BLK_SHA1 version is a pig since it is totally unrolled

   text    data     bss     dec     hex filename
   1220       0       0    1220     4c4 mozilla-sha1/sha1.o
    852       0       0     852     354 arm/sha1_arm.o
   6292       0       0    6292    1894 block-sha1/sha1.o

so the speed advantage has a significant (but relative) code size cost.


Nicolas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: block-sha1: improve code on large-register-set machines
  2009-08-11  6:15 ` Nicolas Pitre
@ 2009-08-11 15:04   ` Linus Torvalds
  2009-08-11 18:00     ` Nicolas Pitre
  2009-08-11 15:43   ` Linus Torvalds
  1 sibling, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2009-08-11 15:04 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Git Mailing List, Junio C Hamano



On Tue, 11 Aug 2009, Nicolas Pitre wrote:
> 
> #define SHA_SRC(t) \
>   ({ unsigned char *__d = (unsigned char *)&data[t]; \
>      (__d[0] << 24) | (__d[1] << 16) | (__d[2] << 8) | (__d[3] << 0); })
> 
> And this provides the exact same performance as the ntohl() based 
> version (4.980s) except that this now cope with unaligned buffers too.

Is it better to do a (conditional) memcpy up front? Or is the byte-based 
one better just because you always end up doing the shifting anyway due to 
most ARM situations being little-endian?

I _suspect_ that most large SHA1 calls from git are pre-aligned. The big 
SHA1 calls are for pack-file verification in fsck, which should all be 
aligned. Same goes for index file integrity checking.

The actual object SHA1 calculations are likely not aligned (we do that 
object header thing), and if you can't do the htonl() any better way I 
guess the byte-based thing is the way to go..

		Linus

---
 block-sha1/sha1.c |   13 ++++++++++++-
 1 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index 9bc8b8a..df27e66 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -25,6 +25,12 @@ void blk_SHA1_Init(blk_SHA_CTX *ctx)
 	ctx->H[4] = 0xc3d2e1f0;
 }
 
+#ifdef REALLY_SLOW_UNALIGNED
+  #define is_unaligned(ptr) (3 & (unsigned long)(ptr))
+#else
+  #define is_unaligned(ptr) 0
+#endif
+
 
 void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len)
 {
@@ -47,7 +53,12 @@ void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len)
 		blk_SHA1Block(ctx, ctx->W);
 	}
 	while (len >= 64) {
-		blk_SHA1Block(ctx, data);
+		const unsigned int *block = data;
+		if (is_unaligned(data)) {
+			memcpy(ctx->W, data, 64);
+			block = ctx->W;
+		}
+		blk_SHA1Block(ctx, block);
 		data += 64;
 		len -= 64;
 	}

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: block-sha1: improve code on large-register-set machines
  2009-08-11  6:15 ` Nicolas Pitre
  2009-08-11 15:04   ` Linus Torvalds
@ 2009-08-11 15:43   ` Linus Torvalds
  2009-08-11 20:03     ` Nicolas Pitre
  1 sibling, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2009-08-11 15:43 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Git Mailing List, Junio C Hamano

On Tue, 11 Aug 2009, Nicolas Pitre wrote:
> 
> BLK_SHA1:	 5.280s		[original]
> BLK_SHA1:	 7.410s		[with SMALL_REGISTER_SET defined]
> BLK_SHA1:	 7.480s		[with 'W(x)=(val);asm("":"+m" (W(x)))']
> BLK_SHA1:	 4.980s		[with 'W(x)=(val);asm("":::"memory")']
> 
> At this point the generated assembly is pretty slick.  I bet the full 
> memory barrier might help on x86 as well.

No, I had tested that earlier - single-word memory barrier for some reason 
gets _much_ better numbers at least on x86-64. We're talking

	linus            1.46       418.2
vs
	linus           2.004       304.6

kind of differences. With the "+m" it outperforms openssl (375-380MB/s).

The "volatile unsigned int *" cast looks pretty much like the "+m" version 
to me, but Arthur got a speedup from whatever gcc code generation 
differences on his P4.

The really fundamental and basic problem with gcc on this code is that gcc 
does not see _any_ difference what-so-ever between the five variables 
declared with

	unsigned int A, B, C, D, E;

and the sixteen variables declared with

	unsigned int array[16];

and considers those all to be 21 local variables. It really seems to think 
that they are all 100% equivalent, and gcc totally ignores me doing things 
like adding "register" to the A-E ones etc.

And if you are a compiler, and think that the routine has 21 equal 
register variables, you're going to do crazy reload sh*t when you have 
only 7 (or 15) GP registers. So doing that full memory barrier seems to 
just take that random situation, and force some random variable to be 
spilled (this is all from looking at the generated code, not from looking 
at gcc).

In contrast, with the _targeted_ thing ("you'd better write back into 
array[]") we force gcc to spill the array[16] values, and not the A-E 
ones, and that's why it seems to make such a big difference.

And no, I'm not sure why ARM apparently doesn't show the same behavior. Or 
maybe it does, but with an in-order core it doesn't matter as much which 
registers you keep reloading - you'll be serialized all the time _anyway_. 

			Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: block-sha1: improve code on large-register-set machines
  2009-08-11 15:04   ` Linus Torvalds
@ 2009-08-11 18:00     ` Nicolas Pitre
  2009-08-11 19:31       ` Nicolas Pitre
  0 siblings, 1 reply; 16+ messages in thread
From: Nicolas Pitre @ 2009-08-11 18:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano

On Tue, 11 Aug 2009, Linus Torvalds wrote:

> On Tue, 11 Aug 2009, Nicolas Pitre wrote:
> > 
> > #define SHA_SRC(t) \
> >   ({ unsigned char *__d = (unsigned char *)&data[t]; \
> >      (__d[0] << 24) | (__d[1] << 16) | (__d[2] << 8) | (__d[3] << 0); })
> > 
> > And this provides the exact same performance as the ntohl() based 
> > version (4.980s) except that this now cope with unaligned buffers too.
> 
> Is it better to do a (conditional) memcpy up front? Or is the byte-based 
> one better just because you always end up doing the shifting anyway due to 
> most ARM situations being little-endian?

The vast majority of ARM processors where git might be run are using a 
LE environment.

> I _suspect_ that most large SHA1 calls from git are pre-aligned. The big 
> SHA1 calls are for pack-file verification in fsck, which should all be 
> aligned. Same goes for index file integrity checking.
> 
> The actual object SHA1 calculations are likely not aligned (we do that 
> object header thing), and if you can't do the htonl() any better way I 
> guess the byte-based thing is the way to go..

Let's see.  The default ntohl() provided by glibc generates this code:

        ldr     r3, [r0, #0]
        mov     r0, r3, lsr #24
        and     r2, r3, #0x00ff0000
        orr     r0, r0, r3, asl #24
        orr     r0, r0, r2, lsr #8
        and     r3, r3, #0x0000ff00
        orr     r0, r0, r3, asl #8

Ignoring the load result delay that gcc should properly schedule anyway, 
that makes for 7 cycles.

Using the smarter ARM swab32 implementation from Linux we get:

        ldr     r3, [r0, #0]
        eor     r0, r3, r3, ror #16
        bic     r0, r0, #0x00ff0000
        mov     r0, r0, lsr #8
        eor     r0, r0, r3, ror #8

So we're down to 5 cycles.  And the SHA1 test is a bit faster too: 
4.570s down from 4.980s.  However this is still using purely aligned 
buffers.

Adding your patch using memcpy() to align the data in the unaligned case 
gives me wild results.  Sometimes it is 4.930s, sometimes it is 5.560s.  
I suspect the icache starts to get tight and sometimes the SHA1 code 
and/or the special unaligned memcpy path get evicted sometimes.

Using the byte access version we get:

        ldrb    r3, [r0, #3]
        ldrb    r2, [r0, #0]
        ldrb    r1, [r0, #1]
        orr     r3, r3, r2, asl #24
        ldrb    r0, [r0, #2]
        orr     r3, r3, r1, asl #16
        orr     r0, r3, r0, asl #8

Again 7 cycles, like the ntohl() based version, which is coherent with 
the fact that they both make for 4.980s..  However this time any buffer 
alignment is supported.  And in fact the extra 2 cycles over the 
swab32() version should actually be less overhead per word than the 
unaligned memcpy which is around 4 cycles per word.

Nicolas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: block-sha1: improve code on large-register-set machines
  2009-08-11 18:00     ` Nicolas Pitre
@ 2009-08-11 19:31       ` Nicolas Pitre
  2009-08-11 21:20         ` Brandon Casey
  0 siblings, 1 reply; 16+ messages in thread
From: Nicolas Pitre @ 2009-08-11 19:31 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano

On Tue, 11 Aug 2009, Nicolas Pitre wrote:

> On Tue, 11 Aug 2009, Linus Torvalds wrote:
> 
> > On Tue, 11 Aug 2009, Nicolas Pitre wrote:
> > > 
> > > #define SHA_SRC(t) \
> > >   ({ unsigned char *__d = (unsigned char *)&data[t]; \
> > >      (__d[0] << 24) | (__d[1] << 16) | (__d[2] << 8) | (__d[3] << 0); })
> > > 
> > > And this provides the exact same performance as the ntohl() based 
> > > version (4.980s) except that this now cope with unaligned buffers too.
> > 
> > The actual object SHA1 calculations are likely not aligned (we do that 
> > object header thing), and if you can't do the htonl() any better way I 
> > guess the byte-based thing is the way to go..

OK, even better: 4.400s.

This is with this instead of the above:

#define SHA_SRC(t) \
   ({   unsigned char *__d = (unsigned char *)data; \
        (__d[(t)*4 + 0] << 24) | (__d[(t)*4 + 1] << 16) | \
        (__d[(t)*4 + 2] <<  8) | (__d[(t)*4 + 3] <<  0); })

The previous version would allocate a new register for __d and then 
index it with an offset of 0, 1, 2 or 3.  This version always uses the 
register containing the data pointer with absolute offsets.  The binary 
is a bit smaller too.


Nicolas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: block-sha1: improve code on large-register-set machines
  2009-08-11 15:43   ` Linus Torvalds
@ 2009-08-11 20:03     ` Nicolas Pitre
  2009-08-11 22:53       ` Linus Torvalds
  0 siblings, 1 reply; 16+ messages in thread
From: Nicolas Pitre @ 2009-08-11 20:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano

On Tue, 11 Aug 2009, Linus Torvalds wrote:

> On Tue, 11 Aug 2009, Nicolas Pitre wrote:
> > 
> > BLK_SHA1:	 5.280s		[original]
> > BLK_SHA1:	 7.410s		[with SMALL_REGISTER_SET defined]
> > BLK_SHA1:	 7.480s		[with 'W(x)=(val);asm("":"+m" (W(x)))']
> > BLK_SHA1:	 4.980s		[with 'W(x)=(val);asm("":::"memory")']
> > 
> > At this point the generated assembly is pretty slick.  I bet the full 
> > memory barrier might help on x86 as well.
> 
> No, I had tested that earlier - single-word memory barrier for some reason 
> gets _much_ better numbers at least on x86-64. We're talking
> 
> 	linus            1.46       418.2
> vs
> 	linus           2.004       304.6
> 
> kind of differences. With the "+m" it outperforms openssl (375-380MB/s).
> 
> The "volatile unsigned int *" cast looks pretty much like the "+m" version 
> to me, but Arthur got a speedup from whatever gcc code generation 
> differences on his P4.

The volatile pointer forces a write to memory but the cached value in 
the processor's register remains valid, whereas the "+m" forces gcc to 
assume the register copy is not valid anymore.  That certainly gives the 
compiler a different clue about register availability, etc.

> The really fundamental and basic problem with gcc on this code is that gcc 
> does not see _any_ difference what-so-ever between the five variables 
> declared with
> 
> 	unsigned int A, B, C, D, E;
> 
> and the sixteen variables declared with
> 
> 	unsigned int array[16];
> 
> and considers those all to be 21 local variables. It really seems to think 
> that they are all 100% equivalent, and gcc totally ignores me doing things 
> like adding "register" to the A-E ones etc.
> 
> And if you are a compiler, and think that the routine has 21 equal 
> register variables, you're going to do crazy reload sh*t when you have 
> only 7 (or 15) GP registers. So doing that full memory barrier seems to 
> just take that random situation, and force some random variable to be 
> spilled (this is all from looking at the generated code, not from looking 
> at gcc).
> 
> In contrast, with the _targeted_ thing ("you'd better write back into 
> array[]") we force gcc to spill the array[16] values, and not the A-E 
> ones, and that's why it seems to make such a big difference.
> 
> And no, I'm not sure why ARM apparently doesn't show the same behavior. Or 
> maybe it does, but with an in-order core it doesn't matter as much which 
> registers you keep reloading - you'll be serialized all the time _anyway_. 

Well... gcc is really strange in this case (and similar other ones) with 
ARM compilation.  A good indicator of the quality of the code is the 
size of the stack frame.  When using the "+m" then gcc creates a 816 
byte stack frame, the generated binary grows by approx 3000 bytes, and 
performances is almost halved (7.600s).  Looking at the assembly result 
I just can't figure out all the crazy moves taking place.  Even the 
version with no barrier what so ever produces better assembly with a 
stack frame of 560 bytes.

The volatile version is second to worst with a 808 byte stack frame with 
similar bad performances.

With the full "memory" the stack frame shrinks to 280 bytes and best 
performances so far is obtained.  And none of the A B C D E or data 
variables are ever spilled onto the stack either, only the array[16] 
gets allocated stack slots, and the TEMP variable.

Nicolas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: block-sha1: improve code on large-register-set machines
  2009-08-11 19:31       ` Nicolas Pitre
@ 2009-08-11 21:20         ` Brandon Casey
  2009-08-11 21:36           ` Nicolas Pitre
  2009-08-11 22:57           ` Linus Torvalds
  0 siblings, 2 replies; 16+ messages in thread
From: Brandon Casey @ 2009-08-11 21:20 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Linus Torvalds, Git Mailing List, Junio C Hamano

Nicolas Pitre wrote:
> On Tue, 11 Aug 2009, Nicolas Pitre wrote:
> 
>> On Tue, 11 Aug 2009, Linus Torvalds wrote:
>>
>>> On Tue, 11 Aug 2009, Nicolas Pitre wrote:
>>>> #define SHA_SRC(t) \
>>>>   ({ unsigned char *__d = (unsigned char *)&data[t]; \
>>>>      (__d[0] << 24) | (__d[1] << 16) | (__d[2] << 8) | (__d[3] << 0); })
>>>>
>>>> And this provides the exact same performance as the ntohl() based 
>>>> version (4.980s) except that this now cope with unaligned buffers too.
>>> The actual object SHA1 calculations are likely not aligned (we do that 
>>> object header thing), and if you can't do the htonl() any better way I 
>>> guess the byte-based thing is the way to go..
> 
> OK, even better: 4.400s.
> 
> This is with this instead of the above:
> 
> #define SHA_SRC(t) \
>    ({   unsigned char *__d = (unsigned char *)data; \
>         (__d[(t)*4 + 0] << 24) | (__d[(t)*4 + 1] << 16) | \
>         (__d[(t)*4 + 2] <<  8) | (__d[(t)*4 + 3] <<  0); })
> 
> The previous version would allocate a new register for __d and then 
> index it with an offset of 0, 1, 2 or 3.  This version always uses the 
> register containing the data pointer with absolute offsets.  The binary 
> is a bit smaller too.

In that case, why not change the interface of blk_SHA1Block() so that its
second argument is const unsigned char* and get rid of __d and the { } ?

Then it will look like this:

   static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned char *data);

   ...

   #define SHA_SRC(t) \
       ( (data[(t)*4 + 0] << 24) | (data[(t)*4 + 1] << 16) | \
         (data[(t)*4 + 2] <<  8) | (data[(t)*4 + 3] <<  0) )


Plus, we need something like the following to handle storing the hash to
an unaligned buffer (warning copy/pasted):

@@ -73,8 +74,12 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *c
 
        /* Output hash
         */
-       for (i = 0; i < 5; i++)
-               ((unsigned int *)hashout)[i] = htonl(ctx->H[i]);
+       for (i = 0; i < 5; i++) {
+               *hashout++ = (unsigned char) (ctx->H[i] >> 24);
+               *hashout++ = (unsigned char) (ctx->H[i] >> 16);
+               *hashout++ = (unsigned char) (ctx->H[i] >> 8);
+               *hashout++ = (unsigned char) (ctx->H[i] >> 0);
+       }
 }
 
 #if defined(__i386__) || defined(__x86_64__)


With these two changes plus a few other minor tweaks, the block-sha1 code compiles
and passes the test suite on sparc (solaris 7) and mips (irix 6.5).

-brandon

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: block-sha1: improve code on large-register-set machines
  2009-08-11 21:20         ` Brandon Casey
@ 2009-08-11 21:36           ` Nicolas Pitre
  2009-08-11 21:49             ` Brandon Casey
  2009-08-11 22:57           ` Linus Torvalds
  1 sibling, 1 reply; 16+ messages in thread
From: Nicolas Pitre @ 2009-08-11 21:36 UTC (permalink / raw)
  To: Brandon Casey; +Cc: Linus Torvalds, Git Mailing List, Junio C Hamano

On Tue, 11 Aug 2009, Brandon Casey wrote:

> Nicolas Pitre wrote:
> > On Tue, 11 Aug 2009, Nicolas Pitre wrote:
> > 
> >> On Tue, 11 Aug 2009, Linus Torvalds wrote:
> >>
> >>> On Tue, 11 Aug 2009, Nicolas Pitre wrote:
> >>>> #define SHA_SRC(t) \
> >>>>   ({ unsigned char *__d = (unsigned char *)&data[t]; \
> >>>>      (__d[0] << 24) | (__d[1] << 16) | (__d[2] << 8) | (__d[3] << 0); })
> >>>>
> >>>> And this provides the exact same performance as the ntohl() based 
> >>>> version (4.980s) except that this now cope with unaligned buffers too.
> >>> The actual object SHA1 calculations are likely not aligned (we do that 
> >>> object header thing), and if you can't do the htonl() any better way I 
> >>> guess the byte-based thing is the way to go..
> > 
> > OK, even better: 4.400s.
> > 
> > This is with this instead of the above:
> > 
> > #define SHA_SRC(t) \
> >    ({   unsigned char *__d = (unsigned char *)data; \
> >         (__d[(t)*4 + 0] << 24) | (__d[(t)*4 + 1] << 16) | \
> >         (__d[(t)*4 + 2] <<  8) | (__d[(t)*4 + 3] <<  0); })
> > 
> > The previous version would allocate a new register for __d and then 
> > index it with an offset of 0, 1, 2 or 3.  This version always uses the 
> > register containing the data pointer with absolute offsets.  The binary 
> > is a bit smaller too.
> 
> In that case, why not change the interface of blk_SHA1Block() so that its
> second argument is const unsigned char* and get rid of __d and the { } ?

Because not all architectures care to access the data bytewise.  Please 
see the original SHA_SRC definition.


Nicolas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: block-sha1: improve code on large-register-set machines
  2009-08-11 21:36           ` Nicolas Pitre
@ 2009-08-11 21:49             ` Brandon Casey
  0 siblings, 0 replies; 16+ messages in thread
From: Brandon Casey @ 2009-08-11 21:49 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Linus Torvalds, Git Mailing List, Junio C Hamano

Nicolas Pitre wrote:
> On Tue, 11 Aug 2009, Brandon Casey wrote:
> 
>> Nicolas Pitre wrote:
>>> On Tue, 11 Aug 2009, Nicolas Pitre wrote:
>>>
>>>> On Tue, 11 Aug 2009, Linus Torvalds wrote:
>>>>
>>>>> On Tue, 11 Aug 2009, Nicolas Pitre wrote:
>>>>>> #define SHA_SRC(t) \
>>>>>>   ({ unsigned char *__d = (unsigned char *)&data[t]; \
>>>>>>      (__d[0] << 24) | (__d[1] << 16) | (__d[2] << 8) | (__d[3] << 0); })
>>>>>>
>>>>>> And this provides the exact same performance as the ntohl() based 
>>>>>> version (4.980s) except that this now cope with unaligned buffers too.
>>>>> The actual object SHA1 calculations are likely not aligned (we do that 
>>>>> object header thing), and if you can't do the htonl() any better way I 
>>>>> guess the byte-based thing is the way to go..
>>> OK, even better: 4.400s.
>>>
>>> This is with this instead of the above:
>>>
>>> #define SHA_SRC(t) \
>>>    ({   unsigned char *__d = (unsigned char *)data; \
>>>         (__d[(t)*4 + 0] << 24) | (__d[(t)*4 + 1] << 16) | \
>>>         (__d[(t)*4 + 2] <<  8) | (__d[(t)*4 + 3] <<  0); })
>>>
>>> The previous version would allocate a new register for __d and then 
>>> index it with an offset of 0, 1, 2 or 3.  This version always uses the 
>>> register containing the data pointer with absolute offsets.  The binary 
>>> is a bit smaller too.
>> In that case, why not change the interface of blk_SHA1Block() so that its
>> second argument is const unsigned char* and get rid of __d and the { } ?
> 
> Because not all architectures care to access the data bytewise.  Please 
> see the original SHA_SRC definition.

You mean this:

   #define SHA_SRC(t) htonl(data[t])

?  Or was there a definition before this one?

I don't follow what you are saying.  Are you saying that the following two
examples are different?

   unsigned int *data;

   #define SHA_SRC(t) \
      ({   unsigned char *__d = (unsigned char *)data; \
         (__d[(t)*4 + 0] << 24) | (__d[(t)*4 + 1] << 16) | \
         (__d[(t)*4 + 2] <<  8) | (__d[(t)*4 + 3] <<  0); })

and

   unsigned char *data;

   #define SHA_SRC(t) \
      ( (data[(t)*4 + 0] << 24) | (data[(t)*4 + 1] << 16) | \
        (data[(t)*4 + 2] <<  8) | (data[(t)*4 + 3] <<  0) )

-brandon

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: block-sha1: improve code on large-register-set machines
  2009-08-11 20:03     ` Nicolas Pitre
@ 2009-08-11 22:53       ` Linus Torvalds
  2009-08-11 23:14         ` Linus Torvalds
  2009-08-11 23:45         ` Artur Skawina
  0 siblings, 2 replies; 16+ messages in thread
From: Linus Torvalds @ 2009-08-11 22:53 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Git Mailing List, Junio C Hamano

On Tue, 11 Aug 2009, Nicolas Pitre wrote:
> 
> Well... gcc is really strange in this case (and similar other ones) with 
> ARM compilation.  A good indicator of the quality of the code is the 
> size of the stack frame.  When using the "+m" then gcc creates a 816 
> byte stack frame, the generated binary grows by approx 3000 bytes, and 
> performances is almost halved (7.600s).  Looking at the assembly result 
> I just can't figure out all the crazy moves taking place.  Even the 
> version with no barrier what so ever produces better assembly with a 
> stack frame of 560 bytes.

Ok, that's just crazy. That function has a required stack size of exactly 
64 bytes, and anything more than that is just spilling. And if you end up 
with a stack frame of 560 bytes, that means that gcc is doing some _crazy_ 
spilling.

One thing that strikes me is that I've been just testing with gcc-4.4, and 
BenH (who did some tests on PPC where SHA1 is just _trivial_ because it 
all fits in the normal register space) noticed that older versions of gcc 
that he tested did much worse on this.

I think Artur also posted (x86) numbers with older gcc versions doing 
worse. Maybe you're seeing some of that?

			Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: block-sha1: improve code on large-register-set machines
  2009-08-11 21:20         ` Brandon Casey
  2009-08-11 21:36           ` Nicolas Pitre
@ 2009-08-11 22:57           ` Linus Torvalds
  2009-08-11 23:13             ` Brandon Casey
  1 sibling, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2009-08-11 22:57 UTC (permalink / raw)
  To: Brandon Casey; +Cc: Nicolas Pitre, Git Mailing List, Junio C Hamano

On Tue, 11 Aug 2009, Brandon Casey wrote:
> 
> In that case, why not change the interface of blk_SHA1Block() so that its
> second argument is const unsigned char* and get rid of __d and the { } ?

Because on big-endian, or on architectures like x86 that have an efficient 
byte swap, that would be horrible.

You absoluetoy MUST NOT do things a byte at a time in those cases. The 
memory operations and the shifting just kills you.

The reason you want to do things a byte at a time on ARM is that ARM 
cannot do unaligned accesses well (very modern cores are better, but 
rare), and that ARM has no bswap instruction and has fairly cheap shifts.

On no sane architecture is that true. Unaligned loads are fast (and quite 
frankly, hardware where unaliged loads aren't fast is just crazy sh*t), 
and doing 'bswap' is way faster than doing many shifts and masks.

So everything should be fundamentally word-oriented. Then, broken 
architectures that can't handle it should split up the words, not the 
other way around.

			Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: block-sha1: improve code on large-register-set machines
  2009-08-11 22:57           ` Linus Torvalds
@ 2009-08-11 23:13             ` Brandon Casey
  0 siblings, 0 replies; 16+ messages in thread
From: Brandon Casey @ 2009-08-11 23:13 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, Git Mailing List, Junio C Hamano

Linus Torvalds wrote:
> 
> On Tue, 11 Aug 2009, Brandon Casey wrote:
>> In that case, why not change the interface of blk_SHA1Block() so that its
>> second argument is const unsigned char* and get rid of __d and the { } ?
> 
> Because on big-endian, or on architectures like x86 that have an efficient 
> byte swap, that would be horrible.

Sorry, I missed Nicolas's first message where he said his SHA_SRC macro was
for arm only.

I started at your reply to him which only has the snippet which says
"...this provides the exact same performance as the ntohl() based version
except that this now cope with unaligned buffers too".

My mistake.

-brandon

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: block-sha1: improve code on large-register-set machines
  2009-08-11 22:53       ` Linus Torvalds
@ 2009-08-11 23:14         ` Linus Torvalds
  2009-08-12  2:26           ` Nicolas Pitre
  2009-08-11 23:45         ` Artur Skawina
  1 sibling, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2009-08-11 23:14 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Git Mailing List, Junio C Hamano

On Tue, 11 Aug 2009, Linus Torvalds wrote:

> 
> 
> On Tue, 11 Aug 2009, Nicolas Pitre wrote:
> > 
> > Well... gcc is really strange in this case (and similar other ones) with 
> > ARM compilation.  A good indicator of the quality of the code is the 
> > size of the stack frame.  When using the "+m" then gcc creates a 816 
> > byte stack frame, the generated binary grows by approx 3000 bytes, and 
> > performances is almost halved (7.600s).  Looking at the assembly result 
> > I just can't figure out all the crazy moves taking place.  Even the 
> > version with no barrier what so ever produces better assembly with a 
> > stack frame of 560 bytes.
> 
> Ok, that's just crazy. That function has a required stack size of exactly 
> 64 bytes, and anything more than that is just spilling. And if you end up 
> with a stack frame of 560 bytes, that means that gcc is doing some _crazy_ 
> spilling.

Btw, what I think happens is:

 - gcc turns all those array accesses into pseudo's 

   So now the 'array[16]' is seen as just another 16 variables rather than 
   an array.

 - gcc then turns it into SSA, where each assignment basically creates a 
   new variable. So the 16 array variables (and 5 regular variables) are 
   now expanded to 80 SSA asignments (one array assignment per SHA1 round) 
   plus an additional 2 assignments to the "regular" variables per round 
   (B and E are changed each round). So in SSA form, you actually end up 
   having 240 pseudo's associated with the actual variables. Plus all 
   the temporaries of course.

 - the thing then goes crazy and tries to generate great code from that 
   internal SSA model. And since there are never more than ~25 things 
   _live_ at any particular point, it works fine with lots of registers, 
   but on small-register machines gcc just goes crazy and has to spill. 
   And it doesn't spill 'array[x]' entries - it spills the _pseudo's_ it 
   has created - hundreds of them.

 - End result: if the spill code doesn't share slots, it's going to create 
   a totally unholy mess of crap.

That's what the whole 'volatile unsigned int *' game tried to avoid. But 
it really sounds like it's not working too well for you. And the _big_ 
memory barrier ends up helping just because with that in place, you end up 
being almost entirely unable to schedule _anything_ between the different 
SHA rounds, so you end up with only six or seven variables "live" in 
between those barriers, and the stupid register allocator/spill logic 
doesn't break down too badly.

The thing is, if you do full memory barriers, then you're probably better 
off making both the loads and the stores be "volatile". That should have 
similar effects.

The downside with that is that it really limits the loads. So (like the 
full memory barrier) it's a big hammer approach. But it probably generates 
better code for you, because it avoids the mental breakdown of gcc 
spilling its pseudo's.

			Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: block-sha1: improve code on large-register-set machines
  2009-08-11 22:53       ` Linus Torvalds
  2009-08-11 23:14         ` Linus Torvalds
@ 2009-08-11 23:45         ` Artur Skawina
  1 sibling, 0 replies; 16+ messages in thread
From: Artur Skawina @ 2009-08-11 23:45 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, Git Mailing List, Junio C Hamano

Linus Torvalds wrote:
> 
> One thing that strikes me is that I've been just testing with gcc-4.4, and 
> BenH (who did some tests on PPC where SHA1 is just _trivial_ because it 
> all fits in the normal register space) noticed that older versions of gcc 
> that he tested did much worse on this.
> 
> I think Artur also posted (x86) numbers with older gcc versions doing 
> worse. Maybe you're seeing some of that?

FWIW, this is how it looks here. On 32-bit x86 gcc4.4 makes a large
difference[1], but your code does fairly well w/ most gccs, relatively
to all the other C implementations.

artur

P4: [linusv is the recent one w/ the volatile stores]

### sha1bench-gcc295: GCCVER 2.95.4 20030502 (prerelease)
rfc3174        0.9438       64.67
linus          0.9081       67.21
linusv         0.4155       146.9
linusp4        0.8761       69.66
linusas        0.9619       63.45
linusas2        1.025       59.52
mozilla         1.314       46.46
mozillaas       1.132       53.92

### sha1bench-gcc31: GCCVER 3.2 2002-07-26 (prerelease)
rfc3174        0.8582       71.12
linus          0.7943       76.84
linusv         0.5667       107.7
linusp4        0.7224       84.48
linusas        0.7127       85.64
linusas2       0.5109       119.5
mozilla         1.251       48.79
mozillaas       1.239       49.27

### sha1bench-gcc32: GCCVER 3.2.3
rfc3174        0.9062       67.35
linus          0.5555       109.9
linusv         0.3647       167.4
linusp4        0.5337       114.4
linusas        0.7126       85.66
linusas2       0.5089       119.9
mozilla         1.138       53.64
mozillaas       1.075       56.78

### sha1bench-gcc33: GCCVER 3.3.6
rfc3174        0.9029        67.6
linus          0.6059       100.7
linusv         0.3734       163.4
linusp4        0.6695       91.16
linusas        0.7832       77.93
linusas2        0.571       106.9
mozilla         1.083       56.36
mozillaas       1.078       56.62

### sha1bench-gcc34: GCCVER 3.4.6 20060121 (prerelease)
rfc3174        0.9277       65.79
linus          0.6583       92.71
linusv         0.6096       100.1
linusp4        0.7326       83.31
linusas        0.7383       82.67
linusas2       0.6264       97.44
mozilla         1.398       43.67
mozillaas       1.392       43.84

### sha1bench-gcc40: GCCVER 4.0.4 20061113 (prerelease)
rfc3174        0.9889       61.72
linus          0.7508       81.29
linusv         0.7752       78.73
linusp4        0.6548       93.21
linusas        0.4904       124.5
linusas2       0.6378        95.7
mozilla         1.528       39.93
mozillaas       1.596       38.24

### sha1bench-gcc41: GCCVER 4.1.3 20080704 (prerelease)
rfc3174        0.9798       62.29
linus          0.6993       87.28
linusv          0.767       79.57
linusp4        0.6785       89.95
linusas        0.6555       93.11
linusas2        0.691       88.32
mozilla         1.594        38.3
mozillaas       1.566       38.98

### sha1bench-gcc42: GCCVER 4.2.5 20090330 (prerelease)
rfc3174         1.138       53.63
linus          0.7772       78.53
linusv         0.6138       99.43
linusp4        0.7018       86.97
linusas        0.8164       74.76
linusas2       0.7038       86.73
mozilla         1.697       35.97
mozillaas       1.491       40.94

### sha1bench-gcc43: GCCVER 4.3.5 20090810 (prerelease)
rfc3174         1.148       53.15
linus          0.7085       86.14
linusv         0.5474       111.5
linusp4        0.5399         113
linusas        0.7522       81.14
linusas2       0.5341       114.3
mozilla         1.723       35.43
mozillaas       1.502       40.64

### sha1bench-gcc44: GCCVER 4.4.2 20090809 (prerelease)
rfc3174         1.451       42.06
linus          0.5871         104
linusv         0.3713       164.4
linusp4        0.4367       139.8
linusas        0.4083       149.5
linusas2       0.4372       139.6
mozilla         1.104       55.27
mozillaas       1.314       46.44


And on Atom:

### sha1bench-gcc295: GCCVER 2.95.4 20030502 (prerelease)
rfc3174         1.905       32.04
linus           1.089       56.06
linusv         0.8134       75.04
linusp4         1.086       56.19
linusas         1.009       60.52
linusas2        1.255       48.63
mozilla         2.663       22.92

### sha1bench-gcc31: GCCVER 3.2 2002-07-26 (prerelease)
rfc3174         2.141       28.51
linus           1.022       59.75
linusv         0.8323       73.34
linusp4         1.061       57.54
linusas        0.9889       61.72
linusas2       0.9204       66.32
mozilla         2.573       23.72

### sha1bench-gcc32: GCCVER 3.2.3
rfc3174         2.155       28.32
linus          0.9031       67.58
linusv         0.7849       77.76
linusp4         0.847       72.06
linusas        0.9888       61.73
linusas2        0.912       66.93
mozilla         2.485       24.56

### sha1bench-gcc33: GCCVER 3.3.6
rfc3174         2.178       28.02
linus          0.9489       64.32
linusv         0.8642       70.63
linusp4        0.8784       69.48
linusas         1.017       60.03
linusas2        0.906       67.37
mozilla         2.541       24.02

### sha1bench-gcc34: GCCVER 3.4.6 20060121 (prerelease)
rfc3174         2.157        28.3
linus          0.9481       64.37
linusv         0.8383        72.8
linusp4         0.965       63.25
linusas        0.9852       61.95
linusas2       0.9809       62.22
mozilla         3.143       19.42

### sha1bench-gcc40: GCCVER 4.0.4 20061113 (prerelease)
rfc3174         2.088       29.24
linus          0.9706       62.89
linusv          0.928       65.77
linusp4         1.003       60.85
linusas        0.9478        64.4
linusas2       0.9475       64.42
mozilla         2.742       22.26

### sha1bench-gcc41: GCCVER 4.1.3 20080704 (prerelease)
rfc3174         2.047       29.81
linus          0.9778       62.42
linusv          1.051       58.06
linusp4         1.062       57.46
linusas         1.052       58.01
linusas2        1.069       57.12
mozilla           2.6       23.47

### sha1bench-gcc42: GCCVER 4.2.5 20090330 (prerelease)
rfc3174         2.025       30.14
linus          0.9622       63.43
linusv         0.7984       76.44
linusp4        0.8967       68.07
linusas         1.018       59.94
linusas2       0.9048       67.46
mozilla         2.748       22.21

### sha1bench-gcc43: GCCVER 4.3.5 20090810 (prerelease)
rfc3174         2.043       29.88
linus          0.9436       64.69
linusv         0.8532       71.54
linusp4        0.8531       71.54
linusas          1.04       58.71
linusas2       0.8495       71.85
mozilla         2.678       22.79

### sha1bench-gcc44: GCCVER 4.4.2 20090809 (prerelease)
rfc3174         2.119        28.8
linus          0.9132       66.84
linusv         0.8632       70.71
linusp4        0.9842       62.02
linusas         1.027       59.45
linusas2       0.9844          62
mozilla         2.214       27.57

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: block-sha1: improve code on large-register-set machines
  2009-08-11 23:14         ` Linus Torvalds
@ 2009-08-12  2:26           ` Nicolas Pitre
  0 siblings, 0 replies; 16+ messages in thread
From: Nicolas Pitre @ 2009-08-12  2:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano

On Tue, 11 Aug 2009, Linus Torvalds wrote:

> 
> 
> On Tue, 11 Aug 2009, Linus Torvalds wrote:
> 
> > 
> > 
> > On Tue, 11 Aug 2009, Nicolas Pitre wrote:
> > > 
> > > Well... gcc is really strange in this case (and similar other ones) with 
> > > ARM compilation.  A good indicator of the quality of the code is the 
> > > size of the stack frame.  When using the "+m" then gcc creates a 816 
> > > byte stack frame, the generated binary grows by approx 3000 bytes, and 
> > > performances is almost halved (7.600s).  Looking at the assembly result 
> > > I just can't figure out all the crazy moves taking place.  Even the 
> > > version with no barrier what so ever produces better assembly with a 
> > > stack frame of 560 bytes.
> > 
> > Ok, that's just crazy. That function has a required stack size of exactly 
> > 64 bytes, and anything more than that is just spilling. And if you end up 
> > with a stack frame of 560 bytes, that means that gcc is doing some _crazy_ 
> > spilling.
> 
> Btw, what I think happens is:
> 
>  - gcc turns all those array accesses into pseudo's 
> 
>    So now the 'array[16]' is seen as just another 16 variables rather than 
>    an array.
> 
>  - gcc then turns it into SSA, where each assignment basically creates a 
>    new variable. So the 16 array variables (and 5 regular variables) are 
>    now expanded to 80 SSA asignments (one array assignment per SHA1 round) 
>    plus an additional 2 assignments to the "regular" variables per round 
>    (B and E are changed each round). So in SSA form, you actually end up 
>    having 240 pseudo's associated with the actual variables. Plus all 
>    the temporaries of course.
> 
>  - the thing then goes crazy and tries to generate great code from that 
>    internal SSA model. And since there are never more than ~25 things 
>    _live_ at any particular point, it works fine with lots of registers, 
>    but on small-register machines gcc just goes crazy and has to spill. 
>    And it doesn't spill 'array[x]' entries - it spills the _pseudo's_ it 
>    has created - hundreds of them.
> 
>  - End result: if the spill code doesn't share slots, it's going to create 
>    a totally unholy mess of crap.
> 
> That's what the whole 'volatile unsigned int *' game tried to avoid. But 
> it really sounds like it's not working too well for you. And the _big_ 
> memory barrier ends up helping just because with that in place, you end up 
> being almost entirely unable to schedule _anything_ between the different 
> SHA rounds, so you end up with only six or seven variables "live" in 
> between those barriers, and the stupid register allocator/spill logic 
> doesn't break down too badly.
> 
> The thing is, if you do full memory barriers, then you're probably better 
> off making both the loads and the stores be "volatile". That should have 
> similar effects.

If the loads are volatile then gcc has less flexibility when scheduling 
them.

> The downside with that is that it really limits the loads. So (like the 
> full memory barrier) it's a big hammer approach. But it probably generates 
> better code for you, because it avoids the mental breakdown of gcc 
> spilling its pseudo's.

Actually, all my previous tests were done with gcc-4.3.2.  I now have 
installed Fedora 11 which has gcc-4.4.0.  And now the stack frame is a 
nice 64 bytes.  ;-)

That's with the "memory" though.  With the volatile, stack frame goes up 
to 224 bytes and performance, although not as bad as before, is like 
5.160s instead of 4.410s.  The "+m" version is not much better: 208 byte 
stack frame and similar performance.

The version with no barrier what so ever runs in 4.580s and uses a 88 
byte stack frame.  The generated assembly contains stupid things, but 
this is still the second best version, even better than the "+m" and 
volatile ptr ones.

Conclusion: the full "memory" barrier remains the best choice on ARM.


Nicolas

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2009-08-12  2:27 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-10 23:52 block-sha1: improve code on large-register-set machines Linus Torvalds
2009-08-11  6:15 ` Nicolas Pitre
2009-08-11 15:04   ` Linus Torvalds
2009-08-11 18:00     ` Nicolas Pitre
2009-08-11 19:31       ` Nicolas Pitre
2009-08-11 21:20         ` Brandon Casey
2009-08-11 21:36           ` Nicolas Pitre
2009-08-11 21:49             ` Brandon Casey
2009-08-11 22:57           ` Linus Torvalds
2009-08-11 23:13             ` Brandon Casey
2009-08-11 15:43   ` Linus Torvalds
2009-08-11 20:03     ` Nicolas Pitre
2009-08-11 22:53       ` Linus Torvalds
2009-08-11 23:14         ` Linus Torvalds
2009-08-12  2:26           ` Nicolas Pitre
2009-08-11 23:45         ` Artur Skawina

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.