Re: [PATCH] block-sha1: more good unaligned memory access candidates

Git development
 help / color / mirror / Atom feed

* Re: [PATCH] block-sha1: more good unaligned memory access candidates
@ 2009-08-13 20:15 George Spelvin
  2009-08-13 21:28 ` Nicolas Pitre
  0 siblings, 1 reply; 8+ messages in thread
From: George Spelvin @ 2009-08-13 20:15 UTC (permalink / raw)
  To: git, gitster; +Cc: linux, nico, torvalds

> Wow.  Is it now faster than the arm/ and ppc/ hand-tweaked assembly?

It's probably faster than the ARM, which was tuned for size rather
than speed, but if you want to rework the assembly for speed, the ARM's
rotate-and-add operations allow tricks which I doubt GCC can pick up on.
(You have to notice that the F(b,c,d) function is bitwise, so you can
do it on rotated data and do the rotate when you add the result to e.)

I'd be surprised if it were faster than PPC code, especially on the
in-order G3 and G4 cores where careful scheduling really pays off.
But maybe I just get to be surprised...

For automatic assembly tuning, I was thinking of having a .c file that
has a bunch of #ifdef __PPC__ statements that gets run through $(CC) -E.
That should be a fairly portable way to 

The other question about unaligned access is whether it's beneficial
to make the fetch loop work like this:

	char const *in;
	uint32_t *out
	unsigned lsb = (unsigned)p & 3;
	uint32_t const *p32 = (uint32_t const *)(in - lsb);
	uint32_t t = ntohl(*p32);

	switch (lsb) {

	case 0:
		*out++ = t;
		for (i = 1; i < 16; i++)
			*out++ = ntohl(*++p32);
		break;
	case 1:
		for (i = 0; i < 16; i++) {
			uint32_t s = t << 8;
			t = ntohl(*++p32);
			*out++ = s | t >> 24;
		}
		break;
	case 1:
		for (i = 0; i < 16; i++) {
			uint32_t s = t << 16;
			t = ntohl(*++p32);
			*out++ = s | t >> 16;
		}
		break;
	case 1:
		for (i = 0; i < 16; i++) {
			uint32_t s = t << 24;
			t = ntohl(*++p32);
			*out++ = s | t >> 8;
		}
		break;
	}

On the ARM, at least, ntohl() isn't particularly cheap, so loading 4
bytes and assembling them turns out to be cheaper.  But it's a thought.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] block-sha1: more good unaligned memory access candidates
  2009-08-13 20:15 [PATCH] block-sha1: more good unaligned memory access candidates George Spelvin
@ 2009-08-13 21:28 ` Nicolas Pitre
  0 siblings, 0 replies; 8+ messages in thread
From: Nicolas Pitre @ 2009-08-13 21:28 UTC (permalink / raw)
  To: George Spelvin; +Cc: git, gitster, torvalds

On Thu, 13 Aug 2009, George Spelvin wrote:

> > Wow.  Is it now faster than the arm/ and ppc/ hand-tweaked assembly?
> 
> It's probably faster than the ARM, which was tuned for size rather
> than speed, but if you want to rework the assembly for speed, the ARM's
> rotate-and-add operations allow tricks which I doubt GCC can pick up on.
> (You have to notice that the F(b,c,d) function is bitwise, so you can
> do it on rotated data and do the rotate when you add the result to e.)

gcc is not too bad at merging ALU and shift/rotate operations into the 
same instruction.  However, to make a really optimal ARM version, some 
custom SHA_ROUND macros with inline assembly could be made.  I suspect 
that wouldn't gain much though, as the pure shift/rotate mov 
instructions really aren't that many in the generated code.

> I'd be surprised if it were faster than PPC code, especially on the
> in-order G3 and G4 cores where careful scheduling really pays off.
> But maybe I just get to be surprised...

Given that PPC has enough register to hold the entire state, it is then 
only a matter of proper instruction scheduling which modern gcc ought to 
do right.  If not then this is a good test case for gcc people to fix 
the PPC machine pipeline description.

> For automatic assembly tuning, I was thinking of having a .c file that
> has a bunch of #ifdef __PPC__ statements that gets run through $(CC) -E.
> That should be a fairly portable way to 

??

> The other question about unaligned access is whether it's beneficial
> to make the fetch loop work like this:
> 
> 	char const *in;
> 	uint32_t *out
> 	unsigned lsb = (unsigned)p & 3;
> 	uint32_t const *p32 = (uint32_t const *)(in - lsb);
> 	uint32_t t = ntohl(*p32);
> 
> 	switch (lsb) {
> 
> 	case 0:
> 		*out++ = t;
> 		for (i = 1; i < 16; i++)
> 			*out++ = ntohl(*++p32);
> 		break;
> 	case 1:
> 		for (i = 0; i < 16; i++) {
> 			uint32_t s = t << 8;
> 			t = ntohl(*++p32);
> 			*out++ = s | t >> 24;
> 		}
> 		break;
> 	case 1:
> 		for (i = 0; i < 16; i++) {
> 			uint32_t s = t << 16;
> 			t = ntohl(*++p32);
> 			*out++ = s | t >> 16;
> 		}
> 		break;
> 	case 1:
> 		for (i = 0; i < 16; i++) {
> 			uint32_t s = t << 24;
> 			t = ntohl(*++p32);
> 			*out++ = s | t >> 8;
> 		}
> 		break;
> 	}
> 
> On the ARM, at least, ntohl() isn't particularly cheap, so loading 4
> bytes and assembling them turns out to be cheaper.  But it's a thought.

Well, that would have to be tested.  This could possibly only be a gain 
if you have a fast ntohl() though.

And the other question is also to decide when this is good enough for a 
generic version.  Gaining 5% speedup on raw SHA1 throughput with ugly 
code might not be worth the maintenance hassle.  At that point you might 
as well go back to a purely asm version if you really want to get the 
extra edge.

Nicolas

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] block-sha1: more good unaligned memory access candidates
@ 2009-08-13  4:29 Nicolas Pitre
  2009-08-13 16:45 ` Linus Torvalds
  0 siblings, 1 reply; 8+ messages in thread
From: Nicolas Pitre @ 2009-08-13  4:29 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, git

In addition to X86, PowerPC and S390 are capable of unaligned memory 
accesses.

Signed-off-by: Nicolas Pitre <nico@cam.org>

diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index d3121f7..e5a1007 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -67,7 +67,10 @@
  * and is faster on architectures with memory alignment issues.
  */
 
-#if defined(__i386__) || defined(__x86_64__)
+#if defined(__i386__) || defined(__x86_64__) || \
+    defined(__ppc__) || defined(__ppc64__) || \
+    defined(__powerpc__) || defined(__powerpc64__) || \
+    defined(__s390__) || defined(__s390x__)
 
 #define get_be32(p)	ntohl(*(unsigned int *)(p))
 #define put_be32(p, v)	do { *(unsigned int *)(p) = htonl(v); } while (0)

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] block-sha1: more good unaligned memory access candidates
  2009-08-13  4:29 Nicolas Pitre
@ 2009-08-13 16:45 ` Linus Torvalds
  2009-08-13 17:23   ` Nicolas Pitre
  0 siblings, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2009-08-13 16:45 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git



On Thu, 13 Aug 2009, Nicolas Pitre wrote:
>
> In addition to X86, PowerPC and S390 are capable of unaligned memory 
> accesses.
> 
> Signed-off-by: Nicolas Pitre <nico@cam.org>

Ack on all your patches (1-3 + this). Looks fine to me.

I do wonder if we should try to do basically "per-architecture hack 
header-files", and then have for each architecture a small trivial 
'hack-x86.h' kind of thing that just does

	/* x86 hacks */
	#define get_be32(p)	ntohl(*(unsigned int *)(p))
	#define put_be32(p, v)	do { *(unsigned int *)(p) = htonl(v); } while (0)
	#define setW(x, val)	(*(volatile unsigned int *)&W(x) = (val))

	#define SHA_ASM(op, x, n) ({ unsigned int __res; __asm__(op " %1,%0":"=r" (__res):"i" (n), "0" (x)); __res; })
	#define SHA_ROL(x,n)   SHA_ASM("rol", x, n)
	#define SHA_ROR(x,n)   SHA_ASM("ror", x, n)

and then we'd have each architecture separated out. Add a few 
"generic helpers":

 - be32-generic.h:

	#define get_be32(p)    ( \
		(*((unsigned char *)(p) + 0) << 24) | \
		(*((unsigned char *)(p) + 1) << 16) | \
		(*((unsigned char *)(p) + 2) <<  8) | \
		(*((unsigned char *)(p) + 3) <<  0) )

	#define put_be32(p, v) do { \
		unsigned int __v = (v); \
		*((unsigned char *)(p) + 0) = __v >> 24; \
		*((unsigned char *)(p) + 1) = __v >> 16; \
		*((unsigned char *)(p) + 2) = __v >>  8; \
		*((unsigned char *)(p) + 3) = __v >>  0; } while (0)

 - rotate-generic.h:

	#define SHA_ROT(X,l,r)  (((X) << (l)) | ((X) >> (r)))
	#define SHA_ROL(X,n)    SHA_ROT(X,n,32-(n))
	#define SHA_ROR(X,n)    SHA_ROT(X,32-(n),n)

that architectures could use when they want to use some particular
portable version.  Then, add a final "hack-generic.h" with the fallback
cases that just does

	#include "be32-generic.h"
	#include "rotate-generic.h"
	#define setW(x,val) (W(x) = (val))

and you'd have all the hacks separated out and fairly easily used by
different architectures.. (ie the ARM version would just look like

 - hack-arm.h:

	#include "be32-generic.h"  
	#include "rotate-generic.h"
	#define setW(x, val) do { W(x) = (val); __asm__("":::"memory"); } while (0)

and you'd be all done.

Hmm? I don't know if this kind of generalization is strictly needed, so 
I'm just throwing it out as an idea.

		Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] block-sha1: more good unaligned memory access candidates
  2009-08-13 16:45 ` Linus Torvalds
@ 2009-08-13 17:23   ` Nicolas Pitre
  2009-08-13 19:33     ` Junio C Hamano
  0 siblings, 1 reply; 8+ messages in thread
From: Nicolas Pitre @ 2009-08-13 17:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, git

On Thu, 13 Aug 2009, Linus Torvalds wrote:

> 
> 
> On Thu, 13 Aug 2009, Nicolas Pitre wrote:
> >
> > In addition to X86, PowerPC and S390 are capable of unaligned memory 
> > accesses.
> > 
> > Signed-off-by: Nicolas Pitre <nico@cam.org>
> 
> Ack on all your patches (1-3 + this). Looks fine to me.
> 
> I do wonder if we should try to do basically "per-architecture hack 
> header-files", and then have for each architecture a small trivial 
> 'hack-x86.h' kind of thing that just does
> 
> 	/* x86 hacks */
> 	#define get_be32(p)	ntohl(*(unsigned int *)(p))
> 	#define put_be32(p, v)	do { *(unsigned int *)(p) = htonl(v); } while (0)
> 	#define setW(x, val)	(*(volatile unsigned int *)&W(x) = (val))
> 
> 	#define SHA_ASM(op, x, n) ({ unsigned int __res; __asm__(op " %1,%0":"=r" (__res):"i" (n), "0" (x)); __res; })
> 	#define SHA_ROL(x,n)   SHA_ASM("rol", x, n)
> 	#define SHA_ROR(x,n)   SHA_ASM("ror", x, n)
> 
> and then we'd have each architecture separated out. Add a few 
> "generic helpers":
> 
>  - be32-generic.h:
> 
> 	#define get_be32(p)    ( \
> 		(*((unsigned char *)(p) + 0) << 24) | \
> 		(*((unsigned char *)(p) + 1) << 16) | \
> 		(*((unsigned char *)(p) + 2) <<  8) | \
> 		(*((unsigned char *)(p) + 3) <<  0) )
> 
> 	#define put_be32(p, v) do { \
> 		unsigned int __v = (v); \
> 		*((unsigned char *)(p) + 0) = __v >> 24; \
> 		*((unsigned char *)(p) + 1) = __v >> 16; \
> 		*((unsigned char *)(p) + 2) = __v >>  8; \
> 		*((unsigned char *)(p) + 3) = __v >>  0; } while (0)
> 
>  - rotate-generic.h:
> 
> 	#define SHA_ROT(X,l,r)  (((X) << (l)) | ((X) >> (r)))
> 	#define SHA_ROL(X,n)    SHA_ROT(X,n,32-(n))
> 	#define SHA_ROR(X,n)    SHA_ROT(X,32-(n),n)
> 
> that architectures could use when they want to use some particular
> portable version.  Then, add a final "hack-generic.h" with the fallback
> cases that just does
> 
> 	#include "be32-generic.h"
> 	#include "rotate-generic.h"
> 	#define setW(x,val) (W(x) = (val))
> 
> and you'd have all the hacks separated out and fairly easily used by
> different architectures.. (ie the ARM version would just look like
> 
>  - hack-arm.h:
> 
> 	#include "be32-generic.h"  
> 	#include "rotate-generic.h"
> 	#define setW(x, val) do { W(x) = (val); __asm__("":::"memory"); } while (0)
> 
> and you'd be all done.
> 
> Hmm? I don't know if this kind of generalization is strictly needed, so 
> I'm just throwing it out as an idea.

Well... first I don't think there are much else we can do with that 
code, i.e. there is probably not so many more hacks to add if any.  With 
my last patch I consider this ready for general consumption.

And given that the only user is this very sha1 implementation then there 
is not much value in abstracting them in header files.

And the point of patch 2/3 was to move hack variants for the same 
purpose together making it easier to document and understand (and 
possibly modify) them.  Your suggestion would go in the opposite 
direction entirely.

As it is now, I was about to suggest:

	git mv block-sha1/sha1.[ch] .
	rmdir block-sha1
	rm -r mozilla-sha1
	rm -r arm
	rm -r ppc 

and remove support for openssl's SHA1 usage, making this implementation 
unconditional.  After all it is faster, or so close to be faster than 
the alternatives, that we should probably cut on the extra dependency 
and simplify portability issues at the same time.


Nicolas

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] block-sha1: more good unaligned memory access candidates
  2009-08-13 17:23   ` Nicolas Pitre
@ 2009-08-13 19:33     ` Junio C Hamano
  2009-08-13 19:54       ` Linus Torvalds
  2009-08-13 20:13       ` Nicolas Pitre
  0 siblings, 2 replies; 8+ messages in thread
From: Junio C Hamano @ 2009-08-13 19:33 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Linus Torvalds, git

Nicolas Pitre <nico@cam.org> writes:

> As it is now, I was about to suggest:
>
> 	git mv block-sha1/sha1.[ch] .
> 	rmdir block-sha1
> 	rm -r mozilla-sha1
> 	rm -r arm
> 	rm -r ppc 
>
> and remove support for openssl's SHA1 usage, making this implementation 
> unconditional.  After all it is faster, or so close to be faster than 
> the alternatives, that we should probably cut on the extra dependency 
> and simplify portability issues at the same time.

Wow.  Is it now faster than the arm/ and ppc/ hand-tweaked assembly?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] block-sha1: more good unaligned memory access candidates
  2009-08-13 19:33     ` Junio C Hamano
@ 2009-08-13 19:54       ` Linus Torvalds
  2009-08-13 20:13       ` Nicolas Pitre
  1 sibling, 0 replies; 8+ messages in thread
From: Linus Torvalds @ 2009-08-13 19:54 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Nicolas Pitre, git

On Thu, 13 Aug 2009, Junio C Hamano wrote:
> 
> Wow.  Is it now faster than the arm/ and ppc/ hand-tweaked assembly?

For the good cases, yes.

For POWER, with gcc-4.4, the C code apparently outperforms the asm code on 
POWER6. The asm code is scheduled for POWER4, and I think outperforms the 
C code there. Also, when compiling in 64-bit mode (with "-m64"), at least 
some versions of gcc seem to do some stupid things and add extra zero 
extension stuff, and that performed suboptimally at least on a PPC G5.

So it's certainly not a clear case of "the C code outperforms the asm 
code", but in BenH's tests, the best numbers really did come from the C 
version. With some silly cases of at least some versions gcc screwing up 
(not reload, but zero extension), and making it noticeably slower.

IOW, the PPC situation really isn't that different from x86. 

			Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] block-sha1: more good unaligned memory access candidates
  2009-08-13 19:33     ` Junio C Hamano
  2009-08-13 19:54       ` Linus Torvalds
@ 2009-08-13 20:13       ` Nicolas Pitre
  1 sibling, 0 replies; 8+ messages in thread
From: Nicolas Pitre @ 2009-08-13 20:13 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, git

On Thu, 13 Aug 2009, Junio C Hamano wrote:

> Nicolas Pitre <nico@cam.org> writes:
> 
> > As it is now, I was about to suggest:
> >
> > 	git mv block-sha1/sha1.[ch] .
> > 	rmdir block-sha1
> > 	rm -r mozilla-sha1
> > 	rm -r arm
> > 	rm -r ppc 
> >
> > and remove support for openssl's SHA1 usage, making this implementation 
> > unconditional.  After all it is faster, or so close to be faster than 
> > the alternatives, that we should probably cut on the extra dependency 
> > and simplify portability issues at the same time.
> 
> Wow.  Is it now faster than the arm/ and ppc/ hand-tweaked assembly?

It is indeed faster than the ARM assembly version by far, and faster 
than all the alternative implementations too, but with a 7x increase in 
compiled code size.  In the context of Git I think this is a good 
compromize.  Making the assembly version faster than the C version could 
be possible, but that would require quite some work and I don't expect 
the gain to be significant, certainly not worth the trouble.

Furthermore the C version can be used to generate ARM Thumb code while 
the asm version cannot without yet more work.

Nicolas

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-08-13 21:29 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-13 20:15 [PATCH] block-sha1: more good unaligned memory access candidates George Spelvin
2009-08-13 21:28 ` Nicolas Pitre
  -- strict thread matches above, loose matches on Subject: below --
2009-08-13  4:29 Nicolas Pitre
2009-08-13 16:45 ` Linus Torvalds
2009-08-13 17:23   ` Nicolas Pitre
2009-08-13 19:33     ` Junio C Hamano
2009-08-13 19:54       ` Linus Torvalds
2009-08-13 20:13       ` Nicolas Pitre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox