[PATCH] block-sha1: more good unaligned memory access candidates

Git development
 help / color / mirror / Atom feed

* [PATCH] block-sha1: more good unaligned memory access candidates
@ 2009-08-13  4:29 Nicolas Pitre
  2009-08-13 16:45 ` Linus Torvalds
  2009-08-14 22:52 ` [PATCH] block-sha1/sha1.c: silence compiler complaints by casting void* to uintptr_t Brandon Casey
  0 siblings, 2 replies; 17+ messages in thread
From: Nicolas Pitre @ 2009-08-13  4:29 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, git

In addition to X86, PowerPC and S390 are capable of unaligned memory 
accesses.

Signed-off-by: Nicolas Pitre <nico@cam.org>

diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index d3121f7..e5a1007 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -67,7 +67,10 @@
  * and is faster on architectures with memory alignment issues.
  */
 
-#if defined(__i386__) || defined(__x86_64__)
+#if defined(__i386__) || defined(__x86_64__) || \
+    defined(__ppc__) || defined(__ppc64__) || \
+    defined(__powerpc__) || defined(__powerpc64__) || \
+    defined(__s390__) || defined(__s390x__)
 
 #define get_be32(p)	ntohl(*(unsigned int *)(p))
 #define put_be32(p, v)	do { *(unsigned int *)(p) = htonl(v); } while (0)

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] block-sha1: more good unaligned memory access candidates
  2009-08-13  4:29 [PATCH] block-sha1: more good unaligned memory access candidates Nicolas Pitre
@ 2009-08-13 16:45 ` Linus Torvalds
  2009-08-13 17:23   ` Nicolas Pitre
  2009-08-14 22:52 ` [PATCH] block-sha1/sha1.c: silence compiler complaints by casting void* to uintptr_t Brandon Casey
  1 sibling, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2009-08-13 16:45 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git



On Thu, 13 Aug 2009, Nicolas Pitre wrote:
>
> In addition to X86, PowerPC and S390 are capable of unaligned memory 
> accesses.
> 
> Signed-off-by: Nicolas Pitre <nico@cam.org>

Ack on all your patches (1-3 + this). Looks fine to me.

I do wonder if we should try to do basically "per-architecture hack 
header-files", and then have for each architecture a small trivial 
'hack-x86.h' kind of thing that just does

	/* x86 hacks */
	#define get_be32(p)	ntohl(*(unsigned int *)(p))
	#define put_be32(p, v)	do { *(unsigned int *)(p) = htonl(v); } while (0)
	#define setW(x, val)	(*(volatile unsigned int *)&W(x) = (val))

	#define SHA_ASM(op, x, n) ({ unsigned int __res; __asm__(op " %1,%0":"=r" (__res):"i" (n), "0" (x)); __res; })
	#define SHA_ROL(x,n)   SHA_ASM("rol", x, n)
	#define SHA_ROR(x,n)   SHA_ASM("ror", x, n)

and then we'd have each architecture separated out. Add a few 
"generic helpers":

 - be32-generic.h:

	#define get_be32(p)    ( \
		(*((unsigned char *)(p) + 0) << 24) | \
		(*((unsigned char *)(p) + 1) << 16) | \
		(*((unsigned char *)(p) + 2) <<  8) | \
		(*((unsigned char *)(p) + 3) <<  0) )

	#define put_be32(p, v) do { \
		unsigned int __v = (v); \
		*((unsigned char *)(p) + 0) = __v >> 24; \
		*((unsigned char *)(p) + 1) = __v >> 16; \
		*((unsigned char *)(p) + 2) = __v >>  8; \
		*((unsigned char *)(p) + 3) = __v >>  0; } while (0)

 - rotate-generic.h:

	#define SHA_ROT(X,l,r)  (((X) << (l)) | ((X) >> (r)))
	#define SHA_ROL(X,n)    SHA_ROT(X,n,32-(n))
	#define SHA_ROR(X,n)    SHA_ROT(X,32-(n),n)

that architectures could use when they want to use some particular
portable version.  Then, add a final "hack-generic.h" with the fallback
cases that just does

	#include "be32-generic.h"
	#include "rotate-generic.h"
	#define setW(x,val) (W(x) = (val))

and you'd have all the hacks separated out and fairly easily used by
different architectures.. (ie the ARM version would just look like

 - hack-arm.h:

	#include "be32-generic.h"  
	#include "rotate-generic.h"
	#define setW(x, val) do { W(x) = (val); __asm__("":::"memory"); } while (0)

and you'd be all done.

Hmm? I don't know if this kind of generalization is strictly needed, so 
I'm just throwing it out as an idea.

		Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] block-sha1: more good unaligned memory access candidates
  2009-08-13 16:45 ` Linus Torvalds
@ 2009-08-13 17:23   ` Nicolas Pitre
  2009-08-13 19:33     ` Junio C Hamano
  0 siblings, 1 reply; 17+ messages in thread
From: Nicolas Pitre @ 2009-08-13 17:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, git

On Thu, 13 Aug 2009, Linus Torvalds wrote:

> 
> 
> On Thu, 13 Aug 2009, Nicolas Pitre wrote:
> >
> > In addition to X86, PowerPC and S390 are capable of unaligned memory 
> > accesses.
> > 
> > Signed-off-by: Nicolas Pitre <nico@cam.org>
> 
> Ack on all your patches (1-3 + this). Looks fine to me.
> 
> I do wonder if we should try to do basically "per-architecture hack 
> header-files", and then have for each architecture a small trivial 
> 'hack-x86.h' kind of thing that just does
> 
> 	/* x86 hacks */
> 	#define get_be32(p)	ntohl(*(unsigned int *)(p))
> 	#define put_be32(p, v)	do { *(unsigned int *)(p) = htonl(v); } while (0)
> 	#define setW(x, val)	(*(volatile unsigned int *)&W(x) = (val))
> 
> 	#define SHA_ASM(op, x, n) ({ unsigned int __res; __asm__(op " %1,%0":"=r" (__res):"i" (n), "0" (x)); __res; })
> 	#define SHA_ROL(x,n)   SHA_ASM("rol", x, n)
> 	#define SHA_ROR(x,n)   SHA_ASM("ror", x, n)
> 
> and then we'd have each architecture separated out. Add a few 
> "generic helpers":
> 
>  - be32-generic.h:
> 
> 	#define get_be32(p)    ( \
> 		(*((unsigned char *)(p) + 0) << 24) | \
> 		(*((unsigned char *)(p) + 1) << 16) | \
> 		(*((unsigned char *)(p) + 2) <<  8) | \
> 		(*((unsigned char *)(p) + 3) <<  0) )
> 
> 	#define put_be32(p, v) do { \
> 		unsigned int __v = (v); \
> 		*((unsigned char *)(p) + 0) = __v >> 24; \
> 		*((unsigned char *)(p) + 1) = __v >> 16; \
> 		*((unsigned char *)(p) + 2) = __v >>  8; \
> 		*((unsigned char *)(p) + 3) = __v >>  0; } while (0)
> 
>  - rotate-generic.h:
> 
> 	#define SHA_ROT(X,l,r)  (((X) << (l)) | ((X) >> (r)))
> 	#define SHA_ROL(X,n)    SHA_ROT(X,n,32-(n))
> 	#define SHA_ROR(X,n)    SHA_ROT(X,32-(n),n)
> 
> that architectures could use when they want to use some particular
> portable version.  Then, add a final "hack-generic.h" with the fallback
> cases that just does
> 
> 	#include "be32-generic.h"
> 	#include "rotate-generic.h"
> 	#define setW(x,val) (W(x) = (val))
> 
> and you'd have all the hacks separated out and fairly easily used by
> different architectures.. (ie the ARM version would just look like
> 
>  - hack-arm.h:
> 
> 	#include "be32-generic.h"  
> 	#include "rotate-generic.h"
> 	#define setW(x, val) do { W(x) = (val); __asm__("":::"memory"); } while (0)
> 
> and you'd be all done.
> 
> Hmm? I don't know if this kind of generalization is strictly needed, so 
> I'm just throwing it out as an idea.

Well... first I don't think there are much else we can do with that 
code, i.e. there is probably not so many more hacks to add if any.  With 
my last patch I consider this ready for general consumption.

And given that the only user is this very sha1 implementation then there 
is not much value in abstracting them in header files.

And the point of patch 2/3 was to move hack variants for the same 
purpose together making it easier to document and understand (and 
possibly modify) them.  Your suggestion would go in the opposite 
direction entirely.

As it is now, I was about to suggest:

	git mv block-sha1/sha1.[ch] .
	rmdir block-sha1
	rm -r mozilla-sha1
	rm -r arm
	rm -r ppc 

and remove support for openssl's SHA1 usage, making this implementation 
unconditional.  After all it is faster, or so close to be faster than 
the alternatives, that we should probably cut on the extra dependency 
and simplify portability issues at the same time.


Nicolas

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] block-sha1: more good unaligned memory access candidates
  2009-08-13 17:23   ` Nicolas Pitre
@ 2009-08-13 19:33     ` Junio C Hamano
  2009-08-13 19:54       ` Linus Torvalds
  2009-08-13 20:13       ` Nicolas Pitre
  0 siblings, 2 replies; 17+ messages in thread
From: Junio C Hamano @ 2009-08-13 19:33 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Linus Torvalds, git

Nicolas Pitre <nico@cam.org> writes:

> As it is now, I was about to suggest:
>
> 	git mv block-sha1/sha1.[ch] .
> 	rmdir block-sha1
> 	rm -r mozilla-sha1
> 	rm -r arm
> 	rm -r ppc 
>
> and remove support for openssl's SHA1 usage, making this implementation 
> unconditional.  After all it is faster, or so close to be faster than 
> the alternatives, that we should probably cut on the extra dependency 
> and simplify portability issues at the same time.

Wow.  Is it now faster than the arm/ and ppc/ hand-tweaked assembly?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] block-sha1: more good unaligned memory access candidates
  2009-08-13 19:33     ` Junio C Hamano
@ 2009-08-13 19:54       ` Linus Torvalds
  2009-08-13 20:13       ` Nicolas Pitre
  1 sibling, 0 replies; 17+ messages in thread
From: Linus Torvalds @ 2009-08-13 19:54 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Nicolas Pitre, git

On Thu, 13 Aug 2009, Junio C Hamano wrote:
> 
> Wow.  Is it now faster than the arm/ and ppc/ hand-tweaked assembly?

For the good cases, yes.

For POWER, with gcc-4.4, the C code apparently outperforms the asm code on 
POWER6. The asm code is scheduled for POWER4, and I think outperforms the 
C code there. Also, when compiling in 64-bit mode (with "-m64"), at least 
some versions of gcc seem to do some stupid things and add extra zero 
extension stuff, and that performed suboptimally at least on a PPC G5.

So it's certainly not a clear case of "the C code outperforms the asm 
code", but in BenH's tests, the best numbers really did come from the C 
version. With some silly cases of at least some versions gcc screwing up 
(not reload, but zero extension), and making it noticeably slower.

IOW, the PPC situation really isn't that different from x86. 

			Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] block-sha1: more good unaligned memory access candidates
  2009-08-13 19:33     ` Junio C Hamano
  2009-08-13 19:54       ` Linus Torvalds
@ 2009-08-13 20:13       ` Nicolas Pitre
  1 sibling, 0 replies; 17+ messages in thread
From: Nicolas Pitre @ 2009-08-13 20:13 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, git

On Thu, 13 Aug 2009, Junio C Hamano wrote:

> Nicolas Pitre <nico@cam.org> writes:
> 
> > As it is now, I was about to suggest:
> >
> > 	git mv block-sha1/sha1.[ch] .
> > 	rmdir block-sha1
> > 	rm -r mozilla-sha1
> > 	rm -r arm
> > 	rm -r ppc 
> >
> > and remove support for openssl's SHA1 usage, making this implementation 
> > unconditional.  After all it is faster, or so close to be faster than 
> > the alternatives, that we should probably cut on the extra dependency 
> > and simplify portability issues at the same time.
> 
> Wow.  Is it now faster than the arm/ and ppc/ hand-tweaked assembly?

It is indeed faster than the ARM assembly version by far, and faster 
than all the alternative implementations too, but with a 7x increase in 
compiled code size.  In the context of Git I think this is a good 
compromize.  Making the assembly version faster than the C version could 
be possible, but that would require quite some work and I don't expect 
the gain to be significant, certainly not worth the trouble.

Furthermore the C version can be used to generate ARM Thumb code while 
the asm version cannot without yet more work.

Nicolas

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] block-sha1/sha1.c: silence compiler complaints by casting void* to uintptr_t
  2009-08-13  4:29 [PATCH] block-sha1: more good unaligned memory access candidates Nicolas Pitre
  2009-08-13 16:45 ` Linus Torvalds
@ 2009-08-14 22:52 ` Brandon Casey
  2009-08-15  0:08   ` Johannes Schindelin
  1 sibling, 1 reply; 17+ messages in thread
From: Brandon Casey @ 2009-08-14 22:52 UTC (permalink / raw)
  To: nico; +Cc: gitster, torvalds, git, Brandon Casey

From: Brandon Casey <drafnel@gmail.com>

Some compilers produce errors when arithmetic is attempted on pointers to
void.  So cast to uintptr_t when performing arithmetic on void pointers.

Signed-off-by: Brandon Casey <casey@nrlssc.navy.mil>
---


This, on top of Nico's last patch "block-sha1: more good unaligned memory
access candidates", allows this series to compile and run successfully on
Sun Sparc and IRIX MIPS for me.

It produces no differences on Linux using gcc.  gcc 3.4.6 and 4.1.2 produce
an identical binary for me on x86 and x86-64 with and without this change.
Any conceivable negative effects on other arch's?

-brandon


 block-sha1/sha1.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index e5a1007..ccaba9e 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -246,14 +246,14 @@ void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len)
 		memcpy(lenW + (char *)ctx->W, data, left);
 		lenW = (lenW + left) & 63;
 		len -= left;
-		data += left;
+		data = (const void*) ((uintptr_t) data + left);
 		if (lenW)
 			return;
 		blk_SHA1_Block(ctx, ctx->W);
 	}
 	while (len >= 64) {
 		blk_SHA1_Block(ctx, data);
-		data += 64;
+		data = (const void*) ((uintptr_t) data + 64);
 		len -= 64;
 	}
 	if (len)
-- 
1.6.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] block-sha1/sha1.c: silence compiler complaints by casting void* to uintptr_t
  2009-08-14 22:52 ` [PATCH] block-sha1/sha1.c: silence compiler complaints by casting void* to uintptr_t Brandon Casey
@ 2009-08-15  0:08   ` Johannes Schindelin
  2009-08-15  0:19     ` Linus Torvalds
  2009-08-15  0:52     ` Brandon Casey
  0 siblings, 2 replies; 17+ messages in thread
From: Johannes Schindelin @ 2009-08-15  0:08 UTC (permalink / raw)
  To: Brandon Casey; +Cc: nico, gitster, torvalds, git, Brandon Casey

Hi,

On Fri, 14 Aug 2009, Brandon Casey wrote:

> Some compilers produce errors when arithmetic is attempted on pointers to
> void.  So cast to uintptr_t when performing arithmetic on void pointers.

I am confused.  Is sizeof(*(uintptr_t)NULL) not larger than 1, and as a 
consequence ((uintptr_t)p)+1 not different from ((void *)p)+1?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] block-sha1/sha1.c: silence compiler complaints by casting void* to uintptr_t
  2009-08-15  0:08   ` Johannes Schindelin
@ 2009-08-15  0:19     ` Linus Torvalds
  2009-08-15  0:44       ` Brandon Casey
  2009-08-15  0:52     ` Brandon Casey
  1 sibling, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2009-08-15  0:19 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Brandon Casey, nico, gitster, git, Brandon Casey

On Sat, 15 Aug 2009, Johannes Schindelin wrote:
> 
> I am confused.  Is sizeof(*(uintptr_t)NULL) not larger than 1

You can't do that. "uintptr_t" isn't actually a pointer. It's just a 
unsigned integer value large enough to contain a pointer.

So it's _not_ a "pointer to uint". It's "uint that can contain all the 
bits of a poitner".

> and as a consequence ((uintptr_t)p)+1 not different from ((void *)p)+1?

No, they're the same.

That said, I suspect it might as well be cast to "const char *", and then 
hopefully you only need one cast.

So maybe it could be written as

	data = (const char *) data + len;

instead, and avoid the second cast (because the assignment should be ok, 
since it's assigning back to a "const void *").

		Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] block-sha1/sha1.c: silence compiler complaints by casting void* to uintptr_t
  2009-08-15  0:19     ` Linus Torvalds
@ 2009-08-15  0:44       ` Brandon Casey
  2009-08-15  1:46         ` Junio C Hamano
  0 siblings, 1 reply; 17+ messages in thread
From: Brandon Casey @ 2009-08-15  0:44 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Johannes Schindelin, nico, gitster, git, Brandon Casey

Linus Torvalds wrote:
> I suspect it might as well be cast to "const char *", and then 
> hopefully you only need one cast.
> 
> So maybe it could be written as
> 
> 	data = (const char *) data + len;
> 
> instead, and avoid the second cast (because the assignment should be ok, 
> since it's assigning back to a "const void *").

Yep, that's enough.  It produces an identical binary on all platforms
except Solaris x86 using SUNWspro compiler.

-brandon

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] block-sha1/sha1.c: silence compiler complaints by casting void* to uintptr_t
  2009-08-15  0:44       ` Brandon Casey
@ 2009-08-15  1:46         ` Junio C Hamano
  2009-08-15  2:34           ` Brandon Casey
  0 siblings, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2009-08-15  1:46 UTC (permalink / raw)
  To: Brandon Casey
  Cc: Linus Torvalds, Johannes Schindelin, nico, git, Brandon Casey

Brandon Casey <casey@nrlssc.navy.mil> writes:

> Linus Torvalds wrote:
>> I suspect it might as well be cast to "const char *", and then 
>> hopefully you only need one cast.
>> 
>> So maybe it could be written as
>> 
>> 	data = (const char *) data + len;
>> 
>> instead, and avoid the second cast (because the assignment should be ok, 
>> since it's assigning back to a "const void *").
>
> Yep, that's enough.  It produces an identical binary on all platforms
> except Solaris x86 using SUNWspro compiler.

Casting to "const char *" to tell compilers that we are interested in byte
address differences/increments makes sense to me, but I do not know how to
interpret your last comment.

Do you mean that SUNWspro compiler misbehave with the "cast to char *" and
do you meed your "casting to uintptr_t to explicitly compute byte
addresses as int" to make it behave?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] block-sha1/sha1.c: silence compiler complaints by casting void* to uintptr_t
  2009-08-15  1:46         ` Junio C Hamano
@ 2009-08-15  2:34           ` Brandon Casey
  2009-08-15  3:16             ` Junio C Hamano
  0 siblings, 1 reply; 17+ messages in thread
From: Brandon Casey @ 2009-08-15  2:34 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Linus Torvalds, Johannes Schindelin, nico, git, Brandon Casey

Junio C Hamano wrote:
> Brandon Casey <casey@nrlssc.navy.mil> writes:
> 
>> Linus Torvalds wrote:
>>> I suspect it might as well be cast to "const char *", and then 
>>> hopefully you only need one cast.
>>>
>>> So maybe it could be written as
>>>
>>> 	data = (const char *) data + len;
>>>
>>> instead, and avoid the second cast (because the assignment should be ok, 
>>> since it's assigning back to a "const void *").
>> Yep, that's enough.  It produces an identical binary on all platforms
>> except Solaris x86 using SUNWspro compiler.
> 
> Casting to "const char *" to tell compilers that we are interested in byte
> address differences/increments makes sense to me,

Yes, I agree.

> but I do not know how to interpret your last comment.
> Do you mean that SUNWspro compiler misbehave with the "cast to char *" and
> do you meed your "casting to uintptr_t to explicitly compute byte
> addresses as int" to make it behave?

No, sorry, the SUNWspro compiler produces a working binary with any of the
three versions: the original, cast to uintptr_t, or cast to const char*.
Though it emits a warning with the original.

I was just pointing out that the SUNWspro compiler does not produce an
identical binary on x86 for each of the three versions like the GNU
compiler does.

Sorry for the confusion.  Linus's suggestion of casting to "const char*"
should be adopted.

-brandon

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] block-sha1/sha1.c: silence compiler complaints by casting void* to uintptr_t
  2009-08-15  2:34           ` Brandon Casey
@ 2009-08-15  3:16             ` Junio C Hamano
  2009-08-15 18:43               ` Linus Torvalds
  0 siblings, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2009-08-15  3:16 UTC (permalink / raw)
  To: Brandon Casey
  Cc: Junio C Hamano, Linus Torvalds, Johannes Schindelin, nico, git,
	Brandon Casey

Brandon Casey <casey@nrlssc.navy.mil> writes:

> I was just pointing out that the SUNWspro compiler does not produce an
> identical binary on x86 for each of the three versions like the GNU
> compiler does.
>
> Sorry for the confusion.  Linus's suggestion of casting to "const char*"
> should be adopted.

Ah, I see.  Thanks.

Let's queue this, then.

-- >8 --
From: Brandon Casey <drafnel@gmail.com>
Date: Fri, 14 Aug 2009 17:52:15 -0500
Subject: [PATCH] block-sha1/sha1.c: silence compiler complaints by casting void * to char *

Some compilers produce errors when arithmetic is attempted on pointers to
void.  We want computations done on byte addresses, so cast them to char *
to work them around.

Signed-off-by: Brandon Casey <casey@nrlssc.navy.mil>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 block-sha1/sha1.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index e5a1007..464cb25 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -246,14 +246,14 @@ void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len)
 		memcpy(lenW + (char *)ctx->W, data, left);
 		lenW = (lenW + left) & 63;
 		len -= left;
-		data += left;
+		data = ((const char *)data + left);
 		if (lenW)
 			return;
 		blk_SHA1_Block(ctx, ctx->W);
 	}
 	while (len >= 64) {
 		blk_SHA1_Block(ctx, data);
-		data += 64;
+		data = ((const char *)data + 64);
 		len -= 64;
 	}
 	if (len)

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] block-sha1/sha1.c: silence compiler complaints by casting void* to uintptr_t
  2009-08-15  3:16             ` Junio C Hamano
@ 2009-08-15 18:43               ` Linus Torvalds
  0 siblings, 0 replies; 17+ messages in thread
From: Linus Torvalds @ 2009-08-15 18:43 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Brandon Casey, Johannes Schindelin, nico, git, Brandon Casey



On Fri, 14 Aug 2009, Junio C Hamano wrote:
>
> From: Brandon Casey <drafnel@gmail.com>
> Date: Fri, 14 Aug 2009 17:52:15 -0500
> Subject: [PATCH] block-sha1/sha1.c: silence compiler complaints by casting void * to char *
> 
> Some compilers produce errors when arithmetic is attempted on pointers to
> void.  We want computations done on byte addresses, so cast them to char *
> to work them around.
> 
> Signed-off-by: Brandon Casey <casey@nrlssc.navy.mil>
> Signed-off-by: Junio C Hamano <gitster@pobox.com>

Unnecessary outer parentheses, but ack anyway.

		Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] block-sha1/sha1.c: silence compiler complaints by casting void* to uintptr_t
  2009-08-15  0:08   ` Johannes Schindelin
  2009-08-15  0:19     ` Linus Torvalds
@ 2009-08-15  0:52     ` Brandon Casey
  1 sibling, 0 replies; 17+ messages in thread
From: Brandon Casey @ 2009-08-15  0:52 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: nico, gitster, torvalds, git, Brandon Casey

Johannes Schindelin wrote:
> Hi,
> 
> On Fri, 14 Aug 2009, Brandon Casey wrote:
> 
>> Some compilers produce errors when arithmetic is attempted on pointers to
>> void.  So cast to uintptr_t when performing arithmetic on void pointers.
> 
> I am confused.  Is sizeof(*(uintptr_t)NULL) not larger than 1, and as a 
> consequence ((uintptr_t)p)+1 not different from ((void *)p)+1?

If you try this:

   printf("NULL + 1: %u\n", (void*)NULL + 1);

the MIPSpro compiler complains like:

   The expression must be a pointer to a complete object type.

        printf("NULL + 1: %u\n", (void*)NULL + 1);
                                 ^

   1 error detected in the compilation of "test.c".


Compilers other than gcc at least issue a warning, if they do
not also fail.

-brandon

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] block-sha1: more good unaligned memory access candidates
@ 2009-08-13 20:15 George Spelvin
  2009-08-13 21:28 ` Nicolas Pitre
  0 siblings, 1 reply; 17+ messages in thread
From: George Spelvin @ 2009-08-13 20:15 UTC (permalink / raw)
  To: git, gitster; +Cc: linux, nico, torvalds

> Wow.  Is it now faster than the arm/ and ppc/ hand-tweaked assembly?

It's probably faster than the ARM, which was tuned for size rather
than speed, but if you want to rework the assembly for speed, the ARM's
rotate-and-add operations allow tricks which I doubt GCC can pick up on.
(You have to notice that the F(b,c,d) function is bitwise, so you can
do it on rotated data and do the rotate when you add the result to e.)

I'd be surprised if it were faster than PPC code, especially on the
in-order G3 and G4 cores where careful scheduling really pays off.
But maybe I just get to be surprised...

For automatic assembly tuning, I was thinking of having a .c file that
has a bunch of #ifdef __PPC__ statements that gets run through $(CC) -E.
That should be a fairly portable way to 

The other question about unaligned access is whether it's beneficial
to make the fetch loop work like this:

	char const *in;
	uint32_t *out
	unsigned lsb = (unsigned)p & 3;
	uint32_t const *p32 = (uint32_t const *)(in - lsb);
	uint32_t t = ntohl(*p32);

	switch (lsb) {

	case 0:
		*out++ = t;
		for (i = 1; i < 16; i++)
			*out++ = ntohl(*++p32);
		break;
	case 1:
		for (i = 0; i < 16; i++) {
			uint32_t s = t << 8;
			t = ntohl(*++p32);
			*out++ = s | t >> 24;
		}
		break;
	case 1:
		for (i = 0; i < 16; i++) {
			uint32_t s = t << 16;
			t = ntohl(*++p32);
			*out++ = s | t >> 16;
		}
		break;
	case 1:
		for (i = 0; i < 16; i++) {
			uint32_t s = t << 24;
			t = ntohl(*++p32);
			*out++ = s | t >> 8;
		}
		break;
	}

On the ARM, at least, ntohl() isn't particularly cheap, so loading 4
bytes and assembling them turns out to be cheaper.  But it's a thought.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] block-sha1: more good unaligned memory access candidates
  2009-08-13 20:15 [PATCH] block-sha1: more good unaligned memory access candidates George Spelvin
@ 2009-08-13 21:28 ` Nicolas Pitre
  0 siblings, 0 replies; 17+ messages in thread
From: Nicolas Pitre @ 2009-08-13 21:28 UTC (permalink / raw)
  To: George Spelvin; +Cc: git, gitster, torvalds

On Thu, 13 Aug 2009, George Spelvin wrote:

> > Wow.  Is it now faster than the arm/ and ppc/ hand-tweaked assembly?
> 
> It's probably faster than the ARM, which was tuned for size rather
> than speed, but if you want to rework the assembly for speed, the ARM's
> rotate-and-add operations allow tricks which I doubt GCC can pick up on.
> (You have to notice that the F(b,c,d) function is bitwise, so you can
> do it on rotated data and do the rotate when you add the result to e.)

gcc is not too bad at merging ALU and shift/rotate operations into the 
same instruction.  However, to make a really optimal ARM version, some 
custom SHA_ROUND macros with inline assembly could be made.  I suspect 
that wouldn't gain much though, as the pure shift/rotate mov 
instructions really aren't that many in the generated code.

> I'd be surprised if it were faster than PPC code, especially on the
> in-order G3 and G4 cores where careful scheduling really pays off.
> But maybe I just get to be surprised...

Given that PPC has enough register to hold the entire state, it is then 
only a matter of proper instruction scheduling which modern gcc ought to 
do right.  If not then this is a good test case for gcc people to fix 
the PPC machine pipeline description.

> For automatic assembly tuning, I was thinking of having a .c file that
> has a bunch of #ifdef __PPC__ statements that gets run through $(CC) -E.
> That should be a fairly portable way to 

??

> The other question about unaligned access is whether it's beneficial
> to make the fetch loop work like this:
> 
> 	char const *in;
> 	uint32_t *out
> 	unsigned lsb = (unsigned)p & 3;
> 	uint32_t const *p32 = (uint32_t const *)(in - lsb);
> 	uint32_t t = ntohl(*p32);
> 
> 	switch (lsb) {
> 
> 	case 0:
> 		*out++ = t;
> 		for (i = 1; i < 16; i++)
> 			*out++ = ntohl(*++p32);
> 		break;
> 	case 1:
> 		for (i = 0; i < 16; i++) {
> 			uint32_t s = t << 8;
> 			t = ntohl(*++p32);
> 			*out++ = s | t >> 24;
> 		}
> 		break;
> 	case 1:
> 		for (i = 0; i < 16; i++) {
> 			uint32_t s = t << 16;
> 			t = ntohl(*++p32);
> 			*out++ = s | t >> 16;
> 		}
> 		break;
> 	case 1:
> 		for (i = 0; i < 16; i++) {
> 			uint32_t s = t << 24;
> 			t = ntohl(*++p32);
> 			*out++ = s | t >> 8;
> 		}
> 		break;
> 	}
> 
> On the ARM, at least, ntohl() isn't particularly cheap, so loading 4
> bytes and assembling them turns out to be cheaper.  But it's a thought.

Well, that would have to be tested.  This could possibly only be a gain 
if you have a fast ntohl() though.

And the other question is also to decide when this is good enough for a 
generic version.  Gaining 5% speedup on raw SHA1 throughput with ugly 
code might not be worth the maintenance hassle.  At that point you might 
as well go back to a purely asm version if you really want to get the 
extra edge.

Nicolas

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2009-08-15 18:45 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-13  4:29 [PATCH] block-sha1: more good unaligned memory access candidates Nicolas Pitre
2009-08-13 16:45 ` Linus Torvalds
2009-08-13 17:23   ` Nicolas Pitre
2009-08-13 19:33     ` Junio C Hamano
2009-08-13 19:54       ` Linus Torvalds
2009-08-13 20:13       ` Nicolas Pitre
2009-08-14 22:52 ` [PATCH] block-sha1/sha1.c: silence compiler complaints by casting void* to uintptr_t Brandon Casey
2009-08-15  0:08   ` Johannes Schindelin
2009-08-15  0:19     ` Linus Torvalds
2009-08-15  0:44       ` Brandon Casey
2009-08-15  1:46         ` Junio C Hamano
2009-08-15  2:34           ` Brandon Casey
2009-08-15  3:16             ` Junio C Hamano
2009-08-15 18:43               ` Linus Torvalds
2009-08-15  0:52     ` Brandon Casey
  -- strict thread matches above, loose matches on Subject: below --
2009-08-13 20:15 [PATCH] block-sha1: more good unaligned memory access candidates George Spelvin
2009-08-13 21:28 ` Nicolas Pitre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox