* [PATCH] block-sha1: more good unaligned memory access candidates @ 2009-08-13 4:29 Nicolas Pitre 2009-08-13 16:45 ` Linus Torvalds 2009-08-14 22:52 ` [PATCH] block-sha1/sha1.c: silence compiler complaints by casting void* to uintptr_t Brandon Casey 0 siblings, 2 replies; 17+ messages in thread From: Nicolas Pitre @ 2009-08-13 4:29 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, git In addition to X86, PowerPC and S390 are capable of unaligned memory accesses. Signed-off-by: Nicolas Pitre <nico@cam.org> diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c index d3121f7..e5a1007 100644 --- a/block-sha1/sha1.c +++ b/block-sha1/sha1.c @@ -67,7 +67,10 @@ * and is faster on architectures with memory alignment issues. */ -#if defined(__i386__) || defined(__x86_64__) +#if defined(__i386__) || defined(__x86_64__) || \ + defined(__ppc__) || defined(__ppc64__) || \ + defined(__powerpc__) || defined(__powerpc64__) || \ + defined(__s390__) || defined(__s390x__) #define get_be32(p) ntohl(*(unsigned int *)(p)) #define put_be32(p, v) do { *(unsigned int *)(p) = htonl(v); } while (0) ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH] block-sha1: more good unaligned memory access candidates 2009-08-13 4:29 [PATCH] block-sha1: more good unaligned memory access candidates Nicolas Pitre @ 2009-08-13 16:45 ` Linus Torvalds 2009-08-13 17:23 ` Nicolas Pitre 2009-08-14 22:52 ` [PATCH] block-sha1/sha1.c: silence compiler complaints by casting void* to uintptr_t Brandon Casey 1 sibling, 1 reply; 17+ messages in thread From: Linus Torvalds @ 2009-08-13 16:45 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Junio C Hamano, git On Thu, 13 Aug 2009, Nicolas Pitre wrote: > > In addition to X86, PowerPC and S390 are capable of unaligned memory > accesses. > > Signed-off-by: Nicolas Pitre <nico@cam.org> Ack on all your patches (1-3 + this). Looks fine to me. I do wonder if we should try to do basically "per-architecture hack header-files", and then have for each architecture a small trivial 'hack-x86.h' kind of thing that just does /* x86 hacks */ #define get_be32(p) ntohl(*(unsigned int *)(p)) #define put_be32(p, v) do { *(unsigned int *)(p) = htonl(v); } while (0) #define setW(x, val) (*(volatile unsigned int *)&W(x) = (val)) #define SHA_ASM(op, x, n) ({ unsigned int __res; __asm__(op " %1,%0":"=r" (__res):"i" (n), "0" (x)); __res; }) #define SHA_ROL(x,n) SHA_ASM("rol", x, n) #define SHA_ROR(x,n) SHA_ASM("ror", x, n) and then we'd have each architecture separated out. Add a few "generic helpers": - be32-generic.h: #define get_be32(p) ( \ (*((unsigned char *)(p) + 0) << 24) | \ (*((unsigned char *)(p) + 1) << 16) | \ (*((unsigned char *)(p) + 2) << 8) | \ (*((unsigned char *)(p) + 3) << 0) ) #define put_be32(p, v) do { \ unsigned int __v = (v); \ *((unsigned char *)(p) + 0) = __v >> 24; \ *((unsigned char *)(p) + 1) = __v >> 16; \ *((unsigned char *)(p) + 2) = __v >> 8; \ *((unsigned char *)(p) + 3) = __v >> 0; } while (0) - rotate-generic.h: #define SHA_ROT(X,l,r) (((X) << (l)) | ((X) >> (r))) #define SHA_ROL(X,n) SHA_ROT(X,n,32-(n)) #define SHA_ROR(X,n) SHA_ROT(X,32-(n),n) that architectures could use when they want to use some particular portable version. Then, add a final "hack-generic.h" with the fallback cases that just does #include "be32-generic.h" #include "rotate-generic.h" #define setW(x,val) (W(x) = (val)) and you'd have all the hacks separated out and fairly easily used by different architectures.. (ie the ARM version would just look like - hack-arm.h: #include "be32-generic.h" #include "rotate-generic.h" #define setW(x, val) do { W(x) = (val); __asm__("":::"memory"); } while (0) and you'd be all done. Hmm? I don't know if this kind of generalization is strictly needed, so I'm just throwing it out as an idea. Linus ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] block-sha1: more good unaligned memory access candidates 2009-08-13 16:45 ` Linus Torvalds @ 2009-08-13 17:23 ` Nicolas Pitre 2009-08-13 19:33 ` Junio C Hamano 0 siblings, 1 reply; 17+ messages in thread From: Nicolas Pitre @ 2009-08-13 17:23 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, git On Thu, 13 Aug 2009, Linus Torvalds wrote: > > > On Thu, 13 Aug 2009, Nicolas Pitre wrote: > > > > In addition to X86, PowerPC and S390 are capable of unaligned memory > > accesses. > > > > Signed-off-by: Nicolas Pitre <nico@cam.org> > > Ack on all your patches (1-3 + this). Looks fine to me. > > I do wonder if we should try to do basically "per-architecture hack > header-files", and then have for each architecture a small trivial > 'hack-x86.h' kind of thing that just does > > /* x86 hacks */ > #define get_be32(p) ntohl(*(unsigned int *)(p)) > #define put_be32(p, v) do { *(unsigned int *)(p) = htonl(v); } while (0) > #define setW(x, val) (*(volatile unsigned int *)&W(x) = (val)) > > #define SHA_ASM(op, x, n) ({ unsigned int __res; __asm__(op " %1,%0":"=r" (__res):"i" (n), "0" (x)); __res; }) > #define SHA_ROL(x,n) SHA_ASM("rol", x, n) > #define SHA_ROR(x,n) SHA_ASM("ror", x, n) > > and then we'd have each architecture separated out. Add a few > "generic helpers": > > - be32-generic.h: > > #define get_be32(p) ( \ > (*((unsigned char *)(p) + 0) << 24) | \ > (*((unsigned char *)(p) + 1) << 16) | \ > (*((unsigned char *)(p) + 2) << 8) | \ > (*((unsigned char *)(p) + 3) << 0) ) > > #define put_be32(p, v) do { \ > unsigned int __v = (v); \ > *((unsigned char *)(p) + 0) = __v >> 24; \ > *((unsigned char *)(p) + 1) = __v >> 16; \ > *((unsigned char *)(p) + 2) = __v >> 8; \ > *((unsigned char *)(p) + 3) = __v >> 0; } while (0) > > - rotate-generic.h: > > #define SHA_ROT(X,l,r) (((X) << (l)) | ((X) >> (r))) > #define SHA_ROL(X,n) SHA_ROT(X,n,32-(n)) > #define SHA_ROR(X,n) SHA_ROT(X,32-(n),n) > > that architectures could use when they want to use some particular > portable version. Then, add a final "hack-generic.h" with the fallback > cases that just does > > #include "be32-generic.h" > #include "rotate-generic.h" > #define setW(x,val) (W(x) = (val)) > > and you'd have all the hacks separated out and fairly easily used by > different architectures.. (ie the ARM version would just look like > > - hack-arm.h: > > #include "be32-generic.h" > #include "rotate-generic.h" > #define setW(x, val) do { W(x) = (val); __asm__("":::"memory"); } while (0) > > and you'd be all done. > > Hmm? I don't know if this kind of generalization is strictly needed, so > I'm just throwing it out as an idea. Well... first I don't think there are much else we can do with that code, i.e. there is probably not so many more hacks to add if any. With my last patch I consider this ready for general consumption. And given that the only user is this very sha1 implementation then there is not much value in abstracting them in header files. And the point of patch 2/3 was to move hack variants for the same purpose together making it easier to document and understand (and possibly modify) them. Your suggestion would go in the opposite direction entirely. As it is now, I was about to suggest: git mv block-sha1/sha1.[ch] . rmdir block-sha1 rm -r mozilla-sha1 rm -r arm rm -r ppc and remove support for openssl's SHA1 usage, making this implementation unconditional. After all it is faster, or so close to be faster than the alternatives, that we should probably cut on the extra dependency and simplify portability issues at the same time. Nicolas ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] block-sha1: more good unaligned memory access candidates 2009-08-13 17:23 ` Nicolas Pitre @ 2009-08-13 19:33 ` Junio C Hamano 2009-08-13 19:54 ` Linus Torvalds 2009-08-13 20:13 ` Nicolas Pitre 0 siblings, 2 replies; 17+ messages in thread From: Junio C Hamano @ 2009-08-13 19:33 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Linus Torvalds, git Nicolas Pitre <nico@cam.org> writes: > As it is now, I was about to suggest: > > git mv block-sha1/sha1.[ch] . > rmdir block-sha1 > rm -r mozilla-sha1 > rm -r arm > rm -r ppc > > and remove support for openssl's SHA1 usage, making this implementation > unconditional. After all it is faster, or so close to be faster than > the alternatives, that we should probably cut on the extra dependency > and simplify portability issues at the same time. Wow. Is it now faster than the arm/ and ppc/ hand-tweaked assembly? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] block-sha1: more good unaligned memory access candidates 2009-08-13 19:33 ` Junio C Hamano @ 2009-08-13 19:54 ` Linus Torvalds 2009-08-13 20:13 ` Nicolas Pitre 1 sibling, 0 replies; 17+ messages in thread From: Linus Torvalds @ 2009-08-13 19:54 UTC (permalink / raw) To: Junio C Hamano; +Cc: Nicolas Pitre, git On Thu, 13 Aug 2009, Junio C Hamano wrote: > > Wow. Is it now faster than the arm/ and ppc/ hand-tweaked assembly? For the good cases, yes. For POWER, with gcc-4.4, the C code apparently outperforms the asm code on POWER6. The asm code is scheduled for POWER4, and I think outperforms the C code there. Also, when compiling in 64-bit mode (with "-m64"), at least some versions of gcc seem to do some stupid things and add extra zero extension stuff, and that performed suboptimally at least on a PPC G5. So it's certainly not a clear case of "the C code outperforms the asm code", but in BenH's tests, the best numbers really did come from the C version. With some silly cases of at least some versions gcc screwing up (not reload, but zero extension), and making it noticeably slower. IOW, the PPC situation really isn't that different from x86. Linus ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] block-sha1: more good unaligned memory access candidates 2009-08-13 19:33 ` Junio C Hamano 2009-08-13 19:54 ` Linus Torvalds @ 2009-08-13 20:13 ` Nicolas Pitre 1 sibling, 0 replies; 17+ messages in thread From: Nicolas Pitre @ 2009-08-13 20:13 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, git On Thu, 13 Aug 2009, Junio C Hamano wrote: > Nicolas Pitre <nico@cam.org> writes: > > > As it is now, I was about to suggest: > > > > git mv block-sha1/sha1.[ch] . > > rmdir block-sha1 > > rm -r mozilla-sha1 > > rm -r arm > > rm -r ppc > > > > and remove support for openssl's SHA1 usage, making this implementation > > unconditional. After all it is faster, or so close to be faster than > > the alternatives, that we should probably cut on the extra dependency > > and simplify portability issues at the same time. > > Wow. Is it now faster than the arm/ and ppc/ hand-tweaked assembly? It is indeed faster than the ARM assembly version by far, and faster than all the alternative implementations too, but with a 7x increase in compiled code size. In the context of Git I think this is a good compromize. Making the assembly version faster than the C version could be possible, but that would require quite some work and I don't expect the gain to be significant, certainly not worth the trouble. Furthermore the C version can be used to generate ARM Thumb code while the asm version cannot without yet more work. Nicolas ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH] block-sha1/sha1.c: silence compiler complaints by casting void* to uintptr_t 2009-08-13 4:29 [PATCH] block-sha1: more good unaligned memory access candidates Nicolas Pitre 2009-08-13 16:45 ` Linus Torvalds @ 2009-08-14 22:52 ` Brandon Casey 2009-08-15 0:08 ` Johannes Schindelin 1 sibling, 1 reply; 17+ messages in thread From: Brandon Casey @ 2009-08-14 22:52 UTC (permalink / raw) To: nico; +Cc: gitster, torvalds, git, Brandon Casey From: Brandon Casey <drafnel@gmail.com> Some compilers produce errors when arithmetic is attempted on pointers to void. So cast to uintptr_t when performing arithmetic on void pointers. Signed-off-by: Brandon Casey <casey@nrlssc.navy.mil> --- This, on top of Nico's last patch "block-sha1: more good unaligned memory access candidates", allows this series to compile and run successfully on Sun Sparc and IRIX MIPS for me. It produces no differences on Linux using gcc. gcc 3.4.6 and 4.1.2 produce an identical binary for me on x86 and x86-64 with and without this change. Any conceivable negative effects on other arch's? -brandon block-sha1/sha1.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c index e5a1007..ccaba9e 100644 --- a/block-sha1/sha1.c +++ b/block-sha1/sha1.c @@ -246,14 +246,14 @@ void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len) memcpy(lenW + (char *)ctx->W, data, left); lenW = (lenW + left) & 63; len -= left; - data += left; + data = (const void*) ((uintptr_t) data + left); if (lenW) return; blk_SHA1_Block(ctx, ctx->W); } while (len >= 64) { blk_SHA1_Block(ctx, data); - data += 64; + data = (const void*) ((uintptr_t) data + 64); len -= 64; } if (len) -- 1.6.4 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH] block-sha1/sha1.c: silence compiler complaints by casting void* to uintptr_t 2009-08-14 22:52 ` [PATCH] block-sha1/sha1.c: silence compiler complaints by casting void* to uintptr_t Brandon Casey @ 2009-08-15 0:08 ` Johannes Schindelin 2009-08-15 0:19 ` Linus Torvalds 2009-08-15 0:52 ` Brandon Casey 0 siblings, 2 replies; 17+ messages in thread From: Johannes Schindelin @ 2009-08-15 0:08 UTC (permalink / raw) To: Brandon Casey; +Cc: nico, gitster, torvalds, git, Brandon Casey Hi, On Fri, 14 Aug 2009, Brandon Casey wrote: > Some compilers produce errors when arithmetic is attempted on pointers to > void. So cast to uintptr_t when performing arithmetic on void pointers. I am confused. Is sizeof(*(uintptr_t)NULL) not larger than 1, and as a consequence ((uintptr_t)p)+1 not different from ((void *)p)+1? Ciao, Dscho ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] block-sha1/sha1.c: silence compiler complaints by casting void* to uintptr_t 2009-08-15 0:08 ` Johannes Schindelin @ 2009-08-15 0:19 ` Linus Torvalds 2009-08-15 0:44 ` Brandon Casey 2009-08-15 0:52 ` Brandon Casey 1 sibling, 1 reply; 17+ messages in thread From: Linus Torvalds @ 2009-08-15 0:19 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Brandon Casey, nico, gitster, git, Brandon Casey On Sat, 15 Aug 2009, Johannes Schindelin wrote: > > I am confused. Is sizeof(*(uintptr_t)NULL) not larger than 1 You can't do that. "uintptr_t" isn't actually a pointer. It's just a unsigned integer value large enough to contain a pointer. So it's _not_ a "pointer to uint". It's "uint that can contain all the bits of a poitner". > and as a consequence ((uintptr_t)p)+1 not different from ((void *)p)+1? No, they're the same. That said, I suspect it might as well be cast to "const char *", and then hopefully you only need one cast. So maybe it could be written as data = (const char *) data + len; instead, and avoid the second cast (because the assignment should be ok, since it's assigning back to a "const void *"). Linus ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] block-sha1/sha1.c: silence compiler complaints by casting void* to uintptr_t 2009-08-15 0:19 ` Linus Torvalds @ 2009-08-15 0:44 ` Brandon Casey 2009-08-15 1:46 ` Junio C Hamano 0 siblings, 1 reply; 17+ messages in thread From: Brandon Casey @ 2009-08-15 0:44 UTC (permalink / raw) To: Linus Torvalds; +Cc: Johannes Schindelin, nico, gitster, git, Brandon Casey Linus Torvalds wrote: > I suspect it might as well be cast to "const char *", and then > hopefully you only need one cast. > > So maybe it could be written as > > data = (const char *) data + len; > > instead, and avoid the second cast (because the assignment should be ok, > since it's assigning back to a "const void *"). Yep, that's enough. It produces an identical binary on all platforms except Solaris x86 using SUNWspro compiler. -brandon ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] block-sha1/sha1.c: silence compiler complaints by casting void* to uintptr_t 2009-08-15 0:44 ` Brandon Casey @ 2009-08-15 1:46 ` Junio C Hamano 2009-08-15 2:34 ` Brandon Casey 0 siblings, 1 reply; 17+ messages in thread From: Junio C Hamano @ 2009-08-15 1:46 UTC (permalink / raw) To: Brandon Casey Cc: Linus Torvalds, Johannes Schindelin, nico, git, Brandon Casey Brandon Casey <casey@nrlssc.navy.mil> writes: > Linus Torvalds wrote: >> I suspect it might as well be cast to "const char *", and then >> hopefully you only need one cast. >> >> So maybe it could be written as >> >> data = (const char *) data + len; >> >> instead, and avoid the second cast (because the assignment should be ok, >> since it's assigning back to a "const void *"). > > Yep, that's enough. It produces an identical binary on all platforms > except Solaris x86 using SUNWspro compiler. Casting to "const char *" to tell compilers that we are interested in byte address differences/increments makes sense to me, but I do not know how to interpret your last comment. Do you mean that SUNWspro compiler misbehave with the "cast to char *" and do you meed your "casting to uintptr_t to explicitly compute byte addresses as int" to make it behave? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] block-sha1/sha1.c: silence compiler complaints by casting void* to uintptr_t 2009-08-15 1:46 ` Junio C Hamano @ 2009-08-15 2:34 ` Brandon Casey 2009-08-15 3:16 ` Junio C Hamano 0 siblings, 1 reply; 17+ messages in thread From: Brandon Casey @ 2009-08-15 2:34 UTC (permalink / raw) To: Junio C Hamano Cc: Linus Torvalds, Johannes Schindelin, nico, git, Brandon Casey Junio C Hamano wrote: > Brandon Casey <casey@nrlssc.navy.mil> writes: > >> Linus Torvalds wrote: >>> I suspect it might as well be cast to "const char *", and then >>> hopefully you only need one cast. >>> >>> So maybe it could be written as >>> >>> data = (const char *) data + len; >>> >>> instead, and avoid the second cast (because the assignment should be ok, >>> since it's assigning back to a "const void *"). >> Yep, that's enough. It produces an identical binary on all platforms >> except Solaris x86 using SUNWspro compiler. > > Casting to "const char *" to tell compilers that we are interested in byte > address differences/increments makes sense to me, Yes, I agree. > but I do not know how to interpret your last comment. > Do you mean that SUNWspro compiler misbehave with the "cast to char *" and > do you meed your "casting to uintptr_t to explicitly compute byte > addresses as int" to make it behave? No, sorry, the SUNWspro compiler produces a working binary with any of the three versions: the original, cast to uintptr_t, or cast to const char*. Though it emits a warning with the original. I was just pointing out that the SUNWspro compiler does not produce an identical binary on x86 for each of the three versions like the GNU compiler does. Sorry for the confusion. Linus's suggestion of casting to "const char*" should be adopted. -brandon ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] block-sha1/sha1.c: silence compiler complaints by casting void* to uintptr_t 2009-08-15 2:34 ` Brandon Casey @ 2009-08-15 3:16 ` Junio C Hamano 2009-08-15 18:43 ` Linus Torvalds 0 siblings, 1 reply; 17+ messages in thread From: Junio C Hamano @ 2009-08-15 3:16 UTC (permalink / raw) To: Brandon Casey Cc: Junio C Hamano, Linus Torvalds, Johannes Schindelin, nico, git, Brandon Casey Brandon Casey <casey@nrlssc.navy.mil> writes: > I was just pointing out that the SUNWspro compiler does not produce an > identical binary on x86 for each of the three versions like the GNU > compiler does. > > Sorry for the confusion. Linus's suggestion of casting to "const char*" > should be adopted. Ah, I see. Thanks. Let's queue this, then. -- >8 -- From: Brandon Casey <drafnel@gmail.com> Date: Fri, 14 Aug 2009 17:52:15 -0500 Subject: [PATCH] block-sha1/sha1.c: silence compiler complaints by casting void * to char * Some compilers produce errors when arithmetic is attempted on pointers to void. We want computations done on byte addresses, so cast them to char * to work them around. Signed-off-by: Brandon Casey <casey@nrlssc.navy.mil> Signed-off-by: Junio C Hamano <gitster@pobox.com> --- block-sha1/sha1.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c index e5a1007..464cb25 100644 --- a/block-sha1/sha1.c +++ b/block-sha1/sha1.c @@ -246,14 +246,14 @@ void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len) memcpy(lenW + (char *)ctx->W, data, left); lenW = (lenW + left) & 63; len -= left; - data += left; + data = ((const char *)data + left); if (lenW) return; blk_SHA1_Block(ctx, ctx->W); } while (len >= 64) { blk_SHA1_Block(ctx, data); - data += 64; + data = ((const char *)data + 64); len -= 64; } if (len) ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH] block-sha1/sha1.c: silence compiler complaints by casting void* to uintptr_t 2009-08-15 3:16 ` Junio C Hamano @ 2009-08-15 18:43 ` Linus Torvalds 0 siblings, 0 replies; 17+ messages in thread From: Linus Torvalds @ 2009-08-15 18:43 UTC (permalink / raw) To: Junio C Hamano Cc: Brandon Casey, Johannes Schindelin, nico, git, Brandon Casey On Fri, 14 Aug 2009, Junio C Hamano wrote: > > From: Brandon Casey <drafnel@gmail.com> > Date: Fri, 14 Aug 2009 17:52:15 -0500 > Subject: [PATCH] block-sha1/sha1.c: silence compiler complaints by casting void * to char * > > Some compilers produce errors when arithmetic is attempted on pointers to > void. We want computations done on byte addresses, so cast them to char * > to work them around. > > Signed-off-by: Brandon Casey <casey@nrlssc.navy.mil> > Signed-off-by: Junio C Hamano <gitster@pobox.com> Unnecessary outer parentheses, but ack anyway. Linus ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] block-sha1/sha1.c: silence compiler complaints by casting void* to uintptr_t 2009-08-15 0:08 ` Johannes Schindelin 2009-08-15 0:19 ` Linus Torvalds @ 2009-08-15 0:52 ` Brandon Casey 1 sibling, 0 replies; 17+ messages in thread From: Brandon Casey @ 2009-08-15 0:52 UTC (permalink / raw) To: Johannes Schindelin; +Cc: nico, gitster, torvalds, git, Brandon Casey Johannes Schindelin wrote: > Hi, > > On Fri, 14 Aug 2009, Brandon Casey wrote: > >> Some compilers produce errors when arithmetic is attempted on pointers to >> void. So cast to uintptr_t when performing arithmetic on void pointers. > > I am confused. Is sizeof(*(uintptr_t)NULL) not larger than 1, and as a > consequence ((uintptr_t)p)+1 not different from ((void *)p)+1? If you try this: printf("NULL + 1: %u\n", (void*)NULL + 1); the MIPSpro compiler complains like: The expression must be a pointer to a complete object type. printf("NULL + 1: %u\n", (void*)NULL + 1); ^ 1 error detected in the compilation of "test.c". Compilers other than gcc at least issue a warning, if they do not also fail. -brandon ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] block-sha1: more good unaligned memory access candidates
@ 2009-08-13 20:15 George Spelvin
2009-08-13 21:28 ` Nicolas Pitre
0 siblings, 1 reply; 17+ messages in thread
From: George Spelvin @ 2009-08-13 20:15 UTC (permalink / raw)
To: git, gitster; +Cc: linux, nico, torvalds
> Wow. Is it now faster than the arm/ and ppc/ hand-tweaked assembly?
It's probably faster than the ARM, which was tuned for size rather
than speed, but if you want to rework the assembly for speed, the ARM's
rotate-and-add operations allow tricks which I doubt GCC can pick up on.
(You have to notice that the F(b,c,d) function is bitwise, so you can
do it on rotated data and do the rotate when you add the result to e.)
I'd be surprised if it were faster than PPC code, especially on the
in-order G3 and G4 cores where careful scheduling really pays off.
But maybe I just get to be surprised...
For automatic assembly tuning, I was thinking of having a .c file that
has a bunch of #ifdef __PPC__ statements that gets run through $(CC) -E.
That should be a fairly portable way to
The other question about unaligned access is whether it's beneficial
to make the fetch loop work like this:
char const *in;
uint32_t *out
unsigned lsb = (unsigned)p & 3;
uint32_t const *p32 = (uint32_t const *)(in - lsb);
uint32_t t = ntohl(*p32);
switch (lsb) {
case 0:
*out++ = t;
for (i = 1; i < 16; i++)
*out++ = ntohl(*++p32);
break;
case 1:
for (i = 0; i < 16; i++) {
uint32_t s = t << 8;
t = ntohl(*++p32);
*out++ = s | t >> 24;
}
break;
case 1:
for (i = 0; i < 16; i++) {
uint32_t s = t << 16;
t = ntohl(*++p32);
*out++ = s | t >> 16;
}
break;
case 1:
for (i = 0; i < 16; i++) {
uint32_t s = t << 24;
t = ntohl(*++p32);
*out++ = s | t >> 8;
}
break;
}
On the ARM, at least, ntohl() isn't particularly cheap, so loading 4
bytes and assembling them turns out to be cheaper. But it's a thought.
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH] block-sha1: more good unaligned memory access candidates 2009-08-13 20:15 [PATCH] block-sha1: more good unaligned memory access candidates George Spelvin @ 2009-08-13 21:28 ` Nicolas Pitre 0 siblings, 0 replies; 17+ messages in thread From: Nicolas Pitre @ 2009-08-13 21:28 UTC (permalink / raw) To: George Spelvin; +Cc: git, gitster, torvalds On Thu, 13 Aug 2009, George Spelvin wrote: > > Wow. Is it now faster than the arm/ and ppc/ hand-tweaked assembly? > > It's probably faster than the ARM, which was tuned for size rather > than speed, but if you want to rework the assembly for speed, the ARM's > rotate-and-add operations allow tricks which I doubt GCC can pick up on. > (You have to notice that the F(b,c,d) function is bitwise, so you can > do it on rotated data and do the rotate when you add the result to e.) gcc is not too bad at merging ALU and shift/rotate operations into the same instruction. However, to make a really optimal ARM version, some custom SHA_ROUND macros with inline assembly could be made. I suspect that wouldn't gain much though, as the pure shift/rotate mov instructions really aren't that many in the generated code. > I'd be surprised if it were faster than PPC code, especially on the > in-order G3 and G4 cores where careful scheduling really pays off. > But maybe I just get to be surprised... Given that PPC has enough register to hold the entire state, it is then only a matter of proper instruction scheduling which modern gcc ought to do right. If not then this is a good test case for gcc people to fix the PPC machine pipeline description. > For automatic assembly tuning, I was thinking of having a .c file that > has a bunch of #ifdef __PPC__ statements that gets run through $(CC) -E. > That should be a fairly portable way to ?? > The other question about unaligned access is whether it's beneficial > to make the fetch loop work like this: > > char const *in; > uint32_t *out > unsigned lsb = (unsigned)p & 3; > uint32_t const *p32 = (uint32_t const *)(in - lsb); > uint32_t t = ntohl(*p32); > > switch (lsb) { > > case 0: > *out++ = t; > for (i = 1; i < 16; i++) > *out++ = ntohl(*++p32); > break; > case 1: > for (i = 0; i < 16; i++) { > uint32_t s = t << 8; > t = ntohl(*++p32); > *out++ = s | t >> 24; > } > break; > case 1: > for (i = 0; i < 16; i++) { > uint32_t s = t << 16; > t = ntohl(*++p32); > *out++ = s | t >> 16; > } > break; > case 1: > for (i = 0; i < 16; i++) { > uint32_t s = t << 24; > t = ntohl(*++p32); > *out++ = s | t >> 8; > } > break; > } > > On the ARM, at least, ntohl() isn't particularly cheap, so loading 4 > bytes and assembling them turns out to be cheaper. But it's a thought. Well, that would have to be tested. This could possibly only be a gain if you have a fast ntohl() though. And the other question is also to decide when this is good enough for a generic version. Gaining 5% speedup on raw SHA1 throughput with ugly code might not be worth the maintenance hassle. At that point you might as well go back to a purely asm version if you really want to get the extra edge. Nicolas ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2009-08-15 18:45 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-08-13 4:29 [PATCH] block-sha1: more good unaligned memory access candidates Nicolas Pitre 2009-08-13 16:45 ` Linus Torvalds 2009-08-13 17:23 ` Nicolas Pitre 2009-08-13 19:33 ` Junio C Hamano 2009-08-13 19:54 ` Linus Torvalds 2009-08-13 20:13 ` Nicolas Pitre 2009-08-14 22:52 ` [PATCH] block-sha1/sha1.c: silence compiler complaints by casting void* to uintptr_t Brandon Casey 2009-08-15 0:08 ` Johannes Schindelin 2009-08-15 0:19 ` Linus Torvalds 2009-08-15 0:44 ` Brandon Casey 2009-08-15 1:46 ` Junio C Hamano 2009-08-15 2:34 ` Brandon Casey 2009-08-15 3:16 ` Junio C Hamano 2009-08-15 18:43 ` Linus Torvalds 2009-08-15 0:52 ` Brandon Casey -- strict thread matches above, loose matches on Subject: below -- 2009-08-13 20:15 [PATCH] block-sha1: more good unaligned memory access candidates George Spelvin 2009-08-13 21:28 ` Nicolas Pitre
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox