Git development
 help / color / mirror / Atom feed
* Re: [PATCH] block-sha1: more good unaligned memory access candidates
@ 2009-08-13 20:15 George Spelvin
  2009-08-13 21:28 ` Nicolas Pitre
  0 siblings, 1 reply; 8+ messages in thread
From: George Spelvin @ 2009-08-13 20:15 UTC (permalink / raw)
  To: git, gitster; +Cc: linux, nico, torvalds

> Wow.  Is it now faster than the arm/ and ppc/ hand-tweaked assembly?

It's probably faster than the ARM, which was tuned for size rather
than speed, but if you want to rework the assembly for speed, the ARM's
rotate-and-add operations allow tricks which I doubt GCC can pick up on.
(You have to notice that the F(b,c,d) function is bitwise, so you can
do it on rotated data and do the rotate when you add the result to e.)

I'd be surprised if it were faster than PPC code, especially on the
in-order G3 and G4 cores where careful scheduling really pays off.
But maybe I just get to be surprised...

For automatic assembly tuning, I was thinking of having a .c file that
has a bunch of #ifdef __PPC__ statements that gets run through $(CC) -E.
That should be a fairly portable way to 


The other question about unaligned access is whether it's beneficial
to make the fetch loop work like this:

	char const *in;
	uint32_t *out
	unsigned lsb = (unsigned)p & 3;
	uint32_t const *p32 = (uint32_t const *)(in - lsb);
	uint32_t t = ntohl(*p32);

	switch (lsb) {

	case 0:
		*out++ = t;
		for (i = 1; i < 16; i++)
			*out++ = ntohl(*++p32);
		break;
	case 1:
		for (i = 0; i < 16; i++) {
			uint32_t s = t << 8;
			t = ntohl(*++p32);
			*out++ = s | t >> 24;
		}
		break;
	case 1:
		for (i = 0; i < 16; i++) {
			uint32_t s = t << 16;
			t = ntohl(*++p32);
			*out++ = s | t >> 16;
		}
		break;
	case 1:
		for (i = 0; i < 16; i++) {
			uint32_t s = t << 24;
			t = ntohl(*++p32);
			*out++ = s | t >> 8;
		}
		break;
	}

On the ARM, at least, ntohl() isn't particularly cheap, so loading 4
bytes and assembling them turns out to be cheaper.  But it's a thought.

^ permalink raw reply	[flat|nested] 8+ messages in thread
* [PATCH] block-sha1: more good unaligned memory access candidates
@ 2009-08-13  4:29 Nicolas Pitre
  2009-08-13 16:45 ` Linus Torvalds
  0 siblings, 1 reply; 8+ messages in thread
From: Nicolas Pitre @ 2009-08-13  4:29 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, git

In addition to X86, PowerPC and S390 are capable of unaligned memory 
accesses.

Signed-off-by: Nicolas Pitre <nico@cam.org>

diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index d3121f7..e5a1007 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -67,7 +67,10 @@
  * and is faster on architectures with memory alignment issues.
  */
 
-#if defined(__i386__) || defined(__x86_64__)
+#if defined(__i386__) || defined(__x86_64__) || \
+    defined(__ppc__) || defined(__ppc64__) || \
+    defined(__powerpc__) || defined(__powerpc64__) || \
+    defined(__s390__) || defined(__s390x__)
 
 #define get_be32(p)	ntohl(*(unsigned int *)(p))
 #define put_be32(p, v)	do { *(unsigned int *)(p) = htonl(v); } while (0)

^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-08-13 21:29 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-13 20:15 [PATCH] block-sha1: more good unaligned memory access candidates George Spelvin
2009-08-13 21:28 ` Nicolas Pitre
  -- strict thread matches above, loose matches on Subject: below --
2009-08-13  4:29 Nicolas Pitre
2009-08-13 16:45 ` Linus Torvalds
2009-08-13 17:23   ` Nicolas Pitre
2009-08-13 19:33     ` Junio C Hamano
2009-08-13 19:54       ` Linus Torvalds
2009-08-13 20:13       ` Nicolas Pitre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox