block-sha1: improve code on large-register-set machines

Git development
 help / color / mirror / Atom feed

From: Linus Torvalds <torvalds@linux-foundation.org>
To: Git Mailing List <git@vger.kernel.org>,
	Junio C Hamano <gitster@pobox.com>
Subject: block-sha1: improve code on large-register-set machines
Date: Mon, 10 Aug 2009 16:52:07 -0700 (PDT)	[thread overview]
Message-ID: <alpine.LFD.2.01.0908101637440.3417@localhost.localdomain> (raw)

For x86 performance (especially in 32-bit mode) I added that hack to write 
the SHA1 internal temporary hash using a volatile pointer, in order to get 
gcc to not try to cache the array contents. Because gcc will do all the 
wrong things, and then spill things in insane random ways.

But on architectures like PPC, where you have 32 registers, it's actually 
perfectly reasonable to put the whole temporary array[] into the register 
set, and gcc can do so.

So make the 'volatile unsigned int *' cast be dependent on a 
SMALL_REGISTER_SET preprocessor symbol, and enable it (currently) on just 
x86 and x86-64.  With that, the routine is fairly reasonable even when 
compared to the hand-scheduled PPC version. Ben Herrenschmidt reports on 
a G5:

 * Paulus asm version:       about 3.67s
 * Yours with no change:     about 5.74s
 * Yours without "volatile": about 3.78s

so with this the C version is within about 3% of the asm one.

And add a lot of commentary on what the heck is going on.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

I also asked David Miller to test the non-volatile version on Sparc, but I 
suspect it will have the same pattern. ia64 likewise (but I have not asked 
anybody to test).

Of the other architectures, ARM probably would wants SMALL_REGISTER_SET, 
but I suspect the problem there is the htonl() (on little-endian), and 
possibly the unaligned loads - at least on older ARM. The latter is 
something gcc could be taught about, though (the SHA_SRC macro would just 
need to use a pointer that goes through a packed struct member or 
something).

 block-sha1/sha1.c |   25 ++++++++++++++++++++++++-
 1 files changed, 24 insertions(+), 1 deletions(-)

diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index 36da763..9bc8b8a 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -82,6 +82,7 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
 #define SHA_ASM(op, x, n) ({ unsigned int __res; asm(op " %1,%0":"=r" (__res):"i" (n), "0" (x)); __res; })
 #define SHA_ROL(x,n)	SHA_ASM("rol", x, n)
 #define SHA_ROR(x,n)	SHA_ASM("ror", x, n)
+#define SMALL_REGISTER_SET

 #else

@@ -93,7 +94,29 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)

 /* This "rolls" over the 512-bit array */
 #define W(x) (array[(x)&15])
-#define setW(x, val) (*(volatile unsigned int *)&W(x) = (val))
+
+/*
+ * If you have 32 registers or more, the compiler can (and should)
+ * try to change the array[] accesses into registers. However, on
+ * machines with less than ~25 registers, that won't really work,
+ * and at least gcc will make an unholy mess of it.
+ *
+ * So to avoid that mess which just slows things down, we force
+ * the stores to memory to actually happen (we might be better off
+ * with a 'W(t)=(val);asm("":"+m" (W(t))' there instead, as
+ * suggested by Artur Skawina - that will also make gcc unable to
+ * try to do the silly "optimize away loads" part because it won't
+ * see what the value will be).
+ *
+ * Ben Herrenschmidt reports that on PPC, the C version comes close
+ * to the optimized asm with this (ie on PPC you don't want that
+ * 'volatile', since there are lots of registers).
+ */
+#ifdef SMALL_REGISTER_SET
+  #define setW(x, val) (*(volatile unsigned int *)&W(x) = (val))
+#else
+  #define setW(x, val) (W(x) = (val))
+#endif

 /*
  * Where do we get the source from? The first 16 iterations get it from

next             reply	other threads:[~2009-08-10 23:52 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-08-10 23:52 Linus Torvalds [this message]
2009-08-11  6:15 ` block-sha1: improve code on large-register-set machines Nicolas Pitre
2009-08-11 15:04   ` Linus Torvalds
2009-08-11 18:00     ` Nicolas Pitre
2009-08-11 19:31       ` Nicolas Pitre
2009-08-11 21:20         ` Brandon Casey
2009-08-11 21:36           ` Nicolas Pitre
2009-08-11 21:49             ` Brandon Casey
2009-08-11 22:57           ` Linus Torvalds
2009-08-11 23:13             ` Brandon Casey
2009-08-11 15:43   ` Linus Torvalds
2009-08-11 20:03     ` Nicolas Pitre
2009-08-11 22:53       ` Linus Torvalds
2009-08-11 23:14         ` Linus Torvalds
2009-08-12  2:26           ` Nicolas Pitre
2009-08-11 23:45         ` Artur Skawina

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:36da763 dfblob:9bc8b8a )
 OR (
bs:"block-sha1: improve code on large-register-set machines" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LFD.2.01.0908101637440.3417@localhost.localdomain \
    --to=torvalds@linux-foundation.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox