{RFC/PATCH] micro-optimize get_sha1

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* {RFC/PATCH] micro-optimize get_sha1_hex()
@ 2006-09-09 21:55 Junio C Hamano
  2006-09-09 22:33 ` Jeff Garzik
  2006-09-10  0:06 ` Linus Torvalds
  0 siblings, 2 replies; 4+ messages in thread
From: Junio C Hamano @ 2006-09-09 21:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

I was profiling 'git-rev-list v2.16.12..', because I suspected
insert_by_date() might be expensive (the function inserts into
singly-linked ordered list, so the data structure has to become
array based to allow optimization).  But profiling showed it was
not the bottleneck.  Probably because the kernel history is not
that bushy and we do not have too many "active" heads during
traversal.

But I noticed something else.

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 21.52      0.17     0.17  2800320     0.00     0.00  hexval
 15.19      0.29     0.12    70008     0.00     0.00  get_sha1_hex
 15.19      0.41     0.12    70008     0.00     0.00  lookup_object
 15.19      0.53     0.12    68495     0.00     0.00  find_pack_entry_one
 11.39      0.62     0.09   198667     0.00     0.00  insert_obj_hash
  7.60      0.68     0.06    34675     0.00     0.00  unpack_object_header_gently
  3.80      0.71     0.03    33819     0.00     0.02  parse_commit_buffer
  1.27      0.72     0.01   103822     0.00     0.00  commit_list_insert
  1.27      0.73     0.01    67640     0.00     0.00  prepare_packed_git
  1.27      0.74     0.01    67611     0.00     0.00  created_object
  1.27      0.75     0.01    33820     0.00     0.00  use_packed_git
  1.27      0.76     0.01    33819     0.00     0.00  lookup_tree
  1.27      0.77     0.01        1    10.00    10.00  prepare_packed_git_one
  1.27      0.78     0.01                             parse_tree_indirect
  1.27      0.79     0.01                             verify_filename
  ...

The attached brings get_sha1_hex() down from 15.19% to 5.41%,
but I feel we should be able to do better.

Is this barking up the wrong tree?  Or did I pick a good target
but the shooter wasn't skilled enough?


diff --git a/sha1_file.c b/sha1_file.c
index 428d791..00aa364 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -26,26 +26,30 @@ const unsigned char null_sha1[20];
 
 static unsigned int sha1_file_open_flag = O_NOATIME;
 
-static unsigned hexval(char c)
-{
-	if (c >= '0' && c <= '9')
-		return c - '0';
-	if (c >= 'a' && c <= 'f')
-		return c - 'a' + 10;
-	if (c >= 'A' && c <= 'F')
-		return c - 'A' + 10;
-	return ~0;
-}
+static const unsigned char hexval[] = {
+	  0,   1,   2,   3,    4,   5,   6,   7, /* 30-37 */
+	  8,   9, 255, 255,  255, 255, 255, 255, /* 38-3F */
+	255,  10,  11,  12,   13,  14,  15, 255, /* 40-47 */
+	255, 255, 255, 255,  255, 255, 255, 255, /* 48-4F */
+	255, 255, 255, 255,  255, 255, 255, 255, /* 50-57 */
+	255, 255, 255, 255,  255, 255, 255, 255, /* 58-5F */
+	255,  10,  11,  12,   13,  14,  15, 255, /* 60-67 */
+};
 
 int get_sha1_hex(const char *hex, unsigned char *sha1)
 {
 	int i;
 	for (i = 0; i < 20; i++) {
-		unsigned int val = (hexval(hex[0]) << 4) | hexval(hex[1]);
-		if (val & ~0xff)
+		unsigned int v, w, val;
+		v = *hex++;
+		if ((v < '0') || ('f' < v) ||
+		    ((v = hexval[v-'0']) == 255))
+			return -1;
+		w = *hex++;
+		if ((w < '0') || ('f' < w) ||
+		    ((w = hexval[w-'0']) == 255))
 			return -1;
-		*sha1++ = val;
-		hex += 2;
+		*sha1++ = (v << 4) | w;
 	}
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: {RFC/PATCH] micro-optimize get_sha1_hex()
  2006-09-09 21:55 {RFC/PATCH] micro-optimize get_sha1_hex() Junio C Hamano
@ 2006-09-09 22:33 ` Jeff Garzik
  2006-09-10  0:06 ` Linus Torvalds
  1 sibling, 0 replies; 4+ messages in thread
From: Jeff Garzik @ 2006-09-09 22:33 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, git

Junio C Hamano wrote:
>  int get_sha1_hex(const char *hex, unsigned char *sha1)
>  {
>  	int i;
>  	for (i = 0; i < 20; i++) {
> -		unsigned int val = (hexval(hex[0]) << 4) | hexval(hex[1]);
> -		if (val & ~0xff)
> +		unsigned int v, w, val;
> +		v = *hex++;
> +		if ((v < '0') || ('f' < v) ||
> +		    ((v = hexval[v-'0']) == 255))
> +			return -1;
> +		w = *hex++;
> +		if ((w < '0') || ('f' < w) ||
> +		    ((w = hexval[w-'0']) == 255))
>  			return -1;
> -		*sha1++ = val;
> -		hex += 2;
> +		*sha1++ = (v << 4) | w;

Why not just make the table include the range of all possible 
characters?  That would eliminate some comparisons and subtractions 
against '0'.

The end result would look more like the current (unpatched) form, except 
with a function replaced by the table.

	Jeff

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: {RFC/PATCH] micro-optimize get_sha1_hex()
  2006-09-09 21:55 {RFC/PATCH] micro-optimize get_sha1_hex() Junio C Hamano
  2006-09-09 22:33 ` Jeff Garzik
@ 2006-09-10  0:06 ` Linus Torvalds
  2006-09-10  0:53   ` Linus Torvalds
  1 sibling, 1 reply; 4+ messages in thread
From: Linus Torvalds @ 2006-09-10  0:06 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Sat, 9 Sep 2006, Junio C Hamano wrote:
>
> I was profiling 'git-rev-list v2.16.12..', because I suspected
> insert_by_date() might be expensive (the function inserts into
> singly-linked ordered list, so the data structure has to become
> array based to allow optimization).

I _suspect_ that you profiled using "gprof" and the "-pg" flag to 
the compiler?

The thing is, that generates almost totally bogus profiles. It makes 
simple function calls look extremely - and unrealistically - expensive, 
because the profiling code itself adds lots of overhead, and it then ends 
up showing on the tick-based profile too in a way that it wouldn't show 
in real life.

It tends to be much better to profile unmodified binaries using 
"oprofile", which doesn't end up giving the exact number of calls, but 
where the time-based profiling is a _lot_ more reliable.

That said, clearly "hexval()" can be optimized, and it's quite possible 
that it would show up even in oprofile data too.

> The attached brings get_sha1_hex() down from 15.19% to 5.41%,
> but I feel we should be able to do better.

I doubt it's 5% in real life, but your optimization looks fine to me.

You can make it even _more_ optimized by not even bothering to test at 
each point, but just inlining hexval(), and making it a single array 
lookup. The point being that if you get -1 for _either_ of the two 
lookups, since we shift them and or them together, the end result will be 
invalid, and you only need to _test_ once.

So this patch should generate much better code, where the inner loop is 
something like

<get_sha1_hex+16>:
	add    $0x1,%esi		# count of result bytes..
	cmp    $0x14,%esi		# did we have 20 already?
	mov    %dl,(%ebx)		# save the "last loop" result
	je     0x807630b <get_sha1_hex+75>	# break out
	add    $0x1,%ebx		# update dest pointer (one byte at a time)
	add    $0x2,%ecx		# update source pointer (two hex chars at a time)
	movsbl (%ecx),%eax		# get first character
	movsbl 0x80989e0(%eax),%edx	# get the hexval for it
	movsbl 0x1(%ecx),%eax		# get the second character
	shl    $0x4,%edx		# shift first character hexval up by 4
	movsbl 0x80989e0(%eax),%eax	# get second character hexval
	or     %eax,%edx		# get the byte value
	test   $0xffffff00,%edx		# was it ok (0x00 - 0xff)?
	je     <get_sha1_hex+16>	# yeah, continue

which should perform very well (just two comparisons - one for the "wrong 
bits set" at the end, and one for the 20-byte-counter test at the 
beginning). Small and sweet.

In fact, I don't see how you can much improve on that code even if you 
were to hand-code it, so gcc actually generates good code once the 
"hexval()" function has been written like this.

Not very much tested, but it looks obvious enough.. And it's actually 
conceptually a smaller diff than yours (ie it doesn't change anything but 
the "hexval()" function, and all the bulk of it is the stupid array 
initializer).

		Linus

---
diff --git a/sha1_file.c b/sha1_file.c
index 428d791..b64b92d 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -26,15 +26,43 @@ const unsigned char null_sha1[20];

 static unsigned int sha1_file_open_flag = O_NOATIME;

-static unsigned hexval(char c)
-{
-	if (c >= '0' && c <= '9')
-		return c - '0';
-	if (c >= 'a' && c <= 'f')
-		return c - 'a' + 10;
-	if (c >= 'A' && c <= 'F')
-		return c - 'A' + 10;
-	return ~0;
+static inline unsigned int hexval(unsigned int c)
+{
+	static signed char val[256] = {
+		 -1, -1, -1, -1, -1, -1, -1, -1,		/* 00-07 */
+		 -1, -1, -1, -1, -1, -1, -1, -1,		/* 08-0f */
+		 -1, -1, -1, -1, -1, -1, -1, -1,		/* 10-17 */
+		 -1, -1, -1, -1, -1, -1, -1, -1,		/* 18-1f */
+		 -1, -1, -1, -1, -1, -1, -1, -1,		/* 20-27 */
+		 -1, -1, -1, -1, -1, -1, -1, -1,		/* 28-2f */
+		  0,  1,  2,  3,  4,  5,  6,  7,		/* 30-37 */
+		  8,  9, -1, -1, -1, -1, -1, -1,		/* 38-3f */
+		 -1, 10, 11, 12, 13, 14, 15, -1,		/* 40-47 */
+		 -1, -1, -1, -1, -1, -1, -1, -1,		/* 48-4f */
+		 -1, -1, -1, -1, -1, -1, -1, -1,		/* 50-57 */
+		 -1, -1, -1, -1, -1, -1, -1, -1,		/* 58-5f */
+		 -1, 10, 11, 12, 13, 14, 15, -1,		/* 60-67 */
+		 -1, -1, -1, -1, -1, -1, -1, -1,		/* 68-67 */
+		 -1, -1, -1, -1, -1, -1, -1, -1,		/* 70-77 */
+		 -1, -1, -1, -1, -1, -1, -1, -1,		/* 78-7f */
+		 -1, -1, -1, -1, -1, -1, -1, -1,		/* 80-87 */
+		 -1, -1, -1, -1, -1, -1, -1, -1,		/* 88-8f */
+		 -1, -1, -1, -1, -1, -1, -1, -1,		/* 90-97 */
+		 -1, -1, -1, -1, -1, -1, -1, -1,		/* 98-9f */
+		 -1, -1, -1, -1, -1, -1, -1, -1,		/* a0-a7 */
+		 -1, -1, -1, -1, -1, -1, -1, -1,		/* a8-af */
+		 -1, -1, -1, -1, -1, -1, -1, -1,		/* b0-b7 */
+		 -1, -1, -1, -1, -1, -1, -1, -1,		/* b8-bf */
+		 -1, -1, -1, -1, -1, -1, -1, -1,		/* c0-c7 */
+		 -1, -1, -1, -1, -1, -1, -1, -1,		/* c8-cf */
+		 -1, -1, -1, -1, -1, -1, -1, -1,		/* d0-d7 */
+		 -1, -1, -1, -1, -1, -1, -1, -1,		/* d8-df */
+		 -1, -1, -1, -1, -1, -1, -1, -1,		/* e0-e7 */
+		 -1, -1, -1, -1, -1, -1, -1, -1,		/* e8-ef */
+		 -1, -1, -1, -1, -1, -1, -1, -1,		/* f0-f7 */
+		 -1, -1, -1, -1, -1, -1, -1, -1,		/* f8-ff */
+	};
+	return val[c];
 }

 int get_sha1_hex(const char *hex, unsigned char *sha1)

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: {RFC/PATCH] micro-optimize get_sha1_hex()
  2006-09-10  0:06 ` Linus Torvalds
@ 2006-09-10  0:53   ` Linus Torvalds
  0 siblings, 0 replies; 4+ messages in thread
From: Linus Torvalds @ 2006-09-10  0:53 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git



On Sat, 9 Sep 2006, Linus Torvalds wrote:
>
> I _suspect_ that you profiled using "gprof" and the "-pg" flag to 
> the compiler?

Btw, in the absense of oprofile (which is very useful, and has the 
advantage that you can run many programs multiple times and get a 
"combined data" output from it all), what you can do (and which is more 
reliable) is:

 - the "minor pagefault" count that /usr/bin/time prints out is very 
   useful. It gives a pretty much direct view into what the total memory 
   use of the program was over its lifetime, in a way that very few other 
   things do.

 - using "gprof" to try to find _potential_ hotspots, but then not 
   trusting the profiling numbers at all for actual improvement, but 
   simply recompiling (witout profiling) and timing it for real after any 
   change. The real timings will almost certainly not match what you 
   thought you'd get from profiling, but now they'll be real numbers.

 - using "-pg" to link in the profiling code, BUT ONLY AT THE FINAL LINK 
   TIME! This will give you the "% time" and "cumulative seconds" part, 
   but it will mean that you will _not_ get the call-graph, because now 
   the actual code generation is not affected, and the compiler won't be 
   inserting all the call-graph-generation code. 

Note that the last use makes gprof much more accurate, but it also means 
that it won't _work_ very well for things that are fast. You usually have 
a 1/100th of a second profiling setup, so anything that is less than a 
second won't have much of a profile. So this only works for longer-running 
things.

		Linus

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2006-09-10  0:53 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-09-09 21:55 {RFC/PATCH] micro-optimize get_sha1_hex() Junio C Hamano
2006-09-09 22:33 ` Jeff Garzik
2006-09-10  0:06 ` Linus Torvalds
2006-09-10  0:53   ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).