* {RFC/PATCH] micro-optimize get_sha1_hex()
@ 2006-09-09 21:55 Junio C Hamano
2006-09-09 22:33 ` Jeff Garzik
2006-09-10 0:06 ` Linus Torvalds
0 siblings, 2 replies; 4+ messages in thread
From: Junio C Hamano @ 2006-09-09 21:55 UTC (permalink / raw)
To: Linus Torvalds; +Cc: git
I was profiling 'git-rev-list v2.16.12..', because I suspected
insert_by_date() might be expensive (the function inserts into
singly-linked ordered list, so the data structure has to become
array based to allow optimization). But profiling showed it was
not the bottleneck. Probably because the kernel history is not
that bushy and we do not have too many "active" heads during
traversal.
But I noticed something else.
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
21.52 0.17 0.17 2800320 0.00 0.00 hexval
15.19 0.29 0.12 70008 0.00 0.00 get_sha1_hex
15.19 0.41 0.12 70008 0.00 0.00 lookup_object
15.19 0.53 0.12 68495 0.00 0.00 find_pack_entry_one
11.39 0.62 0.09 198667 0.00 0.00 insert_obj_hash
7.60 0.68 0.06 34675 0.00 0.00 unpack_object_header_gently
3.80 0.71 0.03 33819 0.00 0.02 parse_commit_buffer
1.27 0.72 0.01 103822 0.00 0.00 commit_list_insert
1.27 0.73 0.01 67640 0.00 0.00 prepare_packed_git
1.27 0.74 0.01 67611 0.00 0.00 created_object
1.27 0.75 0.01 33820 0.00 0.00 use_packed_git
1.27 0.76 0.01 33819 0.00 0.00 lookup_tree
1.27 0.77 0.01 1 10.00 10.00 prepare_packed_git_one
1.27 0.78 0.01 parse_tree_indirect
1.27 0.79 0.01 verify_filename
...
The attached brings get_sha1_hex() down from 15.19% to 5.41%,
but I feel we should be able to do better.
Is this barking up the wrong tree? Or did I pick a good target
but the shooter wasn't skilled enough?
diff --git a/sha1_file.c b/sha1_file.c
index 428d791..00aa364 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -26,26 +26,30 @@ const unsigned char null_sha1[20];
static unsigned int sha1_file_open_flag = O_NOATIME;
-static unsigned hexval(char c)
-{
- if (c >= '0' && c <= '9')
- return c - '0';
- if (c >= 'a' && c <= 'f')
- return c - 'a' + 10;
- if (c >= 'A' && c <= 'F')
- return c - 'A' + 10;
- return ~0;
-}
+static const unsigned char hexval[] = {
+ 0, 1, 2, 3, 4, 5, 6, 7, /* 30-37 */
+ 8, 9, 255, 255, 255, 255, 255, 255, /* 38-3F */
+ 255, 10, 11, 12, 13, 14, 15, 255, /* 40-47 */
+ 255, 255, 255, 255, 255, 255, 255, 255, /* 48-4F */
+ 255, 255, 255, 255, 255, 255, 255, 255, /* 50-57 */
+ 255, 255, 255, 255, 255, 255, 255, 255, /* 58-5F */
+ 255, 10, 11, 12, 13, 14, 15, 255, /* 60-67 */
+};
int get_sha1_hex(const char *hex, unsigned char *sha1)
{
int i;
for (i = 0; i < 20; i++) {
- unsigned int val = (hexval(hex[0]) << 4) | hexval(hex[1]);
- if (val & ~0xff)
+ unsigned int v, w, val;
+ v = *hex++;
+ if ((v < '0') || ('f' < v) ||
+ ((v = hexval[v-'0']) == 255))
+ return -1;
+ w = *hex++;
+ if ((w < '0') || ('f' < w) ||
+ ((w = hexval[w-'0']) == 255))
return -1;
- *sha1++ = val;
- hex += 2;
+ *sha1++ = (v << 4) | w;
}
return 0;
}
^ permalink raw reply related [flat|nested] 4+ messages in thread* Re: {RFC/PATCH] micro-optimize get_sha1_hex() 2006-09-09 21:55 {RFC/PATCH] micro-optimize get_sha1_hex() Junio C Hamano @ 2006-09-09 22:33 ` Jeff Garzik 2006-09-10 0:06 ` Linus Torvalds 1 sibling, 0 replies; 4+ messages in thread From: Jeff Garzik @ 2006-09-09 22:33 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, git Junio C Hamano wrote: > int get_sha1_hex(const char *hex, unsigned char *sha1) > { > int i; > for (i = 0; i < 20; i++) { > - unsigned int val = (hexval(hex[0]) << 4) | hexval(hex[1]); > - if (val & ~0xff) > + unsigned int v, w, val; > + v = *hex++; > + if ((v < '0') || ('f' < v) || > + ((v = hexval[v-'0']) == 255)) > + return -1; > + w = *hex++; > + if ((w < '0') || ('f' < w) || > + ((w = hexval[w-'0']) == 255)) > return -1; > - *sha1++ = val; > - hex += 2; > + *sha1++ = (v << 4) | w; Why not just make the table include the range of all possible characters? That would eliminate some comparisons and subtractions against '0'. The end result would look more like the current (unpatched) form, except with a function replaced by the table. Jeff ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: {RFC/PATCH] micro-optimize get_sha1_hex() 2006-09-09 21:55 {RFC/PATCH] micro-optimize get_sha1_hex() Junio C Hamano 2006-09-09 22:33 ` Jeff Garzik @ 2006-09-10 0:06 ` Linus Torvalds 2006-09-10 0:53 ` Linus Torvalds 1 sibling, 1 reply; 4+ messages in thread From: Linus Torvalds @ 2006-09-10 0:06 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Sat, 9 Sep 2006, Junio C Hamano wrote: > > I was profiling 'git-rev-list v2.16.12..', because I suspected > insert_by_date() might be expensive (the function inserts into > singly-linked ordered list, so the data structure has to become > array based to allow optimization). I _suspect_ that you profiled using "gprof" and the "-pg" flag to the compiler? The thing is, that generates almost totally bogus profiles. It makes simple function calls look extremely - and unrealistically - expensive, because the profiling code itself adds lots of overhead, and it then ends up showing on the tick-based profile too in a way that it wouldn't show in real life. It tends to be much better to profile unmodified binaries using "oprofile", which doesn't end up giving the exact number of calls, but where the time-based profiling is a _lot_ more reliable. That said, clearly "hexval()" can be optimized, and it's quite possible that it would show up even in oprofile data too. > The attached brings get_sha1_hex() down from 15.19% to 5.41%, > but I feel we should be able to do better. I doubt it's 5% in real life, but your optimization looks fine to me. You can make it even _more_ optimized by not even bothering to test at each point, but just inlining hexval(), and making it a single array lookup. The point being that if you get -1 for _either_ of the two lookups, since we shift them and or them together, the end result will be invalid, and you only need to _test_ once. So this patch should generate much better code, where the inner loop is something like <get_sha1_hex+16>: add $0x1,%esi # count of result bytes.. cmp $0x14,%esi # did we have 20 already? mov %dl,(%ebx) # save the "last loop" result je 0x807630b <get_sha1_hex+75> # break out add $0x1,%ebx # update dest pointer (one byte at a time) add $0x2,%ecx # update source pointer (two hex chars at a time) movsbl (%ecx),%eax # get first character movsbl 0x80989e0(%eax),%edx # get the hexval for it movsbl 0x1(%ecx),%eax # get the second character shl $0x4,%edx # shift first character hexval up by 4 movsbl 0x80989e0(%eax),%eax # get second character hexval or %eax,%edx # get the byte value test $0xffffff00,%edx # was it ok (0x00 - 0xff)? je <get_sha1_hex+16> # yeah, continue which should perform very well (just two comparisons - one for the "wrong bits set" at the end, and one for the 20-byte-counter test at the beginning). Small and sweet. In fact, I don't see how you can much improve on that code even if you were to hand-code it, so gcc actually generates good code once the "hexval()" function has been written like this. Not very much tested, but it looks obvious enough.. And it's actually conceptually a smaller diff than yours (ie it doesn't change anything but the "hexval()" function, and all the bulk of it is the stupid array initializer). Linus --- diff --git a/sha1_file.c b/sha1_file.c index 428d791..b64b92d 100644 --- a/sha1_file.c +++ b/sha1_file.c @@ -26,15 +26,43 @@ const unsigned char null_sha1[20]; static unsigned int sha1_file_open_flag = O_NOATIME; -static unsigned hexval(char c) -{ - if (c >= '0' && c <= '9') - return c - '0'; - if (c >= 'a' && c <= 'f') - return c - 'a' + 10; - if (c >= 'A' && c <= 'F') - return c - 'A' + 10; - return ~0; +static inline unsigned int hexval(unsigned int c) +{ + static signed char val[256] = { + -1, -1, -1, -1, -1, -1, -1, -1, /* 00-07 */ + -1, -1, -1, -1, -1, -1, -1, -1, /* 08-0f */ + -1, -1, -1, -1, -1, -1, -1, -1, /* 10-17 */ + -1, -1, -1, -1, -1, -1, -1, -1, /* 18-1f */ + -1, -1, -1, -1, -1, -1, -1, -1, /* 20-27 */ + -1, -1, -1, -1, -1, -1, -1, -1, /* 28-2f */ + 0, 1, 2, 3, 4, 5, 6, 7, /* 30-37 */ + 8, 9, -1, -1, -1, -1, -1, -1, /* 38-3f */ + -1, 10, 11, 12, 13, 14, 15, -1, /* 40-47 */ + -1, -1, -1, -1, -1, -1, -1, -1, /* 48-4f */ + -1, -1, -1, -1, -1, -1, -1, -1, /* 50-57 */ + -1, -1, -1, -1, -1, -1, -1, -1, /* 58-5f */ + -1, 10, 11, 12, 13, 14, 15, -1, /* 60-67 */ + -1, -1, -1, -1, -1, -1, -1, -1, /* 68-67 */ + -1, -1, -1, -1, -1, -1, -1, -1, /* 70-77 */ + -1, -1, -1, -1, -1, -1, -1, -1, /* 78-7f */ + -1, -1, -1, -1, -1, -1, -1, -1, /* 80-87 */ + -1, -1, -1, -1, -1, -1, -1, -1, /* 88-8f */ + -1, -1, -1, -1, -1, -1, -1, -1, /* 90-97 */ + -1, -1, -1, -1, -1, -1, -1, -1, /* 98-9f */ + -1, -1, -1, -1, -1, -1, -1, -1, /* a0-a7 */ + -1, -1, -1, -1, -1, -1, -1, -1, /* a8-af */ + -1, -1, -1, -1, -1, -1, -1, -1, /* b0-b7 */ + -1, -1, -1, -1, -1, -1, -1, -1, /* b8-bf */ + -1, -1, -1, -1, -1, -1, -1, -1, /* c0-c7 */ + -1, -1, -1, -1, -1, -1, -1, -1, /* c8-cf */ + -1, -1, -1, -1, -1, -1, -1, -1, /* d0-d7 */ + -1, -1, -1, -1, -1, -1, -1, -1, /* d8-df */ + -1, -1, -1, -1, -1, -1, -1, -1, /* e0-e7 */ + -1, -1, -1, -1, -1, -1, -1, -1, /* e8-ef */ + -1, -1, -1, -1, -1, -1, -1, -1, /* f0-f7 */ + -1, -1, -1, -1, -1, -1, -1, -1, /* f8-ff */ + }; + return val[c]; } int get_sha1_hex(const char *hex, unsigned char *sha1) ^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: {RFC/PATCH] micro-optimize get_sha1_hex() 2006-09-10 0:06 ` Linus Torvalds @ 2006-09-10 0:53 ` Linus Torvalds 0 siblings, 0 replies; 4+ messages in thread From: Linus Torvalds @ 2006-09-10 0:53 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Sat, 9 Sep 2006, Linus Torvalds wrote: > > I _suspect_ that you profiled using "gprof" and the "-pg" flag to > the compiler? Btw, in the absense of oprofile (which is very useful, and has the advantage that you can run many programs multiple times and get a "combined data" output from it all), what you can do (and which is more reliable) is: - the "minor pagefault" count that /usr/bin/time prints out is very useful. It gives a pretty much direct view into what the total memory use of the program was over its lifetime, in a way that very few other things do. - using "gprof" to try to find _potential_ hotspots, but then not trusting the profiling numbers at all for actual improvement, but simply recompiling (witout profiling) and timing it for real after any change. The real timings will almost certainly not match what you thought you'd get from profiling, but now they'll be real numbers. - using "-pg" to link in the profiling code, BUT ONLY AT THE FINAL LINK TIME! This will give you the "% time" and "cumulative seconds" part, but it will mean that you will _not_ get the call-graph, because now the actual code generation is not affected, and the compiler won't be inserting all the call-graph-generation code. Note that the last use makes gprof much more accurate, but it also means that it won't _work_ very well for things that are fast. You usually have a 1/100th of a second profiling setup, so anything that is less than a second won't have much of a profile. So this only works for longer-running things. Linus ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2006-09-10 0:53 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-09-09 21:55 {RFC/PATCH] micro-optimize get_sha1_hex() Junio C Hamano
2006-09-09 22:33 ` Jeff Garzik
2006-09-10 0:06 ` Linus Torvalds
2006-09-10 0:53 ` Linus Torvalds
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.