* {RFC/PATCH] micro-optimize get_sha1_hex()
@ 2006-09-09 21:55 Junio C Hamano
2006-09-09 22:33 ` Jeff Garzik
2006-09-10 0:06 ` Linus Torvalds
0 siblings, 2 replies; 4+ messages in thread
From: Junio C Hamano @ 2006-09-09 21:55 UTC (permalink / raw)
To: Linus Torvalds; +Cc: git
I was profiling 'git-rev-list v2.16.12..', because I suspected
insert_by_date() might be expensive (the function inserts into
singly-linked ordered list, so the data structure has to become
array based to allow optimization). But profiling showed it was
not the bottleneck. Probably because the kernel history is not
that bushy and we do not have too many "active" heads during
traversal.
But I noticed something else.
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
21.52 0.17 0.17 2800320 0.00 0.00 hexval
15.19 0.29 0.12 70008 0.00 0.00 get_sha1_hex
15.19 0.41 0.12 70008 0.00 0.00 lookup_object
15.19 0.53 0.12 68495 0.00 0.00 find_pack_entry_one
11.39 0.62 0.09 198667 0.00 0.00 insert_obj_hash
7.60 0.68 0.06 34675 0.00 0.00 unpack_object_header_gently
3.80 0.71 0.03 33819 0.00 0.02 parse_commit_buffer
1.27 0.72 0.01 103822 0.00 0.00 commit_list_insert
1.27 0.73 0.01 67640 0.00 0.00 prepare_packed_git
1.27 0.74 0.01 67611 0.00 0.00 created_object
1.27 0.75 0.01 33820 0.00 0.00 use_packed_git
1.27 0.76 0.01 33819 0.00 0.00 lookup_tree
1.27 0.77 0.01 1 10.00 10.00 prepare_packed_git_one
1.27 0.78 0.01 parse_tree_indirect
1.27 0.79 0.01 verify_filename
...
The attached brings get_sha1_hex() down from 15.19% to 5.41%,
but I feel we should be able to do better.
Is this barking up the wrong tree? Or did I pick a good target
but the shooter wasn't skilled enough?
diff --git a/sha1_file.c b/sha1_file.c
index 428d791..00aa364 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -26,26 +26,30 @@ const unsigned char null_sha1[20];
static unsigned int sha1_file_open_flag = O_NOATIME;
-static unsigned hexval(char c)
-{
- if (c >= '0' && c <= '9')
- return c - '0';
- if (c >= 'a' && c <= 'f')
- return c - 'a' + 10;
- if (c >= 'A' && c <= 'F')
- return c - 'A' + 10;
- return ~0;
-}
+static const unsigned char hexval[] = {
+ 0, 1, 2, 3, 4, 5, 6, 7, /* 30-37 */
+ 8, 9, 255, 255, 255, 255, 255, 255, /* 38-3F */
+ 255, 10, 11, 12, 13, 14, 15, 255, /* 40-47 */
+ 255, 255, 255, 255, 255, 255, 255, 255, /* 48-4F */
+ 255, 255, 255, 255, 255, 255, 255, 255, /* 50-57 */
+ 255, 255, 255, 255, 255, 255, 255, 255, /* 58-5F */
+ 255, 10, 11, 12, 13, 14, 15, 255, /* 60-67 */
+};
int get_sha1_hex(const char *hex, unsigned char *sha1)
{
int i;
for (i = 0; i < 20; i++) {
- unsigned int val = (hexval(hex[0]) << 4) | hexval(hex[1]);
- if (val & ~0xff)
+ unsigned int v, w, val;
+ v = *hex++;
+ if ((v < '0') || ('f' < v) ||
+ ((v = hexval[v-'0']) == 255))
+ return -1;
+ w = *hex++;
+ if ((w < '0') || ('f' < w) ||
+ ((w = hexval[w-'0']) == 255))
return -1;
- *sha1++ = val;
- hex += 2;
+ *sha1++ = (v << 4) | w;
}
return 0;
}
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: {RFC/PATCH] micro-optimize get_sha1_hex()
2006-09-09 21:55 {RFC/PATCH] micro-optimize get_sha1_hex() Junio C Hamano
@ 2006-09-09 22:33 ` Jeff Garzik
2006-09-10 0:06 ` Linus Torvalds
1 sibling, 0 replies; 4+ messages in thread
From: Jeff Garzik @ 2006-09-09 22:33 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Linus Torvalds, git
Junio C Hamano wrote:
> int get_sha1_hex(const char *hex, unsigned char *sha1)
> {
> int i;
> for (i = 0; i < 20; i++) {
> - unsigned int val = (hexval(hex[0]) << 4) | hexval(hex[1]);
> - if (val & ~0xff)
> + unsigned int v, w, val;
> + v = *hex++;
> + if ((v < '0') || ('f' < v) ||
> + ((v = hexval[v-'0']) == 255))
> + return -1;
> + w = *hex++;
> + if ((w < '0') || ('f' < w) ||
> + ((w = hexval[w-'0']) == 255))
> return -1;
> - *sha1++ = val;
> - hex += 2;
> + *sha1++ = (v << 4) | w;
Why not just make the table include the range of all possible
characters? That would eliminate some comparisons and subtractions
against '0'.
The end result would look more like the current (unpatched) form, except
with a function replaced by the table.
Jeff
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: {RFC/PATCH] micro-optimize get_sha1_hex()
2006-09-09 21:55 {RFC/PATCH] micro-optimize get_sha1_hex() Junio C Hamano
2006-09-09 22:33 ` Jeff Garzik
@ 2006-09-10 0:06 ` Linus Torvalds
2006-09-10 0:53 ` Linus Torvalds
1 sibling, 1 reply; 4+ messages in thread
From: Linus Torvalds @ 2006-09-10 0:06 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On Sat, 9 Sep 2006, Junio C Hamano wrote:
>
> I was profiling 'git-rev-list v2.16.12..', because I suspected
> insert_by_date() might be expensive (the function inserts into
> singly-linked ordered list, so the data structure has to become
> array based to allow optimization).
I _suspect_ that you profiled using "gprof" and the "-pg" flag to
the compiler?
The thing is, that generates almost totally bogus profiles. It makes
simple function calls look extremely - and unrealistically - expensive,
because the profiling code itself adds lots of overhead, and it then ends
up showing on the tick-based profile too in a way that it wouldn't show
in real life.
It tends to be much better to profile unmodified binaries using
"oprofile", which doesn't end up giving the exact number of calls, but
where the time-based profiling is a _lot_ more reliable.
That said, clearly "hexval()" can be optimized, and it's quite possible
that it would show up even in oprofile data too.
> The attached brings get_sha1_hex() down from 15.19% to 5.41%,
> but I feel we should be able to do better.
I doubt it's 5% in real life, but your optimization looks fine to me.
You can make it even _more_ optimized by not even bothering to test at
each point, but just inlining hexval(), and making it a single array
lookup. The point being that if you get -1 for _either_ of the two
lookups, since we shift them and or them together, the end result will be
invalid, and you only need to _test_ once.
So this patch should generate much better code, where the inner loop is
something like
<get_sha1_hex+16>:
add $0x1,%esi # count of result bytes..
cmp $0x14,%esi # did we have 20 already?
mov %dl,(%ebx) # save the "last loop" result
je 0x807630b <get_sha1_hex+75> # break out
add $0x1,%ebx # update dest pointer (one byte at a time)
add $0x2,%ecx # update source pointer (two hex chars at a time)
movsbl (%ecx),%eax # get first character
movsbl 0x80989e0(%eax),%edx # get the hexval for it
movsbl 0x1(%ecx),%eax # get the second character
shl $0x4,%edx # shift first character hexval up by 4
movsbl 0x80989e0(%eax),%eax # get second character hexval
or %eax,%edx # get the byte value
test $0xffffff00,%edx # was it ok (0x00 - 0xff)?
je <get_sha1_hex+16> # yeah, continue
which should perform very well (just two comparisons - one for the "wrong
bits set" at the end, and one for the 20-byte-counter test at the
beginning). Small and sweet.
In fact, I don't see how you can much improve on that code even if you
were to hand-code it, so gcc actually generates good code once the
"hexval()" function has been written like this.
Not very much tested, but it looks obvious enough.. And it's actually
conceptually a smaller diff than yours (ie it doesn't change anything but
the "hexval()" function, and all the bulk of it is the stupid array
initializer).
Linus
---
diff --git a/sha1_file.c b/sha1_file.c
index 428d791..b64b92d 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -26,15 +26,43 @@ const unsigned char null_sha1[20];
static unsigned int sha1_file_open_flag = O_NOATIME;
-static unsigned hexval(char c)
-{
- if (c >= '0' && c <= '9')
- return c - '0';
- if (c >= 'a' && c <= 'f')
- return c - 'a' + 10;
- if (c >= 'A' && c <= 'F')
- return c - 'A' + 10;
- return ~0;
+static inline unsigned int hexval(unsigned int c)
+{
+ static signed char val[256] = {
+ -1, -1, -1, -1, -1, -1, -1, -1, /* 00-07 */
+ -1, -1, -1, -1, -1, -1, -1, -1, /* 08-0f */
+ -1, -1, -1, -1, -1, -1, -1, -1, /* 10-17 */
+ -1, -1, -1, -1, -1, -1, -1, -1, /* 18-1f */
+ -1, -1, -1, -1, -1, -1, -1, -1, /* 20-27 */
+ -1, -1, -1, -1, -1, -1, -1, -1, /* 28-2f */
+ 0, 1, 2, 3, 4, 5, 6, 7, /* 30-37 */
+ 8, 9, -1, -1, -1, -1, -1, -1, /* 38-3f */
+ -1, 10, 11, 12, 13, 14, 15, -1, /* 40-47 */
+ -1, -1, -1, -1, -1, -1, -1, -1, /* 48-4f */
+ -1, -1, -1, -1, -1, -1, -1, -1, /* 50-57 */
+ -1, -1, -1, -1, -1, -1, -1, -1, /* 58-5f */
+ -1, 10, 11, 12, 13, 14, 15, -1, /* 60-67 */
+ -1, -1, -1, -1, -1, -1, -1, -1, /* 68-67 */
+ -1, -1, -1, -1, -1, -1, -1, -1, /* 70-77 */
+ -1, -1, -1, -1, -1, -1, -1, -1, /* 78-7f */
+ -1, -1, -1, -1, -1, -1, -1, -1, /* 80-87 */
+ -1, -1, -1, -1, -1, -1, -1, -1, /* 88-8f */
+ -1, -1, -1, -1, -1, -1, -1, -1, /* 90-97 */
+ -1, -1, -1, -1, -1, -1, -1, -1, /* 98-9f */
+ -1, -1, -1, -1, -1, -1, -1, -1, /* a0-a7 */
+ -1, -1, -1, -1, -1, -1, -1, -1, /* a8-af */
+ -1, -1, -1, -1, -1, -1, -1, -1, /* b0-b7 */
+ -1, -1, -1, -1, -1, -1, -1, -1, /* b8-bf */
+ -1, -1, -1, -1, -1, -1, -1, -1, /* c0-c7 */
+ -1, -1, -1, -1, -1, -1, -1, -1, /* c8-cf */
+ -1, -1, -1, -1, -1, -1, -1, -1, /* d0-d7 */
+ -1, -1, -1, -1, -1, -1, -1, -1, /* d8-df */
+ -1, -1, -1, -1, -1, -1, -1, -1, /* e0-e7 */
+ -1, -1, -1, -1, -1, -1, -1, -1, /* e8-ef */
+ -1, -1, -1, -1, -1, -1, -1, -1, /* f0-f7 */
+ -1, -1, -1, -1, -1, -1, -1, -1, /* f8-ff */
+ };
+ return val[c];
}
int get_sha1_hex(const char *hex, unsigned char *sha1)
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: {RFC/PATCH] micro-optimize get_sha1_hex()
2006-09-10 0:06 ` Linus Torvalds
@ 2006-09-10 0:53 ` Linus Torvalds
0 siblings, 0 replies; 4+ messages in thread
From: Linus Torvalds @ 2006-09-10 0:53 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On Sat, 9 Sep 2006, Linus Torvalds wrote:
>
> I _suspect_ that you profiled using "gprof" and the "-pg" flag to
> the compiler?
Btw, in the absense of oprofile (which is very useful, and has the
advantage that you can run many programs multiple times and get a
"combined data" output from it all), what you can do (and which is more
reliable) is:
- the "minor pagefault" count that /usr/bin/time prints out is very
useful. It gives a pretty much direct view into what the total memory
use of the program was over its lifetime, in a way that very few other
things do.
- using "gprof" to try to find _potential_ hotspots, but then not
trusting the profiling numbers at all for actual improvement, but
simply recompiling (witout profiling) and timing it for real after any
change. The real timings will almost certainly not match what you
thought you'd get from profiling, but now they'll be real numbers.
- using "-pg" to link in the profiling code, BUT ONLY AT THE FINAL LINK
TIME! This will give you the "% time" and "cumulative seconds" part,
but it will mean that you will _not_ get the call-graph, because now
the actual code generation is not affected, and the compiler won't be
inserting all the call-graph-generation code.
Note that the last use makes gprof much more accurate, but it also means
that it won't _work_ very well for things that are fast. You usually have
a 1/100th of a second profiling setup, so anything that is less than a
second won't have much of a profile. So this only works for longer-running
things.
Linus
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2006-09-10 0:53 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-09-09 21:55 {RFC/PATCH] micro-optimize get_sha1_hex() Junio C Hamano
2006-09-09 22:33 ` Jeff Garzik
2006-09-10 0:06 ` Linus Torvalds
2006-09-10 0:53 ` Linus Torvalds
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).