Performance issue of 'git branch'

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Performance issue of 'git branch'
@ 2009-07-22 23:59 Carlos R. Mafra
  2009-07-23  0:21 ` Linus Torvalds
  2009-07-23  0:23 ` SZEDER Gábor
  0 siblings, 2 replies; 129+ messages in thread
From: Carlos R. Mafra @ 2009-07-22 23:59 UTC (permalink / raw)
  To: git

Hi,

When I run 'git branch' in the linux-2.6 repo I think it takes
too long to finish (with cold cache):

[mafra@Pilar:linux-2.6]$ time git branch
  27-stable
  28-stable
  29-stable
  30-stable
  dev-private
* master
  option
  sparse
  stern
0.00user 0.05system 0:05.73elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (209major+1380minor)pagefaults 0swaps

This is with git 1.6.4.rc1.10.g2a67 and the kernel is 2.6.31-rc3+. The
machine is a 64bit Vaio laptop which is 1+ year old (so it is not "slow").

Repeating the command a second time takes basically zero seconds, but
this is more or less what I would expect in the first time too.

I use git to track linux-2.6 for 2 years now, and I remember that
'git branch' is slow for quite some time, so it is not a regression
or something. It is just now that I took the courage to report this
small issue.

I did a 'strace' and this is where it spent most of the time:

1248301060.654911 open(".git/refs/heads/sparse", O_RDONLY) = 6
1248301060.654985 read(6, "60afdf6a4065a170ad829b4d79a86ec0"..., 255) = 41
1248301060.655056 read(6, "", 214)      = 0
1248301060.655116 close(6)              = 0
1248301060.680754 lstat(".git/refs/heads/stern", 0x7fff80bfa8d0) = -1 ENOENT (No such file or directory)
1248301064.018491 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
1248301064.018641 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f409ffa7000
1248301064.018722 write(1, "  27-stable\33[m\n", 15) = 15

I don't know why .git/refs/heads/stern does not exist and why it takes
so long with it. That branch is functional ('git checkout stern' succeeds),
as well as all the others. But strangely .git/refs/heads/ contains only

[mafra@Pilar:linux-2.6]$ ls .git/refs/heads/
dev-private  master  sparse

which, apart from "master", are the last branches that I created.

I occasionally run 'git gc --aggressive --prune" to optimize the repo,
but other than that I don't do anything fancy, just 'pull' almost
every day and 'bisect' (which is becoming a rare event now :-)

So I would like to ask what should I do to recover the missing files
in .git/refs/heads/ (which apparently is the cause for my issue) and
how I can avoid losing them in the first place.

Also, is there a way to "fix" the 4-secs pause in that lstat() in
case the files in .git/refs/heads/ get lost again?

Thanks in advance,
Carlos

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-22 23:59 Performance issue of 'git branch' Carlos R. Mafra
@ 2009-07-23  0:21 ` Linus Torvalds
  2009-07-23  0:51   ` Linus Torvalds
  2009-07-23  1:22   ` Carlos R. Mafra
  2009-07-23  0:23 ` SZEDER Gábor
  1 sibling, 2 replies; 129+ messages in thread
From: Linus Torvalds @ 2009-07-23  0:21 UTC (permalink / raw)
  To: Carlos R. Mafra; +Cc: git



On Thu, 23 Jul 2009, Carlos R. Mafra wrote:
> 
> When I run 'git branch' in the linux-2.6 repo I think it takes
> too long to finish (with cold cache):
> 
> [mafra@Pilar:linux-2.6]$ time git branch
>   27-stable
>   28-stable
>   29-stable
>   30-stable
>   dev-private
> * master
>   option
>   sparse
>   stern
> 0.00user 0.05system 0:05.73elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (209major+1380minor)pagefaults 0swaps
> 
> This is with git 1.6.4.rc1.10.g2a67 and the kernel is 2.6.31-rc3+. The
> machine is a 64bit Vaio laptop which is 1+ year old (so it is not "slow").

When have you last repacked the repository?

What you're descibing is basically IO overhead, and if you don't have 
packed references, it's going to read a lot of small files.

> I use git to track linux-2.6 for 2 years now, and I remember that
> 'git branch' is slow for quite some time, so it is not a regression
> or something. It is just now that I took the courage to report this
> small issue.
> 
> I did a 'strace' and this is where it spent most of the time:
> 
> 1248301060.654911 open(".git/refs/heads/sparse", O_RDONLY) = 6
> 1248301060.654985 read(6, "60afdf6a4065a170ad829b4d79a86ec0"..., 255) = 41
> 1248301060.655056 read(6, "", 214)      = 0
> 1248301060.655116 close(6)              = 0
> 1248301060.680754 lstat(".git/refs/heads/stern", 0x7fff80bfa8d0) = -1 ENOENT (No such file or directory)
> 1248301064.018491 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
> 1248301064.018641 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f409ffa7000
> 1248301064.018722 write(1, "  27-stable\33[m\n", 15) = 15
> 
> I don't know why .git/refs/heads/stern does not exist and why it takes
> so long with it. That branch is functional ('git checkout stern' succeeds),
> as well as all the others. But strangely .git/refs/heads/ contains only
> 
> [mafra@Pilar:linux-2.6]$ ls .git/refs/heads/
> dev-private  master  sparse
> 
> which, apart from "master", are the last branches that I created.

Ok, this actually means that you _have_ repacked the repo, and the rest of 
the branches are all nicely packed in .git/packed-refs.

But that four _second_ lstat() is really disgusting.

Let me guess: if you do a "ls -ld .git/refs/heads" you get a very big 
directory, despite it only having three entries in it. And your filesystem 
doesn't have name hashing enabled, so searching for a non-existent file 
involves looking through _all_ of the empty slots.

Try this:

	git pack-refs --all

	rmdir .git/refs/heads
	rmdir .git/refs/tags

	mkdir .git/refs/heads
	mkdir .git/refs/tags

and see if it magically speeds up.

			Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-22 23:59 Performance issue of 'git branch' Carlos R. Mafra
  2009-07-23  0:21 ` Linus Torvalds
@ 2009-07-23  0:23 ` SZEDER Gábor
  2009-07-23  2:25   ` Carlos R. Mafra
  1 sibling, 1 reply; 129+ messages in thread
From: SZEDER Gábor @ 2009-07-23  0:23 UTC (permalink / raw)
  To: Carlos R. Mafra; +Cc: git

Hi,


On Thu, Jul 23, 2009 at 01:59:14AM +0200, Carlos R. Mafra wrote:

> I don't know why .git/refs/heads/stern does not exist and why it takes
> so long with it. That branch is functional ('git checkout stern' succeeds),
> as well as all the others. But strangely .git/refs/heads/ contains only
> 
> [mafra@Pilar:linux-2.6]$ ls .git/refs/heads/
> dev-private  master  sparse
> 
> which, apart from "master", are the last branches that I created.
> 
> I occasionally run 'git gc --aggressive --prune" to optimize the repo,
> but other than that I don't do anything fancy, just 'pull' almost
> every day and 'bisect' (which is becoming a rare event now :-)
> 
> So I would like to ask what should I do to recover the missing files
> in .git/refs/heads/ (which apparently is the cause for my issue) and
> how I can avoid losing them in the first place.

have a look at .git/packed-refs and 'git pack-refs'.


Best,
Gábor

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23  0:21 ` Linus Torvalds
@ 2009-07-23  0:51   ` Linus Torvalds
  2009-07-23  0:55     ` Linus Torvalds
  2009-07-23  1:22   ` Carlos R. Mafra
  1 sibling, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2009-07-23  0:51 UTC (permalink / raw)
  To: Carlos R. Mafra; +Cc: git

On Wed, 22 Jul 2009, Linus Torvalds wrote:
> 
> Try this:
> 
> 	git pack-refs --all
> 
> 	rmdir .git/refs/heads
> 	rmdir .git/refs/tags
> 
> 	mkdir .git/refs/heads
> 	mkdir .git/refs/tags
> 
> and see if it magically speeds up.

In fact, you could also just try

	mv .git/refs .git/temp-refs &&
	cp -a .git/temp-refs .git/refs &&
	rm -rf .git/temp-refs

which will re-create other subdirectories too (like .git/refs/remotes 
etc).

Of course, depending on your particular filesystem, a better fix might be 
to enable filename hashing, which gets rid of the whole "look through all 
the old empty stale directory entries to see if there's a filename there" 
issue. That won't fix 'readdir()' performance, but it should fix your 
insane 4-second lstat() thing.

If you have ext3, you'd do something like

	tune2fs -O dir_index /dev/<node-of-your-filesystem-goes-here>

but as mentioned, even with directory indexing it can actually make sense 
to recreate directories that at some point _used_ to be large, but got 
shrunk down to something much smaller. It's a generic directory problem 
(not just ext3, not just unix, it's a common issue across filesystems. 
It's not _universal_ - some smarter filesystems really do shrink their 
directories - but it's certainly not unusual).

		Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23  0:51   ` Linus Torvalds
@ 2009-07-23  0:55     ` Linus Torvalds
  2009-07-23  2:02       ` Carlos R. Mafra
  0 siblings, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2009-07-23  0:55 UTC (permalink / raw)
  To: Carlos R. Mafra; +Cc: git

On Wed, 22 Jul 2009, Linus Torvalds wrote:
> 
> If you have ext3, you'd do something like
> 
> 	tune2fs -O dir_index /dev/<node-of-your-filesystem-goes-here>

One last email note on this subject. Really. Promise.

If you do that "tune2fs -O dir_index" thing, it will only take effect for 
_newly_ created directories. So you'll still need to do that whole 
"mv+cp+rm" dance, just to make sure that the refs directories are all new.

I think you can also force all directories to be indexed by using fsck, 
but I forget the details. I'm sure man-pages will have it. Or google.

		Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23  0:21 ` Linus Torvalds
  2009-07-23  0:51   ` Linus Torvalds
@ 2009-07-23  1:22   ` Carlos R. Mafra
  2009-07-23  2:20     ` Linus Torvalds
  1 sibling, 1 reply; 129+ messages in thread
From: Carlos R. Mafra @ 2009-07-23  1:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

On Wed 22.Jul'09 at 17:21:48 -0700, Linus Torvalds wrote:
> 
> When have you last repacked the repository?

Last week or so, with 'git repack -d -a'


> > [mafra@Pilar:linux-2.6]$ ls .git/refs/heads/
> > dev-private  master  sparse
> > 
> > which, apart from "master", are the last branches that I created.
> 
> Ok, this actually means that you _have_ repacked the repo, and the rest of 
> the branches are all nicely packed in .git/packed-refs.

Yes, now I saw the other branches inside packed-refs.

> But that four _second_ lstat() is really disgusting.
> 
> Let me guess: if you do a "ls -ld .git/refs/heads" you get a very big 
> directory, despite it only having three entries in it. 

[mafra@Pilar:linux-2.6]$ ls -ld .git/refs/heads
drwxr-xr-x 2 mafra mafra 4096 2009-07-22 23:01 .git/refs/heads/

> And your filesystem 
> doesn't have name hashing enabled, so searching for a non-existent file 
> involves looking through _all_ of the empty slots.

I use ext3 without changing any defaults that I know of (I simply compile
and boot the kernel of the day), and I have no idea if name hashing
is enabled here.

> Try this:
> 
> 	git pack-refs --all
> 
> 	rmdir .git/refs/heads
> 	rmdir .git/refs/tags
> 
> 	mkdir .git/refs/heads
> 	mkdir .git/refs/tags
> 
> and see if it magically speeds up.

It didn't change things, unfortunately.

After 'echo 3 > /proc/sys/vm/drop_caches' it still takes too long,

1248310449.693085 munmap(0x7f50bcd11000, 164) = 0
1248310449.693187 lstat(".git/refs/heads/sparse", 0x7fff618c0960) = -1 ENOENT (No such file or directory)
1248310449.719112 lstat(".git/refs/heads/stern", 0x7fff618c0960) = -1 ENOENT (No such file or directory)
1248310453.014041 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 3), ...}) = 0
1248310453.014183 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f50bcd11000

Perhaps I should delete the "stern" branch, but I would like to learn why
it is slowing things, because it also happened before (in fact it is always
like this, afaicr)

Do you have another theory? (now .git/refs/heads is empty)

Thanks,
Carlos

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23  0:55     ` Linus Torvalds
@ 2009-07-23  2:02       ` Carlos R. Mafra
  2009-07-23  2:28         ` Linus Torvalds
  0 siblings, 1 reply; 129+ messages in thread
From: Carlos R. Mafra @ 2009-07-23  2:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

On Wed 22.Jul'09 at 17:55:51 -0700, Linus Torvalds wrote:
> On Wed, 22 Jul 2009, Linus Torvalds wrote:
> > 
> > If you have ext3, you'd do something like
> > 
> > 	tune2fs -O dir_index /dev/<node-of-your-filesystem-goes-here>
> 
> One last email note on this subject. Really. Promise.
> 
> If you do that "tune2fs -O dir_index" thing, it will only take effect for 
> _newly_ created directories. So you'll still need to do that whole 
> "mv+cp+rm" dance, just to make sure that the refs directories are all new.

Ok, now I also did the "dir_index" thing followed by the mv+cp+rm instructions.
It doesn't change the 3.5 secs delay in that single line,

1248313742.355195 lstat(".git/refs/heads/sparse", 0x7fff0c663ab0) = -1 ENOENT (No such file or directory)
1248313742.381178 lstat(".git/refs/heads/stern", 0x7fff0c663ab0) = -1 ENOENT (No such file or directory)
1248313745.804637 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0

Just to double check,

[root@Pilar linux-2.6]# tune2fs -l /dev/sda5 |grep dir_index
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery sparse_super large_file

(and I did the mv+cp+rm after setting "dir_index")

Is there another way to check what is going on with that anomalous lstat()?
[ perhaps I will try 'perf' after I read how to use it ]

Thanks,
Carlos

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23  1:22   ` Carlos R. Mafra
@ 2009-07-23  2:20     ` Linus Torvalds
  2009-07-23  2:23       ` Linus Torvalds
  0 siblings, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2009-07-23  2:20 UTC (permalink / raw)
  To: Carlos R. Mafra; +Cc: Git Mailing List, Junio C Hamano

On Thu, 23 Jul 2009, Carlos R. Mafra wrote:
> > Let me guess: if you do a "ls -ld .git/refs/heads" you get a very big 
> > directory, despite it only having three entries in it. 
> 
> [mafra@Pilar:linux-2.6]$ ls -ld .git/refs/heads
> drwxr-xr-x 2 mafra mafra 4096 2009-07-22 23:01 .git/refs/heads/

Hmm. That's just a single block. 

Then I really don't see why the lstat takes so long.

> After 'echo 3 > /proc/sys/vm/drop_caches' it still takes too long,
> 
> 1248310449.693085 munmap(0x7f50bcd11000, 164) = 0
> 1248310449.693187 lstat(".git/refs/heads/sparse", 0x7fff618c0960) = -1 ENOENT (No such file or directory)
> 1248310449.719112 lstat(".git/refs/heads/stern", 0x7fff618c0960) = -1 ENOENT (No such file or directory)
> 1248310453.014041 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 3), ...}) = 0
> 1248310453.014183 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f50bcd11000

Use 'strace -T', which shows how long the actual system calls take, rather 
than '-tt' which just shows when they started.

Maybe the four seconds is something else than the lstat - page faults on 
the pack-file in between the lstat and the fstat, for example.

> Perhaps I should delete the "stern" branch, but I would like to learn why
> it is slowing things, because it also happened before (in fact it is always
> like this, afaicr)

Absolutely. Don't delete it until we figure out what takes so long there.

> Do you have another theory? (now .git/refs/heads is empty)

Clearly it's IO, but if that 'lstat()' was just a red herring, then I 
suspect it's IO on the pack-file. If so, I'd further guess that your VAIO 
has some pitiful 4200rpm harddisk that is slow as hell and has horrible 
seek latencies, and the CPU is way overpowered compared to the cruddy 
disk.

It probably does the object lookup. You can see some debug output if you 
do

	GIT_DEBUG_LOOKUP=1 git branch

and that will show you the patterns. It won't be very pretty, especially 
if you have several pack-files, but maybe we can figure out what's up.

Hmm. I wonder.. I suspect 'git branch' looks up _all_ refs, and then 
afterwards it filters them. So even though it only prints out a few 
branches, maybe it will look at all the tags etc of the whole repository.

Ooh yes. That would do it. It's going to peel and look up every single ref 
it finds, so it's going to look up _hundreds_ of objects (all the tags, 
all the commits they point to, etc etc). Even if it then only shows a 
couple of branches.

Junio, any ideas?

		Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23  2:20     ` Linus Torvalds
@ 2009-07-23  2:23       ` Linus Torvalds
  2009-07-23  3:08         ` Linus Torvalds
                           ` (3 more replies)
  0 siblings, 4 replies; 129+ messages in thread
From: Linus Torvalds @ 2009-07-23  2:23 UTC (permalink / raw)
  To: Carlos R. Mafra; +Cc: Git Mailing List, Junio C Hamano



On Wed, 22 Jul 2009, Linus Torvalds wrote:
> 
> Ooh yes. That would do it. It's going to peel and look up every single ref 
> it finds, so it's going to look up _hundreds_ of objects (all the tags, 
> all the commits they point to, etc etc). Even if it then only shows a 
> couple of branches.
> 
> Junio, any ideas?

I had one of my own.

Does this fix it?

It uses the "raw" version of 'for_each_ref()' (which doesn't verify that 
the ref is valid), and then does the "type verification" before it starts 
doing any gentle commit lookup.

That should hopefully mean that it no longer does tons of object lookups 
on refs that it's not actually interested in. 

		Linus

---
 builtin-branch.c |   10 +++++-----
 1 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/builtin-branch.c b/builtin-branch.c
index 5687d60..54a89ff 100644
--- a/builtin-branch.c
+++ b/builtin-branch.c
@@ -240,6 +240,10 @@ static int append_ref(const char *refname, const unsigned char *sha1, int flags,
 	if (ARRAY_SIZE(ref_kind) <= i)
 		return 0;
 
+	/* Don't add types the caller doesn't want */
+	if ((kind & ref_list->kinds) == 0)
+		return 0;
+
 	commit = lookup_commit_reference_gently(sha1, 1);
 	if (!commit)
 		return error("branch '%s' does not point at a commit", refname);
@@ -248,10 +252,6 @@ static int append_ref(const char *refname, const unsigned char *sha1, int flags,
 	if (!is_descendant_of(commit, ref_list->with_commit))
 		return 0;
 
-	/* Don't add types the caller doesn't want */
-	if ((kind & ref_list->kinds) == 0)
-		return 0;
-
 	if (merge_filter != NO_FILTER)
 		add_pending_object(&ref_list->revs,
 				   (struct object *)commit, refname);
@@ -426,7 +426,7 @@ static void print_ref_list(int kinds, int detached, int verbose, int abbrev, str
 	ref_list.with_commit = with_commit;
 	if (merge_filter != NO_FILTER)
 		init_revisions(&ref_list.revs, NULL);
-	for_each_ref(append_ref, &ref_list);
+	for_each_rawref(append_ref, &ref_list);
 	if (merge_filter != NO_FILTER) {
 		struct commit *filter;
 		filter = lookup_commit_reference_gently(merge_filter_ref, 0);

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23  0:23 ` SZEDER Gábor
@ 2009-07-23  2:25   ` Carlos R. Mafra
  0 siblings, 0 replies; 129+ messages in thread
From: Carlos R. Mafra @ 2009-07-23  2:25 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: git

Hi,

On Wed 22.Jul'09 at 19:23:23 -0500, SZEDER Gábor wrote:
> > So I would like to ask what should I do to recover the missing files
> > in .git/refs/heads/ (which apparently is the cause for my issue) and
> > how I can avoid losing them in the first place.
> 
> have a look at .git/packed-refs and 'git pack-refs'.

Yes, now I learned that the files were not really missing
as in "there is something wrong".

I will also start to use 'git pack-refs --prune' from time to time
now, in adition to 'git gc --prune' and 'git repack -d -a'.

But the takes-too-long 'git branch' issue is apparently caused
by something else.

Thanks Gábor,
Carlos

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23  2:02       ` Carlos R. Mafra
@ 2009-07-23  2:28         ` Linus Torvalds
  2009-07-23 12:42           ` Jakub Narebski
  0 siblings, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2009-07-23  2:28 UTC (permalink / raw)
  To: Carlos R. Mafra; +Cc: git

On Thu, 23 Jul 2009, Carlos R. Mafra wrote:
> 
> Is there another way to check what is going on with that anomalous lstat()?

I really don't think it's the lstat any more. Your directories look small 
and simple, and clearly the indexing made no difference.

See earlier email about using "strace -T" instead of "-tt". Also, I sent 
you a patch to try out just a minute ago, I think that may be it.

> [ perhaps I will try 'perf' after I read how to use it ]

I really like 'perf' (it does what oprofile did for me, but without the 
headaches), but it doesn't help with IO profiling.

I've actually often wanted to have a 'strace' that shows page faults as 
special system calls, but it's sadly nontrivial ;(

			Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23  2:23       ` Linus Torvalds
@ 2009-07-23  3:08         ` Linus Torvalds
  2009-07-23  3:21           ` Linus Torvalds
  2009-07-23  3:18         ` Carlos R. Mafra
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2009-07-23  3:08 UTC (permalink / raw)
  To: Carlos R. Mafra; +Cc: Git Mailing List, Junio C Hamano

On Wed, 22 Jul 2009, Linus Torvalds wrote:
> 
> It uses the "raw" version of 'for_each_ref()' (which doesn't verify that 
> the ref is valid), and then does the "type verification" before it starts 
> doing any gentle commit lookup.
> 
> That should hopefully mean that it no longer does tons of object lookups 
> on refs that it's not actually interested in. 

Hmm. On my kernel repo, doing

	GIT_DEBUG_LOOKUP=1 git branch | wc -l

I get
 - before: 2121
 - after: 39

(where two of the lines are the actual 'git branch' output). So yeah, this 
should make a big difference. It now looks up just two objects (one of 
them duplicated because it checks "HEAD" - but the duplicate lookup won't 
result in any extra IO, so it's only two _uncached_ accesses).

The GIT_DEBUG_LOOKUP debug output probably does match the number of 
cold-cache IO's fairly well for something like this (at least to a first 
approximation), so I really hope my patch will fix your problem.

			Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23  2:23       ` Linus Torvalds
  2009-07-23  3:08         ` Linus Torvalds
@ 2009-07-23  3:18         ` Carlos R. Mafra
  2009-07-23  3:27           ` Carlos R. Mafra
                             ` (2 more replies)
  2009-07-23  4:40         ` Junio C Hamano
  2009-07-23 16:48         ` Anders Kaseorg
  3 siblings, 3 replies; 129+ messages in thread
From: Carlos R. Mafra @ 2009-07-23  3:18 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano

First of all:
      * yes, my VAIO has a slow 4200 rpm disc :-(
      * strace -T indeed showed that lstat() was not guilty
      * GIT_DEBUG_LOOKUP=1 git branch produced ugly 2200+ lines

Now to the patch,

On Wed 22.Jul'09 at 19:23:39 -0700, Linus Torvalds wrote:
> > Ooh yes. That would do it. It's going to peel and look up every single ref 
> > it finds, so it's going to look up _hundreds_ of objects (all the tags, 
> > all the commits they point to, etc etc). Even if it then only shows a 
> > couple of branches.
> > 
> > Junio, any ideas?
> 
> I had one of my own.
> 
> Does this fix it?

Yes!

[mafra@Pilar:linux-2.6]$ time git branch
  27-stable
  28-stable
  29-stable
  30-stable
  dev-private
* master
  option
  sparse
  stern
0.00user 0.01system 0:01.50elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (42major+757minor)pagefaults 0swaps

01.50 is not that good, but it doesn't "feel" terrible as 4 seconds.
[ It is incredible how 4 secs feels really bad while 2 is acceptable... ]

So thank you very much, Linus! A 50% improvement here!

And I am happy to have finally reported it, after quietly suffering for so long 
thinking that "git is as fast as possible, so it is probably my fault".

PS: Out of curiosity, how many femtoseconds does it take in your 
state-of-the-art machine? :-)

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23  3:08         ` Linus Torvalds
@ 2009-07-23  3:21           ` Linus Torvalds
  2009-07-23 17:47             ` Tony Finch
  0 siblings, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2009-07-23  3:21 UTC (permalink / raw)
  To: Carlos R. Mafra; +Cc: Git Mailing List, Junio C Hamano

On Wed, 22 Jul 2009, Linus Torvalds wrote:
> 
> The GIT_DEBUG_LOOKUP debug output probably does match the number of 
> cold-cache IO's fairly well for something like this (at least to a first 
> approximation), so I really hope my patch will fix your problem.

Side note: the object lookup binary search we do is simple and reasonably 
efficient, but it is _not_ very cache-friendly (where "cache-friendly" 
also in this case means IO caches).

There are more cache-friendly ways of searching, although the really 
clever ones would require us to switch the format of the pack-file index 
around. Which would be a fairly big pain (in addition to making the lookup 
a lot more complex).

The _simpler_ cache-friendly alternative is likely to try the "guess 
location by assuming the SHA1's are evenly spread out" thing doesn't jump 
back-and-forth like a binary search does.

We tried it a few years ago, but didn't do cold-cache numbers. And 
repositories were smaller too.

With something like the kernel repo, with 1.2+ million objects, a binary 
search needs about 21 comparisons for each object we look up. The index 
has a first-level fan-out of 256, so that takes away 8 of them, but we're 
still talking about 13 comparisons. With bad locality except for the very 
last ones.

Assuming a 4kB page-size, and about 170 index entries per page (~7 binary 
search levels), that's 6 pages we have to page-fault in for each search. 
And we probably won't start seeing lots of cache reuse until we hit 
hundreds or thousands of objects searched for.

With soemthing like "three iterations of newton-raphson + linear search", 
we might end up with more index entries looked at, but we'd quite possibly 
get much better locality.

I suspect the old newton-raphson patches we had (Discussions and patches 
back in April 2007 on this list) could be resurrected pretty easily.

		Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23  3:18         ` Carlos R. Mafra
@ 2009-07-23  3:27           ` Carlos R. Mafra
  2009-07-23  3:40           ` Carlos R. Mafra
  2009-07-23  3:47           ` Linus Torvalds
  2 siblings, 0 replies; 129+ messages in thread
From: Carlos R. Mafra @ 2009-07-23  3:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano

On Thu 23.Jul'09 at  5:18:44 +0200, Carlos R. Mafra wrote:

>       * GIT_DEBUG_LOOKUP=1 git branch produced ugly 2200+ lines

With your patch applied it went down to 132 lines.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23  3:18         ` Carlos R. Mafra
  2009-07-23  3:27           ` Carlos R. Mafra
@ 2009-07-23  3:40           ` Carlos R. Mafra
  2009-07-23  3:47           ` Linus Torvalds
  2 siblings, 0 replies; 129+ messages in thread
From: Carlos R. Mafra @ 2009-07-23  3:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano

On Thu 23.Jul'09 at  5:18:44 +0200, Carlos R. Mafra wrote:

> 0.00user 0.01system 0:01.50elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (42major+757minor)pagefaults 0swaps
> 
> 01.50 is not that good, but it doesn't "feel" terrible as 4 seconds.
> [ It is incredible how 4 secs feels really bad while 2 is acceptable... ]

I need to sleep, as the number 4 seconds got stuck in my head. In my original
report it was much worse

0.00user 0.05system 0:05.73elapsed

So now it was a 75% improvement!

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23  3:18         ` Carlos R. Mafra
  2009-07-23  3:27           ` Carlos R. Mafra
  2009-07-23  3:40           ` Carlos R. Mafra
@ 2009-07-23  3:47           ` Linus Torvalds
  2009-07-23  4:10             ` Linus Torvalds
  2 siblings, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2009-07-23  3:47 UTC (permalink / raw)
  To: Carlos R. Mafra; +Cc: Git Mailing List, Junio C Hamano

On Thu, 23 Jul 2009, Carlos R. Mafra wrote:
> 
> PS: Out of curiosity, how many femtoseconds does it take in your 
> state-of-the-art machine? :-)

Cold cache? 0.15s before the patch. 0.03s after.

So we're not talking femto-seconds, but I've got Intel SSD's that do 
random reads in well under a millisecond. Your pitiful 4200rpm drive 
probably takes 20ms for each seek. You don't really need that many IO's 
for it to take a second or two. Or four.

The kernel will do IO in bigger chunks than a single page, and there is 
_some_ locality to it all, so you won't see IO for each lookup. But with 
2000+ lines of GIT_DEBUG_LOOKUP, you probably do end up having a 
noticeable fraction of them being IO-causing, and another fraction causing 
seeks.

But I'll see if I can dig up my non-binary-search patch and see if I can 
make it go faster. My machine is fast, but not so fast that I can't 
measure it ;)

		Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23  3:47           ` Linus Torvalds
@ 2009-07-23  4:10             ` Linus Torvalds
  2009-07-23  5:13               ` Junio C Hamano
  2009-07-23  5:17               ` Carlos R. Mafra
  0 siblings, 2 replies; 129+ messages in thread
From: Linus Torvalds @ 2009-07-23  4:10 UTC (permalink / raw)
  To: Carlos R. Mafra; +Cc: Git Mailing List, Junio C Hamano

On Wed, 22 Jul 2009, Linus Torvalds wrote:
> 
> But I'll see if I can dig up my non-binary-search patch and see if I can 
> make it go faster. My machine is fast, but not so fast that I can't 
> measure it ;)

Oh. We actually merged a fixed version of it. I'd completely forgotten.

Enabled with 'GIT_USE_LOOKUP'. But it seems to give worse performance, 
despite giving me fewer searches: I get 2121 probes with binary searching, 
but only 1325 with the newton-raphson method (for the non-fixed 'git 
branch' case).

Using GIT_USE_LOOKUP actually results in fewer pagefaults (1391 vs 1473), 
but it's still slower. Interesting. Carlos, try it on your machine (just 
do

	export GIT_USE_LOOKUP=1
	time git branch

to try it, and 'unset GIT_USE_LOOKUP' to disable it.

(And note that the "=1" part isn't important - the only thing that matters 
is whether the environment variable is set or not - setting it to '0' will 
_not_ disable it, you need to 'unset' it).

With my fix to 'git branch', it doesn't matter. I get the same 
performance, and same number of page faults (676) regardless. So my patch 
makes the GIT_USE_LOOKUP=1 thing irrelevant.

		Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23  2:23       ` Linus Torvalds
  2009-07-23  3:08         ` Linus Torvalds
  2009-07-23  3:18         ` Carlos R. Mafra
@ 2009-07-23  4:40         ` Junio C Hamano
  2009-07-23  5:36           ` Linus Torvalds
  2009-07-23 16:07           ` Carlos R. Mafra
  2009-07-23 16:48         ` Anders Kaseorg
  3 siblings, 2 replies; 129+ messages in thread
From: Junio C Hamano @ 2009-07-23  4:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Carlos R. Mafra, Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Wed, 22 Jul 2009, Linus Torvalds wrote:
>> 
>> Ooh yes. That would do it. It's going to peel and look up every single ref 
>> it finds, so it's going to look up _hundreds_ of objects (all the tags, 
>> all the commits they point to, etc etc). Even if it then only shows a 
>> couple of branches.
>> 
>> Junio, any ideas?
>
> I had one of my own.

It seems that I missed all the fun while going out to dinner.

> It uses the "raw" version of 'for_each_ref()' (which doesn't verify that 
> the ref is valid), and then does the "type verification" before it starts 
> doing any gentle commit lookup.

Hmm, we now have to remember what this patch did, if we ever wanted to
introduce negative refs later (see ef06b91 do_for_each_ref: perform the
same sanity check for leftovers., 2006-11-18).  Not exactly nice to spread
the codepaths that need to be updated.  Is the cold cache performance of
"git branch" to list your local branches that important?

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23  4:10             ` Linus Torvalds
@ 2009-07-23  5:13               ` Junio C Hamano
  2009-07-23  5:17               ` Carlos R. Mafra
  1 sibling, 0 replies; 129+ messages in thread
From: Junio C Hamano @ 2009-07-23  5:13 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Carlos R. Mafra, Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Wed, 22 Jul 2009, Linus Torvalds wrote:
>> 
>> But I'll see if I can dig up my non-binary-search patch and see if I can 
>> make it go faster. My machine is fast, but not so fast that I can't 
>> measure it ;)
>
> Oh. We actually merged a fixed version of it. I'd completely forgotten.

As the commit message of 628522e (sha1-lookup: more memory efficient
search in sorted list of SHA-1, 2007-12-29) shows, it didn't get any great
performance improvements, even though it did make the probing quite a lot
less memory intensive.

Perhaps you can spot obvious inefficiency in the code that I failed to
see, just like you recently did for "show --cc" codepath?

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23  4:10             ` Linus Torvalds
  2009-07-23  5:13               ` Junio C Hamano
@ 2009-07-23  5:17               ` Carlos R. Mafra
  1 sibling, 0 replies; 129+ messages in thread
From: Carlos R. Mafra @ 2009-07-23  5:17 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano

On Wed 22.Jul'09 at 21:10:49 -0700, Linus Torvalds wrote:
> Enabled with 'GIT_USE_LOOKUP'. But it seems to give worse performance, 
> despite giving me fewer searches: I get 2121 probes with binary searching, 
> but only 1325 with the newton-raphson method (for the non-fixed 'git 
> branch' case).
> 
> Using GIT_USE_LOOKUP actually results in fewer pagefaults (1391 vs 1473), 
> but it's still slower. Interesting. Carlos, try it on your machine (just 
> do
> 
> 	export GIT_USE_LOOKUP=1
> 	time git branch
> 
> to try it, and 'unset GIT_USE_LOOKUP' to disable it.


GIT_USE_LOOKUP=1 makes is a bit slower overall. 

Without your patch, I get fewer pagefaults (1254 vs 1404) when
it is set, but it takes ~0.5s longer (it varies a bit).

> With my fix to 'git branch', it doesn't matter. I get the same 
> performance, and same number of page faults (676) regardless. So my patch 
> makes the GIT_USE_LOOKUP=1 thing irrelevant.

With your patch and GIT_USE_LOOKUP=1 I get 751 pagefaults, versus 775
if GIT_USE_LOOKUP is unset, but it is faster when unset.

So your patch without GIT_USE_LOOKUP=1 is the fastest option.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23  4:40         ` Junio C Hamano
@ 2009-07-23  5:36           ` Linus Torvalds
  2009-07-23  5:52             ` Junio C Hamano
  2009-07-23 16:07           ` Carlos R. Mafra
  1 sibling, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2009-07-23  5:36 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Carlos R. Mafra, Git Mailing List

On Wed, 22 Jul 2009, Junio C Hamano wrote:
> 
> Hmm, we now have to remember what this patch did, if we ever wanted to
> introduce negative refs later (see ef06b91 do_for_each_ref: perform the
> same sanity check for leftovers., 2006-11-18).  Not exactly nice to spread
> the codepaths that need to be updated.  Is the cold cache performance of
> "git branch" to list your local branches that important?

Hmm. I do think that 7.5s is _way_ too long to wait for something as 
simple as "what branches do I have?".

And yes, it's also an operation that I'd expect to be quite possibly the 
first one you do when moving to a new repo, so cold-cache is realistic.

And the 'rawref' thing is exactly the same as the 'ref' version, except it 
doesn't do the null_sha1 check and the 'has_sha1-file()' check.

And since git branch will do something _better_ than the 'has_sha1_file()' 
check (by virtue of actually looking up the commit), I don't think that 
part is an issue. So the only issue is the is_null_sha1() thing.

And quite frankly, while the null-sha1 check may make sense, the way the 
flag is named right now (DO_FOR_EACH_INCLUDE_BROKEN), I think we might be 
better off re-thinking things later if we ever end up caring. That 
'is_null_sha1()' check should possibly be under a separate flag.

That said, while I think my patch was the simplest and most 
the problem could certainly have been fixed differently.

For example, instead of using 'for_each_ref()' and then splitting them by 
kind with that "detect kind" loop, it could instead have done two loops, 
ie

	if (kinds & REF_LOCAL_BRANCH)
		for_each_ref_in("refs/heads/", append_local, &ref_list);
	if (kinds & REF_REMOTE_BRANCH)
		for_each_ref_in("refs/remotes/", append_remote, &ref_list);

and avoided the other refs we aren't interested in _that_ way instead.

But it would be a bigger and involved patch. It gets really messy too (I 
tried), because when you use 'for_each_ref_in()' it removes the prefix as 
it goes along, but then the code in builtin-branch.c wants the prefix 
after all.

		Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23  5:36           ` Linus Torvalds
@ 2009-07-23  5:52             ` Junio C Hamano
  2009-07-23  6:04               ` Junio C Hamano
  0 siblings, 1 reply; 129+ messages in thread
From: Junio C Hamano @ 2009-07-23  5:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Carlos R. Mafra, Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Wed, 22 Jul 2009, Junio C Hamano wrote:
>> 
>> Hmm, we now have to remember what this patch did, if we ever wanted to
>> introduce negative refs later (see ef06b91 do_for_each_ref: perform the
>> same sanity check for leftovers., 2006-11-18).  Not exactly nice to spread
>> the codepaths that need to be updated.
> ...
> And since git branch will do something _better_ than the 'has_sha1_file()' 
> check (by virtue of actually looking up the commit), I don't think that 
> part is an issue. So the only issue is the is_null_sha1() thing.

Exactly.

That is_null_sha1() thing was a remnant of your idea to represent deleted
ref that has a packed counterpart by storing 0{40} in a loose ref, so that
we can implement deletion efficiently.

Since we currently implement deletion by repacking packed refs if the ref
has a packed (possibly stale) one, we do not use such a "negative ref",
and skipping 0{40} done by the normal (i.e. non-raw) for_each_ref() family
is not necessary.

I was inclined to say that, because I never saw anybody complained that
deleting refs was too slow, we declare that we would forever stick to the
current implementation of ref deletion, and remove the is_null_sha1()
check from the do_one_ref() function, even for include-broken case.

But after thinking about it again, I'd say "if null, then skip" should be
outside the DO_FOR_EACH_INCLUDE_BROKEN anyway, because the null check is
not about brokenness of the ref, but is about a possible future expansion
to represent deleted ref with such a "negative ref" entry.

If we remove is_null_sha1() from do_one_ref(), or if we move it out of the
"include broken" thing, my "Not exactly nice" comment can be rescinded, as
doing the former (i.e. removal of is_null_sha1() check) is a promise that
we will never have to worry about negative refs, and doing the latter will
still protect callers of do_for_each_rawref() from negative refs if we
ever introduce them in some future.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23  5:52             ` Junio C Hamano
@ 2009-07-23  6:04               ` Junio C Hamano
  2009-07-23 17:19                 ` Linus Torvalds
  0 siblings, 1 reply; 129+ messages in thread
From: Junio C Hamano @ 2009-07-23  6:04 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, Carlos R. Mafra, Git Mailing List

Junio C Hamano <gitster@pobox.com> writes:

> Exactly.
>
> That is_null_sha1() thing was a remnant of your idea to represent deleted
> ref that has a packed counterpart by storing 0{40} in a loose ref, so that
> we can implement deletion efficiently.
>
> Since we currently implement deletion by repacking packed refs if the ref
> has a packed (possibly stale) one, we do not use such a "negative ref",
> and skipping 0{40} done by the normal (i.e. non-raw) for_each_ref() family
> is not necessary.
>
> I was inclined to say that, because I never saw anybody complained that
> deleting refs was too slow, we declare that we would forever stick to the
> current implementation of ref deletion, and remove the is_null_sha1()
> check from the do_one_ref() function, even for include-broken case.
>
> But after thinking about it again, I'd say "if null, then skip" should be
> outside the DO_FOR_EACH_INCLUDE_BROKEN anyway, because the null check is
> not about brokenness of the ref, but is about a possible future expansion
> to represent deleted ref with such a "negative ref" entry.
>
> If we remove is_null_sha1() from do_one_ref(), or if we move it out of the
> "include broken" thing, my "Not exactly nice" comment can be rescinded, as
> doing the former (i.e. removal of is_null_sha1() check) is a promise that
> we will never have to worry about negative refs, and doing the latter will
> still protect callers of do_for_each_rawref() from negative refs if we
> ever introduce them in some future.

That is, a patch like this (this should go to 'maint'), and my worries
will go away.

-- >8 --
Subject: do_one_ref(): null_sha1 check is not about broken ref

f8948e2 (remote prune: warn dangling symrefs, 2009-02-08) introduced a
more dangerous variant of for_each_ref() family that skips the check for
dangling refs, but it also made another unrelated check optional by
mistake.

The check to see if a ref points at 0{40} is not about brokenness, but is
about a possible future plan to represent a deleted ref by writing 40 "0"
in a loose ref when there is a stale version of the same ref already in
.git/packed-refs, so that we can implement deletion of a ref without
having to rewrite the packed refs file excluding the ref being deleted.
This check has to be outside of the conditional.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 refs.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/refs.c b/refs.c
index bb0762e..3da3c8c 100644
--- a/refs.c
+++ b/refs.c
@@ -531,9 +531,10 @@ static int do_one_ref(const char *base, each_ref_fn fn, int trim,
 {
 	if (strncmp(base, entry->name, trim))
 		return 0;
+	/* Is this a "negative ref" that represents a deleted ref? */
+	if (is_null_sha1(entry->sha1))
+		return 0;
 	if (!(flags & DO_FOR_EACH_INCLUDE_BROKEN)) {
-		if (is_null_sha1(entry->sha1))
-			return 0;
 		if (!has_sha1_file(entry->sha1)) {
 			error("%s does not point to a valid object!", entry->name);
 			return 0;

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23  2:28         ` Linus Torvalds
@ 2009-07-23 12:42           ` Jakub Narebski
  2009-07-23 14:45             ` Carlos R. Mafra
  2009-07-23 16:25             ` Linus Torvalds
  0 siblings, 2 replies; 129+ messages in thread
From: Jakub Narebski @ 2009-07-23 12:42 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Carlos R. Mafra, git

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Thu, 23 Jul 2009, Carlos R. Mafra wrote:
> > 
> > Is there another way to check what is going on with that anomalous lstat()?
> 
> I really don't think it's the lstat any more. Your directories look small 
> and simple, and clearly the indexing made no difference.
> 
> See earlier email about using "strace -T" instead of "-tt". Also, I sent 
> you a patch to try out just a minute ago, I think that may be it.
> 
> > [ perhaps I will try 'perf' after I read how to use it ]
> 
> I really like 'perf' (it does what oprofile did for me, but without the 
> headaches), but it doesn't help with IO profiling.
> 
> I've actually often wanted to have a 'strace' that shows page faults as 
> special system calls, but it's sadly nontrivial ;(

BTW. Would SystemTap help there?  Among contributed scripts there is
iotimes, so perhaps it would be possible to have iotrace...

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23 12:42           ` Jakub Narebski
@ 2009-07-23 14:45             ` Carlos R. Mafra
  2009-07-23 16:25             ` Linus Torvalds
  1 sibling, 0 replies; 129+ messages in thread
From: Carlos R. Mafra @ 2009-07-23 14:45 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Linus Torvalds, git

On Thu 23.Jul'09 at  5:42:03 -0700, Jakub Narebski wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
> 
> > On Thu, 23 Jul 2009, Carlos R. Mafra wrote:
> > > 
> > > Is there another way to check what is going on with that anomalous lstat()?
> > 
> > I really don't think it's the lstat any more. Your directories look small 
> > and simple, and clearly the indexing made no difference.
> > 
> > See earlier email about using "strace -T" instead of "-tt". Also, I sent 
> > you a patch to try out just a minute ago, I think that may be it.
> > 
> > > [ perhaps I will try 'perf' after I read how to use it ]
> > 
> > I really like 'perf' (it does what oprofile did for me, but without the 
> > headaches), but it doesn't help with IO profiling.
> > 
> > I've actually often wanted to have a 'strace' that shows page faults as 
> > special system calls, but it's sadly nontrivial ;(
> 
> BTW. Would SystemTap help there?  Among contributed scripts there is
> iotimes, so perhaps it would be possible to have iotrace...


I played a bit with 'blktrace' and 'btrace' and had two terminals
open side by side, one with 'strace git branch' and the other with
'blktrace'.

It was pretty obvious that exactly at the point where 'git branch'
was stalling (without Linus' patch) -- which I thought had to do
with lstat() -- there was a flurry of activity going on in 'btrace' 
output.

It would be nice if 'btrace' could be somehow unified with 'strace',
if that makes any sense.

Here are some numbers from my tests with blktrace (blkparse and btrace):

[root@Pilar mafra]# grep git blkparse-patch.txt |wc -l
811
[root@Pilar mafra]# grep git blkparse-nopatch.txt |wc -l
3479

where those lines with 'git' are something like

8,5    0      677     1.787350654 18591  I   R 204488479 + 40 [git]
8,0    0      678     1.787370489 18591  A   R 204488783 + 96 <- (8,5) 137529800
8,5    0      679     1.787371886 18591  Q   R 204488783 + 96 [git]
8,5    0      680     1.787375378 18591  G   R 204488783 + 96 [git]
8,5    0      681     1.787377613 18591  I   R 204488783 + 96 [git]

And the summary lines also indicate that the non-patched git makes
the disc work much harder:

*************** Without Linus' patch ******************************************

Total (8,5):
 Reads Queued:         764,   20,008KiB  Writes Queued:           0,        0KiB
 Read Dispatches:      764,   20,008KiB  Write Dispatches:        0,        0KiB
 Reads Requeued:         0               Writes Requeued:         0
 Reads Completed:      764,   20,008KiB  Writes Completed:        0,        0KiB
 Read Merges:            0,        0KiB  Write Merges:            0,        0KiB
 IO unplugs:           299               Timer unplugs:           2

Throughput (R/W): 4,003KiB/s / 0KiB/s
Events (8,5): 5,266 entries
Skips: 0 forward (0 -   0.0%)

************** With Linus' patch **********************************************

Total (sda5):
 Reads Queued:         171,    3,128KiB	 Writes Queued:           6,       24KiB
 Read Dispatches:      171,    3,128KiB	 Write Dispatches:        2,       24KiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:      171,    3,128KiB	 Writes Completed:        2,       24KiB
 Read Merges:            0,        0KiB	 Write Merges:            4,       16KiB
 IO unplugs:            80        	 Timer unplugs:           0

Throughput (R/W): 1,632KiB/s / 12KiB/s
Events (sda5): 1,226 entries
Skips: 0 forward (0 -   0.0%)

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23  4:40         ` Junio C Hamano
  2009-07-23  5:36           ` Linus Torvalds
@ 2009-07-23 16:07           ` Carlos R. Mafra
  2009-07-23 16:19             ` Linus Torvalds
  1 sibling, 1 reply; 129+ messages in thread
From: Carlos R. Mafra @ 2009-07-23 16:07 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, Git Mailing List

On Wed 22.Jul'09 at 21:40:36 -0700, Junio C Hamano wrote:
> Is the cold cache performance of "git branch" to list your 
> local branches that important?

I simply felt like something not optimal was going on, and in
some sense I still feel it even with Linus' patch applied...

Don't get me wrong, I am super happy that Linus fixed it
so quickly and I am grateful for that, but I am surely missing
some git internal reason why 'git branch' is not instantaneous
as I _naively_ expected.

Having learned about .git/packed-refs last night, today I tried
this (with cold cache),

[mafra@Pilar:linux-2.6]$ time awk '{print $2}' .git/packed-refs |grep heads| awk -F "/" '{print $3}'
0.00user 0.00system 0:00.12elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (3major+311minor)pagefaults 0swaps
27-stable
28-stable
29-stable
30-stable
dev-private
master
option
sparse
stern

and notice how that makes my pitiful harddisc look like Linus' SSD! And the
result is the same. 

[ If some branches are not inside .git/packed-refs but are listed in .git/refs/heads 
(like some of them were last night), it would require some modification to the
script, but it would still be faster ]

However, I know that I am missing something here and I would be happy to 
learn what.

Thanks in advance,
Carlos

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23 16:07           ` Carlos R. Mafra
@ 2009-07-23 16:19             ` Linus Torvalds
  2009-07-23 16:53               ` Carlos R. Mafra
  0 siblings, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2009-07-23 16:19 UTC (permalink / raw)
  To: Carlos R. Mafra; +Cc: Junio C Hamano, Git Mailing List

On Thu, 23 Jul 2009, Carlos R. Mafra wrote:
> 
> Having learned about .git/packed-refs last night, today I tried
> this (with cold cache),
> 
> [mafra@Pilar:linux-2.6]$ time awk '{print $2}' .git/packed-refs |grep heads| awk -F "/" '{print $3}'
> 0.00user 0.00system 0:00.12elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (3major+311minor)pagefaults 0swaps
> 27-stable
> 28-stable
> 29-stable
> 30-stable
> dev-private
> master
> option
> sparse
> stern
> 
> and notice how that makes my pitiful harddisc look like Linus' SSD! And the
> result is the same. 

The result is the same, yes, but it doesn't do error checking.

What "git branch" does over and beyond just looking at the heads is to 
also look at the commits those heads point to. And the reason it sucks for 
you is that the commits are pretty spread out (particularly in the index 
file, but also in the pack-file) on disk. So each "verify this head" will 
likely involve at least one seek, and possibly four or five. 

And on your disk, five seeks is a tenth of a second. You can run hdparm, 
and it will probably say that you get 30MB/s off that laptop drive - but 
when doing small random reads you'll probably get performance in the order 
of a few tens of kilobytes, not megabytes. (With read-ahead and 
read-around it's probably going to be mostly ~64kB IO's and you'll 
probably get hundreds of kB per second, but you're going to care about 
just a few kB total of those).

So we _could_ make 'git branch' not actually read and verify the commits. 
It doesn't strictly _need_ to, unless you use 'git branch -v' or 
something. That would speed it up further, but the verification is nice, 
and as long as performance isn't _horrible_ I think we're better off doing 
it.

After all, you'll see the problem only once.

			Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23 12:42           ` Jakub Narebski
  2009-07-23 14:45             ` Carlos R. Mafra
@ 2009-07-23 16:25             ` Linus Torvalds
  1 sibling, 0 replies; 129+ messages in thread
From: Linus Torvalds @ 2009-07-23 16:25 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Carlos R. Mafra, git

On Thu, 23 Jul 2009, Jakub Narebski wrote:
> 
> BTW. Would SystemTap help there?  Among contributed scripts there is
> iotimes, so perhaps it would be possible to have iotrace...

The problem I've had with all iotracers is that it's easy enough to get an 
IO trace, but it's basically almost impossible to integrate it with what 
actually _caused_ the IO.

Using 'strace -T' shows very clearly what operations are taking a long 
time. It's very useful for seeing what you should not do for good 
performance - including IO - and where it comes from. It's just that page 
faults are invisible to it.

		Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23  2:23       ` Linus Torvalds
                           ` (2 preceding siblings ...)
  2009-07-23  4:40         ` Junio C Hamano
@ 2009-07-23 16:48         ` Anders Kaseorg
  2009-07-23 19:03           ` Carlos R. Mafra
  3 siblings, 1 reply; 129+ messages in thread
From: Anders Kaseorg @ 2009-07-23 16:48 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Carlos R. Mafra, Git Mailing List, Junio C Hamano

On Wed, 22 Jul 2009, Linus Torvalds wrote:
> It uses the "raw" version of 'for_each_ref()' (which doesn't verify that 
> the ref is valid), and then does the "type verification" before it starts 
> doing any gentle commit lookup.

I submitted essentially the same patch in May:
  http://article.gmane.org/gmane.comp.version-control.git/120097
with the additional optimization that we don’t need to lookup commits at 
all unless we’re using -v, --merged, --no-merged, or --contains.  In my 
tests, it makes `git branch` 5 times faster on an uncached linux-2.6 
repository.

Anders

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23 16:19             ` Linus Torvalds
@ 2009-07-23 16:53               ` Carlos R. Mafra
  2009-07-23 19:05                 ` Linus Torvalds
  0 siblings, 1 reply; 129+ messages in thread
From: Carlos R. Mafra @ 2009-07-23 16:53 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List

On Thu 23.Jul'09 at  9:19:21 -0700, Linus Torvalds wrote:
> > 
> > and notice how that makes my pitiful harddisc look like Linus' SSD! And the
> > result is the same. 
> 
> The result is the same, yes, but it doesn't do error checking.

Oh, I see.

> So we _could_ make 'git branch' not actually read and verify the commits. 
> It doesn't strictly _need_ to, unless you use 'git branch -v' or 
> something. That would speed it up further, but the verification is nice, 
> and as long as performance isn't _horrible_ I think we're better off doing 
> it.

Right, but I would definitely like having some option like --dont-check to 
'git branch', and I think I would use it as default (unless experience
tells that errors happen often).

> After all, you'll see the problem only once.

True, but paradoxically that is also the reason why I notice it and
makes it feel bad.

Everytime I did the first 'git branch' those 5 seconds really hurt, because
I wondered why it couldn't be done in 0s like subsequent commands.

But sure, this was definitely not a pressing issue and your patch made
it even less. I am happy that it takes 1s now, and I really appreciated
your patch! 

Thanks,
Carlos

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23  6:04               ` Junio C Hamano
@ 2009-07-23 17:19                 ` Linus Torvalds
  0 siblings, 0 replies; 129+ messages in thread
From: Linus Torvalds @ 2009-07-23 17:19 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Carlos R. Mafra, Git Mailing List



On Wed, 22 Jul 2009, Junio C Hamano wrote:
>
> Subject: do_one_ref(): null_sha1 check is not about broken ref

Ack. If we want to make it conditional at some point, we'd want to use a 
different flag. 

I do wonder if we should simply remove the code entirely?

		Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23  3:21           ` Linus Torvalds
@ 2009-07-23 17:47             ` Tony Finch
  2009-07-23 18:57               ` Linus Torvalds
  0 siblings, 1 reply; 129+ messages in thread
From: Tony Finch @ 2009-07-23 17:47 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

On Wed, 22 Jul 2009, Linus Torvalds wrote:
>
> I suspect the old newton-raphson patches we had (Discussions and patches
> back in April 2007 on this list) could be resurrected pretty easily.

That sounds interesting, but I can't find the thread you are referring to.
Do you have a URL or a subject I can feed to Google?

Tony.
-- 
f.anthony.n.finch  <dot@dotat.at>  http://dotat.at/
GERMAN BIGHT HUMBER: SOUTHWEST 5 TO 7. MODERATE OR ROUGH. SQUALLY SHOWERS.
MODERATE OR GOOD.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23 17:47             ` Tony Finch
@ 2009-07-23 18:57               ` Linus Torvalds
  0 siblings, 0 replies; 129+ messages in thread
From: Linus Torvalds @ 2009-07-23 18:57 UTC (permalink / raw)
  To: Tony Finch; +Cc: git



On Thu, 23 Jul 2009, Tony Finch wrote:

> On Wed, 22 Jul 2009, Linus Torvalds wrote:
> >
> > I suspect the old newton-raphson patches we had (Discussions and patches
> > back in April 2007 on this list) could be resurrected pretty easily.
> 
> That sounds interesting, but I can't find the thread you are referring to.
> Do you have a URL or a subject I can feed to Google?

Some googling found this:

	http://marc.info/?l=git&m=117537594112450&w=2

but what got merged (half a year later) was a much fancier thing by Junio. 
See sha1-lookup.c.

That original "single iteration of newton-raphson" patch was buggy, but 
it's perhaps interesting as a concept patch.

		Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23 16:48         ` Anders Kaseorg
@ 2009-07-23 19:03           ` Carlos R. Mafra
  0 siblings, 0 replies; 129+ messages in thread
From: Carlos R. Mafra @ 2009-07-23 19:03 UTC (permalink / raw)
  To: Anders Kaseorg; +Cc: Linus Torvalds, Git Mailing List, Junio C Hamano

On Thu 23.Jul'09 at 12:48:20 -0400, Anders Kaseorg wrote:
> 
> I submitted essentially the same patch in May:
>   http://article.gmane.org/gmane.comp.version-control.git/120097
> with the additional optimization that we don't need to lookup commits at
> all unless we're using -v, --merged, --no-merged, or --contains.  In my 
> tests, it makes `git branch` 5 times faster on an uncached linux-2.6 
> repository.

I also tested your patch even if you said that it was "essentially the same". 

But after repeating the tests 6 times for both your and Linus' patch
(taking care to let the system rest a bit after clearing the cache), your
patch is faster,

0.62 +/- 0.24 (Anders)
1.35 +/- 0.23 (Linus)

And this is the raw data for your patch,

0.00user 0.01system 0:00.54elapsed 2%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (7major+727minor)pagefaults 0swaps

0.00user 0.00system 0:00.18elapsed 5%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (1major+733minor)pagefaults 0swaps

0.00user 0.00system 0:00.66elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (9major+723minor)pagefaults 0swaps

0.00user 0.01system 0:00.74elapsed 2%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (14major+720minor)pagefaults 0swaps

0.00user 0.00system 0:00.80elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (16major+718minor)pagefaults 0swaps

0.00user 0.00system 0:00.83elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (16major+718minor)pagefaults 0swaps


and for Linus'

0.00user 0.01system 0:01.56elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (43major+755minor)pagefaults 0swaps

0.00user 0.01system 0:01.09elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (24major+775minor)pagefaults 0swaps

0.00user 0.01system 0:01.33elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (32major+767minor)pagefaults 0swaps

0.00user 0.00system 0:01.53elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (39major+760minor)pagefaults 0swaps

0.00user 0.01system 0:01.06elapsed 2%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (24major+775minor)pagefaults 0swaps

0.00user 0.00system 0:01.54elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (39major+760minor)pagefaults 0swaps

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23 16:53               ` Carlos R. Mafra
@ 2009-07-23 19:05                 ` Linus Torvalds
  2009-07-23 19:13                   ` Linus Torvalds
  0 siblings, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2009-07-23 19:05 UTC (permalink / raw)
  To: Carlos R. Mafra; +Cc: Junio C Hamano, Git Mailing List



On Thu, 23 Jul 2009, Carlos R. Mafra wrote:
>
> Everytime I did the first 'git branch' those 5 seconds really hurt, because
> I wondered why it couldn't be done in 0s like subsequent commands.
> 
> But sure, this was definitely not a pressing issue and your patch made
> it even less. I am happy that it takes 1s now, and I really appreciated
> your patch! 

You could try something like this (on _top_ of the previous patch). 

Not very exhaustively tested, but it's pretty simple.

It will still do _some_ object lookups. In particular, it will do the HEAD 
lookup in 'print_ref_list()', even if it's not strictly necessary. But it 
should cut down the noise further.

		Linus

---
 builtin-branch.c |   24 ++++++++++++++----------
 1 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/builtin-branch.c b/builtin-branch.c
index 54a89ff..82c2cf0 100644
--- a/builtin-branch.c
+++ b/builtin-branch.c
@@ -191,7 +191,7 @@ struct ref_item {
 
 struct ref_list {
 	struct rev_info revs;
-	int index, alloc, maxwidth;
+	int index, alloc, maxwidth, verbose;
 	struct ref_item *list;
 	struct commit_list *with_commit;
 	int kinds;
@@ -244,17 +244,20 @@ static int append_ref(const char *refname, const unsigned char *sha1, int flags,
 	if ((kind & ref_list->kinds) == 0)
 		return 0;
 
-	commit = lookup_commit_reference_gently(sha1, 1);
-	if (!commit)
-		return error("branch '%s' does not point at a commit", refname);
+	commit = NULL;
+	if (ref_list->verbose || ref_list->with_commit || merge_filter != NO_FILTER) {
+		commit = lookup_commit_reference_gently(sha1, 1);
+		if (!commit)
+			return error("branch '%s' does not point at a commit", refname);
 
-	/* Filter with with_commit if specified */
-	if (!is_descendant_of(commit, ref_list->with_commit))
-		return 0;
+		/* Filter with with_commit if specified */
+		if (!is_descendant_of(commit, ref_list->with_commit))
+			return 0;
 
-	if (merge_filter != NO_FILTER)
-		add_pending_object(&ref_list->revs,
-				   (struct object *)commit, refname);
+		if (merge_filter != NO_FILTER)
+			add_pending_object(&ref_list->revs,
+					   (struct object *)commit, refname);
+	}
 
 	/* Resize buffer */
 	if (ref_list->index >= ref_list->alloc) {
@@ -423,6 +426,7 @@ static void print_ref_list(int kinds, int detached, int verbose, int abbrev, str
 
 	memset(&ref_list, 0, sizeof(ref_list));
 	ref_list.kinds = kinds;
+	ref_list.verbose = verbose;
 	ref_list.with_commit = with_commit;
 	if (merge_filter != NO_FILTER)
 		init_revisions(&ref_list.revs, NULL);

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23 19:05                 ` Linus Torvalds
@ 2009-07-23 19:13                   ` Linus Torvalds
  2009-07-23 19:55                     ` Carlos R. Mafra
  0 siblings, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2009-07-23 19:13 UTC (permalink / raw)
  To: Carlos R. Mafra; +Cc: Junio C Hamano, Git Mailing List



On Thu, 23 Jul 2009, Linus Torvalds wrote:
> 
> You could try something like this (on _top_ of the previous patch). 
> 
> Not very exhaustively tested, but it's pretty simple.
> 
> It will still do _some_ object lookups. In particular, it will do the HEAD 
> lookup in 'print_ref_list()', even if it's not strictly necessary. But it 
> should cut down the noise further.

And this (on top of them all) will basically avoid even that one.

In fact, I think this is a cleanup. I think I'll resubmit the whole series 
with proper commit messages etc.

		Linus

---
 builtin-branch.c |   38 +++++++++++++++++++++++---------------
 1 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/builtin-branch.c b/builtin-branch.c
index 82c2cf0..1a03d5f 100644
--- a/builtin-branch.c
+++ b/builtin-branch.c
@@ -191,7 +191,7 @@ struct ref_item {
 
 struct ref_list {
 	struct rev_info revs;
-	int index, alloc, maxwidth, verbose;
+	int index, alloc, maxwidth, verbose, abbrev;
 	struct ref_item *list;
 	struct commit_list *with_commit;
 	int kinds;
@@ -418,15 +418,34 @@ static int calc_maxwidth(struct ref_list *refs)
 	return w;
 }
 
+
+static void show_detached(struct ref_list *ref_list)
+{
+	struct commit *head_commit = lookup_commit_reference_gently(head_sha1, 1);
+
+	if (head_commit && is_descendant_of(head_commit, ref_list->with_commit)) {
+		struct ref_item item;
+		item.name = xstrdup("(no branch)");
+		item.len = strlen(item.name);
+		item.kind = REF_LOCAL_BRANCH;
+		item.dest = NULL;
+		item.commit = head_commit;
+		if (item.len > ref_list->maxwidth)
+			ref_list->maxwidth = item.len;
+		print_ref_item(&item, ref_list->maxwidth, ref_list->verbose, ref_list->abbrev, 1, "");
+		free(item.name);
+	}
+}
+
 static void print_ref_list(int kinds, int detached, int verbose, int abbrev, struct commit_list *with_commit)
 {
 	int i;
 	struct ref_list ref_list;
-	struct commit *head_commit = lookup_commit_reference_gently(head_sha1, 1);
 
 	memset(&ref_list, 0, sizeof(ref_list));
 	ref_list.kinds = kinds;
 	ref_list.verbose = verbose;
+	ref_list.abbrev = abbrev;
 	ref_list.with_commit = with_commit;
 	if (merge_filter != NO_FILTER)
 		init_revisions(&ref_list.revs, NULL);
@@ -446,19 +465,8 @@ static void print_ref_list(int kinds, int detached, int verbose, int abbrev, str
 	qsort(ref_list.list, ref_list.index, sizeof(struct ref_item), ref_cmp);
 
 	detached = (detached && (kinds & REF_LOCAL_BRANCH));
-	if (detached && head_commit &&
-	    is_descendant_of(head_commit, with_commit)) {
-		struct ref_item item;
-		item.name = xstrdup("(no branch)");
-		item.len = strlen(item.name);
-		item.kind = REF_LOCAL_BRANCH;
-		item.dest = NULL;
-		item.commit = head_commit;
-		if (item.len > ref_list.maxwidth)
-			ref_list.maxwidth = item.len;
-		print_ref_item(&item, ref_list.maxwidth, verbose, abbrev, 1, "");
-		free(item.name);
-	}
+	if (detached)
+		show_detached(&ref_list);
 
 	for (i = 0; i < ref_list.index; i++) {
 		int current = !detached &&

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23 19:13                   ` Linus Torvalds
@ 2009-07-23 19:55                     ` Carlos R. Mafra
  2009-07-24 20:36                       ` Linus Torvalds
  0 siblings, 1 reply; 129+ messages in thread
From: Carlos R. Mafra @ 2009-07-23 19:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List

On Thu 23.Jul'09 at 12:13:41 -0700, Linus Torvalds wrote:
> > It will still do _some_ object lookups. In particular, it will do the HEAD 
> > lookup in 'print_ref_list()', even if it's not strictly necessary. But it 
> > should cut down the noise further.
> 
> And this (on top of them all) will basically avoid even that one.

Ok, I applied (both) on top of the first one.

After 7 tests I got these, 

time:

      0.61 +/- 0.08

GIT_DEBUG_LOOKUP=1 git branch |wc -l
    
      9
      
which are in fact only the branches list.

Compared to yesterday, that is a huge improvement (0.6s vs 5.7s)
and (9 vs 2200+). At least for me 0.6s is "instantaneous", so
the issue is really gone.

Thanks a lot to everyone!

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-23 19:55                     ` Carlos R. Mafra
@ 2009-07-24 20:36                       ` Linus Torvalds
  2009-07-24 20:47                         ` Linus Torvalds
  0 siblings, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2009-07-24 20:36 UTC (permalink / raw)
  To: Carlos R. Mafra; +Cc: Junio C Hamano, Git Mailing List

On Thu, 23 Jul 2009, Carlos R. Mafra wrote:
>
> After 7 tests I got these, 
> 
> time:
> 
>       0.61 +/- 0.08

Btw, I think 0.61s is still too much. Can you send me the output of 
'strace -Ttt' on your machine?

It's entirely possible that it's all the actual binary (and shared 
library) loading, of course. You do have a slow harddisk. But it takes 
0.035s for me, and I'm wondering if there is something else than just CPU 
speed and IO speed accounting for the 20x performance difference.

(That said, maybe 20x is right - my SSD latency almost certainly is 20x 
better).

			Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-24 20:36                       ` Linus Torvalds
@ 2009-07-24 20:47                         ` Linus Torvalds
  2009-07-24 21:21                           ` Linus Torvalds
  0 siblings, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2009-07-24 20:47 UTC (permalink / raw)
  To: Carlos R. Mafra; +Cc: Junio C Hamano, Git Mailing List



On Fri, 24 Jul 2009, Linus Torvalds wrote:
> 
> Btw, I think 0.61s is still too much. Can you send me the output of 
> 'strace -Ttt' on your machine?

Never mind. I'm seeing even worse behavior on a laptop I just dug up 
(another 4200 rpm harddisk).

I'll dig some more.

		Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-24 20:47                         ` Linus Torvalds
@ 2009-07-24 21:21                           ` Linus Torvalds
  2009-07-24 22:13                             ` Linus Torvalds
                                               ` (2 more replies)
  0 siblings, 3 replies; 129+ messages in thread
From: Linus Torvalds @ 2009-07-24 21:21 UTC (permalink / raw)
  To: Carlos R. Mafra; +Cc: Junio C Hamano, Git Mailing List



On Fri, 24 Jul 2009, Linus Torvalds wrote:
> 
> Never mind. I'm seeing even worse behavior on a laptop I just dug up 
> (another 4200 rpm harddisk).
> 
> I'll dig some more.

Yeah, it seems to be the loading overhead. I'm seeing a 'time git branch' 
take 1.2s in the cold-cache case, in a directory that isn't even a git 
directory.

And 80% of it comes before we even get to 'main()'. Shared library 
loading, SELinux crud etc. A lot of it seems to be 'libfreebl3' and 
'libselinux', which is some crazy sh*t.

It seems to be all from 'curl' support.

That seems _really_ sad. Lookie here:

   [torvalds@nehalem git]$ ldd git
	linux-vdso.so.1 =>  (0x00007fff61da7000)
	libcurl.so.4 => /usr/lib64/libcurl.so.4 (0x00007f2f1a498000)
	libz.so.1 => /lib64/libz.so.1 (0x0000003cdb800000)
	libcrypto.so.8 => /usr/lib64/libcrypto.so.8 (0x0000003ba7a00000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003cdb400000)
	libc.so.6 => /lib64/libc.so.6 (0x0000003cda800000)
	libidn.so.11 => /lib64/libidn.so.11 (0x0000003ceaa00000)
	libssh2.so.1 => /usr/lib64/libssh2.so.1 (0x0000003ba8e00000)
	libldap-2.4.so.2 => /usr/lib64/libldap-2.4.so.2 (0x00007f2f1a250000)
	librt.so.1 => /lib64/librt.so.1 (0x0000003cdbc00000)
	libgssapi_krb5.so.2 => /usr/lib64/libgssapi_krb5.so.2 (0x0000003ce6e00000)
	libkrb5.so.3 => /usr/lib64/libkrb5.so.3 (0x0000003ce7e00000)
	libk5crypto.so.3 => /usr/lib64/libk5crypto.so.3 (0x0000003ce7200000)
	libcom_err.so.2 => /lib64/libcom_err.so.2 (0x0000003ce6a00000)
	libssl3.so => /lib64/libssl3.so (0x0000003490200000)
	libsmime3.so => /lib64/libsmime3.so (0x000000348fe00000)
	libnss3.so => /lib64/libnss3.so (0x000000348f600000)
	libplds4.so => /lib64/libplds4.so (0x0000003cbc800000)
	libplc4.so => /lib64/libplc4.so (0x0000003cbdc00000)
	libnspr4.so => /lib64/libnspr4.so (0x0000003cbd800000)
	libdl.so.2 => /lib64/libdl.so.2 (0x0000003cdb000000)
	/lib64/ld-linux-x86-64.so.2 (0x0000003cda400000)
	libssl.so.8 => /usr/lib64/libssl.so.8 (0x0000003ba7e00000)
	liblber-2.4.so.2 => /usr/lib64/liblber-2.4.so.2 (0x0000003ceee00000)
	libresolv.so.2 => /lib64/libresolv.so.2 (0x0000003ce5600000)
	libsasl2.so.2 => /usr/lib64/libsasl2.so.2 (0x00007f2f1a030000)
	libkrb5support.so.0 => /usr/lib64/libkrb5support.so.0 (0x0000003ce7a00000)
	libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x0000003ce7600000)
	libnssutil3.so => /lib64/libnssutil3.so (0x000000348fa00000)
	libcrypt.so.1 => /lib64/libcrypt.so.1 (0x00007f2f19df8000)
	libselinux.so.1 => /lib64/libselinux.so.1 (0x0000003cdc400000)
	libfreebl3.so => /lib64/libfreebl3.so (0x00007f2f19b99000)
   [torvalds@nehalem git]$ make -j16 NO_CURL=1
   [torvalds@nehalem git]$ ldd git
	linux-vdso.so.1 =>  (0x00007fff2f960000)
	libz.so.1 => /lib64/libz.so.1 (0x0000003cdb800000)
	libcrypto.so.8 => /usr/lib64/libcrypto.so.8 (0x0000003ba7a00000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003cdb400000)
	libc.so.6 => /lib64/libc.so.6 (0x0000003cda800000)
	libdl.so.2 => /lib64/libdl.so.2 (0x0000003cdb000000)
	/lib64/ld-linux-x86-64.so.2 (0x0000003cda400000)

What a huge difference!

And the NO_CURL version really does load a lot faster in cold-cache. We're 
not talking small differences:

 - compiled with NO_CURL, five runs of "echo 3 > /proc/sys/vm/drop_caches" 
   followed by "time git branch":

	real	0m0.654s
	real	0m0.562s
	real	0m0.519s
	real	0m0.534s
	real	0m0.734s

   Total number of system calls: 194

 - compiled with curl, same thing:

	real	0m1.503s
	real	0m1.455s
	real	0m1.267s
	real	0m1.819s
	real	0m0.985s

   Total number of system calls: 407!

ie we're talking a _huge_ hit in startup times for that curl support. 
That's really really sad - especially considering how all the curl support 
is for very random occasional stuff. I never use it myself, for example, 
since I don't use http at all. And even for people who do, they only need 
it for non-local operations.

I wonder if there is some way to only load the crazy curl stuff when we 
actually want open a http: connection.

			Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-24 21:21                           ` Linus Torvalds
@ 2009-07-24 22:13                             ` Linus Torvalds
  2009-07-24 22:18                               ` david
  2009-08-07  4:21                               ` Jeff King
  2009-07-24 22:54                             ` Theodore Tso
  2009-07-24 23:46                             ` Carlos R. Mafra
  2 siblings, 2 replies; 129+ messages in thread
From: Linus Torvalds @ 2009-07-24 22:13 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List
  Cc: Carlos R. Mafra, Daniel Barkalow, Johannes Schindelin


On Fri, 24 Jul 2009, Linus Torvalds wrote:
> 
> ie we're talking a _huge_ hit in startup times for that curl support. 
> That's really really sad - especially considering how all the curl support 
> is for very random occasional stuff. I never use it myself, for example, 
> since I don't use http at all. And even for people who do, they only need 
> it for non-local operations.
> 
> I wonder if there is some way to only load the crazy curl stuff when we 
> actually want open a http: connection.

Here's the simple step#1: make 'git-http-fetch' be an external program 
rather than a built-in.

Sadly, I have no idea hot to turn the transport.c code into an external 
walker sanely (turn the ref/object walkers into an exec of an external 
program). So we still end up linking with curl. But maybe somebody 
(Daniel? Dscho?) who knows the transport code could try to make it an 
external process?

The performance angle of http fetching is non-existent, we really should 
try very hard to make the curl-dependent parts be in a binary of their 
own.

		Linus

---
>From 3cfc50d497266dc73a414ed1460b36b712ad10de Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Fri, 24 Jul 2009 14:54:55 -0700
Subject: [PATCH] git-http-fetch: not a builtin

We should really try to avoid having a dependency on the curl libraries
for the core 'git' executable. It adds huge overheads, for no advantage.

This splits up git-http-fetch so that it isn't built-in.  We still do
end up linking with curl for the git binary due to the transport.c http
walker, but that's at least partially an independent issue.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 Makefile                             |    8 +++++++-
 git.c                                |    3 ---
 builtin-http-fetch.c => http-fetch.c |    5 ++++-
 3 files changed, 11 insertions(+), 5 deletions(-)
 rename builtin-http-fetch.c => http-fetch.c (95%)

diff --git a/Makefile b/Makefile
index bde27ed..8cbd863 100644
--- a/Makefile
+++ b/Makefile
@@ -978,9 +978,12 @@ else
 	else
 		CURL_LIBCURL = -lcurl
 	endif
-	BUILTIN_OBJS += builtin-http-fetch.o
+	PROGRAMS += git-http-fetch$X
+
+	# FIXME! Sadly 'transport.c' still needs these for the builtin case
 	EXTLIBS += $(CURL_LIBCURL)
 	LIB_OBJS += http.o http-walker.o
+
 	curl_check := $(shell (echo 070908; curl-config --vernum) | sort -r | sed -ne 2p)
 	ifeq "$(curl_check)" "070908"
 		ifndef NO_EXPAT
@@ -1485,6 +1488,9 @@ git-imap-send$X: imap-send.o $(GITLIBS)
 
 http.o http-walker.o http-push.o transport.o: http.h
 
+git-http-fetch$X: revision.o http.o http-push.o $(GITLIBS)
+	$(QUIET_LINK)$(CC) $(ALL_CFLAGS) -o $@ $(ALL_LDFLAGS) $(filter %.o,$^) \
+		$(LIBS) $(CURL_LIBCURL) $(EXPAT_LIBEXPAT)
 git-http-push$X: revision.o http.o http-push.o $(GITLIBS)
 	$(QUIET_LINK)$(CC) $(ALL_CFLAGS) -o $@ $(ALL_LDFLAGS) $(filter %.o,$^) \
 		$(LIBS) $(CURL_LIBCURL) $(EXPAT_LIBEXPAT)
diff --git a/git.c b/git.c
index 807d875..c1e8f05 100644
--- a/git.c
+++ b/git.c
@@ -309,9 +309,6 @@ static void handle_internal_command(int argc, const char **argv)
 		{ "get-tar-commit-id", cmd_get_tar_commit_id },
 		{ "grep", cmd_grep, RUN_SETUP | USE_PAGER },
 		{ "help", cmd_help },
-#ifndef NO_CURL
-		{ "http-fetch", cmd_http_fetch, RUN_SETUP },
-#endif
 		{ "init", cmd_init_db },
 		{ "init-db", cmd_init_db },
 		{ "log", cmd_log, RUN_SETUP | USE_PAGER },
diff --git a/builtin-http-fetch.c b/http-fetch.c
similarity index 95%
rename from builtin-http-fetch.c
rename to http-fetch.c
index f3e63d7..e8f44ba 100644
--- a/builtin-http-fetch.c
+++ b/http-fetch.c
@@ -1,8 +1,9 @@
 #include "cache.h"
 #include "walker.h"
 
-int cmd_http_fetch(int argc, const char **argv, const char *prefix)
+int main(int argc, const char **argv)
 {
+	const char *prefix;
 	struct walker *walker;
 	int commits_on_stdin = 0;
 	int commits;
@@ -18,6 +19,8 @@ int cmd_http_fetch(int argc, const char **argv, const char *prefix)
 	int get_verbosely = 0;
 	int get_recover = 0;
 
+	prefix = setup_git_directory();
+
 	git_config(git_default_config, NULL);
 
 	while (arg < argc && argv[arg][0] == '-') {
-- 
1.6.4.rc1.5.gb84f

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-24 22:13                             ` Linus Torvalds
@ 2009-07-24 22:18                               ` david
  2009-07-24 22:42                                 ` Linus Torvalds
  2009-08-07  4:21                               ` Jeff King
  1 sibling, 1 reply; 129+ messages in thread
From: david @ 2009-07-24 22:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Junio C Hamano, Git Mailing List, Carlos R. Mafra,
	Daniel Barkalow, Johannes Schindelin

On Fri, 24 Jul 2009, Linus Torvalds wrote:

> On Fri, 24 Jul 2009, Linus Torvalds wrote:
>>
>> ie we're talking a _huge_ hit in startup times for that curl support.
>> That's really really sad - especially considering how all the curl support
>> is for very random occasional stuff. I never use it myself, for example,
>> since I don't use http at all. And even for people who do, they only need
>> it for non-local operations.
>>
>> I wonder if there is some way to only load the crazy curl stuff when we
>> actually want open a http: connection.
>
> Here's the simple step#1: make 'git-http-fetch' be an external program
> rather than a built-in.
>
> Sadly, I have no idea hot to turn the transport.c code into an external
> walker sanely (turn the ref/object walkers into an exec of an external
> program). So we still end up linking with curl. But maybe somebody
> (Daniel? Dscho?) who knows the transport code could try to make it an
> external process?
>
> The performance angle of http fetching is non-existent, we really should
> try very hard to make the curl-dependent parts be in a binary of their
> own.

what does the performance look like if you just do a static compile 
instead?

David Lang

> 		Linus
>
> ---
>> From 3cfc50d497266dc73a414ed1460b36b712ad10de Mon Sep 17 00:00:00 2001
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Fri, 24 Jul 2009 14:54:55 -0700
> Subject: [PATCH] git-http-fetch: not a builtin
>
> We should really try to avoid having a dependency on the curl libraries
> for the core 'git' executable. It adds huge overheads, for no advantage.
>
> This splits up git-http-fetch so that it isn't built-in.  We still do
> end up linking with curl for the git binary due to the transport.c http
> walker, but that's at least partially an independent issue.
>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> ---
> Makefile                             |    8 +++++++-
> git.c                                |    3 ---
> builtin-http-fetch.c => http-fetch.c |    5 ++++-
> 3 files changed, 11 insertions(+), 5 deletions(-)
> rename builtin-http-fetch.c => http-fetch.c (95%)
>
> diff --git a/Makefile b/Makefile
> index bde27ed..8cbd863 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -978,9 +978,12 @@ else
> 	else
> 		CURL_LIBCURL = -lcurl
> 	endif
> -	BUILTIN_OBJS += builtin-http-fetch.o
> +	PROGRAMS += git-http-fetch$X
> +
> +	# FIXME! Sadly 'transport.c' still needs these for the builtin case
> 	EXTLIBS += $(CURL_LIBCURL)
> 	LIB_OBJS += http.o http-walker.o
> +
> 	curl_check := $(shell (echo 070908; curl-config --vernum) | sort -r | sed -ne 2p)
> 	ifeq "$(curl_check)" "070908"
> 		ifndef NO_EXPAT
> @@ -1485,6 +1488,9 @@ git-imap-send$X: imap-send.o $(GITLIBS)
>
> http.o http-walker.o http-push.o transport.o: http.h
>
> +git-http-fetch$X: revision.o http.o http-push.o $(GITLIBS)
> +	$(QUIET_LINK)$(CC) $(ALL_CFLAGS) -o $@ $(ALL_LDFLAGS) $(filter %.o,$^) \
> +		$(LIBS) $(CURL_LIBCURL) $(EXPAT_LIBEXPAT)
> git-http-push$X: revision.o http.o http-push.o $(GITLIBS)
> 	$(QUIET_LINK)$(CC) $(ALL_CFLAGS) -o $@ $(ALL_LDFLAGS) $(filter %.o,$^) \
> 		$(LIBS) $(CURL_LIBCURL) $(EXPAT_LIBEXPAT)
> diff --git a/git.c b/git.c
> index 807d875..c1e8f05 100644
> --- a/git.c
> +++ b/git.c
> @@ -309,9 +309,6 @@ static void handle_internal_command(int argc, const char **argv)
> 		{ "get-tar-commit-id", cmd_get_tar_commit_id },
> 		{ "grep", cmd_grep, RUN_SETUP | USE_PAGER },
> 		{ "help", cmd_help },
> -#ifndef NO_CURL
> -		{ "http-fetch", cmd_http_fetch, RUN_SETUP },
> -#endif
> 		{ "init", cmd_init_db },
> 		{ "init-db", cmd_init_db },
> 		{ "log", cmd_log, RUN_SETUP | USE_PAGER },
> diff --git a/builtin-http-fetch.c b/http-fetch.c
> similarity index 95%
> rename from builtin-http-fetch.c
> rename to http-fetch.c
> index f3e63d7..e8f44ba 100644
> --- a/builtin-http-fetch.c
> +++ b/http-fetch.c
> @@ -1,8 +1,9 @@
> #include "cache.h"
> #include "walker.h"
>
> -int cmd_http_fetch(int argc, const char **argv, const char *prefix)
> +int main(int argc, const char **argv)
> {
> +	const char *prefix;
> 	struct walker *walker;
> 	int commits_on_stdin = 0;
> 	int commits;
> @@ -18,6 +19,8 @@ int cmd_http_fetch(int argc, const char **argv, const char *prefix)
> 	int get_verbosely = 0;
> 	int get_recover = 0;
>
> +	prefix = setup_git_directory();
> +
> 	git_config(git_default_config, NULL);
>
> 	while (arg < argc && argv[arg][0] == '-') {
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-24 22:18                               ` david
@ 2009-07-24 22:42                                 ` Linus Torvalds
  2009-07-24 22:46                                   ` david
  0 siblings, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2009-07-24 22:42 UTC (permalink / raw)
  To: david
  Cc: Junio C Hamano, Git Mailing List, Carlos R. Mafra,
	Daniel Barkalow, Johannes Schindelin

On Fri, 24 Jul 2009, david@lang.hm wrote:
> 
> what does the performance look like if you just do a static compile instead?

I don't even know - I don't have a static version of curl. I could install 
one, of course, but since I don't think that's the solution anyway, I'm 
not going to bother.

The real solution really is to not have curl support in the main binary.

One option might be to make _all_ the transport code be outside of the 
core binary, or course.  That's a fairly simple but somewhat sad solution 
(ie make all of push/pull/fetch/clone/ls-remote/etc be external binaries)

		Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-24 22:42                                 ` Linus Torvalds
@ 2009-07-24 22:46                                   ` david
  2009-07-25  2:39                                     ` Linus Torvalds
  0 siblings, 1 reply; 129+ messages in thread
From: david @ 2009-07-24 22:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Junio C Hamano, Git Mailing List, Carlos R. Mafra,
	Daniel Barkalow, Johannes Schindelin

On Fri, 24 Jul 2009, Linus Torvalds wrote:

> On Fri, 24 Jul 2009, david@lang.hm wrote:
>>
>> what does the performance look like if you just do a static compile instead?
>
> I don't even know - I don't have a static version of curl. I could install
> one, of course, but since I don't think that's the solution anyway, I'm
> not going to bother.

I wasn't thinking a static version of curl, I was thinking a static 
version of the git binaries. see how fast things could be if no startup 
linking was nessasary.

David Lang

> The real solution really is to not have curl support in the main binary.
>
> One option might be to make _all_ the transport code be outside of the
> core binary, or course.  That's a fairly simple but somewhat sad solution
> (ie make all of push/pull/fetch/clone/ls-remote/etc be external binaries)
>
> 		Linus
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-24 21:21                           ` Linus Torvalds
  2009-07-24 22:13                             ` Linus Torvalds
@ 2009-07-24 22:54                             ` Theodore Tso
  2009-07-24 22:59                               ` Shawn O. Pearce
  2009-07-24 23:46                             ` Carlos R. Mafra
  2 siblings, 1 reply; 129+ messages in thread
From: Theodore Tso @ 2009-07-24 22:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Carlos R. Mafra, Junio C Hamano, Git Mailing List

On Fri, Jul 24, 2009 at 02:21:20PM -0700, Linus Torvalds wrote:
> 
> I wonder if there is some way to only load the crazy curl stuff when we 
> actually want open a http: connection.

Well, we could use dlopen(), but I'm not sure that qualifies as a
_sane_ solution --- especially given that there are approximately 15
interfaces used by git, that we'd have to resolve using dlsym().

	   	   	     	       - Ted

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-24 22:54                             ` Theodore Tso
@ 2009-07-24 22:59                               ` Shawn O. Pearce
  2009-07-24 23:28                                 ` Junio C Hamano
  2009-07-26 17:07                                 ` Avi Kivity
  0 siblings, 2 replies; 129+ messages in thread
From: Shawn O. Pearce @ 2009-07-24 22:59 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Linus Torvalds, Carlos R. Mafra, Junio C Hamano, Git Mailing List

Theodore Tso <tytso@mit.edu> wrote:
> On Fri, Jul 24, 2009 at 02:21:20PM -0700, Linus Torvalds wrote:
> > 
> > I wonder if there is some way to only load the crazy curl stuff when we 
> > actually want open a http: connection.
> 
> Well, we could use dlopen(), but I'm not sure that qualifies as a
> _sane_ solution --- especially given that there are approximately 15
> interfaces used by git, that we'd have to resolve using dlsym().

Yea, that's not sane.

Probably the better approach is to have git fetch and git push be a
different binary from main git, so we only pay the libcurl loading
overheads when we hit transport.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-24 22:59                               ` Shawn O. Pearce
@ 2009-07-24 23:28                                 ` Junio C Hamano
  2009-07-26 17:07                                 ` Avi Kivity
  1 sibling, 0 replies; 129+ messages in thread
From: Junio C Hamano @ 2009-07-24 23:28 UTC (permalink / raw)
  To: Shawn O. Pearce
  Cc: Theodore Tso, Linus Torvalds, Carlos R. Mafra, Junio C Hamano,
	Git Mailing List

"Shawn O. Pearce" <spearce@spearce.org> writes:

> Theodore Tso <tytso@mit.edu> wrote:
>> On Fri, Jul 24, 2009 at 02:21:20PM -0700, Linus Torvalds wrote:
>> > 
>> > I wonder if there is some way to only load the crazy curl stuff when we 
>> > actually want open a http: connection.
>> 
>> Well, we could use dlopen(), but I'm not sure that qualifies as a
>> _sane_ solution --- especially given that there are approximately 15
>> interfaces used by git, that we'd have to resolve using dlsym().
>
> Yea, that's not sane.
>
> Probably the better approach is to have git fetch and git push be a
> different binary from main git, so we only pay the libcurl loading
> overheads when we hit transport.

Even though that still will hurt people who do not use http, I think it
would be a right approach (in the sense that it should not be too painful
and with a reasonable gain for local-only operations).

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-24 21:21                           ` Linus Torvalds
  2009-07-24 22:13                             ` Linus Torvalds
  2009-07-24 22:54                             ` Theodore Tso
@ 2009-07-24 23:46                             ` Carlos R. Mafra
  2009-07-25  0:41                               ` Carlos R. Mafra
  2 siblings, 1 reply; 129+ messages in thread
From: Carlos R. Mafra @ 2009-07-24 23:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List

Sorry for the delay and missing the "strace -ttT" request,
but today was a "Physics" day and took me longer to 
notice your email.

On Fri 24.Jul'09 at 14:21:20 -0700, Linus Torvalds wrote:
> 
> What a huge difference!
> 
> And the NO_CURL version really does load a lot faster in cold-cache. We're 
> not talking small differences:

With NO_CURL=1 the strace log contained 242 lines (vs 404), but
the time difference was not as great as you got. But it was
better:

0.55 +- 0.06 (for 8 runs)

So I repeated the tests with curl enabled and this time
I got:

0.77 +- 0.03 (for 6 runs)

(yesterday I got 0.61 +- 0.08, so there is lot of noise)

So it is better, but not by the same factor as you saw.
But I may have an explanation for this.

After I clear the cache I wait a few seconds to stabilize,
and I do the 'time git branch' test when I see that
there is no activity in the disk by looking at
the 'btrace' output in another xterm. 

I noticed that after dropping the cache and before
I do the test there is lot of activity of something
called 'preload', with lines which look like these:

8,0  0  42881   495.067655112 17777  Q   R 51244367 + 552 [preload]
8,0  0  42882   495.067659931 17777  G   R 51244367 + 552 [preload]
8,0  0  42883   495.067664401 17777  I   R 51244367 + 552 [preload]

I hadn't noticed this before and now I checked that,

"preload is an adaptive readahead daemon that prefetches files mapped by
applications from the disk to reduce application startup time."

So I guess that my tests here for your NO_CURL=1 idea is inconclusive,
as I am not sure what preload is prefetching.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-24 23:46                             ` Carlos R. Mafra
@ 2009-07-25  0:41                               ` Carlos R. Mafra
  2009-07-25 18:04                                 ` Linus Torvalds
  0 siblings, 1 reply; 129+ messages in thread
From: Carlos R. Mafra @ 2009-07-25  0:41 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List

On Sat 25.Jul'09 at  1:46:48 +0200, Carlos R. Mafra wrote:
> 
> So I guess that my tests here for your NO_CURL=1 idea is inconclusive,
> as I am not sure what preload is prefetching.

Ok, so I killed /usr/sbin/preload and did the tests again. The 
results were much more stable, with average 0.40 vs 0.79
(NO_CURL=1 being faster). The pagefaults were pretty stable too,
(40major+654minor vs 12major+401minor). 

I will use NO_CURL=1 from now on!

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-24 22:46                                   ` david
@ 2009-07-25  2:39                                     ` Linus Torvalds
  2009-07-25  2:53                                       ` Daniel Barkalow
  0 siblings, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2009-07-25  2:39 UTC (permalink / raw)
  To: david
  Cc: Junio C Hamano, Git Mailing List, Carlos R. Mafra,
	Daniel Barkalow, Johannes Schindelin

On Fri, 24 Jul 2009, david@lang.hm wrote:

> On Fri, 24 Jul 2009, Linus Torvalds wrote:
> 
> > On Fri, 24 Jul 2009, david@lang.hm wrote:
> > > 
> > > what does the performance look like if you just do a static compile
> > > instead?
> > 
> > I don't even know - I don't have a static version of curl. I could install
> > one, of course, but since I don't think that's the solution anyway, I'm
> > not going to bother.
> 
> I wasn't thinking a static version of curl, I was thinking a static version of
> the git binaries. see how fast things could be if no startup linking was
> nessasary.

Well, that's what I meant. If I add '-static' to the link flags, I get

	/usr/bin/ld: cannot find -lcurl
	collect2: ld returned 1 exit status

because I simply don't have a static library version of curl (and if I do 
NO_CURL, I fail the link due to not having a static version of zlib).

That's what I meant by "I could install a static version of curl" - I 
could install the debug libraries, but it just isn't a normal thing to do 
on any modern distribution. The right thing to do really would be to not 
have -lcurl for the main git binary at all.

Preferably done by having http walking handled by an external process (the 
way we already do rsync), but it's probably easier to just make all the 
clone/fetch/ls-remote things be a separate binary.

Of course, I'd personally solve the problem with NO_CURL=1, but that's 
probably not acceptable in general.

			Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-25  2:39                                     ` Linus Torvalds
@ 2009-07-25  2:53                                       ` Daniel Barkalow
  0 siblings, 0 replies; 129+ messages in thread
From: Daniel Barkalow @ 2009-07-25  2:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: david, Junio C Hamano, Git Mailing List, Carlos R. Mafra,
	Johannes Schindelin

On Fri, 24 Jul 2009, Linus Torvalds wrote:

> On Fri, 24 Jul 2009, david@lang.hm wrote:
> 
> > On Fri, 24 Jul 2009, Linus Torvalds wrote:
> > 
> > > On Fri, 24 Jul 2009, david@lang.hm wrote:
> > > > 
> > > > what does the performance look like if you just do a static compile
> > > > instead?
> > > 
> > > I don't even know - I don't have a static version of curl. I could install
> > > one, of course, but since I don't think that's the solution anyway, I'm
> > > not going to bother.
> > 
> > I wasn't thinking a static version of curl, I was thinking a static version of
> > the git binaries. see how fast things could be if no startup linking was
> > nessasary.
> 
> Well, that's what I meant. If I add '-static' to the link flags, I get
> 
> 	/usr/bin/ld: cannot find -lcurl
> 	collect2: ld returned 1 exit status
> 
> because I simply don't have a static library version of curl (and if I do 
> NO_CURL, I fail the link due to not having a static version of zlib).
> 
> That's what I meant by "I could install a static version of curl" - I 
> could install the debug libraries, but it just isn't a normal thing to do 
> on any modern distribution. The right thing to do really would be to not 
> have -lcurl for the main git binary at all.
> 
> Preferably done by having http walking handled by an external process (the 
> way we already do rsync), but it's probably easier to just make all the 
> clone/fetch/ls-remote things be a separate binary.

I think it's actually easy enough to have a separate binary to handle the 
http walking, particularly since I've got code lying around to handle 
importing from a foreign VCS with a separate binary that I can just remove 
some of the features from.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-25  0:41                               ` Carlos R. Mafra
@ 2009-07-25 18:04                                 ` Linus Torvalds
  2009-07-25 18:57                                   ` Timo Hirvonen
  0 siblings, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2009-07-25 18:04 UTC (permalink / raw)
  To: Carlos R. Mafra; +Cc: Junio C Hamano, Git Mailing List

On Sat, 25 Jul 2009, Carlos R. Mafra wrote:
> 
> Ok, so I killed /usr/sbin/preload and did the tests again. The 
> results were much more stable, with average 0.40 vs 0.79
> (NO_CURL=1 being faster). The pagefaults were pretty stable too,
> (40major+654minor vs 12major+401minor). 
> 
> I will use NO_CURL=1 from now on!

I actually find it interesting that this whole NO_CURL issue is actually a 
lot more noticeable for me in the hot-cache case than all the other 'git 
branch' issues were.

I went back to a version a few days ago (before all the optimizations), 
and on my machine with a hot cache I get (for my kernel repo - I don't 
use branches there, but I have an old 'akpm' branch for taking a emailed 
patch series from Andrew):

	[torvalds@nehalem linux]$ time ~/git/git branch
	  akpm
	* master

	real	0m0.005s
	user	0m0.004s
	sys	0m0.000s

so it's five milliseconds. Big deal, fast enough, right?

Ok, so fast-forward to today, with the optimizations to builtin-branch.c:

	[torvalds@nehalem linux]$ time ~/git/git branch
	  akpm
	* master

	real	0m0.004s
	user	0m0.000s
	sys	0m0.004s

Woot! I shaved a millisecond off it by avoiding all those page faults and 
object lookups. Good, but hey, all that unnecessary lookup was just a 25% 
cost.

So let's build it with NO_CURL:

	[torvalds@nehalem linux]$ time ~/git/git branch
	  akpm
	* master

	real	0m0.002s
	user	0m0.000s
	sys	0m0.000s

Heh. The whole NO_CURL=1 thing is actually a _bigger_ optimization than 
anything else I did to git-branch. Cost of curl: 100%.

The difference in number of system calls and page faults is really quite 
staggering. System calls: 397->184, page faults: 619->293. Just from not 
doing that curl loading. No wonder performance actually doubles.

Now, I admit that 5ms vs 2ms probably doesn't really matter much, but 
dang, performance was a primary goal in git, so I'm a bit upset at how bad 
curl screwed us. Plus those things do add up when scripting things, and 
those 300+ page faults are basically true for _all_ git programs.

So it's not just 'git branch': doing 'git show' shows the exact same 
thing: 6ms -> 4ms, 448->235 system calls, and 1549->1176 page faults.

So curl really must die. It may not matter for the expensive operations, 
but a lot of scripting is about running all those "cheap" things that just 
add up over time.

			Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-25 18:04                                 ` Linus Torvalds
@ 2009-07-25 18:57                                   ` Timo Hirvonen
  2009-07-25 19:06                                     ` Reece Dunn
                                                       ` (2 more replies)
  0 siblings, 3 replies; 129+ messages in thread
From: Timo Hirvonen @ 2009-07-25 18:57 UTC (permalink / raw)
  To: git; +Cc: Carlos R. Mafra, Junio C Hamano, Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> So curl really must die. It may not matter for the expensive operations, 
> but a lot of scripting is about running all those "cheap" things that just 
> add up over time.

SELinux is the problem, not curl.

On my Arch Linux machine:

   $ ldd bin/git
	linux-vdso.so.1 =>  (0x00007fff42306000)
	libcurl.so.4 => /usr/lib/libcurl.so.4 (0x00007f8714532000)
	libz.so.1 => /usr/lib/libz.so.1 (0x00007f871431d000)
	libcrypto.so.0.9.8 => /usr/lib/libcrypto.so.0.9.8 (0x00007f8713f8f000)
	libpthread.so.0 => /lib/libpthread.so.0 (0x00007f8713d74000)
	libc.so.6 => /lib/libc.so.6 (0x00007f8713a21000)
	librt.so.1 => /lib/librt.so.1 (0x00007f8713819000)
	libssl.so.0.9.8 => /usr/lib/libssl.so.0.9.8 (0x00007f87135ca000)
	libdl.so.2 => /lib/libdl.so.2 (0x00007f87133c6000)
	/lib/ld-linux-x86-64.so.2 (0x00007f8714778000)

Your:

   [torvalds@nehalem git]$ ldd git
	linux-vdso.so.1 =>  (0x00007fff61da7000)
	libcurl.so.4 => /usr/lib64/libcurl.so.4 (0x00007f2f1a498000)
	libz.so.1 => /lib64/libz.so.1 (0x0000003cdb800000)
	libcrypto.so.8 => /usr/lib64/libcrypto.so.8 (0x0000003ba7a00000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003cdb400000)
	libc.so.6 => /lib64/libc.so.6 (0x0000003cda800000)
	libidn.so.11 => /lib64/libidn.so.11 (0x0000003ceaa00000)
	libssh2.so.1 => /usr/lib64/libssh2.so.1 (0x0000003ba8e00000)
	libldap-2.4.so.2 => /usr/lib64/libldap-2.4.so.2 (0x00007f2f1a250000)
	librt.so.1 => /lib64/librt.so.1 (0x0000003cdbc00000)
	libgssapi_krb5.so.2 => /usr/lib64/libgssapi_krb5.so.2 (0x0000003ce6e00000)
	libkrb5.so.3 => /usr/lib64/libkrb5.so.3 (0x0000003ce7e00000)
	libk5crypto.so.3 => /usr/lib64/libk5crypto.so.3 (0x0000003ce7200000)
	libcom_err.so.2 => /lib64/libcom_err.so.2 (0x0000003ce6a00000)
	libssl3.so => /lib64/libssl3.so (0x0000003490200000)
	libsmime3.so => /lib64/libsmime3.so (0x000000348fe00000)
	libnss3.so => /lib64/libnss3.so (0x000000348f600000)
	libplds4.so => /lib64/libplds4.so (0x0000003cbc800000)
	libplc4.so => /lib64/libplc4.so (0x0000003cbdc00000)
	libnspr4.so => /lib64/libnspr4.so (0x0000003cbd800000)
	libdl.so.2 => /lib64/libdl.so.2 (0x0000003cdb000000)
	/lib64/ld-linux-x86-64.so.2 (0x0000003cda400000)
	libssl.so.8 => /usr/lib64/libssl.so.8 (0x0000003ba7e00000)
	liblber-2.4.so.2 => /usr/lib64/liblber-2.4.so.2 (0x0000003ceee00000)
	libresolv.so.2 => /lib64/libresolv.so.2 (0x0000003ce5600000)
	libsasl2.so.2 => /usr/lib64/libsasl2.so.2 (0x00007f2f1a030000)
	libkrb5support.so.0 => /usr/lib64/libkrb5support.so.0 (0x0000003ce7a00000)
	libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x0000003ce7600000)
	libnssutil3.so => /lib64/libnssutil3.so (0x000000348fa00000)
	libcrypt.so.1 => /lib64/libcrypt.so.1 (0x00007f2f19df8000)
	libselinux.so.1 => /lib64/libselinux.so.1 (0x0000003cdc400000)
	libfreebl3.so => /lib64/libfreebl3.so (0x00007f2f19b99000)

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-25 18:57                                   ` Timo Hirvonen
@ 2009-07-25 19:06                                     ` Reece Dunn
  2009-07-25 20:31                                     ` Mike Hommey
  2009-07-25 21:04                                     ` Carlos R. Mafra
  2 siblings, 0 replies; 129+ messages in thread
From: Reece Dunn @ 2009-07-25 19:06 UTC (permalink / raw)
  To: Timo Hirvonen; +Cc: git, Carlos R. Mafra, Junio C Hamano

2009/7/25 Timo Hirvonen <tihirvon@gmail.com>:
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>> So curl really must die. It may not matter for the expensive operations,
>> but a lot of scripting is about running all those "cheap" things that just
>> add up over time.
>
> SELinux is the problem, not curl.
>
> On my Arch Linux machine:
>
>   $ ldd bin/git
>        linux-vdso.so.1 =>  (0x00007fff42306000)
>        libcurl.so.4 => /usr/lib/libcurl.so.4 (0x00007f8714532000)
>        libz.so.1 => /usr/lib/libz.so.1 (0x00007f871431d000)
>        libcrypto.so.0.9.8 => /usr/lib/libcrypto.so.0.9.8 (0x00007f8713f8f000)
>        libpthread.so.0 => /lib/libpthread.so.0 (0x00007f8713d74000)
>        libc.so.6 => /lib/libc.so.6 (0x00007f8713a21000)
>        librt.so.1 => /lib/librt.so.1 (0x00007f8713819000)
>        libssl.so.0.9.8 => /usr/lib/libssl.so.0.9.8 (0x00007f87135ca000)
>        libdl.so.2 => /lib/libdl.so.2 (0x00007f87133c6000)
>        /lib/ld-linux-x86-64.so.2 (0x00007f8714778000)

It will depend on the dependencies of curl that are applied. BLFS
(http://www.linuxfromscratch.org/blfs/view/stable/basicnet/curl.html)
list the following dependencies:

    pkg-config-0.22
    OpenSSL-0.9.8g  or GnuTLS-1.6.3
    OpenLDAP-2.3.39
    libidn-0.6.14
    MIT Kerberos V5-1.6 or Heimdal-1.1
    krb4
    SPNEGO
    c-ares

and the dependencies of those packages and so forth.

On Ubuntu 9.04, I get:

$ ldd /usr/bin/git
	linux-gate.so.1 =>  (0xb80ae000)
	libcurl-gnutls.so.4 => /usr/lib/libcurl-gnutls.so.4 (0xb805b000)
	libz.so.1 => /lib/libz.so.1 (0xb8045000)
	libpthread.so.0 => /lib/tls/i686/cmov/libpthread.so.0 (0xb802b000)
	libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0xb7ec8000)
	libidn.so.11 => /usr/lib/libidn.so.11 (0xb7e95000)
	liblber-2.4.so.2 => /usr/lib/liblber-2.4.so.2 (0xb7e87000)
	libldap_r-2.4.so.2 => /usr/lib/libldap_r-2.4.so.2 (0xb7e43000)
	librt.so.1 => /lib/tls/i686/cmov/librt.so.1 (0xb7e39000)
	libgssapi_krb5.so.2 => /usr/lib/libgssapi_krb5.so.2 (0xb7e0e000)
	libgnutls.so.26 => /usr/lib/libgnutls.so.26 (0xb7d71000)
	libtasn1.so.3 => /usr/lib/libtasn1.so.3 (0xb7d5f000)
	libgcrypt.so.11 => /lib/libgcrypt.so.11 (0xb7cf6000)
	/lib/ld-linux.so.2 (0xb80af000)
	libresolv.so.2 => /lib/tls/i686/cmov/libresolv.so.2 (0xb7ce0000)
	libsasl2.so.2 => /usr/lib/libsasl2.so.2 (0xb7cc7000)
	libdl.so.2 => /lib/tls/i686/cmov/libdl.so.2 (0xb7cc3000)
	libkrb5.so.3 => /usr/lib/libkrb5.so.3 (0xb7c31000)
	libk5crypto.so.3 => /usr/lib/libk5crypto.so.3 (0xb7c0d000)
	libcom_err.so.2 => /lib/libcom_err.so.2 (0xb7c09000)
	libkrb5support.so.0 => /usr/lib/libkrb5support.so.0 (0xb7bff000)
	libkeyutils.so.1 => /lib/libkeyutils.so.1 (0xb7bfb000)
	libgpg-error.so.0 => /lib/libgpg-error.so.0 (0xb7bf7000)

- Reece

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-25 18:57                                   ` Timo Hirvonen
  2009-07-25 19:06                                     ` Reece Dunn
@ 2009-07-25 20:31                                     ` Mike Hommey
  2009-07-25 21:02                                       ` Linus Torvalds
  2009-07-25 21:04                                     ` Carlos R. Mafra
  2 siblings, 1 reply; 129+ messages in thread
From: Mike Hommey @ 2009-07-25 20:31 UTC (permalink / raw)
  To: Timo Hirvonen; +Cc: git, Carlos R. Mafra, Junio C Hamano

On Sat, Jul 25, 2009 at 09:57:39PM +0300, Timo Hirvonen wrote:
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > So curl really must die. It may not matter for the expensive operations, 
> > but a lot of scripting is about running all those "cheap" things that just 
> > add up over time.
> 
> SELinux is the problem, not curl.

I think it's NSS, the problem, not SELinux. Linus's libcurl is built
against NSS, which is the default on Fedora.

Mike

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-25 20:31                                     ` Mike Hommey
@ 2009-07-25 21:02                                       ` Linus Torvalds
  2009-07-25 21:13                                         ` Linus Torvalds
  2009-07-26  7:54                                         ` Mike Hommey
  0 siblings, 2 replies; 129+ messages in thread
From: Linus Torvalds @ 2009-07-25 21:02 UTC (permalink / raw)
  To: Mike Hommey; +Cc: Timo Hirvonen, git, Carlos R. Mafra, Junio C Hamano



On Sat, 25 Jul 2009, Mike Hommey wrote:

> On Sat, Jul 25, 2009 at 09:57:39PM +0300, Timo Hirvonen wrote:
> > Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > 
> > > So curl really must die. It may not matter for the expensive operations, 
> > > but a lot of scripting is about running all those "cheap" things that just 
> > > add up over time.
> > 
> > SELinux is the problem, not curl.
> 
> I think it's NSS, the problem, not SELinux. Linus's libcurl is built
> against NSS, which is the default on Fedora.

Well, it kind of doesn't matter. The fact is, libcurl is a bloated 
monster, and adds zero to 99% of what git people do.

The fact that apparently sometimes it's less bloated than other times 
doesn't really change anything fundamental, does it?

			Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-25 18:57                                   ` Timo Hirvonen
  2009-07-25 19:06                                     ` Reece Dunn
  2009-07-25 20:31                                     ` Mike Hommey
@ 2009-07-25 21:04                                     ` Carlos R. Mafra
  2 siblings, 0 replies; 129+ messages in thread
From: Carlos R. Mafra @ 2009-07-25 21:04 UTC (permalink / raw)
  To: Timo Hirvonen; +Cc: git, Junio C Hamano

On Sat 25.Jul'09 at 21:57:39 +0300, Timo Hirvonen wrote:
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > So curl really must die. It may not matter for the expensive operations, 
> > but a lot of scripting is about running all those "cheap" things that just 
> > add up over time.
> 
> SELinux is the problem, not curl.

I don't have SELinux, and without curl it takes ~50% less time (on
top of Linus' previous optimizations!).

The time to open() all the libs really sums up to a considerable 
fraction (when the total time is low, not when compared to the 
huge 6 secs of before)

Without curl:
[mafra@Pilar:linux-2.6]$ grep open strace-nocurl.log |grep lib \
> | awk -F "<" '{print $2}' | sed s/\>// | awk '{s += $1} END {print s}'
0.070104

With curl:
[mafra@Pilar:linux-2.6]$ grep open strace-curl.log |grep lib \
> | awk -F "<" '{print $2}' | sed s/\>// | awk '{s += $1} END {print s}'
0.249764

PS: It is interesting that in my laptop the time required
to open libcurl alone is 20x the total time of 'git branch' for Linus'
in his supercomputer:
open("/usr/lib64/libcurl.so.4", O_RDONLY) = 3 <0.066239>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-25 21:02                                       ` Linus Torvalds
@ 2009-07-25 21:13                                         ` Linus Torvalds
  2009-07-25 23:23                                           ` Johannes Schindelin
  2009-07-26  7:54                                         ` Mike Hommey
  1 sibling, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2009-07-25 21:13 UTC (permalink / raw)
  To: Mike Hommey; +Cc: Timo Hirvonen, git, Carlos R. Mafra, Junio C Hamano

On Sat, 25 Jul 2009, Linus Torvalds wrote:
>
> The fact that apparently sometimes it's less bloated than other times 
> doesn't really change anything fundamental, does it?

Btw, does anybody know how/why libdl seems to get linked in too?

We're not doing -ldl, and I'm not seeing any need for it, but it's 
definitely there on fedora, at least.

It seems to come from libcrypto. I can get rid of it with NO_OPENSSL, and 
that cuts down on the number of system calls in my startup by 16 (getting 
rid of both libcrypto and libdl). I wonder if there is some way to get the 
optimized openssl sha1 routines _without_ that silly ldl thing.

		Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-25 21:13                                         ` Linus Torvalds
@ 2009-07-25 23:23                                           ` Johannes Schindelin
  2009-07-26  4:49                                             ` Linus Torvalds
  0 siblings, 1 reply; 129+ messages in thread
From: Johannes Schindelin @ 2009-07-25 23:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mike Hommey, Timo Hirvonen, git, Carlos R. Mafra, Junio C Hamano

Hi,

On Sat, 25 Jul 2009, Linus Torvalds wrote:

> On Sat, 25 Jul 2009, Linus Torvalds wrote:
> >
> > The fact that apparently sometimes it's less bloated than other times 
> > doesn't really change anything fundamental, does it?
> 
> Btw, does anybody know how/why libdl seems to get linked in too?
> 
> We're not doing -ldl, and I'm not seeing any need for it, but it's 
> definitely there on fedora, at least.
> 
> It seems to come from libcrypto. I can get rid of it with NO_OPENSSL, and 
> that cuts down on the number of system calls in my startup by 16 (getting 
> rid of both libcrypto and libdl). I wonder if there is some way to get the 
> optimized openssl sha1 routines _without_ that silly ldl thing.

OpenSSL allows for so-called engines implementing certain algorithms.  
These engines are dynamic libraries, loaded via dlopen().

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-25 23:23                                           ` Johannes Schindelin
@ 2009-07-26  4:49                                             ` Linus Torvalds
  2009-07-26 16:29                                               ` Theodore Tso
  0 siblings, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2009-07-26  4:49 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Mike Hommey, Timo Hirvonen, git, Carlos R. Mafra, Junio C Hamano



On Sun, 26 Jul 2009, Johannes Schindelin wrote:
> > 
> > It seems to come from libcrypto. I can get rid of it with NO_OPENSSL, and 
> > that cuts down on the number of system calls in my startup by 16 (getting 
> > rid of both libcrypto and libdl). I wonder if there is some way to get the 
> > optimized openssl sha1 routines _without_ that silly ldl thing.
> 
> OpenSSL allows for so-called engines implementing certain algorithms.  
> These engines are dynamic libraries, loaded via dlopen().

Ah. Ok, that explains it.

It's a bit sad, since the _only_ thing we load all of libcrypto for is the 
(fairly trivial) SHA1 code. 

But at the same time, last time I benchmarked the different SHA1 
libraries, the openssl one was the fastest. I think it has tuned assembly 
language for most architectures. Our regular mozilla-based C code is 
perfectly fine, but it doesn't hold a candle to assembler tuning.

Oh well. 

		Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-25 21:02                                       ` Linus Torvalds
  2009-07-25 21:13                                         ` Linus Torvalds
@ 2009-07-26  7:54                                         ` Mike Hommey
  2009-07-26 10:16                                           ` Johannes Schindelin
  1 sibling, 1 reply; 129+ messages in thread
From: Mike Hommey @ 2009-07-26  7:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Timo Hirvonen, git, Carlos R. Mafra, Junio C Hamano

On Sat, Jul 25, 2009 at 02:02:19PM -0700, Linus Torvalds wrote:
> 
> 
> On Sat, 25 Jul 2009, Mike Hommey wrote:
> 
> > On Sat, Jul 25, 2009 at 09:57:39PM +0300, Timo Hirvonen wrote:
> > > Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > > 
> > > > So curl really must die. It may not matter for the expensive operations, 
> > > > but a lot of scripting is about running all those "cheap" things that just 
> > > > add up over time.
> > > 
> > > SELinux is the problem, not curl.
> > 
> > I think it's NSS, the problem, not SELinux. Linus's libcurl is built
> > against NSS, which is the default on Fedora.
> 
> Well, it kind of doesn't matter. The fact is, libcurl is a bloated 
> monster, and adds zero to 99% of what git people do.

Especially consideting the http transport fails to be useful in various
scenarios.

Mike

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-26  7:54                                         ` Mike Hommey
@ 2009-07-26 10:16                                           ` Johannes Schindelin
  2009-07-26 10:23                                             ` demerphq
  0 siblings, 1 reply; 129+ messages in thread
From: Johannes Schindelin @ 2009-07-26 10:16 UTC (permalink / raw)
  To: Mike Hommey
  Cc: Linus Torvalds, Timo Hirvonen, git, Carlos R. Mafra,
	Junio C Hamano

Hi,

On Sun, 26 Jul 2009, Mike Hommey wrote:

> On Sat, Jul 25, 2009 at 02:02:19PM -0700, Linus Torvalds wrote:
> > 
> > 
> > On Sat, 25 Jul 2009, Mike Hommey wrote:
> > 
> > > On Sat, Jul 25, 2009 at 09:57:39PM +0300, Timo Hirvonen wrote:
> > > > Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > > > 
> > > > > So curl really must die. It may not matter for the expensive operations, 
> > > > > but a lot of scripting is about running all those "cheap" things that just 
> > > > > add up over time.
> > > > 
> > > > SELinux is the problem, not curl.
> > > 
> > > I think it's NSS, the problem, not SELinux. Linus's libcurl is built
> > > against NSS, which is the default on Fedora.
> > 
> > Well, it kind of doesn't matter. The fact is, libcurl is a bloated 
> > monster, and adds zero to 99% of what git people do.
> 
> Especially consideting the http transport fails to be useful in various
> scenarios.

I beg your pardon?  Maybe "s/useful/desirable/"?

In many scenarios, http transport is the _last resort_ against overzealous 
administrators.  The fact that you might be lucky enough not to need that 
resort is a blessing, and does not give you the right to ridicule those 
who are unfortunate enough not to share your good luck.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-26 10:16                                           ` Johannes Schindelin
@ 2009-07-26 10:23                                             ` demerphq
  2009-07-26 10:27                                               ` demerphq
  0 siblings, 1 reply; 129+ messages in thread
From: demerphq @ 2009-07-26 10:23 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Mike Hommey, Linus Torvalds, Timo Hirvonen, git, Carlos R. Mafra,
	Junio C Hamano

2009/7/26 Johannes Schindelin <Johannes.Schindelin@gmx.de>:
> Hi,
>
> On Sun, 26 Jul 2009, Mike Hommey wrote:
>
>> On Sat, Jul 25, 2009 at 02:02:19PM -0700, Linus Torvalds wrote:
>> >
>> >
>> > On Sat, 25 Jul 2009, Mike Hommey wrote:
>> >
>> > > On Sat, Jul 25, 2009 at 09:57:39PM +0300, Timo Hirvonen wrote:
>> > > > Linus Torvalds <torvalds@linux-foundation.org> wrote:
>> > > >
>> > > > > So curl really must die. It may not matter for the expensive operations,
>> > > > > but a lot of scripting is about running all those "cheap" things that just
>> > > > > add up over time.
>> > > >
>> > > > SELinux is the problem, not curl.
>> > >
>> > > I think it's NSS, the problem, not SELinux. Linus's libcurl is built
>> > > against NSS, which is the default on Fedora.
>> >
>> > Well, it kind of doesn't matter. The fact is, libcurl is a bloated
>> > monster, and adds zero to 99% of what git people do.
>>
>> Especially consideting the http transport fails to be useful in various
>> scenarios.
>
> I beg your pardon?  Maybe "s/useful/desirable/"?
>
> In many scenarios, http transport is the _last resort_ against overzealous
> administrators.  The fact that you might be lucky enough not to need that
> resort is a blessing, and does not give you the right to ridicule those
> who are unfortunate enough not to share your good luck.

I think he meant that it is buggy and does not work correctly in
various scenarios.

Eg: Last I checked it couldn't handle repos where the main branch
wasn''t called master, and I've seen other messages that make me think
it doesn't work correctly on edge cases.

cheers,
Yves



-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-26 10:23                                             ` demerphq
@ 2009-07-26 10:27                                               ` demerphq
  0 siblings, 0 replies; 129+ messages in thread
From: demerphq @ 2009-07-26 10:27 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Mike Hommey, Linus Torvalds, Timo Hirvonen, git, Carlos R. Mafra,
	Junio C Hamano

2009/7/26 demerphq <demerphq@gmail.com>:
> 2009/7/26 Johannes Schindelin <Johannes.Schindelin@gmx.de>:
>> Hi,
>>
>> On Sun, 26 Jul 2009, Mike Hommey wrote:
>>
>>> On Sat, Jul 25, 2009 at 02:02:19PM -0700, Linus Torvalds wrote:
>>> >
>>> >
>>> > On Sat, 25 Jul 2009, Mike Hommey wrote:
>>> >
>>> > > On Sat, Jul 25, 2009 at 09:57:39PM +0300, Timo Hirvonen wrote:
>>> > > > Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>> > > >
>>> > > > > So curl really must die. It may not matter for the expensive operations,
>>> > > > > but a lot of scripting is about running all those "cheap" things that just
>>> > > > > add up over time.
>>> > > >
>>> > > > SELinux is the problem, not curl.
>>> > >
>>> > > I think it's NSS, the problem, not SELinux. Linus's libcurl is built
>>> > > against NSS, which is the default on Fedora.
>>> >
>>> > Well, it kind of doesn't matter. The fact is, libcurl is a bloated
>>> > monster, and adds zero to 99% of what git people do.
>>>
>>> Especially consideting the http transport fails to be useful in various
>>> scenarios.
>>
>> I beg your pardon?  Maybe "s/useful/desirable/"?
>>
>> In many scenarios, http transport is the _last resort_ against overzealous
>> administrators.  The fact that you might be lucky enough not to need that
>> resort is a blessing, and does not give you the right to ridicule those
>> who are unfortunate enough not to share your good luck.
>
> I think he meant that it is buggy and does not work correctly in
> various scenarios.
>
> Eg: Last I checked it couldn't handle repos where the main branch
> wasn''t called master, and I've seen other messages that make me think
> it doesn't work correctly on edge cases.

Er, I meant that to go to Johannes directly, not to spam the list or
the cc's with my hazy recollection, and I should have added: "but
perhaps im confusing http and rsync".

Sorry for the noise.

Yves



-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-26  4:49                                             ` Linus Torvalds
@ 2009-07-26 16:29                                               ` Theodore Tso
  0 siblings, 0 replies; 129+ messages in thread
From: Theodore Tso @ 2009-07-26 16:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Schindelin, Mike Hommey, Timo Hirvonen, git,
	Carlos R. Mafra, Junio C Hamano

On Sat, Jul 25, 2009 at 09:49:41PM -0700, Linus Torvalds wrote:
> 
> But at the same time, last time I benchmarked the different SHA1 
> libraries, the openssl one was the fastest. I think it has tuned assembly 
> language for most architectures. Our regular mozilla-based C code is 
> perfectly fine, but it doesn't hold a candle to assembler tuning.

So maybe git should import the SHA1 code into its own source base?
It's not like the SHA1 code changes often, or is likely to have
security issues (at least, not buffer overruns; if SHA1 gets thorouhly
broken we might have to change algorithms, but that's a different
kettle of fish :-).

					- Ted

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-24 22:59                               ` Shawn O. Pearce
  2009-07-24 23:28                                 ` Junio C Hamano
@ 2009-07-26 17:07                                 ` Avi Kivity
  2009-07-26 17:16                                   ` Johannes Schindelin
  1 sibling, 1 reply; 129+ messages in thread
From: Avi Kivity @ 2009-07-26 17:07 UTC (permalink / raw)
  To: Shawn O. Pearce
  Cc: Theodore Tso, Linus Torvalds, Carlos R. Mafra, Junio C Hamano,
	Git Mailing List

On 07/25/2009 01:59 AM, Shawn O. Pearce wrote:
> Theodore Tso<tytso@mit.edu>  wrote:
>    
>> On Fri, Jul 24, 2009 at 02:21:20PM -0700, Linus Torvalds wrote:
>>      
>>> I wonder if there is some way to only load the crazy curl stuff when we
>>> actually want open a http: connection.
>>>        
>> Well, we could use dlopen(), but I'm not sure that qualifies as a
>> _sane_ solution --- especially given that there are approximately 15
>> interfaces used by git, that we'd have to resolve using dlsym().
>>      
>
> Yea, that's not sane.
>
> Probably the better approach is to have git fetch and git push be a
> different binary from main git, so we only pay the libcurl loading
> overheads when we hit transport.
>    

Or make the transports shared libraries, and use dlopen() to open the 
transport and dlsym() to resolve the struct transport object exported by 
the library.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-26 17:07                                 ` Avi Kivity
@ 2009-07-26 17:16                                   ` Johannes Schindelin
  0 siblings, 0 replies; 129+ messages in thread
From: Johannes Schindelin @ 2009-07-26 17:16 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Shawn O. Pearce, Theodore Tso, Linus Torvalds, Carlos R. Mafra,
	Junio C Hamano, Git Mailing List

Hi,

On Sun, 26 Jul 2009, Avi Kivity wrote:

> On 07/25/2009 01:59 AM, Shawn O. Pearce wrote:
> > Theodore Tso<tytso@mit.edu>  wrote:
> >    
> > > On Fri, Jul 24, 2009 at 02:21:20PM -0700, Linus Torvalds wrote:
> > >      
> > > > I wonder if there is some way to only load the crazy curl stuff when we
> > > > actually want open a http: connection.
> > > >        
> > > Well, we could use dlopen(), but I'm not sure that qualifies as a
> > > _sane_ solution --- especially given that there are approximately 15
> > > interfaces used by git, that we'd have to resolve using dlsym().
> > >      
> >
> > Yea, that's not sane.
> >
> > Probably the better approach is to have git fetch and git push be a
> > different binary from main git, so we only pay the libcurl loading
> > overheads when we hit transport.
> >    
> 
> Or make the transports shared libraries, and use dlopen() to open the
> transport and dlsym() to resolve the struct transport object exported by the
> library.

... and introduce all kinds of braindamage to the Makefile so we can 
properly compile .dll files on Windows?

Umm, thanks, but no.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
@ 2009-07-26 23:21 George Spelvin
  2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin
  0 siblings, 1 reply; 129+ messages in thread
From: George Spelvin @ 2009-07-26 23:21 UTC (permalink / raw)
  To: git, torvalds; +Cc: linux

> It's a bit sad, since the _only_ thing we load all of libcrypto for is the 
> (fairly trivial) SHA1 code. 
>
> But at the same time, last time I benchmarked the different SHA1 
> libraries, the openssl one was the fastest. I think it has tuned assembly 
> language for most architectures. Our regular mozilla-based C code is 
> perfectly fine, but it doesn't hold a candle to assembler tuning.

Actually, openssl only has assembly for x86, x86_64, and ia64.
Truthfully, once you have 32 registers, SHA1 is awfully easy to
compile near-optimally.

Git currently includes some hand-tuned assembly that isn't in OpenSSL:
- ARM (only 16 registers, and the rotate+op support can be used nicely)
- PPC (3-way superscalar *without* OO execution benefits from careful
  scheduling)

Further, all of the core hand-tuned SHA1 assembly code in OpenSSL is by
Andy Polyakov and is dual-licensed GPL/3-clause BSD *in addition to*
the OpenSSL license.  So we can just import it:

See http://www.openssl.org/~appro/cryptogams/
and http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz

(Ooh, look, he has PPC code in there, too.  Does anyone with a PPC machine
want to compare it with Git's?)

It'll take some massaging because that's just the core SHA1_Transform
function and not the wrappers, but it's hardly a heroic effort.

I'm pretty deep in the weeds at $DAY_JOB and can't get to it for a while,
but would a patch be appreciated?

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Request for benchmarking: x86 SHA1 code
  2009-07-26 23:21 Performance issue of 'git branch' George Spelvin
@ 2009-07-31 10:46 ` George Spelvin
  2009-07-31 11:11   ` Erik Faye-Lund
                     ` (8 more replies)
  0 siblings, 9 replies; 129+ messages in thread
From: George Spelvin @ 2009-07-31 10:46 UTC (permalink / raw)
  To: git; +Cc: linux

After studying Andy Polyakov's optimized x86 SHA-1 in OpenSSL, I've
got a version that's 1.6% slower on a P4 and 15% faster on a Phenom.
I'm curious about the performance on other CPUs I don't have access to,
particularly Core 2 duo and i7.

Could someone do some benchmarking for me?  Old (486/Pentium/P2/P3)
machines are also interesting, but I'm optimizing for newer ones.

I haven't packaged this nicely, but it's not that complicated.
- Download Andy's original code from
  http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz
- Unpack and cd to the cryptogams-0/x86 directory
- "patch < this_email" to create "sha1test.c", "sha1-586.h", "Makefile",
   and "sha1-x86.pl".
- "make" 
- Run ./586test (before) and ./x86test (after) and note the timings.

Thank you!

--- /dev/null	2009-05-12 02:55:38.579106460 -0400
+++ Makefile	2009-07-31 06:22:42.000000000 -0400
@@ -0,0 +1,16 @@
+CC := gcc
+CFLAGS := -m32 -W -Wall -Os -g
+ASFLAGS := --32
+
+all: 586test x86test
+
+586test : sha1test.c sha1-586.o
+	$(CC) $(CFLAGS) -o $@ sha1test.c sha1-586.o
+
+x86test : sha1test.c sha1-x86.o
+	$(CC) $(CFLAGS) -o $@ sha1test.c sha1-x86.o
+
+586test x86test : sha1-586.h
+
+%.s : %.pl x86asm.pl x86unix.pl
+	perl $< elf > $@
--- /dev/null	2009-05-12 02:55:38.579106460 -0400
+++ sha1test.c	2009-07-28 09:24:09.000000000 -0400
@@ -0,0 +1,67 @@
+#include <stdint.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <sys/time.h>
+
+#include "sha1-586.h"
+
+#define SIZE 1000000
+
+#if SIZE % 64
+# error SIZE must be a multiple of 64!
+#endif
+
+int
+main(void)
+{
+	uint32_t iv[5] = {
+		0x67452301, 0xefcdab89, 0x98badcfe, 0x10325476, 0xc3d2e1f0
+	};
+	/* Simplest known-answer test, "abc" */
+	static uint8_t const testbuf[64] = {
+		'a','b','c', 0x80, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+		0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+		0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+		0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 24
+	};
+	/* Expected: A9993E364706816ABA3E25717850C26C9CD0D89D */
+	static uint32_t const expected[5] = {
+		0xa9993e36, 0x4706816a, 0xba3e2571, 0x7850c26c, 0x9cd0d89d };
+	unsigned i;
+	char *p = malloc(SIZE);
+	struct timeval tv0, tv1;
+
+	if (!p) {
+		perror("malloc");
+		return 1;
+	}
+
+	sha1_block_data_order(iv, testbuf, 1);
+	printf("Result:  %08x %08x %08x %08x %08x\n"
+	       "Expected:%08x %08x %08x %08x %08x\n",
+		iv[0], iv[1], iv[2], iv[3], iv[4], expected[0],
+		expected[1], expected[2], expected[3], expected[4]);
+	for (i = 0; i < 5; i++)
+		if (iv[i] != expected[i])
+			printf("MISMATCH in word %u\n", i);
+
+	if (gettimeofday(&tv0, NULL) < 0) {
+		perror("gettimeofday");
+		return 1;
+	}
+	for (i = 0; i < 500; i++)
+		sha1_block_data_order(iv, p, SIZE/64);
+	if (gettimeofday(&tv1, NULL) < 0) {
+		perror("gettimeofday");
+		return 1;
+	}
+	tv1.tv_sec -= tv0.tv_sec;
+	tv1.tv_usec -= tv0.tv_usec;
+	if (tv1.tv_usec < 0) {
+		tv1.tv_sec--;
+		tv1.tv_usec += 1000000;
+	}
+	printf("%u bytes: %u.%06u s\n", i * SIZE, (unsigned)tv1.tv_sec,
+		(unsigned)tv1.tv_usec);
+	return 0;
+}
--- /dev/null	2009-05-12 02:55:38.579106460 -0400
+++ sha1-586.h	2009-07-27 09:34:03.000000000 -0400
@@ -0,0 +1,3 @@
+#include <stdint.h>
+
+void sha1_block_data_order(uint32_t iv[5], void const *in, unsigned len);
--- /dev/null	2009-05-12 02:55:38.579106460 -0400
+++ sha1-x86.pl	2009-07-31 06:10:18.000000000 -0400
@@ -0,0 +1,398 @@
+#!/usr/bin/env perl
+
+# ====================================================================
+# [Re]written by Andy Polyakov <appro@fy.chalmers.se> for the OpenSSL
+# project. The module is, however, dual licensed under OpenSSL and
+# CRYPTOGAMS licenses depending on where you obtain it. For further
+# details see http://www.openssl.org/~appro/cryptogams/.
+# ====================================================================
+
+# "[Re]written" was achieved in two major overhauls. In 2004 BODY_*
+# functions were re-implemented to address P4 performance issue [see
+# commentary below], and in 2006 the rest was rewritten in order to
+# gain freedom to liberate licensing terms.
+
+# It was noted that Intel IA-32 C compiler generates code which
+# performs ~30% *faster* on P4 CPU than original *hand-coded*
+# SHA1 assembler implementation. To address this problem (and
+# prove that humans are still better than machines:-), the
+# original code was overhauled, which resulted in following
+# performance changes:
+#
+#		compared with original	compared with Intel cc
+#		assembler impl.		generated code
+# Pentium	-16%			+48%
+# PIII/AMD	+8%			+16%
+# P4		+85%(!)			+45%
+#
+# As you can see Pentium came out as looser:-( Yet I reckoned that
+# improvement on P4 outweights the loss and incorporate this
+# re-tuned code to 0.9.7 and later.
+# ----------------------------------------------------------------
+#					<appro@fy.chalmers.se>
+
+$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1;
+push(@INC,"${dir}","${dir}../../perlasm");
+require "x86asm.pl";
+
+&asm_init($ARGV[0],"sha1-586.pl",$ARGV[$#ARGV] eq "386");
+
+$A="eax";
+$B="ebx";
+$C="ecx";
+$D="edx";
+$E="ebp";
+
+# Two temporaries
+$S="esi";
+$T="edi";
+
+# The round constants
+use constant K1 => 0x5a827999;
+use constant K2 => 0x6ED9EBA1;
+use constant K3 => 0x8F1BBCDC;
+use constant K4 => 0xCA62C1D6;
+
+@V=($A,$B,$C,$D,$E);
+
+# Given unlimited registers and functional units, it would be
+# possible to compute SHA-1 at two cycles per round, using 7
+# operations per round.  Remember, each round computes a new
+# value for E, which is used as A in the following round and B
+# in the round after that.  There are two critical paths:
+# - A must be rotated and added to the output E
+# - B must go through two boolean operations before being added
+#   to the result E.  Since this latter addition can't be done
+#   in the same-cycle as the critical addition of a<<<5, this is
+#   a total of 2+1+1 = 4 cycles.
+# Additionally, if you want to avoid copying B, you have to
+# rotate it soon after use in this round so it is ready for use
+# as the following round's C.
+
+# f = (a <<< 5) + e + K + in[i] + (d^(b&(c^d)))		(0..19)
+# f = (a <<< 5) + e + K + in[i] + (b^c^d)		(20..39, 60..79)
+# f = (a <<< 5) + e + K + in[i] + (c&d) + (b&(c^d))	(40..59)
+# The hard part is doing this with only two temporary registers.
+# Let's divide it into 4 parts.  These can be executed in a 7-cycle
+# loop, assuming triple (quadruple counting xor separately) issue:
+#
+#	in[i]		F1(c,d)		F2(b,c,d)	a<<<5
+#	mov in[i],T			(addl S,A)	(movl B,S)
+#	xor in[i+1],T					(rorl 5,S)
+#	xor in[i+2],T	movl D,S			(addl S,A)
+#	xor in[i+3],T	andl C,S
+#	rotl 1,T	addl S,E	movl D,S
+#	(movl T,in[i])                  xorl C,S
+#	addl T+K,E			andl B,S	rorl 2,B	//
+#	(mov in[i],T)			addl S,E	movl A,S
+#	(xor in[i+1],T)					rorl 5,S
+#	(xor in[i+2],T)	(movl C,S)			addl S,E
+#
+# (The last 3 rounds can omit the store of T.)
+# The "addl T+K,E" line is actually implemented using the lea instruction,
+# which (on a Pentium) requires that neither T not K was modified on the
+# previous cycle.
+#
+# The other two rounds are a bit simpler, and can therefore be "pulled in"
+# one cycle, to 6.  The bit-select function (0..19):
+#	in[i]		F(b,c,d)	a<<<5
+#	mov in[i],T	(xorl E,S)			(addl T+K,A)
+#	xor in[i+1],T	(addl S,A)	(movl A,S)
+#	xor in[i+2],T	(rorl 2,C)	(roll 5,S)
+#	xor in[i+3],T	movl D,S	(addl S,A)
+#	roll 1,T	xorl C,S
+#	movl T,in[i]	andl B,S			rorl 2,B
+#	addl T+K,E	xorl D,S		//	(mov in[i],T)
+#	(xor in[i+1],T)	addl S,E	movl A,S
+#	(xor in[i+2],T)			roll 5,S
+#	(xor in[i+3],T)	(movl C,S)	addl S,E
+#
+# And the XOR function (also 6, limited by the in[i] forming) used in
+# rounds 20..39 and 60..79:
+#	in[i]		F(b,c,d)	a<<<5
+#	mov in[i],T	(xorl C,S)			(addl T+K,A)
+#	xor in[i+1],T	(addl S,A)	(movl A,S)
+#	xor in[i+2],T			(roll 5,S)	
+#	xor in[i+3],T			(addl S,A)
+#	roll 1,T	movl D,S
+#	movl T,in[i]	xorl B,S			rorl 2,B
+#	addl T+K,E	xorl C,S		//	(mov in[i],T)
+#	(xor in[i+1],T)	addl S,E	movl A,S
+#	(xor in[i+2],T)			roll 5,S
+#	(xor in[i+3],T)			addl S,E
+#
+# The first 16 rounds don't need to form the in[i] equation, letting
+# us pull it in another 2 cycles, to 4, after some reassignment of
+# temporaries:
+#	in[i]		F(b,c,d)	a<<<5
+#			movl D,S	(roll 5,T)	(addl S,A)
+#	mov in[i],T	xorl C,S	(addl T,A)
+#			andl B,S			rorl 2,B
+#	addl T+K,E	xorl D,S	movl A,T
+#			addl S,E	roll 5,T	(movl C,S)
+#	(mov in[i],T)	(xorl B,S)	addl T,E
+#
+
+# The transition between rounds 15 and 16 will be a bit tricky... the best
+# thing to do is to delay the computation of a<<<5 one cycle and move it back
+# to the S register.  That way, T is free as early as possible.
+#	in[i]		F(b,c,d)	a<<<5
+#	(addl T+K,A)	(xorl E,S)	(movl A,T)
+#			movl D,S	(roll 5,T)	(addl S,A)
+#	mov in[i],T	xorl C,S	(addl T,A)
+#			andl B,S			rorl 2,B
+#	addl T+K,E	xorl D,S			(movl in[1],T)
+#	(xor in[i+1],T)	addl S,E	movl A,S
+#	(xor in[i+2],T)	rorl 2,B	roll 5,S
+#	(xor in[i+3],T)	(movl C,S)	addl S,E
+
+
+
+
+
+# This expects the first copy of D to S to have been done already
+#			movl D,S	(roll 5,T)	(addl S,A)	//
+#	mov in[i],T	xorl C,S	(addl T,A)
+#			andl B,S			rorl 2,B
+#	addl T+K,E	xorl D,S	movl A,T
+#			addl S,E	roll 5,T	(movl C,S)	//
+#	(mov in[i],T)	(xorl B,S)	addl T,E	
+
+sub BODY_00_15
+{
+	local($n,$a,$b,$c,$d,$e)=@_;
+
+	&comment("00_15 $n");
+		&mov($S,$d) if ($n == 0);
+	&mov($T,&swtmp($n%16));		# Load Xi.
+		&xor($S,$c);		# Continue computing F() = d^(b&(c^d))
+		&and($S,$b);
+			&rotr($b,2);
+	&lea($e,&DWP(K1,$e,$T));	# Add Xi and K
+    if ($n < 15) {
+			&mov($T,$a);
+		&xor($S,$d);
+			&rotl($T,5);
+		&add($e,$S);
+		&mov($S,$c);		# Start of NEXT round's F()
+			&add($e,$T);
+    } else {
+	# This version provides the correct start for BODY_20_39
+	&mov($T,&swtmp(($n+1)%16));	# Start computing mext Xi.
+		&xor($S,$d);
+	&xor($T,&swtmp(($n+3)%16));
+		&add($e,$S);		# Add F()
+			&mov($S,$a);	# Start computing a<<<5
+	&xor($T,&swtmp(($n+9)%16));
+			&rotl($S,5);
+    }
+
+}
+
+# The transition between rounds 15 and 16 will be a bit tricky... the best
+# thing to do is to delay the computation of a<<<5 one cycle and move it back
+# to the S register.  That way, T is free as early as possible.
+#	in[i]		F(b,c,d)	a<<<5
+#	(addl T+K,A)	(xorl E,S)	(movl B,T)
+#			movl D,S	(roll 5,T)	(addl S,A)	//
+#	mov in[i],T	xorl C,S	(addl T,A)
+#			andl B,S			rorl 2,B
+#	addl T+K,E 	xorl D,S			(movl in[1],T)
+#	(xor in[i+1],T)	addl S,E	movl A,S
+#	(xor in[i+2],T)	rorl 2,B	roll 5,S	//
+#	(xor in[i+3],T)	(movl C,S)	addl S,E
+
+
+# This starts just before starting to compute F(); the Xi should have XORed
+# the first three values together.  (Break is at //)
+#
+#	in[i]		F(b,c,d)	a<<<5
+#	mov in[i],T	(xorl E,S)			(addl T+K,A)
+#	xor in[i+1],T	(addl S,A)	(movl B,S)
+#	xor in[i+2],T			(roll 5,S)	//
+#	xor in[i+3],T 	movl D,S  	(addl S,A)
+#	roll 1,T	xorl C,S
+#	movl T,in[i]	andl B,S			rorl 2,B
+#	addl T+K,E 	xorl D,S			(mov in[i],T)
+#	(xor in[i+1],T)	addl S,E	movl A,S
+#	(xor in[i+2],T)			roll 5,S	//
+#	(xor in[i+3],T)	(movl C,S)	addl S,E
+
+sub BODY_16_19
+{
+	local($n,$a,$b,$c,$d,$e)=@_;
+
+	&comment("16_20 $n");
+
+	&xor($T,&swtmp(($n+13)%16));
+			&add($a,$S);	# End of previous round
+		&mov($S,$d);		# Start current round's F()
+	&rotl($T,1);
+		&xor($S,$c);
+	&mov(&swtmp($n%16),$T);		# Store computed Xi.
+		&and($S,$b);
+		&rotr($b,2);
+	&lea($e,&DWP(K1,$e,$T));	# Add Xi and K
+	&mov($T,&swtmp(($n+1)%16));	# Start computing mext Xi.
+		&xor($S,$d);
+	&xor($T,&swtmp(($n+3)%16));
+		&add($e,$S);		# Add F()
+			&mov($S,$a);	# Start computing a<<<5
+	&xor($T,&swtmp(($n+9)%16));
+			&rotl($S,5);
+}
+
+# This is just like BODY_16_19, but computes a different F() = b^c^d
+#
+#	in[i]		F(b,c,d)	a<<<5
+#	mov in[i],T	(xorl E,S)			(addl T+K,A)
+#	xor in[i+1],T	(addl S,A)	(movl B,S)
+#	xor in[i+2],T			(roll 5,S)	//
+#	xor in[i+3],T 			(addl S,A)
+#	roll 1,T	movl C,S
+#	movl T,in[i]	xorl B,S			rorl 2,B
+#	addl T+K,E 	xorl C,S			(mov in[i],T)
+#	(xor in[i+1],T)	addl S,E	movl A,S
+#	(xor in[i+2],T)			roll 5,S	//
+#	(xor in[i+3],T)	(movl C,S)	addl S,E
+
+sub BODY_20_39	# And 61..79
+{
+	local($n,$a,$b,$c,$d,$e)=@_;
+	local $K=($n<40) ? K2 : K4;
+
+	&comment("21_30 $n");
+
+	&xor($T,&swtmp(($n+13)%16));
+			&add($a,$S);	# End of previous round
+		&mov($S,$d)
+	&rotl($T,1);
+		&mov($S,$d);		# Start current round's F()
+	&mov(&swtmp($n%16),$T) if ($n < 77);	# Store computed Xi.
+		&xor($S,$b);
+		&rotr($b,2);
+	&lea($e,&DWP($K,$e,$T));	# Add Xi and K
+	&mov($T,&swtmp(($n+1)%16)) if ($n < 79); # Start computing next Xi.
+		&xor($S,$c);
+	&xor($T,&swtmp(($n+3)%16)) if ($n < 79);
+		&add($e,$S);		# Add F1()
+			&mov($S,$a);	# Start computing a<<<5
+	&xor($T,&swtmp(($n+9)%16)) if ($n < 79);
+			&rotl($S,5);
+
+			&add($e,$S) if ($n == 79);
+}
+
+
+# This starts immediately after the LEA, and expects to need to finish
+# the previous round. (break is at //)
+#
+#	in[i]		F1(c,d)		F2(b,c,d)	a<<<5
+#	(addl T+K,E)			(andl C,S)	(rorl 2,C)
+#	mov in[i],T			(addl S,A)	(movl B,S)
+#	xor in[i+1],T					(rorl 5,S)
+#	xor in[i+2],T /	movl D,S			(addl S,A)
+#	xor in[i+3],T	andl C,S
+#	rotl 1,T	addl S,E	movl D,S
+#	(movl T,in[i])                  xorl C,S
+#	addl T+K,E			andl B,S	rorl 2,B
+#	(mov in[i],T)			addl S,E	movl A,S
+#	(xor in[i+1],T)					rorl 5,S
+#	(xor in[i+2],T)	// (movl C,S)			addl S,E
+
+sub BODY_40_59
+{
+	local($n,$a,$b,$c,$d,$e)=@_;
+
+	&comment("41_59 $n");
+
+			&add($a,$S);	# End of previous round
+		&mov($S,$d);		# Start current round's F(1)
+	&xor($T,&swtmp(($n+13)%16));
+		&and($S,$c);
+	&rotl($T,1);
+		&add($e,$S);		# Add F1()
+		&mov($S,$d);		# Start current round's F2()
+	&mov(&swtmp($n%16),$T);		# Store computed Xi.
+		&xor($S,$c);
+	&lea($e,&DWP(K3,$e,$T));
+		&and($S,$b);
+		&rotr($b,2);
+	&mov($T,&swtmp(($n+1)%16));	# Start computing next Xi.
+		&add($e,$S);		# Add F2()
+		&mov($S,$a);		# Start computing a<<<5
+	&xor($T,&swtmp(($n+3)%16));
+			&rotl($S,5);
+	&xor($T,&swtmp(($n+9)%16));
+}
+
+&function_begin("sha1_block_data_order",16);
+	&mov($S,&wparam(0));	# SHA_CTX *c
+	&mov($T,&wparam(1));	# const void *input
+	&mov($A,&wparam(2));	# size_t num
+	&stack_push(16);	# allocate X[16]
+	&shl($A,6);
+	&add($A,$T);
+	&mov(&wparam(2),$A);	# pointer beyond the end of input
+	&mov($E,&DWP(16,$S));# pre-load E
+
+	&set_label("loop",16);
+
+	# copy input chunk to X, but reversing byte order!
+	for ($i=0; $i<16; $i+=4)
+		{
+		&mov($A,&DWP(4*($i+0),$T));
+		&mov($B,&DWP(4*($i+1),$T));
+		&mov($C,&DWP(4*($i+2),$T));
+		&mov($D,&DWP(4*($i+3),$T));
+		&bswap($A);
+		&bswap($B);
+		&bswap($C);
+		&bswap($D);
+		&mov(&swtmp($i+0),$A);
+		&mov(&swtmp($i+1),$B);
+		&mov(&swtmp($i+2),$C);
+		&mov(&swtmp($i+3),$D);
+		}
+	&mov(&wparam(1),$T);	# redundant in 1st spin
+
+	&mov($A,&DWP(0,$S));	# load SHA_CTX
+	&mov($B,&DWP(4,$S));
+	&mov($C,&DWP(8,$S));
+	&mov($D,&DWP(12,$S));
+	# E is pre-loaded
+
+	for($i=0;$i<16;$i++)	{ &BODY_00_15($i,@V); unshift(@V,pop(@V)); }
+	for(;$i<16;$i++)	{ &BODY_15($i,@V);    unshift(@V,pop(@V)); }
+	for(;$i<20;$i++)	{ &BODY_16_19($i,@V); unshift(@V,pop(@V)); }
+	for(;$i<40;$i++)	{ &BODY_20_39($i,@V); unshift(@V,pop(@V)); }
+	for(;$i<60;$i++)	{ &BODY_40_59($i,@V); unshift(@V,pop(@V)); }
+	for(;$i<80;$i++)	{ &BODY_20_39($i,@V); unshift(@V,pop(@V)); }
+
+	(($V[4] eq $E) and ($V[0] eq $A)) or die;	# double-check
+
+	&comment("Loop trailer");
+
+	&mov($S,&wparam(0));	# re-load SHA_CTX*
+	&mov($T,&wparam(1));	# re-load input pointer
+
+	&add($A,&DWP(0,$S));	# E is last "A"...
+	&add($B,&DWP(4,$S));
+	&add($C,&DWP(8,$S));
+	&add($D,&DWP(12,$S));
+	&add($E,&DWP(16,$S));
+
+	&mov(&DWP(0,$S),$A);	# update SHA_CTX
+	 &add($T,64);		# advance input pointer
+	&mov(&DWP(4,$S),$B);
+	 &cmp($T,&wparam(2));	# have we reached the end yet?
+	&mov(&DWP(8,$S),$C);
+	&mov(&DWP(12,$S),$D);
+	&mov(&DWP(16,$S),$E);
+	&jb(&label("loop"));
+
+	&stack_pop(16);
+&function_end("sha1_block_data_order");
+&asciz("SHA1 block transform for x86, CRYPTOGAMS by <appro\@openssl.org>");
+
+&asm_finish();

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Request for benchmarking: x86 SHA1 code
  2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin
@ 2009-07-31 11:11   ` Erik Faye-Lund
  2009-07-31 11:31     ` George Spelvin
  2009-07-31 11:37     ` Michael J Gruber
  2009-07-31 11:21   ` Michael J Gruber
                     ` (7 subsequent siblings)
  8 siblings, 2 replies; 129+ messages in thread
From: Erik Faye-Lund @ 2009-07-31 11:11 UTC (permalink / raw)
  To: George Spelvin; +Cc: git

On Fri, Jul 31, 2009 at 12:46 PM, George Spelvin<linux@horizon.com> wrote:
> I haven't packaged this nicely, but it's not that complicated.
> - Download Andy's original code from
>  http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz
> - Unpack and cd to the cryptogams-0/x86 directory
> - "patch < this_email" to create "sha1test.c", "sha1-586.h", "Makefile",
>   and "sha1-x86.pl".
> - "make"

$ patch < ../sha1-opt.patch.eml
patching file `Makefile'
patching file `sha1test.c'
patching file `sha1-586.h'
patching file `sha1-x86.pl'

$ make
make: *** No rule to make target `sha1-586.o', needed by `586test'.  Stop.

What did I do wrong? :)
Would it be easier if you pushed it out somewhere?

-- 
Erik "kusma" Faye-Lund
kusmabite@gmail.com
(+47) 986 59 656

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Request for benchmarking: x86 SHA1 code
  2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin
  2009-07-31 11:11   ` Erik Faye-Lund
@ 2009-07-31 11:21   ` Michael J Gruber
  2009-07-31 11:26   ` Michael J Gruber
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 129+ messages in thread
From: Michael J Gruber @ 2009-07-31 11:21 UTC (permalink / raw)
  To: George Spelvin; +Cc: git

George Spelvin venit, vidit, dixit 31.07.2009 12:46:
> After studying Andy Polyakov's optimized x86 SHA-1 in OpenSSL, I've
> got a version that's 1.6% slower on a P4 and 15% faster on a Phenom.
> I'm curious about the performance on other CPUs I don't have access to,
> particularly Core 2 duo and i7.
> 
> Could someone do some benchmarking for me?  Old (486/Pentium/P2/P3)
> machines are also interesting, but I'm optimizing for newer ones.
> 
> I haven't packaged this nicely, but it's not that complicated.
> - Download Andy's original code from
>   http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz
> - Unpack and cd to the cryptogams-0/x86 directory
> - "patch < this_email" to create "sha1test.c", "sha1-586.h", "Makefile",
>    and "sha1-x86.pl".
> - "make" 
> - Run ./586test (before) and ./x86test (after) and note the timings.
> 
> Thank you!

Best of 6 runs:
./586test
Result:  a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
500000000 bytes: 1.642336 s
./x86test
Result:  a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
500000000 bytes: 1.532153 s

System:
uname -a
Linux localhost.localdomain 2.6.29.6-213.fc11.x86_64 #1 SMP Tue Jul 7
21:02:57 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Core(TM)2 Duo CPU     T7500  @ 2.20GHz
stepping        : 11
cpu MHz         : 800.000
cache size      : 4096 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor
ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm ida tpr_shadow vnmi
flexpriority
bogomips        : 4389.20
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Core(TM)2 Duo CPU     T7500  @ 2.20GHz
stepping        : 11
cpu MHz         : 800.000
cache size      : 4096 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor
ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm ida tpr_shadow vnmi
flexpriority
bogomips        : 4388.78
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Request for benchmarking: x86 SHA1 code
  2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin
  2009-07-31 11:11   ` Erik Faye-Lund
  2009-07-31 11:21   ` Michael J Gruber
@ 2009-07-31 11:26   ` Michael J Gruber
  2009-07-31 12:31   ` Carlos R. Mafra
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 129+ messages in thread
From: Michael J Gruber @ 2009-07-31 11:26 UTC (permalink / raw)
  To: George Spelvin; +Cc: git

George Spelvin venit, vidit, dixit 31.07.2009 12:46:
> After studying Andy Polyakov's optimized x86 SHA-1 in OpenSSL, I've
> got a version that's 1.6% slower on a P4 and 15% faster on a Phenom.
> I'm curious about the performance on other CPUs I don't have access to,
> particularly Core 2 duo and i7.
> 
> Could someone do some benchmarking for me?  Old (486/Pentium/P2/P3)
> machines are also interesting, but I'm optimizing for newer ones.
> 
> I haven't packaged this nicely, but it's not that complicated.
> - Download Andy's original code from
>   http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz
> - Unpack and cd to the cryptogams-0/x86 directory
> - "patch < this_email" to create "sha1test.c", "sha1-586.h", "Makefile",
>    and "sha1-x86.pl".
> - "make" 
> - Run ./586test (before) and ./x86test (after) and note the timings.
> 
> Thank you!

Best of 6 runs:
./586test
Result:  a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
500000000 bytes: 1.258031 s
./x86test
Result:  a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
500000000 bytes: 1.171770 s

System:
uname -a
Linux whatever 2.6.22-14-generic #1 SMP Tue Feb 12 07:42:25 UTC 2008
i686 GNU/Linux
cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz
stepping        : 10
cpu MHz         : 2000.000
cache size      : 6144 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm
constant_tsc pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr lahf_lm
bogomips        : 5988.92
clflush size    : 64

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz
stepping        : 10
cpu MHz         : 2000.000
cache size      : 6144 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm
constant_tsc pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr lahf_lm
bogomips        : 5984.92
clflush size    : 64

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Request for benchmarking: x86 SHA1 code
  2009-07-31 11:11   ` Erik Faye-Lund
@ 2009-07-31 11:31     ` George Spelvin
  2009-07-31 11:37     ` Michael J Gruber
  1 sibling, 0 replies; 129+ messages in thread
From: George Spelvin @ 2009-07-31 11:31 UTC (permalink / raw)
  To: kusmabite, linux; +Cc: git

> $ make
> make: *** No rule to make target `sha1-586.o', needed by `586test'.  Stop.
>
> What did I do wrong? :)
> Would it be easier if you pushed it out somewhere?

H'm.... It *should* do

perl sha1-586.pl elf > sha1-586.s
as --32  -o sha1-586.o sha1-586.s
gcc -m32 -W -Wall -Os -g -o 586test sha1test.c sha1-586.o
(And likewise for the "x86test" binary.)

which is what happened when I tested it.  Obviously I have something
non-portable in the Makefile.

You could try "make sha1-586.s" and "sha1-586.o" and see which rule is
f*ed up.

Um, you *are* in a directory which contains a sha1-586.pl file, right?


Thanks!

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Request for benchmarking: x86 SHA1 code
  2009-07-31 11:11   ` Erik Faye-Lund
  2009-07-31 11:31     ` George Spelvin
@ 2009-07-31 11:37     ` Michael J Gruber
  2009-07-31 12:24       ` Erik Faye-Lund
  1 sibling, 1 reply; 129+ messages in thread
From: Michael J Gruber @ 2009-07-31 11:37 UTC (permalink / raw)
  To: Erik Faye-Lund; +Cc: George Spelvin, git

Erik Faye-Lund venit, vidit, dixit 31.07.2009 13:11:
> On Fri, Jul 31, 2009 at 12:46 PM, George Spelvin<linux@horizon.com> wrote:
>> I haven't packaged this nicely, but it's not that complicated.
>> - Download Andy's original code from
>>  http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz
>> - Unpack and cd to the cryptogams-0/x86 directory
>> - "patch < this_email" to create "sha1test.c", "sha1-586.h", "Makefile",
>>   and "sha1-x86.pl".
>> - "make"
> 
> $ patch < ../sha1-opt.patch.eml
> patching file `Makefile'
> patching file `sha1test.c'
> patching file `sha1-586.h'
> patching file `sha1-x86.pl'
> 
> $ make
> make: *** No rule to make target `sha1-586.o', needed by `586test'.  Stop.
> 
> What did I do wrong? :)
> Would it be easier if you pushed it out somewhere?
> 

You need to go to the x86 directory, apply the patch and run make there.
(I made the same mistake.) Also, you i586 (32bit) glibc-devel if you're
on a 64 bit system.

Michael

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Request for benchmarking: x86 SHA1 code
  2009-07-31 11:37     ` Michael J Gruber
@ 2009-07-31 12:24       ` Erik Faye-Lund
  2009-07-31 12:29         ` Johannes Schindelin
  2009-07-31 12:32         ` George Spelvin
  0 siblings, 2 replies; 129+ messages in thread
From: Erik Faye-Lund @ 2009-07-31 12:24 UTC (permalink / raw)
  To: Michael J Gruber; +Cc: George Spelvin, git

On Fri, Jul 31, 2009 at 1:37 PM, Michael J
Gruber<git@drmicha.warpmail.net> wrote:
>> What did I do wrong? :)
>
> You need to go to the x86 directory, apply the patch and run make there.
> (I made the same mistake.) Also, you i586 (32bit) glibc-devel if you're
> on a 64 bit system.

Aha, thanks :)

Now I'm getting a different error:
$ make
as   -o sha1-586.o sha1-586.s
sha1-586.s: Assembler messages:
sha1-586.s:4: Warning: .type pseudo-op used outside of .def/.endef ignored.
sha1-586.s:4: Error: junk at end of line, first unrecognized character is `s'
sha1-586.s:1438: Warning: .size pseudo-op used outside of .def/.endef ignored.
sha1-586.s:1438: Error: junk at end of line, first unrecognized character is `s'

make: *** [sha1-586.o] Error 1

What might be relevant, is that I'm trying this on Windows (Vista
64bit). I'd still think GNU as should be able to assemble the source,
though. I've got an i7, so I thought the result might be interresting.

-- 
Erik "kusma" Faye-Lund
kusmabite@gmail.com
(+47) 986 59 656

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Request for benchmarking: x86 SHA1 code
  2009-07-31 12:24       ` Erik Faye-Lund
@ 2009-07-31 12:29         ` Johannes Schindelin
  2009-07-31 12:32         ` George Spelvin
  1 sibling, 0 replies; 129+ messages in thread
From: Johannes Schindelin @ 2009-07-31 12:29 UTC (permalink / raw)
  To: Erik Faye-Lund; +Cc: Michael J Gruber, George Spelvin, git

Hi,

On Fri, 31 Jul 2009, Erik Faye-Lund wrote:

> On Fri, Jul 31, 2009 at 1:37 PM, Michael J
> Gruber<git@drmicha.warpmail.net> wrote:
> >> What did I do wrong? :)
> >
> > You need to go to the x86 directory, apply the patch and run make there.
> > (I made the same mistake.) Also, you i586 (32bit) glibc-devel if you're
> > on a 64 bit system.
> 
> Aha, thanks :)
> 
> Now I'm getting a different error:
> $ make
> as   -o sha1-586.o sha1-586.s
> sha1-586.s: Assembler messages:
> sha1-586.s:4: Warning: .type pseudo-op used outside of .def/.endef ignored.
> sha1-586.s:4: Error: junk at end of line, first unrecognized character is `s'
> sha1-586.s:1438: Warning: .size pseudo-op used outside of .def/.endef ignored.
> sha1-586.s:1438: Error: junk at end of line, first unrecognized character is `s'
> 
> make: *** [sha1-586.o] Error 1
> 
> What might be relevant, is that I'm trying this on Windows (Vista
> 64bit).

Probably using msysGit?  Then you're still using the 32-bit environment, 
as MSys is 32-bit only for now.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Request for benchmarking: x86 SHA1 code
  2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin
                     ` (2 preceding siblings ...)
  2009-07-31 11:26   ` Michael J Gruber
@ 2009-07-31 12:31   ` Carlos R. Mafra
  2009-07-31 13:27   ` Brian Ristuccia
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 129+ messages in thread
From: Carlos R. Mafra @ 2009-07-31 12:31 UTC (permalink / raw)
  To: George Spelvin; +Cc: git

On Fri 31.Jul'09 at  6:46:02 -0400, George Spelvin wrote:
> - Run ./586test (before) and ./x86test (after) and note the timings.

For 8 runs in a Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz,

before: 1.75 +- 0.02
after: 1.62 +- 0.02

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Request for benchmarking: x86 SHA1 code
  2009-07-31 12:24       ` Erik Faye-Lund
  2009-07-31 12:29         ` Johannes Schindelin
@ 2009-07-31 12:32         ` George Spelvin
  2009-07-31 12:45           ` Erik Faye-Lund
  1 sibling, 1 reply; 129+ messages in thread
From: George Spelvin @ 2009-07-31 12:32 UTC (permalink / raw)
  To: git, kusmabite; +Cc: git, linux

> Now I'm getting a different error:
> $ make
> as   -o sha1-586.o sha1-586.s
> sha1-586.s: Assembler messages:
> sha1-586.s:4: Warning: .type pseudo-op used outside of .def/.endef ignored.
> sha1-586.s:4: Error: junk at end of line, first unrecognized character is `s'
> sha1-586.s:1438: Warning: .size pseudo-op used outside of .def/.endef ignored.
> sha1-586.s:1438: Error: junk at end of line, first unrecognized character is `s'
> 
> make: *** [sha1-586.o] Error 1

> What might be relevant, is that I'm trying this on Windows (Vista
> 64bit). I'd still think GNU as should be able to assemble the source,
> though. I've got an i7, so I thought the result might be interresting.

Ah... what assembler?  the perl proprocessor supports multiple
assemblers:
	elf     - Linux, FreeBSD, Solaris x86, etc.
	a.out   - DJGPP, elder OpenBSD, etc.
	coff    - GAS/COFF such as Win32 targets
	win32n  - Windows 95/Windows NT NASM format
	nw-nasm - NetWare NASM format
	nw-mwasm- NetWare Metrowerks Assembler

Maybe you need to replace "elf" with "coff"?

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Request for benchmarking: x86 SHA1 code
  2009-07-31 12:32         ` George Spelvin
@ 2009-07-31 12:45           ` Erik Faye-Lund
  2009-07-31 13:02             ` George Spelvin
  0 siblings, 1 reply; 129+ messages in thread
From: Erik Faye-Lund @ 2009-07-31 12:45 UTC (permalink / raw)
  To: George Spelvin; +Cc: git, git

On Fri, Jul 31, 2009 at 2:32 PM, George Spelvin<linux@horizon.com> wrote:
> Maybe you need to replace "elf" with "coff"?

That did the trick, thanks!

Best of 6 runs on an Intel Core i7 920 @ 2.67GHz:

before (586test): 1.415
after (x86test): 1.470

-- 
Erik "kusma" Faye-Lund
kusmabite@gmail.com
(+47) 986 59 656

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Request for benchmarking: x86 SHA1 code
  2009-07-31 12:45           ` Erik Faye-Lund
@ 2009-07-31 13:02             ` George Spelvin
  0 siblings, 0 replies; 129+ messages in thread
From: George Spelvin @ 2009-07-31 13:02 UTC (permalink / raw)
  To: kusmabite, linux; +Cc: git, git

> That did the trick, thanks!
>
> Best of 6 runs on an Intel Core i7 920 @ 2.67GHz:
> 
> before (586test): 1.415
> after (x86test): 1.470

So it's slower.  Bummer. :-(  Obviously I have some work to do.

But thank you very much for the result!

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Request for benchmarking: x86 SHA1 code
  2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin
                     ` (3 preceding siblings ...)
  2009-07-31 12:31   ` Carlos R. Mafra
@ 2009-07-31 13:27   ` Brian Ristuccia
  2009-07-31 14:05     ` George Spelvin
  2009-07-31 13:27   ` Jakub Narebski
                     ` (3 subsequent siblings)
  8 siblings, 1 reply; 129+ messages in thread
From: Brian Ristuccia @ 2009-07-31 13:27 UTC (permalink / raw)
  To: git; +Cc: George Spelvin

The revised code is faster on Intel Atom N270 by around 15% (results below
typical of several runs):

$ ./586test ; ./x86test 
Result:  a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
500000000 bytes: 4.981760 s
Result:  a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
500000000 bytes: 4.323324 s

$ cat /proc/cpuinfo

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 28
model name	: Intel(R) Atom(TM) CPU N270   @ 1.60GHz
stepping	: 2
cpu MHz		: 800.000
cache size	: 512 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 1
apicid		: 0
initial apicid	: 0
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx constant_tsc
arch_perfmon pebs bts pni dtes64 monitor ds_cpl est tm2 ssse3 xtpr pdcm
lahf_lm
bogomips	: 3191.59
clflush size	: 64
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 28
model name	: Intel(R) Atom(TM) CPU N270   @ 1.60GHz
stepping	: 2
cpu MHz		: 800.000
cache size	: 512 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 1
apicid		: 1
initial apicid	: 1
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx constant_tsc
arch_perfmon pebs bts pni dtes64 monitor ds_cpl est tm2 ssse3 xtpr pdcm
lahf_lm
bogomips	: 3191.91
clflush size	: 64
power management:

-- 
Brian Ristuccia
brian@ristuccia.com

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Request for benchmarking: x86 SHA1 code
  2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin
                     ` (4 preceding siblings ...)
  2009-07-31 13:27   ` Brian Ristuccia
@ 2009-07-31 13:27   ` Jakub Narebski
  2009-07-31 15:05   ` Peter Harris
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 129+ messages in thread
From: Jakub Narebski @ 2009-07-31 13:27 UTC (permalink / raw)
  To: George Spelvin; +Cc: git

"George Spelvin" <linux@horizon.com> writes:
> After studying Andy Polyakov's optimized x86 SHA-1 in OpenSSL, I've
> got a version that's 1.6% slower on a P4 and 15% faster on a Phenom.
> I'm curious about the performance on other CPUs I don't have access to,
> particularly Core 2 duo and i7.
> 
> Could someone do some benchmarking for me?  Old (486/Pentium/P2/P3)
> machines are also interesting, but I'm optimizing for newer ones.

----------
$ [time] ./586test 
Result:  a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
500000000 bytes: 5.376325 s

real    0m5.384s
user    0m5.108s
sys     0m0.008s

500000000 bytes: 5.367261 s

5.09user 0.00system 0:05.38elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+378minor)pagefaults 0swaps

----------
$ [time] ./x86test 
Result:  a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
500000000 bytes: 5.312238 s

real    0m5.325s
user    0m5.060s
sys     0m0.008s

500000000 bytes: 5.323783 s

5.06user 0.00system 0:05.34elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+377minor)pagefaults 0swaps

==========
System:
$ uname -a
Linux roke 2.6.14-11.1.aur.2 #1 Tue Jan 31 16:05:05 CET 2006 \
 i686 athlon i386 GNU/Linux

$ cat /proc/cpuinfo
processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 6
model		: 4
model name	: AMD Athlon(tm) processor
stepping	: 2
cpu MHz		: 1000.188
cache size	: 256 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 1
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 mtrr pge \
 mca cmov pat pse36 mmx fxsr syscall mmxext 3dnowext 3dnow
bogomips	: 2002.43

$ free
             total       used       free     shared    buffers     cached
Mem:        515616     495812      19804          0       6004     103160
-/+ buffers/cache:     386648     128968
Swap:      1052248     279544     772704

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Request for benchmarking: x86 SHA1 code
  2009-07-31 13:27   ` Brian Ristuccia
@ 2009-07-31 14:05     ` George Spelvin
  0 siblings, 0 replies; 129+ messages in thread
From: George Spelvin @ 2009-07-31 14:05 UTC (permalink / raw)
  To: brian, git; +Cc: linux

> The revised code is faster on Intel Atom N270 by around 15% (results below
> typical of several runs):
> 
> $ ./586test ; ./x86test 
> Result:  a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
> Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
> 500000000 bytes: 4.981760 s
> Result:  a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
> Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
> 500000000 bytes: 4.323324 s

Cool, thanks!  I hadn't optimized it at all for Atom's
in-order pipe, so I'm pleasantly surprised.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Request for benchmarking: x86 SHA1 code
  2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin
                     ` (5 preceding siblings ...)
  2009-07-31 13:27   ` Jakub Narebski
@ 2009-07-31 15:05   ` Peter Harris
  2009-07-31 15:22   ` Peter Harris
  2009-08-03  3:47   ` x86 SHA1: Faster than OpenSSL George Spelvin
  8 siblings, 0 replies; 129+ messages in thread
From: Peter Harris @ 2009-07-31 15:05 UTC (permalink / raw)
  To: George Spelvin; +Cc: git

On Fri, Jul 31, 2009 at 6:46 AM, George Spelvin wrote:
> Could someone do some benchmarking for me?  Old (486/Pentium/P2/P3)
> machines are also interesting, but I'm optimizing for newer ones.

The new code appears to be marginally faster on a Pentium 3 Xeon:

Best of five runs:
586test: 11.658661 s
x86test: 11.209024 s

$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 7
model name      : Pentium III (Katmai)
stepping        : 3
cpu MHz         : 547.630
cache size      : 1024 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 mmx fxsr sse
bogomips        : 1097.12
clflush size    : 32
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 7
model name      : Pentium III (Katmai)
stepping        : 3
cpu MHz         : 547.630
cache size      : 1024 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 mmx fxsr sse
bogomips        : 1095.37
clflush size    : 32
power management:

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Request for benchmarking: x86 SHA1 code
  2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin
                     ` (6 preceding siblings ...)
  2009-07-31 15:05   ` Peter Harris
@ 2009-07-31 15:22   ` Peter Harris
  2009-08-03  3:47   ` x86 SHA1: Faster than OpenSSL George Spelvin
  8 siblings, 0 replies; 129+ messages in thread
From: Peter Harris @ 2009-07-31 15:22 UTC (permalink / raw)
  To: George Spelvin; +Cc: git

On Fri, Jul 31, 2009 at 6:46 AM, George Spelvin wrote:
> Could someone do some benchmarking for me?  Old (486/Pentium/P2/P3)
> machines are also interesting, but I'm optimizing for newer ones.

My Geode isn't old in age, but I admit it's old in design (and the
vendor switched to Atom right after I bought it...)

Best of three runs:

586test: 26.536597 s
x86test: 26.111148 s

$ cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 5
model           : 10
model name      : Geode(TM) Integrated Processor by AMD PCS
stepping        : 2
cpu MHz         : 499.927
cache size      : 128 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu de pse tsc msr cx8 sep pge cmov clflush mmx
mmxext 3dnowext 3dnow
bogomips        : 1001.72
clflush size    : 32
power management:

Peter Harris

^ permalink raw reply	[flat|nested] 129+ messages in thread

* x86 SHA1: Faster than OpenSSL
  2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin
                     ` (7 preceding siblings ...)
  2009-07-31 15:22   ` Peter Harris
@ 2009-08-03  3:47   ` George Spelvin
  2009-08-03  7:36     ` Jonathan del Strother
                       ` (3 more replies)
  8 siblings, 4 replies; 129+ messages in thread
From: George Spelvin @ 2009-08-03  3:47 UTC (permalink / raw)
  To: git; +Cc: appro, appro, linux

(Work in progress, state dump to mailing list archives.)

This started when discussing git startup overhead due to the dynamic
linker.  One big contributor is the openssl library, which is used only
for its optimized x86 SHA-1 implementation.  So I took a look at it,
with an eye to importing the code directly into the git source tree,
and decided that I felt like trying to do better.

The original code was excellent, but it was optimized when the P4 was new.
After a bit of tweaking, I've inflicted a slight (1.4%) slowdown on the
P4, but a small-but-noticeable speedup on a variety of other processors.

Before      After       Gain    Processor
1.585248    1.353314	+17%	2500 MHz Phenom
3.249614    3.295619	-1.4%	1594 MHz P4
1.414512    1.352843	+4.5%	2.66 GHz i7
3.460635    3.284221	+5.4%	1596 MHz Athlon XP
4.077993    3.891826	+4.8%	1144 MHz Athlon
1.912161    1.623212	+17%	2100 MHz Athlon 64 X2
2.956432    2.940210	+0.55%	1794 MHz Mobile Celeron (fam 15 model 2)

(Seconds to hash 500x 1 MB, best of 10 runs in all cases.)

This is based on Andy Polyakov's GPL/BSD licensed cryptogams code, and
(for now) uses the same perl preprocessor.   To test it, do the following:
- Download Andy's original code from
  http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz
- "tar xz cryptogams-0.tar.gz"
- "cd cryptogams-0/x86"
- "patch < this_email" to create "sha1test.c", "sha1-586.h", "Makefile",
   and "sha1-x86.pl".
- "make" 
- Run ./586test (before) and ./x86test (after) and note the timings.

The code is currenty only the core SHA transform.  Adding the appropriate
init/uodate/finish wrappers is straightforward.

An open question is how to add appropriate CPU detection to the git
build scripts.  (Note that `uname -m`, which it currently uses to select
the ARM code, does NOT produce the right answer if you're using a 32-bit
compiler on a 64-bit platform.)

I try to explain it in the comments, but with all the software pipelining
that makes the rounds overlap (and there are at least 4 different kinds
of rounds, which overlap with each other), it's a bit intricate.  If you
feel really masochistic, make commenting suggestions...

--- /dev/null	2009-05-12 02:55:38.579106460 -0400
+++ Makefile	2009-08-02 06:44:44.000000000 -0400
@@ -0,0 +1,16 @@
+CC := gcc
+CFLAGS := -m32 -W -Wall -Os -g
+ASFLAGS := --32
+
+all: 586test x86test
+
+586test : sha1test.c sha1-586.o
+	$(CC) $(CFLAGS) -o $@ sha1test.c sha1-586.o
+
+x86test : sha1test.c sha1-x86.o
+	$(CC) $(CFLAGS) -o $@ sha1test.c sha1-x86.o
+
+586test x86test : sha1-586.h
+
+%.s : %.pl x86asm.pl x86unix.pl
+	perl $< elf > $@
--- /dev/null	2009-05-12 02:55:38.579106460 -0400
+++ sha1-586.h	2009-08-02 06:44:17.000000000 -0400
@@ -0,0 +1,3 @@
+#include <stdint.h>
+
+void sha1_block_data_order(uint32_t iv[5], void const *in, unsigned len);
--- /dev/null	2009-05-12 02:55:38.579106460 -0400
+++ sha1test.c	2009-08-02 08:27:48.449609504 -0400
@@ -0,0 +1,85 @@
+#include <stdint.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <sys/time.h>
+
+#include "sha1-586.h"
+
+#define SIZE 1000000
+
+#if SIZE % 64
+# error SIZE must be a multiple of 64!
+#endif
+
+static unsigned long
+timing_test(uint32_t iv[5], unsigned iter)
+{
+	unsigned i;
+	struct timeval tv0, tv1;
+	static char *p;	/* Very large buffer */
+
+	if (!p) {
+		p = malloc(SIZE);
+		if (!p) {
+			perror("malloc");
+			exit(1);
+		}
+	}
+
+	if (gettimeofday(&tv0, NULL) < 0) {
+		perror("gettimeofday");
+		exit(1);
+	}
+	for (i = 0; i < iter; i++)
+		sha1_block_data_order(iv, p, SIZE/64);
+	if (gettimeofday(&tv1, NULL) < 0) {
+		perror("gettimeofday");
+		exit(1);
+	}
+	return 1000000ul * (tv1.tv_sec-tv0.tv_sec) + tv1.tv_usec-tv0.tv_usec;
+}
+
+int
+main(void)
+{
+	uint32_t iv[5] = {
+		0x67452301, 0xefcdab89, 0x98badcfe, 0x10325476, 0xc3d2e1f0
+	};
+	/* Simplest known-answer test, "abc" */
+	static uint8_t const testbuf[64] = {
+		'a','b','c', 0x80, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+		0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+		0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+		0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 24
+	};
+	/* Expected: A9993E364706816ABA3E25717850C26C9CD0D89D */
+	static uint32_t const expected[5] = {
+		0xa9993e36, 0x4706816a, 0xba3e2571, 0x7850c26c, 0x9cd0d89d };
+	unsigned i;
+	unsigned long min_usec = -1ul;
+
+	/* quick correct-answer test.  silent unless successful. */
+	sha1_block_data_order(iv, testbuf, 1);
+	for (i = 0; i < 5; i++) {
+		if (iv[i] != expected[i]) {
+			printf("Result:  %08x %08x %08x %08x %08x\n"
+			       "Expected:%08x %08x %08x %08x %08x\n",
+				iv[0], iv[1], iv[2], iv[3], iv[4], expected[0],
+				expected[1], expected[2], expected[3],
+				expected[4]);
+			break;
+		}
+	}
+
+	for (i = 0; i < 10; i++) {
+		unsigned long usec = timing_test(iv, 500);
+		printf("%2u/10: %u.%06u s\n", i+1, (unsigned)(usec/1000000),
+			(unsigned)(usec % 1000000));
+		if (usec < min_usec)
+			min_usec = usec;
+	}
+	printf("Minimum time to hash %u bytes: %u.%06u\n", 
+		500 * SIZE, (unsigned)(min_usec/1000000),
+		(unsigned)(min_usec % 1000000));
+	return 0;
+}
--- /dev/null	2009-05-12 02:55:38.579106460 -0400
+++ sha1-x86.pl	2009-08-02 08:51:01.069614130 -0400
@@ -0,0 +1,389 @@
+#!/usr/bin/env perl
+
+# ====================================================================
+# [Re]written by Andy Polyakov <appro@fy.chalmers.se> for the OpenSSL
+# project. The module is, however, dual licensed under OpenSSL and
+# CRYPTOGAMS licenses depending on where you obtain it. For further
+# details see http://www.openssl.org/~appro/cryptogams/.
+# ====================================================================
+
+# "[Re]written" was achieved in two major overhauls. In 2004 BODY_*
+# functions were re-implemented to address P4 performance issue [see
+# commentary below], and in 2006 the rest was rewritten in order to
+# gain freedom to liberate licensing terms.
+
+# It was noted that Intel IA-32 C compiler generates code which
+# performs ~30% *faster* on P4 CPU than original *hand-coded*
+# SHA1 assembler implementation. To address this problem (and
+# prove that humans are still better than machines:-), the
+# original code was overhauled, which resulted in following
+# performance changes:
+#
+#		compared with original	compared with Intel cc
+#		assembler impl.		generated code
+# Pentium	-16%			+48%
+# PIII/AMD	+8%			+16%
+# P4		+85%(!)			+45%
+#
+# As you can see Pentium came out as looser:-( Yet I reckoned that
+# improvement on P4 outweights the loss and incorporate this
+# re-tuned code to 0.9.7 and later.
+# ----------------------------------------------------------------
+#					<appro@fy.chalmers.se>
+
+$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1;
+push(@INC,"${dir}","${dir}../../perlasm");
+require "x86asm.pl";
+
+&asm_init($ARGV[0],"sha1-586.pl",$ARGV[$#ARGV] eq "386");
+
+$A="eax";
+$B="ebx";
+$C="ecx";
+$D="edx";
+$E="ebp";
+
+# Two temporaries
+$S="edi";
+$T="esi";
+
+# The round constants
+use constant K1 => 0x5a827999;
+use constant K2 => 0x6ED9EBA1;
+use constant K3 => 0x8F1BBCDC;
+use constant K4 => 0xCA62C1D6;
+
+# Given unlimited registers and functional units, it would be possible to
+# compute SHA-1 at two cycles per round, using 7 operations per round.
+# Remember, each round computes a new value for E, which is used as A
+# in the following round and B in the round after that.  There are two
+# critical paths:
+# - A must be rotated and added to the output E (2 cycles between rounds)
+# - B must go through two boolean operations before being added to
+#   the result E.  Since this latter addition can't be done in the
+#   same-cycle as the critical addition of a<<<5, this is a total of
+#   2+1+1 = 4 cycles per 2 rounds.
+# Additionally, if you want to avoid copying B, you have to rotate it
+# soon after use in this round so it is ready for use as the following
+# round's C.
+
+# e += (a <<< 5) + K + in[i] + (d^(b&(c^d)))		(0..19)
+# e += (a <<< 5) + K + in[i] + (b^c^d)			(20..39, 60..79)
+# e += (a <<< 5) + K + in[i] + (c&d) + (b&(c^d))	(40..59)
+#
+# The hard part is doing this with only two temporary registers.
+# Taking the most complex F(b,c,d) function, writing it as above
+# breaks it into two parts which can be accumulated into e separately.
+# Let's divide it into 4 parts.  These can be executed in a 7-cycle
+# loop, assuming an in-order triple issue machine
+# (quadruple counting xor-from-memory as 2):
+#
+#	in[i]		F1(c,d)		F2(b,c,d)	a<<<5
+#	mov in[i],T			(addl S,A)	(movl B,S)
+#	xor in[i+1],T					(rorl 5,S)
+#	xor in[i+2],T	movl D,S			(addl S,A)
+#	xor in[i+3],T	andl C,S
+#	rotl 1,T	addl S,E	movl D,S
+#	movl T,in[i]			xorl C,S
+#	addl T+K,E			andl B,S	rorl 2,B	//
+#	(mov in[i],T)			addl S,E	movl A,S
+#	(xor in[i+1],T)					rorl 5,S
+#	(xor in[i+2],T)	(movl C,S)			addl S,E
+#
+# In the above, we routinely read and write a register on the same cycle,
+# overlapping the beginning of one computation with the end of another.
+# I've tried to place the reads to the left of the writes, but some of the
+# overlapping operations from adjacent rounds (given in parentheses)
+# violate that.
+#
+# The "addl T+K,E" line is actually implemented using the lea instruction,
+# which (on a Pentium) requires that neither T not K was modified on the
+# previous cycle.
+#
+# As you can see, in the absence of out-of-order execution, the first
+# column takes a minimum of 6 cycles (fetch, 3 XORs, rotate, add to E),
+# and I reserve a seventh cycle before the add to E so that I can use a
+# Pentium's lea instruction.
+#
+# The other three columns take 3, 4 and 3 cycles, respectively.
+# These can all be overlapped by 1 cycle assuming a superscalar
+# processor, for a total of 2+2+3 = 7 cycles.
+#
+# The other F() functions require 5 and 4 cycles, respectively.
+# overlapped with the 3-cycle a<<<5 computation, that makes a total of 6
+# and 5 cycles, respectively.  If we overlap the beginning and end of the
+# Xi computation, we can get it down to 6 cycles, but below that, we'll
+# just have to waste a cycle.
+#
+# For the first 16 rounds, forming Xi is just a fetch, and the F()
+# function only requires 5 cycles, so the whole round can be pulled in
+# to 4 cycles.
+
+
+# Atom pairing rules (not yet optimized):
+# The Atom has a dial-issue in-order pipeline, similar to the
+# original Pentium.  However, the issue restrictions are different.
+# In particular, all memory source operations must use st use port 0,
+# as must all rotates.
+#
+# Given that a round uses 4 fetches and 3 rotates, that's going to
+# require significant care to pair well.  It may take a completely
+# different implementation.
+#
+# LEA must use port 1, but apparently it has even worse address generation
+# interlock latency than the Pentium.  Oh well, it's still the best way
+# to do a 3-way add with a 32-bit immediate.
+
+
+# The first 16 rounds use s simple simplest F(b,c,d) = d^(b&(c^d)), and
+# don't need to form the in[i] equation, letting us reduce the round to
+# 4 cycles, after some reassignment of temporaries:
+
+#			movl D,S	(roll 5,T)	(addl S,A)	//
+#	mov in[i],T	xorl C,S	(addl T,A)
+#			andl B,S			rorl 2,B
+#	addl T+K,E	xorl D,S	movl A,T
+#			addl S,E	roll 5,T	(movl C,S)	//
+#	(mov in[i],T)	(xorl B,S)	addl T,E	
+
+# The // mark where the round function starts.  Each round expects the
+# first copy of D to S to have been done already.
+
+# The transition between rounds 15 and 16 is a bit tricky... the best
+# thing to do is to delay the computation of a<<<5 one cycle and move it back
+# to the S register.  That way, T is free as early as possible.
+#	in[i]		F(b,c,d)	a<<<5
+#	(addl T+K,A)	(xorl E,S)	(movl B,T)
+#			movl D,S	(roll 5,T)	(addl S,A)	//
+#	mov in[i],T	xorl C,S	(addl T,A)
+#			andl B,S			rorl 2,B
+#	addl T+K,E 	xorl D,S			(movl in[1],T)
+#	(xor in[i+1],T)	addl S,E	movl A,S
+#	(xor in[i+2],T)	rorl 2,B	roll 5,S	//
+#	(xor in[i+3],T)	(movl C,S)	addl S,E
+
+sub BODY_00_15
+{
+	local($n,$a,$b,$c,$d,$e)=@_;
+
+	&comment("00_15 $n");
+		&mov($S,$d) if ($n == 0);
+	&mov($T,&swtmp($n%16));		#  V Load Xi.
+		&xor($S,$c);		# U  Continue F() = d^(b&(c^d))
+		&and($S,$b);		#  V
+			&rotr($b,2);	# NP
+	&lea($e,&DWP(K1,$e,$T));	# U  Add Xi and K
+    if ($n < 15) {
+			&mov($T,$a);	#  V
+		&xor($S,$d);		# U 
+			&rotl($T,5);	# NP
+		&add($e,$S);		# U 
+		&mov($S,$c);		#  V Start of NEXT round's F()
+			&add($e,$T);	# U 
+    } else {
+	# This version provides the correct start for BODY_20_39
+		&xor($S,$d);		#  V
+	&mov($T,&swtmp(($n+1)%16));	# U  Start computing mext Xi.
+		&add($e,$S);		#  V Add F()
+			&mov($S,$a);	# U  Start computing a<<<5
+	&xor($T,&swtmp(($n+3)%16));	#  V
+			&rotl($S,5);	# U 
+	&xor($T,&swtmp(($n+9)%16));	#  V
+    }
+}
+
+
+# A full round using F(b,c,d) = b^c^d.  6 cycles of dependency chain.
+# This starts just before starting to compute F(); the Xi should have XORed
+# the first three values together.  (Break is at //)
+#
+#	in[i]		F(b,c,d)	a<<<5
+#	mov in[i],T	(xorl E,S)			(addl T+K,A)
+#	xor in[i+1],T	(addl S,A)	(movl B,S)
+#	xor in[i+2],T			(roll 5,S)	//
+#	xor in[i+3],T 	movl D,S  	(addl S,A)
+#	roll 1,T	xorl C,S
+#	movl T,in[i]	andl B,S			rorl 2,B
+#	addl T+K,E 	xorl D,S			(mov in[i],T)
+#	(xor in[i+1],T)	addl S,E	movl A,S
+#	(xor in[i+2],T)			roll 5,S	//
+#	(xor in[i+3],T)	(movl C,S)	addl S,E
+
+sub BODY_16_19
+{
+	local($n,$a,$b,$c,$d,$e)=@_;
+
+	&comment("16_19 $n");
+
+	&xor($T,&swtmp(($n+13)%16));	# U 
+			&add($a,$S);	#  V End of previous round
+		&mov($S,$d);		# U  Start current round's F()
+	&rotl($T,1);			#  V
+		&xor($S,$c);		# U 
+	&mov(&swtmp($n%16),$T);		# U Store computed Xi.
+		&and($S,$b);		#  V
+		&rotr($b,2);		# NP
+	&lea($e,&DWP(K1,$e,$T));	# U  Add Xi and K
+	&mov($T,&swtmp(($n+1)%16));	#  V Start computing mext Xi.
+		&xor($S,$d);		# U
+	&xor($T,&swtmp(($n+3)%16));	#  V
+		&add($e,$S);		# U  Add F()
+			&mov($S,$a);	#  V Start computing a<<<5
+	&xor($T,&swtmp(($n+9)%16));	# U
+			&rotl($S,5);	# NP
+}
+
+# This is just like BODY_16_19, but computes a different F() = b^c^d
+#
+#	in[i]		F(b,c,d)	a<<<5
+#	mov in[i],T	(xorl E,S)			(addl T+K,A)
+#	xor in[i+1],T	(addl S,A)	(movl B,S)
+#	xor in[i+2],T			(roll 5,S)	//
+#	xor in[i+3],T 			(addl S,A)
+#	roll 1,T	movl D,S
+#	movl T,in[i]	xorl B,S			rorl 2,B
+#	addl T+K,E 	xorl C,S			(mov in[i],T)
+#	(xor in[i+1],T)	addl S,E	movl A,S
+#	(xor in[i+2],T)			roll 5,S	//
+#	(xor in[i+3],T)	(movl C,S)	addl S,E
+
+sub BODY_20_39	# And 61..79
+{
+	local($n,$a,$b,$c,$d,$e)=@_;
+	local $K=($n<40) ? K2 : K4;
+
+	&comment("20_39 $n");
+
+	&xor($T,&swtmp(($n+13)%16));	# U 
+			&add($a,$S);	#  V End of previous round
+	&rotl($T,1);			# U
+		&mov($S,$d);		#  V Start current round's F()
+	&mov(&swtmp($n%16),$T) if ($n < 77);	# Store computed Xi.
+		&xor($S,$b);		#  V
+		&rotr($b,2);		# NP
+	&lea($e,&DWP($K,$e,$T));	# U  Add Xi and K
+	&mov($T,&swtmp(($n+1)%16)) if ($n < 79); # Start computing next Xi.
+		&xor($S,$c);		# U
+	&xor($T,&swtmp(($n+3)%16)) if ($n < 79);
+		&add($e,$S);		# U  Add F1()
+			&mov($S,$a);	#  V Start computing a<<<5
+	&xor($T,&swtmp(($n+9)%16)) if ($n < 79);
+			&rotl($S,5);	# NP
+
+			&add($e,$S) if ($n == 79);
+}
+
+
+# This starts immediately after the LEA, and expects to need to finish
+# the previous round. (break is at //)
+#
+#	in[i]		F1(c,d)		F2(b,c,d)	a<<<5
+#	(addl T+K,E)			(andl C,S)	(rorl 2,C)
+#	mov in[i],T			(addl S,A)	(movl B,S)
+#	xor in[i+1],T					(rorl 5,S)
+#	xor in[i+2],T /	movl D,S			(addl S,A)
+#	xor in[i+3],T	andl C,S
+#	rotl 1,T	addl S,E	movl D,S
+#	(movl T,in[i])			xorl C,S
+#	addl T+K,E			andl B,S	rorl 2,B
+#	(mov in[i],T)			addl S,E	movl A,S
+#	(xor in[i+1],T)					rorl 5,S
+#	(xor in[i+2],T)	// (movl C,S)			addl S,E
+
+sub BODY_40_59
+{
+	local($n,$a,$b,$c,$d,$e)=@_;
+
+	&comment("40_59 $n");
+
+			&add($a,$S);	#  V End of previous round
+		&mov($S,$d);		# U  Start current round's F(1)
+	&xor($T,&swtmp(($n+13)%16));	#  V
+		&and($S,$c);		# U
+	&rotl($T,1);			# U	XXX Missed pairing
+		&add($e,$S);		#  V Add F1()
+		&mov($S,$d);		# U  Start current round's F2()
+	&mov(&swtmp($n%16),$T);		#  V Store computed Xi.
+		&xor($S,$c);		# U
+	&lea($e,&DWP(K3,$e,$T));	#  V
+		&and($S,$b);		# U	XXX Missed pairing
+		&rotr($b,2);		# NP
+	&mov($T,&swtmp(($n+1)%16));	# U  Start computing next Xi.
+		&add($e,$S);		#  V Add F2()
+		&mov($S,$a);		# U  Start computing a<<<5
+	&xor($T,&swtmp(($n+3)%16));	#  V
+			&rotl($S,5);	# NP
+	&xor($T,&swtmp(($n+9)%16));	# U
+}
+# The above code is NOT optimally paired for the Pentium.  (And thus,
+# presumably, Atom, which has a very similar dual-issue in-order pipeline.)
+# However, attempts to improve it make it slower on Phenom & i7.
+
+&function_begin("sha1_block_data_order",16);
+
+	local @V = ($A,$B,$C,$D,$E);
+	local @W = ($A,$B,$C);
+
+	&mov($S,&wparam(0));	# SHA_CTX *c
+	&mov($T,&wparam(1));	# const void *input
+	&mov($A,&wparam(2));	# size_t num
+	&stack_push(16);	# allocate X[16]
+	&shl($A,6);
+	&add($A,$T);
+	&mov(&wparam(2),$A);	# pointer beyond the end of input
+	&mov($E,&DWP(16,$S));# pre-load E
+	&mov($D,&DWP(12,$S));# pre-load D
+
+	&set_label("loop",16);
+
+	# copy input chunk to X, but reversing byte order!
+	&mov($W[2],&DWP(4*(0),$T));
+	&mov($W[1],&DWP(4*(1),$T));
+	&bswap($W[2]);
+	for ($i=0; $i<14; $i++) {
+		&mov($W[0],&DWP(4*($i+2),$T));
+		&bswap($W[1]);
+		&mov(&swtmp($i+0),$W[2]);
+		unshift(@W,pop(@W));
+	}
+	&bswap($W[1]);
+	&mov(&swtmp($i+0),$W[2]);
+	&mov(&swtmp($i+1),$W[1]);
+
+	&mov(&wparam(1),$T);	# redundant in 1st spin
+
+	# Reload A, B and C, which we use as temporaries in the copying
+	&mov($C,&DWP(8,$S));
+	&mov($B,&DWP(4,$S));
+	&mov($A,&DWP(0,$S));
+
+	for($i=0;$i<16;$i++)	{ &BODY_00_15($i,@V); unshift(@V,pop(@V)); }
+	for(;$i<20;$i++)	{ &BODY_16_19($i,@V); unshift(@V,pop(@V)); }
+	for(;$i<40;$i++)	{ &BODY_20_39($i,@V); unshift(@V,pop(@V)); }
+	for(;$i<60;$i++)	{ &BODY_40_59($i,@V); unshift(@V,pop(@V)); }
+	for(;$i<80;$i++)	{ &BODY_20_39($i,@V); unshift(@V,pop(@V)); }
+
+	(($V[4] eq $E) and ($V[0] eq $A)) or die;	# double-check
+
+	&comment("Loop trailer");
+
+	&mov($S,&wparam(0));	# re-load SHA_CTX*
+	&mov($T,&wparam(1));	# re-load input pointer
+
+	&add($E,&DWP(16,$S));
+	&add($D,&DWP(12,$S));
+	&add(&DWP(8,$S),$C);
+	&add(&DWP(4,$S),$B);
+	 &add($T,64);		# advance input pointer
+	&add(&DWP(0,$S),$A);
+	&mov(&DWP(12,$S),$D);
+	&mov(&DWP(16,$S),$E);
+
+	&cmp($T,&wparam(2));	# have we reached the end yet?
+	&jb(&label("loop"));
+
+	&stack_pop(16);
+&function_end("sha1_block_data_order");
+&asciz("SHA1 block transform for x86, CRYPTOGAMS by <appro\@openssl.org>");
+
+&asm_finish();

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-03  3:47   ` x86 SHA1: Faster than OpenSSL George Spelvin
@ 2009-08-03  7:36     ` Jonathan del Strother
  2009-08-04  1:40     ` Mark Lodato
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 129+ messages in thread
From: Jonathan del Strother @ 2009-08-03  7:36 UTC (permalink / raw)
  To: George Spelvin; +Cc: git, appro, appro

On Mon, Aug 3, 2009 at 4:47 AM, George Spelvin<linux@horizon.com> wrote:
> (Work in progress, state dump to mailing list archives.)
>
> This started when discussing git startup overhead due to the dynamic
> linker.  One big contributor is the openssl library, which is used only
> for its optimized x86 SHA-1 implementation.  So I took a look at it,
> with an eye to importing the code directly into the git source tree,
> and decided that I felt like trying to do better.
>

FWIW, this doesn't work on OS X / Darwin.  'as' doesn't take a --32
flag, it takes an -arch i386 flag.  After changing that, I get:

as -arch i386  -o sha1-586.o sha1-586.s
sha1-586.s:4:Unknown pseudo-op: .type
sha1-586.s:4:Rest of line ignored. 1st junk character valued 115 (s).
sha1-586.s:5:Alignment too large: 15. assumed.
sha1-586.s:19:Alignment too large: 15. assumed.
sha1-586.s:1438:Unknown pseudo-op: .size
sha1-586.s:1438:Rest of line ignored. 1st junk character valued 115 (s).
make: *** [sha1-586.o] Error 1

- at which point I have no idea how to fix it.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-03  3:47   ` x86 SHA1: Faster than OpenSSL George Spelvin
  2009-08-03  7:36     ` Jonathan del Strother
@ 2009-08-04  1:40     ` Mark Lodato
  2009-08-04  2:30     ` Linus Torvalds
  2009-08-18 21:26     ` Andy Polyakov
  3 siblings, 0 replies; 129+ messages in thread
From: Mark Lodato @ 2009-08-04  1:40 UTC (permalink / raw)
  To: George Spelvin; +Cc: git, appro, appro

On Sun, Aug 2, 2009 at 11:47 PM, George Spelvin<linux@horizon.com> wrote:
> Before      After       Gain    Processor
> 1.585248    1.353314    +17%    2500 MHz Phenom
> 3.249614    3.295619    -1.4%   1594 MHz P4
> 1.414512    1.352843    +4.5%   2.66 GHz i7
> 3.460635    3.284221    +5.4%   1596 MHz Athlon XP
> 4.077993    3.891826    +4.8%   1144 MHz Athlon
> 1.912161    1.623212    +17%    2100 MHz Athlon 64 X2
> 2.956432    2.940210    +0.55%  1794 MHz Mobile Celeron (fam 15 model 2)
>
> (Seconds to hash 500x 1 MB, best of 10 runs in all cases.)
>
> This is based on Andy Polyakov's GPL/BSD licensed cryptogams code, and
> (for now) uses the same perl preprocessor.   To test it, do the following:
> - Download Andy's original code from
>  http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz
> - "tar xz cryptogams-0.tar.gz"
> - "cd cryptogams-0/x86"
> - "patch < this_email" to create "sha1test.c", "sha1-586.h", "Makefile",
>   and "sha1-x86.pl".
> - "make"
> - Run ./586test (before) and ./x86test (after) and note the timings.

Note, to compile this on Ubuntu x86-64, I had to:
$ sudo apt-get install libc6-dev-i386

$ ./586test
 1/10: 2.016621 s
 2/10: 2.030742 s
 3/10: 2.027333 s
 4/10: 2.024018 s
 5/10: 2.022306 s
 6/10: 2.022418 s
 7/10: 2.047103 s
 8/10: 2.035467 s
 9/10: 2.032237 s
10/10: 2.029231 s
Minimum time to hash 500000000 bytes: 2.016621
$ ./x86test
 1/10: 1.818661 s
 2/10: 1.814856 s
 3/10: 1.816232 s
 4/10: 1.815208 s
 5/10: 1.834047 s
 6/10: 1.843020 s
 7/10: 1.819564 s
 8/10: 1.815560 s
 9/10: 1.824232 s
10/10: 1.820943 s
Minimum time to hash 500000000 bytes: 1.814856
$ python -c 'print 2.016621 / 1.814856'
1.11117410968
$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Core(TM)2 CPU          6300  @ 1.86GHz
stepping        : 2
cpu MHz         : 1861.825
cache size      : 2048 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant
_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3
cx16 xtpr pdcm lahf_lm tpr_shadow
bogomips        : 3723.65
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Core(TM)2 CPU          6300  @ 1.86GHz
stepping        : 2
cpu MHz         : 1861.825
cache size      : 2048 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant
_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3
cx16 xtpr pdcm lahf_lm tpr_shadow
bogomips        : 3724.01
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:



I imagine that you can get a bigger speedup by making a 64-bit version
(but maybe not).  Either way, it would be nice if x86-64 users did not
have to install an additional package to compile.

Cheers,
Mark

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-03  3:47   ` x86 SHA1: Faster than OpenSSL George Spelvin
  2009-08-03  7:36     ` Jonathan del Strother
  2009-08-04  1:40     ` Mark Lodato
@ 2009-08-04  2:30     ` Linus Torvalds
  2009-08-04  2:51       ` Linus Torvalds
  2009-08-04  4:48       ` George Spelvin
  2009-08-18 21:26     ` Andy Polyakov
  3 siblings, 2 replies; 129+ messages in thread
From: Linus Torvalds @ 2009-08-04  2:30 UTC (permalink / raw)
  To: George Spelvin; +Cc: git, appro, appro

On Sun, 2 Aug 2009, George Spelvin wrote:
> 
> The original code was excellent, but it was optimized when the P4 was new.
> After a bit of tweaking, I've inflicted a slight (1.4%) slowdown on the
> P4, but a small-but-noticeable speedup on a variety of other processors.
> 
> Before      After       Gain    Processor
> 1.585248    1.353314	+17%	2500 MHz Phenom
> 3.249614    3.295619	-1.4%	1594 MHz P4
> 1.414512    1.352843	+4.5%	2.66 GHz i7
> 3.460635    3.284221	+5.4%	1596 MHz Athlon XP
> 4.077993    3.891826	+4.8%	1144 MHz Athlon
> 1.912161    1.623212	+17%	2100 MHz Athlon 64 X2
> 2.956432    2.940210	+0.55%	1794 MHz Mobile Celeron (fam 15 model 2)

It would be better to have a more git-centric benchmark that actually 
shows some real git load, rather than a sha1-only microbenchmark.

The thing that I'd prefer is simply

	git fsck --full

on the Linux kernel archive. For me (with a fast machine), it takes about 
4m30s with the OpenSSL SHA1, and takes 6m40s with the Mozilla SHA1 (ie 
using a NO_OPENSSL=1 build).

So that's an example of a load that is actually very sensitive to SHA1 
performance (more so than _most_ git loads, I suspect), and at the same 
time is a real git load rather than some SHA1-only microbenchmark. It also 
shows very clearly why we default to the OpenSSL version over the Mozilla 
one.

NOTE! I didn't do multiple runs to see how stable the numbers are, and 
so it's possible that I exaggerated the OpenSSL advantage over the 
Mozilla-SHA1 code. Or vice versa. My point is really only that I don't 
know how meaningful a "50 x 1M SHA1" benchmark is, while I know that a 
"git fsck" benchmark has at least _some_ real life value.

		Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-04  2:30     ` Linus Torvalds
@ 2009-08-04  2:51       ` Linus Torvalds
  2009-08-04  3:07         ` Jon Smirl
  2009-08-18 21:50         ` Andy Polyakov
  2009-08-04  4:48       ` George Spelvin
  1 sibling, 2 replies; 129+ messages in thread
From: Linus Torvalds @ 2009-08-04  2:51 UTC (permalink / raw)
  To: George Spelvin; +Cc: git, appro, appro



On Mon, 3 Aug 2009, Linus Torvalds wrote:
> 
> The thing that I'd prefer is simply
> 
> 	git fsck --full
> 
> on the Linux kernel archive. For me (with a fast machine), it takes about 
> 4m30s with the OpenSSL SHA1, and takes 6m40s with the Mozilla SHA1 (ie 
> using a NO_OPENSSL=1 build).
> 
> So that's an example of a load that is actually very sensitive to SHA1 
> performance (more so than _most_ git loads, I suspect), and at the same 
> time is a real git load rather than some SHA1-only microbenchmark. It also 
> shows very clearly why we default to the OpenSSL version over the Mozilla 
> one.

"perf report --sort comm,dso,symbol" profiling shows the following for 
'git fsck --full' on the kernel repo, using the Mozilla SHA1:

    47.69%               git  /home/torvalds/git/git     [.] moz_SHA1_Update
    22.98%               git  /lib64/libz.so.1.2.3       [.] inflate_fast
     7.32%               git  /lib64/libc-2.10.1.so      [.] __GI_memcpy
     4.66%               git  /lib64/libz.so.1.2.3       [.] inflate
     3.76%               git  /lib64/libz.so.1.2.3       [.] adler32
     2.86%               git  /lib64/libz.so.1.2.3       [.] inflate_table
     2.41%               git  /home/torvalds/git/git     [.] lookup_object
     1.31%               git  /lib64/libc-2.10.1.so      [.] _int_malloc
     0.84%               git  /home/torvalds/git/git     [.] patch_delta
     0.78%               git  [kernel]                   [k] hpet_next_event

so yeah, SHA1 performance matters. Judging by the OpenSSL numbers, the 
OpenSSL SHA1 implementation must be about twice as fast as the C version 
we use.

That said, under "normal" git usage models, the SHA1 costs are almost 
invisible. So git-fsck is definitely a fairly unusual case that stresses 
the SHA1 performance more than most git lods.

		Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-04  2:51       ` Linus Torvalds
@ 2009-08-04  3:07         ` Jon Smirl
  2009-08-04  5:01           ` George Spelvin
  2009-08-18 21:50         ` Andy Polyakov
  1 sibling, 1 reply; 129+ messages in thread
From: Jon Smirl @ 2009-08-04  3:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: George Spelvin, git, appro, appro

On Mon, Aug 3, 2009 at 10:51 PM, Linus
Torvalds<torvalds@linux-foundation.org> wrote:
>
>
> On Mon, 3 Aug 2009, Linus Torvalds wrote:
>>
>> The thing that I'd prefer is simply
>>
>>       git fsck --full
>>
>> on the Linux kernel archive. For me (with a fast machine), it takes about
>> 4m30s with the OpenSSL SHA1, and takes 6m40s with the Mozilla SHA1 (ie
>> using a NO_OPENSSL=1 build).
>>
>> So that's an example of a load that is actually very sensitive to SHA1
>> performance (more so than _most_ git loads, I suspect), and at the same
>> time is a real git load rather than some SHA1-only microbenchmark. It also
>> shows very clearly why we default to the OpenSSL version over the Mozilla
>> one.
>
> "perf report --sort comm,dso,symbol" profiling shows the following for
> 'git fsck --full' on the kernel repo, using the Mozilla SHA1:
>
>    47.69%               git  /home/torvalds/git/git     [.] moz_SHA1_Update
>    22.98%               git  /lib64/libz.so.1.2.3       [.] inflate_fast
>     7.32%               git  /lib64/libc-2.10.1.so      [.] __GI_memcpy
>     4.66%               git  /lib64/libz.so.1.2.3       [.] inflate
>     3.76%               git  /lib64/libz.so.1.2.3       [.] adler32
>     2.86%               git  /lib64/libz.so.1.2.3       [.] inflate_table
>     2.41%               git  /home/torvalds/git/git     [.] lookup_object
>     1.31%               git  /lib64/libc-2.10.1.so      [.] _int_malloc
>     0.84%               git  /home/torvalds/git/git     [.] patch_delta
>     0.78%               git  [kernel]                   [k] hpet_next_event
>
> so yeah, SHA1 performance matters. Judging by the OpenSSL numbers, the
> OpenSSL SHA1 implementation must be about twice as fast as the C version
> we use.

Would there happen to be a SHA1 implementation around that can compute
the SHA1 without first decompressing the data? Databases gain a lot of
speed by using special algorithms that can directly operate on the
compressed data.

>
> That said, under "normal" git usage models, the SHA1 costs are almost
> invisible. So git-fsck is definitely a fairly unusual case that stresses
> the SHA1 performance more than most git lods.
>
>                Linus
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-04  2:30     ` Linus Torvalds
  2009-08-04  2:51       ` Linus Torvalds
@ 2009-08-04  4:48       ` George Spelvin
  2009-08-04  6:30         ` Linus Torvalds
  2009-08-04  6:40         ` Linus Torvalds
  1 sibling, 2 replies; 129+ messages in thread
From: George Spelvin @ 2009-08-04  4:48 UTC (permalink / raw)
  To: torvalds; +Cc: git, linux

> It would be better to have a more git-centric benchmark that actually 
> shows some real git load, rather than a sha1-only microbenchmark.
> 
> The thing that I'd prefer is simply
>
>	git fsck --full
>
> on the Linux kernel archive. For me (with a fast machine), it takes about 
> 4m30s with the OpenSSL SHA1, and takes 6m40s with the Mozilla SHA1 (ie 
> using a NO_OPENSSL=1 build).

The actual goal of this effort is to address the dynamic linker startup
time issues by removing the second-largest contributor after libcurl,
namely openssl.  Optimizing the assembly code is just the fun part. ;-)

Anyway, on the git repository:

[1273]$ time x/git-fsck --full			(New SHA1 code)
dangling tree 524973049a7e4593df4af41e0564912f678a41ac
dangling tree 7da7d73185a1df5c2a477d2ee5599ac8a58cad56

real    0m59.306s
user    0m58.760s
sys     0m0.550s
[1274]$ time ./git-fsck --full			(OpenSSL)
dangling tree 524973049a7e4593df4af41e0564912f678a41ac
dangling tree 7da7d73185a1df5c2a477d2ee5599ac8a58cad56

real    1m0.364s
user    0m59.970s
sys     0m0.400s

1.6% is a pretty minor difference, especially as the machine is running
a backup at the time (but it's a quad-core, with near-zero CPU usage;
the business is all I/O).

On the full Linux repository, I repacked it first to make sure that
everything was in RAM, and I have the first result:

[517]$ time ~/git/x/git-fsck --full		(New SHA1 code)

real    10m12.702s
user    9m48.410s
sys     0m23.350s
[518]$ time ~/git/git-fsck --full		(OpenSSL)

real    10m26.083s
user    10m2.800s
sys     0m22.000s

Again, 2.2% is not a huge improvement.  But my only goal was not to be worse.

> So that's an example of a load that is actually very sensitive to SHA1 
> performance (more so than _most_ git loads, I suspect), and at the same 
> time is a real git load rather than some SHA1-only microbenchmark. It also 
> shows very clearly why we default to the OpenSSL version over the Mozilla 
> one.

I wasn't questioning *that*.  As I said, I was just doing the fun part
of importing a heavily-optimized OpenSSL-like SHA1 implementation into
the git source tree.

(The un-fun part is modifying the build process to detect the target
processor and include the right asm automatically.)

Anyway, if you want to test it, here's a crude x86_32-only patch to the
git tree.  "make NO_OPENSSL=1" to use the new code.

diff --git a/Makefile b/Makefile
index daf4296..8531c39 100644
--- a/Makefile
+++ b/Makefile
@@ -1176,8 +1176,10 @@ ifdef ARM_SHA1
 	LIB_OBJS += arm/sha1.o arm/sha1_arm.o
 else
 ifdef MOZILLA_SHA1
-	SHA1_HEADER = "mozilla-sha1/sha1.h"
-	LIB_OBJS += mozilla-sha1/sha1.o
+#	SHA1_HEADER = "mozilla-sha1/sha1.h"
+#	LIB_OBJS += mozilla-sha1/sha1.o
+	SHA1_HEADER = "x86/sha1.h"
+	LIB_OBJS += x86/sha1.o x86/sha1-x86.o
 else
 	SHA1_HEADER = <openssl/sha.h>
 	EXTLIBS += $(LIB_4_CRYPTO)
diff --git a/x86/sha1-x86.s b/x86/sha1-x86.s
new file mode 100644
index 0000000..96796d4
--- /dev/null
+++ b/x86/sha1-x86.s
@@ -0,0 +1,1372 @@
+.file	"sha1-586.s"
+.text
+.globl	sha1_block_data_order
+.type	sha1_block_data_order,@function
+.align	16
+sha1_block_data_order:
+	pushl	%ebp
+	pushl	%ebx
+	pushl	%esi
+	pushl	%edi
+	movl	20(%esp),%edi
+	movl	24(%esp),%esi
+	movl	28(%esp),%eax
+	subl	$64,%esp
+	shll	$6,%eax
+	addl	%esi,%eax
+	movl	%eax,92(%esp)
+	movl	16(%edi),%ebp
+	movl	12(%edi),%edx
+.align	16
+.L000loop:
+	movl	(%esi),%ecx
+	movl	4(%esi),%ebx
+	bswap	%ecx
+	movl	8(%esi),%eax
+	bswap	%ebx
+	movl	%ecx,(%esp)
+	movl	12(%esi),%ecx
+	bswap	%eax
+	movl	%ebx,4(%esp)
+	movl	16(%esi),%ebx
+	bswap	%ecx
+	movl	%eax,8(%esp)
+	movl	20(%esi),%eax
+	bswap	%ebx
+	movl	%ecx,12(%esp)
+	movl	24(%esi),%ecx
+	bswap	%eax
+	movl	%ebx,16(%esp)
+	movl	28(%esi),%ebx
+	bswap	%ecx
+	movl	%eax,20(%esp)
+	movl	32(%esi),%eax
+	bswap	%ebx
+	movl	%ecx,24(%esp)
+	movl	36(%esi),%ecx
+	bswap	%eax
+	movl	%ebx,28(%esp)
+	movl	40(%esi),%ebx
+	bswap	%ecx
+	movl	%eax,32(%esp)
+	movl	44(%esi),%eax
+	bswap	%ebx
+	movl	%ecx,36(%esp)
+	movl	48(%esi),%ecx
+	bswap	%eax
+	movl	%ebx,40(%esp)
+	movl	52(%esi),%ebx
+	bswap	%ecx
+	movl	%eax,44(%esp)
+	movl	56(%esi),%eax
+	bswap	%ebx
+	movl	%ecx,48(%esp)
+	movl	60(%esi),%ecx
+	bswap	%eax
+	movl	%ebx,52(%esp)
+	bswap	%ecx
+	movl	%eax,56(%esp)
+	movl	%ecx,60(%esp)
+	movl	%esi,88(%esp)
+	movl	8(%edi),%ecx
+	movl	4(%edi),%ebx
+	movl	(%edi),%eax
+	/* 00_15 0 */
+	movl	%edx,%edi
+	movl	(%esp),%esi
+	xorl	%ecx,%edi
+	andl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	1518500249(%ebp,%esi,1),%ebp
+	movl	%eax,%esi
+	xorl	%edx,%edi
+	roll	$5,%esi
+	addl	%edi,%ebp
+	movl	%ecx,%edi
+	addl	%esi,%ebp
+	/* 00_15 1 */
+	movl	4(%esp),%esi
+	xorl	%ebx,%edi
+	andl	%eax,%edi
+	rorl	$2,%eax
+	leal	1518500249(%edx,%esi,1),%edx
+	movl	%ebp,%esi
+	xorl	%ecx,%edi
+	roll	$5,%esi
+	addl	%edi,%edx
+	movl	%ebx,%edi
+	addl	%esi,%edx
+	/* 00_15 2 */
+	movl	8(%esp),%esi
+	xorl	%eax,%edi
+	andl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	1518500249(%ecx,%esi,1),%ecx
+	movl	%edx,%esi
+	xorl	%ebx,%edi
+	roll	$5,%esi
+	addl	%edi,%ecx
+	movl	%eax,%edi
+	addl	%esi,%ecx
+	/* 00_15 3 */
+	movl	12(%esp),%esi
+	xorl	%ebp,%edi
+	andl	%edx,%edi
+	rorl	$2,%edx
+	leal	1518500249(%ebx,%esi,1),%ebx
+	movl	%ecx,%esi
+	xorl	%eax,%edi
+	roll	$5,%esi
+	addl	%edi,%ebx
+	movl	%ebp,%edi
+	addl	%esi,%ebx
+	/* 00_15 4 */
+	movl	16(%esp),%esi
+	xorl	%edx,%edi
+	andl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	1518500249(%eax,%esi,1),%eax
+	movl	%ebx,%esi
+	xorl	%ebp,%edi
+	roll	$5,%esi
+	addl	%edi,%eax
+	movl	%edx,%edi
+	addl	%esi,%eax
+	/* 00_15 5 */
+	movl	20(%esp),%esi
+	xorl	%ecx,%edi
+	andl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	1518500249(%ebp,%esi,1),%ebp
+	movl	%eax,%esi
+	xorl	%edx,%edi
+	roll	$5,%esi
+	addl	%edi,%ebp
+	movl	%ecx,%edi
+	addl	%esi,%ebp
+	/* 00_15 6 */
+	movl	24(%esp),%esi
+	xorl	%ebx,%edi
+	andl	%eax,%edi
+	rorl	$2,%eax
+	leal	1518500249(%edx,%esi,1),%edx
+	movl	%ebp,%esi
+	xorl	%ecx,%edi
+	roll	$5,%esi
+	addl	%edi,%edx
+	movl	%ebx,%edi
+	addl	%esi,%edx
+	/* 00_15 7 */
+	movl	28(%esp),%esi
+	xorl	%eax,%edi
+	andl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	1518500249(%ecx,%esi,1),%ecx
+	movl	%edx,%esi
+	xorl	%ebx,%edi
+	roll	$5,%esi
+	addl	%edi,%ecx
+	movl	%eax,%edi
+	addl	%esi,%ecx
+	/* 00_15 8 */
+	movl	32(%esp),%esi
+	xorl	%ebp,%edi
+	andl	%edx,%edi
+	rorl	$2,%edx
+	leal	1518500249(%ebx,%esi,1),%ebx
+	movl	%ecx,%esi
+	xorl	%eax,%edi
+	roll	$5,%esi
+	addl	%edi,%ebx
+	movl	%ebp,%edi
+	addl	%esi,%ebx
+	/* 00_15 9 */
+	movl	36(%esp),%esi
+	xorl	%edx,%edi
+	andl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	1518500249(%eax,%esi,1),%eax
+	movl	%ebx,%esi
+	xorl	%ebp,%edi
+	roll	$5,%esi
+	addl	%edi,%eax
+	movl	%edx,%edi
+	addl	%esi,%eax
+	/* 00_15 10 */
+	movl	40(%esp),%esi
+	xorl	%ecx,%edi
+	andl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	1518500249(%ebp,%esi,1),%ebp
+	movl	%eax,%esi
+	xorl	%edx,%edi
+	roll	$5,%esi
+	addl	%edi,%ebp
+	movl	%ecx,%edi
+	addl	%esi,%ebp
+	/* 00_15 11 */
+	movl	44(%esp),%esi
+	xorl	%ebx,%edi
+	andl	%eax,%edi
+	rorl	$2,%eax
+	leal	1518500249(%edx,%esi,1),%edx
+	movl	%ebp,%esi
+	xorl	%ecx,%edi
+	roll	$5,%esi
+	addl	%edi,%edx
+	movl	%ebx,%edi
+	addl	%esi,%edx
+	/* 00_15 12 */
+	movl	48(%esp),%esi
+	xorl	%eax,%edi
+	andl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	1518500249(%ecx,%esi,1),%ecx
+	movl	%edx,%esi
+	xorl	%ebx,%edi
+	roll	$5,%esi
+	addl	%edi,%ecx
+	movl	%eax,%edi
+	addl	%esi,%ecx
+	/* 00_15 13 */
+	movl	52(%esp),%esi
+	xorl	%ebp,%edi
+	andl	%edx,%edi
+	rorl	$2,%edx
+	leal	1518500249(%ebx,%esi,1),%ebx
+	movl	%ecx,%esi
+	xorl	%eax,%edi
+	roll	$5,%esi
+	addl	%edi,%ebx
+	movl	%ebp,%edi
+	addl	%esi,%ebx
+	/* 00_15 14 */
+	movl	56(%esp),%esi
+	xorl	%edx,%edi
+	andl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	1518500249(%eax,%esi,1),%eax
+	movl	%ebx,%esi
+	xorl	%ebp,%edi
+	roll	$5,%esi
+	addl	%edi,%eax
+	movl	%edx,%edi
+	addl	%esi,%eax
+	/* 00_15 15 */
+	movl	60(%esp),%esi
+	xorl	%ecx,%edi
+	andl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	1518500249(%ebp,%esi,1),%ebp
+	xorl	%edx,%edi
+	movl	(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	8(%esp),%esi
+	roll	$5,%edi
+	xorl	32(%esp),%esi
+	/* 16_19 16 */
+	xorl	52(%esp),%esi
+	addl	%edi,%ebp
+	movl	%ecx,%edi
+	roll	$1,%esi
+	xorl	%ebx,%edi
+	movl	%esi,(%esp)
+	andl	%eax,%edi
+	rorl	$2,%eax
+	leal	1518500249(%edx,%esi,1),%edx
+	movl	4(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	12(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	36(%esp),%esi
+	roll	$5,%edi
+	/* 16_19 17 */
+	xorl	56(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebx,%edi
+	roll	$1,%esi
+	xorl	%eax,%edi
+	movl	%esi,4(%esp)
+	andl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	1518500249(%ecx,%esi,1),%ecx
+	movl	8(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	16(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	40(%esp),%esi
+	roll	$5,%edi
+	/* 16_19 18 */
+	xorl	60(%esp),%esi
+	addl	%edi,%ecx
+	movl	%eax,%edi
+	roll	$1,%esi
+	xorl	%ebp,%edi
+	movl	%esi,8(%esp)
+	andl	%edx,%edi
+	rorl	$2,%edx
+	leal	1518500249(%ebx,%esi,1),%ebx
+	movl	12(%esp),%esi
+	xorl	%eax,%edi
+	xorl	20(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	44(%esp),%esi
+	roll	$5,%edi
+	/* 16_19 19 */
+	xorl	(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ebp,%edi
+	roll	$1,%esi
+	xorl	%edx,%edi
+	movl	%esi,12(%esp)
+	andl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	1518500249(%eax,%esi,1),%eax
+	movl	16(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	24(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	48(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 20 */
+	xorl	4(%esp),%esi
+	addl	%edi,%eax
+	roll	$1,%esi
+	movl	%edx,%edi
+	movl	%esi,16(%esp)
+	xorl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	1859775393(%ebp,%esi,1),%ebp
+	movl	20(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	28(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	52(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 21 */
+	xorl	8(%esp),%esi
+	addl	%edi,%ebp
+	roll	$1,%esi
+	movl	%ecx,%edi
+	movl	%esi,20(%esp)
+	xorl	%eax,%edi
+	rorl	$2,%eax
+	leal	1859775393(%edx,%esi,1),%edx
+	movl	24(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	32(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	56(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 22 */
+	xorl	12(%esp),%esi
+	addl	%edi,%edx
+	roll	$1,%esi
+	movl	%ebx,%edi
+	movl	%esi,24(%esp)
+	xorl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	1859775393(%ecx,%esi,1),%ecx
+	movl	28(%esp),%esi
+	xorl	%eax,%edi
+	xorl	36(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	60(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 23 */
+	xorl	16(%esp),%esi
+	addl	%edi,%ecx
+	roll	$1,%esi
+	movl	%eax,%edi
+	movl	%esi,28(%esp)
+	xorl	%edx,%edi
+	rorl	$2,%edx
+	leal	1859775393(%ebx,%esi,1),%ebx
+	movl	32(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	40(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 24 */
+	xorl	20(%esp),%esi
+	addl	%edi,%ebx
+	roll	$1,%esi
+	movl	%ebp,%edi
+	movl	%esi,32(%esp)
+	xorl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	1859775393(%eax,%esi,1),%eax
+	movl	36(%esp),%esi
+	xorl	%edx,%edi
+	xorl	44(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	4(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 25 */
+	xorl	24(%esp),%esi
+	addl	%edi,%eax
+	roll	$1,%esi
+	movl	%edx,%edi
+	movl	%esi,36(%esp)
+	xorl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	1859775393(%ebp,%esi,1),%ebp
+	movl	40(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	48(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	8(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 26 */
+	xorl	28(%esp),%esi
+	addl	%edi,%ebp
+	roll	$1,%esi
+	movl	%ecx,%edi
+	movl	%esi,40(%esp)
+	xorl	%eax,%edi
+	rorl	$2,%eax
+	leal	1859775393(%edx,%esi,1),%edx
+	movl	44(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	52(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	12(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 27 */
+	xorl	32(%esp),%esi
+	addl	%edi,%edx
+	roll	$1,%esi
+	movl	%ebx,%edi
+	movl	%esi,44(%esp)
+	xorl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	1859775393(%ecx,%esi,1),%ecx
+	movl	48(%esp),%esi
+	xorl	%eax,%edi
+	xorl	56(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	16(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 28 */
+	xorl	36(%esp),%esi
+	addl	%edi,%ecx
+	roll	$1,%esi
+	movl	%eax,%edi
+	movl	%esi,48(%esp)
+	xorl	%edx,%edi
+	rorl	$2,%edx
+	leal	1859775393(%ebx,%esi,1),%ebx
+	movl	52(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	60(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	20(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 29 */
+	xorl	40(%esp),%esi
+	addl	%edi,%ebx
+	roll	$1,%esi
+	movl	%ebp,%edi
+	movl	%esi,52(%esp)
+	xorl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	1859775393(%eax,%esi,1),%eax
+	movl	56(%esp),%esi
+	xorl	%edx,%edi
+	xorl	(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	24(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 30 */
+	xorl	44(%esp),%esi
+	addl	%edi,%eax
+	roll	$1,%esi
+	movl	%edx,%edi
+	movl	%esi,56(%esp)
+	xorl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	1859775393(%ebp,%esi,1),%ebp
+	movl	60(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	4(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	28(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 31 */
+	xorl	48(%esp),%esi
+	addl	%edi,%ebp
+	roll	$1,%esi
+	movl	%ecx,%edi
+	movl	%esi,60(%esp)
+	xorl	%eax,%edi
+	rorl	$2,%eax
+	leal	1859775393(%edx,%esi,1),%edx
+	movl	(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	8(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	32(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 32 */
+	xorl	52(%esp),%esi
+	addl	%edi,%edx
+	roll	$1,%esi
+	movl	%ebx,%edi
+	movl	%esi,(%esp)
+	xorl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	1859775393(%ecx,%esi,1),%ecx
+	movl	4(%esp),%esi
+	xorl	%eax,%edi
+	xorl	12(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	36(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 33 */
+	xorl	56(%esp),%esi
+	addl	%edi,%ecx
+	roll	$1,%esi
+	movl	%eax,%edi
+	movl	%esi,4(%esp)
+	xorl	%edx,%edi
+	rorl	$2,%edx
+	leal	1859775393(%ebx,%esi,1),%ebx
+	movl	8(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	16(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	40(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 34 */
+	xorl	60(%esp),%esi
+	addl	%edi,%ebx
+	roll	$1,%esi
+	movl	%ebp,%edi
+	movl	%esi,8(%esp)
+	xorl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	1859775393(%eax,%esi,1),%eax
+	movl	12(%esp),%esi
+	xorl	%edx,%edi
+	xorl	20(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	44(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 35 */
+	xorl	(%esp),%esi
+	addl	%edi,%eax
+	roll	$1,%esi
+	movl	%edx,%edi
+	movl	%esi,12(%esp)
+	xorl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	1859775393(%ebp,%esi,1),%ebp
+	movl	16(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	24(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	48(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 36 */
+	xorl	4(%esp),%esi
+	addl	%edi,%ebp
+	roll	$1,%esi
+	movl	%ecx,%edi
+	movl	%esi,16(%esp)
+	xorl	%eax,%edi
+	rorl	$2,%eax
+	leal	1859775393(%edx,%esi,1),%edx
+	movl	20(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	28(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	52(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 37 */
+	xorl	8(%esp),%esi
+	addl	%edi,%edx
+	roll	$1,%esi
+	movl	%ebx,%edi
+	movl	%esi,20(%esp)
+	xorl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	1859775393(%ecx,%esi,1),%ecx
+	movl	24(%esp),%esi
+	xorl	%eax,%edi
+	xorl	32(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	56(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 38 */
+	xorl	12(%esp),%esi
+	addl	%edi,%ecx
+	roll	$1,%esi
+	movl	%eax,%edi
+	movl	%esi,24(%esp)
+	xorl	%edx,%edi
+	rorl	$2,%edx
+	leal	1859775393(%ebx,%esi,1),%ebx
+	movl	28(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	36(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	60(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 39 */
+	xorl	16(%esp),%esi
+	addl	%edi,%ebx
+	roll	$1,%esi
+	movl	%ebp,%edi
+	movl	%esi,28(%esp)
+	xorl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	1859775393(%eax,%esi,1),%eax
+	movl	32(%esp),%esi
+	xorl	%edx,%edi
+	xorl	40(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	(%esp),%esi
+	roll	$5,%edi
+	/* 40_59 40 */
+	addl	%edi,%eax
+	movl	%edx,%edi
+	xorl	20(%esp),%esi
+	andl	%ecx,%edi
+	roll	$1,%esi
+	addl	%edi,%ebp
+	movl	%edx,%edi
+	movl	%esi,32(%esp)
+	xorl	%ecx,%edi
+	leal	2400959708(%ebp,%esi,1),%ebp
+	andl	%ebx,%edi
+	rorl	$2,%ebx
+	movl	36(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	44(%esp),%esi
+	roll	$5,%edi
+	xorl	4(%esp),%esi
+	/* 40_59 41 */
+	addl	%edi,%ebp
+	movl	%ecx,%edi
+	xorl	24(%esp),%esi
+	andl	%ebx,%edi
+	roll	$1,%esi
+	addl	%edi,%edx
+	movl	%ecx,%edi
+	movl	%esi,36(%esp)
+	xorl	%ebx,%edi
+	leal	2400959708(%edx,%esi,1),%edx
+	andl	%eax,%edi
+	rorl	$2,%eax
+	movl	40(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	48(%esp),%esi
+	roll	$5,%edi
+	xorl	8(%esp),%esi
+	/* 40_59 42 */
+	addl	%edi,%edx
+	movl	%ebx,%edi
+	xorl	28(%esp),%esi
+	andl	%eax,%edi
+	roll	$1,%esi
+	addl	%edi,%ecx
+	movl	%ebx,%edi
+	movl	%esi,40(%esp)
+	xorl	%eax,%edi
+	leal	2400959708(%ecx,%esi,1),%ecx
+	andl	%ebp,%edi
+	rorl	$2,%ebp
+	movl	44(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	52(%esp),%esi
+	roll	$5,%edi
+	xorl	12(%esp),%esi
+	/* 40_59 43 */
+	addl	%edi,%ecx
+	movl	%eax,%edi
+	xorl	32(%esp),%esi
+	andl	%ebp,%edi
+	roll	$1,%esi
+	addl	%edi,%ebx
+	movl	%eax,%edi
+	movl	%esi,44(%esp)
+	xorl	%ebp,%edi
+	leal	2400959708(%ebx,%esi,1),%ebx
+	andl	%edx,%edi
+	rorl	$2,%edx
+	movl	48(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	56(%esp),%esi
+	roll	$5,%edi
+	xorl	16(%esp),%esi
+	/* 40_59 44 */
+	addl	%edi,%ebx
+	movl	%ebp,%edi
+	xorl	36(%esp),%esi
+	andl	%edx,%edi
+	roll	$1,%esi
+	addl	%edi,%eax
+	movl	%ebp,%edi
+	movl	%esi,48(%esp)
+	xorl	%edx,%edi
+	leal	2400959708(%eax,%esi,1),%eax
+	andl	%ecx,%edi
+	rorl	$2,%ecx
+	movl	52(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	60(%esp),%esi
+	roll	$5,%edi
+	xorl	20(%esp),%esi
+	/* 40_59 45 */
+	addl	%edi,%eax
+	movl	%edx,%edi
+	xorl	40(%esp),%esi
+	andl	%ecx,%edi
+	roll	$1,%esi
+	addl	%edi,%ebp
+	movl	%edx,%edi
+	movl	%esi,52(%esp)
+	xorl	%ecx,%edi
+	leal	2400959708(%ebp,%esi,1),%ebp
+	andl	%ebx,%edi
+	rorl	$2,%ebx
+	movl	56(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	(%esp),%esi
+	roll	$5,%edi
+	xorl	24(%esp),%esi
+	/* 40_59 46 */
+	addl	%edi,%ebp
+	movl	%ecx,%edi
+	xorl	44(%esp),%esi
+	andl	%ebx,%edi
+	roll	$1,%esi
+	addl	%edi,%edx
+	movl	%ecx,%edi
+	movl	%esi,56(%esp)
+	xorl	%ebx,%edi
+	leal	2400959708(%edx,%esi,1),%edx
+	andl	%eax,%edi
+	rorl	$2,%eax
+	movl	60(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	4(%esp),%esi
+	roll	$5,%edi
+	xorl	28(%esp),%esi
+	/* 40_59 47 */
+	addl	%edi,%edx
+	movl	%ebx,%edi
+	xorl	48(%esp),%esi
+	andl	%eax,%edi
+	roll	$1,%esi
+	addl	%edi,%ecx
+	movl	%ebx,%edi
+	movl	%esi,60(%esp)
+	xorl	%eax,%edi
+	leal	2400959708(%ecx,%esi,1),%ecx
+	andl	%ebp,%edi
+	rorl	$2,%ebp
+	movl	(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	8(%esp),%esi
+	roll	$5,%edi
+	xorl	32(%esp),%esi
+	/* 40_59 48 */
+	addl	%edi,%ecx
+	movl	%eax,%edi
+	xorl	52(%esp),%esi
+	andl	%ebp,%edi
+	roll	$1,%esi
+	addl	%edi,%ebx
+	movl	%eax,%edi
+	movl	%esi,(%esp)
+	xorl	%ebp,%edi
+	leal	2400959708(%ebx,%esi,1),%ebx
+	andl	%edx,%edi
+	rorl	$2,%edx
+	movl	4(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	12(%esp),%esi
+	roll	$5,%edi
+	xorl	36(%esp),%esi
+	/* 40_59 49 */
+	addl	%edi,%ebx
+	movl	%ebp,%edi
+	xorl	56(%esp),%esi
+	andl	%edx,%edi
+	roll	$1,%esi
+	addl	%edi,%eax
+	movl	%ebp,%edi
+	movl	%esi,4(%esp)
+	xorl	%edx,%edi
+	leal	2400959708(%eax,%esi,1),%eax
+	andl	%ecx,%edi
+	rorl	$2,%ecx
+	movl	8(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	16(%esp),%esi
+	roll	$5,%edi
+	xorl	40(%esp),%esi
+	/* 40_59 50 */
+	addl	%edi,%eax
+	movl	%edx,%edi
+	xorl	60(%esp),%esi
+	andl	%ecx,%edi
+	roll	$1,%esi
+	addl	%edi,%ebp
+	movl	%edx,%edi
+	movl	%esi,8(%esp)
+	xorl	%ecx,%edi
+	leal	2400959708(%ebp,%esi,1),%ebp
+	andl	%ebx,%edi
+	rorl	$2,%ebx
+	movl	12(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	20(%esp),%esi
+	roll	$5,%edi
+	xorl	44(%esp),%esi
+	/* 40_59 51 */
+	addl	%edi,%ebp
+	movl	%ecx,%edi
+	xorl	(%esp),%esi
+	andl	%ebx,%edi
+	roll	$1,%esi
+	addl	%edi,%edx
+	movl	%ecx,%edi
+	movl	%esi,12(%esp)
+	xorl	%ebx,%edi
+	leal	2400959708(%edx,%esi,1),%edx
+	andl	%eax,%edi
+	rorl	$2,%eax
+	movl	16(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	24(%esp),%esi
+	roll	$5,%edi
+	xorl	48(%esp),%esi
+	/* 40_59 52 */
+	addl	%edi,%edx
+	movl	%ebx,%edi
+	xorl	4(%esp),%esi
+	andl	%eax,%edi
+	roll	$1,%esi
+	addl	%edi,%ecx
+	movl	%ebx,%edi
+	movl	%esi,16(%esp)
+	xorl	%eax,%edi
+	leal	2400959708(%ecx,%esi,1),%ecx
+	andl	%ebp,%edi
+	rorl	$2,%ebp
+	movl	20(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	28(%esp),%esi
+	roll	$5,%edi
+	xorl	52(%esp),%esi
+	/* 40_59 53 */
+	addl	%edi,%ecx
+	movl	%eax,%edi
+	xorl	8(%esp),%esi
+	andl	%ebp,%edi
+	roll	$1,%esi
+	addl	%edi,%ebx
+	movl	%eax,%edi
+	movl	%esi,20(%esp)
+	xorl	%ebp,%edi
+	leal	2400959708(%ebx,%esi,1),%ebx
+	andl	%edx,%edi
+	rorl	$2,%edx
+	movl	24(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	32(%esp),%esi
+	roll	$5,%edi
+	xorl	56(%esp),%esi
+	/* 40_59 54 */
+	addl	%edi,%ebx
+	movl	%ebp,%edi
+	xorl	12(%esp),%esi
+	andl	%edx,%edi
+	roll	$1,%esi
+	addl	%edi,%eax
+	movl	%ebp,%edi
+	movl	%esi,24(%esp)
+	xorl	%edx,%edi
+	leal	2400959708(%eax,%esi,1),%eax
+	andl	%ecx,%edi
+	rorl	$2,%ecx
+	movl	28(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	36(%esp),%esi
+	roll	$5,%edi
+	xorl	60(%esp),%esi
+	/* 40_59 55 */
+	addl	%edi,%eax
+	movl	%edx,%edi
+	xorl	16(%esp),%esi
+	andl	%ecx,%edi
+	roll	$1,%esi
+	addl	%edi,%ebp
+	movl	%edx,%edi
+	movl	%esi,28(%esp)
+	xorl	%ecx,%edi
+	leal	2400959708(%ebp,%esi,1),%ebp
+	andl	%ebx,%edi
+	rorl	$2,%ebx
+	movl	32(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	40(%esp),%esi
+	roll	$5,%edi
+	xorl	(%esp),%esi
+	/* 40_59 56 */
+	addl	%edi,%ebp
+	movl	%ecx,%edi
+	xorl	20(%esp),%esi
+	andl	%ebx,%edi
+	roll	$1,%esi
+	addl	%edi,%edx
+	movl	%ecx,%edi
+	movl	%esi,32(%esp)
+	xorl	%ebx,%edi
+	leal	2400959708(%edx,%esi,1),%edx
+	andl	%eax,%edi
+	rorl	$2,%eax
+	movl	36(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	44(%esp),%esi
+	roll	$5,%edi
+	xorl	4(%esp),%esi
+	/* 40_59 57 */
+	addl	%edi,%edx
+	movl	%ebx,%edi
+	xorl	24(%esp),%esi
+	andl	%eax,%edi
+	roll	$1,%esi
+	addl	%edi,%ecx
+	movl	%ebx,%edi
+	movl	%esi,36(%esp)
+	xorl	%eax,%edi
+	leal	2400959708(%ecx,%esi,1),%ecx
+	andl	%ebp,%edi
+	rorl	$2,%ebp
+	movl	40(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	48(%esp),%esi
+	roll	$5,%edi
+	xorl	8(%esp),%esi
+	/* 40_59 58 */
+	addl	%edi,%ecx
+	movl	%eax,%edi
+	xorl	28(%esp),%esi
+	andl	%ebp,%edi
+	roll	$1,%esi
+	addl	%edi,%ebx
+	movl	%eax,%edi
+	movl	%esi,40(%esp)
+	xorl	%ebp,%edi
+	leal	2400959708(%ebx,%esi,1),%ebx
+	andl	%edx,%edi
+	rorl	$2,%edx
+	movl	44(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	52(%esp),%esi
+	roll	$5,%edi
+	xorl	12(%esp),%esi
+	/* 40_59 59 */
+	addl	%edi,%ebx
+	movl	%ebp,%edi
+	xorl	32(%esp),%esi
+	andl	%edx,%edi
+	roll	$1,%esi
+	addl	%edi,%eax
+	movl	%ebp,%edi
+	movl	%esi,44(%esp)
+	xorl	%edx,%edi
+	leal	2400959708(%eax,%esi,1),%eax
+	andl	%ecx,%edi
+	rorl	$2,%ecx
+	movl	48(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	56(%esp),%esi
+	roll	$5,%edi
+	xorl	16(%esp),%esi
+	/* 20_39 60 */
+	xorl	36(%esp),%esi
+	addl	%edi,%eax
+	roll	$1,%esi
+	movl	%edx,%edi
+	movl	%esi,48(%esp)
+	xorl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	3395469782(%ebp,%esi,1),%ebp
+	movl	52(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	60(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	20(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 61 */
+	xorl	40(%esp),%esi
+	addl	%edi,%ebp
+	roll	$1,%esi
+	movl	%ecx,%edi
+	movl	%esi,52(%esp)
+	xorl	%eax,%edi
+	rorl	$2,%eax
+	leal	3395469782(%edx,%esi,1),%edx
+	movl	56(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	24(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 62 */
+	xorl	44(%esp),%esi
+	addl	%edi,%edx
+	roll	$1,%esi
+	movl	%ebx,%edi
+	movl	%esi,56(%esp)
+	xorl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	3395469782(%ecx,%esi,1),%ecx
+	movl	60(%esp),%esi
+	xorl	%eax,%edi
+	xorl	4(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	28(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 63 */
+	xorl	48(%esp),%esi
+	addl	%edi,%ecx
+	roll	$1,%esi
+	movl	%eax,%edi
+	movl	%esi,60(%esp)
+	xorl	%edx,%edi
+	rorl	$2,%edx
+	leal	3395469782(%ebx,%esi,1),%ebx
+	movl	(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	8(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	32(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 64 */
+	xorl	52(%esp),%esi
+	addl	%edi,%ebx
+	roll	$1,%esi
+	movl	%ebp,%edi
+	movl	%esi,(%esp)
+	xorl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	3395469782(%eax,%esi,1),%eax
+	movl	4(%esp),%esi
+	xorl	%edx,%edi
+	xorl	12(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	36(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 65 */
+	xorl	56(%esp),%esi
+	addl	%edi,%eax
+	roll	$1,%esi
+	movl	%edx,%edi
+	movl	%esi,4(%esp)
+	xorl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	3395469782(%ebp,%esi,1),%ebp
+	movl	8(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	16(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	40(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 66 */
+	xorl	60(%esp),%esi
+	addl	%edi,%ebp
+	roll	$1,%esi
+	movl	%ecx,%edi
+	movl	%esi,8(%esp)
+	xorl	%eax,%edi
+	rorl	$2,%eax
+	leal	3395469782(%edx,%esi,1),%edx
+	movl	12(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	20(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	44(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 67 */
+	xorl	(%esp),%esi
+	addl	%edi,%edx
+	roll	$1,%esi
+	movl	%ebx,%edi
+	movl	%esi,12(%esp)
+	xorl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	3395469782(%ecx,%esi,1),%ecx
+	movl	16(%esp),%esi
+	xorl	%eax,%edi
+	xorl	24(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	48(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 68 */
+	xorl	4(%esp),%esi
+	addl	%edi,%ecx
+	roll	$1,%esi
+	movl	%eax,%edi
+	movl	%esi,16(%esp)
+	xorl	%edx,%edi
+	rorl	$2,%edx
+	leal	3395469782(%ebx,%esi,1),%ebx
+	movl	20(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	28(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	52(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 69 */
+	xorl	8(%esp),%esi
+	addl	%edi,%ebx
+	roll	$1,%esi
+	movl	%ebp,%edi
+	movl	%esi,20(%esp)
+	xorl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	3395469782(%eax,%esi,1),%eax
+	movl	24(%esp),%esi
+	xorl	%edx,%edi
+	xorl	32(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	56(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 70 */
+	xorl	12(%esp),%esi
+	addl	%edi,%eax
+	roll	$1,%esi
+	movl	%edx,%edi
+	movl	%esi,24(%esp)
+	xorl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	3395469782(%ebp,%esi,1),%ebp
+	movl	28(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	36(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	60(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 71 */
+	xorl	16(%esp),%esi
+	addl	%edi,%ebp
+	roll	$1,%esi
+	movl	%ecx,%edi
+	movl	%esi,28(%esp)
+	xorl	%eax,%edi
+	rorl	$2,%eax
+	leal	3395469782(%edx,%esi,1),%edx
+	movl	32(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	40(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 72 */
+	xorl	20(%esp),%esi
+	addl	%edi,%edx
+	roll	$1,%esi
+	movl	%ebx,%edi
+	movl	%esi,32(%esp)
+	xorl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	3395469782(%ecx,%esi,1),%ecx
+	movl	36(%esp),%esi
+	xorl	%eax,%edi
+	xorl	44(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	4(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 73 */
+	xorl	24(%esp),%esi
+	addl	%edi,%ecx
+	roll	$1,%esi
+	movl	%eax,%edi
+	movl	%esi,36(%esp)
+	xorl	%edx,%edi
+	rorl	$2,%edx
+	leal	3395469782(%ebx,%esi,1),%ebx
+	movl	40(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	48(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	8(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 74 */
+	xorl	28(%esp),%esi
+	addl	%edi,%ebx
+	roll	$1,%esi
+	movl	%ebp,%edi
+	movl	%esi,40(%esp)
+	xorl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	3395469782(%eax,%esi,1),%eax
+	movl	44(%esp),%esi
+	xorl	%edx,%edi
+	xorl	52(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	12(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 75 */
+	xorl	32(%esp),%esi
+	addl	%edi,%eax
+	roll	$1,%esi
+	movl	%edx,%edi
+	movl	%esi,44(%esp)
+	xorl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	3395469782(%ebp,%esi,1),%ebp
+	movl	48(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	56(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	16(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 76 */
+	xorl	36(%esp),%esi
+	addl	%edi,%ebp
+	roll	$1,%esi
+	movl	%ecx,%edi
+	movl	%esi,48(%esp)
+	xorl	%eax,%edi
+	rorl	$2,%eax
+	leal	3395469782(%edx,%esi,1),%edx
+	movl	52(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	60(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	20(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 77 */
+	xorl	40(%esp),%esi
+	addl	%edi,%edx
+	roll	$1,%esi
+	movl	%ebx,%edi
+	xorl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	3395469782(%ecx,%esi,1),%ecx
+	movl	56(%esp),%esi
+	xorl	%eax,%edi
+	xorl	(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	24(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 78 */
+	xorl	44(%esp),%esi
+	addl	%edi,%ecx
+	roll	$1,%esi
+	movl	%eax,%edi
+	xorl	%edx,%edi
+	rorl	$2,%edx
+	leal	3395469782(%ebx,%esi,1),%ebx
+	movl	60(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	4(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	28(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 79 */
+	xorl	48(%esp),%esi
+	addl	%edi,%ebx
+	roll	$1,%esi
+	movl	%ebp,%edi
+	xorl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	3395469782(%eax,%esi,1),%eax
+	xorl	%edx,%edi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	roll	$5,%edi
+	addl	%edi,%eax
+	/* Loop trailer */
+	movl	84(%esp),%edi
+	movl	88(%esp),%esi
+	addl	16(%edi),%ebp
+	addl	12(%edi),%edx
+	addl	%ecx,8(%edi)
+	addl	%ebx,4(%edi)
+	addl	$64,%esi
+	addl	%eax,(%edi)
+	movl	%edx,12(%edi)
+	movl	%ebp,16(%edi)
+	cmpl	92(%esp),%esi
+	jb	.L000loop
+	addl	$64,%esp
+	popl	%edi
+	popl	%esi
+	popl	%ebx
+	popl	%ebp
+	ret
+.L_sha1_block_data_order_end:
+.size	sha1_block_data_order,.L_sha1_block_data_order_end-sha1_block_data_order
+.byte	83,72,65,49,32,98,108,111,99,107,32,116,114,97,110,115,102,111,114,109,32,102,111,114,32,120,56,54,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0
diff --git a/x86/sha1.c b/x86/sha1.c
new file mode 100644
index 0000000..4c1a569
--- /dev/null
+++ b/x86/sha1.c
@@ -0,0 +1,81 @@
+/*
+ * SHA-1 implementation.
+ *
+ * Copyright (C) 2005 Paul Mackerras <paulus@samba.org>
+ *
+ * This version assumes we are running on a big-endian machine.
+ * It calls an external sha1_core() to process blocks of 64 bytes.
+ */
+#include <stdio.h>
+#include <string.h>
+#include <arpa/inet.h>	/* For htonl */
+#include "sha1.h"
+
+#define x86_sha1_core sha1_block_data_order
+extern void x86_sha1_core(uint32_t hash[5], const unsigned char *p,
+			  unsigned int nblocks);
+
+void x86_SHA1_Init(x86_SHA_CTX *c)
+{
+	/* Matches prefix of scontext structure */
+	static struct {
+		uint32_t hash[5];
+		uint64_t len;
+	} const iv = {
+		{ 0x67452301, 0xEFCDAB89, 0x98BADCFE, 0x10325476, 0xC3D2E1F0 },
+		0
+	};
+
+	memcpy(c, &iv, sizeof iv);
+}
+
+void x86_SHA1_Update(x86_SHA_CTX *c, const void *p, unsigned long n)
+{
+	unsigned pos = (unsigned)c->len & 63;
+	unsigned long nb;
+
+	c->len += n;
+
+	/* Initial partial block */
+	if (pos) {
+		unsigned space = 64 - pos;
+		if (space > n)
+			goto end;
+		memcpy(c->buf + pos, p, space);
+		p += space;
+		n -= space;
+		x86_sha1_core(c->hash, c->buf, 1);
+	}
+
+	/* The big impressive middle */
+	nb = n >> 6;
+	if (nb) {
+		x86_sha1_core(c->hash, p, nb);
+		p += nb << 6;
+		n &= 63;
+	}
+	pos = 0;
+end:
+	/* Final partial block */
+	memcpy(c->buf + pos, p, n);
+}
+
+void x86_SHA1_Final(unsigned char *hash, x86_SHA_CTX *c)
+{
+	unsigned pos = (unsigned)c->len & 63;
+
+	c->buf[pos++] = 0x80;
+	if (pos > 56) {
+		memset(c->buf + pos, 0, 64 - pos);
+		x86_sha1_core(c->hash, c->buf, 1);
+		pos = 0;
+	}
+	memset(c->buf + pos, 0, 56 - pos);
+	/* Last two words are 64-bit *bit* count */
+	*(uint32_t *)(c->buf + 56) = htonl((uint32_t)(c->len >> 29));
+	*(uint32_t *)(c->buf + 60) = htonl((uint32_t)c->len << 3);
+	x86_sha1_core(c->hash, c->buf, 1);
+
+	for (pos = 0; pos < 5; pos++)
+		((uint32_t *)hash)[pos] = htonl(c->hash[pos]);
+}
diff --git a/x86/sha1.h b/x86/sha1.h
new file mode 100644
index 0000000..8988da9
--- /dev/null
+++ b/x86/sha1.h
@@ -0,0 +1,21 @@
+/*
+ * SHA-1 implementation.
+ *
+ * Copyright (C) 2005 Paul Mackerras <paulus@samba.org>
+ */
+#include <stdint.h>
+
+typedef struct {
+	uint32_t hash[5];
+	uint64_t len;
+	unsigned char buf[64];	/* Keep this aligned */
+} x86_SHA_CTX;
+
+void x86_SHA1_Init(x86_SHA_CTX *c);
+void x86_SHA1_Update(x86_SHA_CTX *c, const void *p, unsigned long n);
+void x86_SHA1_Final(unsigned char *hash, x86_SHA_CTX *c);
+
+#define git_SHA_CTX	x86_SHA_CTX
+#define git_SHA1_Init	x86_SHA1_Init
+#define git_SHA1_Update	x86_SHA1_Update
+#define git_SHA1_Final	x86_SHA1_Final

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-04  3:07         ` Jon Smirl
@ 2009-08-04  5:01           ` George Spelvin
  2009-08-04 12:56             ` Jon Smirl
  0 siblings, 1 reply; 129+ messages in thread
From: George Spelvin @ 2009-08-04  5:01 UTC (permalink / raw)
  To: jonsmirl; +Cc: git, linux

> Would there happen to be a SHA1 implementation around that can compute
> the SHA1 without first decompressing the data? Databases gain a lot of
> speed by using special algorithms that can directly operate on the
> compressed data.

I can't imagine how.  In general, this requires that the compression
be carefully designed to be compatible with the algorithms, and SHA1
is specifically designed to depend on every bit of the input in
an un-analyzable way.

Also, git normally avoids hashing objects that it doesn't need
uncompressed for some other reason.  git-fsck is a notable exception,
but I think the idea of creating special optimized code paths for that
interferes with its reliability and robustness goals.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-04  4:48       ` George Spelvin
@ 2009-08-04  6:30         ` Linus Torvalds
  2009-08-04  8:01           ` George Spelvin
  2009-08-04  6:40         ` Linus Torvalds
  1 sibling, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2009-08-04  6:30 UTC (permalink / raw)
  To: George Spelvin; +Cc: Git Mailing List

On Mon, 4 Aug 2009, George Spelvin wrote:
> 
> The actual goal of this effort is to address the dynamic linker startup
> time issues by removing the second-largest contributor after libcurl,
> namely openssl.  Optimizing the assembly code is just the fun part. ;-)

Now, I agree that it would be wonderful to get rid of the linker startup, 
but the startup costs of openssl are very low compared to the equivalent 
curl ones. So we can't lose _too_ much performance - especially for 
long-running jobs where startup costs really don't even matter - in the 
quest to get rid of those.

That said, your numbers are impressive. Improving fsck by 1.1-2.2% is very 
good. That means that you not only avodied the startup costs, you actually 
improved on the openssl code. So it's a win-win situation.

That said, it would be even better if the SHA1 code was also somewhat 
portable to other environments (it looks like your current patch is very 
GNU as specific), and if you had a solution for x86-64 too ;)

Yeah, I'm a whiny little b*tch, aren't I?

		Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-04  4:48       ` George Spelvin
  2009-08-04  6:30         ` Linus Torvalds
@ 2009-08-04  6:40         ` Linus Torvalds
  1 sibling, 0 replies; 129+ messages in thread
From: Linus Torvalds @ 2009-08-04  6:40 UTC (permalink / raw)
  To: George Spelvin; +Cc: Git Mailing List

On Mon, 4 Aug 2009, George Spelvin wrote:
> +sha1_block_data_order:
> +	pushl	%ebp
> +	pushl	%ebx
> +	pushl	%esi
> +	pushl	%edi
> +	movl	20(%esp),%edi
> +	movl	24(%esp),%esi
> +	movl	28(%esp),%eax
> +	subl	$64,%esp
> +	shll	$6,%eax
> +	addl	%esi,%eax
> +	movl	%eax,92(%esp)
> +	movl	16(%edi),%ebp
> +	movl	12(%edi),%edx
> +.align	16
> +.L000loop:
> +	movl	(%esi),%ecx
> +	movl	4(%esi),%ebx
> +	bswap	%ecx
> +	movl	8(%esi),%eax
> +	bswap	%ebx
> +	movl	%ecx,(%esp)

...

Hmm. Does it really help to do the bswap as a separate initial phase?

As far as I can tell, you load the result of the bswap just a single time 
for each value. So the initial "bswap all 64 bytes" seems pointless.

> +	/* 00_15 0 */
> +	movl	%edx,%edi
> +	movl	(%esp),%esi

Why not do the bswap here instead?

Is it because you're running out of registers for scheduling, and want to 
use the stack pointer rather than the original source?

Or does the data dependency end up being so much better that you're better 
off doing a separate bswap loop?

Or is it just because the code was written that way?

Intriguing, either way.

		Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-04  6:30         ` Linus Torvalds
@ 2009-08-04  8:01           ` George Spelvin
  2009-08-04 20:41             ` Junio C Hamano
  0 siblings, 1 reply; 129+ messages in thread
From: George Spelvin @ 2009-08-04  8:01 UTC (permalink / raw)
  To: linux, torvalds; +Cc: git

> Now, I agree that it would be wonderful to get rid of the linker startup, 
> but the startup costs of openssl are very low compared to the equivalent 
> curl ones. So we can't lose _too_ much performance - especially for 
> long-running jobs where startup costs really don't even matter - in the 
> quest to get rid of those.
>
> That said, your numbers are impressive. Improving fsck by 1.1-2.2% is very 
> good. That means that you not only avodied the startup costs, you actually 
> improved on the openssl code. So it's a win-win situation.

Er, yes, that *is* what the subject line is advertising.  I started
with the OpenSSL core SHA1 code (which is BSD/GPL dual-licensed by its
author) and tweaked it some more for more recent processors.

> That said, it would be even better if the SHA1 code was also somewhat 
> portable to other environments (it looks like your current patch is very 
> GNU as specific), and if you had a solution for x86-64 too ;)

Done and will be done.

The code is *actually* written (see the first e-mail in this thread)
in the perl-preprocessor that OpenSSL uses, which can generate quite a
few output syntaxes (including Intel).  I just included the preprocessed
version to reduce the complexity of the rough-draft patch.

The one question I have is that currently perl is not a critical
compile-time dependency; it's needed for some extra stuff, but AFAIK you
can get most of git working without it.  Whether to add that dependency
or what is a Junio question.

As for x86-64, I haven't actually *written* it yet, but it'll be a very
simple adaptation.  Mostly it's just a matter of using the additional
registers effectively.

> Yeah, I'm a whiny little b*tch, aren't I?

Not at all; I expected all of that.  Getting rid of OpenSSL kind of
requires those things.

> Hmm. Does it really help to do the bswap as a separate initial phase?
> 
> As far as I can tell, you load the result of the bswap just a single time 
> for each value. So the initial "bswap all 64 bytes" seems pointless.

>> +	/* 00_15 0 */
>> +	movl	%edx,%edi
>> +	movl	(%esp),%esi

> Why not do the bswap here instead?
>
> Is it because you're running out of registers for scheduling, and want to 
> use the stack pointer rather than the original source?

Exactly.  I looked hard at it, but that means that I'd have to write the
first 16 rounds with only one temp register, because the other is being
used as an input pointer.

Here's the pipelined loop for the first 16 rounds (when in[i] is the
stack buffer), showing parallel operations on the same line.
(Operations in parens belong to adjacent rounds.)
#                       movl D,S        (roll 5,T)      (addl S,A)      //
#       mov in[i],T     xorl C,S        (addl T,A)
#                       andl B,S                        rorl 2,B
#       addl T+K,E      xorl D,S        movl A,T
#                       addl S,E        roll 5,T        (movl C,S)      //
#       (mov in[i],T)   (xorl B,S)      addl T,E        

which translates in perl code to:

sub BODY_00_15
{
        local($n,$a,$b,$c,$d,$e)=@_;

        &comment("00_15 $n");
                &mov($S,$d) if ($n == 0);
        &mov($T,&swtmp($n%16));         #  V Load Xi.
                &xor($S,$c);            # U  Continue F() = d^(b&(c^d))
                &and($S,$b);            #  V
                        &rotr($b,2);    # NP
        &lea($e,&DWP(K1,$e,$T));        # U  Add Xi and K
    if ($n < 15) {
                        &mov($T,$a);    #  V
                &xor($S,$d);            # U 
                        &rotl($T,5);    # NP
                &add($e,$S);            # U 
                &mov($S,$c);            #  V Start of NEXT round's F()
                        &add($e,$T);    # U 
    } else {
        # This version provides the correct start for BODY_20_39
                &xor($S,$d);            #  V
        &mov($T,&swtmp(($n+1)%16));     # U  Start computing mext Xi.
                &add($e,$S);            #  V Add F()
                        &mov($S,$a);    # U  Start computing a<<<5
        &xor($T,&swtmp(($n+3)%16));     #  V
                        &rotl($S,5);    # U 
        &xor($T,&swtmp(($n+9)%16));     #  V
    }
}

Anyway, the round is:

#define K1 0x5a827999
e += bswap(in[i]) + K1 + (d^(b&(c^d))) + ROTL(a,5).
b = ROTR(b,2);

Notice how I use one temp (T) for in[i] and ROTL(a,5), and the other (S)
for F1(b,c,d) = d^(b&(c^d)).

If I only had one temporary, I'd have to seriously un-overlap it:
	mov	S[i],T
	bswap	T
	mov	T,in[i]
	lea	K1(T,e),e
	  mov	  d,T
	  xor	  c,T
	  and	  b,T
	  xor	  d,T
	  add	  T,e
	mov	a,T
	roll	5,T
	add	T,e

Current processors probably have enough out-of-order scheduling resources to
find the parallelism there, but something like an Atom would be doomed.

I just cobbled together a test implementation, and it looks pretty similar
on my Phenom here (minimum of 30 runs):

Separate copy loop: 1.355603
In-line:            1.350444 (+0.4% faster)

A hint of being faster, but not much.

It is a couple of percent faster on a P4:
Separate copy loop: 3.297174
In-line:            3.237354 (+1.8% faster)

And on an i7:
Separate copy loop: 1.353641
In-line:            1.336766 (+1.2% faster)

but I worry about in-order machines.  An Athlon XP:
Separate copy loop: 3.252682
In-line:            3.313870 (-1.8% slower)

H'm... it's not bad.  And the code is smaller.  Maybe I'll work on
it a bit.

If you want to try it, the modified sha1-x86.s file is appended.

--- /dev/null	2009-05-12 02:55:38.579106460 -0400
+++ sha1-x86.s	2009-08-04 03:42:31.073284734 -0400
@@ -0,0 +1,1359 @@
+.file	"sha1-586.s"
+.text
+.globl	sha1_block_data_order
+.type	sha1_block_data_order,@function
+.align	16
+sha1_block_data_order:
+	pushl	%ebp
+	pushl	%ebx
+	pushl	%esi
+	pushl	%edi
+	movl	20(%esp),%edi
+	movl	24(%esp),%esi
+	movl	28(%esp),%eax
+	subl	$64,%esp
+	shll	$6,%eax
+	addl	%esi,%eax
+	movl	%eax,92(%esp)
+	movl	16(%edi),%ebp
+	movl	12(%edi),%edx
+	movl	8(%edi),%ecx
+	movl	4(%edi),%ebx
+	movl	(%edi),%eax
+.align	16
+.L000loop:
+	movl	%esi,88(%esp)
+	/* 00_15 0 */
+	movl	(%esi),%edi
+	bswap	%edi
+	movl	%edi,(%esp)
+	leal	1518500249(%ebp,%edi,1),%ebp
+	movl	%edx,%edi
+	xorl	%ecx,%edi
+	andl	%ebx,%edi
+	rorl	$2,%ebx
+	xorl	%edx,%edi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	roll	$5,%edi
+	addl	%edi,%ebp
+	/* 00_15 1 */
+	movl	4(%esi),%edi
+	bswap	%edi
+	movl	%edi,4(%esp)
+	leal	1518500249(%edx,%edi,1),%edx
+	movl	%ecx,%edi
+	xorl	%ebx,%edi
+	andl	%eax,%edi
+	rorl	$2,%eax
+	xorl	%ecx,%edi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	roll	$5,%edi
+	addl	%edi,%edx
+	/* 00_15 2 */
+	movl	8(%esi),%edi
+	bswap	%edi
+	movl	%edi,8(%esp)
+	leal	1518500249(%ecx,%edi,1),%ecx
+	movl	%ebx,%edi
+	xorl	%eax,%edi
+	andl	%ebp,%edi
+	rorl	$2,%ebp
+	xorl	%ebx,%edi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	roll	$5,%edi
+	addl	%edi,%ecx
+	/* 00_15 3 */
+	movl	12(%esi),%edi
+	bswap	%edi
+	movl	%edi,12(%esp)
+	leal	1518500249(%ebx,%edi,1),%ebx
+	movl	%eax,%edi
+	xorl	%ebp,%edi
+	andl	%edx,%edi
+	rorl	$2,%edx
+	xorl	%eax,%edi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	roll	$5,%edi
+	addl	%edi,%ebx
+	/* 00_15 4 */
+	movl	16(%esi),%edi
+	bswap	%edi
+	movl	%edi,16(%esp)
+	leal	1518500249(%eax,%edi,1),%eax
+	movl	%ebp,%edi
+	xorl	%edx,%edi
+	andl	%ecx,%edi
+	rorl	$2,%ecx
+	xorl	%ebp,%edi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	roll	$5,%edi
+	addl	%edi,%eax
+	/* 00_15 5 */
+	movl	20(%esi),%edi
+	bswap	%edi
+	movl	%edi,20(%esp)
+	leal	1518500249(%ebp,%edi,1),%ebp
+	movl	%edx,%edi
+	xorl	%ecx,%edi
+	andl	%ebx,%edi
+	rorl	$2,%ebx
+	xorl	%edx,%edi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	roll	$5,%edi
+	addl	%edi,%ebp
+	/* 00_15 6 */
+	movl	24(%esi),%edi
+	bswap	%edi
+	movl	%edi,24(%esp)
+	leal	1518500249(%edx,%edi,1),%edx
+	movl	%ecx,%edi
+	xorl	%ebx,%edi
+	andl	%eax,%edi
+	rorl	$2,%eax
+	xorl	%ecx,%edi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	roll	$5,%edi
+	addl	%edi,%edx
+	/* 00_15 7 */
+	movl	28(%esi),%edi
+	bswap	%edi
+	movl	%edi,28(%esp)
+	leal	1518500249(%ecx,%edi,1),%ecx
+	movl	%ebx,%edi
+	xorl	%eax,%edi
+	andl	%ebp,%edi
+	rorl	$2,%ebp
+	xorl	%ebx,%edi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	roll	$5,%edi
+	addl	%edi,%ecx
+	/* 00_15 8 */
+	movl	32(%esi),%edi
+	bswap	%edi
+	movl	%edi,32(%esp)
+	leal	1518500249(%ebx,%edi,1),%ebx
+	movl	%eax,%edi
+	xorl	%ebp,%edi
+	andl	%edx,%edi
+	rorl	$2,%edx
+	xorl	%eax,%edi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	roll	$5,%edi
+	addl	%edi,%ebx
+	/* 00_15 9 */
+	movl	36(%esi),%edi
+	bswap	%edi
+	movl	%edi,36(%esp)
+	leal	1518500249(%eax,%edi,1),%eax
+	movl	%ebp,%edi
+	xorl	%edx,%edi
+	andl	%ecx,%edi
+	rorl	$2,%ecx
+	xorl	%ebp,%edi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	roll	$5,%edi
+	addl	%edi,%eax
+	/* 00_15 10 */
+	movl	40(%esi),%edi
+	bswap	%edi
+	movl	%edi,40(%esp)
+	leal	1518500249(%ebp,%edi,1),%ebp
+	movl	%edx,%edi
+	xorl	%ecx,%edi
+	andl	%ebx,%edi
+	rorl	$2,%ebx
+	xorl	%edx,%edi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	roll	$5,%edi
+	addl	%edi,%ebp
+	/* 00_15 11 */
+	movl	44(%esi),%edi
+	bswap	%edi
+	movl	%edi,44(%esp)
+	leal	1518500249(%edx,%edi,1),%edx
+	movl	%ecx,%edi
+	xorl	%ebx,%edi
+	andl	%eax,%edi
+	rorl	$2,%eax
+	xorl	%ecx,%edi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	roll	$5,%edi
+	addl	%edi,%edx
+	/* 00_15 12 */
+	movl	48(%esi),%edi
+	bswap	%edi
+	movl	%edi,48(%esp)
+	leal	1518500249(%ecx,%edi,1),%ecx
+	movl	%ebx,%edi
+	xorl	%eax,%edi
+	andl	%ebp,%edi
+	rorl	$2,%ebp
+	xorl	%ebx,%edi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	roll	$5,%edi
+	addl	%edi,%ecx
+	/* 00_15 13 */
+	movl	52(%esi),%edi
+	bswap	%edi
+	movl	%edi,52(%esp)
+	leal	1518500249(%ebx,%edi,1),%ebx
+	movl	%eax,%edi
+	xorl	%ebp,%edi
+	andl	%edx,%edi
+	rorl	$2,%edx
+	xorl	%eax,%edi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	roll	$5,%edi
+	addl	%edi,%ebx
+	/* 00_15 14 */
+	movl	56(%esi),%edi
+	movl	60(%esi),%esi
+	bswap	%edi
+	movl	%edi,56(%esp)
+	leal	1518500249(%eax,%edi,1),%eax
+	movl	%ebp,%edi
+	xorl	%edx,%edi
+	andl	%ecx,%edi
+	rorl	$2,%ecx
+	xorl	%ebp,%edi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	roll	$5,%edi
+	addl	%edi,%eax
+	/* 00_15 15 */
+	movl	%edx,%edi
+	bswap	%esi
+	xorl	%ecx,%edi
+	movl	%esi,60(%esp)
+	andl	%ebx,%edi
+	rorl	$2,%ebx
+	xorl	%edx,%edi
+	leal	1518500249(%ebp,%esi,1),%ebp
+	movl	(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	8(%esp),%esi
+	roll	$5,%edi
+	xorl	32(%esp),%esi
+	/* 16_19 16 */
+	xorl	52(%esp),%esi
+	addl	%edi,%ebp
+	movl	%ecx,%edi
+	roll	$1,%esi
+	xorl	%ebx,%edi
+	movl	%esi,(%esp)
+	andl	%eax,%edi
+	rorl	$2,%eax
+	leal	1518500249(%edx,%esi,1),%edx
+	movl	4(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	12(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	36(%esp),%esi
+	roll	$5,%edi
+	/* 16_19 17 */
+	xorl	56(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebx,%edi
+	roll	$1,%esi
+	xorl	%eax,%edi
+	movl	%esi,4(%esp)
+	andl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	1518500249(%ecx,%esi,1),%ecx
+	movl	8(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	16(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	40(%esp),%esi
+	roll	$5,%edi
+	/* 16_19 18 */
+	xorl	60(%esp),%esi
+	addl	%edi,%ecx
+	movl	%eax,%edi
+	roll	$1,%esi
+	xorl	%ebp,%edi
+	movl	%esi,8(%esp)
+	andl	%edx,%edi
+	rorl	$2,%edx
+	leal	1518500249(%ebx,%esi,1),%ebx
+	movl	12(%esp),%esi
+	xorl	%eax,%edi
+	xorl	20(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	44(%esp),%esi
+	roll	$5,%edi
+	/* 16_19 19 */
+	xorl	(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ebp,%edi
+	roll	$1,%esi
+	xorl	%edx,%edi
+	movl	%esi,12(%esp)
+	andl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	1518500249(%eax,%esi,1),%eax
+	movl	16(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	24(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	48(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 20 */
+	xorl	4(%esp),%esi
+	addl	%edi,%eax
+	roll	$1,%esi
+	movl	%edx,%edi
+	movl	%esi,16(%esp)
+	xorl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	1859775393(%ebp,%esi,1),%ebp
+	movl	20(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	28(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	52(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 21 */
+	xorl	8(%esp),%esi
+	addl	%edi,%ebp
+	roll	$1,%esi
+	movl	%ecx,%edi
+	movl	%esi,20(%esp)
+	xorl	%eax,%edi
+	rorl	$2,%eax
+	leal	1859775393(%edx,%esi,1),%edx
+	movl	24(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	32(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	56(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 22 */
+	xorl	12(%esp),%esi
+	addl	%edi,%edx
+	roll	$1,%esi
+	movl	%ebx,%edi
+	movl	%esi,24(%esp)
+	xorl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	1859775393(%ecx,%esi,1),%ecx
+	movl	28(%esp),%esi
+	xorl	%eax,%edi
+	xorl	36(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	60(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 23 */
+	xorl	16(%esp),%esi
+	addl	%edi,%ecx
+	roll	$1,%esi
+	movl	%eax,%edi
+	movl	%esi,28(%esp)
+	xorl	%edx,%edi
+	rorl	$2,%edx
+	leal	1859775393(%ebx,%esi,1),%ebx
+	movl	32(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	40(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 24 */
+	xorl	20(%esp),%esi
+	addl	%edi,%ebx
+	roll	$1,%esi
+	movl	%ebp,%edi
+	movl	%esi,32(%esp)
+	xorl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	1859775393(%eax,%esi,1),%eax
+	movl	36(%esp),%esi
+	xorl	%edx,%edi
+	xorl	44(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	4(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 25 */
+	xorl	24(%esp),%esi
+	addl	%edi,%eax
+	roll	$1,%esi
+	movl	%edx,%edi
+	movl	%esi,36(%esp)
+	xorl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	1859775393(%ebp,%esi,1),%ebp
+	movl	40(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	48(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	8(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 26 */
+	xorl	28(%esp),%esi
+	addl	%edi,%ebp
+	roll	$1,%esi
+	movl	%ecx,%edi
+	movl	%esi,40(%esp)
+	xorl	%eax,%edi
+	rorl	$2,%eax
+	leal	1859775393(%edx,%esi,1),%edx
+	movl	44(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	52(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	12(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 27 */
+	xorl	32(%esp),%esi
+	addl	%edi,%edx
+	roll	$1,%esi
+	movl	%ebx,%edi
+	movl	%esi,44(%esp)
+	xorl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	1859775393(%ecx,%esi,1),%ecx
+	movl	48(%esp),%esi
+	xorl	%eax,%edi
+	xorl	56(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	16(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 28 */
+	xorl	36(%esp),%esi
+	addl	%edi,%ecx
+	roll	$1,%esi
+	movl	%eax,%edi
+	movl	%esi,48(%esp)
+	xorl	%edx,%edi
+	rorl	$2,%edx
+	leal	1859775393(%ebx,%esi,1),%ebx
+	movl	52(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	60(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	20(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 29 */
+	xorl	40(%esp),%esi
+	addl	%edi,%ebx
+	roll	$1,%esi
+	movl	%ebp,%edi
+	movl	%esi,52(%esp)
+	xorl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	1859775393(%eax,%esi,1),%eax
+	movl	56(%esp),%esi
+	xorl	%edx,%edi
+	xorl	(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	24(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 30 */
+	xorl	44(%esp),%esi
+	addl	%edi,%eax
+	roll	$1,%esi
+	movl	%edx,%edi
+	movl	%esi,56(%esp)
+	xorl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	1859775393(%ebp,%esi,1),%ebp
+	movl	60(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	4(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	28(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 31 */
+	xorl	48(%esp),%esi
+	addl	%edi,%ebp
+	roll	$1,%esi
+	movl	%ecx,%edi
+	movl	%esi,60(%esp)
+	xorl	%eax,%edi
+	rorl	$2,%eax
+	leal	1859775393(%edx,%esi,1),%edx
+	movl	(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	8(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	32(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 32 */
+	xorl	52(%esp),%esi
+	addl	%edi,%edx
+	roll	$1,%esi
+	movl	%ebx,%edi
+	movl	%esi,(%esp)
+	xorl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	1859775393(%ecx,%esi,1),%ecx
+	movl	4(%esp),%esi
+	xorl	%eax,%edi
+	xorl	12(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	36(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 33 */
+	xorl	56(%esp),%esi
+	addl	%edi,%ecx
+	roll	$1,%esi
+	movl	%eax,%edi
+	movl	%esi,4(%esp)
+	xorl	%edx,%edi
+	rorl	$2,%edx
+	leal	1859775393(%ebx,%esi,1),%ebx
+	movl	8(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	16(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	40(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 34 */
+	xorl	60(%esp),%esi
+	addl	%edi,%ebx
+	roll	$1,%esi
+	movl	%ebp,%edi
+	movl	%esi,8(%esp)
+	xorl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	1859775393(%eax,%esi,1),%eax
+	movl	12(%esp),%esi
+	xorl	%edx,%edi
+	xorl	20(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	44(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 35 */
+	xorl	(%esp),%esi
+	addl	%edi,%eax
+	roll	$1,%esi
+	movl	%edx,%edi
+	movl	%esi,12(%esp)
+	xorl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	1859775393(%ebp,%esi,1),%ebp
+	movl	16(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	24(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	48(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 36 */
+	xorl	4(%esp),%esi
+	addl	%edi,%ebp
+	roll	$1,%esi
+	movl	%ecx,%edi
+	movl	%esi,16(%esp)
+	xorl	%eax,%edi
+	rorl	$2,%eax
+	leal	1859775393(%edx,%esi,1),%edx
+	movl	20(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	28(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	52(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 37 */
+	xorl	8(%esp),%esi
+	addl	%edi,%edx
+	roll	$1,%esi
+	movl	%ebx,%edi
+	movl	%esi,20(%esp)
+	xorl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	1859775393(%ecx,%esi,1),%ecx
+	movl	24(%esp),%esi
+	xorl	%eax,%edi
+	xorl	32(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	56(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 38 */
+	xorl	12(%esp),%esi
+	addl	%edi,%ecx
+	roll	$1,%esi
+	movl	%eax,%edi
+	movl	%esi,24(%esp)
+	xorl	%edx,%edi
+	rorl	$2,%edx
+	leal	1859775393(%ebx,%esi,1),%ebx
+	movl	28(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	36(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	60(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 39 */
+	xorl	16(%esp),%esi
+	addl	%edi,%ebx
+	roll	$1,%esi
+	movl	%ebp,%edi
+	movl	%esi,28(%esp)
+	xorl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	1859775393(%eax,%esi,1),%eax
+	movl	32(%esp),%esi
+	xorl	%edx,%edi
+	xorl	40(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	(%esp),%esi
+	roll	$5,%edi
+	/* 40_59 40 */
+	addl	%edi,%eax
+	movl	%edx,%edi
+	xorl	20(%esp),%esi
+	andl	%ecx,%edi
+	roll	$1,%esi
+	addl	%edi,%ebp
+	movl	%edx,%edi
+	movl	%esi,32(%esp)
+	xorl	%ecx,%edi
+	leal	2400959708(%ebp,%esi,1),%ebp
+	andl	%ebx,%edi
+	rorl	$2,%ebx
+	movl	36(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	44(%esp),%esi
+	roll	$5,%edi
+	xorl	4(%esp),%esi
+	/* 40_59 41 */
+	addl	%edi,%ebp
+	movl	%ecx,%edi
+	xorl	24(%esp),%esi
+	andl	%ebx,%edi
+	roll	$1,%esi
+	addl	%edi,%edx
+	movl	%ecx,%edi
+	movl	%esi,36(%esp)
+	xorl	%ebx,%edi
+	leal	2400959708(%edx,%esi,1),%edx
+	andl	%eax,%edi
+	rorl	$2,%eax
+	movl	40(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	48(%esp),%esi
+	roll	$5,%edi
+	xorl	8(%esp),%esi
+	/* 40_59 42 */
+	addl	%edi,%edx
+	movl	%ebx,%edi
+	xorl	28(%esp),%esi
+	andl	%eax,%edi
+	roll	$1,%esi
+	addl	%edi,%ecx
+	movl	%ebx,%edi
+	movl	%esi,40(%esp)
+	xorl	%eax,%edi
+	leal	2400959708(%ecx,%esi,1),%ecx
+	andl	%ebp,%edi
+	rorl	$2,%ebp
+	movl	44(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	52(%esp),%esi
+	roll	$5,%edi
+	xorl	12(%esp),%esi
+	/* 40_59 43 */
+	addl	%edi,%ecx
+	movl	%eax,%edi
+	xorl	32(%esp),%esi
+	andl	%ebp,%edi
+	roll	$1,%esi
+	addl	%edi,%ebx
+	movl	%eax,%edi
+	movl	%esi,44(%esp)
+	xorl	%ebp,%edi
+	leal	2400959708(%ebx,%esi,1),%ebx
+	andl	%edx,%edi
+	rorl	$2,%edx
+	movl	48(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	56(%esp),%esi
+	roll	$5,%edi
+	xorl	16(%esp),%esi
+	/* 40_59 44 */
+	addl	%edi,%ebx
+	movl	%ebp,%edi
+	xorl	36(%esp),%esi
+	andl	%edx,%edi
+	roll	$1,%esi
+	addl	%edi,%eax
+	movl	%ebp,%edi
+	movl	%esi,48(%esp)
+	xorl	%edx,%edi
+	leal	2400959708(%eax,%esi,1),%eax
+	andl	%ecx,%edi
+	rorl	$2,%ecx
+	movl	52(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	60(%esp),%esi
+	roll	$5,%edi
+	xorl	20(%esp),%esi
+	/* 40_59 45 */
+	addl	%edi,%eax
+	movl	%edx,%edi
+	xorl	40(%esp),%esi
+	andl	%ecx,%edi
+	roll	$1,%esi
+	addl	%edi,%ebp
+	movl	%edx,%edi
+	movl	%esi,52(%esp)
+	xorl	%ecx,%edi
+	leal	2400959708(%ebp,%esi,1),%ebp
+	andl	%ebx,%edi
+	rorl	$2,%ebx
+	movl	56(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	(%esp),%esi
+	roll	$5,%edi
+	xorl	24(%esp),%esi
+	/* 40_59 46 */
+	addl	%edi,%ebp
+	movl	%ecx,%edi
+	xorl	44(%esp),%esi
+	andl	%ebx,%edi
+	roll	$1,%esi
+	addl	%edi,%edx
+	movl	%ecx,%edi
+	movl	%esi,56(%esp)
+	xorl	%ebx,%edi
+	leal	2400959708(%edx,%esi,1),%edx
+	andl	%eax,%edi
+	rorl	$2,%eax
+	movl	60(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	4(%esp),%esi
+	roll	$5,%edi
+	xorl	28(%esp),%esi
+	/* 40_59 47 */
+	addl	%edi,%edx
+	movl	%ebx,%edi
+	xorl	48(%esp),%esi
+	andl	%eax,%edi
+	roll	$1,%esi
+	addl	%edi,%ecx
+	movl	%ebx,%edi
+	movl	%esi,60(%esp)
+	xorl	%eax,%edi
+	leal	2400959708(%ecx,%esi,1),%ecx
+	andl	%ebp,%edi
+	rorl	$2,%ebp
+	movl	(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	8(%esp),%esi
+	roll	$5,%edi
+	xorl	32(%esp),%esi
+	/* 40_59 48 */
+	addl	%edi,%ecx
+	movl	%eax,%edi
+	xorl	52(%esp),%esi
+	andl	%ebp,%edi
+	roll	$1,%esi
+	addl	%edi,%ebx
+	movl	%eax,%edi
+	movl	%esi,(%esp)
+	xorl	%ebp,%edi
+	leal	2400959708(%ebx,%esi,1),%ebx
+	andl	%edx,%edi
+	rorl	$2,%edx
+	movl	4(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	12(%esp),%esi
+	roll	$5,%edi
+	xorl	36(%esp),%esi
+	/* 40_59 49 */
+	addl	%edi,%ebx
+	movl	%ebp,%edi
+	xorl	56(%esp),%esi
+	andl	%edx,%edi
+	roll	$1,%esi
+	addl	%edi,%eax
+	movl	%ebp,%edi
+	movl	%esi,4(%esp)
+	xorl	%edx,%edi
+	leal	2400959708(%eax,%esi,1),%eax
+	andl	%ecx,%edi
+	rorl	$2,%ecx
+	movl	8(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	16(%esp),%esi
+	roll	$5,%edi
+	xorl	40(%esp),%esi
+	/* 40_59 50 */
+	addl	%edi,%eax
+	movl	%edx,%edi
+	xorl	60(%esp),%esi
+	andl	%ecx,%edi
+	roll	$1,%esi
+	addl	%edi,%ebp
+	movl	%edx,%edi
+	movl	%esi,8(%esp)
+	xorl	%ecx,%edi
+	leal	2400959708(%ebp,%esi,1),%ebp
+	andl	%ebx,%edi
+	rorl	$2,%ebx
+	movl	12(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	20(%esp),%esi
+	roll	$5,%edi
+	xorl	44(%esp),%esi
+	/* 40_59 51 */
+	addl	%edi,%ebp
+	movl	%ecx,%edi
+	xorl	(%esp),%esi
+	andl	%ebx,%edi
+	roll	$1,%esi
+	addl	%edi,%edx
+	movl	%ecx,%edi
+	movl	%esi,12(%esp)
+	xorl	%ebx,%edi
+	leal	2400959708(%edx,%esi,1),%edx
+	andl	%eax,%edi
+	rorl	$2,%eax
+	movl	16(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	24(%esp),%esi
+	roll	$5,%edi
+	xorl	48(%esp),%esi
+	/* 40_59 52 */
+	addl	%edi,%edx
+	movl	%ebx,%edi
+	xorl	4(%esp),%esi
+	andl	%eax,%edi
+	roll	$1,%esi
+	addl	%edi,%ecx
+	movl	%ebx,%edi
+	movl	%esi,16(%esp)
+	xorl	%eax,%edi
+	leal	2400959708(%ecx,%esi,1),%ecx
+	andl	%ebp,%edi
+	rorl	$2,%ebp
+	movl	20(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	28(%esp),%esi
+	roll	$5,%edi
+	xorl	52(%esp),%esi
+	/* 40_59 53 */
+	addl	%edi,%ecx
+	movl	%eax,%edi
+	xorl	8(%esp),%esi
+	andl	%ebp,%edi
+	roll	$1,%esi
+	addl	%edi,%ebx
+	movl	%eax,%edi
+	movl	%esi,20(%esp)
+	xorl	%ebp,%edi
+	leal	2400959708(%ebx,%esi,1),%ebx
+	andl	%edx,%edi
+	rorl	$2,%edx
+	movl	24(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	32(%esp),%esi
+	roll	$5,%edi
+	xorl	56(%esp),%esi
+	/* 40_59 54 */
+	addl	%edi,%ebx
+	movl	%ebp,%edi
+	xorl	12(%esp),%esi
+	andl	%edx,%edi
+	roll	$1,%esi
+	addl	%edi,%eax
+	movl	%ebp,%edi
+	movl	%esi,24(%esp)
+	xorl	%edx,%edi
+	leal	2400959708(%eax,%esi,1),%eax
+	andl	%ecx,%edi
+	rorl	$2,%ecx
+	movl	28(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	36(%esp),%esi
+	roll	$5,%edi
+	xorl	60(%esp),%esi
+	/* 40_59 55 */
+	addl	%edi,%eax
+	movl	%edx,%edi
+	xorl	16(%esp),%esi
+	andl	%ecx,%edi
+	roll	$1,%esi
+	addl	%edi,%ebp
+	movl	%edx,%edi
+	movl	%esi,28(%esp)
+	xorl	%ecx,%edi
+	leal	2400959708(%ebp,%esi,1),%ebp
+	andl	%ebx,%edi
+	rorl	$2,%ebx
+	movl	32(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	40(%esp),%esi
+	roll	$5,%edi
+	xorl	(%esp),%esi
+	/* 40_59 56 */
+	addl	%edi,%ebp
+	movl	%ecx,%edi
+	xorl	20(%esp),%esi
+	andl	%ebx,%edi
+	roll	$1,%esi
+	addl	%edi,%edx
+	movl	%ecx,%edi
+	movl	%esi,32(%esp)
+	xorl	%ebx,%edi
+	leal	2400959708(%edx,%esi,1),%edx
+	andl	%eax,%edi
+	rorl	$2,%eax
+	movl	36(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	44(%esp),%esi
+	roll	$5,%edi
+	xorl	4(%esp),%esi
+	/* 40_59 57 */
+	addl	%edi,%edx
+	movl	%ebx,%edi
+	xorl	24(%esp),%esi
+	andl	%eax,%edi
+	roll	$1,%esi
+	addl	%edi,%ecx
+	movl	%ebx,%edi
+	movl	%esi,36(%esp)
+	xorl	%eax,%edi
+	leal	2400959708(%ecx,%esi,1),%ecx
+	andl	%ebp,%edi
+	rorl	$2,%ebp
+	movl	40(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	48(%esp),%esi
+	roll	$5,%edi
+	xorl	8(%esp),%esi
+	/* 40_59 58 */
+	addl	%edi,%ecx
+	movl	%eax,%edi
+	xorl	28(%esp),%esi
+	andl	%ebp,%edi
+	roll	$1,%esi
+	addl	%edi,%ebx
+	movl	%eax,%edi
+	movl	%esi,40(%esp)
+	xorl	%ebp,%edi
+	leal	2400959708(%ebx,%esi,1),%ebx
+	andl	%edx,%edi
+	rorl	$2,%edx
+	movl	44(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	52(%esp),%esi
+	roll	$5,%edi
+	xorl	12(%esp),%esi
+	/* 40_59 59 */
+	addl	%edi,%ebx
+	movl	%ebp,%edi
+	xorl	32(%esp),%esi
+	andl	%edx,%edi
+	roll	$1,%esi
+	addl	%edi,%eax
+	movl	%ebp,%edi
+	movl	%esi,44(%esp)
+	xorl	%edx,%edi
+	leal	2400959708(%eax,%esi,1),%eax
+	andl	%ecx,%edi
+	rorl	$2,%ecx
+	movl	48(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	56(%esp),%esi
+	roll	$5,%edi
+	xorl	16(%esp),%esi
+	/* 20_39 60 */
+	xorl	36(%esp),%esi
+	addl	%edi,%eax
+	roll	$1,%esi
+	movl	%edx,%edi
+	movl	%esi,48(%esp)
+	xorl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	3395469782(%ebp,%esi,1),%ebp
+	movl	52(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	60(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	20(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 61 */
+	xorl	40(%esp),%esi
+	addl	%edi,%ebp
+	roll	$1,%esi
+	movl	%ecx,%edi
+	movl	%esi,52(%esp)
+	xorl	%eax,%edi
+	rorl	$2,%eax
+	leal	3395469782(%edx,%esi,1),%edx
+	movl	56(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	24(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 62 */
+	xorl	44(%esp),%esi
+	addl	%edi,%edx
+	roll	$1,%esi
+	movl	%ebx,%edi
+	movl	%esi,56(%esp)
+	xorl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	3395469782(%ecx,%esi,1),%ecx
+	movl	60(%esp),%esi
+	xorl	%eax,%edi
+	xorl	4(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	28(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 63 */
+	xorl	48(%esp),%esi
+	addl	%edi,%ecx
+	roll	$1,%esi
+	movl	%eax,%edi
+	movl	%esi,60(%esp)
+	xorl	%edx,%edi
+	rorl	$2,%edx
+	leal	3395469782(%ebx,%esi,1),%ebx
+	movl	(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	8(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	32(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 64 */
+	xorl	52(%esp),%esi
+	addl	%edi,%ebx
+	roll	$1,%esi
+	movl	%ebp,%edi
+	movl	%esi,(%esp)
+	xorl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	3395469782(%eax,%esi,1),%eax
+	movl	4(%esp),%esi
+	xorl	%edx,%edi
+	xorl	12(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	36(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 65 */
+	xorl	56(%esp),%esi
+	addl	%edi,%eax
+	roll	$1,%esi
+	movl	%edx,%edi
+	movl	%esi,4(%esp)
+	xorl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	3395469782(%ebp,%esi,1),%ebp
+	movl	8(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	16(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	40(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 66 */
+	xorl	60(%esp),%esi
+	addl	%edi,%ebp
+	roll	$1,%esi
+	movl	%ecx,%edi
+	movl	%esi,8(%esp)
+	xorl	%eax,%edi
+	rorl	$2,%eax
+	leal	3395469782(%edx,%esi,1),%edx
+	movl	12(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	20(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	44(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 67 */
+	xorl	(%esp),%esi
+	addl	%edi,%edx
+	roll	$1,%esi
+	movl	%ebx,%edi
+	movl	%esi,12(%esp)
+	xorl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	3395469782(%ecx,%esi,1),%ecx
+	movl	16(%esp),%esi
+	xorl	%eax,%edi
+	xorl	24(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	48(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 68 */
+	xorl	4(%esp),%esi
+	addl	%edi,%ecx
+	roll	$1,%esi
+	movl	%eax,%edi
+	movl	%esi,16(%esp)
+	xorl	%edx,%edi
+	rorl	$2,%edx
+	leal	3395469782(%ebx,%esi,1),%ebx
+	movl	20(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	28(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	52(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 69 */
+	xorl	8(%esp),%esi
+	addl	%edi,%ebx
+	roll	$1,%esi
+	movl	%ebp,%edi
+	movl	%esi,20(%esp)
+	xorl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	3395469782(%eax,%esi,1),%eax
+	movl	24(%esp),%esi
+	xorl	%edx,%edi
+	xorl	32(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	56(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 70 */
+	xorl	12(%esp),%esi
+	addl	%edi,%eax
+	roll	$1,%esi
+	movl	%edx,%edi
+	movl	%esi,24(%esp)
+	xorl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	3395469782(%ebp,%esi,1),%ebp
+	movl	28(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	36(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	60(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 71 */
+	xorl	16(%esp),%esi
+	addl	%edi,%ebp
+	roll	$1,%esi
+	movl	%ecx,%edi
+	movl	%esi,28(%esp)
+	xorl	%eax,%edi
+	rorl	$2,%eax
+	leal	3395469782(%edx,%esi,1),%edx
+	movl	32(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	40(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 72 */
+	xorl	20(%esp),%esi
+	addl	%edi,%edx
+	roll	$1,%esi
+	movl	%ebx,%edi
+	movl	%esi,32(%esp)
+	xorl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	3395469782(%ecx,%esi,1),%ecx
+	movl	36(%esp),%esi
+	xorl	%eax,%edi
+	xorl	44(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	4(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 73 */
+	xorl	24(%esp),%esi
+	addl	%edi,%ecx
+	roll	$1,%esi
+	movl	%eax,%edi
+	movl	%esi,36(%esp)
+	xorl	%edx,%edi
+	rorl	$2,%edx
+	leal	3395469782(%ebx,%esi,1),%ebx
+	movl	40(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	48(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	8(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 74 */
+	xorl	28(%esp),%esi
+	addl	%edi,%ebx
+	roll	$1,%esi
+	movl	%ebp,%edi
+	movl	%esi,40(%esp)
+	xorl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	3395469782(%eax,%esi,1),%eax
+	movl	44(%esp),%esi
+	xorl	%edx,%edi
+	xorl	52(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	12(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 75 */
+	xorl	32(%esp),%esi
+	addl	%edi,%eax
+	roll	$1,%esi
+	movl	%edx,%edi
+	movl	%esi,44(%esp)
+	xorl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	3395469782(%ebp,%esi,1),%ebp
+	movl	48(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	56(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	16(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 76 */
+	xorl	36(%esp),%esi
+	addl	%edi,%ebp
+	roll	$1,%esi
+	movl	%ecx,%edi
+	movl	%esi,48(%esp)
+	xorl	%eax,%edi
+	rorl	$2,%eax
+	leal	3395469782(%edx,%esi,1),%edx
+	movl	52(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	60(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	20(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 77 */
+	xorl	40(%esp),%esi
+	addl	%edi,%edx
+	roll	$1,%esi
+	movl	%ebx,%edi
+	xorl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	3395469782(%ecx,%esi,1),%ecx
+	movl	56(%esp),%esi
+	xorl	%eax,%edi
+	xorl	(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	24(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 78 */
+	xorl	44(%esp),%esi
+	addl	%edi,%ecx
+	roll	$1,%esi
+	movl	%eax,%edi
+	xorl	%edx,%edi
+	rorl	$2,%edx
+	leal	3395469782(%ebx,%esi,1),%ebx
+	movl	60(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	4(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	28(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 79 */
+	xorl	48(%esp),%esi
+	addl	%edi,%ebx
+	roll	$1,%esi
+	movl	%ebp,%edi
+	xorl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	3395469782(%eax,%esi,1),%eax
+	xorl	%edx,%edi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	roll	$5,%edi
+	addl	%edi,%eax
+	/* Loop trailer */
+	movl	84(%esp),%edi
+	movl	88(%esp),%esi
+	addl	16(%edi),%ebp
+	addl	12(%edi),%edx
+	addl	8(%edi),%ecx
+	addl	4(%edi),%ebx
+	addl	(%edi),%eax
+	addl	$64,%esi
+	movl	%ebp,16(%edi)
+	movl	%edx,12(%edi)
+	cmpl	92(%esp),%esi
+	movl	%ecx,8(%edi)
+	movl	%ebx,4(%edi)
+	movl	%eax,(%edi)
+	jb	.L000loop
+	addl	$64,%esp
+	popl	%edi
+	popl	%esi
+	popl	%ebx
+	popl	%ebp
+	ret
+.L_sha1_block_data_order_end:
+.size	sha1_block_data_order,.L_sha1_block_data_order_end-sha1_block_data_order
+.byte	83,72,65,49,32,98,108,111,99,107,32,116,114,97,110,115,102,111,114,109,32,102,111,114,32,120,56,54,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-04  5:01           ` George Spelvin
@ 2009-08-04 12:56             ` Jon Smirl
  2009-08-04 14:29               ` Dmitry Potapov
  0 siblings, 1 reply; 129+ messages in thread
From: Jon Smirl @ 2009-08-04 12:56 UTC (permalink / raw)
  To: George Spelvin; +Cc: git

On Tue, Aug 4, 2009 at 1:01 AM, George Spelvin<linux@horizon.com> wrote:
>> Would there happen to be a SHA1 implementation around that can compute
>> the SHA1 without first decompressing the data? Databases gain a lot of
>> speed by using special algorithms that can directly operate on the
>> compressed data.
>
> I can't imagine how.  In general, this requires that the compression
> be carefully designed to be compatible with the algorithms, and SHA1
> is specifically designed to depend on every bit of the input in
> an un-analyzable way.

A simple start would be to feed each byte as it is decompressed
directly into the sha code and avoid the intermediate buffer. Removing
the buffer reduces cache pressure.

> Also, git normally avoids hashing objects that it doesn't need
> uncompressed for some other reason.  git-fsck is a notable exception,
> but I think the idea of creating special optimized code paths for that
> interferes with its reliability and robustness goals.

Agreed that there is no real need for this, just something to play
with if you are trying for a speed record.

I'd much rather have a solution for the rebase problem where one side
of the diff has moved to a different file and rebase can't figure it
out.

>



-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-04 12:56             ` Jon Smirl
@ 2009-08-04 14:29               ` Dmitry Potapov
  0 siblings, 0 replies; 129+ messages in thread
From: Dmitry Potapov @ 2009-08-04 14:29 UTC (permalink / raw)
  To: Jon Smirl; +Cc: George Spelvin, git

On Tue, Aug 04, 2009 at 08:56:48AM -0400, Jon Smirl wrote:
> 
> A simple start would be to feed each byte as it is decompressed
> directly into the sha code and avoid the intermediate buffer. Removing
> the buffer reduces cache pressure.

First, you still have to preserve any decoded byte in the compress
window, which is 32Kb by default. Typical files in Git repositories are
not so big, many are under 32Kb and practically all of them fit to L2
cache of modern processors. Second, complication of assembler code from
the coupling of two algorithms will be enormous. It is not sufficient
registers on x86 for SHA-1 alone. Third, SHA-1 is very computationally
intensive and with predictable access pattern (linear), so you do not
wait for L2, because it will be in L1. So, I don't see where you can
gain significantly. Perhaps, you can win just from re-writing inflate in
assembler, but I do not expect any significant gains other than that.
And coupling has obvious disadvantages when it comes to maintenance...

Dmitry

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-04  8:01           ` George Spelvin
@ 2009-08-04 20:41             ` Junio C Hamano
  2009-08-05 18:17               ` George Spelvin
  0 siblings, 1 reply; 129+ messages in thread
From: Junio C Hamano @ 2009-08-04 20:41 UTC (permalink / raw)
  To: George Spelvin; +Cc: torvalds, git

"George Spelvin" <linux@horizon.com> writes:

> The one question I have is that currently perl is not a critical
> compile-time dependency; it's needed for some extra stuff, but AFAIK you
> can get most of git working without it.  Whether to add that dependency
> or what is a Junio question.

I am actually feel a lot more uneasy to apply a patch signed of by
somebody who calls himself George Spelvin, though.

Three classes of people compile git from the source:

 * People who want to be on the bleeding edge and compile git for
   themselves, even though they are on mainstream platforms where they
   could choose distro-packaged one;

 * People who produce binary packages for distribution.

 * People who are on minority platforms and have no other way to get git
   than compiling for themselves;

We do not have to worry about the first two groups of people.  It won't
be too involved for them to install Perl on their system; after all they
are already coping with asciidoc and xmlto ;-)

We can continue shipping mozilla one to help the last group.

In the Makefile, we say:

    # Define NO_OPENSSL environment variable if you do not have OpenSSL.
    # This also implies MOZILLA_SHA1.

and with your change, we would start implying STANDALONE_OPENSSL_SHA1
instead.  But if MOZILLA_SHA1 was given explicitly, we could use that.

If they really really really want the extra performance out of statically
linked OpenSSL derivative, they could prepare a preprocessed assmebly on
some other machine and use it as the last resort if they do not have/want
Perl.  The situation is exactly the same as the documentation set.  They
are using HTML/man prepared on another machine (namely, mine) as the last
resort if they do not have/want AsciiDoc toolchain.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-04 20:41             ` Junio C Hamano
@ 2009-08-05 18:17               ` George Spelvin
  2009-08-05 20:36                 ` Johannes Schindelin
                                   ` (2 more replies)
  0 siblings, 3 replies; 129+ messages in thread
From: George Spelvin @ 2009-08-05 18:17 UTC (permalink / raw)
  To: gitster; +Cc: git, linux, torvalds

> Three classes of people compile git from the source:
>
> * People who want to be on the bleeding edge and compile git for
>   themselves, even though they are on mainstream platforms where they
>   could choose distro-packaged one;
>
> * People who produce binary packages for distribution.
>
> * People who are on minority platforms and have no other way to get git
>   than compiling for themselves;
>
> We do not have to worry about the first two groups of people.  It won't
> be too involved for them to install Perl on their system; after all they
> are already coping with asciidoc and xmlto ;-)

Actually, I'd get rid of the perl entirely, but I'm not sure how
necessary the other-assembler-syntax features are needed by the
folks on MacOS X and Windows (msysgit).

> We can continue shipping mozilla one to help the last group.

Of course, we always need a C fallback.  Would you like a faster one?

> In the Makefile, we say:
>
>    # Define NO_OPENSSL environment variable if you do not have OpenSSL.
>    # This also implies MOZILLA_SHA1.
>
> and with your change, we would start implying STANDALONE_OPENSSL_SHA1
> instead.  But if MOZILLA_SHA1 was given explicitly, we could use that.

Well, I'd really like to auto-detect the processor.  Current gcc's
"gcc -v" output includes a "Target: " line that will do nicely.  I can,
of course, fall back to C if it fails, but is there a significant user
base using a non-GCC compiler?

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-05 18:17               ` George Spelvin
@ 2009-08-05 20:36                 ` Johannes Schindelin
  2009-08-05 20:44                 ` Junio C Hamano
  2009-08-05 20:55                 ` Linus Torvalds
  2 siblings, 0 replies; 129+ messages in thread
From: Johannes Schindelin @ 2009-08-05 20:36 UTC (permalink / raw)
  To: George Spelvin; +Cc: gitster, git, torvalds

Hi,

On Wed, 5 Aug 2009, George Spelvin wrote:

> > Three classes of people compile git from the source:
> >
> > * People who want to be on the bleeding edge and compile git for
> >   themselves, even though they are on mainstream platforms where they
> >   could choose distro-packaged one;
> >
> > * People who produce binary packages for distribution.
> >
> > * People who are on minority platforms and have no other way to get git
> >   than compiling for themselves;
> >
> > We do not have to worry about the first two groups of people.  It won't
> > be too involved for them to install Perl on their system; after all they
> > are already coping with asciidoc and xmlto ;-)
> 
> Actually, I'd get rid of the perl entirely, but I'm not sure how
> necessary the other-assembler-syntax features are needed by the
> folks on MacOS X and Windows (msysgit).

Don't worry for MacOSX and msysGit (or Cygwin, for that matter): all of 
them use GCC.

> > We can continue shipping mozilla one to help the last group.
> 
> Of course, we always need a C fallback.  Would you like a faster one?

Is that a trick question?

:-)

> > In the Makefile, we say:
> >
> >    # Define NO_OPENSSL environment variable if you do not have OpenSSL.
> >    # This also implies MOZILLA_SHA1.
> >
> > and with your change, we would start implying STANDALONE_OPENSSL_SHA1
> > instead.  But if MOZILLA_SHA1 was given explicitly, we could use that.
> 
> Well, I'd really like to auto-detect the processor.  Current gcc's
> "gcc -v" output includes a "Target: " line that will do nicely.  I can,
> of course, fall back to C if it fails, but is there a significant user
> base using a non-GCC compiler?

Do you really want to determine which processor to optimize for at compile 
time?  Build system and target system are often different...

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-05 18:17               ` George Spelvin
  2009-08-05 20:36                 ` Johannes Schindelin
@ 2009-08-05 20:44                 ` Junio C Hamano
  2009-08-05 20:55                 ` Linus Torvalds
  2 siblings, 0 replies; 129+ messages in thread
From: Junio C Hamano @ 2009-08-05 20:44 UTC (permalink / raw)
  To: George Spelvin; +Cc: git, torvalds

"George Spelvin" <linux@horizon.com> writes:

>> We can continue shipping mozilla one to help the last group.
>
> Of course, we always need a C fallback.  Would you like a faster one?

No.  I'd rather keep tested and tried while a better alternative is in
work-in-progress state.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-05 18:17               ` George Spelvin
  2009-08-05 20:36                 ` Johannes Schindelin
  2009-08-05 20:44                 ` Junio C Hamano
@ 2009-08-05 20:55                 ` Linus Torvalds
  2009-08-05 23:13                   ` Linus Torvalds
  2 siblings, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2009-08-05 20:55 UTC (permalink / raw)
  To: George Spelvin; +Cc: gitster, git

On Wed, 5 Aug 2009, George Spelvin wrote:
> 
> > We can continue shipping mozilla one to help the last group.
> 
> Of course, we always need a C fallback.  Would you like a faster one?

I actually looked at code generation (on x86-64) for the C fallback, and 
it should be quite doable to re-write the C one to generate good code on 
x86-64.

On 32-bit x86, I suspect the register pressures are so intense that it's 
unrealistic to expect gcc to do a good job, but the Mozilla SHA1 C code 
really seems _designed_ to be slow in stupid ways (that whole "byte at a 
time into a word buffer with shifts" is a really really sucky way to 
handle the endianness issues).

So if you'd like to look at the C version, that's definitely worth it. 
Much bigger bang for the buck than trying to schedule asm language and 
having to deal with different assemblers/linkers/whatnot.

		Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-05 20:55                 ` Linus Torvalds
@ 2009-08-05 23:13                   ` Linus Torvalds
  2009-08-06  1:18                     ` Linus Torvalds
  0 siblings, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2009-08-05 23:13 UTC (permalink / raw)
  To: George Spelvin; +Cc: gitster, git



On Wed, 5 Aug 2009, Linus Torvalds wrote:
> 
> I actually looked at code generation (on x86-64) for the C fallback, and 
> it should be quite doable to re-write the C one to generate good code on 
> x86-64.

Ok, here's a try.

It's based on the mozilla SHA1 code, but with quite a bit of surgery. 
Enable with "make BLK_SHA1=1".

Timings for "git fsck --full" on the git directory:

 - Mozilla SHA1 portable C-code (sucky sucky): MOZILLA_SHA1=1

	real	0m38.194s
	user	0m37.838s
	sys	0m0.356s

 - This code ("half-portable C code"): BLK_SHA1=1

	real	0m28.120s
	user	0m27.930s
	sys	0m0.192s

 - OpenSSL assembler code:

	real	0m26.327s
	user	0m26.194s
	sys	0m0.136s

ie this is slightly slower than the openssh SHA1 routines, but that's only 
true on something very SHA1-intensive like "git fsck", and this is 
_almost_ portable code. I say "almost" because it really does require that 
we can do unaligned word loads, and do a good job of 'htonl()', and it 
assumes that 'unsigned int' is 32-bit (the latter would be easy to change 
by using 'uint32_t', but since it's not the relevant portability issue, I 
don't think it matters).

In other words, unlike the Mozilla SHA1, this one doesn't suck. It's 
certainly not great either, but it's probably good enough in practice, 
without the headaches of actually making people use an assembler version.

And maybe somebody can see how to improve it further?

		Linus
---
From: Linus Torvalds <torvalds@linux-foundation.org>
Subject: [PATCH] Add new optimized C 'block-sha1' routines

Based on the mozilla SHA1 routine, but doing the input data accesses a
word at a time and with 'htonl()' instead of loading bytes and shifting.

It requires an architecture that is ok with unaligned 32-bit loads and a
fast htonl().

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 Makefile          |    9 +++
 block-sha1/sha1.c |  145 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 block-sha1/sha1.h |   21 ++++++++
 3 files changed, 175 insertions(+), 0 deletions(-)

diff --git a/Makefile b/Makefile
index d7669b1..f12024c 100644
--- a/Makefile
+++ b/Makefile
@@ -84,6 +84,10 @@ all::
 # specify your own (or DarwinPort's) include directories and
 # library directories by defining CFLAGS and LDFLAGS appropriately.
 #
+# Define BLK_SHA1 environment variable if you want the C version
+# of the SHA1 that assumes you can do unaligned 32-bit loads and
+# have a fast htonl() function.
+#
 # Define PPC_SHA1 environment variable when running make to make use of
 # a bundled SHA1 routine optimized for PowerPC.
 #
@@ -1166,6 +1170,10 @@ ifdef NO_DEFLATE_BOUND
 	BASIC_CFLAGS += -DNO_DEFLATE_BOUND
 endif
 
+ifdef BLK_SHA1
+	SHA1_HEADER = "block-sha1/sha1.h"
+	LIB_OBJS += block-sha1/sha1.o
+else
 ifdef PPC_SHA1
 	SHA1_HEADER = "ppc/sha1.h"
 	LIB_OBJS += ppc/sha1.o ppc/sha1ppc.o
@@ -1183,6 +1191,7 @@ else
 endif
 endif
 endif
+endif
 ifdef NO_PERL_MAKEMAKER
 	export NO_PERL_MAKEMAKER
 endif
diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
new file mode 100644
index 0000000..8fd90b0
--- /dev/null
+++ b/block-sha1/sha1.c
@@ -0,0 +1,145 @@
+/*
+ * Based on the Mozilla SHA1 (see mozilla-sha1/sha1.c),
+ * optimized to do word accesses rather than byte accesses,
+ * and to avoid unnecessary copies into the context array.
+ */
+
+#include <string.h>
+#include <arpa/inet.h>
+
+#include "sha1.h"
+
+/* Hash one 64-byte block of data */
+static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data);
+
+void blk_SHA1_Init(blk_SHA_CTX *ctx)
+{
+	ctx->lenW = 0;
+	ctx->size = 0;
+
+	/* Initialize H with the magic constants (see FIPS180 for constants)
+	 */
+	ctx->H[0] = 0x67452301;
+	ctx->H[1] = 0xefcdab89;
+	ctx->H[2] = 0x98badcfe;
+	ctx->H[3] = 0x10325476;
+	ctx->H[4] = 0xc3d2e1f0;
+}
+
+
+void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, int len)
+{
+	int lenW = ctx->lenW;
+
+	ctx->size += len << 3;
+
+	/* Read the data into W and process blocks as they get full
+	 */
+	if (lenW) {
+		int left = 64 - lenW;
+		if (len < left)
+			left = len;
+		memcpy(lenW + (char *)ctx->W, data, left);
+		lenW = (lenW + left) & 63;
+		len -= left;
+		data += left;
+		ctx->lenW = lenW;
+		if (lenW)
+			return;
+		blk_SHA1Block(ctx, ctx->W);
+	}
+	while (len >= 64) {
+		blk_SHA1Block(ctx, data);
+		data += 64;
+		len -= 64;
+	}
+	if (len) {
+		memcpy(ctx->W, data, len);
+		ctx->lenW = len;
+	}
+}
+
+
+void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
+{
+	static const unsigned char pad[64] = { 0x80 };
+	unsigned int padlen[2];
+	int i;
+
+	/* Pad with a binary 1 (ie 0x80), then zeroes, then length
+	 */
+	padlen[0] = htonl(ctx->size >> 32);
+	padlen[1] = htonl(ctx->size);
+
+	blk_SHA1_Update(ctx, pad, 1+ (63 & (55 - ctx->lenW)));
+	blk_SHA1_Update(ctx, padlen, 8);
+
+	/* Output hash
+	 */
+	for (i = 0; i < 5; i++)
+		((unsigned int *)hashout)[i] = htonl(ctx->H[i]);
+}
+
+#define SHA_ROT(X,n) (((X) << (n)) | ((X) >> (32-(n))))
+
+static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
+{
+	int t;
+	unsigned int A,B,C,D,E,TEMP;
+	unsigned int W[80];
+
+	for (t = 0; t < 16; t++)
+		W[t] = htonl(data[t]);
+
+	/* Unroll it? */
+	for (t = 16; t <= 79; t++)
+		W[t] = SHA_ROT(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
+
+	A = ctx->H[0];
+	B = ctx->H[1];
+	C = ctx->H[2];
+	D = ctx->H[3];
+	E = ctx->H[4];
+
+#define T_0_19(t) \
+	TEMP = SHA_ROT(A,5) + (((C^D)&B)^D)     + E + W[t] + 0x5a827999; \
+	E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
+
+	T_0_19( 0); T_0_19( 1); T_0_19( 2); T_0_19( 3); T_0_19( 4);
+	T_0_19( 5); T_0_19( 6); T_0_19( 7); T_0_19( 8); T_0_19( 9);
+	T_0_19(10); T_0_19(11); T_0_19(12); T_0_19(13); T_0_19(14);
+	T_0_19(15); T_0_19(16); T_0_19(17); T_0_19(18); T_0_19(19);
+
+#define T_20_39(t) \
+	TEMP = SHA_ROT(A,5) + (B^C^D)           + E + W[t] + 0x6ed9eba1; \
+	E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
+
+	T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24);
+	T_20_39(25); T_20_39(26); T_20_39(27); T_20_39(28); T_20_39(29);
+	T_20_39(30); T_20_39(31); T_20_39(32); T_20_39(33); T_20_39(34);
+	T_20_39(35); T_20_39(36); T_20_39(37); T_20_39(38); T_20_39(39);
+
+#define T_40_59(t) \
+	TEMP = SHA_ROT(A,5) + ((B&C)|(D&(B|C))) + E + W[t] + 0x8f1bbcdc; \
+	E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
+
+	T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44);
+	T_40_59(45); T_40_59(46); T_40_59(47); T_40_59(48); T_40_59(49);
+	T_40_59(50); T_40_59(51); T_40_59(52); T_40_59(53); T_40_59(54);
+	T_40_59(55); T_40_59(56); T_40_59(57); T_40_59(58); T_40_59(59);
+
+#define T_60_79(t) \
+	TEMP = SHA_ROT(A,5) + (B^C^D)           + E + W[t] + 0xca62c1d6; \
+	E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
+
+	T_60_79(60); T_60_79(61); T_60_79(62); T_60_79(63); T_60_79(64);
+	T_60_79(65); T_60_79(66); T_60_79(67); T_60_79(68); T_60_79(69);
+	T_60_79(70); T_60_79(71); T_60_79(72); T_60_79(73); T_60_79(74);
+	T_60_79(75); T_60_79(76); T_60_79(77); T_60_79(78); T_60_79(79);
+
+	ctx->H[0] += A;
+	ctx->H[1] += B;
+	ctx->H[2] += C;
+	ctx->H[3] += D;
+	ctx->H[4] += E;
+}
diff --git a/block-sha1/sha1.h b/block-sha1/sha1.h
new file mode 100644
index 0000000..dbc719f
--- /dev/null
+++ b/block-sha1/sha1.h
@@ -0,0 +1,21 @@
+/*
+ * Based on the Mozilla SHA1 (see mozilla-sha1/sha1.h),
+ * optimized to do word accesses rather than byte accesses,
+ * and to avoid unnecessary copies into the context array.
+ */
+
+typedef struct {
+	unsigned int H[5];
+	unsigned int W[16];
+	int lenW;
+	unsigned long long size;
+} blk_SHA_CTX;
+
+void blk_SHA1_Init(blk_SHA_CTX *ctx);
+void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *dataIn, int len);
+void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx);
+
+#define git_SHA_CTX	blk_SHA_CTX
+#define git_SHA1_Init	blk_SHA1_Init
+#define git_SHA1_Update	blk_SHA1_Update
+#define git_SHA1_Final	blk_SHA1_Final

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-05 23:13                   ` Linus Torvalds
@ 2009-08-06  1:18                     ` Linus Torvalds
  2009-08-06  1:52                       ` Nicolas Pitre
  2009-08-06 18:49                       ` Erik Faye-Lund
  0 siblings, 2 replies; 129+ messages in thread
From: Linus Torvalds @ 2009-08-06  1:18 UTC (permalink / raw)
  To: George Spelvin; +Cc: gitster, git

On Wed, 5 Aug 2009, Linus Torvalds wrote:
> 
> Timings for "git fsck --full" on the git directory:
> 
>  - Mozilla SHA1 portable C-code (sucky sucky): MOZILLA_SHA1=1
> 
> 	real	0m38.194s
> 	user	0m37.838s
> 	sys	0m0.356s
> 
>  - This code ("half-portable C code"): BLK_SHA1=1
> 
> 	real	0m28.120s
> 	user	0m27.930s
> 	sys	0m0.192s
> 
>  - OpenSSL assembler code:
> 
> 	real	0m26.327s
> 	user	0m26.194s
> 	sys	0m0.136s

Ok, I installed the 32-bit libraries too, to see what it looks like for 
that case. As expected, the compiler is not able to do a great job due to 
it being somewhat register starved, but on the other hand, the old Mozilla 
code did even worse, so..

 - Mozilla SHA:

	real	0m47.063s
	user	0m46.815s
	sys	0m0.252s

 - BLK_SHA1=1

	real	0m34.705s
	user	0m34.394s
	sys	0m0.312s

 - OPENSSL:

	real	0m29.754s
	user	0m29.446s
	sys	0m0.288s

so the tuned asm from OpenSSL does kick ass, but the C code version isn't 
_that_ far away. It's quite a reasonable alternative if you don't have the 
OpenSSL libraries installed, for example.

I note that MINGW does NO_OPENSSL by default, for example, and maybe the 
MINGW people want to test the patch out and enable BLK_SHA1 rather than 
the original Mozilla one.

But while looking at 32-bit issues, I noticed that I really should also 
cast 'len' when shifting it. Otherwise the thing is limited to fairly 
small areas (28 bits - 256MB). This is not just a 32-bit problem ("int" is 
a signed 32-bit thing even in a 64-bit build), but I only noticed it when 
looking at 32-bit issues.

So here's an incremental patch to fix that. 

		Linus

---
 block-sha1/sha1.c |    4 ++--
 block-sha1/sha1.h |    2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index 8fd90b0..eef32f7 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -27,11 +27,11 @@ void blk_SHA1_Init(blk_SHA_CTX *ctx)
 }

-void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, int len)
+void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len)
 {
 	int lenW = ctx->lenW;

-	ctx->size += len << 3;
+	ctx->size += (unsigned long long) len << 3;

 	/* Read the data into W and process blocks as they get full
 	 */
diff --git a/block-sha1/sha1.h b/block-sha1/sha1.h
index dbc719f..7be2d93 100644
--- a/block-sha1/sha1.h
+++ b/block-sha1/sha1.h
@@ -12,7 +12,7 @@ typedef struct {
 } blk_SHA_CTX;

 void blk_SHA1_Init(blk_SHA_CTX *ctx);
-void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *dataIn, int len);
+void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *dataIn, unsigned long len);
 void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx);

 #define git_SHA_CTX	blk_SHA_CTX

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  1:18                     ` Linus Torvalds
@ 2009-08-06  1:52                       ` Nicolas Pitre
  2009-08-06  2:04                         ` Junio C Hamano
  2009-08-06  2:08                         ` Linus Torvalds
  2009-08-06 18:49                       ` Erik Faye-Lund
  1 sibling, 2 replies; 129+ messages in thread
From: Nicolas Pitre @ 2009-08-06  1:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: George Spelvin, Junio C Hamano, git

On Wed, 5 Aug 2009, Linus Torvalds wrote:

> But while looking at 32-bit issues, I noticed that I really should also 
> cast 'len' when shifting it. Otherwise the thing is limited to fairly 
> small areas (28 bits - 256MB). This is not just a 32-bit problem ("int" is 
> a signed 32-bit thing even in a 64-bit build), but I only noticed it when 
> looking at 32-bit issues.

Even better is to not shift len at all in SHA_update() but shift 
ctx->size only at the end in SHA_final().  It is not like if 
SHA_update() could operate on partial bytes, so counting total bytes 
instead of total bits is all you need.  This way you need no cast there 
and make the code slightly faster.


Nicolas

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  1:52                       ` Nicolas Pitre
@ 2009-08-06  2:04                         ` Junio C Hamano
  2009-08-06  2:10                           ` Linus Torvalds
  2009-08-06  2:20                           ` Nicolas Pitre
  2009-08-06  2:08                         ` Linus Torvalds
  1 sibling, 2 replies; 129+ messages in thread
From: Junio C Hamano @ 2009-08-06  2:04 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Linus Torvalds, George Spelvin, git

Nicolas Pitre <nico@cam.org> writes:

> On Wed, 5 Aug 2009, Linus Torvalds wrote:
>
>> But while looking at 32-bit issues, I noticed that I really should also 
>> cast 'len' when shifting it. Otherwise the thing is limited to fairly 
>> small areas (28 bits - 256MB). This is not just a 32-bit problem ("int" is 
>> a signed 32-bit thing even in a 64-bit build), but I only noticed it when 
>> looking at 32-bit issues.
>
> Even better is to not shift len at all in SHA_update() but shift 
> ctx->size only at the end in SHA_final().  It is not like if 
> SHA_update() could operate on partial bytes, so counting total bytes 
> instead of total bits is all you need.  This way you need no cast there 
> and make the code slightly faster.

Like this?

By the way, Mozilla one calls Init at the end of Final but block-sha1
doesn't.  I do not think it matters for our callers, but on the other hand
FInal is not performance critical part nor Init is heavy, so it may not be
a bad idea to imitate them as well.  Or am I missing something?

diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index eef32f7..8293f7b 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -31,7 +31,7 @@ void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len)
 {
 	int lenW = ctx->lenW;
 
-	ctx->size += (unsigned long long) len << 3;
+	ctx->size += (unsigned long long) len;
 
 	/* Read the data into W and process blocks as they get full
 	 */
@@ -68,6 +68,7 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
 
 	/* Pad with a binary 1 (ie 0x80), then zeroes, then length
 	 */
+	ctx->size <<= 3; /* bytes to bits */
 	padlen[0] = htonl(ctx->size >> 32);
 	padlen[1] = htonl(ctx->size);
 

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  1:52                       ` Nicolas Pitre
  2009-08-06  2:04                         ` Junio C Hamano
@ 2009-08-06  2:08                         ` Linus Torvalds
  2009-08-06  3:19                           ` Artur Skawina
  1 sibling, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2009-08-06  2:08 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: George Spelvin, Junio C Hamano, git



On Wed, 5 Aug 2009, Nicolas Pitre wrote:
> 
> Even better is to not shift len at all in SHA_update() but shift 
> ctx->size only at the end in SHA_final().  It is not like if 
> SHA_update() could operate on partial bytes, so counting total bytes 
> instead of total bits is all you need.  This way you need no cast there 
> and make the code slightly faster.

Yeah, I tried it, but it's not noticeable.

The bigger issue seems to be that it's shifter-limited, or that's what I 
take away from my profiles. I suspect it's even _more_ shifter-limited on 
some other micro-architectures, because gcc is being stupid, and generates

	ror $31,%eax

from the "left shift + right shift" combination. It seems to -always- 
generate a "ror", rather than trying to generate 'rot' if the shift count 
would be smaller that way.

And I know _some_ old micro-architectures will literally internally loop 
on the rol/ror counts, so "ror $31" can be _much_ more expensive than "rol 
$1".

That isn't the case on my Nehalem, though. But I can't seem to get gcc to 
generate better code without actually using inline asm..

(So to clarify: this patch makes no difference that I can see to 
performance, but I suspect it could matter on other CPU's like an old 
Pentium or maybe an Atom).

		Linus

---
 block-sha1/sha1.c |   36 ++++++++++++++++++++++++------------
 block-sha1/sha1.h |    2 +-
 2 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index 8fd90b0..a45a3de 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -80,7 +80,19 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
 		((unsigned int *)hashout)[i] = htonl(ctx->H[i]);
 }
 
-#define SHA_ROT(X,n) (((X) << (n)) | ((X) >> (32-(n))))
+#if defined(__i386__) || defined(__x86_64__)
+
+#define SHA_ASM(op, x, n) ({ unsigned int __res; asm(op " %1,%0":"=r" (__res):"i" (n), "0" (x)); __res; })
+#define SHA_ROL(x,n)	SHA_ASM("rol", x, n)
+#define SHA_ROR(x,n)	SHA_ASM("ror", x, n)
+
+#else
+
+#define SHA_ROT(X,n)	(((X) << (l)) | ((X) >> (r)))
+#define SHA_ROL(X,n)	SHA_ROT(X,n,32-(n))
+#define SHA_ROR(X,n)	SHA_ROT(X,32-(n),n)
+
+#endif
 
 static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 {
@@ -93,7 +105,7 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 
 	/* Unroll it? */
 	for (t = 16; t <= 79; t++)
-		W[t] = SHA_ROT(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
+		W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
 
 	A = ctx->H[0];
 	B = ctx->H[1];
@@ -102,8 +114,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 	E = ctx->H[4];
 
 #define T_0_19(t) \
-	TEMP = SHA_ROT(A,5) + (((C^D)&B)^D)     + E + W[t] + 0x5a827999; \
-	E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
+	TEMP = SHA_ROL(A,5) + (((C^D)&B)^D)     + E + W[t] + 0x5a827999; \
+	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
 
 	T_0_19( 0); T_0_19( 1); T_0_19( 2); T_0_19( 3); T_0_19( 4);
 	T_0_19( 5); T_0_19( 6); T_0_19( 7); T_0_19( 8); T_0_19( 9);
@@ -111,8 +123,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 	T_0_19(15); T_0_19(16); T_0_19(17); T_0_19(18); T_0_19(19);
 
 #define T_20_39(t) \
-	TEMP = SHA_ROT(A,5) + (B^C^D)           + E + W[t] + 0x6ed9eba1; \
-	E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
+	TEMP = SHA_ROL(A,5) + (B^C^D)           + E + W[t] + 0x6ed9eba1; \
+	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
 
 	T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24);
 	T_20_39(25); T_20_39(26); T_20_39(27); T_20_39(28); T_20_39(29);
@@ -120,8 +132,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 	T_20_39(35); T_20_39(36); T_20_39(37); T_20_39(38); T_20_39(39);
 
 #define T_40_59(t) \
-	TEMP = SHA_ROT(A,5) + ((B&C)|(D&(B|C))) + E + W[t] + 0x8f1bbcdc; \
-	E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
+	TEMP = SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E + W[t] + 0x8f1bbcdc; \
+	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
 
 	T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44);
 	T_40_59(45); T_40_59(46); T_40_59(47); T_40_59(48); T_40_59(49);
@@ -129,8 +141,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 	T_40_59(55); T_40_59(56); T_40_59(57); T_40_59(58); T_40_59(59);
 
 #define T_60_79(t) \
-	TEMP = SHA_ROT(A,5) + (B^C^D)           + E + W[t] + 0xca62c1d6; \
-	E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
+	TEMP = SHA_ROL(A,5) + (B^C^D)           + E + W[t] + 0xca62c1d6; \
+	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
 
 	T_60_79(60); T_60_79(61); T_60_79(62); T_60_79(63); T_60_79(64);
 	T_60_79(65); T_60_79(66); T_60_79(67); T_60_79(68); T_60_79(69);

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  2:04                         ` Junio C Hamano
@ 2009-08-06  2:10                           ` Linus Torvalds
  2009-08-06  2:20                           ` Nicolas Pitre
  1 sibling, 0 replies; 129+ messages in thread
From: Linus Torvalds @ 2009-08-06  2:10 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Nicolas Pitre, George Spelvin, git



On Wed, 5 Aug 2009, Junio C Hamano wrote:
> 
> Like this?

No, combine it with the other shifts:

Yes:

> -	ctx->size += (unsigned long long) len << 3;
> +	ctx->size += (unsigned long long) len;

No:

> +	ctx->size <<= 3; /* bytes to bits */
>  	padlen[0] = htonl(ctx->size >> 32);
>  	padlen[1] = htonl(ctx->size);

Do

	padlen[0] = htonl(ctx->size >> 29);
	padlen[1] = htonl(ctx->size << 3);

instead. Or whatever.

		Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  2:04                         ` Junio C Hamano
  2009-08-06  2:10                           ` Linus Torvalds
@ 2009-08-06  2:20                           ` Nicolas Pitre
  1 sibling, 0 replies; 129+ messages in thread
From: Nicolas Pitre @ 2009-08-06  2:20 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, George Spelvin, git

On Wed, 5 Aug 2009, Junio C Hamano wrote:

> Nicolas Pitre <nico@cam.org> writes:
> 
> > On Wed, 5 Aug 2009, Linus Torvalds wrote:
> >
> >> But while looking at 32-bit issues, I noticed that I really should also 
> >> cast 'len' when shifting it. Otherwise the thing is limited to fairly 
> >> small areas (28 bits - 256MB). This is not just a 32-bit problem ("int" is 
> >> a signed 32-bit thing even in a 64-bit build), but I only noticed it when 
> >> looking at 32-bit issues.
> >
> > Even better is to not shift len at all in SHA_update() but shift 
> > ctx->size only at the end in SHA_final().  It is not like if 
> > SHA_update() could operate on partial bytes, so counting total bytes 
> > instead of total bits is all you need.  This way you need no cast there 
> > and make the code slightly faster.
> 
> Like this?

Almost (see below).

> By the way, Mozilla one calls Init at the end of Final but block-sha1
> doesn't.  I do not think it matters for our callers, but on the other hand
> FInal is not performance critical part nor Init is heavy, so it may not be
> a bad idea to imitate them as well.  Or am I missing something?

It is done only to make sure potentially crypto sensitive information is 
wiped out of the ctx structure instance.  In our case we have no such 
concerns.

> diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
> index eef32f7..8293f7b 100644
> --- a/block-sha1/sha1.c
> +++ b/block-sha1/sha1.c
> @@ -31,7 +31,7 @@ void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len)
>  {
>  	int lenW = ctx->lenW;
>  
> -	ctx->size += (unsigned long long) len << 3;
> +	ctx->size += (unsigned long long) len;

You can get rid of the cast as well now.

>  	/* Read the data into W and process blocks as they get full
>  	 */
> @@ -68,6 +68,7 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
>  
>  	/* Pad with a binary 1 (ie 0x80), then zeroes, then length
>  	 */
> +	ctx->size <<= 3; /* bytes to bits */
>  	padlen[0] = htonl(ctx->size >> 32);
>  	padlen[1] = htonl(ctx->size);

Instead, I'd do:

	padlen[0] = htonl(ctx->size >> (32 - 3));
	padlen[1] = htonl(ctx->size << 3);

That would eliminate a redundant write back of ctx->size.


Nicolas

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  2:08                         ` Linus Torvalds
@ 2009-08-06  3:19                           ` Artur Skawina
  2009-08-06  3:31                             ` Linus Torvalds
  0 siblings, 1 reply; 129+ messages in thread
From: Artur Skawina @ 2009-08-06  3:19 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git

Linus Torvalds wrote:
> 
> The bigger issue seems to be that it's shifter-limited, or that's what I 
> take away from my profiles. I suspect it's even _more_ shifter-limited on 
> some other micro-architectures, because gcc is being stupid, and generates
> 
> 	ror $31,%eax
> 
> from the "left shift + right shift" combination. It seems to -always- 
> generate a "ror", rather than trying to generate 'rot' if the shift count 
> would be smaller that way.
> 
> And I know _some_ old micro-architectures will literally internally loop 
> on the rol/ror counts, so "ror $31" can be _much_ more expensive than "rol 
> $1".
> 
> That isn't the case on my Nehalem, though. But I can't seem to get gcc to 
> generate better code without actually using inline asm..

The compiler does the right thing w/ something like this:

+#if __GNUC__>1 && defined(__i386)
+#define SHA_ROT(data,bits) ({ \
+  unsigned d = (data); \
+  if (bits<16) \
+    __asm__ ("roll %1,%0" : "=r" (d) : "I" (bits), "0" (d)); \
+  else \
+    __asm__ ("rorl %1,%0" : "=r" (d) : "I" (32-bits), "0" (d)); \
+  d; \
+  })
+#else
 #define SHA_ROT(X,n) (((X) << (n)) | ((X) >> (32-(n))))
+#endif
 
which doesn't obfuscate the code as much.
(I needed the asm on p4 anyway, as w/o it the mozilla version is even
 slower than an rfc3174 one. rol vs ror makes no measurable difference)

>  static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
>  {
> @@ -93,7 +105,7 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
>  
>  	/* Unroll it? */
>  	for (t = 16; t <= 79; t++)
> -		W[t] = SHA_ROT(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
> +		W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);

unrolling this once (but not more) is a win, at least on p4.

>  #define T_0_19(t) \
> -	TEMP = SHA_ROT(A,5) + (((C^D)&B)^D)     + E + W[t] + 0x5a827999; \
> -	E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
> +	TEMP = SHA_ROL(A,5) + (((C^D)&B)^D)     + E + W[t] + 0x5a827999; \
> +	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
>  
>  	T_0_19( 0); T_0_19( 1); T_0_19( 2); T_0_19( 3); T_0_19( 4);
>  	T_0_19( 5); T_0_19( 6); T_0_19( 7); T_0_19( 8); T_0_19( 9);

unrolling these otoh is a clear loss (iirc ~10%). 

artur

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  3:19                           ` Artur Skawina
@ 2009-08-06  3:31                             ` Linus Torvalds
  2009-08-06  3:48                               ` Linus Torvalds
  2009-08-06  4:08                               ` Artur Skawina
  0 siblings, 2 replies; 129+ messages in thread
From: Linus Torvalds @ 2009-08-06  3:31 UTC (permalink / raw)
  To: Artur Skawina; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git



On Thu, 6 Aug 2009, Artur Skawina wrote:
> 
> >  #define T_0_19(t) \
> > -	TEMP = SHA_ROT(A,5) + (((C^D)&B)^D)     + E + W[t] + 0x5a827999; \
> > -	E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
> > +	TEMP = SHA_ROL(A,5) + (((C^D)&B)^D)     + E + W[t] + 0x5a827999; \
> > +	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
> >  
> >  	T_0_19( 0); T_0_19( 1); T_0_19( 2); T_0_19( 3); T_0_19( 4);
> >  	T_0_19( 5); T_0_19( 6); T_0_19( 7); T_0_19( 8); T_0_19( 9);
> 
> unrolling these otoh is a clear loss (iirc ~10%). 

I can well imagine. The P4 decode bandwidth is abysmal unless you get 
things into the trace cache, and the trace cache is of a very limited 
size.

However, on at least Nehalem, unrolling it all is quite a noticeable win.

The way it's written, I can easily make it do one or the other by just 
turning the macro inside a loop (and we can have a preprocessor flag to 
choose one or the other), but let me work on it a bit more first.

I'm trying to move the htonl() inside the loops (the same way I suggested 
George do with his assembly), and it seems to help a tiny bit. But I may 
be measuring noise.

However, right now my biggest profile hit is on this irritating loop:

	/* Unroll it? */
	for (t = 16; t <= 79; t++)
		W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);

and I haven't been able to move _that_ into the other iterations yet.

Here's my micro-optimization update. It does the first 16 rounds (of the 
first 20-round thing) specially, and takes the data directly from the 
input array. I'm _this_ close to breaking the 28s second barrier on 
git-fsck, but not quite yet.

			Linus

---
From: Linus Torvalds <torvalds@linux-foundation.org>
Subject: [PATCH] block-sha1: make the 'ntohl()' part of the first SHA1 loop

This helps a teeny bit.  But what I -really- want to do is to avoid the
whole 80-array loop, and do the xor updates as I go along..

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 block-sha1/sha1.c |   28 ++++++++++++++++------------
 1 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index a45a3de..39a5bbb 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -100,27 +100,31 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 	unsigned int A,B,C,D,E,TEMP;
 	unsigned int W[80];
 
-	for (t = 0; t < 16; t++)
-		W[t] = htonl(data[t]);
-
-	/* Unroll it? */
-	for (t = 16; t <= 79; t++)
-		W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
-
 	A = ctx->H[0];
 	B = ctx->H[1];
 	C = ctx->H[2];
 	D = ctx->H[3];
 	E = ctx->H[4];
 
-#define T_0_19(t) \
+#define T_0_15(t) \
+	TEMP = htonl(data[t]); W[t] = TEMP; \
+	TEMP += SHA_ROL(A,5) + (((C^D)&B)^D)     + E + 0x5a827999; \
+	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \
+
+	T_0_15( 0); T_0_15( 1); T_0_15( 2); T_0_15( 3); T_0_15( 4);
+	T_0_15( 5); T_0_15( 6); T_0_15( 7); T_0_15( 8); T_0_15( 9);
+	T_0_15(10); T_0_15(11); T_0_15(12); T_0_15(13); T_0_15(14);
+	T_0_15(15);
+
+	/* Unroll it? */
+	for (t = 16; t <= 79; t++)
+		W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
+
+#define T_16_19(t) \
 	TEMP = SHA_ROL(A,5) + (((C^D)&B)^D)     + E + W[t] + 0x5a827999; \
 	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
 
-	T_0_19( 0); T_0_19( 1); T_0_19( 2); T_0_19( 3); T_0_19( 4);
-	T_0_19( 5); T_0_19( 6); T_0_19( 7); T_0_19( 8); T_0_19( 9);
-	T_0_19(10); T_0_19(11); T_0_19(12); T_0_19(13); T_0_19(14);
-	T_0_19(15); T_0_19(16); T_0_19(17); T_0_19(18); T_0_19(19);
+	T_16_19(16); T_16_19(17); T_16_19(18); T_16_19(19);
 
 #define T_20_39(t) \
 	TEMP = SHA_ROL(A,5) + (B^C^D)           + E + W[t] + 0x6ed9eba1; \

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  3:31                             ` Linus Torvalds
@ 2009-08-06  3:48                               ` Linus Torvalds
  2009-08-06  4:01                                 ` Linus Torvalds
  2009-08-06  4:52                                 ` George Spelvin
  2009-08-06  4:08                               ` Artur Skawina
  1 sibling, 2 replies; 129+ messages in thread
From: Linus Torvalds @ 2009-08-06  3:48 UTC (permalink / raw)
  To: Artur Skawina; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git



On Wed, 5 Aug 2009, Linus Torvalds wrote:
> 
> However, right now my biggest profile hit is on this irritating loop:
> 
> 	/* Unroll it? */
> 	for (t = 16; t <= 79; t++)
> 		W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
> 
> and I haven't been able to move _that_ into the other iterations yet.

Oh yes I have.

Here's the patch that gets me sub-28s git-fsck times. In fact, it gives me 
sub-27s times. In fact, it's really close to the OpenSSL times.

And all using plain C.

Again - this is all on x86-64. I suspect 32-bit code ends up having 
spills due to register pressure. That said, I did get rid of that big 
temporary array, and it now basically only uses that 512-bit array as one 
circular queue.

		Linus

PS. Ok, so my definition of "plain C" is a bit odd. There's nothing plain 
about it. It's disgusting C preprocessor misuse. But dang, it's kind of 
fun to abuse the compiler this way.

---
 block-sha1/sha1.c |   28 ++++++++++++++++------------
 1 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index 39a5bbb..80193d4 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -96,9 +96,8 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
 
 static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 {
-	int t;
 	unsigned int A,B,C,D,E,TEMP;
-	unsigned int W[80];
+	unsigned int array[16];
 
 	A = ctx->H[0];
 	B = ctx->H[1];
@@ -107,8 +106,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 	E = ctx->H[4];
 
 #define T_0_15(t) \
-	TEMP = htonl(data[t]); W[t] = TEMP; \
-	TEMP += SHA_ROL(A,5) + (((C^D)&B)^D)     + E + 0x5a827999; \
+	TEMP = htonl(data[t]); array[t] = TEMP; \
+	TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \
 	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \
 
 	T_0_15( 0); T_0_15( 1); T_0_15( 2); T_0_15( 3); T_0_15( 4);
@@ -116,18 +115,21 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 	T_0_15(10); T_0_15(11); T_0_15(12); T_0_15(13); T_0_15(14);
 	T_0_15(15);
 
-	/* Unroll it? */
-	for (t = 16; t <= 79; t++)
-		W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
+/* This "rolls" over the 512-bit array */
+#define W(x) (array[(x)&15])
+#define SHA_XOR(t) \
+	TEMP = SHA_ROL(W(t+13) ^ W(t+8) ^ W(t+2) ^ W(t), 1); W(t) = TEMP;
 
 #define T_16_19(t) \
-	TEMP = SHA_ROL(A,5) + (((C^D)&B)^D)     + E + W[t] + 0x5a827999; \
-	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
+	SHA_XOR(t); \
+	TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \
+	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \
 
 	T_16_19(16); T_16_19(17); T_16_19(18); T_16_19(19);
 
 #define T_20_39(t) \
-	TEMP = SHA_ROL(A,5) + (B^C^D)           + E + W[t] + 0x6ed9eba1; \
+	SHA_XOR(t); \
+	TEMP += SHA_ROL(A,5) + (B^C^D) + E + 0x6ed9eba1; \
 	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
 
 	T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24);
@@ -136,7 +138,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 	T_20_39(35); T_20_39(36); T_20_39(37); T_20_39(38); T_20_39(39);
 
 #define T_40_59(t) \
-	TEMP = SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E + W[t] + 0x8f1bbcdc; \
+	SHA_XOR(t); \
+	TEMP += SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E + 0x8f1bbcdc; \
 	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
 
 	T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44);
@@ -145,7 +148,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 	T_40_59(55); T_40_59(56); T_40_59(57); T_40_59(58); T_40_59(59);
 
 #define T_60_79(t) \
-	TEMP = SHA_ROL(A,5) + (B^C^D)           + E + W[t] + 0xca62c1d6; \
+	SHA_XOR(t); \
+	TEMP += SHA_ROL(A,5) + (B^C^D) + E + 0xca62c1d6; \
 	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
 
 	T_60_79(60); T_60_79(61); T_60_79(62); T_60_79(63); T_60_79(64);

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  3:48                               ` Linus Torvalds
@ 2009-08-06  4:01                                 ` Linus Torvalds
  2009-08-06  4:28                                   ` Artur Skawina
  2009-08-06  4:52                                 ` George Spelvin
  1 sibling, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2009-08-06  4:01 UTC (permalink / raw)
  To: Artur Skawina; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git



On Wed, 5 Aug 2009, Linus Torvalds wrote:
> 
> Here's the patch that gets me sub-28s git-fsck times. In fact, it gives me 
> sub-27s times. In fact, it's really close to the OpenSSL times.

Just to back that up:

 - OpenSSL:

	real	0m26.363s
	user	0m26.174s
	sys	0m0.188s

 - This C implementation:

	real	0m26.594s
	user	0m26.310s
	sys	0m0.256s

so I'm still slower, but now you really have to look closely to see the 
difference. In fact, you have to do multiple runs to make sure, because 
the error bars are bigger thant he difference - but openssl definitely 
edges my C code out by a small amount, and the above numbers are rairly 
normal.

		Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  3:31                             ` Linus Torvalds
  2009-08-06  3:48                               ` Linus Torvalds
@ 2009-08-06  4:08                               ` Artur Skawina
  2009-08-06  4:27                                 ` Linus Torvalds
  1 sibling, 1 reply; 129+ messages in thread
From: Artur Skawina @ 2009-08-06  4:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git

Linus Torvalds wrote:
> 
> On Thu, 6 Aug 2009, Artur Skawina wrote:
>>>  #define T_0_19(t) \
>>> -	TEMP = SHA_ROT(A,5) + (((C^D)&B)^D)     + E + W[t] + 0x5a827999; \
>>> -	E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
>>> +	TEMP = SHA_ROL(A,5) + (((C^D)&B)^D)     + E + W[t] + 0x5a827999; \
>>> +	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
>>>  
>>>  	T_0_19( 0); T_0_19( 1); T_0_19( 2); T_0_19( 3); T_0_19( 4);
>>>  	T_0_19( 5); T_0_19( 6); T_0_19( 7); T_0_19( 8); T_0_19( 9);
>> unrolling these otoh is a clear loss (iirc ~10%). 
> 
> I can well imagine. The P4 decode bandwidth is abysmal unless you get 
> things into the trace cache, and the trace cache is of a very limited 
> size.
> 
> However, on at least Nehalem, unrolling it all is quite a noticeable win.
> 
> The way it's written, I can easily make it do one or the other by just 
> turning the macro inside a loop (and we can have a preprocessor flag to 
> choose one or the other), but let me work on it a bit more first.

that's of course how i measured it.. :)

> I'm trying to move the htonl() inside the loops (the same way I suggested 
> George do with his assembly), and it seems to help a tiny bit. But I may 
> be measuring noise.

i haven't tried your version at all yet (just applied the rol/ror and
unrolling changes, but neither was a win on p4)

> However, right now my biggest profile hit is on this irritating loop:
> 
> 	/* Unroll it? */
> 	for (t = 16; t <= 79; t++)
> 		W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
> 
> and I haven't been able to move _that_ into the other iterations yet.

i've done that before -- was a small loss -- maybe because of the small
trace cache. deleted that attempt while cleaning up the #if mess, so don't
have the patch, but it was basically

#define newW(t) (W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1))

and than s/W[t]/newW(t)/ in rounds 16..79.

I've only tested on p4 and there the winner so far is still:

-  for (t = 16; t <= 79; t++)
+  for (t = 16; t <= 79; t+=2) {
     ctx->W[t] =
-      SHA_ROT(ctx->W[t-3] ^ ctx->W[t-8] ^ ctx->W[t-14] ^ ctx->W[t-16], 1);
+      SHA_ROT(ctx->W[t-16] ^ ctx->W[t-14] ^ ctx->W[t-8] ^ ctx->W[t-3], 1);
+    ctx->W[t+1] =
+      SHA_ROT(ctx->W[t-15] ^ ctx->W[t-13] ^ ctx->W[t-7] ^ ctx->W[t-2], 1);
+  }

> Here's my micro-optimization update. It does the first 16 rounds (of the 
> first 20-round thing) specially, and takes the data directly from the 
> input array. I'm _this_ close to breaking the 28s second barrier on 
> git-fsck, but not quite yet.

tried this before too -- doesn't help. Not much a of a surprise --
if unrolling didn't help adding another loop (for rounds 17..20) won't.

artur

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  4:08                               ` Artur Skawina
@ 2009-08-06  4:27                                 ` Linus Torvalds
  2009-08-06  5:44                                   ` Artur Skawina
  0 siblings, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2009-08-06  4:27 UTC (permalink / raw)
  To: Artur Skawina; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git

On Thu, 6 Aug 2009, Artur Skawina wrote:
> > 
> > The way it's written, I can easily make it do one or the other by just 
> > turning the macro inside a loop (and we can have a preprocessor flag to 
> > choose one or the other), but let me work on it a bit more first.
> 
> that's of course how i measured it.. :)

Well, with my "rolling 512-bit array" I can't do that easily any more.

Now it actually depends on the compiler being able to statically do that 
circular list calculation. If I were to turn it back into the chunks of 
loops, my new code would suck, because it would have all those nasty 
dynamic address calculations.

> I've only tested on p4 and there the winner so far is still:

Yeah, well, I refuse to touch that crappy micro-architecture any more. I 
complained to Intel people for years that their best CPU was only 
available as a laptop chip (Pentium-M), and I'm really happy to have 
gotten rid of all my horrid P4's.

(Ok, so it was great when the P4 ran at 2x the frequency of the 
competition, and then it smoked them all. Except on OS loads, where the P4 
exception handling took ten times longer than anything else).

So I'm a big biased against P4. 

I'll try it on my Atom's, though. They're pretty crappy CPU's, but they 
have a fairly good _reason_ to be crappy.

			Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  4:01                                 ` Linus Torvalds
@ 2009-08-06  4:28                                   ` Artur Skawina
  2009-08-06  4:50                                     ` Linus Torvalds
  0 siblings, 1 reply; 129+ messages in thread
From: Artur Skawina @ 2009-08-06  4:28 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git

Linus Torvalds wrote:
> 
> On Wed, 5 Aug 2009, Linus Torvalds wrote:
>> Here's the patch that gets me sub-28s git-fsck times. In fact, it gives me 
>> sub-27s times. In fact, it's really close to the OpenSSL times.
> 
> Just to back that up:
> 
>  - OpenSSL:
> 
> 	real	0m26.363s
> 	user	0m26.174s
> 	sys	0m0.188s
> 
>  - This C implementation:
> 
> 	real	0m26.594s
> 	user	0m26.310s
> 	sys	0m0.256s
> 
> so I'm still slower, but now you really have to look closely to see the 
> difference. In fact, you have to do multiple runs to make sure, because 
> the error bars are bigger thant he difference - but openssl definitely 
> edges my C code out by a small amount, and the above numbers are rairly 
> normal.

nice, the p4 microbenchmark #s:

#             TIME[s] SPEED[MB/s]
rfc3174         1.357       44.99
rfc3174         1.352       45.13
mozilla         1.509       40.44
mozillaas       1.133       53.87
linus          0.5818       104.9

so it's more than twice as fast as the mozilla implementation.

artur

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  4:28                                   ` Artur Skawina
@ 2009-08-06  4:50                                     ` Linus Torvalds
  2009-08-06  5:19                                       ` Artur Skawina
  0 siblings, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2009-08-06  4:50 UTC (permalink / raw)
  To: Artur Skawina; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git

On Thu, 6 Aug 2009, Artur Skawina wrote:
> 
> #             TIME[s] SPEED[MB/s]
> rfc3174         1.357       44.99
> rfc3174         1.352       45.13
> mozilla         1.509       40.44
> mozillaas       1.133       53.87
> linus          0.5818       104.9
> 
> so it's more than twice as fast as the mozilla implementation.

So that's some general SHA1 benchmark you have?

I hope it tests correctness too. 

Although I can't imagine it being wrong - I've made mistakes (oh, yes, 
many mistakes) when trying to convert the code to something efficient, and 
even the smallest mistake results in 'git fsck' immediately complaining 
about every single object.

But still. I literally haven't tested it any other way (well, the git 
test-suite ends up doing a fair amount of testing too, and I _have_ run 
that).

As to my atom testing: my poor little atom is a sad little thing, and 
it's almost painful to benchmark that thing. But it's worth it to look at 
how the 32-bit code compares to the openssl asm code too:

 - BLK_SHA1:

	real	2m27.160s
	user	2m23.651s
	sys	0m2.392s

 - OpenSSL:

	real	2m12.580s
	user	2m9.998s
	sys	0m1.811s

 - Mozilla-SHA1:

	real	3m21.836s
	user	3m18.369s
	sys	0m2.862s

As expected, the hand-tuned assembly does better (and by a bigger margin). 
Probably partly because scheduling is important when in-order, and partly 
because gcc will have a harder time with the small register set.

But it's still a big improvement over mozilla one.

(This is, as always, 'git fsck --full'. It spends about 50% on that SHA1 
calculation, so the SHA1 speedup is larger than you see from just th 
enumbers)

		Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  3:48                               ` Linus Torvalds
  2009-08-06  4:01                                 ` Linus Torvalds
@ 2009-08-06  4:52                                 ` George Spelvin
  1 sibling, 0 replies; 129+ messages in thread
From: George Spelvin @ 2009-08-06  4:52 UTC (permalink / raw)
  To: art.08.09, torvalds; +Cc: git, gitster, linux, nico

On Wed, 5 Aug 2009, Linus Torvalds wrote:
> Oh yes I have.
> 
> Here's the patch that gets me sub-28s git-fsck times. In fact, it gives me 
> sub-27s times. In fact, it's really close to the OpenSSL times.
> 
> And all using plain C.
> 
> Again - this is all on x86-64. I suspect 32-bit code ends up having 
> spills due to register pressure. That said, I did get rid of that big 
> temporary array, and it now basically only uses that 512-bit array as one 
> circular queue.
> 
> 		Linus
> 
> PS. Ok, so my definition of "plain C" is a bit odd. There's nothing plain 
> about it. It's disgusting C preprocessor misuse. But dang, it's kind of 
> fun to abuse the compiler this way.

You're still missing three tricks, which give a slight speedup 
on my machine:

1) (major)
	Instead of reassigning all those variable all the time,
	make the round function
		E += SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + W[t] + 0x8f1bbcdc; \
		B = SHA_ROR(B, 2);
	and rename the variables between rounds.

2 (minor)
	One of the round functions ((B&C)|(D&(B|C))) can be rewritten
		  (B&C) | (C&D) | (D&B)
		= (B&C) | (D&(B|C))
		= (B&C) | (D&(B^C))
		= (B&C) ^ (D&(B^C))
		= (B&C) + (D&(B^C))
	to expose more associativty (and thus scheduling flexibility)
	to the compiler.

3) (minor)
	ctx->lenW is always simply a copy of the low 6 bits of ctx->size,
	so there's no need to bother with it.

Actually, looking at the code, GCC manages to figure out the first
(major) one by itself.  Way to go, GCC authors!

But getting avoiding the extra temporary in trick 2 also gets rid of
some extra REX prefixes, saving 240 bytes in blk_SHA1Block, which is
kind of nice in an inner loop.

Here's my modified version of your earlier code.  I haven't
incoporated the W[] formation into the round functions as in
your latest version.

I'm sure you can bash the two together in very little time.  Or I'll
get to it later; I really should attend to $DAY_JOB at the moment.

diff --git a/Makefile b/Makefile
index daf4296..e6df8ec 100644
--- a/Makefile
+++ b/Makefile
@@ -84,6 +84,10 @@ all::
 # specify your own (or DarwinPort's) include directories and
 # library directories by defining CFLAGS and LDFLAGS appropriately.
 #
+# Define BLK_SHA1 environment variable if you want the C version
+# of the SHA1 that assumes you can do unaligned 32-bit loads and
+# have a fast htonl() function.
+#
 # Define PPC_SHA1 environment variable when running make to make use of
 # a bundled SHA1 routine optimized for PowerPC.
 #
@@ -1167,6 +1171,10 @@ ifdef NO_DEFLATE_BOUND
 	BASIC_CFLAGS += -DNO_DEFLATE_BOUND
 endif
 
+ifdef BLK_SHA1
+	SHA1_HEADER = "block-sha1/sha1.h"
+	LIB_OBJS += block-sha1/sha1.o
+else
 ifdef PPC_SHA1
 	SHA1_HEADER = "ppc/sha1.h"
 	LIB_OBJS += ppc/sha1.o ppc/sha1ppc.o
@@ -1184,6 +1192,7 @@ else
 endif
 endif
 endif
+endif
 ifdef NO_PERL_MAKEMAKER
 	export NO_PERL_MAKEMAKER
 endif
diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
new file mode 100644
index 0000000..261eae7
--- /dev/null
+++ b/block-sha1/sha1.c
@@ -0,0 +1,141 @@
+/*
+ * Based on the Mozilla SHA1 (see mozilla-sha1/sha1.c),
+ * optimized to do word accesses rather than byte accesses,
+ * and to avoid unnecessary copies into the context array.
+ */
+
+#include <string.h>
+#include <arpa/inet.h>
+
+#include "sha1.h"
+
+/* Hash one 64-byte block of data */
+static void blk_SHA1Block(blk_SHA_CTX *ctx, const uint32_t *data);
+
+void blk_SHA1_Init(blk_SHA_CTX *ctx)
+{
+	/* Initialize H with the magic constants (see FIPS180 for constants)
+	 */
+	ctx->H[0] = 0x67452301;
+	ctx->H[1] = 0xefcdab89;
+	ctx->H[2] = 0x98badcfe;
+	ctx->H[3] = 0x10325476;
+	ctx->H[4] = 0xc3d2e1f0;
+	ctx->size = 0;
+}
+
+
+void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, int len)
+{
+	int lenW = (int)ctx->size & 63;
+
+	ctx->size += len;
+
+	/* Read the data into W and process blocks as they get full
+	 */
+	if (lenW) {
+		int left = 64 - lenW;
+		if (len < left)
+			left = len;
+		memcpy(lenW + (char *)ctx->W, data, left);
+		if (left + lenW != 64)
+			return;
+		len -= left;
+		data += left;
+		blk_SHA1Block(ctx, ctx->W);
+	}
+	while (len >= 64) {
+		blk_SHA1Block(ctx, data);
+		data += 64;
+		len -= 64;
+	}
+	memcpy(ctx->W, data, len);
+}
+
+
+void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
+{
+	int i, lenW = (int)ctx->size & 63;
+
+	/* Pad with a binary 1 (ie 0x80), then zeroes, then length
+	 */
+	((char *)ctx->W)[lenW++] = 0x80;
+	if (lenW > 56) {
+		memset((char *)ctx->W + lenW, 0, 64 - lenW);
+		blk_SHA1Block(ctx, ctx->W);
+		lenW = 0;
+	}
+	memset((char *)ctx->W + lenW, 0, 56 - lenW);
+	ctx->W[14] = htonl(ctx->size >> 29);
+	ctx->W[15] = htonl((uint32_t)ctx->size << 3);
+	blk_SHA1Block(ctx, ctx->W);
+
+	/* Output hash
+	 */
+	for (i = 0; i < 5; i++)
+		((unsigned int *)hashout)[i] = htonl(ctx->H[i]);
+}
+
+/* SHA-1 helper macros */
+#define SHA_ROT(X,n) (((X) << (n)) | ((X) >> (32-(n))))
+#define F1(b,c,d) (((d^c)&b)^d)
+#define F2(b,c,d) (b^c^d)
+/* This version lets the compiler use the fact that + is associative. */
+#define F3(b,c,d) (c&d) + (b & (c^d))
+
+/* The basic SHA-1 round */
+#define ROUND(a, b, c, d, e, f, k, t) \
+	e += SHA_ROT(a,5) + f(b,c,d) + W[t] + k;  b = SHA_ROT(b, 30)
+/* Five SHA-1 rounds */
+#define FIVE(f, k, t) \
+	ROUND(A, B, C, D, E, f, k, t  ); \
+	ROUND(E, A, B, C, D, f, k, t+1); \
+	ROUND(D, E, A, B, C, f, k, t+2); \
+	ROUND(C, D, E, A, B, f, k, t+3); \
+	ROUND(B, C, D, E, A, f, k, t+4)
+
+static void blk_SHA1Block(blk_SHA_CTX *ctx, const uint32_t *data)
+{
+	int t;
+	uint32_t A,B,C,D,E;
+	uint32_t W[80];
+
+	for (t = 0; t < 16; t++)
+		W[t] = htonl(data[t]);
+
+	/* Unroll it? */
+	for (t = 16; t <= 79; t++)
+		W[t] = SHA_ROT(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
+
+	A = ctx->H[0];
+	B = ctx->H[1];
+	C = ctx->H[2];
+	D = ctx->H[3];
+	E = ctx->H[4];
+
+	FIVE(F1, 0x5a827999,  0);
+	FIVE(F1, 0x5a827999,  5);
+	FIVE(F1, 0x5a827999, 10);
+	FIVE(F1, 0x5a827999, 15);
+
+	FIVE(F2, 0x6ed9eba1, 20);
+	FIVE(F2, 0x6ed9eba1, 25);
+	FIVE(F2, 0x6ed9eba1, 30);
+	FIVE(F2, 0x6ed9eba1, 35);
+
+	FIVE(F3, 0x8f1bbcdc, 40);
+	FIVE(F3, 0x8f1bbcdc, 45);
+	FIVE(F3, 0x8f1bbcdc, 50);
+	FIVE(F3, 0x8f1bbcdc, 55);
+
+	FIVE(F2, 0xca62c1d6, 60);
+	FIVE(F2, 0xca62c1d6, 65);
+	FIVE(F2, 0xca62c1d6, 70);
+	FIVE(F2, 0xca62c1d6, 75);
+
+	ctx->H[0] += A;
+	ctx->H[1] += B;
+	ctx->H[2] += C;
+	ctx->H[3] += D;
+	ctx->H[4] += E;
+}
diff --git a/block-sha1/sha1.h b/block-sha1/sha1.h
new file mode 100644
index 0000000..c9dc156
--- /dev/null
+++ b/block-sha1/sha1.h
@@ -0,0 +1,21 @@
+/*
+ * Based on the Mozilla SHA1 (see mozilla-sha1/sha1.h),
+ * optimized to do word accesses rather than byte accesses,
+ * and to avoid unnecessary copies into the context array.
+ */
+ #include <stdint.h>
+
+typedef struct {
+	uint32_t H[5];
+	uint64_t size;
+	uint32_t W[16];
+} blk_SHA_CTX;
+
+void blk_SHA1_Init(blk_SHA_CTX *ctx);
+void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *dataIn, int len);
+void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx);
+
+#define git_SHA_CTX	blk_SHA_CTX
+#define git_SHA1_Init	blk_SHA1_Init
+#define git_SHA1_Update	blk_SHA1_Update
+#define git_SHA1_Final	blk_SHA1_Final

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  4:50                                     ` Linus Torvalds
@ 2009-08-06  5:19                                       ` Artur Skawina
  2009-08-06  7:03                                         ` George Spelvin
  0 siblings, 1 reply; 129+ messages in thread
From: Artur Skawina @ 2009-08-06  5:19 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git

Linus Torvalds wrote:
> 
> On Thu, 6 Aug 2009, Artur Skawina wrote:
>> #             TIME[s] SPEED[MB/s]
>> rfc3174         1.357       44.99
>> rfc3174         1.352       45.13
>> mozilla         1.509       40.44
>> mozillaas       1.133       53.87
>> linus          0.5818       104.9
>>
>> so it's more than twice as fast as the mozilla implementation.
> 
> So that's some general SHA1 benchmark you have?
> 
> I hope it tests correctness too. 

yep, sort of, i just check that all versions return the same result
when hashing some pseudorandom data.

> As to my atom testing: my poor little atom is a sad little thing, and 
> it's almost painful to benchmark that thing. But it's worth it to look at 
> how the 32-bit code compares to the openssl asm code too:
> 
>  - BLK_SHA1:
> 	real	2m27.160s
>  - OpenSSL:
> 	real	2m12.580s
>  - Mozilla-SHA1:
> 	real	3m21.836s
> 
> As expected, the hand-tuned assembly does better (and by a bigger margin). 
> Probably partly because scheduling is important when in-order, and partly 
> because gcc will have a harder time with the small register set.
> 
> But it's still a big improvement over mozilla one.
> 
> (This is, as always, 'git fsck --full'. It spends about 50% on that SHA1 
> calculation, so the SHA1 speedup is larger than you see from just th 
> enumbers)

I'll start looking at other cpus once i integrate the asm versions into
my benchmark. 

P4s really are "special". Even something as simple as this on top of your
version:

@@ -129,8 +133,8 @@
 
 #define T_20_39(t) \
        SHA_XOR(t); \
-       TEMP += SHA_ROL(A,5) + (B^C^D) + E + 0x6ed9eba1; \
-       E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
+       TEMP += SHA_ROL(A,5) + (B^C^D) + E; \
+       E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP + 0x6ed9eba1;
 
        T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24);
        T_20_39(25); T_20_39(26); T_20_39(27); T_20_39(28); T_20_39(29);
@@ -139,8 +143,8 @@
 
 #define T_40_59(t) \
        SHA_XOR(t); \
-       TEMP += SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E + 0x8f1bbcdc; \
-       E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
+       TEMP += SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E; \
+       E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP + 0x8f1bbcdc;
 
        T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44);
        T_40_59(45); T_40_59(46); T_40_59(47); T_40_59(48); T_40_59(49);

saves another 10% or so:

#Initializing... Rounds: 1000000, size: 62500K, time: 1.421s, speed: 42.97MB/s
#             TIME[s] SPEED[MB/s]
rfc3174         1.403        43.5
# New hash result: b747042d9f4f1fdabd2ac53076f8f830dea7fe0f
rfc3174         1.403       43.51
linus          0.5891       103.6
linusas        0.5337       114.4
mozilla         1.535       39.76
mozillaas       1.128       54.13


artur

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  4:27                                 ` Linus Torvalds
@ 2009-08-06  5:44                                   ` Artur Skawina
  2009-08-06  5:56                                     ` Artur Skawina
  0 siblings, 1 reply; 129+ messages in thread
From: Artur Skawina @ 2009-08-06  5:44 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git

Linus Torvalds wrote:
> 
> On Thu, 6 Aug 2009, Artur Skawina wrote:
>>> The way it's written, I can easily make it do one or the other by just 
>>> turning the macro inside a loop (and we can have a preprocessor flag to 
>>> choose one or the other), but let me work on it a bit more first.
>> that's of course how i measured it.. :)
> 
> Well, with my "rolling 512-bit array" I can't do that easily any more.
> 
> Now it actually depends on the compiler being able to statically do that 
> circular list calculation. If I were to turn it back into the chunks of 
> loops, my new code would suck, because it would have all those nasty 
> dynamic address calculations.

i did try (obvious patch below) and in fact the loops still win on p4:

#Initializing... Rounds: 1000000, size: 62500K, time: 1.428s, speed: 42.76MB/s
#             TIME[s] SPEED[MB/s]
rfc3174         1.437       42.47
rfc3174         1.438       42.45
linus          0.5791       105.4
linusas        0.5052       120.8
mozilla         1.525       40.01
mozillaas       1.192       51.19

artur

--- block-sha1/sha1.c	2009-08-06 06:45:03.407322970 +0200
+++ block-sha1/sha1as.c	2009-08-06 07:36:41.332318683 +0200
@@ -107,13 +107,17 @@
 
 #define T_0_15(t) \
 	TEMP = htonl(data[t]); array[t] = TEMP; \
-	TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \
-	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \
+	TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E; \
+	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP + 0x5a827999; \
 
+#if UNROLL
 	T_0_15( 0); T_0_15( 1); T_0_15( 2); T_0_15( 3); T_0_15( 4);
 	T_0_15( 5); T_0_15( 6); T_0_15( 7); T_0_15( 8); T_0_15( 9);
 	T_0_15(10); T_0_15(11); T_0_15(12); T_0_15(13); T_0_15(14);
 	T_0_15(15);
+#else
+	for (int t = 0; t <= 15; t++) { T_0_15(t); }
+#endif
 
 /* This "rolls" over the 512-bit array */
 #define W(x) (array[(x)&15])
@@ -125,37 +129,53 @@
 	TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \
 	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \
 
+#if UNROLL
 	T_16_19(16); T_16_19(17); T_16_19(18); T_16_19(19);
+#else
+	for (int t = 16; t <= 19; t++) { T_16_19(t); }
+#endif
 
 #define T_20_39(t) \
 	SHA_XOR(t); \
-	TEMP += SHA_ROL(A,5) + (B^C^D) + E + 0x6ed9eba1; \
-	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
+	TEMP += SHA_ROL(A,5) + (B^C^D) + E; \
+	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP + 0x6ed9eba1;
 
+#if UNROLL
 	T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24);
 	T_20_39(25); T_20_39(26); T_20_39(27); T_20_39(28); T_20_39(29);
 	T_20_39(30); T_20_39(31); T_20_39(32); T_20_39(33); T_20_39(34);
 	T_20_39(35); T_20_39(36); T_20_39(37); T_20_39(38); T_20_39(39);
+#else
+	for (int t = 20; t <= 39; t++) { T_20_39(t); }
+#endif
 
 #define T_40_59(t) \
 	SHA_XOR(t); \
-	TEMP += SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E + 0x8f1bbcdc; \
-	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
+	TEMP += SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E; \
+	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP + 0x8f1bbcdc;
 
+#if UNROLL
 	T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44);
 	T_40_59(45); T_40_59(46); T_40_59(47); T_40_59(48); T_40_59(49);
 	T_40_59(50); T_40_59(51); T_40_59(52); T_40_59(53); T_40_59(54);
 	T_40_59(55); T_40_59(56); T_40_59(57); T_40_59(58); T_40_59(59);
+#else
+	for (int t = 40; t <= 59; t++) { T_40_59(t); }
+#endif
 
 #define T_60_79(t) \
 	SHA_XOR(t); \
 	TEMP += SHA_ROL(A,5) + (B^C^D) + E + 0xca62c1d6; \
 	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
 
+#if UNROLL
 	T_60_79(60); T_60_79(61); T_60_79(62); T_60_79(63); T_60_79(64);
 	T_60_79(65); T_60_79(66); T_60_79(67); T_60_79(68); T_60_79(69);
 	T_60_79(70); T_60_79(71); T_60_79(72); T_60_79(73); T_60_79(74);
 	T_60_79(75); T_60_79(76); T_60_79(77); T_60_79(78); T_60_79(79);
+#else
+	for (int t = 60; t <= 79; t++) { T_60_79(t); }
+#endif
 
 	ctx->H[0] += A;
 	ctx->H[1] += B;

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  5:44                                   ` Artur Skawina
@ 2009-08-06  5:56                                     ` Artur Skawina
  2009-08-06  7:45                                       ` Artur Skawina
  0 siblings, 1 reply; 129+ messages in thread
From: Artur Skawina @ 2009-08-06  5:56 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git

Artur Skawina wrote:
> i did try (obvious patch below) and in fact the loops still win on p4:
> 
> #Initializing... Rounds: 1000000, size: 62500K, time: 1.428s, speed: 42.76MB/s
> #             TIME[s] SPEED[MB/s]
> rfc3174         1.437       42.47
> rfc3174         1.438       42.45
> linus          0.5791       105.4
> linusas        0.5052       120.8
> mozilla         1.525       40.01
> mozillaas       1.192       51.19

and my atom seems to like the compact loops too: 

#Initializing... Rounds: 1000000, size: 62500K, time: 4.379s, speed: 13.94MB/s
#             TIME[s] SPEED[MB/s]
rfc3174         4.429       13.78
rfc3174         4.414       13.83
linus           1.733       35.22
linusas           1.5        40.7
mozilla         2.818       21.66
mozillaas       2.539       24.04

artur

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  5:19                                       ` Artur Skawina
@ 2009-08-06  7:03                                         ` George Spelvin
  0 siblings, 0 replies; 129+ messages in thread
From: George Spelvin @ 2009-08-06  7:03 UTC (permalink / raw)
  To: art.08.09, torvalds; +Cc: git, gitster, linux, nico

> On Thu, 6 Aug 2009, Artur Skawina wrote:
>> #             TIME[s] SPEED[MB/s]
>> rfc3174         1.357       44.99
>> rfc3174         1.352       45.13
>> mozilla         1.509       40.44
>> mozillaas       1.133       53.87
>> linus          0.5818       104.9

> #Initializing... Rounds: 1000000, size: 62500K, time: 1.421s, speed: 42.97MB/s
> #             TIME[s] SPEED[MB/s]
> rfc3174         1.403        43.5
> # New hash result: b747042d9f4f1fdabd2ac53076f8f830dea7fe0f
> rfc3174         1.403       43.51
> linus          0.5891       103.6
> linusas        0.5337       114.4
> mozilla         1.535       39.76
> mozillaas       1.128       54.13

I'm trying to absorb what you're learning about P4 performance, but
I'm getting confused... what is what in these benchmarks?

The major architectural decisions I see are:

1) Three possible ways to compute the W[] array for rounds 16..79:
	1a) Compute W[16..79] in a loop beforehand (you noted that unrolling
	    two copies helped significantly.)
	1b) Compute W[16..79] as part of hash rounds 16..79.
	1c) Compute W[0..15] in-place as part of hash rounds 16..79

2) The main hashing can be rolled up or unrolled:
	2a) Four 20-round loops.  (In case of options 1b and 1c, the
	    first one might be split into a 16 and a 4.)
	2b) Four 4-round loops, each unrolled 5x.  (See the ARM assembly.)
	2c) all 80 rounds unrolled.

As Linus noted, 1c is not friends with options 2a and 2b, because the
W() indexing math is not longer a compile-time constant.

Linus has posted 1a+2c and 1c+2c.  You posted some code that could be
2a or 2c depending on an UNROLL preprocessor #define.  Which combinations
are your "linus" and "linusas" code?

You talk about "and my atom seems to like the compact loops too", but
I'm not sure which loops those are.

Thanks.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  5:56                                     ` Artur Skawina
@ 2009-08-06  7:45                                       ` Artur Skawina
  0 siblings, 0 replies; 129+ messages in thread
From: Artur Skawina @ 2009-08-06  7:45 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git

Artur Skawina wrote:
> 
> and my atom seems to like the compact loops too: 

no, that was wrong, i forgot to turn off the ondemand governor...

the unrolled loops are in fact much faster and the numbers
look more reasonable, after a few tweaks even on a P4.
Now i just need to check how well it does compared to the
asm implementations...

artur

#             TIME[s] SPEED[MB/s]
# ATOM
rfc3174         2.199       27.75
linus          0.8642       70.62
linusas         1.606       38.01
linusas2       0.8763       69.65
mozilla         2.813        21.7
mozillaas       2.539       24.04
# P4
rfc3174         1.402       43.53
linus          0.5835       104.6
linusas        0.4625         132
linusas2       0.4456         137
mozilla         1.529       39.91
mozillaas       1.131       53.96
# P3
rfc3174         5.019       12.16
linus            1.86       32.81
linusas         3.108       19.64
linusas2        1.812       33.68
mozilla         6.431        9.49
mozillaas       5.868        10.4

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  1:18                     ` Linus Torvalds
  2009-08-06  1:52                       ` Nicolas Pitre
@ 2009-08-06 18:49                       ` Erik Faye-Lund
  1 sibling, 0 replies; 129+ messages in thread
From: Erik Faye-Lund @ 2009-08-06 18:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: George Spelvin, gitster, git

On Thu, Aug 6, 2009 at 3:18 AM, Linus
Torvalds<torvalds@linux-foundation.org> wrote:
> I note that MINGW does NO_OPENSSL by default, for example, and maybe the
> MINGW people want to test the patch out and enable BLK_SHA1 rather than
> the original Mozilla one.

We recently got OpenSSL in msysgit. The NO_OPENSSL-switch hasn't been
flipped yet, though. (We did OpenSSL to get https-support in cURL...)

-- 
Erik "kusma" Faye-Lund
kusmabite@gmail.com
(+47) 986 59 656

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Performance issue of 'git branch'
  2009-07-24 22:13                             ` Linus Torvalds
  2009-07-24 22:18                               ` david
@ 2009-08-07  4:21                               ` Jeff King
  1 sibling, 0 replies; 129+ messages in thread
From: Jeff King @ 2009-08-07  4:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Junio C Hamano, Git Mailing List, Carlos R. Mafra,
	Daniel Barkalow, Johannes Schindelin

On Fri, Jul 24, 2009 at 03:13:07PM -0700, Linus Torvalds wrote:

> Subject: [PATCH] git-http-fetch: not a builtin
> 
> We should really try to avoid having a dependency on the curl libraries
> for the core 'git' executable. It adds huge overheads, for no advantage.
> 
> This splits up git-http-fetch so that it isn't built-in.  We still do
> end up linking with curl for the git binary due to the transport.c http
> walker, but that's at least partially an independent issue.
>
> [...]
>
> +git-http-fetch$X: revision.o http.o http-push.o $(GITLIBS)
> +	$(QUIET_LINK)$(CC) $(ALL_CFLAGS) -o $@ $(ALL_LDFLAGS) $(filter %.o,$^) \
> +		$(LIBS) $(CURL_LIBCURL) $(EXPAT_LIBEXPAT)

Err, this seems to horribly break git-http-fetch (see if you can spot
the logic error in dependencies). Patch is below.

Nobody noticed, I expect, because nothing in git _uses_ http-fetch
anymore, now that git-clone is no longer a shell script. I only noticed
because it tried to build http-push on one of my NO_EXPAT machines.

It might be an interesting exercise to dust off the old shell scripts
once in a while and see if they still pass their original tests while
running on top of a more modern git. It would test that we haven't
broken the plumbing interfaces.

-- >8 --
Subject: [PATCH] Makefile: build http-fetch against http-fetch.o

As opposed to http-push.o. We can also drop EXPAT_LIBEXPAT,
since fetch does not need it.

This appears to be a bad cut-and-paste in commit 1088261f.

Signed-off-by: Jeff King <peff@peff.net>
---
 Makefile |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/Makefile b/Makefile
index 97d904b..d6362d3 100644
--- a/Makefile
+++ b/Makefile
@@ -1502,9 +1502,9 @@ http.o http-walker.o http-push.o: http.h
 
 http.o http-walker.o: $(LIB_H)
 
-git-http-fetch$X: revision.o http.o http-push.o $(GITLIBS)
+git-http-fetch$X: revision.o http.o http-fetch.o http-walker.o $(GITLIBS)
 	$(QUIET_LINK)$(CC) $(ALL_CFLAGS) -o $@ $(ALL_LDFLAGS) $(filter %.o,$^) \
-		$(LIBS) $(CURL_LIBCURL) $(EXPAT_LIBEXPAT)
+		$(LIBS) $(CURL_LIBCURL)
 git-http-push$X: revision.o http.o http-push.o $(GITLIBS)
 	$(QUIET_LINK)$(CC) $(ALL_CFLAGS) -o $@ $(ALL_LDFLAGS) $(filter %.o,$^) \
 		$(LIBS) $(CURL_LIBCURL) $(EXPAT_LIBEXPAT)
-- 
1.6.4.117.g6056d.dirty

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-03  3:47   ` x86 SHA1: Faster than OpenSSL George Spelvin
                       ` (2 preceding siblings ...)
  2009-08-04  2:30     ` Linus Torvalds
@ 2009-08-18 21:26     ` Andy Polyakov
  3 siblings, 0 replies; 129+ messages in thread
From: Andy Polyakov @ 2009-08-18 21:26 UTC (permalink / raw)
  To: George Spelvin; +Cc: git

George Spelvin wrote:
> (Work in progress, state dump to mailing list archives.)
> 
> This started when discussing git startup overhead due to the dynamic
> linker.  One big contributor is the openssl library, which is used only
> for its optimized x86 SHA-1 implementation.  So I took a look at it,
> with an eye to importing the code directly into the git source tree,
> and decided that I felt like trying to do better.
> 
> The original code was excellent, but it was optimized when the P4 was new.

Even though last revision took place when "the P4 was new" and even
triggered by its appearance, *all-round* performance was and will always
be the prime goal. This means that improvements on some particular
micro-architecture is always weighed against losses on others [and
compromise is considered of so required]. Please note that I'm *not*
trying to diminish George's effort by saying that proposed code is
inappropriate, on the contrary I'm nothing but grateful! Thanks, George!
I'm only saying that it will be given thorough consideration. Well, I've
actually given the consideration and outcome is already committed:-) See
http://cvs.openssl.org/chngview?cn=18513. I don't deliver +17%, only
+12%, but at the cost of Intel Atom-specific optimizations. I used this
opportunity to optimize even for Intel Atom core, something I was
planning to do at some point anyway...

>   http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz
> - "tar xz cryptogams-0.tar.gz"

If there is interest I can pack new tar ball with updated modules.

> An open question is how to add appropriate CPU detection to the git
> build scripts. (Note that `uname -m`, which it currently uses to select
> the ARM code, does NOT produce the right answer if you're using a 32-bit
> compiler on a 64-bit platform.)

It's not only that. As next subscriber noted problem on MacOS X, it
[MacOS X] uses slightly different assembler convention and ELF modules
can't be compiled on MacOS X. OpenSSL perlasm takes care of several
assembler flavors and executable formats, including MacOS X. I'm talking
about

> +++ Makefile	2009-08-02 06:44:44.000000000 -0400
> +%.s : %.pl x86asm.pl x86unix.pl
> +	perl $< elf > $@
                ^^^ this argument.

Cheers. A.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-04  2:51       ` Linus Torvalds
  2009-08-04  3:07         ` Jon Smirl
@ 2009-08-18 21:50         ` Andy Polyakov
  1 sibling, 0 replies; 129+ messages in thread
From: Andy Polyakov @ 2009-08-18 21:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: George Spelvin, git

> On Mon, 3 Aug 2009, Linus Torvalds wrote:
>> The thing that I'd prefer is simply
>>
>> 	git fsck --full
>>
>> on the Linux kernel archive. For me (with a fast machine), it takes about 
>> 4m30s with the OpenSSL SHA1, and takes 6m40s with the Mozilla SHA1 (ie 
>> using a NO_OPENSSL=1 build).
>>
>> So that's an example of a load that is actually very sensitive to SHA1 
>> performance (more so than _most_ git loads, I suspect), and at the same 
>> time is a real git load rather than some SHA1-only microbenchmark.

I couldn't agree more that real-life benchmarks are of greater value
than specific algorithm micro-benchmark. And given the provided
profiling data one can argue that +17% (or my +12%) improvement on
micro-benchmark aren't really worth bothering about. But it's kind of
sport [at least for me], so don't judge too harshly:-)

>> It also 
>> shows very clearly why we default to the OpenSSL version over the Mozilla 
>> one.

As George implicitly mentioned most OpenSSL assembler modules are
available under more permissive license and if there is interest I'm
ready to assist...

> "perf report --sort comm,dso,symbol" profiling shows the following for 
> 'git fsck --full' on the kernel repo, using the Mozilla SHA1:
> 
>     47.69%               git  /home/torvalds/git/git     [.] moz_SHA1_Update
>     22.98%               git  /lib64/libz.so.1.2.3       [.] inflate_fast
>      7.32%               git  /lib64/libc-2.10.1.so      [.] __GI_memcpy
>      4.66%               git  /lib64/libz.so.1.2.3       [.] inflate
>      3.76%               git  /lib64/libz.so.1.2.3       [.] adler32
>      2.86%               git  /lib64/libz.so.1.2.3       [.] inflate_table
>      2.41%               git  /home/torvalds/git/git     [.] lookup_object
>      1.31%               git  /lib64/libc-2.10.1.so      [.] _int_malloc
>      0.84%               git  /home/torvalds/git/git     [.] patch_delta
>      0.78%               git  [kernel]                   [k] hpet_next_event
> 
> so yeah, SHA1 performance matters. Judging by the OpenSSL numbers, the 
> OpenSSL SHA1 implementation must be about twice as fast as the C version 
> we use.

And given /lib64 path this is 64-bit C compiler-generated code compared
to 32-bit assembler? Either way in this context I have extra comment
addressing previous subscriber, Mark Lodato, who effectively wondered
how would 64-bit assembler compare to 32-bit one. First of all there
*is* even 64-bit assembler version. But as SHA1 is essentially 32-bit
algorithm, 64-bit implementation is only nominally faster, +20% at most.
Faster thanks to larger register bank facilitating more efficient
instruction scheduling.

Cheers. A.

^ permalink raw reply	[flat|nested] 129+ messages in thread

end of thread, other threads:[~2009-08-18 21:50 UTC | newest]

Thread overview: 129+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-07-26 23:21 Performance issue of 'git branch' George Spelvin
2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin
2009-07-31 11:11   ` Erik Faye-Lund
2009-07-31 11:31     ` George Spelvin
2009-07-31 11:37     ` Michael J Gruber
2009-07-31 12:24       ` Erik Faye-Lund
2009-07-31 12:29         ` Johannes Schindelin
2009-07-31 12:32         ` George Spelvin
2009-07-31 12:45           ` Erik Faye-Lund
2009-07-31 13:02             ` George Spelvin
2009-07-31 11:21   ` Michael J Gruber
2009-07-31 11:26   ` Michael J Gruber
2009-07-31 12:31   ` Carlos R. Mafra
2009-07-31 13:27   ` Brian Ristuccia
2009-07-31 14:05     ` George Spelvin
2009-07-31 13:27   ` Jakub Narebski
2009-07-31 15:05   ` Peter Harris
2009-07-31 15:22   ` Peter Harris
2009-08-03  3:47   ` x86 SHA1: Faster than OpenSSL George Spelvin
2009-08-03  7:36     ` Jonathan del Strother
2009-08-04  1:40     ` Mark Lodato
2009-08-04  2:30     ` Linus Torvalds
2009-08-04  2:51       ` Linus Torvalds
2009-08-04  3:07         ` Jon Smirl
2009-08-04  5:01           ` George Spelvin
2009-08-04 12:56             ` Jon Smirl
2009-08-04 14:29               ` Dmitry Potapov
2009-08-18 21:50         ` Andy Polyakov
2009-08-04  4:48       ` George Spelvin
2009-08-04  6:30         ` Linus Torvalds
2009-08-04  8:01           ` George Spelvin
2009-08-04 20:41             ` Junio C Hamano
2009-08-05 18:17               ` George Spelvin
2009-08-05 20:36                 ` Johannes Schindelin
2009-08-05 20:44                 ` Junio C Hamano
2009-08-05 20:55                 ` Linus Torvalds
2009-08-05 23:13                   ` Linus Torvalds
2009-08-06  1:18                     ` Linus Torvalds
2009-08-06  1:52                       ` Nicolas Pitre
2009-08-06  2:04                         ` Junio C Hamano
2009-08-06  2:10                           ` Linus Torvalds
2009-08-06  2:20                           ` Nicolas Pitre
2009-08-06  2:08                         ` Linus Torvalds
2009-08-06  3:19                           ` Artur Skawina
2009-08-06  3:31                             ` Linus Torvalds
2009-08-06  3:48                               ` Linus Torvalds
2009-08-06  4:01                                 ` Linus Torvalds
2009-08-06  4:28                                   ` Artur Skawina
2009-08-06  4:50                                     ` Linus Torvalds
2009-08-06  5:19                                       ` Artur Skawina
2009-08-06  7:03                                         ` George Spelvin
2009-08-06  4:52                                 ` George Spelvin
2009-08-06  4:08                               ` Artur Skawina
2009-08-06  4:27                                 ` Linus Torvalds
2009-08-06  5:44                                   ` Artur Skawina
2009-08-06  5:56                                     ` Artur Skawina
2009-08-06  7:45                                       ` Artur Skawina
2009-08-06 18:49                       ` Erik Faye-Lund
2009-08-04  6:40         ` Linus Torvalds
2009-08-18 21:26     ` Andy Polyakov
  -- strict thread matches above, loose matches on Subject: below --
2009-07-22 23:59 Performance issue of 'git branch' Carlos R. Mafra
2009-07-23  0:21 ` Linus Torvalds
2009-07-23  0:51   ` Linus Torvalds
2009-07-23  0:55     ` Linus Torvalds
2009-07-23  2:02       ` Carlos R. Mafra
2009-07-23  2:28         ` Linus Torvalds
2009-07-23 12:42           ` Jakub Narebski
2009-07-23 14:45             ` Carlos R. Mafra
2009-07-23 16:25             ` Linus Torvalds
2009-07-23  1:22   ` Carlos R. Mafra
2009-07-23  2:20     ` Linus Torvalds
2009-07-23  2:23       ` Linus Torvalds
2009-07-23  3:08         ` Linus Torvalds
2009-07-23  3:21           ` Linus Torvalds
2009-07-23 17:47             ` Tony Finch
2009-07-23 18:57               ` Linus Torvalds
2009-07-23  3:18         ` Carlos R. Mafra
2009-07-23  3:27           ` Carlos R. Mafra
2009-07-23  3:40           ` Carlos R. Mafra
2009-07-23  3:47           ` Linus Torvalds
2009-07-23  4:10             ` Linus Torvalds
2009-07-23  5:13               ` Junio C Hamano
2009-07-23  5:17               ` Carlos R. Mafra
2009-07-23  4:40         ` Junio C Hamano
2009-07-23  5:36           ` Linus Torvalds
2009-07-23  5:52             ` Junio C Hamano
2009-07-23  6:04               ` Junio C Hamano
2009-07-23 17:19                 ` Linus Torvalds
2009-07-23 16:07           ` Carlos R. Mafra
2009-07-23 16:19             ` Linus Torvalds
2009-07-23 16:53               ` Carlos R. Mafra
2009-07-23 19:05                 ` Linus Torvalds
2009-07-23 19:13                   ` Linus Torvalds
2009-07-23 19:55                     ` Carlos R. Mafra
2009-07-24 20:36                       ` Linus Torvalds
2009-07-24 20:47                         ` Linus Torvalds
2009-07-24 21:21                           ` Linus Torvalds
2009-07-24 22:13                             ` Linus Torvalds
2009-07-24 22:18                               ` david
2009-07-24 22:42                                 ` Linus Torvalds
2009-07-24 22:46                                   ` david
2009-07-25  2:39                                     ` Linus Torvalds
2009-07-25  2:53                                       ` Daniel Barkalow
2009-08-07  4:21                               ` Jeff King
2009-07-24 22:54                             ` Theodore Tso
2009-07-24 22:59                               ` Shawn O. Pearce
2009-07-24 23:28                                 ` Junio C Hamano
2009-07-26 17:07                                 ` Avi Kivity
2009-07-26 17:16                                   ` Johannes Schindelin
2009-07-24 23:46                             ` Carlos R. Mafra
2009-07-25  0:41                               ` Carlos R. Mafra
2009-07-25 18:04                                 ` Linus Torvalds
2009-07-25 18:57                                   ` Timo Hirvonen
2009-07-25 19:06                                     ` Reece Dunn
2009-07-25 20:31                                     ` Mike Hommey
2009-07-25 21:02                                       ` Linus Torvalds
2009-07-25 21:13                                         ` Linus Torvalds
2009-07-25 23:23                                           ` Johannes Schindelin
2009-07-26  4:49                                             ` Linus Torvalds
2009-07-26 16:29                                               ` Theodore Tso
2009-07-26  7:54                                         ` Mike Hommey
2009-07-26 10:16                                           ` Johannes Schindelin
2009-07-26 10:23                                             ` demerphq
2009-07-26 10:27                                               ` demerphq
2009-07-25 21:04                                     ` Carlos R. Mafra
2009-07-23 16:48         ` Anders Kaseorg
2009-07-23 19:03           ` Carlos R. Mafra
2009-07-23  0:23 ` SZEDER Gábor
2009-07-23  2:25   ` Carlos R. Mafra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).