git-fetch per-repository speed issues

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* git-fetch per-repository speed issues
@ 2006-07-03 18:02 Keith Packard
  2006-07-03 23:14 ` Linus Torvalds
                   ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Keith Packard @ 2006-07-03 18:02 UTC (permalink / raw)
  To: Git Mailing List; +Cc: keithp

[-- Attachment #1: Type: text/plain, Size: 1237 bytes --]

Ok, so maybe X.org is using git in an unexpected (or even wrong)
fashion. Our environment has split development across dozens of separate
repositories which match ABI interfaces. With CVS, we were able to keep
this all in one giant CVS repository with separate modules, but git
doesn't have that notion (which is mostly good). As such, we could use
cvsup or rsync to update the entire collection of modules.

With git, we'd prefer to use the git protocol instead of rsync for the
usual pack-related reasons, but that is limited to a single repository
at a time. And, it's painfully slow, even when the repository is up to
date:

$ cd lib/libXrandr
$ time git-fetch origin
...

real    0m17.035s
user    0m2.584s
sys     0m0.576s

This is a repository with 24 files and perhaps 50 revisions. Given
X.org's 307 git repositories, I'll clearly need to find a faster way
than running git-fetch on every one.

One thing I noticed was that the git+ssh syntax found in remotes files
doesn't do what I thought it did -- I assumed this would use 'git' for
fetch and 'ssh' for push, when in fact it just uses ssh for everything.
This slows down the connection process by several seconds.

-- 
keith.packard@intel.com

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: git-fetch per-repository speed issues
  2006-07-03 18:02 git-fetch per-repository speed issues Keith Packard
@ 2006-07-03 23:14 ` Linus Torvalds
  2006-07-04  0:21   ` Jeff King
       [not found]   ` <1151973438.4723.70.camel@neko.keithp.com>
  2006-07-04 15:42 ` Jakub Narebski
  2006-07-06 23:36 ` David Woodhouse
  2 siblings, 2 replies; 30+ messages in thread
From: Linus Torvalds @ 2006-07-03 23:14 UTC (permalink / raw)
  To: Keith Packard; +Cc: Git Mailing List

On Mon, 3 Jul 2006, Keith Packard wrote:
> 
> With git, we'd prefer to use the git protocol instead of rsync for the
> usual pack-related reasons, but that is limited to a single repository
> at a time.

Well, you could use multiple branches in the same repository, even if they 
are totally unrealated. That would allow you to fetch them all in one go.

One way to do that is to just name the branches hierarcially have one 
repo, but then call the branches something like

	libXrandr/master
	libXrandr/develop
	Xorg/master
	Xorg/develop
	..

> And, it's painfully slow, even when the repository is up to
> date:
> 
> $ cd lib/libXrandr
> $ time git-fetch origin
> ...
> 
> real    0m17.035s
> user    0m2.584s
> sys     0m0.576s

That's _seriously_ wrong. If everything is up-to-date, a fetch should be 
basically zero-cost. That's especially true with the anonymous git 
protocol, which doesn't have any connection validation overhead (for the 
ssh protocol, the cost is usually the ssh login).

But there may well be some bug there.

Look at this:

	[torvalds@g5 git]$ time git fetch git://git.kernel.org/pub/scm/git/git.git 

	real    0m0.431s
	user    0m0.036s
	sys     0m0.024s

and that's over my DSL line, not some studly network thing. 

Basically, a repo that is up-to-date should do a "git fetch" about as 
quickly as it does a "git ls-remote". Which in turn really shouldn't be 
doing much anything at all, apart from the connect itself:

	[torvalds@g5 git]$ time git ls-remote master.kernel.org:/pub/scm/git/git.git > /dev/null 

	real    0m1.758s
	user    0m0.188s
	sys     0m0.024s
	[torvalds@g5 git]$ time git ls-remote git://git.kernel.org/pub/scm/git/git.git > /dev/null 

	real    0m0.431s
	user    0m0.056s
	sys     0m0.016s

(note how the ssh connection is much slower - it actually ends up doing 
all the ssh back-and-forth).

Can you try from different hosts? One problem may be the remote end 
just trying to do reverse DNS lookups for xinetd or whatever?

Also, one thing to try is to just do

	strace -Ttt git-peek-remote ...

which shows where the time is going (I selected "git-peek-remote", because 
that's a simple program).

		Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: git-fetch per-repository speed issues
  2006-07-03 23:14 ` Linus Torvalds
@ 2006-07-04  0:21   ` Jeff King
  2006-07-04  1:22     ` Ryan Anderson
                       ` (2 more replies)
       [not found]   ` <1151973438.4723.70.camel@neko.keithp.com>
  1 sibling, 3 replies; 30+ messages in thread
From: Jeff King @ 2006-07-04  0:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Keith Packard, Git Mailing List

On Mon, Jul 03, 2006 at 04:14:10PM -0700, Linus Torvalds wrote:

> Well, you could use multiple branches in the same repository, even if they 
> are totally unrealated. That would allow you to fetch them all in one go.

One annoying thing about this is that you may want to have several of
the branches checked out at a time (i.e., you want the actual directory
structure of libXrandr/, Xorg/, etc). You could pull everything down
into one repo and point small pseudo-repos at it with alternates, but I
would think that would become a mess with pushes. You can do some magic
with read-tree --prefix, but again, I'm not sure how you'd make commits
on the correct branch.  Is there an easier way to do this?

> Basically, a repo that is up-to-date should do a "git fetch" about as 
> quickly as it does a "git ls-remote". Which in turn really shouldn't be 
> doing much anything at all, apart from the connect itself:

Fetching by ssh actually makes two ssh connections (the second is to
grab tags).

-Peff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: git-fetch per-repository speed issues
  2006-07-04  0:21   ` Jeff King
@ 2006-07-04  1:22     ` Ryan Anderson
  2006-07-04  1:44       ` Jeff King
  2006-07-04  3:07     ` Linus Torvalds
  2006-07-04  6:44     ` Jakub Narebski
  2 siblings, 1 reply; 30+ messages in thread
From: Ryan Anderson @ 2006-07-04  1:22 UTC (permalink / raw)
  To: Jeff King; +Cc: Linus Torvalds, Keith Packard, Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 1158 bytes --]

Jeff King wrote:
> On Mon, Jul 03, 2006 at 04:14:10PM -0700, Linus Torvalds wrote:
>
>   
>> Well, you could use multiple branches in the same repository, even if they 
>> are totally unrealated. That would allow you to fetch them all in one go.
>>     
>
> One annoying thing about this is that you may want to have several of
> the branches checked out at a time (i.e., you want the actual directory
> structure of libXrandr/, Xorg/, etc). You could pull everything down
> into one repo and point small pseudo-repos at it with alternates, but I
> would think that would become a mess with pushes. You can do some magic
> with read-tree --prefix, but again, I'm not sure how you'd make commits
> on the correct branch.  Is there an easier way to do this?
>   
You can have multiple source trees, one per 'branch' (which is a bit of
a bad term here), and have completely unrelated things in the branches.

See, for an example, the main Git repo, which has the "man", "html", and
"todo" branches, logically distinct and (somewhat) unrelated to the main
branch tucked away in "master".

-- 

Ryan Anderson
  sometimes Pug Majere



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 252 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: git-fetch per-repository speed issues
  2006-07-04  1:22     ` Ryan Anderson
@ 2006-07-04  1:44       ` Jeff King
  2006-07-04  1:55         ` Ryan Anderson
  0 siblings, 1 reply; 30+ messages in thread
From: Jeff King @ 2006-07-04  1:44 UTC (permalink / raw)
  To: Ryan Anderson; +Cc: Linus Torvalds, Keith Packard, Git Mailing List

On Mon, Jul 03, 2006 at 06:22:26PM -0700, Ryan Anderson wrote:

> You can have multiple source trees, one per 'branch' (which is a bit of
> a bad term here), and have completely unrelated things in the branches.
> 
> See, for an example, the main Git repo, which has the "man", "html", and
> "todo" branches, logically distinct and (somewhat) unrelated to the main
> branch tucked away in "master".

Right, I know, but my complaint is that I can't then turn that into a
directory hierarchy of .../man, .../html, .../todo that are all checked
out at the same time (there are obviously ways of playing with it, say
by setting GIT_DIR and doing a checkout in those directories, but then I
can't use git in the normal way).

The best I can come up with is having man, html, and todo repos pointing
to the one (now local) repo which contains everything. But then pushing
is a two-step process.

-Peff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: git-fetch per-repository speed issues
  2006-07-04  1:44       ` Jeff King
@ 2006-07-04  1:55         ` Ryan Anderson
  0 siblings, 0 replies; 30+ messages in thread
From: Ryan Anderson @ 2006-07-04  1:55 UTC (permalink / raw)
  To: Jeff King; +Cc: Linus Torvalds, Keith Packard, Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 1406 bytes --]

Jeff King wrote:
> On Mon, Jul 03, 2006 at 06:22:26PM -0700, Ryan Anderson wrote:
>
>   
>> You can have multiple source trees, one per 'branch' (which is a bit of
>> a bad term here), and have completely unrelated things in the branches.
>>
>> See, for an example, the main Git repo, which has the "man", "html", and
>> "todo" branches, logically distinct and (somewhat) unrelated to the main
>> branch tucked away in "master".
>>     
>
> Right, I know, but my complaint is that I can't then turn that into a
> directory hierarchy of .../man, .../html, .../todo that are all checked
> out at the same time (there are obviously ways of playing with it, say
> by setting GIT_DIR and doing a checkout in those directories, but then I
> can't use git in the normal way).
>
> The best I can come up with is having man, html, and todo repos pointing
> to the one (now local) repo which contains everything. But then pushing
> is a two-step process.
>
>   
Hrm, if I understand CVS at all, the old workflow was "cvsup a copy of
the repository, update a working tree against that", which is, I think,
actually even worse than the Git equivalent, since you can't reliably
even commit to that local clone of the CVS repository.

What am I missing?

You can still push directly upstream, I suppose, and just do 2-stage
pulls down.

-- 

Ryan Anderson
  sometimes Pug Majere



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 252 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: git-fetch per-repository speed issues
  2006-07-04  0:21   ` Jeff King
  2006-07-04  1:22     ` Ryan Anderson
@ 2006-07-04  3:07     ` Linus Torvalds
  2006-07-05  6:47       ` Jeff King
  2006-07-04  6:44     ` Jakub Narebski
  2 siblings, 1 reply; 30+ messages in thread
From: Linus Torvalds @ 2006-07-04  3:07 UTC (permalink / raw)
  To: Jeff King; +Cc: Keith Packard, Git Mailing List



On Mon, 3 Jul 2006, Jeff King wrote:
> 
> Fetching by ssh actually makes two ssh connections (the second is to
> grab tags).

True. Although that should happen only if there are any new tags.

		Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: git-fetch per-repository speed issues
  2006-07-04  3:07     ` Linus Torvalds
@ 2006-07-05  6:47       ` Jeff King
  2006-07-05 16:40         ` Linus Torvalds
  0 siblings, 1 reply; 30+ messages in thread
From: Jeff King @ 2006-07-05  6:47 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

On Mon, Jul 03, 2006 at 08:07:49PM -0700, Linus Torvalds wrote:

> > Fetching by ssh actually makes two ssh connections (the second is to
> > grab tags).
> True. Although that should happen only if there are any new tags.

Either you're wrong or there's a bug in git-fetch. 

I think you're missing the call to git-ls-remote --tags to get the list
of tags (which we will then auto-follow if necessary). So in that case,
there would actually be 3 ssh connections. If everything is up to date,
we still make 2 connections (one to check refs from remotes file, and
one to check remote tag list).

-Peff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: git-fetch per-repository speed issues
  2006-07-05  6:47       ` Jeff King
@ 2006-07-05 16:40         ` Linus Torvalds
  0 siblings, 0 replies; 30+ messages in thread
From: Linus Torvalds @ 2006-07-05 16:40 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Mailing List



On Wed, 5 Jul 2006, Jeff King wrote:
> 
> Either you're wrong or there's a bug in git-fetch. 

I was wrong - I forgot the git-ls-remote (which really should be 
unnecessary, but the way the git-fetch-pack works, we end up 
re-connecting).

		Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: git-fetch per-repository speed issues
  2006-07-04  0:21   ` Jeff King
  2006-07-04  1:22     ` Ryan Anderson
  2006-07-04  3:07     ` Linus Torvalds
@ 2006-07-04  6:44     ` Jakub Narebski
  2 siblings, 0 replies; 30+ messages in thread
From: Jakub Narebski @ 2006-07-04  6:44 UTC (permalink / raw)
  To: git

Jeff King wrote:

> On Mon, Jul 03, 2006 at 04:14:10PM -0700, Linus Torvalds wrote:
> 
>> Well, you could use multiple branches in the same repository, even if
they 
>> are totally unrealated. That would allow you to fetch them all in one go.
> 
> One annoying thing about this is that you may want to have several of
> the branches checked out at a time (i.e., you want the actual directory
> structure of libXrandr/, Xorg/, etc). You could pull everything down
> into one repo and point small pseudo-repos at it with alternates, but I
> would think that would become a mess with pushes. You can do some magic
> with read-tree --prefix, but again, I'm not sure how you'd make commits
> on the correct branch.  Is there an easier way to do this?

Write proper subprojects support for git, or pester someone to write it
(finally). See Subpro.txt in todo branch.

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <1151973438.4723.70.camel@neko.keithp.com>]

* Re: git-fetch per-repository speed issues
       [not found]   ` <1151973438.4723.70.camel@neko.keithp.com>
@ 2006-07-04  3:21     ` Linus Torvalds
  2006-07-04  3:30       ` Junio C Hamano
  2006-07-04  4:02       ` Keith Packard
  0 siblings, 2 replies; 30+ messages in thread
From: Linus Torvalds @ 2006-07-04  3:21 UTC (permalink / raw)
  To: Keith Packard; +Cc: Git Mailing List, Junio C Hamano

On Mon, 3 Jul 2006, Keith Packard wrote:
> On Mon, 2006-07-03 at 16:14 -0700, Linus Torvalds wrote:
> > 
> > Well, you could use multiple branches in the same repository, even if they 
> > are totally unrealated. That would allow you to fetch them all in one go.
> 
> I'd like to avoid this; the hope is that most people won't ever need to
> look at most repositories; it would be somewhat like having glibc in the
> same repo as the kernel...

Sure, understood. I'm just saying that if you want to fetch in one go, 
it's one possibility.

However, your setup has something else seriously wrong.

> Yeah, I tried with the git protocol and it's a few seconds faster (about
> 14 seconds instead of 17). Ick.

That's -still- about 13 seconds too much.

> I think it might have something to do with the number of heads we're
> tracking.

It really shouldn't matter. You get all the heads in one go with a single 
connection, so if 32 heads takes 32 times longer, there's something wrong.

> > Also, one thing to try is to just do
> > 
> > 	strace -Ttt git-peek-remote ...
> 
> That's plenty fast, 0.410 seconds, with nothing ugly in the strace.

Ok, a "git fetch" really shouldn't take any longer than a single 
connection. However, the fact that you have 32 heads, and it takes pretty 
close to _exactly_ 32 times 0.410 seconds (32*0.410s = 13.1s) makes me 
suspect that "git fetch" is just broken and fetches one branch at a time. 

Which would be just stupid.

But look as I might, I see only that one "git-fetch-pack" in git-fetch.sh 
that should trigger. Once. Not 32 times. But your timings sure sound like 
it's doing a _lot_ more than it should.

Junio, any ideas?

Keithp, can you try this trivial patch? It _should_ say something like

	Fetching
	refs/heads/master
	refs/heads/...
	refs/heads/...
	...
	refs/heads/... from git://..../...

and more importantly, it should say so only once.

And then it should leave a "fetch.trace" file in your working directory, 
which should show where that _one_ thing spends its time.

		Linus

----
diff --git a/git-fetch.sh b/git-fetch.sh
index 48818f8..4739202 100755
--- a/git-fetch.sh
+++ b/git-fetch.sh
@@ -339,6 +339,8 @@ fetch_main () {
     ( : subshell because we muck with IFS
       IFS=" 	$LF"
       (
+	  echo "Fetching $rref from $remote" >&2
+	  strace -o fetch.trace -Ttt \
 	  git-fetch-pack $exec $keep --thin "$remote" $rref || echo failed "$remote"
       ) |
       while read sha1 remote_name

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: git-fetch per-repository speed issues
  2006-07-04  3:21     ` Linus Torvalds
@ 2006-07-04  3:30       ` Junio C Hamano
  2006-07-04  3:40         ` Linus Torvalds
  2006-07-04  4:02       ` Keith Packard
  1 sibling, 1 reply; 30+ messages in thread
From: Junio C Hamano @ 2006-07-04  3:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Linus Torvalds <torvalds@osdl.org> writes:

> Ok, a "git fetch" really shouldn't take any longer than a single 
> connection. However, the fact that you have 32 heads, and it takes pretty 
> close to _exactly_ 32 times 0.410 seconds (32*0.410s = 13.1s) makes me 
> suspect that "git fetch" is just broken and fetches one branch at a time. 
>
> Which would be just stupid.
>
> But look as I might, I see only that one "git-fetch-pack" in git-fetch.sh 
> that should trigger. Once. Not 32 times. But your timings sure sound like 
> it's doing a _lot_ more than it should.
>
> Junio, any ideas?

Isn't that because the repository have 32 subprojects, totally
unrelated content-wise?  If you have real stuff to pull from
there your pack generation needs to do 32 time as much work as
you would for a single head in that case.

If you are discussing "peek-remote runs, find out the 32 heads
are all up to date and no pack is generated" case, then you are
right.  There is one single fetch-pack to grab the specified
heads, and after that, an optional single ls-remote and
fetch-pack runs only once to follow all new tags.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: git-fetch per-repository speed issues
  2006-07-04  3:30       ` Junio C Hamano
@ 2006-07-04  3:40         ` Linus Torvalds
  2006-07-04  4:30           ` Keith Packard
  0 siblings, 1 reply; 30+ messages in thread
From: Linus Torvalds @ 2006-07-04  3:40 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git



On Mon, 3 Jul 2006, Junio C Hamano wrote:
> 
> Isn't that because the repository have 32 subprojects, totally
> unrelated content-wise?  If you have real stuff to pull from
> there your pack generation needs to do 32 time as much work as
> you would for a single head in that case.

No, Keith said this was for the case where the fetching repository is 
already totally up-to-date:

    "And, it's painfully slow, even when the repository is up to date"

and gave a 17-second time.

			Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: git-fetch per-repository speed issues
  2006-07-04  3:40         ` Linus Torvalds
@ 2006-07-04  4:30           ` Keith Packard
  2006-07-04 11:10             ` Andreas Ericsson
  0 siblings, 1 reply; 30+ messages in thread
From: Keith Packard @ 2006-07-04  4:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: keithp, Junio C Hamano, git

[-- Attachment #1: Type: text/plain, Size: 2093 bytes --]

On Mon, 2006-07-03 at 20:40 -0700, Linus Torvalds wrote:

>     "And, it's painfully slow, even when the repository is up to date"
> 
> and gave a 17-second time.

It's faster this evening, down to 8 seconds using ssh and 4 seconds
using git. I clearly need to force use of the git protocol. Anyone else
like the attached patch?

---
 connect.c |   18 ++++++++++++++----
 1 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/connect.c b/connect.c
index 9a87bd9..e74eddc 100644
--- a/connect.c
+++ b/connect.c
@@ -303,6 +303,7 @@ enum protocol {
 	PROTO_LOCAL = 1,
 	PROTO_SSH,
 	PROTO_GIT,
+	PROTO_GIT_SSH,
 };
 
 static enum protocol get_protocol(const char *name)
@@ -312,9 +313,9 @@ static enum protocol get_protocol(const 
 	if (!strcmp(name, "git"))
 		return PROTO_GIT;
 	if (!strcmp(name, "git+ssh"))
-		return PROTO_SSH;
+		return PROTO_GIT_SSH;
 	if (!strcmp(name, "ssh+git"))
-		return PROTO_SSH;
+		return PROTO_GIT_SSH;
 	die("I don't handle protocol '%s'", name);
 }
 
@@ -572,6 +573,14 @@ static void git_proxy_connect(int fd[2],
 	close(pipefd[1][0]);
 }
 
+/* returns whether the specified command can be interpreted by the
daemon */
+int git_is_daemon_command (const char *prog) 
+{
+	if (!strcmp("git-upload-pack", prog))
+		return 1;
+	return 0;
+}
+
 /*
  * Yeah, yeah, fixme. Need to pass in the heads etc.
  */
@@ -641,7 +650,8 @@ int git_connect(int fd[2], char *url, co
 		*ptr = '\0';
 	}
 
-	if (protocol == PROTO_GIT) {
+	if (protocol == PROTO_GIT || 
+	    (protocol == PROTO_GIT_SSH && git_is_daemon_command (prog))) {
 		/* These underlying connection commands die() if they
 		 * cannot connect.
 		 */
@@ -678,7 +688,7 @@ int git_connect(int fd[2], char *url, co
 		close(pipefd[0][1]);
 		close(pipefd[1][0]);
 		close(pipefd[1][1]);
-		if (protocol == PROTO_SSH) {
+		if (protocol == PROTO_SSH || protocol == PROTO_GIT_SSH) {
 			const char *ssh, *ssh_basename;
 			ssh = getenv("GIT_SSH");
 			if (!ssh) ssh = "ssh";
-- 
1.4.1.g8fced-dirty

-- 
keith.packard@intel.com

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: git-fetch per-repository speed issues
  2006-07-04  4:30           ` Keith Packard
@ 2006-07-04 11:10             ` Andreas Ericsson
  2006-07-04 11:18               ` Matthias Kestenholz
  0 siblings, 1 reply; 30+ messages in thread
From: Andreas Ericsson @ 2006-07-04 11:10 UTC (permalink / raw)
  To: Keith Packard; +Cc: Linus Torvalds, Junio C Hamano, git

Keith Packard wrote:
> On Mon, 2006-07-03 at 20:40 -0700, Linus Torvalds wrote:
> 
> 
>>    "And, it's painfully slow, even when the repository is up to date"
>>
>>and gave a 17-second time.
> 
> 
> It's faster this evening, down to 8 seconds using ssh and 4 seconds
> using git. I clearly need to force use of the git protocol. Anyone else
> like the attached patch?

Since it changes the current meaning of ssh+git, I'm not exactly 
thrilled. However, "git/ssh" or "ssh/git" would work fine for me. The 
slash-separator could be used to say "fetch over this, push over that", 
so we can end up with any valid protocol to use for fetches and another 
one to push over.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: git-fetch per-repository speed issues
  2006-07-04 11:10             ` Andreas Ericsson
@ 2006-07-04 11:18               ` Matthias Kestenholz
  2006-07-04 12:05                 ` Andreas Ericsson
  0 siblings, 1 reply; 30+ messages in thread
From: Matthias Kestenholz @ 2006-07-04 11:18 UTC (permalink / raw)
  To: Andreas Ericsson; +Cc: git

* Andreas Ericsson (ae@op5.se) wrote:
> Keith Packard wrote:
> >On Mon, 2006-07-03 at 20:40 -0700, Linus Torvalds wrote:
> >
> >
> >>   "And, it's painfully slow, even when the repository is up to date"
> >>
> >>and gave a 17-second time.
> >
> >
> >It's faster this evening, down to 8 seconds using ssh and 4 seconds
> >using git. I clearly need to force use of the git protocol. Anyone else
> >like the attached patch?
> 
> Since it changes the current meaning of ssh+git, I'm not exactly 
> thrilled. However, "git/ssh" or "ssh/git" would work fine for me. The 
> slash-separator could be used to say "fetch over this, push over that", 
> so we can end up with any valid protocol to use for fetches and another 
> one to push over.
> 

If we would do such a thing, we would be probably better off
allowing different URLs for pushing and pulling, because the git and
ssh URLs will only be the same, if the git repositories are located
in the root folder and I suspect that's almost never the case.

	Matthias

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: git-fetch per-repository speed issues
  2006-07-04 11:18               ` Matthias Kestenholz
@ 2006-07-04 12:05                 ` Andreas Ericsson
  0 siblings, 0 replies; 30+ messages in thread
From: Andreas Ericsson @ 2006-07-04 12:05 UTC (permalink / raw)
  To: Matthias Kestenholz; +Cc: git

Matthias Kestenholz wrote:
> * Andreas Ericsson (ae@op5.se) wrote:
> 
>>Keith Packard wrote:
>>
>>>On Mon, 2006-07-03 at 20:40 -0700, Linus Torvalds wrote:
>>>
>>>
>>>
>>>>  "And, it's painfully slow, even when the repository is up to date"
>>>>
>>>>and gave a 17-second time.
>>>
>>>
>>>It's faster this evening, down to 8 seconds using ssh and 4 seconds
>>>using git. I clearly need to force use of the git protocol. Anyone else
>>>like the attached patch?
>>
>>Since it changes the current meaning of ssh+git, I'm not exactly 
>>thrilled. However, "git/ssh" or "ssh/git" would work fine for me. The 
>>slash-separator could be used to say "fetch over this, push over that", 
>>so we can end up with any valid protocol to use for fetches and another 
>>one to push over.
>>
> 
> 
> If we would do such a thing, we would be probably better off
> allowing different URLs for pushing and pulling, because the git and
> ssh URLs will only be the same, if the git repositories are located
> in the root folder and I suspect that's almost never the case.
> 

True. We use relative paths where I work, so for us either way would 
work. Your way is better though.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: git-fetch per-repository speed issues
  2006-07-04  3:21     ` Linus Torvalds
  2006-07-04  3:30       ` Junio C Hamano
@ 2006-07-04  4:02       ` Keith Packard
  2006-07-04  4:19         ` Linus Torvalds
  1 sibling, 1 reply; 30+ messages in thread
From: Keith Packard @ 2006-07-04  4:02 UTC (permalink / raw)
  To: Linus Torvalds, Git Mailing List; +Cc: keithp

[-- Attachment #1: Type: text/plain, Size: 732 bytes --]

On Mon, 2006-07-03 at 20:21 -0700, Linus Torvalds wrote:

> Keithp, can you try this trivial patch? It _should_ say something like

Yeah, it says that only once. And, it runs the fetch-pack in about .5
seconds. And, now the whole process completes in 4.7 seconds; perhaps
the remote server is less loaded than earlier this afternoon? It's also
possible that I was running old git bits here, but I don't think so.

> And then it should leave a "fetch.trace" file in your working directory, 
> which should show where that _one_ thing spends its time.

It looks boring to me and spent 0.55 from start to finish. I can send
along the whole trace if you have an acute desire to peer at it.

-- 
keith.packard@intel.com

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: git-fetch per-repository speed issues
  2006-07-04  4:02       ` Keith Packard
@ 2006-07-04  4:19         ` Linus Torvalds
  2006-07-04  5:05           ` Keith Packard
  2006-07-04  5:29           ` Keith Packard
  0 siblings, 2 replies; 30+ messages in thread
From: Linus Torvalds @ 2006-07-04  4:19 UTC (permalink / raw)
  To: Keith Packard; +Cc: Git Mailing List



On Mon, 3 Jul 2006, Keith Packard wrote:
> 
> Yeah, it says that only once. And, it runs the fetch-pack in about .5
> seconds. And, now the whole process completes in 4.7 seconds; perhaps
> the remote server is less loaded than earlier this afternoon?

Well, that's still strange. What takes 4.2 seconds then?

> > And then it should leave a "fetch.trace" file in your working directory, 
> > which should show where that _one_ thing spends its time.
> 
> It looks boring to me and spent 0.55 from start to finish. I can send
> along the whole trace if you have an acute desire to peer at it.

No, the 0.5 seconds is what I _expected_. There's something strange going 
on in your git fetch that it takes any longer than that.

Can you instrument your "git-fetch.sh" script (just add random

	(echo $LINENO ; date) >&2

lines all over) to see what is so expensive? 

That fetch-pack really should be the most expensive part by far (and half 
a second sounds right), but it clearly isn't. At 4.7s, your fetch is still 
taking about ten times longer than it _should_.

		Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: git-fetch per-repository speed issues
  2006-07-04  4:19         ` Linus Torvalds
@ 2006-07-04  5:05           ` Keith Packard
  2006-07-04  5:36             ` Linus Torvalds
  2006-07-04  5:29           ` Keith Packard
  1 sibling, 1 reply; 30+ messages in thread
From: Keith Packard @ 2006-07-04  5:05 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: keithp, Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 865 bytes --]

On Mon, 2006-07-03 at 21:19 -0700, Linus Torvalds wrote:

> Can you instrument your "git-fetch.sh" script (just add random
> 
> 	(echo $LINENO ; date) >&2
> 
> lines all over) to see what is so expensive? 

5 Start:                             21:59:01.584648000
66 After args:                       21:59:01.605987000
248 fetch_main() start:              21:59:02.408559000
339 fetch_main() before fetch-pack:  21:59:03.293228000
387 fetch_main() done:               21:59:04.784388000
422 After tag following:             21:59:05.311439000
438 All done:                        21:59:05.315338000

fetch-pack itself took 0.421 seconds (measured with time(1)).

Looks like the bulk of the time here is caused by simple shell
processing overhead, some of which scales with the number of heads and
tags to track.

-- 
keith.packard@intel.com

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: git-fetch per-repository speed issues
  2006-07-04  5:05           ` Keith Packard
@ 2006-07-04  5:36             ` Linus Torvalds
  2006-07-04  6:21               ` Junio C Hamano
  0 siblings, 1 reply; 30+ messages in thread
From: Linus Torvalds @ 2006-07-04  5:36 UTC (permalink / raw)
  To: Keith Packard; +Cc: Git Mailing List

On Mon, 3 Jul 2006, Keith Packard wrote:
> 
> 5 Start:                             21:59:01.584648000
> 66 After args:                       21:59:01.605987000
> 248 fetch_main() start:              21:59:02.408559000
> 339 fetch_main() before fetch-pack:  21:59:03.293228000
> 387 fetch_main() done:               21:59:04.784388000
> 422 After tag following:             21:59:05.311439000
> 438 All done:                        21:59:05.315338000
> 
> fetch-pack itself took 0.421 seconds (measured with time(1)).
> 
> Looks like the bulk of the time here is caused by simple shell
> processing overhead, some of which scales with the number of heads and
> tags to track.

Ahh.. Do you have tons of tags at the other end?

Looking closer, I suspect a big part of it is that

	git-ls-remote $upload_pack --tags "$remote" |
	sed -ne 's|^\([0-9a-f]*\)[      ]\(refs/tags/.*\)^{}$|\1 \2|p' |
	while read sha1 name
	do
		..
	done

loop.

With a lot of tags, the shell overhead there can indeed be pretty 
disgusting. And I was wrong - I thought it would do that git-ls-remote 
only if the first time around we noticed that we would need to, but we do 
actually do it all the time that we're fetching any new branches. 

The sad part is that we really already got the list once, we just never 
saved it away (ie "git-fetch-pack" actually _knows_ what the tags at the 
other end are, and also knows which tags we already have, so if we made 
git-fetch-pack just create that list and save it off, all the overhead 
would just go away).

And yes, the shell script loops are really really simple, but some of them 
are actually quadratic in the number of refs (O(local*remote)). If this 
was a C program, we'd never even care, but with shell, the thing is slow 
enough that having even a modest amount of tags and refs is going to just 
make it waste a lot of time in shell scripting.

We already do a lot of the infrastructure for "git fetch" in C - the 
remotes parsing etc is all things that "git fetch" used to share with "git 
push", but "git push" has been a builtin C program for a while now. I 
suspect we should just do the same to "git fetch", which would make all 
these issues just totally go away.

			Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: git-fetch per-repository speed issues
  2006-07-04  5:36             ` Linus Torvalds
@ 2006-07-04  6:21               ` Junio C Hamano
  0 siblings, 0 replies; 30+ messages in thread
From: Junio C Hamano @ 2006-07-04  6:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Linus Torvalds <torvalds@osdl.org> writes:

> Looking closer, I suspect a big part of it is that
>
> 	git-ls-remote $upload_pack --tags "$remote" |
> 	sed -ne 's|^\([0-9a-f]*\)[      ]\(refs/tags/.*\)^{}$|\1 \2|p' |
> 	while read sha1 name
> 	do
> 		..
> 	done
>
> loop.

Yes indeed.  Maybe we can do this loop in Perl.  Doing the whole
thing in C is another option but it would be somewhat painful,
unless we can deprecate all transport but git native protocols.

On the other hand, 5 seconds may not matter that much in practice.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: git-fetch per-repository speed issues
  2006-07-04  4:19         ` Linus Torvalds
  2006-07-04  5:05           ` Keith Packard
@ 2006-07-04  5:29           ` Keith Packard
  2006-07-04  5:53             ` Linus Torvalds
  1 sibling, 1 reply; 30+ messages in thread
From: Keith Packard @ 2006-07-04  5:29 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: keithp, Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 837 bytes --]

On Mon, 2006-07-03 at 21:19 -0700, Linus Torvalds wrote:

> Well, that's still strange. What takes 4.2 seconds then?

$ strace -e trace=execve -f git-fetch 2>&1 | grep execve | sed -e 's/^.*execve("//' -e 's/".*$//' | sort | uniq -c | sort -n
      1 /bin/rm
      1 /home/keithp/bin/git
      1 /home/keithp/bin/git-fetch
      1 /home/keithp/bin/git-fetch-pack
      1 /home/keithp/bin/git-ls-remote
      1 /home/keithp/bin/git-peek-remote
      1 /usr/bin/sort
      3 /bin/sed
      4 /home/keithp/bin/git-repo-config
     30 /bin/mkdir
     30 /home/keithp/bin/git-cat-file
     30 /home/keithp/bin/git-check-ref-format
     30 /home/keithp/bin/git-merge-base
     30 /usr/bin/dirname
     64 /home/keithp/bin/git-rev-parse
    361 /usr/bin/expr

someone sure likes 'expr'...

-- 
keith.packard@intel.com

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: git-fetch per-repository speed issues
  2006-07-04  5:29           ` Keith Packard
@ 2006-07-04  5:53             ` Linus Torvalds
  0 siblings, 0 replies; 30+ messages in thread
From: Linus Torvalds @ 2006-07-04  5:53 UTC (permalink / raw)
  To: Keith Packard; +Cc: Git Mailing List



On Mon, 3 Jul 2006, Keith Packard wrote:
>
>     361 /usr/bin/expr
> 
> someone sure likes 'expr'...

Heh. That's a very Junio thing to do.

Junio seems to like

	if expr "z$string" : "z<regexp>" >/dev/null
	then
		..

and I think he explained it as being the way old-fashioned users do it.

		Linus
	

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: git-fetch per-repository speed issues
  2006-07-03 18:02 git-fetch per-repository speed issues Keith Packard
  2006-07-03 23:14 ` Linus Torvalds
@ 2006-07-04 15:42 ` Jakub Narebski
  2006-07-04 16:30   ` Thomas Glanzmann
  2006-07-04 17:45   ` Junio C Hamano
  2006-07-06 23:36 ` David Woodhouse
  2 siblings, 2 replies; 30+ messages in thread
From: Jakub Narebski @ 2006-07-04 15:42 UTC (permalink / raw)
  To: git

I wonder if the problem detected here is also responsible with results 
of Jeremy Blosser benchmark comparing git with Mercurial
http://lists.ibiblio.org/pipermail/sm-discuss/2006-May/014586.html
where git wins for clone, status and log, but is slower for pull.

See summary at
http://git.or.cz/gitwiki/GitBenchmarks#head-85df1bb7f019c4c504e34cde43450ef69349882f
-- 
Jakub Narebski

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: git-fetch per-repository speed issues
  2006-07-04 15:42 ` Jakub Narebski
@ 2006-07-04 16:30   ` Thomas Glanzmann
  2006-07-04 17:45   ` Junio C Hamano
  1 sibling, 0 replies; 30+ messages in thread
From: Thomas Glanzmann @ 2006-07-04 16:30 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

Hello,

> See summary at
> http://git.or.cz/gitwiki/GitBenchmarks#head-85df1bb7f019c4c504e34cde43450ef69349882f

thank you for clarifing! I finally understand why Solaris folks prefer
hg over git: It is dog slow. - So it fits the general philosophy behind
Solaris.

        Thomas

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: git-fetch per-repository speed issues
  2006-07-04 15:42 ` Jakub Narebski
  2006-07-04 16:30   ` Thomas Glanzmann
@ 2006-07-04 17:45   ` Junio C Hamano
  2006-07-04 19:22     ` Linus Torvalds
  1 sibling, 1 reply; 30+ messages in thread
From: Junio C Hamano @ 2006-07-04 17:45 UTC (permalink / raw)
  To: git; +Cc: jnareb

Jakub Narebski <jnareb@gmail.com> writes:

> I wonder if the problem detected here is also responsible with results 
> of Jeremy Blosser benchmark comparing git with Mercurial
> http://lists.ibiblio.org/pipermail/sm-discuss/2006-May/014586.html
> where git wins for clone, status and log, but is slower for pull.

I had an impression, though the report does not talk about this
specific detail, that the extra time we are paying is because
the "git pull" test is done without suppressing the final
diffstat phase.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: git-fetch per-repository speed issues
  2006-07-04 17:45   ` Junio C Hamano
@ 2006-07-04 19:22     ` Linus Torvalds
  2006-07-04 21:05       ` Junio C Hamano
  0 siblings, 1 reply; 30+ messages in thread
From: Linus Torvalds @ 2006-07-04 19:22 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, jnareb

On Tue, 4 Jul 2006, Junio C Hamano wrote:
> 
> I had an impression, though the report does not talk about this
> specific detail, that the extra time we are paying is because
> the "git pull" test is done without suppressing the final
> diffstat phase.

I'm pretty sure that was the reason for the particular hg issue. Looking 
at the "clone" times, the problem is almost certainly not the actual 
pulling.

The diffstat generation is often the largest part of a git merge. It's 
gotten cheaper since the hg benchmarks were done (I think they were done 
back before the integrated diff generation, so they also have the overhead 
of executing a lot of external GNU diff processes), but it's still not 
"cheap".

But I have to say that the diffstat at least for me is absolutely 
invaluable.

			Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: git-fetch per-repository speed issues
  2006-07-04 19:22     ` Linus Torvalds
@ 2006-07-04 21:05       ` Junio C Hamano
  0 siblings, 0 replies; 30+ messages in thread
From: Junio C Hamano @ 2006-07-04 21:05 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Linus Torvalds <torvalds@osdl.org> writes:

> But I have to say that the diffstat at least for me is absolutely 
> invaluable.

Oh, I absolutely agree with that and somebody who suggests to
turn it off by default needs a very good argument to convince
me.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: git-fetch per-repository speed issues
  2006-07-03 18:02 git-fetch per-repository speed issues Keith Packard
  2006-07-03 23:14 ` Linus Torvalds
  2006-07-04 15:42 ` Jakub Narebski
@ 2006-07-06 23:36 ` David Woodhouse
  2 siblings, 0 replies; 30+ messages in thread
From: David Woodhouse @ 2006-07-06 23:36 UTC (permalink / raw)
  To: Keith Packard; +Cc: Git Mailing List

On Mon, 2006-07-03 at 11:02 -0700, Keith Packard wrote:
>  just uses ssh for everything. This slows down the connection process
> by several seconds.

Only if you forgot to use the 'control socket' support, which lets you
make a _single_ authenticated connection and re-use it for multiple
sessions.

http://david.woodhou.se/openssh-control.html has a couple of
improvements, but the basics are usable in upstream openssh.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2006-07-06 23:36 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-07-03 18:02 git-fetch per-repository speed issues Keith Packard
2006-07-03 23:14 ` Linus Torvalds
2006-07-04  0:21   ` Jeff King
2006-07-04  1:22     ` Ryan Anderson
2006-07-04  1:44       ` Jeff King
2006-07-04  1:55         ` Ryan Anderson
2006-07-04  3:07     ` Linus Torvalds
2006-07-05  6:47       ` Jeff King
2006-07-05 16:40         ` Linus Torvalds
2006-07-04  6:44     ` Jakub Narebski
     [not found]   ` <1151973438.4723.70.camel@neko.keithp.com>
2006-07-04  3:21     ` Linus Torvalds
2006-07-04  3:30       ` Junio C Hamano
2006-07-04  3:40         ` Linus Torvalds
2006-07-04  4:30           ` Keith Packard
2006-07-04 11:10             ` Andreas Ericsson
2006-07-04 11:18               ` Matthias Kestenholz
2006-07-04 12:05                 ` Andreas Ericsson
2006-07-04  4:02       ` Keith Packard
2006-07-04  4:19         ` Linus Torvalds
2006-07-04  5:05           ` Keith Packard
2006-07-04  5:36             ` Linus Torvalds
2006-07-04  6:21               ` Junio C Hamano
2006-07-04  5:29           ` Keith Packard
2006-07-04  5:53             ` Linus Torvalds
2006-07-04 15:42 ` Jakub Narebski
2006-07-04 16:30   ` Thomas Glanzmann
2006-07-04 17:45   ` Junio C Hamano
2006-07-04 19:22     ` Linus Torvalds
2006-07-04 21:05       ` Junio C Hamano
2006-07-06 23:36 ` David Woodhouse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).