git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Cloning speed comparison, round II
@ 2005-11-12 15:53 Petr Baudis
  2005-11-12 19:40 ` Linus Torvalds
  0 siblings, 1 reply; 8+ messages in thread
From: Petr Baudis @ 2005-11-12 15:53 UTC (permalink / raw)
  To: git

  Hello,

  here comes another round of the cloning speed comparison. For the
previous round, see

	6395   F Aug 13 Petr Baudis     ( 2.6K) Cloning speed comparison
	Message-ID: <20050813015402.GC20812@pasky.ji.cz>

  What do we have here:

	git.git: 10267 objects, mostly packed
	cogito.git: 99903903 objects, no packs

  (But please note that compared to the previous round, there are
probably more unpacked objects lying around in the git.git repository
- actually roughly 1000.)

  Latest GIT, latest Cogito, performed by cg-clone to the official
kernel.org hostnames for each transport. 'real' medians from several
tries (number of tries usually at least three, based on their variance):


             rsync   git+ssh(*)   git(**)   http

git.git      0m45s   0m34s        5m30s     4m01s (++)

cogito.git   2m09s   1m54s (+)    4m30s     15m11s (only single run)


(*) git+ssh was to master.kernel.org, which was under significant load
    from some seemingly runaway gzip process, so that slowed things
    down.

(**) The unpacking was slooooooooow yet the load was quite low. This
     should be investigated, the native git fetching is much slower
     than even HTTP.

(+) There was unusually high variance here - the results range from
    0m27s to 2m30s. In all other cases the results for all runs were
    very similiar.

(++) Messy - it stalls at 72 objects, then does seemingly nothing for
     a long while, then requests a pack. Huh? Also, I saw the

	Getting alternates list for http://www.kernel.org/pub/scm/git/git.git/

     line about 100 times for the 72 objects (but never after the pack
     is fetched, at least).


  Conclusion: The GIT protocol has some major problem which is dragging
it down to ridiculous slowness (isn't it supposed to be as fast as git+ssh
or even slightly faster?). rsync unfortunately still provides the fastest
anonymous access. HTTP is better than it was, but especially when fetching
packs, there is a room for improvement - the mysterious stalls should be
eliminated, and it shouldn't get bogged down by repeated silly requests
for alternates lists etc.

  Compared to the last round, HTTP is much faster for unpacked object
stores (thanks to parallel fetching), but much slower for packed object
stores - however that can be caused by more objects lying unpacked now
in the git.git repository. git+ssh (back then clone-pack:ssh) is as
fast as usual - the 0m27s case suggests that it would be much faster
for Cogito if master.kernel.org wouldn't be under load.

  GIT protocol is the great disappointment. And I kind of hoped that I could
push the rsync deprecation further because of the GIT protocol.

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
VI has two modes: the one in which it beeps and the one in which
it doesn't.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Cloning speed comparison, round II
  2005-11-12 15:53 Cloning speed comparison, round II Petr Baudis
@ 2005-11-12 19:40 ` Linus Torvalds
  2005-11-12 19:46   ` Petr Baudis
  0 siblings, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2005-11-12 19:40 UTC (permalink / raw)
  To: Petr Baudis; +Cc: git



On Sat, 12 Nov 2005, Petr Baudis wrote:
>
>              rsync   git+ssh(*)   git(**)   http
> 
> git.git      0m45s   0m34s        5m30s     4m01s (++)
> 
> cogito.git   2m09s   1m54s (+)    4m30s     15m11s (only single run)
> 
> 
> (*) git+ssh was to master.kernel.org, which was under significant load
>     from some seemingly runaway gzip process, so that slowed things
>     down.
> 
> (**) The unpacking was slooooooooow yet the load was quite low. This
>      should be investigated, the native git fetching is much slower
>      than even HTTP.

git:// and git+ssh:// should be the exact same protocol, the main 
difference in this case being the server they go to.

In the case of (**), the unpacking itself is fast, but it's done as the 
stream of data comes in, so it will appear slow if the server at the other 
end is slow (or the network to that server is slow).

So I think the difference between your git+ssh and git tests are purely 
due to the fact that master.kernel.org sees a lot less load (both in CPU 
and in networking) than the public sites, and the time differences have 
nothing to do with the protocol per se.

I suspect master.kernel.org also has a beefier machine with more memory 
(but even if that's not the case, it's simply true that the public 
machines obviously do mirroring to a lot of other machines, and run things 
like webgit and just basic serving too).

If anything, git:// as a protocol is theoretically a bit faster, since the 
login procedure is faster and there's no encryption overhead.

			Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Cloning speed comparison, round II
  2005-11-12 19:40 ` Linus Torvalds
@ 2005-11-12 19:46   ` Petr Baudis
  2005-11-12 20:13     ` Linus Torvalds
  0 siblings, 1 reply; 8+ messages in thread
From: Petr Baudis @ 2005-11-12 19:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Dear diary, on Sat, Nov 12, 2005 at 08:40:11PM CET, I got a letter
where Linus Torvalds <torvalds@osdl.org> said that...
> 
> 
> On Sat, 12 Nov 2005, Petr Baudis wrote:
> >
> >              rsync   git+ssh(*)   git(**)   http
> > 
> > git.git      0m45s   0m34s        5m30s     4m01s (++)
> > 
> > cogito.git   2m09s   1m54s (+)    4m30s     15m11s (only single run)
> > 
> > 
> > (*) git+ssh was to master.kernel.org, which was under significant load
> >     from some seemingly runaway gzip process, so that slowed things
> >     down.
> > 
> > (**) The unpacking was slooooooooow yet the load was quite low. This
> >      should be investigated, the native git fetching is much slower
> >      than even HTTP.

(BTW, it's obviously not much slower than HTTP - I wrote that after the
first git run, but I screwed up myself at that one.)

> So I think the difference between your git+ssh and git tests are purely 
> due to the fact that master.kernel.org sees a lot less load (both in CPU 
> and in networking) than the public sites, and the time differences have 
> nothing to do with the protocol per se.

Well, at the time of fetching, master.kernel.org with git+ssh had load
about ~3.5 and some wild gzip was eating most of the CPU there. So if
the git protocol still manages to be TEN times slower while rsync goes
full speed from that machine, I would say that this means the git server
requires way too much CPU.

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
VI has two modes: the one in which it beeps and the one in which
it doesn't.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Cloning speed comparison, round II
  2005-11-12 19:46   ` Petr Baudis
@ 2005-11-12 20:13     ` Linus Torvalds
  2005-11-12 20:31       ` Linus Torvalds
  2005-11-12 21:20       ` Petr Baudis
  0 siblings, 2 replies; 8+ messages in thread
From: Linus Torvalds @ 2005-11-12 20:13 UTC (permalink / raw)
  To: Petr Baudis; +Cc: git



On Sat, 12 Nov 2005, Petr Baudis wrote:

> Dear diary, on Sat, Nov 12, 2005 at 08:40:11PM CET, I got a letter
> where Linus Torvalds <torvalds@osdl.org> said that...
> > 
> > 
> > On Sat, 12 Nov 2005, Petr Baudis wrote:
> > >
> > >              rsync   git+ssh(*)   git(**)   http
> > > 
> > > git.git      0m45s   0m34s        5m30s     4m01s (++)
> > > 
> > > cogito.git   2m09s   1m54s (+)    4m30s     15m11s (only single run)
> 
> Well, at the time of fetching, master.kernel.org with git+ssh had load
> about ~3.5 and some wild gzip was eating most of the CPU there. So if
> the git protocol still manages to be TEN times slower while rsync goes
> full speed from that machine, I would say that this means the git server
> requires way too much CPU.

Look again.

master.kernel.org was _faster_ than rsync using the native git protocol, 
despite being under a load of 3.5.

Look at the numbers: 45 secs for rsync, 34 secs for git protocol to 
master.

Now, I don't know which rsync machine you used (you can rsync both from 
master and from rync.kernel.org), since you don't say. 

Now, it's unquestionably true that rsync can be faster under many 
circumstances. Most notably when disk IO is really slow, since the native 
git protocol will do a lot more synchronous operations, since it actually 
tests what it is doing.

But I _guarantee_ you that rsync is at least ten times slower than the 
native git protocol in many circumstances. It can't handle repacking 
(which is critical for good server performance).

And in fact it can't handle totally unpacked directories and small updates 
well either (the reason I totally stopped doing rsync was because it took 
minutes to go through the whole list of unpacked objects for a small 
update, while the native protocol would just fetch the needed objects and 
be done with it.

So sometimes rsync is faster, sometimes the git protocol is faster. But 
the git protocol is _always_ better from a sanity standpoint.

The things you get with the native git protocol:

 - you don't have to trust the other end. If the other end lies about the 
   SHA1's of its objects, rsync will never know. It will just download the 
   thing, and you may have a corrupt database.

   With the git protocol, we just get the objects, and recompute their 
   names. The other end can't lie about what their SHA is.

 - the rsync protocol totally breaks down with multiple branches. It 
   fetches stuff it shouldn't because it doesn't know better.

 - the rsync protocol scales with project size, not with change size. This 
   works well for small projects, where the changes are usually not all 
   that hugely different from the total size of the project, but it really 
   sucks for big projects.

 - the rsync protocol fundamentally cannot handle two differently packed 
   trees well. That doesn't matter if you only track one tree, but it 
   matters _hugely_ for people (like me) who pull from tens of different 
   trees.

So the fact is: rsync is often slower, and _always_ less capable. 

			Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Cloning speed comparison, round II
  2005-11-12 20:13     ` Linus Torvalds
@ 2005-11-12 20:31       ` Linus Torvalds
  2005-11-12 21:30         ` Petr Baudis
  2005-11-12 21:20       ` Petr Baudis
  1 sibling, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2005-11-12 20:31 UTC (permalink / raw)
  To: Petr Baudis; +Cc: git



On Sat, 12 Nov 2005, Linus Torvalds wrote:
> 
> So the fact is: rsync is often slower, and _always_ less capable. 

Side note: a lot of the rsync problems are non-issues for the initial 
clone. 

That initial clone is in fact the only time I think rsync can be a good 
idea, especially if the server end is repacking regularly.

Then the only downside of cloning using rsync is that it will obviously 
also set up the "origin" branch to be updated using rsync, which is sad if 
there are alternatives.

But it may be a good idea to first clone using rsync, and then edit the 
origin file to change the rsync into a git-native protocol if one is 
available.

		Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Cloning speed comparison, round II
  2005-11-12 20:13     ` Linus Torvalds
  2005-11-12 20:31       ` Linus Torvalds
@ 2005-11-12 21:20       ` Petr Baudis
  2005-11-12 22:03         ` Linus Torvalds
  1 sibling, 1 reply; 8+ messages in thread
From: Petr Baudis @ 2005-11-12 21:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Dear diary, on Sat, Nov 12, 2005 at 09:13:20PM CET, I got a letter
where Linus Torvalds <torvalds@osdl.org> said that...
> On Sat, 12 Nov 2005, Petr Baudis wrote:
> 
> > Dear diary, on Sat, Nov 12, 2005 at 08:40:11PM CET, I got a letter
> > where Linus Torvalds <torvalds@osdl.org> said that...
> > > 
> > > 
> > > On Sat, 12 Nov 2005, Petr Baudis wrote:
> > > >
> > > >              rsync   git+ssh(*)   git(**)   http
> > > > 
> > > > git.git      0m45s   0m34s        5m30s     4m01s (++)
> > > > 
> > > > cogito.git   2m09s   1m54s (+)    4m30s     15m11s (only single run)
> > 
> > Well, at the time of fetching, master.kernel.org with git+ssh had load
> > about ~3.5 and some wild gzip was eating most of the CPU there. So if
> > the git protocol still manages to be TEN times slower while rsync goes
> > full speed from that machine, I would say that this means the git server
> > requires way too much CPU.
> 
> Look again.
> 
> master.kernel.org was _faster_ than rsync using the native git protocol, 
> despite being under a load of 3.5.
> 
> Look at the numbers: 45 secs for rsync, 34 secs for git protocol to 
> master.

That's git+ssh (not git transport protocol - this DID make a difference
at first, I know it's hard to believe) on machine A vs. rsync on
machine B.

> Now, I don't know which rsync machine you used (you can rsync both from 
> master and from rync.kernel.org), since you don't say. 

rsync.kernel.org ("... official kernel.org hostnames for each transport.")


So I had this length more detailed benchmark which re-examined git+ssh
vs. git on master.kernel.org (good that git-daemon binds to
non-priviledged port). However, an embarrassing thing turned out - this
seems to be a network problem, actually.

 From kernel.org to machine X, it takes 20s, and from machine X to my
home machine, it takes 20s. From kernel.org to my home machine, it takes
5 minutes. This concerns only the git protocols, HTTP and SSH from my
home machine to kernel.org is fast as usual.  Even stranger, this is the
same for both hera.kernel.org and zeus.kernel.org, while each is at
totally different network.  Tried many times.  Ugh.

Well, so sorry for false alarm.

Your totally puzzled,

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
VI has two modes: the one in which it beeps and the one in which
it doesn't.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Cloning speed comparison, round II
  2005-11-12 20:31       ` Linus Torvalds
@ 2005-11-12 21:30         ` Petr Baudis
  0 siblings, 0 replies; 8+ messages in thread
From: Petr Baudis @ 2005-11-12 21:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Dear diary, on Sat, Nov 12, 2005 at 09:31:32PM CET, I got a letter
where Linus Torvalds <torvalds@osdl.org> said that...
> On Sat, 12 Nov 2005, Linus Torvalds wrote:
> > 
> > So the fact is: rsync is often slower, and _always_ less capable. 
> 
> Side note: a lot of the rsync problems are non-issues for the initial 
> clone. 

But based on my "machine X" tests, git is as fast as git+ssh (which is
as it should be), which means it is even slightly faster than rsync,
even for the initial commit.

> That initial clone is in fact the only time I think rsync can be a good 
> idea, especially if the server end is repacking regularly.

Yes, the only advantage of rsync I can see is that you get it all
packed. Well, I think this only means we are doing something wrong,
and perhaps we should automatically pack as well.

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
VI has two modes: the one in which it beeps and the one in which
it doesn't.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Cloning speed comparison, round II
  2005-11-12 21:20       ` Petr Baudis
@ 2005-11-12 22:03         ` Linus Torvalds
  0 siblings, 0 replies; 8+ messages in thread
From: Linus Torvalds @ 2005-11-12 22:03 UTC (permalink / raw)
  To: Petr Baudis; +Cc: git



On Sat, 12 Nov 2005, Petr Baudis wrote:
> 
>  From kernel.org to machine X, it takes 20s, and from machine X to my
> home machine, it takes 20s. From kernel.org to my home machine, it takes
> 5 minutes.

That's a twilight zone moment ;)

The pack-file writer is even trying to be good about not doing lots of 
small writes (it should chunk things up into 8kB chunks) so it should 
actually be a nice network app and send full-sized TCP frames from the 
very beginning (and nagle should mean that it continues to do so even 
around the chunk boundaries, assuming the server is fast enough at 
generating the data).

But it's entirely possible that one of the paths between the machine has 
some logic that prioritizes known stream types - gives higher priority to 
http/ssh over "unknown" protocols. That's a bad thing to do (the whole 
point of the internet is that the smarts is in the end-points, not in the 
network), but with so much of the packets whizzing by being 
virus-generated crap, some places apparently do things like that.

			Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2005-11-12 22:03 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-11-12 15:53 Cloning speed comparison, round II Petr Baudis
2005-11-12 19:40 ` Linus Torvalds
2005-11-12 19:46   ` Petr Baudis
2005-11-12 20:13     ` Linus Torvalds
2005-11-12 20:31       ` Linus Torvalds
2005-11-12 21:30         ` Petr Baudis
2005-11-12 21:20       ` Petr Baudis
2005-11-12 22:03         ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).