From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Hudec Subject: Re: git over webdav: what can I do for improving http-push ? Date: Thu, 3 Jan 2008 22:15:21 +0100 Message-ID: <20080103211521.GA4225@efreet.light.src> References: <477822C3.9060002@gbarbier.org> <47791F90.8030302@pobox.com> <20080101113301.GC9214@efreet.light.src> <477A26FD.7020408@gbarbier.org> <20080101202352.GA4295@efreet.light.src> <477D3401.2010005@gbarbier.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Jakub Narebski , git@vger.kernel.org To: =?utf-8?B?R3LDqWdvaXJl?= Barbier X-From: git-owner@vger.kernel.org Thu Jan 03 22:16:04 2008 Return-path: Envelope-to: gcvg-git-2@gmane.org Received: from vger.kernel.org ([209.132.176.167]) by lo.gmane.org with esmtp (Exim 4.50) id 1JAXQ3-0006pU-07 for gcvg-git-2@gmane.org; Thu, 03 Jan 2008 22:16:03 +0100 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752569AbYACVPe convert rfc822-to-quoted-printable (ORCPT ); Thu, 3 Jan 2008 16:15:34 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752565AbYACVPe (ORCPT ); Thu, 3 Jan 2008 16:15:34 -0500 Received: from ns1.bluetone.cz ([212.158.128.13]:39327 "EHLO ns1.bluetone.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752155AbYACVPd (ORCPT ); Thu, 3 Jan 2008 16:15:33 -0500 Received: from localhost (spamhole.bluetone.cz [192.168.13.2]) by ns1.bluetone.cz (Postfix) with ESMTP id 544DD5745C; Thu, 3 Jan 2008 22:15:31 +0100 (CET) Received: from ns1.bluetone.cz ([192.168.13.1]) by localhost (spamhole.bluetone.cz [192.168.13.2]) (amavisd-new, port 10026) with ESMTP id aeTNbD90EsqT; Thu, 3 Jan 2008 22:15:27 +0100 (CET) Received: from efreet.light.src (145-119-207-85.strcechy.adsl-llu.static.bluetone.cz [85.207.119.145]) by ns1.bluetone.cz (Postfix) with ESMTP id EB10557420; Thu, 3 Jan 2008 22:15:26 +0100 (CET) Received: from bulb by efreet.light.src with local (Exim 4.68) (envelope-from ) id 1JAXPN-0002Gi-Sh; Thu, 03 Jan 2008 22:15:21 +0100 Content-Disposition: inline In-Reply-To: <477D3401.2010005@gbarbier.org> User-Agent: Mutt/1.5.17 (2007-11-01) Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: On Thu, Jan 03, 2008 at 20:14:09 +0100, Gr=C3=A9goire Barbier wrote: > Jan Hudec a =C3=A9crit : [...] >> It is what bzr and mercurial do and I think it would be quite good w= ay to go >> for cases like this. > Ok, I will have to look at bzr and mercurial... Bzr is quite far, design-wise, I fear. The mercurial might be a little = more interesting to study, but being in python and internally somewhat file-oriented, I wouldn't think it is of much use. You should start with upload, leaving the download direction to the dum= b machinery git currently uses. [...] >> I have also thought about optimizing download using CGI, but than I = thought, >> that maybe there is a way to statically generate packs so, that if t= he client >> wants n revisions, the number of revisions it downloads is O(n) and = the >> number of packs it gets them from (and thus number of round-trips) i= s >> O(log(n)). Assuming the client always wants everything up to the tip= , of >> course. Now this is trivial with linear history (pack first half, th= an half >> of what's left, etc., gives logarithmic number of packs and you alwa= ys >> download at most twice as much as you need), but it would be nice if= somebody >> found a way (even one that satisfies the conditions on average only)= to do >> this with non-linear history, it would be very nice improvement to t= he http >> download -- native git server optimizes amount of data transfered ve= ry well, >> but at the cost of quite heavy CPU load on the server. >> =20 > Well... frankly I don't think I'm able of such things. > Writing a walker over webdav or a simple cgi is a thing I can do (I=20 > think), but I'm not tought enough (or not ready to take the time nee= ded)=20 > to have a look on the internals of packing revisions (whereas I can i= magine=20 > it would means that "my" walker would be suitable only for small proj= ects=20 > in terms of code amount and commit frequency). Well, it does not depend on the walker -- the walker is quite simple an= d already written anyway. > I had a quick look on bzr and hg, and it seems that bzr use the easy = way=20 > (walker, no optimizations) That's not quite true -- bzr has both dumb (walker over plain HTTP) and= smart (CGI) methods. But their CGI is really just tunelling their custom prot= ocol over HTTP and that protocol will not be anywhere near what we want for = git because of vastly different design of the storage. > and hg a cgi (therefore, maybe optimizations).=20 > By quick look I mean that I sniff the HTTP queries on the network dur= ing a=20 > clone. I need to look harder... Yes, mercurial uses a CGI. But I am not sure how similar their approach= is to anything that would make sense for git, so looking at the details might= or might not be useful. > BTW I never looked at the git:// protocol. Do you think that by tunne= ling=20 > the git protocol in a cgi (hg uses URLs of the form=20 > "/mycgi?cmd=3Dmycommand&...", therefore I think "tunnel" is not a bad= =20 > word...) the performance would be good? It would be pretty hard to tunnel it and it would loose all it's nice properties. The git protocol, for pull, basically works like this: - server sends a list of it's refs - client tells server which ones it wants - client starts listing revisions it has, newest to oldest - server tells client whenever it finds common ancestor with one of th= e heads desired - client restarts the listing from next ref - server starts sending the data when client runs out of refs to list The main point about the protocol is, that the client is listing the re= fs, as fast as it can and server will stop it when it sees a revision it knows= =2E Therefore there will only be one round-trip to discover each common anc= estor. However, you can't do this over HTTP, because response won't be started= until the request is received. You could be sending a lot of smallish request= s and quick, often empty, responses to them. However, that will waste a lot o= f bandwidth (because of the HTTP overhead) and loose much of the speed an= yway. Also the HTTP protocol is stateless, but this is inherently stateful, s= o you would have to work that around somehow too. Therefore a different appro= ach is preferable on HTTP. Now to keep it stateless, I thought that: - client would first ask for list of refs - client would than ask for pack containing the first ref - server would respond with pack containing just that commit plus all objects that are not referenced by any of it's parents - if client does not have it's parent, it would ask for pack containin= g that - since it's second request, server would pack 2 revisions (with neces= sary objects) this time - if client still does not have all parents, it would again ask for a = pack, receiving 4 revisions this time (next 8, 16, etc...) This would guarantee, that when you want n revisions, you make at most log2(n) requests and get at most 2*n revisions (well, the requests are = for each ref separately, so it's more like m*log2(n) where m is number of r= efs, but still). Additionally, it would be stateless, because the client wou= ld simply say 'I want commit abcdef01 and this is my 3rd request' and serv= er would provide that commit and 7 it's parents (ie. 2^3 commits). Now generating the packs takes it's CPU. The servers like git.kernel.or= g have quite high CPU load. But in this schema, all clients would most of the = time get the same packs (unlike native git protocol, where the client always= gets single pack with exactly what it needs). So the idea struck me, that th= ey could simply be statically generated and fetched via the existing dumb protocol. It would get the efficiency and save a lot of CPU power, whic= h is allows one to serve quite busy git repository from a limited (and therefore cheap) virtual machine or even (yes, I saw such idea on the l= ist) serving any repository from NSLU2. Now to create a packing policy to create such packs, you don't actually= need to touch any C -- because git-repack is still in shell -- and you don't really need to touch any internals, because you only need to change how= the pack-objects command will be called and leave all the dirty details to = that command. I would personally not re-split the already generated packs. Only find = some algorithm for choosing when packs are deep enough in history so they sh= ould be merged together. It would also might not make sense to ever pack unp= acked objects to more than one pack -- a dumb-HTTP-served might have a requir= ement of running this kind of repack after every push and clients will rarely= have part of single push to the server. > Maybe it's not that hard to write a performant HTTP/CGI protocol for = Git if=20 > it's based upon existing code such as the git protocol. =46or push it might, or might not be easy. But in the worst case you sh= ould be able to calculate a pack to upload locally (fetching from the server beforehead if necessary), upload that pack (or bundle) and update all t= he refs. =46or pull it certainly won't be. You might be able to reimplement the = common ref discovery using some kind of gradually increasing ref list and then generate a bundle for the server, but optimizing the dumb protocol seem= s more useful to me. As I said, generating the packs only requires devising a = way of selecting which objects should go together and git pack-objects will ta= ke care of the dirty details of generating the packs and git update-server= -info will take care of the dirty details of presenting the list of packs to client. --=20 Jan 'Bulb' Hudec