* Smart fetch via HTTP?
@ 2007-05-15 20:10 Jan Hudec
2007-05-15 22:30 ` A Large Angry SCM
` (3 more replies)
0 siblings, 4 replies; 47+ messages in thread
From: Jan Hudec @ 2007-05-15 20:10 UTC (permalink / raw)
To: git
[-- Attachment #1: Type: text/plain, Size: 1635 bytes --]
Hello,
Did anyone already think about fetching over HTTP working similarly to the
native git protocol?
That is rather than reading the raw content of the repository, there would be
a CGI script (could be integrated to gitweb), that would negotiate what the
client needs and then generate and send a single pack with it.
Mercurial and bzr both have this option. It would IMO have three benefits:
- Fast access for people behind paranoid firewalls, that only let http and
https (you can tunel anything through, but only to port 443) through.
- Can be run on shared machine. If you have web space on machine shared
by many people, you can set up your own gitweb, but cannot/are not allowed
to start your own network server for git native protocol.
- Less things to set up. If you are setting up gitweb anyway, you'd not need
to set up additional thing for providing fetch access.
Than a question is how to implement it. The current protocol is stateful on
both sides, but the stateless nature of HTTP more or less requires the
protocol to be stateless on the server.
I think it would be possible to use basically the same protocol as now, but
make it stateless for server. That is server first sends it's heads and than
client repeatedly sends all it's wants and some haves until the server acks
all of them and sends the pack.
Alternatively I am thinking about using Bloom filters (somebody came with
such idea on the bzr list when I still followed it). It might be useful, as
over HTTP we need to send as many haves as possible in one go.
--
Jan 'Bulb' Hudec <bulb@ucw.cz>
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: Smart fetch via HTTP? 2007-05-15 20:10 Smart fetch via HTTP? Jan Hudec @ 2007-05-15 22:30 ` A Large Angry SCM 2007-05-15 23:29 ` Shawn O. Pearce ` (2 subsequent siblings) 3 siblings, 0 replies; 47+ messages in thread From: A Large Angry SCM @ 2007-05-15 22:30 UTC (permalink / raw) To: Jan Hudec; +Cc: git Jan Hudec wrote: > Hello, > > Did anyone already think about fetching over HTTP working similarly to the > native git protocol? > > That is rather than reading the raw content of the repository, there would be > a CGI script (could be integrated to gitweb), that would negotiate what the > client needs and then generate and send a single pack with it. > > Mercurial and bzr both have this option. It would IMO have three benefits: > - Fast access for people behind paranoid firewalls, that only let http and > https (you can tunel anything through, but only to port 443) through. > - Can be run on shared machine. If you have web space on machine shared > by many people, you can set up your own gitweb, but cannot/are not allowed > to start your own network server for git native protocol. > - Less things to set up. If you are setting up gitweb anyway, you'd not need > to set up additional thing for providing fetch access. > > Than a question is how to implement it. The current protocol is stateful on > both sides, but the stateless nature of HTTP more or less requires the > protocol to be stateless on the server. > > I think it would be possible to use basically the same protocol as now, but > make it stateless for server. That is server first sends it's heads and than > client repeatedly sends all it's wants and some haves until the server acks > all of them and sends the pack. > > Alternatively I am thinking about using Bloom filters (somebody came with > such idea on the bzr list when I still followed it). It might be useful, as > over HTTP we need to send as many haves as possible in one go. > Bundles? Client POSTs it's ref set; server uses the ref set to generate and return the bundle. Push over http(s) could work the same... ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-15 20:10 Smart fetch via HTTP? Jan Hudec 2007-05-15 22:30 ` A Large Angry SCM @ 2007-05-15 23:29 ` Shawn O. Pearce 2007-05-16 0:38 ` Junio C Hamano 2007-05-16 5:25 ` Martin Langhoff 2007-05-17 12:40 ` Petr Baudis 3 siblings, 1 reply; 47+ messages in thread From: Shawn O. Pearce @ 2007-05-15 23:29 UTC (permalink / raw) To: Jan Hudec; +Cc: git Jan Hudec <bulb@ucw.cz> wrote: > Did anyone already think about fetching over HTTP working similarly to the > native git protocol? No work has been done on this (that I know of) but I've discussed it to some extent with Simon 'corecode' Schubert on #git, and I think he also brought it up on the mailing list not too long after. I've certainly thought about adding some sort of pack-objects frontend into gitweb.cgi for this exact purpose. It is really quite easy, except for the negotation of what the client has. ;-) > Than a question is how to implement it. The current protocol is stateful on > both sides, but the stateless nature of HTTP more or less requires the > protocol to be stateless on the server. > > I think it would be possible to use basically the same protocol as now, but > make it stateless for server. That is server first sends it's heads and than > client repeatedly sends all it's wants and some haves until the server acks > all of them and sends the pack. I think Simon was talking about doubling the number of haves the client sends in each request. So the client POSTs initially all of its current refs; then current refs and their parents; then 4 commits back, then 8, etc. The server replies to each POST request with either a "send more please" or the packfile. -- Shawn. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-15 23:29 ` Shawn O. Pearce @ 2007-05-16 0:38 ` Junio C Hamano 0 siblings, 0 replies; 47+ messages in thread From: Junio C Hamano @ 2007-05-16 0:38 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Jan Hudec, git "Shawn O. Pearce" <spearce@spearce.org> writes: > Jan Hudec <bulb@ucw.cz> wrote: >> Did anyone already think about fetching over HTTP working similarly to the >> native git protocol? > > No work has been done on this (that I know of) but I've discussed > it to some extent with Simon 'corecode' Schubert on #git, and I > think he also brought it up on the mailing list not too long after. > > I've certainly thought about adding some sort of pack-objects > frontend into gitweb.cgi for this exact purpose. It is really > quite easy, except for the negotation of what the client has. ;-) > >> Than a question is how to implement it. The current protocol is stateful on >> both sides, but the stateless nature of HTTP more or less requires the >> protocol to be stateless on the server. >> >> I think it would be possible to use basically the same protocol as now, but >> make it stateless for server. That is server first sends it's heads and than >> client repeatedly sends all it's wants and some haves until the server acks >> all of them and sends the pack. > > I think Simon was talking about doubling the number of haves the > client sends in each request. So the client POSTs initially all > of its current refs; then current refs and their parents; then 4 > commits back, then 8, etc. The server replies to each POST request > with either a "send more please" or the packfile. I kinda' like the bundle suggestion ;-) ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-15 20:10 Smart fetch via HTTP? Jan Hudec 2007-05-15 22:30 ` A Large Angry SCM 2007-05-15 23:29 ` Shawn O. Pearce @ 2007-05-16 5:25 ` Martin Langhoff 2007-05-16 11:33 ` Johannes Schindelin 2007-05-17 12:40 ` Petr Baudis 3 siblings, 1 reply; 47+ messages in thread From: Martin Langhoff @ 2007-05-16 5:25 UTC (permalink / raw) To: Jan Hudec; +Cc: git On 5/16/07, Jan Hudec <bulb@ucw.cz> wrote: > Did anyone already think about fetching over HTTP working similarly to the > native git protocol? Do the indexes have enough info to use them with http ranges? It'd be chunkier than a smart protocol, but it'd still work with dumb servers. cheers, m ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-16 5:25 ` Martin Langhoff @ 2007-05-16 11:33 ` Johannes Schindelin 2007-05-16 21:26 ` Martin Langhoff 0 siblings, 1 reply; 47+ messages in thread From: Johannes Schindelin @ 2007-05-16 11:33 UTC (permalink / raw) To: Martin Langhoff; +Cc: Jan Hudec, git Hi, On Wed, 16 May 2007, Martin Langhoff wrote: > On 5/16/07, Jan Hudec <bulb@ucw.cz> wrote: > > Did anyone already think about fetching over HTTP working similarly to the > > native git protocol? > > Do the indexes have enough info to use them with http ranges? It'd be > chunkier than a smart protocol, but it'd still work with dumb servers. It would not be really performant, would it? Besides, not all Web servers speak HTTP/1.1... Ciao, Dscho ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-16 11:33 ` Johannes Schindelin @ 2007-05-16 21:26 ` Martin Langhoff 2007-05-16 21:54 ` Jakub Narebski 2007-05-17 0:52 ` Johannes Schindelin 0 siblings, 2 replies; 47+ messages in thread From: Martin Langhoff @ 2007-05-16 21:26 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Jan Hudec, git On 5/16/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > On Wed, 16 May 2007, Martin Langhoff wrote: > > Do the indexes have enough info to use them with http ranges? It'd be > > chunkier than a smart protocol, but it'd still work with dumb servers. > It would not be really performant, would it? Besides, not all Web servers > speak HTTP/1.1... Performant compared to downloading a huge packfile to get 10% of it? Sure! It'd probably take a few trips, and you'd end up fetching 20% of the file, still better than 100%. > Besides, not all Web servers speak HTTP/1.1... Are there any interesting webservers out there that don't? Hand-rolled purpose-built webservers often don't but those don't serve files, they serve web apps. When it comes to serving files, any webserver that is supported (security-wise) these days is HTTP/1.1. And for services like SF.net it'd be a safe low-cpu way of serving git files. 'cause the git protocol is quite expensive server-side (io+cpu) as we've seen with kernel.org. Being really smart with a cgi is probably going to be expensive too. cheers, m ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-16 21:26 ` Martin Langhoff @ 2007-05-16 21:54 ` Jakub Narebski 2007-05-17 0:52 ` Johannes Schindelin 1 sibling, 0 replies; 47+ messages in thread From: Jakub Narebski @ 2007-05-16 21:54 UTC (permalink / raw) To: git Martin Langhoff wrote: > On 5/16/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: >> On Wed, 16 May 2007, Martin Langhoff wrote: >> > Do the indexes have enough info to use them with http ranges? It'd be >> > chunkier than a smart protocol, but it'd still work with dumb servers. >> It would not be really performant, would it? Besides, not all Web servers >> speak HTTP/1.1... > > Performant compared to downloading a huge packfile to get 10% of it? > Sure! It'd probably take a few trips, and you'd end up fetching 20% of > the file, still better than 100%. That's why you should have something akin to backup policy for pack files, like daily packs, weekly packs, ..., and the rest, just for the dumb protocols. -- Jakub Narebski Warsaw, Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-16 21:26 ` Martin Langhoff 2007-05-16 21:54 ` Jakub Narebski @ 2007-05-17 0:52 ` Johannes Schindelin 2007-05-17 1:03 ` Shawn O. Pearce 2007-05-17 11:28 ` Matthieu Moy 1 sibling, 2 replies; 47+ messages in thread From: Johannes Schindelin @ 2007-05-17 0:52 UTC (permalink / raw) To: Martin Langhoff; +Cc: Jan Hudec, git Hi, On Thu, 17 May 2007, Martin Langhoff wrote: > On 5/16/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > > On Wed, 16 May 2007, Martin Langhoff wrote: > > > Do the indexes have enough info to use them with http ranges? It'd be > > > chunkier than a smart protocol, but it'd still work with dumb servers. > > It would not be really performant, would it? Besides, not all Web servers > > speak HTTP/1.1... > > Performant compared to downloading a huge packfile to get 10% of it? > Sure! It'd probably take a few trips, and you'd end up fetching 20% of > the file, still better than 100%. Don't forget that those 10% probably do not do you the favour to be in large chunks. Chances are that _every_ _single_ wanted object is separate from the others. > > Besides, not all Web servers speak HTTP/1.1... > > Are there any interesting webservers out there that don't? Hand-rolled > purpose-built webservers often don't but those don't serve files, they > serve web apps. When it comes to serving files, any webserver that is > supported (security-wise) these days is HTTP/1.1. > > And for services like SF.net it'd be a safe low-cpu way of serving git > files. 'cause the git protocol is quite expensive server-side (io+cpu) > as we've seen with kernel.org. Being really smart with a cgi is > probably going to be expensive too. It's probably better and faster than relying on a feature which does not exactly help. Ciao, Dscho ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-17 0:52 ` Johannes Schindelin @ 2007-05-17 1:03 ` Shawn O. Pearce 2007-05-17 1:04 ` david 2007-05-17 3:45 ` Nicolas Pitre 2007-05-17 11:28 ` Matthieu Moy 1 sibling, 2 replies; 47+ messages in thread From: Shawn O. Pearce @ 2007-05-17 1:03 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Martin Langhoff, Jan Hudec, git Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > Don't forget that those 10% probably do not do you the favour to be in > large chunks. Chances are that _every_ _single_ wanted object is separate > from the others. That's completely possible. Assuming the objects even are packed in the first place. Its very unlikely that you would be able to fetch very large of a range from an existing packfile, you would be submitting most of your range requests for very very small sections. > > And for services like SF.net it'd be a safe low-cpu way of serving git > > files. 'cause the git protocol is quite expensive server-side (io+cpu) > > as we've seen with kernel.org. Being really smart with a cgi is > > probably going to be expensive too. > > It's probably better and faster than relying on a feature which does not > exactly help. Yes. Packing more often and pack v4 may help a lot there. The other thing is kernel.org should really try to encourage the folks with repositories there to try and share against one master repository, so the poor OS has a better chance at holding the bulk of linux-2.6.git in buffer cache. I'm not suggesting they share specifically against Linus' repository; maybe hpa and the other admins can host one seperately from Linus and enourage users to use that repository when on a system they maintain. In an SF.net type case this doesn't help however. Most of SF.net is tiny projects with very few, if any, developers. Hence most of that is going to be unsharable, infrequently accessed, and uh, not needed to be stored in buffer cache. For the few projects that are hosted there that have a large developer base they could use a shared repository approach as I just suggested for kernel.org. aka the "forks" thing in gitweb, and on repo.or.cz... -- Shawn. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-17 1:03 ` Shawn O. Pearce @ 2007-05-17 1:04 ` david 2007-05-17 1:26 ` Shawn O. Pearce 2007-05-17 3:45 ` Nicolas Pitre 1 sibling, 1 reply; 47+ messages in thread From: david @ 2007-05-17 1:04 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Johannes Schindelin, Martin Langhoff, Jan Hudec, git On Wed, 16 May 2007, Shawn O. Pearce wrote: > Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > >>> And for services like SF.net it'd be a safe low-cpu way of serving git >>> files. 'cause the git protocol is quite expensive server-side (io+cpu) >>> as we've seen with kernel.org. Being really smart with a cgi is >>> probably going to be expensive too. >> >> It's probably better and faster than relying on a feature which does not >> exactly help. > > Yes. Packing more often and pack v4 may help a lot there. > > The other thing is kernel.org should really try to encourage the > folks with repositories there to try and share against one master > repository, so the poor OS has a better chance at holding the bulk > of linux-2.6.git in buffer cache. do you mean more precisely share against one object store or do you really mean repository? David Lang > I'm not suggesting they share specifically against Linus' repository; > maybe hpa and the other admins can host one seperately from Linus and > enourage users to use that repository when on a system they maintain. > > In an SF.net type case this doesn't help however. Most of SF.net > is tiny projects with very few, if any, developers. Hence most > of that is going to be unsharable, infrequently accessed, and uh, > not needed to be stored in buffer cache. For the few projects that > are hosted there that have a large developer base they could use > a shared repository approach as I just suggested for kernel.org. > > aka the "forks" thing in gitweb, and on repo.or.cz... > > ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-17 1:04 ` david @ 2007-05-17 1:26 ` Shawn O. Pearce 2007-05-17 1:45 ` Shawn O. Pearce 0 siblings, 1 reply; 47+ messages in thread From: Shawn O. Pearce @ 2007-05-17 1:26 UTC (permalink / raw) To: david; +Cc: Johannes Schindelin, Martin Langhoff, Jan Hudec, git david@lang.hm wrote: > On Wed, 16 May 2007, Shawn O. Pearce wrote: > > > >The other thing is kernel.org should really try to encourage the > >folks with repositories there to try and share against one master > >repository, so the poor OS has a better chance at holding the bulk > >of linux-2.6.git in buffer cache. > > do you mean more precisely share against one object store or do you really > mean repository? Sorry, I did mean "object store". ;-) Repository is insanity, as the refs and tags namespaces are suddenly shared. What a nightmare that would become. -- Shawn. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-17 1:26 ` Shawn O. Pearce @ 2007-05-17 1:45 ` Shawn O. Pearce 2007-05-17 12:36 ` Theodore Tso 0 siblings, 1 reply; 47+ messages in thread From: Shawn O. Pearce @ 2007-05-17 1:45 UTC (permalink / raw) To: david; +Cc: Johannes Schindelin, Martin Langhoff, Jan Hudec, git "Shawn O. Pearce" <spearce@spearce.org> wrote: > david@lang.hm wrote: > > On Wed, 16 May 2007, Shawn O. Pearce wrote: > > > > > >The other thing is kernel.org should really try to encourage the > > >folks with repositories there to try and share against one master > > >repository, so the poor OS has a better chance at holding the bulk > > >of linux-2.6.git in buffer cache. > > > > do you mean more precisely share against one object store or do you really > > mean repository? > > Sorry, I did mean "object store". ;-) And even there, I don't mean symlink objects to a shared database, I mean use the objects/info/alternates file to point to the shared, read-only location. Its not perfect. The hotter parts of the object database is almost always the recent stuff, as that's what people are actively trying to fetch, or are using as a base when they are trying to fetch from someone else. The hotter parts are also probably too new to be in the shared store offered by kernel.org admins, which means you cannot get good IO buffering. Back to the current set of problems. A single shared object directory that everyone can write new files into, but cannot modify or delete from, would help that problem quite a bit. But it opens up huge problems about pruning, as there is no way to perform garbage collection on that database without scanning every ref on the system, and that's just not simply possible on a busy system like kernel.org. -- Shawn. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-17 1:45 ` Shawn O. Pearce @ 2007-05-17 12:36 ` Theodore Tso 0 siblings, 0 replies; 47+ messages in thread From: Theodore Tso @ 2007-05-17 12:36 UTC (permalink / raw) To: Shawn O. Pearce Cc: david, Johannes Schindelin, Martin Langhoff, Jan Hudec, git On Wed, May 16, 2007 at 09:45:42PM -0400, Shawn O. Pearce wrote: > Its not perfect. The hotter parts of the object database is almost > always the recent stuff, as that's what people are actively trying > to fetch, or are using as a base when they are trying to fetch from > someone else. The hotter parts are also probably too new to be > in the shared store offered by kernel.org admins, which means you > cannot get good IO buffering. Back to the current set of problems. Actually, as long as objects/info/alternates is pointing at Linus's kernel.org tree, I would think that it should work relatively well, since everyone is normally basing their work on top of his tree as a starting point. - Ted ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-17 1:03 ` Shawn O. Pearce 2007-05-17 1:04 ` david @ 2007-05-17 3:45 ` Nicolas Pitre 2007-05-17 10:48 ` Johannes Schindelin 1 sibling, 1 reply; 47+ messages in thread From: Nicolas Pitre @ 2007-05-17 3:45 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Johannes Schindelin, Martin Langhoff, Jan Hudec, git On Wed, 16 May 2007, Shawn O. Pearce wrote: > Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > > Don't forget that those 10% probably do not do you the favour to be in > > large chunks. Chances are that _every_ _single_ wanted object is separate > > from the others. > > That's completely possible. Assuming the objects even are packed > in the first place. Its very unlikely that you would be able to > fetch very large of a range from an existing packfile, you would be > submitting most of your range requests for very very small sections. Well, in the commit objects case you're likely to have a bunch of them all contigous. For tree and blob objects it is less likely. And of course there is the question of deltas for which you might or might not have the base object locally already. Still... I wonder if this could be actually workable. A typical daily update on the Linux kernel repository might consist of a couple hundreds or a few tousands objects. This could still be faster to fetch parts of a pack than the whole pack if the size difference is above a certain treshold. It is certainly not worse than fetching loose objects. Things would be pretty horrid if you think of fetching a commit object, parsing it to find out what tree object to fetch, then parse that tree object to find out what other objects to fetch, and so on. But if you only take the approach of fetching the pack index files, finding out about the objects that the remote has that are not available locally, and then fetching all those objects from within pack files without even looking at them (except for deltas), then it should be possible to issue a couple requests in parallel and possibly have decent performances. And if it turns out that more than, say, 70% of a particular pack is to be fetched (you can determine that up front), then it might be decided to fetch the whole pack. There is no way to sensibly keep those objects packed on the receiving end of course, but storing them as loose objects and repacking them afterwards should be just fine. Of course you'll get objects from branches in the remote repository you might not be interested in, but that's a price to pay for such a hack. On average the overhead shouldn't be that big anyway if branches within a repository are somewhat related. I think this is something worth experimenting. Nicolas ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-17 3:45 ` Nicolas Pitre @ 2007-05-17 10:48 ` Johannes Schindelin 2007-05-17 14:41 ` Nicolas Pitre 0 siblings, 1 reply; 47+ messages in thread From: Johannes Schindelin @ 2007-05-17 10:48 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Shawn O. Pearce, Martin Langhoff, Jan Hudec, git Hi, On Wed, 16 May 2007, Nicolas Pitre wrote: > Still... I wonder if this could be actually workable. A typical daily > update on the Linux kernel repository might consist of a couple hundreds > or a few tousands objects. This could still be faster to fetch parts of > a pack than the whole pack if the size difference is above a certain > treshold. It is certainly not worse than fetching loose objects. > > Things would be pretty horrid if you think of fetching a commit object, > parsing it to find out what tree object to fetch, then parse that tree > object to find out what other objects to fetch, and so on. > > But if you only take the approach of fetching the pack index files, > finding out about the objects that the remote has that are not available > locally, and then fetching all those objects from within pack files > without even looking at them (except for deltas), then it should be > possible to issue a couple requests in parallel and possibly have decent > performances. And if it turns out that more than, say, 70% of a > particular pack is to be fetched (you can determine that up front), then > it might be decided to fetch the whole pack. > > There is no way to sensibly keep those objects packed on the receiving > end of course, but storing them as loose objects and repacking them > afterwards should be just fine. > > Of course you'll get objects from branches in the remote repository you > might not be interested in, but that's a price to pay for such a hack. > On average the overhead shouldn't be that big anyway if branches within > a repository are somewhat related. > > I think this is something worth experimenting. I am a bit wary about that, because it is so complex. IMHO a cgi which gets, say, up to a hundred refs (maybe something like ref~0, ref~1, ref~2, ref~4, ref~8, ref~16, ... for the refs), and then makes a bundle for that case on the fly, is easier to do. Of course, as with all cgi scripts, you have to make sure that DOS attacks have a low probability of success. Ciao, Dscho ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-17 10:48 ` Johannes Schindelin @ 2007-05-17 14:41 ` Nicolas Pitre 2007-05-17 15:24 ` Martin Langhoff 2007-05-17 20:04 ` Jan Hudec 0 siblings, 2 replies; 47+ messages in thread From: Nicolas Pitre @ 2007-05-17 14:41 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Shawn O. Pearce, Martin Langhoff, Jan Hudec, git [-- Attachment #1: Type: TEXT/PLAIN, Size: 4719 bytes --] On Thu, 17 May 2007, Johannes Schindelin wrote: > Hi, > > On Wed, 16 May 2007, Nicolas Pitre wrote: > > > Still... I wonder if this could be actually workable. A typical daily > > update on the Linux kernel repository might consist of a couple hundreds > > or a few tousands objects. This could still be faster to fetch parts of > > a pack than the whole pack if the size difference is above a certain > > treshold. It is certainly not worse than fetching loose objects. > > > > Things would be pretty horrid if you think of fetching a commit object, > > parsing it to find out what tree object to fetch, then parse that tree > > object to find out what other objects to fetch, and so on. > > > > But if you only take the approach of fetching the pack index files, > > finding out about the objects that the remote has that are not available > > locally, and then fetching all those objects from within pack files > > without even looking at them (except for deltas), then it should be > > possible to issue a couple requests in parallel and possibly have decent > > performances. And if it turns out that more than, say, 70% of a > > particular pack is to be fetched (you can determine that up front), then > > it might be decided to fetch the whole pack. > > > > There is no way to sensibly keep those objects packed on the receiving > > end of course, but storing them as loose objects and repacking them > > afterwards should be just fine. > > > > Of course you'll get objects from branches in the remote repository you > > might not be interested in, but that's a price to pay for such a hack. > > On average the overhead shouldn't be that big anyway if branches within > > a repository are somewhat related. > > > > I think this is something worth experimenting. > > I am a bit wary about that, because it is so complex. IMHO a cgi which > gets, say, up to a hundred refs (maybe something like ref~0, ref~1, ref~2, > ref~4, ref~8, ref~16, ... for the refs), and then makes a bundle for that > case on the fly, is easier to do. And if you have 1) the permission and 2) the CPU power to execute such a cgi on the server and obviously 3) the knowledge to set it up properly, then why aren't you running the Git daemon in the first place? After all, they both boil down to running git-pack-objects and sending out the result. I don't think such a solution really buys much. On the other hand, if the client does all the work and provides the server with a list of ranges within a pack it wants to be sent, then you simply have zero special setup to perform on the hosting server and you keep the server load down due to not running pack-objects there. That, at least, is different enough from the Git daemon to be worth considering. Not only does it provide an advantage to those who cannot do anything but http out of their segregated network, but it also provide many advantages on the server side too while the cgi approach doesn't. And actually finding out the list of objects the remote has that you don't have is not that complex. It could go as follows: 1) Fetch every .idx files the remote has. 2) From those .idx files, keep only a list of objects that are unknown locally. A good starting point for doing this really efficiently is the code for git-pack-redundant. 3) From the .idx files we got in (1), create a reverse index to get each object's size in the remote pack. The code to do this already exists in builtin-pack-objects.c. 4) With the list of missing objects from (2) along with their offset and size within a given pack file, fetch those objects from the remote server. Either perform multiple requests in parallel, or as someone mentioned already, provide the server with a list of ranges you want to be sent. 5) Store the received objects as loose objects locally. If a given object is a delta, verify if its base is available locally, or if it is listed amongst those objects to be fetched from the server. If not, add it to the list. In most cases, delta base objects will be objects already listed to be fetched anyway. To greatly simplify things, the loose delta object type from 2 years ago could be revived (commit 91d7b8afc2) since a repack will get rid of them. 6 Repeat (4) and (5) until everything has been fetched. 7) Run git-pack-objects with the list of fetched objects. Et voilà. Oh, and of course update your local refs from the remote's. Actually there is nothing really complex in the above operations. And with this the server side remains really simple with no special setup nor extra load beyond the simple serving of file content. Nicolas ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-17 14:41 ` Nicolas Pitre @ 2007-05-17 15:24 ` Martin Langhoff 2007-05-17 15:34 ` Nicolas Pitre 2007-05-17 20:04 ` Jan Hudec 1 sibling, 1 reply; 47+ messages in thread From: Martin Langhoff @ 2007-05-17 15:24 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Johannes Schindelin, Shawn O. Pearce, Jan Hudec, git On 5/18/07, Nicolas Pitre <nico@cam.org> wrote: > And if you have 1) the permission and 2) the CPU power to execute such a > cgi on the server and obviously 3) the knowledge to set it up properly, > then why aren't you running the Git daemon in the first place? And you probably _are_ running git daemon. But some clients may be on shitty connections that only allow http. That's one of the scenarios we're discussing. cheers, m ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-17 15:24 ` Martin Langhoff @ 2007-05-17 15:34 ` Nicolas Pitre 0 siblings, 0 replies; 47+ messages in thread From: Nicolas Pitre @ 2007-05-17 15:34 UTC (permalink / raw) To: Martin Langhoff; +Cc: Johannes Schindelin, Shawn O. Pearce, Jan Hudec, git On Fri, 18 May 2007, Martin Langhoff wrote: > On 5/18/07, Nicolas Pitre <nico@cam.org> wrote: > > And if you have 1) the permission and 2) the CPU power to execute such a > > cgi on the server and obviously 3) the knowledge to set it up properly, > > then why aren't you running the Git daemon in the first place? > > And you probably _are_ running git daemon. But some clients may be on > shitty connections that only allow http. That's one of the scenarios > we're discussing. That's not what I'm disputing at all. I'm disputing the vertue of an HTTP solution involving a cgi with Git bundles vs an HTTP solution involving static file range serving. The clients on shitty connections don't care either ways. Nicolas ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-17 14:41 ` Nicolas Pitre 2007-05-17 15:24 ` Martin Langhoff @ 2007-05-17 20:04 ` Jan Hudec 2007-05-17 20:31 ` Nicolas Pitre 2007-05-18 9:01 ` Johannes Schindelin 1 sibling, 2 replies; 47+ messages in thread From: Jan Hudec @ 2007-05-17 20:04 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Johannes Schindelin, Shawn O. Pearce, Martin Langhoff, git [-- Attachment #1: Type: text/plain, Size: 4738 bytes --] On Thu, May 17, 2007 at 10:41:37 -0400, Nicolas Pitre wrote: > On Thu, 17 May 2007, Johannes Schindelin wrote: > > On Wed, 16 May 2007, Nicolas Pitre wrote: > And if you have 1) the permission and 2) the CPU power to execute such a > cgi on the server and obviously 3) the knowledge to set it up properly, > then why aren't you running the Git daemon in the first place? After > all, they both boil down to running git-pack-objects and sending out the > result. I don't think such a solution really buys much. Yes, it does. I had 2 accounts where I could run CGI, but not separate server, at university while I studied and now I can get the same on friend's server. Neither of them would probably be ok for serving larger busy git repository, but something smaller accessed by several people is OK. I think this is quite common for university students. Of course your suggestion which moves the logic to client-side is a good one, but even the cgi with logic on server side would help in some situations. > On the other hand, if the client does all the work and provides the > server with a list of ranges within a pack it wants to be sent, then you > simply have zero special setup to perform on the hosting server and you > keep the server load down due to not running pack-objects there. That, > at least, is different enough from the Git daemon to be worth > considering. Not only does it provide an advantage to those who cannot > do anything but http out of their segregated network, but it also > provide many advantages on the server side too while the cgi approach > doesn't. > > And actually finding out the list of objects the remote has that you > don't have is not that complex. It could go as follows: > > 1) Fetch every .idx files the remote has. ... for git it's 1.2 MiB. And that definitely isn't a huge source tree. Of course the local side could remember which indices it already saw during previous fetch from that location and not re-fetch them. A slight problem is, that git-repack normally recombines everything to a single pack, so the index would have to be re-fetched again anyway. > 2) From those .idx files, keep only a list of objects that are unknown > locally. A good starting point for doing this really efficiently is > the code for git-pack-redundant. > > 3) From the .idx files we got in (1), create a reverse index to get each > object's size in the remote pack. The code to do this already exists > in builtin-pack-objects.c. > > 4) With the list of missing objects from (2) along with their offset and > size within a given pack file, fetch those objects from the remote > server. Either perform multiple requests in parallel, or as someone > mentioned already, provide the server with a list of ranges you want > to be sent. Does the git server really have to do so much beyond that? I didn't look at the algorithm that finds what deltas should be based on, but depending on that it might (or might not) be possible to proof the client has everything to understand if the server sends the objects as it currently has them. > 5) Store the received objects as loose objects locally. If a given > object is a delta, verify if its base is available locally, or if it > is listed amongst those objects to be fetched from the server. If > not, add it to the list. In most cases, delta base objects will be > objects already listed to be fetched anyway. To greatly simplify > things, the loose delta object type from 2 years ago could be revived > (commit 91d7b8afc2) since a repack will get rid of them. > > 6 Repeat (4) and (5) until everything has been fetched. Unless I am really seriously missing something, there is no point in repeating. For each pack you need to unpack a delta either: - you have it => ok. - you don't have it, but the server does => but than it's already in the fetch set calculated in 2. - you don't have it and nor does server => the repository at server is corrupted and you can't fix it. > 7) Run git-pack-objects with the list of fetched objects. > > Et voilà. Oh, and of course update your local refs from the remote's. > > Actually there is nothing really complex in the above operations. And > with this the server side remains really simple with no special setup > nor extra load beyond the simple serving of file content. On the other hand the amount of data transfered is larger, than with the git server approach, because at least the indices have to be transfered in entirety. So each approach has it's own advantages. -- Jan 'Bulb' Hudec <bulb@ucw.cz> [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-17 20:04 ` Jan Hudec @ 2007-05-17 20:31 ` Nicolas Pitre 2007-05-17 21:00 ` david 2007-05-18 9:01 ` Johannes Schindelin 1 sibling, 1 reply; 47+ messages in thread From: Nicolas Pitre @ 2007-05-17 20:31 UTC (permalink / raw) To: Jan Hudec; +Cc: Johannes Schindelin, Shawn O. Pearce, Martin Langhoff, git On Thu, 17 May 2007, Jan Hudec wrote: > On Thu, May 17, 2007 at 10:41:37 -0400, Nicolas Pitre wrote: > > On Thu, 17 May 2007, Johannes Schindelin wrote: > > > On Wed, 16 May 2007, Nicolas Pitre wrote: > > And if you have 1) the permission and 2) the CPU power to execute such a > > cgi on the server and obviously 3) the knowledge to set it up properly, > > then why aren't you running the Git daemon in the first place? After > > all, they both boil down to running git-pack-objects and sending out the > > result. I don't think such a solution really buys much. > > Yes, it does. I had 2 accounts where I could run CGI, but not separate > server, at university while I studied and now I can get the same on friend's > server. Neither of them would probably be ok for serving larger busy git > repository, but something smaller accessed by several people is OK. I think > this is quite common for university students. > > Of course your suggestion which moves the logic to client-side is a good one, > but even the cgi with logic on server side would help in some situations. You could simply wrap git-bundle within a cgi. That is certainly easy enough. > > On the other hand, if the client does all the work and provides the > > server with a list of ranges within a pack it wants to be sent, then you > > simply have zero special setup to perform on the hosting server and you > > keep the server load down due to not running pack-objects there. That, > > at least, is different enough from the Git daemon to be worth > > considering. Not only does it provide an advantage to those who cannot > > do anything but http out of their segregated network, but it also > > provide many advantages on the server side too while the cgi approach > > doesn't. > > > > And actually finding out the list of objects the remote has that you > > don't have is not that complex. It could go as follows: > > > > 1) Fetch every .idx files the remote has. > > ... for git it's 1.2 MiB. And that definitely isn't a huge source tree. > Of course the local side could remember which indices it already saw during > previous fetch from that location and not re-fetch them. Right. The name of the pack/index plus its time stamp can be cached. If the remote doesn't repack too often then the overhead would be minimal. > > 2) From those .idx files, keep only a list of objects that are unknown > > locally. A good starting point for doing this really efficiently is > > the code for git-pack-redundant. > > > > 3) From the .idx files we got in (1), create a reverse index to get each > > object's size in the remote pack. The code to do this already exists > > in builtin-pack-objects.c. > > > > 4) With the list of missing objects from (2) along with their offset and > > size within a given pack file, fetch those objects from the remote > > server. Either perform multiple requests in parallel, or as someone > > mentioned already, provide the server with a list of ranges you want > > to be sent. > > Does the git server really have to do so much beyond that? Yes it does. The real thing perform a full object reachability walk and only the objects that are needed for the wanted branch(es) are sent in a custom pack meaning that the data transfer is really optimal. > > 5) Store the received objects as loose objects locally. If a given > > object is a delta, verify if its base is available locally, or if it > > is listed amongst those objects to be fetched from the server. If > > not, add it to the list. In most cases, delta base objects will be > > objects already listed to be fetched anyway. To greatly simplify > > things, the loose delta object type from 2 years ago could be revived > > (commit 91d7b8afc2) since a repack will get rid of them. > > > > 6 Repeat (4) and (5) until everything has been fetched. > > Unless I am really seriously missing something, there is no point in > repeating. For each pack you need to unpack a delta either: > - you have it => ok. > - you don't have it, but the server does => > but than it's already in the fetch set calculated in 2. > - you don't have it and nor does server => > the repository at server is corrupted and you can't fix it. You're right of course. Nicolas ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-17 20:31 ` Nicolas Pitre @ 2007-05-17 21:00 ` david 0 siblings, 0 replies; 47+ messages in thread From: david @ 2007-05-17 21:00 UTC (permalink / raw) To: Nicolas Pitre Cc: Jan Hudec, Johannes Schindelin, Shawn O. Pearce, Martin Langhoff, git On Thu, 17 May 2007, Nicolas Pitre wrote: > On Thu, 17 May 2007, Jan Hudec wrote: > >> On Thu, May 17, 2007 at 10:41:37 -0400, Nicolas Pitre wrote: >>> On Thu, 17 May 2007, Johannes Schindelin wrote: >>>> On Wed, 16 May 2007, Nicolas Pitre wrote: >>> And if you have 1) the permission and 2) the CPU power to execute such a >>> cgi on the server and obviously 3) the knowledge to set it up properly, >>> then why aren't you running the Git daemon in the first place? After >>> all, they both boil down to running git-pack-objects and sending out the >>> result. I don't think such a solution really buys much. >> >> Yes, it does. I had 2 accounts where I could run CGI, but not separate >> server, at university while I studied and now I can get the same on friend's >> server. Neither of them would probably be ok for serving larger busy git >> repository, but something smaller accessed by several people is OK. I think >> this is quite common for university students. >> >> Of course your suggestion which moves the logic to client-side is a good one, >> but even the cgi with logic on server side would help in some situations. > > You could simply wrap git-bundle within a cgi. That is certainly easy > enough. isn't this (or something very similar) exactly what we want for a smalrt fetch via http? after all, we're completely in control of the client software, and the useual reason for HTTP-only access is on the client side rather then the server side. so http access that wraps the git protocol in http would make life much cleaner for lots of people there are a few cases where all you have is static web space, but I don't think it's worth trying to optimize that too much as you still have the safety issues to worry about David Lang ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-17 20:04 ` Jan Hudec 2007-05-17 20:31 ` Nicolas Pitre @ 2007-05-18 9:01 ` Johannes Schindelin 2007-05-18 17:51 ` Jan Hudec 1 sibling, 1 reply; 47+ messages in thread From: Johannes Schindelin @ 2007-05-18 9:01 UTC (permalink / raw) To: Jan Hudec; +Cc: Nicolas Pitre, Shawn O. Pearce, Martin Langhoff, git [-- Attachment #1: Type: TEXT/PLAIN, Size: 1567 bytes --] Hi, On Thu, 17 May 2007, Jan Hudec wrote: > On Thu, May 17, 2007 at 10:41:37 -0400, Nicolas Pitre wrote: > > > And if you have 1) the permission and 2) the CPU power to execute such > > a cgi on the server and obviously 3) the knowledge to set it up > > properly, then why aren't you running the Git daemon in the first > > place? After all, they both boil down to running git-pack-objects and > > sending out the result. I don't think such a solution really buys > > much. > > Yes, it does. I had 2 accounts where I could run CGI, but not separate > server, at university while I studied and now I can get the same on > friend's server. Neither of them would probably be ok for serving larger > busy git repository, but something smaller accessed by several people is > OK. I think this is quite common for university students. 1) This has nothing to do with the way the repo is served, but how much you advertise it. The load will not be lower, just because you use a CGI script. 2) you say yourself that git-daemon would have less impact on the load: > > [...] > > > > Et voilà. Oh, and of course update your local refs from the > > remote's. > > > > Actually there is nothing really complex in the above operations. And > > with this the server side remains really simple with no special setup > > nor extra load beyond the simple serving of file content. > > On the other hand the amount of data transfered is larger, than with the > git server approach, because at least the indices have to be transfered > in entirety. Ciao, Dscho ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-18 9:01 ` Johannes Schindelin @ 2007-05-18 17:51 ` Jan Hudec 0 siblings, 0 replies; 47+ messages in thread From: Jan Hudec @ 2007-05-18 17:51 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Nicolas Pitre, Shawn O. Pearce, Martin Langhoff, git [-- Attachment #1: Type: text/plain, Size: 2906 bytes --] On Fri, May 18, 2007 at 10:01:52 +0100, Johannes Schindelin wrote: > Hi, > > On Thu, 17 May 2007, Jan Hudec wrote: > > > On Thu, May 17, 2007 at 10:41:37 -0400, Nicolas Pitre wrote: > > > > > And if you have 1) the permission and 2) the CPU power to execute such > > > a cgi on the server and obviously 3) the knowledge to set it up > > > properly, then why aren't you running the Git daemon in the first > > > place? After all, they both boil down to running git-pack-objects and > > > sending out the result. I don't think such a solution really buys > > > much. > > > > Yes, it does. I had 2 accounts where I could run CGI, but not separate > > server, at university while I studied and now I can get the same on > > friend's server. Neither of them would probably be ok for serving larger > > busy git repository, but something smaller accessed by several people is > > OK. I think this is quite common for university students. > > 1) This has nothing to do with the way the repo is served, but how much > you advertise it. The load will not be lower, just because you use a CGI > script. That won't. But that was never the purpose of "smart cgi". The purpose was to minimize the bandwidth usage (and connectivity is still not so cheap that you'd not care) while still working over http either because the users need to access it from behind firewall or because administrator is not willing to set up git-daemon for you, while CGI you can run yourself. > 2) you say yourself that git-daemon would have less impact on the load: NO, I didn't -- at least not in the paragraph below. In the below paragraph I said, that *network* use will never be as good with *dumb* solution, as it can be with smart solution, no matter whether it is over special protocol or HTTP. --- Of course it would be less efficient in both CPU and network load, because there is the overhead of the web server and overhead of the http headers. Actually I like the ranges solution. If accompanied with repack stategy that does not pack everything together, but instead creates packs of limited number of objects -- so that the indices don't exceed configurable size, say 64kB -- could not so much less efficient for the network and have the advantage of working without ability to execute CGI. > > > [...] > > > > > > Et voilà. Oh, and of course update your local refs from the > > > remote's. > > > > > > Actually there is nothing really complex in the above operations. And > > > with this the server side remains really simple with no special setup > > > nor extra load beyond the simple serving of file content. > > > > On the other hand the amount of data transfered is larger, than with the > > git server approach, because at least the indices have to be transfered > > in entirety. -- Jan 'Bulb' Hudec <bulb@ucw.cz> [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-17 0:52 ` Johannes Schindelin 2007-05-17 1:03 ` Shawn O. Pearce @ 2007-05-17 11:28 ` Matthieu Moy 2007-05-17 13:10 ` Martin Langhoff 1 sibling, 1 reply; 47+ messages in thread From: Matthieu Moy @ 2007-05-17 11:28 UTC (permalink / raw) To: git Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > Hi, > > On Thu, 17 May 2007, Martin Langhoff wrote: > >> On 5/16/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: >> > On Wed, 16 May 2007, Martin Langhoff wrote: >> > > Do the indexes have enough info to use them with http ranges? It'd be >> > > chunkier than a smart protocol, but it'd still work with dumb servers. >> > It would not be really performant, would it? Besides, not all Web servers >> > speak HTTP/1.1... >> >> Performant compared to downloading a huge packfile to get 10% of it? >> Sure! It'd probably take a few trips, and you'd end up fetching 20% of >> the file, still better than 100%. > > Don't forget that those 10% probably do not do you the favour to be in > large chunks. Chances are that _every_ _single_ wanted object is separate > from the others. FYI, bzr uses HTTP range requests, and the introduction of this feature lead to significant performance improvement for them (bzr is more dumb-protocol oriented than git is, so that's really important there). They have this "index file+data file" system too, so you download the full index file, and then send an HTTP range request to get only the relevant parts of the data file. The thing is, AAUI, they don't send N range requests to get N chunks, but one HTTP request, requesting the N ranges at a time, and get the N chunks a a whole (IIRC, a kind of MIME-encoded response from the server). So, you pay the price of a longer HTTP request, but not the price of N networks round-trips. That's surely not as efficient as anything smart on the server, but might really help for the cases where the server is /not/ smart. -- Matthieu ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-17 11:28 ` Matthieu Moy @ 2007-05-17 13:10 ` Martin Langhoff 2007-05-17 13:47 ` Johannes Schindelin 0 siblings, 1 reply; 47+ messages in thread From: Martin Langhoff @ 2007-05-17 13:10 UTC (permalink / raw) To: git On 5/17/07, Matthieu Moy <Matthieu.Moy@imag.fr> wrote: > FYI, bzr uses HTTP range requests, and the introduction of this > feature lead to significant performance improvement for them (bzr is > more dumb-protocol oriented than git is, so that's really important > there). They have this "index file+data file" system too, so you > download the full index file, and then send an HTTP range request to > get only the relevant parts of the data file. That's the kind of thing I was imagining. Between the index and an additional "index-supplement-for-dumb-protocols" maintained by update-server-info, http ranges can be bent to our evil purposes. Of course it won't be as network-efficient as the git proto, or even as the git-over-cgi proto, but it'll surely be server-cpu-and-memory efficient. And people will benefit from it without having to do any additional setup. It might be hard to come up with a usable approach to http ranges. But I do think it's worth considering carefully. cheers, m ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-17 13:10 ` Martin Langhoff @ 2007-05-17 13:47 ` Johannes Schindelin 2007-05-17 14:05 ` Matthieu Moy ` (2 more replies) 0 siblings, 3 replies; 47+ messages in thread From: Johannes Schindelin @ 2007-05-17 13:47 UTC (permalink / raw) To: Martin Langhoff; +Cc: git Hi, [I missed this mail, because Matthieu culled the Cc list again] On Fri, 18 May 2007, Martin Langhoff wrote: > On 5/17/07, Matthieu Moy <Matthieu.Moy@imag.fr> wrote: > > > FYI, bzr uses HTTP range requests, and the introduction of this > > feature lead to significant performance improvement for them (bzr is > > more dumb-protocol oriented than git is, so that's really important > > there). They have this "index file+data file" system too, so you > > download the full index file, and then send an HTTP range request to > > get only the relevant parts of the data file. > > That's the kind of thing I was imagining. Between the index and an > additional "index-supplement-for-dumb-protocols" maintained by > update-server-info, http ranges can be bent to our evil purposes. > > Of course it won't be as network-efficient as the git proto, or even > as the git-over-cgi proto, but it'll surely be server-cpu-and-memory > efficient. And people will benefit from it without having to do any > additional setup. Of course, the problem is that only the server can know beforehand which objects are needed. Imagine this: X - Y - Z \ A Client has "X", wants "Z", but not "A". Client needs "Y" and "Z". But client cannot know that it needs "Y" before getting "Z", except if the server says so. If you have a solution for that problem, please enlighten me: I don't. Ciao, Dscho ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-17 13:47 ` Johannes Schindelin @ 2007-05-17 14:05 ` Matthieu Moy 2007-05-17 14:09 ` Martin Langhoff 2007-05-17 14:50 ` Nicolas Pitre 2 siblings, 0 replies; 47+ messages in thread From: Matthieu Moy @ 2007-05-17 14:05 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Martin Langhoff, git Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > Hi, > > [I missed this mail, because Matthieu culled the Cc list again] Sorry about that, miss-configuration of my mailer. I didn't find time to solve it before. OTOH, since most people actually complain when you Cc them on a mailing list, the choice "To Cc or not to Cc" has no universal solution ;-). -- Matthieu ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-17 13:47 ` Johannes Schindelin 2007-05-17 14:05 ` Matthieu Moy @ 2007-05-17 14:09 ` Martin Langhoff 2007-05-17 15:01 ` Nicolas Pitre 2007-05-17 23:14 ` Jakub Narebski 2007-05-17 14:50 ` Nicolas Pitre 2 siblings, 2 replies; 47+ messages in thread From: Martin Langhoff @ 2007-05-17 14:09 UTC (permalink / raw) To: Johannes Schindelin; +Cc: git On 5/18/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > If you have a solution for that problem, please enlighten me: I don't. Ok - worst case scenario - have a minimal hints file that tells me the ranges to fetch all commits and all trees. To reduce that Add to the hints file data to name the hashes (or even better - offsets) for the delta chains that contain commits+trees relevant to all the heads - minus 10, 20, 30, 40 commits and 1,2,4,8 and 16 days. So there's a good chance the client can get the commits+trees needed efficiently. For blobs, all you need is the index to mark the delta chains you need. cheers, m ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-17 14:09 ` Martin Langhoff @ 2007-05-17 15:01 ` Nicolas Pitre 2007-05-17 23:14 ` Jakub Narebski 1 sibling, 0 replies; 47+ messages in thread From: Nicolas Pitre @ 2007-05-17 15:01 UTC (permalink / raw) To: Martin Langhoff; +Cc: Johannes Schindelin, git On Fri, 18 May 2007, Martin Langhoff wrote: > On 5/18/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > > If you have a solution for that problem, please enlighten me: I don't. > > Ok - worst case scenario - have a minimal hints file that tells me the > ranges to fetch all commits and all trees. To reduce that Add to the > hints file data to name the hashes (or even better - offsets) for the > delta chains that contain commits+trees relevant to all the heads - > minus 10, 20, 30, 40 commits and 1,2,4,8 and 16 days. NO ! This is unreliable, unnecessary, and actually kills the beauty of the solution's simplicity. You get updates for every branches the remote has, period. No server side extra files, no guesses, no arbitrary ranges, no backward compatibility issues, no crap! Nicolas ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-17 14:09 ` Martin Langhoff 2007-05-17 15:01 ` Nicolas Pitre @ 2007-05-17 23:14 ` Jakub Narebski 1 sibling, 0 replies; 47+ messages in thread From: Jakub Narebski @ 2007-05-17 23:14 UTC (permalink / raw) To: git Martin Langhoff wrote: > On 5/18/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: >> If you have a solution for that problem, please enlighten me: I don't. > > Ok - worst case scenario - have a minimal hints file that tells me the > ranges to fetch all commits and all trees. To reduce that Add to the > hints file data to name the hashes (or even better - offsets) for the > delta chains that contain commits+trees relevant to all the heads - > minus 10, 20, 30, 40 commits and 1,2,4,8 and 16 days. > > So there's a good chance the client can get the commits+trees needed > efficiently. For blobs, all you need is the index to mark the delta > chains you need. By the way, I think we always should get the whole delta chain, unless we are absolutely sure that we have base object(s) in repo. -- Jakub Narebski Warsaw, Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-17 13:47 ` Johannes Schindelin 2007-05-17 14:05 ` Matthieu Moy 2007-05-17 14:09 ` Martin Langhoff @ 2007-05-17 14:50 ` Nicolas Pitre 2 siblings, 0 replies; 47+ messages in thread From: Nicolas Pitre @ 2007-05-17 14:50 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Martin Langhoff, git On Thu, 17 May 2007, Johannes Schindelin wrote: > Hi, > > [I missed this mail, because Matthieu culled the Cc list again] > > On Fri, 18 May 2007, Martin Langhoff wrote: > > > On 5/17/07, Matthieu Moy <Matthieu.Moy@imag.fr> wrote: > > > > > FYI, bzr uses HTTP range requests, and the introduction of this > > > feature lead to significant performance improvement for them (bzr is > > > more dumb-protocol oriented than git is, so that's really important > > > there). They have this "index file+data file" system too, so you > > > download the full index file, and then send an HTTP range request to > > > get only the relevant parts of the data file. > > > > That's the kind of thing I was imagining. Between the index and an > > additional "index-supplement-for-dumb-protocols" maintained by > > update-server-info, http ranges can be bent to our evil purposes. > > > > Of course it won't be as network-efficient as the git proto, or even > > as the git-over-cgi proto, but it'll surely be server-cpu-and-memory > > efficient. And people will benefit from it without having to do any > > additional setup. > > Of course, the problem is that only the server can know beforehand which > objects are needed. But the whole idea is that we don't care. > Imagine this: > > X - Y - Z > \ > A > > > Client has "X", wants "Z", but not "A". Client needs "Y" and "Z". But > client cannot know that it needs "Y" before getting "Z", except if the > server says so. > > If you have a solution for that problem, please enlighten me: I don't. We're talking about a _dumb_ protocol here. If you want something fancy, just use the Git daemon. Otherwise, you'll simply get everything the remote has that you don't have, including A. In practice this shouldn't be a problem because people tend to have clean repositories on machines they want their stuff to be published, meaning that those public repos are usually the result of pushes, hence they contain only the minimum set of needed objects. Of course you get every branches and not only a particular one, but that's the price to pay with a dumb protocol. Nicolas ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-15 20:10 Smart fetch via HTTP? Jan Hudec ` (2 preceding siblings ...) 2007-05-16 5:25 ` Martin Langhoff @ 2007-05-17 12:40 ` Petr Baudis 2007-05-17 12:48 ` Matthieu Moy 2007-05-17 20:26 ` Jan Hudec 3 siblings, 2 replies; 47+ messages in thread From: Petr Baudis @ 2007-05-17 12:40 UTC (permalink / raw) To: Jan Hudec; +Cc: git Hi, On Tue, May 15, 2007 at 10:10:06PM CEST, Jan Hudec wrote: > Did anyone already think about fetching over HTTP working similarly to the > That is rather than reading the raw content of the repository, there would be > a CGI script (could be integrated to gitweb), that would negotiate what the > client needs and then generate and send a single pack with it. frankly, I'm not that excited. I'm not disputing that this would be useful, but I have my doubts on just how *much* useful it would be - I'm not so sure the set of users affected is really all that large. So I'm just cooling people down here. ;-)) > Mercurial and bzr both have this option. It would IMO have three benefits: > - Fast access for people behind paranoid firewalls, that only let http and > https (you can tunel anything through, but only to port 443) through. How many users really have this problem? I'm not so sure. There are certainly some, but enough for this to be a viable argument? > - Can be run on shared machine. If you have web space on machine shared > by many people, you can set up your own gitweb, but cannot/are not allowed > to start your own network server for git native protocol. You need to have CGI-enabled hosting, set up the CGI script etc. - overally, the setup is similarly complicated as git-daemon setup, so it's not "zero-setup" solution anymore. Again, I'm not sure just how many people are in the situation that they can run real CGI (not just PHP) but not git-daemon. > - Less things to set up. If you are setting up gitweb anyway, you'd not need > to set up additional thing for providing fetch access. Except, well, how do you "set it up"? You need to make sure git-update-server-info is run, yes, but that shouldn't be a problem (I'm not so sure if git does this for you automagically - Cogito would...). I think 95% of people don't set up gitweb.cgi either for their small HTTP repositories. :-) Then again, it's not that it would be really technically complicated - adding "give me a bundle" support to gitweb should be pretty easy. However, this support has some "social" costs as well: no compatibility with older git versions, support cost, confusion between dumb HTTP and gitweb HTTP transports, more lack of motivation for improving dumb HTTP transport... -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ Ever try. Ever fail. No matter. // Try again. Fail again. Fail better. -- Samuel Beckett ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-17 12:40 ` Petr Baudis @ 2007-05-17 12:48 ` Matthieu Moy 2007-05-18 18:27 ` Linus Torvalds 2007-05-17 20:26 ` Jan Hudec 1 sibling, 1 reply; 47+ messages in thread From: Matthieu Moy @ 2007-05-17 12:48 UTC (permalink / raw) To: git Petr Baudis <pasky@suse.cz> writes: >> Mercurial and bzr both have this option. It would IMO have three benefits: >> - Fast access for people behind paranoid firewalls, that only let http and >> https (you can tunel anything through, but only to port 443) through. > > How many users really have this problem? I'm not so sure. Many (if not most?) of the people working in a big company, I'd say. Year, it sucks, but people having used a paranoid firewall with a not-less-paranoid and broken proxy understand what I mean. >> - Can be run on shared machine. If you have web space on machine shared >> by many people, you can set up your own gitweb, but cannot/are not allowed >> to start your own network server for git native protocol. > > You need to have CGI-enabled hosting, set up the CGI script etc. - > overally, the setup is similarly complicated as git-daemon setup, so > it's not "zero-setup" solution anymore. > > Again, I'm not sure just how many people are in the situation that > they can run real CGI (not just PHP) but not git-daemon. Any volunteer to write a full-PHP version of git? ;-) -- Matthieu ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-17 12:48 ` Matthieu Moy @ 2007-05-18 18:27 ` Linus Torvalds 2007-05-18 18:33 ` alan ` (2 more replies) 0 siblings, 3 replies; 47+ messages in thread From: Linus Torvalds @ 2007-05-18 18:27 UTC (permalink / raw) To: Matthieu Moy; +Cc: git On Thu, 17 May 2007, Matthieu Moy wrote: > > Many (if not most?) of the people working in a big company, I'd say. > Year, it sucks, but people having used a paranoid firewall with a > not-less-paranoid and broken proxy understand what I mean. Well, we could try to support the git protocol over port 80.. IOW, it's probably easier to try to get people to use git clone git://some.host:80/project and just run git-daemon on port 80, than it is to try to set of magic cgi scripts etc. Doing that with virtual hosts etc should be pretty trivial. Much more so than trying to make a git-cgi script. And yes, I do realize that in theory you can have http-aware firewalls that expect to see the normal http sequences in the first few packets in order to pass things through, but I seriously doubt it's very common. Linus ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-18 18:27 ` Linus Torvalds @ 2007-05-18 18:33 ` alan 2007-05-18 19:01 ` Joel Becker 2007-05-19 0:50 ` david 2 siblings, 0 replies; 47+ messages in thread From: alan @ 2007-05-18 18:33 UTC (permalink / raw) To: Linus Torvalds; +Cc: Matthieu Moy, git On Fri, 18 May 2007, Linus Torvalds wrote: > > > On Thu, 17 May 2007, Matthieu Moy wrote: >> >> Many (if not most?) of the people working in a big company, I'd say. >> Year, it sucks, but people having used a paranoid firewall with a >> not-less-paranoid and broken proxy understand what I mean. > > Well, we could try to support the git protocol over port 80.. > > IOW, it's probably easier to try to get people to use > > git clone git://some.host:80/project > > and just run git-daemon on port 80, than it is to try to set of magic cgi > scripts etc. Except some filtering firewalls try and strip content from data (like ActiveX controls.) Running git on port 53 will bypass pretty much every firewall out there. (If you want to learn how to bypass an overactive firewall, talk to a bunch of teenagers at a school with an agressive porn filter.) -- "ANSI C says access to the padding fields of a struct is undefined. ANSI C also says that struct assignment is a memcpy. Therefore struct assignment in ANSI C is a violation of ANSI C..." - Alan Cox ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-18 18:27 ` Linus Torvalds 2007-05-18 18:33 ` alan @ 2007-05-18 19:01 ` Joel Becker 2007-05-18 20:06 ` Matthieu Moy 2007-05-18 20:13 ` Linus Torvalds 2007-05-19 0:50 ` david 2 siblings, 2 replies; 47+ messages in thread From: Joel Becker @ 2007-05-18 19:01 UTC (permalink / raw) To: Linus Torvalds; +Cc: Matthieu Moy, git On Fri, May 18, 2007 at 11:27:22AM -0700, Linus Torvalds wrote: > Well, we could try to support the git protocol over port 80.. > > IOW, it's probably easier to try to get people to use > > git clone git://some.host:80/project > > and just run git-daemon on port 80, than it is to try to set of magic cgi > scripts etc. Can we tech the git-daemon to parse the HTTP headers (specifically, the URL) and return the appropriate HTTP response? > And yes, I do realize that in theory you can have http-aware firewalls > that expect to see the normal http sequences in the first few packets in > order to pass things through, but I seriously doubt it's very common. It's not about packet scanning, it's about GET vs CONNECT. If the proxy allows GET but not CONNECT, it's going to forward the HTTP protocol to the server, and git-daemon is going to see "GET /project HTTP/1.1" as its first input. Now, perhaps we can cook that up behind some apache so that apache handles vhosting the URL, then calls git-daemon which can take the stdin. So we'd be doing POST, not GET. On the other hand, if the proxy allows CONNECT, there is no scanning for HTTP sequences done by the proxy. It just allows all raw data (as it figures you're doing SSL). A normal company needs to have their firewall allow CONNECT to 9418. Then git proxying over HTTP is possible to a standard git-daemon. Joel -- "The first requisite of a good citizen in this republic of ours is that he shall be able and willing to pull his weight." - Theodore Roosevelt Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-18 19:01 ` Joel Becker @ 2007-05-18 20:06 ` Matthieu Moy 2007-05-18 20:13 ` Linus Torvalds 1 sibling, 0 replies; 47+ messages in thread From: Matthieu Moy @ 2007-05-18 20:06 UTC (permalink / raw) To: Joel Becker; +Cc: Linus Torvalds, git Joel Becker <Joel.Becker@oracle.com> writes: > A normal company needs to have their firewall allow CONNECT to > 9418. Then git proxying over HTTP is possible to a standard > git-daemon. 443 should work too (that's HTTPS, and the proxy can't filter it, since this would be a man-in-the-middle attack). -- Matthieu ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-18 19:01 ` Joel Becker 2007-05-18 20:06 ` Matthieu Moy @ 2007-05-18 20:13 ` Linus Torvalds 2007-05-18 21:56 ` Joel Becker 1 sibling, 1 reply; 47+ messages in thread From: Linus Torvalds @ 2007-05-18 20:13 UTC (permalink / raw) To: Joel Becker; +Cc: Matthieu Moy, git On Fri, 18 May 2007, Joel Becker wrote: > > It's not about packet scanning, it's about GET vs CONNECT. If > the proxy allows GET but not CONNECT, it's going to forward the HTTP > protocol to the server, and git-daemon is going to see "GET /project > HTTP/1.1" as its first input. Now, perhaps we can cook that up behind > some apache so that apache handles vhosting the URL, then calls > git-daemon which can take the stdin. So we'd be doing POST, not GET. If it's _just_ the initial GET/CONNECT strings, yeah, we could probably easily make the git-daemon just ignore them. That shouldn't be a problem. But if there's anything *else* required, it gets uglier much more quickly. Linus ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-18 20:13 ` Linus Torvalds @ 2007-05-18 21:56 ` Joel Becker 2007-05-20 10:30 ` Jan Hudec 0 siblings, 1 reply; 47+ messages in thread From: Joel Becker @ 2007-05-18 21:56 UTC (permalink / raw) To: Linus Torvalds; +Cc: Matthieu Moy, git On Fri, May 18, 2007 at 01:13:36PM -0700, Linus Torvalds wrote: > If it's _just_ the initial GET/CONNECT strings, yeah, we could probably > easily make the git-daemon just ignore them. That shouldn't be a problem. > > But if there's anything *else* required, it gets uglier much more quickly. With CONNECT, there isn't anything. That is, your GIT_PROXY_COMMAND handles talking to the proxy, then gives git itself a raw data pipe. My proxy allows CONNECT to 9418, and that's how I use it today. If you tried to make POST work (It'd be POST, not GET, as you need to connect up the sending side), either apache would have to front it for us, or "git-daemon --http" would have to accept the HTTP headers on before the input, and output a proper HTTP response before sending output. Seeing the headers would allow for us to vhost, even. Hmm, but the proxy may not allow two-way communication. Does the git protocol have more than one round-trip? That is: Client: POST http://server.git.host:80/projects/thisproject HTTP/1.1 Host: server.git.host fetch-pack <sha1> EOF Server: 200 OK HTTP/1.1 <data> EOF should work, I'd think. Joel -- "Ninety feet between bases is perhaps as close as man has ever come to perfection." - Red Smith Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-18 21:56 ` Joel Becker @ 2007-05-20 10:30 ` Jan Hudec 0 siblings, 0 replies; 47+ messages in thread From: Jan Hudec @ 2007-05-20 10:30 UTC (permalink / raw) To: Joel Becker; +Cc: Linus Torvalds, Matthieu Moy, git [-- Attachment #1: Type: text/plain, Size: 2000 bytes --] On Fri, May 18, 2007 at 14:56:07 -0700, Joel Becker wrote: > On Fri, May 18, 2007 at 01:13:36PM -0700, Linus Torvalds wrote: > > If it's _just_ the initial GET/CONNECT strings, yeah, we could probably > > easily make the git-daemon just ignore them. That shouldn't be a problem. > > > > But if there's anything *else* required, it gets uglier much more quickly. > > With CONNECT, there isn't anything. That is, your > GIT_PROXY_COMMAND handles talking to the proxy, then gives git itself a > raw data pipe. My proxy allows CONNECT to 9418, and that's how I use it > today. Yes. Connect is easy. However many companies only allow CONNECT to 443 (not that it's much more secure than allowing it anywhere, but at least it has to block CONNECT to 25 to block sending spam). > If you tried to make POST work (It'd be POST, not GET, as you > need to connect up the sending side), either apache would have to front > it for us, or "git-daemon --http" would have to accept the HTTP headers > on before the input, and output a proper HTTP response before sending > output. Seeing the headers would allow for us to vhost, even. > Hmm, but the proxy may not allow two-way communication. Does > the git protocol have more than one round-trip? That is: > > Client: > POST http://server.git.host:80/projects/thisproject HTTP/1.1 > Host: server.git.host > > fetch-pack <sha1> > EOF > > Server: > 200 OK HTTP/1.1 > > <data> > EOF > > should work, I'd think. Well, that does not require git at all -- apache can handle this all right. But it's not network-efficient. To be network-efficient, it is necessary to negotiate the list of objects that need to be send. And that requires more than one round-trip. Additionally, the current git protocol is streaming -- the client sends data without waiting for the server. So it would require slightly different protocol over HTTP. -- Jan 'Bulb' Hudec <bulb@ucw.cz> [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-18 18:27 ` Linus Torvalds 2007-05-18 18:33 ` alan 2007-05-18 19:01 ` Joel Becker @ 2007-05-19 0:50 ` david 2007-05-19 3:58 ` Shawn O. Pearce 2 siblings, 1 reply; 47+ messages in thread From: david @ 2007-05-19 0:50 UTC (permalink / raw) To: Linus Torvalds; +Cc: Matthieu Moy, git On Fri, 18 May 2007, Linus Torvalds wrote: > On Thu, 17 May 2007, Matthieu Moy wrote: >> >> Many (if not most?) of the people working in a big company, I'd say. >> Year, it sucks, but people having used a paranoid firewall with a >> not-less-paranoid and broken proxy understand what I mean. > > Well, we could try to support the git protocol over port 80.. > > IOW, it's probably easier to try to get people to use > > git clone git://some.host:80/project > > and just run git-daemon on port 80, than it is to try to set of magic cgi > scripts etc. > > Doing that with virtual hosts etc should be pretty trivial. Much more so > than trying to make a git-cgi script. > > And yes, I do realize that in theory you can have http-aware firewalls > that expect to see the normal http sequences in the first few packets in > order to pass things through, but I seriously doubt it's very common. they are actually more common than you think, and getting even more common thanks to IE when a person browsing a hostile website will allow that website to take over the machine the demand is created for 'malware filters' for http, to do this the firewalls need to decode the http, and in the process limit you to only doing legitimate http. it's also the case that the companies that have firewalls paranoid enough to not let you get to the git port are highly likely to be paranoid enough to have a malware filtering http firewall. David Lang ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-19 0:50 ` david @ 2007-05-19 3:58 ` Shawn O. Pearce 2007-05-19 4:58 ` david 0 siblings, 1 reply; 47+ messages in thread From: Shawn O. Pearce @ 2007-05-19 3:58 UTC (permalink / raw) To: david; +Cc: Linus Torvalds, Matthieu Moy, git david@lang.hm wrote: > when a person browsing a hostile website will allow that website to take > over the machine the demand is created for 'malware filters' for http, to > do this the firewalls need to decode the http, and in the process limit > you to only doing legitimate http. > > it's also the case that the companies that have firewalls paranoid enough > to not let you get to the git port are highly likely to be paranoid enough > to have a malware filtering http firewall. I'm behind such a filter, and fetch git.git via HTTP just to keep my work system current with Junio. ;-) Of course we're really really really paranoid about our firewall, but are also so paranoid that any other web browser *except* Microsoft Internet Explorer is thought to be a security risk and is more-or-less banned from the network. The kicker is some of our developers create public websites, where testing your local webpage with Firefox and Safari is pretty much required... but those browsers still aren't as trusted as IE and require special clearances. *shakes head* We're pretty much limited to: *) Running the native Git protocol SSL, where the remote system is answering to port 443. It may not need to be HTTP at all, but it probably has to smell enough like SSL to get it through the malware filter. Oh, what's that? The filter cannot actually filter the SSL data? Funny! ;-) *) Using a single POST upload followed by response from server, formatted with minimal HTTP headers. The real problem as people have pointed out is not the HTTP headers, but it is the single exchange. One might think you could use HTTP pipelining to try and get a bi-directional channel with the remote system, but I'm sure proxy servers are not required to reuse the same TCP connection to the remote HTTP server when the inside client piplines a new request. So any sort of hack on pipelining won't work. If you really want a stateful exchange you have to treat HTTP as though it were IP, but with reliable (and much more expensive) packet delivery, and make the Git daemon keep track of the protocol state with the client. Yes, that means that when the client suddenly goes away and doesn't tell you he went away you also have to garbage collect your state. No nice messages from your local kernel. :-( -- Shawn. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-19 3:58 ` Shawn O. Pearce @ 2007-05-19 4:58 ` david 0 siblings, 0 replies; 47+ messages in thread From: david @ 2007-05-19 4:58 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Linus Torvalds, Matthieu Moy, git On Fri, 18 May 2007, Shawn O. Pearce wrote: > david@lang.hm wrote: >> when a person browsing a hostile website will allow that website to take >> over the machine the demand is created for 'malware filters' for http, to >> do this the firewalls need to decode the http, and in the process limit >> you to only doing legitimate http. >> >> it's also the case that the companies that have firewalls paranoid enough >> to not let you get to the git port are highly likely to be paranoid enough >> to have a malware filtering http firewall. > > I'm behind such a filter, and fetch git.git via HTTP just to keep > my work system current with Junio. ;-) > > Of course we're really really really paranoid about our firewall, > but are also so paranoid that any other web browser *except* > Microsoft Internet Explorer is thought to be a security risk and > is more-or-less banned from the network. > > The kicker is some of our developers create public websites, where > testing your local webpage with Firefox and Safari is pretty much > required... but those browsers still aren't as trusted as IE and > require special clearances. *shakes head* this isn't paranoia, this is just bullheadedness > We're pretty much limited to: > > *) Running the native Git protocol SSL, where the remote system > is answering to port 443. It may not need to be HTTP at all, > but it probably has to smell enough like SSL to get it through > the malware filter. Oh, what's that? The filter cannot actually > filter the SSL data? Funny! ;-) we're actually paranoid enough to have devices that do man-in-the-middle decryption for some sites, and are given copies of the encryption keys that other sites (and browsers) use so that it can decrypt the SSL and check it. I admit that this is far more paranoid then almost all sites though :-) > *) Using a single POST upload followed by response from server, > formatted with minimal HTTP headers. The real problem as people > have pointed out is not the HTTP headers, but it is the single > exchange. > If you really want a stateful exchange you have to treat HTTP as > though it were IP, but with reliable (and much more expensive) > packet delivery, and make the Git daemon keep track of the protocol > state with the client. Yes, that means that when the client suddenly > goes away and doesn't tell you he went away you also have to garbage > collect your state. No nice messages from your local kernel. :-( unfortunantly you are right about this. David Lang ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-17 12:40 ` Petr Baudis 2007-05-17 12:48 ` Matthieu Moy @ 2007-05-17 20:26 ` Jan Hudec 2007-05-17 20:38 ` Nicolas Pitre 1 sibling, 1 reply; 47+ messages in thread From: Jan Hudec @ 2007-05-17 20:26 UTC (permalink / raw) To: Petr Baudis; +Cc: git [-- Attachment #1: Type: text/plain, Size: 2165 bytes --] On Thu, May 17, 2007 at 14:40:06 +0200, Petr Baudis wrote: > On Tue, May 15, 2007 at 10:10:06PM CEST, Jan Hudec wrote: > > - Can be run on shared machine. If you have web space on machine shared > > by many people, you can set up your own gitweb, but cannot/are not allowed > > to start your own network server for git native protocol. > > You need to have CGI-enabled hosting, set up the CGI script etc. - > overally, the setup is similarly complicated as git-daemon setup, so > it's not "zero-setup" solution anymore. > > Again, I'm not sure just how many people are in the situation that > they can run real CGI (not just PHP) but not git-daemon. A particular case would be a group of students wanting to publish their software project (I mean the PRG023 or equivalent). Private computers in the hostel are not allowed to serve anything, so they'd use some of the lab servers (eg. artax, ss1000...). All of them allow full CGI, but running daemons is forbiden. > > - Less things to set up. If you are setting up gitweb anyway, you'd not need > > to set up additional thing for providing fetch access. > > Except, well, how do you "set it up"? You need to make sure > git-update-server-info is run, yes, but that shouldn't be a problem (I'm > not so sure if git does this for you automagically - Cogito would...). No. If it worked similar to git-upload-pack, only over http, it would work without update-server-info, no? > I think 95% of people don't set up gitweb.cgi either for their small > HTTP repositories. :-) > > Then again, it's not that it would be really technically complicated - > adding "give me a bundle" support to gitweb should be pretty easy. > However, this support has some "social" costs as well: no compatibility > with older git versions, support cost, confusion between dumb HTTP and > gitweb HTTP transports, more lack of motivation for improving dumb HTTP > transport... The dumb transport is definitely useful. Extending it to use ranges if possible would be useful as well (and maybe more than upload-pack-over-http). -- Jan 'Bulb' Hudec <bulb@ucw.cz> [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-17 20:26 ` Jan Hudec @ 2007-05-17 20:38 ` Nicolas Pitre 2007-05-18 17:35 ` Jan Hudec 0 siblings, 1 reply; 47+ messages in thread From: Nicolas Pitre @ 2007-05-17 20:38 UTC (permalink / raw) To: Jan Hudec; +Cc: Petr Baudis, git On Thu, 17 May 2007, Jan Hudec wrote: > A particular case would be a group of students wanting to publish their > software project (I mean the PRG023 or equivalent). Private computers in the > hostel are not allowed to serve anything, so they'd use some of the lab > servers (eg. artax, ss1000...). All of them allow full CGI, but running > daemons is forbiden. And wouldn't the admin authority for those lab servers be amenable to install a Git daemon service? That'd be a much better solution to me. Nicolas ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP? 2007-05-17 20:38 ` Nicolas Pitre @ 2007-05-18 17:35 ` Jan Hudec 0 siblings, 0 replies; 47+ messages in thread From: Jan Hudec @ 2007-05-18 17:35 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Petr Baudis, git [-- Attachment #1: Type: text/plain, Size: 715 bytes --] On Thu, May 17, 2007 at 16:38:41 -0400, Nicolas Pitre wrote: > On Thu, 17 May 2007, Jan Hudec wrote: > > > A particular case would be a group of students wanting to publish their > > software project (I mean the PRG023 or equivalent). Private computers in the > > hostel are not allowed to serve anything, so they'd use some of the lab > > servers (eg. artax, ss1000...). All of them allow full CGI, but running > > daemons is forbiden. > > And wouldn't the admin authority for those lab servers be amenable to > install a Git daemon service? That'd be a much better solution to me. It would. But it would really depend on the administrator goodwill. -- Jan 'Bulb' Hudec <bulb@ucw.cz> [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 47+ messages in thread
end of thread, other threads:[~2007-05-20 10:30 UTC | newest] Thread overview: 47+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-05-15 20:10 Smart fetch via HTTP? Jan Hudec 2007-05-15 22:30 ` A Large Angry SCM 2007-05-15 23:29 ` Shawn O. Pearce 2007-05-16 0:38 ` Junio C Hamano 2007-05-16 5:25 ` Martin Langhoff 2007-05-16 11:33 ` Johannes Schindelin 2007-05-16 21:26 ` Martin Langhoff 2007-05-16 21:54 ` Jakub Narebski 2007-05-17 0:52 ` Johannes Schindelin 2007-05-17 1:03 ` Shawn O. Pearce 2007-05-17 1:04 ` david 2007-05-17 1:26 ` Shawn O. Pearce 2007-05-17 1:45 ` Shawn O. Pearce 2007-05-17 12:36 ` Theodore Tso 2007-05-17 3:45 ` Nicolas Pitre 2007-05-17 10:48 ` Johannes Schindelin 2007-05-17 14:41 ` Nicolas Pitre 2007-05-17 15:24 ` Martin Langhoff 2007-05-17 15:34 ` Nicolas Pitre 2007-05-17 20:04 ` Jan Hudec 2007-05-17 20:31 ` Nicolas Pitre 2007-05-17 21:00 ` david 2007-05-18 9:01 ` Johannes Schindelin 2007-05-18 17:51 ` Jan Hudec 2007-05-17 11:28 ` Matthieu Moy 2007-05-17 13:10 ` Martin Langhoff 2007-05-17 13:47 ` Johannes Schindelin 2007-05-17 14:05 ` Matthieu Moy 2007-05-17 14:09 ` Martin Langhoff 2007-05-17 15:01 ` Nicolas Pitre 2007-05-17 23:14 ` Jakub Narebski 2007-05-17 14:50 ` Nicolas Pitre 2007-05-17 12:40 ` Petr Baudis 2007-05-17 12:48 ` Matthieu Moy 2007-05-18 18:27 ` Linus Torvalds 2007-05-18 18:33 ` alan 2007-05-18 19:01 ` Joel Becker 2007-05-18 20:06 ` Matthieu Moy 2007-05-18 20:13 ` Linus Torvalds 2007-05-18 21:56 ` Joel Becker 2007-05-20 10:30 ` Jan Hudec 2007-05-19 0:50 ` david 2007-05-19 3:58 ` Shawn O. Pearce 2007-05-19 4:58 ` david 2007-05-17 20:26 ` Jan Hudec 2007-05-17 20:38 ` Nicolas Pitre 2007-05-18 17:35 ` Jan Hudec
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).