Smart fetch via HTTP?

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Smart fetch via HTTP?
@ 2007-05-15 20:10 Jan Hudec
  2007-05-15 22:30 ` A Large Angry SCM
                   ` (3 more replies)
  0 siblings, 4 replies; 47+ messages in thread
From: Jan Hudec @ 2007-05-15 20:10 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 1635 bytes --]

Hello,

Did anyone already think about fetching over HTTP working similarly to the
native git protocol?

That is rather than reading the raw content of the repository, there would be
a CGI script (could be integrated to gitweb), that would negotiate what the
client needs and then generate and send a single pack with it.

Mercurial and bzr both have this option. It would IMO have three benefits:
 - Fast access for people behind paranoid firewalls, that only let http and
   https (you can tunel anything through, but only to port 443) through.
 - Can be run on shared machine. If you have web space on machine shared
   by many people, you can set up your own gitweb, but cannot/are not allowed
   to start your own network server for git native protocol.
 - Less things to set up. If you are setting up gitweb anyway, you'd not need
   to set up additional thing for providing fetch access.

Than a question is how to implement it. The current protocol is stateful on
both sides, but the stateless nature of HTTP more or less requires the
protocol to be stateless on the server.

I think it would be possible to use basically the same protocol as now, but
make it stateless for server. That is server first sends it's heads and than
client repeatedly sends all it's wants and some haves until the server acks
all of them and sends the pack.

Alternatively I am thinking about using Bloom filters (somebody came with
such idea on the bzr list when I still followed it). It might be useful, as
over HTTP we need to send as many haves as possible in one go.

-- 
						 Jan 'Bulb' Hudec <bulb@ucw.cz>

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-15 20:10 Smart fetch via HTTP? Jan Hudec
@ 2007-05-15 22:30 ` A Large Angry SCM
  2007-05-15 23:29 ` Shawn O. Pearce
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 47+ messages in thread
From: A Large Angry SCM @ 2007-05-15 22:30 UTC (permalink / raw)
  To: Jan Hudec; +Cc: git

Jan Hudec wrote:
> Hello,
> 
> Did anyone already think about fetching over HTTP working similarly to the
> native git protocol?
> 
> That is rather than reading the raw content of the repository, there would be
> a CGI script (could be integrated to gitweb), that would negotiate what the
> client needs and then generate and send a single pack with it.
> 
> Mercurial and bzr both have this option. It would IMO have three benefits:
>  - Fast access for people behind paranoid firewalls, that only let http and
>    https (you can tunel anything through, but only to port 443) through.
>  - Can be run on shared machine. If you have web space on machine shared
>    by many people, you can set up your own gitweb, but cannot/are not allowed
>    to start your own network server for git native protocol.
>  - Less things to set up. If you are setting up gitweb anyway, you'd not need
>    to set up additional thing for providing fetch access.
> 
> Than a question is how to implement it. The current protocol is stateful on
> both sides, but the stateless nature of HTTP more or less requires the
> protocol to be stateless on the server.
> 
> I think it would be possible to use basically the same protocol as now, but
> make it stateless for server. That is server first sends it's heads and than
> client repeatedly sends all it's wants and some haves until the server acks
> all of them and sends the pack.
> 
> Alternatively I am thinking about using Bloom filters (somebody came with
> such idea on the bzr list when I still followed it). It might be useful, as
> over HTTP we need to send as many haves as possible in one go.
> 

Bundles?

Client POSTs it's ref set; server uses the ref set to generate and 
return the bundle.

Push over http(s) could work the same...

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-15 20:10 Smart fetch via HTTP? Jan Hudec
  2007-05-15 22:30 ` A Large Angry SCM
@ 2007-05-15 23:29 ` Shawn O. Pearce
  2007-05-16  0:38   ` Junio C Hamano
  2007-05-16  5:25 ` Martin Langhoff
  2007-05-17 12:40 ` Petr Baudis
  3 siblings, 1 reply; 47+ messages in thread
From: Shawn O. Pearce @ 2007-05-15 23:29 UTC (permalink / raw)
  To: Jan Hudec; +Cc: git

Jan Hudec <bulb@ucw.cz> wrote:
> Did anyone already think about fetching over HTTP working similarly to the
> native git protocol?

No work has been done on this (that I know of) but I've discussed
it to some extent with Simon 'corecode' Schubert on #git, and I
think he also brought it up on the mailing list not too long after.

I've certainly thought about adding some sort of pack-objects
frontend into gitweb.cgi for this exact purpose.  It is really
quite easy, except for the negotation of what the client has.  ;-)

> Than a question is how to implement it. The current protocol is stateful on
> both sides, but the stateless nature of HTTP more or less requires the
> protocol to be stateless on the server.
> 
> I think it would be possible to use basically the same protocol as now, but
> make it stateless for server. That is server first sends it's heads and than
> client repeatedly sends all it's wants and some haves until the server acks
> all of them and sends the pack.

I think Simon was talking about doubling the number of haves the
client sends in each request.  So the client POSTs initially all
of its current refs; then current refs and their parents; then 4
commits back, then 8, etc.  The server replies to each POST request
with either a "send more please" or the packfile.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-15 23:29 ` Shawn O. Pearce
@ 2007-05-16  0:38   ` Junio C Hamano
  0 siblings, 0 replies; 47+ messages in thread
From: Junio C Hamano @ 2007-05-16  0:38 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Jan Hudec, git

"Shawn O. Pearce" <spearce@spearce.org> writes:

> Jan Hudec <bulb@ucw.cz> wrote:
>> Did anyone already think about fetching over HTTP working similarly to the
>> native git protocol?
>
> No work has been done on this (that I know of) but I've discussed
> it to some extent with Simon 'corecode' Schubert on #git, and I
> think he also brought it up on the mailing list not too long after.
>
> I've certainly thought about adding some sort of pack-objects
> frontend into gitweb.cgi for this exact purpose.  It is really
> quite easy, except for the negotation of what the client has.  ;-)
>  
>> Than a question is how to implement it. The current protocol is stateful on
>> both sides, but the stateless nature of HTTP more or less requires the
>> protocol to be stateless on the server.
>> 
>> I think it would be possible to use basically the same protocol as now, but
>> make it stateless for server. That is server first sends it's heads and than
>> client repeatedly sends all it's wants and some haves until the server acks
>> all of them and sends the pack.
>
> I think Simon was talking about doubling the number of haves the
> client sends in each request.  So the client POSTs initially all
> of its current refs; then current refs and their parents; then 4
> commits back, then 8, etc.  The server replies to each POST request
> with either a "send more please" or the packfile.

I kinda' like the bundle suggestion ;-)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-15 20:10 Smart fetch via HTTP? Jan Hudec
  2007-05-15 22:30 ` A Large Angry SCM
  2007-05-15 23:29 ` Shawn O. Pearce
@ 2007-05-16  5:25 ` Martin Langhoff
  2007-05-16 11:33   ` Johannes Schindelin
  2007-05-17 12:40 ` Petr Baudis
  3 siblings, 1 reply; 47+ messages in thread
From: Martin Langhoff @ 2007-05-16  5:25 UTC (permalink / raw)
  To: Jan Hudec; +Cc: git

On 5/16/07, Jan Hudec <bulb@ucw.cz> wrote:
> Did anyone already think about fetching over HTTP working similarly to the
> native git protocol?

Do the indexes have enough info to use them with http ranges? It'd be
chunkier than a smart protocol, but it'd still work with dumb servers.

cheers,


m

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-16  5:25 ` Martin Langhoff
@ 2007-05-16 11:33   ` Johannes Schindelin
  2007-05-16 21:26     ` Martin Langhoff
  0 siblings, 1 reply; 47+ messages in thread
From: Johannes Schindelin @ 2007-05-16 11:33 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Jan Hudec, git

Hi,

On Wed, 16 May 2007, Martin Langhoff wrote:

> On 5/16/07, Jan Hudec <bulb@ucw.cz> wrote:
> > Did anyone already think about fetching over HTTP working similarly to the
> > native git protocol?
> 
> Do the indexes have enough info to use them with http ranges? It'd be
> chunkier than a smart protocol, but it'd still work with dumb servers.

It would not be really performant, would it? Besides, not all Web servers 
speak HTTP/1.1...

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-16 11:33   ` Johannes Schindelin
@ 2007-05-16 21:26     ` Martin Langhoff
  2007-05-16 21:54       ` Jakub Narebski
  2007-05-17  0:52       ` Johannes Schindelin
  0 siblings, 2 replies; 47+ messages in thread
From: Martin Langhoff @ 2007-05-16 21:26 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Jan Hudec, git

On 5/16/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> On Wed, 16 May 2007, Martin Langhoff wrote:
> > Do the indexes have enough info to use them with http ranges? It'd be
> > chunkier than a smart protocol, but it'd still work with dumb servers.
> It would not be really performant, would it? Besides, not all Web servers
> speak HTTP/1.1...

Performant compared to downloading a huge packfile to get 10% of it?
Sure! It'd probably take a few trips, and you'd end up fetching 20% of
the file, still better than 100%.

> Besides, not all Web servers speak HTTP/1.1...

Are there any interesting webservers out there that don't? Hand-rolled
purpose-built webservers often don't but those don't serve files, they
serve web apps. When it comes to serving files, any webserver that is
supported (security-wise) these days is HTTP/1.1.

And for services like SF.net it'd be a safe low-cpu way of serving git
files. 'cause the git protocol is quite expensive server-side (io+cpu)
as we've seen with kernel.org. Being really smart with a cgi is
probably going to be expensive too.

cheers,

m

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-16 21:26     ` Martin Langhoff
@ 2007-05-16 21:54       ` Jakub Narebski
  2007-05-17  0:52       ` Johannes Schindelin
  1 sibling, 0 replies; 47+ messages in thread
From: Jakub Narebski @ 2007-05-16 21:54 UTC (permalink / raw)
  To: git

Martin Langhoff wrote:

> On 5/16/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
>> On Wed, 16 May 2007, Martin Langhoff wrote:
>> > Do the indexes have enough info to use them with http ranges? It'd be
>> > chunkier than a smart protocol, but it'd still work with dumb servers.
>> It would not be really performant, would it? Besides, not all Web servers
>> speak HTTP/1.1...
> 
> Performant compared to downloading a huge packfile to get 10% of it?
> Sure! It'd probably take a few trips, and you'd end up fetching 20% of
> the file, still better than 100%.

That's why you should have something akin to backup policy for pack files,
like daily packs, weekly packs, ..., and the rest, just for the dumb
protocols.

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-16 21:26     ` Martin Langhoff
  2007-05-16 21:54       ` Jakub Narebski
@ 2007-05-17  0:52       ` Johannes Schindelin
  2007-05-17  1:03         ` Shawn O. Pearce
  2007-05-17 11:28         ` Matthieu Moy
  1 sibling, 2 replies; 47+ messages in thread
From: Johannes Schindelin @ 2007-05-17  0:52 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Jan Hudec, git

Hi,

On Thu, 17 May 2007, Martin Langhoff wrote:

> On 5/16/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> > On Wed, 16 May 2007, Martin Langhoff wrote:
> > > Do the indexes have enough info to use them with http ranges? It'd be
> > > chunkier than a smart protocol, but it'd still work with dumb servers.
> > It would not be really performant, would it? Besides, not all Web servers
> > speak HTTP/1.1...
> 
> Performant compared to downloading a huge packfile to get 10% of it?
> Sure! It'd probably take a few trips, and you'd end up fetching 20% of
> the file, still better than 100%.

Don't forget that those 10% probably do not do you the favour to be in 
large chunks. Chances are that _every_ _single_ wanted object is separate 
from the others.

> > Besides, not all Web servers speak HTTP/1.1...
> 
> Are there any interesting webservers out there that don't? Hand-rolled 
> purpose-built webservers often don't but those don't serve files, they 
> serve web apps. When it comes to serving files, any webserver that is 
> supported (security-wise) these days is HTTP/1.1.
> 
> And for services like SF.net it'd be a safe low-cpu way of serving git
> files. 'cause the git protocol is quite expensive server-side (io+cpu)
> as we've seen with kernel.org. Being really smart with a cgi is
> probably going to be expensive too.

It's probably better and faster than relying on a feature which does not 
exactly help.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-17  0:52       ` Johannes Schindelin
@ 2007-05-17  1:03         ` Shawn O. Pearce
  2007-05-17  1:04           ` david
  2007-05-17  3:45           ` Nicolas Pitre
  2007-05-17 11:28         ` Matthieu Moy
  1 sibling, 2 replies; 47+ messages in thread
From: Shawn O. Pearce @ 2007-05-17  1:03 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Martin Langhoff, Jan Hudec, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> Don't forget that those 10% probably do not do you the favour to be in 
> large chunks. Chances are that _every_ _single_ wanted object is separate 
> from the others.

That's completely possible.  Assuming the objects even are packed
in the first place.  Its very unlikely that you would be able to
fetch very large of a range from an existing packfile, you would be
submitting most of your range requests for very very small sections.

> > And for services like SF.net it'd be a safe low-cpu way of serving git
> > files. 'cause the git protocol is quite expensive server-side (io+cpu)
> > as we've seen with kernel.org. Being really smart with a cgi is
> > probably going to be expensive too.
> 
> It's probably better and faster than relying on a feature which does not 
> exactly help.

Yes.  Packing more often and pack v4 may help a lot there.

The other thing is kernel.org should really try to encourage the
folks with repositories there to try and share against one master
repository, so the poor OS has a better chance at holding the bulk
of linux-2.6.git in buffer cache.

I'm not suggesting they share specifically against Linus' repository;
maybe hpa and the other admins can host one seperately from Linus and
enourage users to use that repository when on a system they maintain.

In an SF.net type case this doesn't help however.  Most of SF.net
is tiny projects with very few, if any, developers.  Hence most
of that is going to be unsharable, infrequently accessed, and uh,
not needed to be stored in buffer cache.  For the few projects that
are hosted there that have a large developer base they could use
a shared repository approach as I just suggested for kernel.org.

aka the "forks" thing in gitweb, and on repo.or.cz...

-- 
Shawn.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-17  1:03         ` Shawn O. Pearce
@ 2007-05-17  1:04           ` david
  2007-05-17  1:26             ` Shawn O. Pearce
  2007-05-17  3:45           ` Nicolas Pitre
  1 sibling, 1 reply; 47+ messages in thread
From: david @ 2007-05-17  1:04 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Johannes Schindelin, Martin Langhoff, Jan Hudec, git

On Wed, 16 May 2007, Shawn O. Pearce wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
>
>>> And for services like SF.net it'd be a safe low-cpu way of serving git
>>> files. 'cause the git protocol is quite expensive server-side (io+cpu)
>>> as we've seen with kernel.org. Being really smart with a cgi is
>>> probably going to be expensive too.
>>
>> It's probably better and faster than relying on a feature which does not
>> exactly help.
>
> Yes.  Packing more often and pack v4 may help a lot there.
>
> The other thing is kernel.org should really try to encourage the
> folks with repositories there to try and share against one master
> repository, so the poor OS has a better chance at holding the bulk
> of linux-2.6.git in buffer cache.

do you mean more precisely share against one object store or do you really 
mean repository?

David Lang

> I'm not suggesting they share specifically against Linus' repository;
> maybe hpa and the other admins can host one seperately from Linus and
> enourage users to use that repository when on a system they maintain.
>
> In an SF.net type case this doesn't help however.  Most of SF.net
> is tiny projects with very few, if any, developers.  Hence most
> of that is going to be unsharable, infrequently accessed, and uh,
> not needed to be stored in buffer cache.  For the few projects that
> are hosted there that have a large developer base they could use
> a shared repository approach as I just suggested for kernel.org.
>
> aka the "forks" thing in gitweb, and on repo.or.cz...
>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-17  1:04           ` david
@ 2007-05-17  1:26             ` Shawn O. Pearce
  2007-05-17  1:45               ` Shawn O. Pearce
  0 siblings, 1 reply; 47+ messages in thread
From: Shawn O. Pearce @ 2007-05-17  1:26 UTC (permalink / raw)
  To: david; +Cc: Johannes Schindelin, Martin Langhoff, Jan Hudec, git

david@lang.hm wrote:
> On Wed, 16 May 2007, Shawn O. Pearce wrote:
> >
> >The other thing is kernel.org should really try to encourage the
> >folks with repositories there to try and share against one master
> >repository, so the poor OS has a better chance at holding the bulk
> >of linux-2.6.git in buffer cache.
> 
> do you mean more precisely share against one object store or do you really 
> mean repository?

Sorry, I did mean "object store".  ;-)

Repository is insanity, as the refs and tags namespaces are suddenly
shared.  What a nightmare that would become.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-17  1:26             ` Shawn O. Pearce
@ 2007-05-17  1:45               ` Shawn O. Pearce
  2007-05-17 12:36                 ` Theodore Tso
  0 siblings, 1 reply; 47+ messages in thread
From: Shawn O. Pearce @ 2007-05-17  1:45 UTC (permalink / raw)
  To: david; +Cc: Johannes Schindelin, Martin Langhoff, Jan Hudec, git

"Shawn O. Pearce" <spearce@spearce.org> wrote:
> david@lang.hm wrote:
> > On Wed, 16 May 2007, Shawn O. Pearce wrote:
> > >
> > >The other thing is kernel.org should really try to encourage the
> > >folks with repositories there to try and share against one master
> > >repository, so the poor OS has a better chance at holding the bulk
> > >of linux-2.6.git in buffer cache.
> > 
> > do you mean more precisely share against one object store or do you really 
> > mean repository?
> 
> Sorry, I did mean "object store".  ;-)

And even there, I don't mean symlink objects to a shared database,
I mean use the objects/info/alternates file to point to the shared,
read-only location.

Its not perfect.  The hotter parts of the object database is almost
always the recent stuff, as that's what people are actively trying
to fetch, or are using as a base when they are trying to fetch from
someone else.  The hotter parts are also probably too new to be
in the shared store offered by kernel.org admins, which means you
cannot get good IO buffering.  Back to the current set of problems.

A single shared object directory that everyone can write new files
into, but cannot modify or delete from, would help that problem quite
a bit.  But it opens up huge problems about pruning, as there is no
way to perform garbage collection on that database without scanning
every ref on the system, and that's just not simply possible on a
busy system like kernel.org.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-17  1:03         ` Shawn O. Pearce
  2007-05-17  1:04           ` david
@ 2007-05-17  3:45           ` Nicolas Pitre
  2007-05-17 10:48             ` Johannes Schindelin
  1 sibling, 1 reply; 47+ messages in thread
From: Nicolas Pitre @ 2007-05-17  3:45 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Johannes Schindelin, Martin Langhoff, Jan Hudec, git

On Wed, 16 May 2007, Shawn O. Pearce wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> > Don't forget that those 10% probably do not do you the favour to be in 
> > large chunks. Chances are that _every_ _single_ wanted object is separate 
> > from the others.
> 
> That's completely possible.  Assuming the objects even are packed
> in the first place.  Its very unlikely that you would be able to
> fetch very large of a range from an existing packfile, you would be
> submitting most of your range requests for very very small sections.

Well, in the commit objects case you're likely to have a bunch of them 
all contigous.

For tree and blob objects it is less likely.

And of course there is the question of deltas for which you might or 
might not have the base object locally already.

Still... I wonder if this could be actually workable.  A typical daily 
update on the Linux kernel repository might consist of a couple hundreds 
or a few tousands objects.  This could still be faster to fetch parts of 
a pack than the whole pack if the size difference is above a certain 
treshold.  It is certainly not worse than fetching loose objects.

Things would be pretty horrid if you think of fetching a commit object, 
parsing it to find out what tree object to fetch, then parse that tree 
object to find out what other objects to fetch, and so on.

But if you only take the approach of fetching the pack index files, 
finding out about the objects that the remote has that are not available 
locally, and then fetching all those objects from within pack files 
without even looking at them (except for deltas), then it should be 
possible to issue a couple requests in parallel and possibly have decent 
performances.  And if it turns out that more than, say, 70% of a 
particular pack is to be fetched (you can determine that up front), then 
it might be decided to fetch the whole pack.

There is no way to sensibly keep those objects packed on the receiving 
end of course, but storing them as loose objects and repacking them 
afterwards should be just fine.

Of course you'll get objects from branches in the remote repository you 
might not be interested in, but that's a price to pay for such a hack.  
On average the overhead shouldn't be that big anyway if branches within 
a repository are somewhat related.

I think this is something worth experimenting.

Nicolas

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-17  3:45           ` Nicolas Pitre
@ 2007-05-17 10:48             ` Johannes Schindelin
  2007-05-17 14:41               ` Nicolas Pitre
  0 siblings, 1 reply; 47+ messages in thread
From: Johannes Schindelin @ 2007-05-17 10:48 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Shawn O. Pearce, Martin Langhoff, Jan Hudec, git

Hi,

On Wed, 16 May 2007, Nicolas Pitre wrote:

> Still... I wonder if this could be actually workable.  A typical daily 
> update on the Linux kernel repository might consist of a couple hundreds 
> or a few tousands objects.  This could still be faster to fetch parts of 
> a pack than the whole pack if the size difference is above a certain 
> treshold.  It is certainly not worse than fetching loose objects.
> 
> Things would be pretty horrid if you think of fetching a commit object, 
> parsing it to find out what tree object to fetch, then parse that tree 
> object to find out what other objects to fetch, and so on.
> 
> But if you only take the approach of fetching the pack index files, 
> finding out about the objects that the remote has that are not available 
> locally, and then fetching all those objects from within pack files 
> without even looking at them (except for deltas), then it should be 
> possible to issue a couple requests in parallel and possibly have decent 
> performances.  And if it turns out that more than, say, 70% of a 
> particular pack is to be fetched (you can determine that up front), then 
> it might be decided to fetch the whole pack.
> 
> There is no way to sensibly keep those objects packed on the receiving 
> end of course, but storing them as loose objects and repacking them 
> afterwards should be just fine.
> 
> Of course you'll get objects from branches in the remote repository you 
> might not be interested in, but that's a price to pay for such a hack.  
> On average the overhead shouldn't be that big anyway if branches within 
> a repository are somewhat related.
> 
> I think this is something worth experimenting.

I am a bit wary about that, because it is so complex. IMHO a cgi which 
gets, say, up to a hundred refs (maybe something like ref~0, ref~1, ref~2, 
ref~4, ref~8, ref~16, ... for the refs), and then makes a bundle for that 
case on the fly, is easier to do.

Of course, as with all cgi scripts, you have to make sure that DOS attacks 
have a low probability of success.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-17  0:52       ` Johannes Schindelin
  2007-05-17  1:03         ` Shawn O. Pearce
@ 2007-05-17 11:28         ` Matthieu Moy
  2007-05-17 13:10           ` Martin Langhoff
  1 sibling, 1 reply; 47+ messages in thread
From: Matthieu Moy @ 2007-05-17 11:28 UTC (permalink / raw)
  To: git

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> Hi,
>
> On Thu, 17 May 2007, Martin Langhoff wrote:
>
>> On 5/16/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
>> > On Wed, 16 May 2007, Martin Langhoff wrote:
>> > > Do the indexes have enough info to use them with http ranges? It'd be
>> > > chunkier than a smart protocol, but it'd still work with dumb servers.
>> > It would not be really performant, would it? Besides, not all Web servers
>> > speak HTTP/1.1...
>> 
>> Performant compared to downloading a huge packfile to get 10% of it?
>> Sure! It'd probably take a few trips, and you'd end up fetching 20% of
>> the file, still better than 100%.
>
> Don't forget that those 10% probably do not do you the favour to be in 
> large chunks. Chances are that _every_ _single_ wanted object is separate 
> from the others.

FYI, bzr uses HTTP range requests, and the introduction of this
feature lead to significant performance improvement for them (bzr is
more dumb-protocol oriented than git is, so that's really important
there). They have this "index file+data file" system too, so you
download the full index file, and then send an HTTP range request to
get only the relevant parts of the data file.

The thing is, AAUI, they don't send N range requests to get N chunks,
but one HTTP request, requesting the N ranges at a time, and get the N
chunks a a whole (IIRC, a kind of MIME-encoded response from the
server). So, you pay the price of a longer HTTP request, but not the
price of N networks round-trips.

That's surely not as efficient as anything smart on the server, but
might really help for the cases where the server is /not/ smart.

-- 
Matthieu

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-17  1:45               ` Shawn O. Pearce
@ 2007-05-17 12:36                 ` Theodore Tso
  0 siblings, 0 replies; 47+ messages in thread
From: Theodore Tso @ 2007-05-17 12:36 UTC (permalink / raw)
  To: Shawn O. Pearce
  Cc: david, Johannes Schindelin, Martin Langhoff, Jan Hudec, git

On Wed, May 16, 2007 at 09:45:42PM -0400, Shawn O. Pearce wrote:
> Its not perfect.  The hotter parts of the object database is almost
> always the recent stuff, as that's what people are actively trying
> to fetch, or are using as a base when they are trying to fetch from
> someone else.  The hotter parts are also probably too new to be
> in the shared store offered by kernel.org admins, which means you
> cannot get good IO buffering.  Back to the current set of problems.

Actually, as long as objects/info/alternates is pointing at Linus's
kernel.org tree, I would think that it should work relatively well,
since everyone is normally basing their work on top of his tree as a
starting point.

						- Ted

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-15 20:10 Smart fetch via HTTP? Jan Hudec
                   ` (2 preceding siblings ...)
  2007-05-16  5:25 ` Martin Langhoff
@ 2007-05-17 12:40 ` Petr Baudis
  2007-05-17 12:48   ` Matthieu Moy
  2007-05-17 20:26   ` Jan Hudec
  3 siblings, 2 replies; 47+ messages in thread
From: Petr Baudis @ 2007-05-17 12:40 UTC (permalink / raw)
  To: Jan Hudec; +Cc: git

  Hi,

On Tue, May 15, 2007 at 10:10:06PM CEST, Jan Hudec wrote:
> Did anyone already think about fetching over HTTP working similarly to the
> That is rather than reading the raw content of the repository, there would be
> a CGI script (could be integrated to gitweb), that would negotiate what the
> client needs and then generate and send a single pack with it.

  frankly, I'm not that excited. I'm not disputing that this would be
useful, but I have my doubts on just how *much* useful it would be - I'm
not so sure the set of users affected is really all that large. So I'm
just cooling people down here. ;-))

> Mercurial and bzr both have this option. It would IMO have three benefits:
>  - Fast access for people behind paranoid firewalls, that only let http and
>    https (you can tunel anything through, but only to port 443) through.

  How many users really have this problem? I'm not so sure. There are
certainly some, but enough for this to be a viable argument?

>  - Can be run on shared machine. If you have web space on machine shared
>    by many people, you can set up your own gitweb, but cannot/are not allowed
>    to start your own network server for git native protocol.

  You need to have CGI-enabled hosting, set up the CGI script etc. -
overally, the setup is similarly complicated as git-daemon setup, so
it's not "zero-setup" solution anymore.

  Again, I'm not sure just how many people are in the situation that
they can run real CGI (not just PHP) but not git-daemon.

>  - Less things to set up. If you are setting up gitweb anyway, you'd not need
>    to set up additional thing for providing fetch access.

  Except, well, how do you "set it up"? You need to make sure
git-update-server-info is run, yes, but that shouldn't be a problem (I'm
not so sure if git does this for you automagically - Cogito would...).

  I think 95% of people don't set up gitweb.cgi either for their small
HTTP repositories. :-)

  Then again, it's not that it would be really technically complicated -
adding "give me a bundle" support to gitweb should be pretty easy.
However, this support has some "social" costs as well: no compatibility
with older git versions, support cost, confusion between dumb HTTP and
gitweb HTTP transports, more lack of motivation for improving dumb HTTP
transport...

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
Ever try. Ever fail. No matter. // Try again. Fail again. Fail better.
		-- Samuel Beckett

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-17 12:40 ` Petr Baudis
@ 2007-05-17 12:48   ` Matthieu Moy
  2007-05-18 18:27     ` Linus Torvalds
  2007-05-17 20:26   ` Jan Hudec
  1 sibling, 1 reply; 47+ messages in thread
From: Matthieu Moy @ 2007-05-17 12:48 UTC (permalink / raw)
  To: git

Petr Baudis <pasky@suse.cz> writes:

>> Mercurial and bzr both have this option. It would IMO have three benefits:
>>  - Fast access for people behind paranoid firewalls, that only let http and
>>    https (you can tunel anything through, but only to port 443) through.
>
>   How many users really have this problem? I'm not so sure.

Many (if not most?) of the people working in a big company, I'd say.
Year, it sucks, but people having used a paranoid firewall with a
not-less-paranoid and broken proxy understand what I mean.

>>  - Can be run on shared machine. If you have web space on machine shared
>>    by many people, you can set up your own gitweb, but cannot/are not allowed
>>    to start your own network server for git native protocol.
>
>   You need to have CGI-enabled hosting, set up the CGI script etc. -
> overally, the setup is similarly complicated as git-daemon setup, so
> it's not "zero-setup" solution anymore.
>
>   Again, I'm not sure just how many people are in the situation that
> they can run real CGI (not just PHP) but not git-daemon.

Any volunteer to write a full-PHP version of git? ;-)

-- 
Matthieu

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-17 11:28         ` Matthieu Moy
@ 2007-05-17 13:10           ` Martin Langhoff
  2007-05-17 13:47             ` Johannes Schindelin
  0 siblings, 1 reply; 47+ messages in thread
From: Martin Langhoff @ 2007-05-17 13:10 UTC (permalink / raw)
  To: git

On 5/17/07, Matthieu Moy <Matthieu.Moy@imag.fr> wrote:
> FYI, bzr uses HTTP range requests, and the introduction of this
> feature lead to significant performance improvement for them (bzr is
> more dumb-protocol oriented than git is, so that's really important
> there). They have this "index file+data file" system too, so you
> download the full index file, and then send an HTTP range request to
> get only the relevant parts of the data file.

That's the kind of thing I was imagining. Between the index and an
additional "index-supplement-for-dumb-protocols" maintained by
update-server-info, http ranges can be bent to our evil purposes.

Of course it won't be as network-efficient as the git proto, or even
as the git-over-cgi proto, but it'll surely be server-cpu-and-memory
efficient. And people will benefit from it without having to do any
additional setup.

It might be hard to come up with a usable approach to http ranges. But
I do think it's worth considering carefully.

cheers,

m

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-17 13:10           ` Martin Langhoff
@ 2007-05-17 13:47             ` Johannes Schindelin
  2007-05-17 14:05               ` Matthieu Moy
                                 ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Johannes Schindelin @ 2007-05-17 13:47 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: git

Hi,

[I missed this mail, because Matthieu culled the Cc list again]

On Fri, 18 May 2007, Martin Langhoff wrote:

> On 5/17/07, Matthieu Moy <Matthieu.Moy@imag.fr> wrote:
>
> > FYI, bzr uses HTTP range requests, and the introduction of this
> > feature lead to significant performance improvement for them (bzr is
> > more dumb-protocol oriented than git is, so that's really important
> > there). They have this "index file+data file" system too, so you
> > download the full index file, and then send an HTTP range request to
> > get only the relevant parts of the data file.
> 
> That's the kind of thing I was imagining. Between the index and an
> additional "index-supplement-for-dumb-protocols" maintained by
> update-server-info, http ranges can be bent to our evil purposes.
> 
> Of course it won't be as network-efficient as the git proto, or even
> as the git-over-cgi proto, but it'll surely be server-cpu-and-memory
> efficient. And people will benefit from it without having to do any
> additional setup.

Of course, the problem is that only the server can know beforehand which 
objects are needed. Imagine this:

X - Y - Z
  \
    A


Client has "X", wants "Z", but not "A". Client needs "Y" and "Z". But 
client cannot know that it needs "Y" before getting "Z", except if the 
server says so.

If you have a solution for that problem, please enlighten me: I don't.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-17 13:47             ` Johannes Schindelin
@ 2007-05-17 14:05               ` Matthieu Moy
  2007-05-17 14:09               ` Martin Langhoff
  2007-05-17 14:50               ` Nicolas Pitre
  2 siblings, 0 replies; 47+ messages in thread
From: Matthieu Moy @ 2007-05-17 14:05 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Martin Langhoff, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> Hi,
>
> [I missed this mail, because Matthieu culled the Cc list again]

Sorry about that, miss-configuration of my mailer. I didn't find time
to solve it before.

OTOH, since most people actually complain when you Cc them on a
mailing list, the choice "To Cc or not to Cc" has no universal
solution ;-).

-- 
Matthieu

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-17 13:47             ` Johannes Schindelin
  2007-05-17 14:05               ` Matthieu Moy
@ 2007-05-17 14:09               ` Martin Langhoff
  2007-05-17 15:01                 ` Nicolas Pitre
  2007-05-17 23:14                 ` Jakub Narebski
  2007-05-17 14:50               ` Nicolas Pitre
  2 siblings, 2 replies; 47+ messages in thread
From: Martin Langhoff @ 2007-05-17 14:09 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git

On 5/18/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> If you have a solution for that problem, please enlighten me: I don't.

Ok - worst case scenario - have a minimal hints file that tells me the
ranges to fetch all commits and all trees. To reduce that Add to the
hints file data to name the hashes (or even better - offsets) for the
delta chains that contain commits+trees relevant to all the heads -
minus 10, 20, 30, 40 commits and 1,2,4,8 and 16 days.

So there's a good chance the client can get the commits+trees needed
efficiently. For blobs, all you need is the index to mark the delta
chains you need.

cheers,

m

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-17 10:48             ` Johannes Schindelin
@ 2007-05-17 14:41               ` Nicolas Pitre
  2007-05-17 15:24                 ` Martin Langhoff
  2007-05-17 20:04                 ` Jan Hudec
  0 siblings, 2 replies; 47+ messages in thread
From: Nicolas Pitre @ 2007-05-17 14:41 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Shawn O. Pearce, Martin Langhoff, Jan Hudec, git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4719 bytes --]

On Thu, 17 May 2007, Johannes Schindelin wrote:

> Hi,
> 
> On Wed, 16 May 2007, Nicolas Pitre wrote:
> 
> > Still... I wonder if this could be actually workable.  A typical daily 
> > update on the Linux kernel repository might consist of a couple hundreds 
> > or a few tousands objects.  This could still be faster to fetch parts of 
> > a pack than the whole pack if the size difference is above a certain 
> > treshold.  It is certainly not worse than fetching loose objects.
> > 
> > Things would be pretty horrid if you think of fetching a commit object, 
> > parsing it to find out what tree object to fetch, then parse that tree 
> > object to find out what other objects to fetch, and so on.
> > 
> > But if you only take the approach of fetching the pack index files, 
> > finding out about the objects that the remote has that are not available 
> > locally, and then fetching all those objects from within pack files 
> > without even looking at them (except for deltas), then it should be 
> > possible to issue a couple requests in parallel and possibly have decent 
> > performances.  And if it turns out that more than, say, 70% of a 
> > particular pack is to be fetched (you can determine that up front), then 
> > it might be decided to fetch the whole pack.
> > 
> > There is no way to sensibly keep those objects packed on the receiving 
> > end of course, but storing them as loose objects and repacking them 
> > afterwards should be just fine.
> > 
> > Of course you'll get objects from branches in the remote repository you 
> > might not be interested in, but that's a price to pay for such a hack.  
> > On average the overhead shouldn't be that big anyway if branches within 
> > a repository are somewhat related.
> > 
> > I think this is something worth experimenting.
> 
> I am a bit wary about that, because it is so complex. IMHO a cgi which 
> gets, say, up to a hundred refs (maybe something like ref~0, ref~1, ref~2, 
> ref~4, ref~8, ref~16, ... for the refs), and then makes a bundle for that 
> case on the fly, is easier to do.

And if you have 1) the permission and 2) the CPU power to execute such a 
cgi on the server and obviously 3) the knowledge to set it up properly, 
then why aren't you running the Git daemon in the first place?  After 
all, they both boil down to running git-pack-objects and sending out the 
result.  I don't think such a solution really buys much.

On the other hand, if the client does all the work and provides the 
server with a list of ranges within a pack it wants to be sent, then you 
simply have zero special setup to perform on the hosting server and you 
keep the server load down due to not running pack-objects there.  That, 
at least, is different enough from the Git daemon to be worth 
considering.  Not only does it provide an advantage to those who cannot 
do anything but http out of their segregated network, but it also 
provide many advantages on the server side too while the cgi approach 
doesn't.

And actually finding out the list of objects the remote has that you 
don't have is not that complex.  It could go as follows:

1) Fetch every .idx files the remote has.

2) From those .idx files, keep only a list of objects that are unknown 
   locally.  A good starting point for doing this really efficiently is 
   the code for git-pack-redundant.

3) From the .idx files we got in (1), create a reverse index to get each 
   object's size in the remote pack.  The code to do this already exists 
   in builtin-pack-objects.c.

4) With the list of missing objects from (2) along with their offset and 
   size within a given pack file, fetch those objects from the remote 
   server.  Either perform multiple requests in parallel, or as someone 
   mentioned already, provide the server with a list of ranges you want 
   to be sent.

5) Store the received objects as loose objects locally.  If a given 
   object is a delta, verify if its base is available locally, or if it 
   is listed amongst those objects to be fetched from the server.  If 
   not, add it to the list.  In most cases, delta base objects will be 
   objects already listed to be fetched anyway.  To greatly simplify 
   things, the loose delta object type from 2 years ago could be revived 
   (commit 91d7b8afc2) since a repack will get rid of them.

6 Repeat (4) and (5) until everything has been fetched.

7) Run git-pack-objects with the list of fetched objects.

Et voilà.  Oh, and of course update your local refs from the remote's.

Actually there is nothing really complex in the above operations. And 
with this the server side remains really simple with no special setup 
nor extra load beyond the simple serving of file content.

Nicolas

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-17 13:47             ` Johannes Schindelin
  2007-05-17 14:05               ` Matthieu Moy
  2007-05-17 14:09               ` Martin Langhoff
@ 2007-05-17 14:50               ` Nicolas Pitre
  2 siblings, 0 replies; 47+ messages in thread
From: Nicolas Pitre @ 2007-05-17 14:50 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Martin Langhoff, git

On Thu, 17 May 2007, Johannes Schindelin wrote:

> Hi,
> 
> [I missed this mail, because Matthieu culled the Cc list again]
> 
> On Fri, 18 May 2007, Martin Langhoff wrote:
> 
> > On 5/17/07, Matthieu Moy <Matthieu.Moy@imag.fr> wrote:
> >
> > > FYI, bzr uses HTTP range requests, and the introduction of this
> > > feature lead to significant performance improvement for them (bzr is
> > > more dumb-protocol oriented than git is, so that's really important
> > > there). They have this "index file+data file" system too, so you
> > > download the full index file, and then send an HTTP range request to
> > > get only the relevant parts of the data file.
> > 
> > That's the kind of thing I was imagining. Between the index and an
> > additional "index-supplement-for-dumb-protocols" maintained by
> > update-server-info, http ranges can be bent to our evil purposes.
> > 
> > Of course it won't be as network-efficient as the git proto, or even
> > as the git-over-cgi proto, but it'll surely be server-cpu-and-memory
> > efficient. And people will benefit from it without having to do any
> > additional setup.
> 
> Of course, the problem is that only the server can know beforehand which 
> objects are needed.

But the whole idea is that we don't care.

> Imagine this:
> 
> X - Y - Z
>   \
>     A
> 
> 
> Client has "X", wants "Z", but not "A". Client needs "Y" and "Z". But 
> client cannot know that it needs "Y" before getting "Z", except if the 
> server says so.
> 
> If you have a solution for that problem, please enlighten me: I don't.

We're talking about a _dumb_ protocol here.  If you want something 
fancy, just use the Git daemon.

Otherwise, you'll simply get everything the remote has that you don't 
have, including A.

In practice this shouldn't be a problem because people tend to have 
clean repositories on machines they want their stuff to be published, 
meaning that those public repos are usually the result of pushes, hence 
they contain only the minimum set of needed objects.  Of course you get 
every branches and not only a particular one, but that's the price to 
pay with a dumb protocol.


Nicolas

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-17 14:09               ` Martin Langhoff
@ 2007-05-17 15:01                 ` Nicolas Pitre
  2007-05-17 23:14                 ` Jakub Narebski
  1 sibling, 0 replies; 47+ messages in thread
From: Nicolas Pitre @ 2007-05-17 15:01 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Johannes Schindelin, git

On Fri, 18 May 2007, Martin Langhoff wrote:

> On 5/18/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> > If you have a solution for that problem, please enlighten me: I don't.
> 
> Ok - worst case scenario - have a minimal hints file that tells me the
> ranges to fetch all commits and all trees. To reduce that Add to the
> hints file data to name the hashes (or even better - offsets) for the
> delta chains that contain commits+trees relevant to all the heads -
> minus 10, 20, 30, 40 commits and 1,2,4,8 and 16 days.

NO !

This is unreliable, unnecessary, and actually kills the beauty of 
the solution's simplicity.

You get updates for every branches the remote has, period.

No server side extra files, no guesses, no arbitrary ranges, no backward 
compatibility issues, no crap!


Nicolas

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-17 14:41               ` Nicolas Pitre
@ 2007-05-17 15:24                 ` Martin Langhoff
  2007-05-17 15:34                   ` Nicolas Pitre
  2007-05-17 20:04                 ` Jan Hudec
  1 sibling, 1 reply; 47+ messages in thread
From: Martin Langhoff @ 2007-05-17 15:24 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Johannes Schindelin, Shawn O. Pearce, Jan Hudec, git

On 5/18/07, Nicolas Pitre <nico@cam.org> wrote:
> And if you have 1) the permission and 2) the CPU power to execute such a
> cgi on the server and obviously 3) the knowledge to set it up properly,
> then why aren't you running the Git daemon in the first place?

And you probably _are_ running git daemon. But some clients may be on
shitty connections that only allow http. That's one of the scenarios
we're discussing.

cheers,


m

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-17 15:24                 ` Martin Langhoff
@ 2007-05-17 15:34                   ` Nicolas Pitre
  0 siblings, 0 replies; 47+ messages in thread
From: Nicolas Pitre @ 2007-05-17 15:34 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Johannes Schindelin, Shawn O. Pearce, Jan Hudec, git

On Fri, 18 May 2007, Martin Langhoff wrote:

> On 5/18/07, Nicolas Pitre <nico@cam.org> wrote:
> > And if you have 1) the permission and 2) the CPU power to execute such a
> > cgi on the server and obviously 3) the knowledge to set it up properly,
> > then why aren't you running the Git daemon in the first place?
> 
> And you probably _are_ running git daemon. But some clients may be on
> shitty connections that only allow http. That's one of the scenarios
> we're discussing.

That's not what I'm disputing at all.

I'm disputing the vertue of an HTTP solution involving a cgi with Git 
bundles vs an HTTP solution involving static file range serving.  The 
clients on shitty connections don't care either ways.


Nicolas

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-17 14:41               ` Nicolas Pitre
  2007-05-17 15:24                 ` Martin Langhoff
@ 2007-05-17 20:04                 ` Jan Hudec
  2007-05-17 20:31                   ` Nicolas Pitre
  2007-05-18  9:01                   ` Johannes Schindelin
  1 sibling, 2 replies; 47+ messages in thread
From: Jan Hudec @ 2007-05-17 20:04 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Johannes Schindelin, Shawn O. Pearce, Martin Langhoff, git

[-- Attachment #1: Type: text/plain, Size: 4738 bytes --]

On Thu, May 17, 2007 at 10:41:37 -0400, Nicolas Pitre wrote:
> On Thu, 17 May 2007, Johannes Schindelin wrote:
> > On Wed, 16 May 2007, Nicolas Pitre wrote:
> And if you have 1) the permission and 2) the CPU power to execute such a 
> cgi on the server and obviously 3) the knowledge to set it up properly, 
> then why aren't you running the Git daemon in the first place?  After 
> all, they both boil down to running git-pack-objects and sending out the 
> result.  I don't think such a solution really buys much.

Yes, it does. I had 2 accounts where I could run CGI, but not separate
server, at university while I studied and now I can get the same on friend's
server. Neither of them would probably be ok for serving larger busy git
repository, but something smaller accessed by several people is OK. I think
this is quite common for university students.

Of course your suggestion which moves the logic to client-side is a good one,
but even the cgi with logic on server side would help in some situations.

> On the other hand, if the client does all the work and provides the 
> server with a list of ranges within a pack it wants to be sent, then you 
> simply have zero special setup to perform on the hosting server and you 
> keep the server load down due to not running pack-objects there.  That, 
> at least, is different enough from the Git daemon to be worth 
> considering.  Not only does it provide an advantage to those who cannot 
> do anything but http out of their segregated network, but it also 
> provide many advantages on the server side too while the cgi approach 
> doesn't.
> 
> And actually finding out the list of objects the remote has that you 
> don't have is not that complex.  It could go as follows:
> 
> 1) Fetch every .idx files the remote has.

... for git it's 1.2 MiB. And that definitely isn't a huge source tree.
Of course the local side could remember which indices it already saw during
previous fetch from that location and not re-fetch them.

A slight problem is, that git-repack normally recombines everything to
a single pack, so the index would have to be re-fetched again anyway.

> 2) From those .idx files, keep only a list of objects that are unknown 
>    locally.  A good starting point for doing this really efficiently is 
>    the code for git-pack-redundant.
> 
> 3) From the .idx files we got in (1), create a reverse index to get each 
>    object's size in the remote pack.  The code to do this already exists 
>    in builtin-pack-objects.c.
> 
> 4) With the list of missing objects from (2) along with their offset and 
>    size within a given pack file, fetch those objects from the remote 
>    server.  Either perform multiple requests in parallel, or as someone 
>    mentioned already, provide the server with a list of ranges you want 
>    to be sent.

Does the git server really have to do so much beyond that? I didn't look at
the algorithm that finds what deltas should be based on, but depending on
that it might (or might not) be possible to proof the client has everything to
understand if the server sends the objects as it currently has them.

> 5) Store the received objects as loose objects locally.  If a given 
>    object is a delta, verify if its base is available locally, or if it 
>    is listed amongst those objects to be fetched from the server.  If 
>    not, add it to the list.  In most cases, delta base objects will be 
>    objects already listed to be fetched anyway.  To greatly simplify 
>    things, the loose delta object type from 2 years ago could be revived 
>    (commit 91d7b8afc2) since a repack will get rid of them.
> 
> 6 Repeat (4) and (5) until everything has been fetched.

Unless I am really seriously missing something, there is no point in
repeating. For each pack you need to unpack a delta either:
 - you have it => ok.
 - you don't have it, but the server does =>
    but than it's already in the fetch set calculated in 2.
 - you don't have it and nor does server =>
    the repository at server is corrupted and you can't fix it.

> 7) Run git-pack-objects with the list of fetched objects.
> 
> Et voilà.  Oh, and of course update your local refs from the remote's.
> 
> Actually there is nothing really complex in the above operations. And 
> with this the server side remains really simple with no special setup 
> nor extra load beyond the simple serving of file content.

On the other hand the amount of data transfered is larger, than with the git
server approach, because at least the indices have to be transfered in
entirety. So each approach has it's own advantages.

-- 
						 Jan 'Bulb' Hudec <bulb@ucw.cz>

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-17 12:40 ` Petr Baudis
  2007-05-17 12:48   ` Matthieu Moy
@ 2007-05-17 20:26   ` Jan Hudec
  2007-05-17 20:38     ` Nicolas Pitre
  1 sibling, 1 reply; 47+ messages in thread
From: Jan Hudec @ 2007-05-17 20:26 UTC (permalink / raw)
  To: Petr Baudis; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 2165 bytes --]

On Thu, May 17, 2007 at 14:40:06 +0200, Petr Baudis wrote:
> On Tue, May 15, 2007 at 10:10:06PM CEST, Jan Hudec wrote:
> >  - Can be run on shared machine. If you have web space on machine shared
> >    by many people, you can set up your own gitweb, but cannot/are not allowed
> >    to start your own network server for git native protocol.
> 
>   You need to have CGI-enabled hosting, set up the CGI script etc. -
> overally, the setup is similarly complicated as git-daemon setup, so
> it's not "zero-setup" solution anymore.
> 
>   Again, I'm not sure just how many people are in the situation that
> they can run real CGI (not just PHP) but not git-daemon.

A particular case would be a group of students wanting to publish their
software project (I mean the PRG023 or equivalent). Private computers in the
hostel are not allowed to serve anything, so they'd use some of the lab
servers (eg. artax, ss1000...). All of them allow full CGI, but running
daemons is forbiden.

> >  - Less things to set up. If you are setting up gitweb anyway, you'd not need
> >    to set up additional thing for providing fetch access.
> 
>   Except, well, how do you "set it up"? You need to make sure
> git-update-server-info is run, yes, but that shouldn't be a problem (I'm
> not so sure if git does this for you automagically - Cogito would...).

No. If it worked similar to git-upload-pack, only over http, it would work
without update-server-info, no?

>   I think 95% of people don't set up gitweb.cgi either for their small
> HTTP repositories. :-)
> 
>   Then again, it's not that it would be really technically complicated -
> adding "give me a bundle" support to gitweb should be pretty easy.
> However, this support has some "social" costs as well: no compatibility
> with older git versions, support cost, confusion between dumb HTTP and
> gitweb HTTP transports, more lack of motivation for improving dumb HTTP
> transport...

The dumb transport is definitely useful. Extending it to use ranges if
possible would be useful as well (and maybe more than upload-pack-over-http).

-- 
						 Jan 'Bulb' Hudec <bulb@ucw.cz>

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-17 20:04                 ` Jan Hudec
@ 2007-05-17 20:31                   ` Nicolas Pitre
  2007-05-17 21:00                     ` david
  2007-05-18  9:01                   ` Johannes Schindelin
  1 sibling, 1 reply; 47+ messages in thread
From: Nicolas Pitre @ 2007-05-17 20:31 UTC (permalink / raw)
  To: Jan Hudec; +Cc: Johannes Schindelin, Shawn O. Pearce, Martin Langhoff, git

On Thu, 17 May 2007, Jan Hudec wrote:

> On Thu, May 17, 2007 at 10:41:37 -0400, Nicolas Pitre wrote:
> > On Thu, 17 May 2007, Johannes Schindelin wrote:
> > > On Wed, 16 May 2007, Nicolas Pitre wrote:
> > And if you have 1) the permission and 2) the CPU power to execute such a 
> > cgi on the server and obviously 3) the knowledge to set it up properly, 
> > then why aren't you running the Git daemon in the first place?  After 
> > all, they both boil down to running git-pack-objects and sending out the 
> > result.  I don't think such a solution really buys much.
> 
> Yes, it does. I had 2 accounts where I could run CGI, but not separate
> server, at university while I studied and now I can get the same on friend's
> server. Neither of them would probably be ok for serving larger busy git
> repository, but something smaller accessed by several people is OK. I think
> this is quite common for university students.
> 
> Of course your suggestion which moves the logic to client-side is a good one,
> but even the cgi with logic on server side would help in some situations.

You could simply wrap git-bundle within a cgi.  That is certainly easy 
enough.

> > On the other hand, if the client does all the work and provides the 
> > server with a list of ranges within a pack it wants to be sent, then you 
> > simply have zero special setup to perform on the hosting server and you 
> > keep the server load down due to not running pack-objects there.  That, 
> > at least, is different enough from the Git daemon to be worth 
> > considering.  Not only does it provide an advantage to those who cannot 
> > do anything but http out of their segregated network, but it also 
> > provide many advantages on the server side too while the cgi approach 
> > doesn't.
> > 
> > And actually finding out the list of objects the remote has that you 
> > don't have is not that complex.  It could go as follows:
> > 
> > 1) Fetch every .idx files the remote has.
> 
> ... for git it's 1.2 MiB. And that definitely isn't a huge source tree.
> Of course the local side could remember which indices it already saw during
> previous fetch from that location and not re-fetch them.

Right.  The name of the pack/index plus its time stamp can be cached.  
If the remote doesn't repack too often then the overhead would be 
minimal.

> > 2) From those .idx files, keep only a list of objects that are unknown 
> >    locally.  A good starting point for doing this really efficiently is 
> >    the code for git-pack-redundant.
> > 
> > 3) From the .idx files we got in (1), create a reverse index to get each 
> >    object's size in the remote pack.  The code to do this already exists 
> >    in builtin-pack-objects.c.
> > 
> > 4) With the list of missing objects from (2) along with their offset and 
> >    size within a given pack file, fetch those objects from the remote 
> >    server.  Either perform multiple requests in parallel, or as someone 
> >    mentioned already, provide the server with a list of ranges you want 
> >    to be sent.
> 
> Does the git server really have to do so much beyond that?

Yes it does.  The real thing perform a full object reachability walk and 
only the objects that are needed for the wanted branch(es) are sent in a 
custom pack meaning that the data transfer is really optimal.

> > 5) Store the received objects as loose objects locally.  If a given 
> >    object is a delta, verify if its base is available locally, or if it 
> >    is listed amongst those objects to be fetched from the server.  If 
> >    not, add it to the list.  In most cases, delta base objects will be 
> >    objects already listed to be fetched anyway.  To greatly simplify 
> >    things, the loose delta object type from 2 years ago could be revived 
> >    (commit 91d7b8afc2) since a repack will get rid of them.
> > 
> > 6 Repeat (4) and (5) until everything has been fetched.
> 
> Unless I am really seriously missing something, there is no point in
> repeating. For each pack you need to unpack a delta either:
>  - you have it => ok.
>  - you don't have it, but the server does =>
>     but than it's already in the fetch set calculated in 2.
>  - you don't have it and nor does server =>
>     the repository at server is corrupted and you can't fix it.

You're right of course.


Nicolas

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-17 20:26   ` Jan Hudec
@ 2007-05-17 20:38     ` Nicolas Pitre
  2007-05-18 17:35       ` Jan Hudec
  0 siblings, 1 reply; 47+ messages in thread
From: Nicolas Pitre @ 2007-05-17 20:38 UTC (permalink / raw)
  To: Jan Hudec; +Cc: Petr Baudis, git

On Thu, 17 May 2007, Jan Hudec wrote:

> A particular case would be a group of students wanting to publish their
> software project (I mean the PRG023 or equivalent). Private computers in the
> hostel are not allowed to serve anything, so they'd use some of the lab
> servers (eg. artax, ss1000...). All of them allow full CGI, but running
> daemons is forbiden.

And wouldn't the admin authority for those lab servers be amenable to 
install a Git daemon service?  That'd be a much better solution to me.


Nicolas

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-17 20:31                   ` Nicolas Pitre
@ 2007-05-17 21:00                     ` david
  0 siblings, 0 replies; 47+ messages in thread
From: david @ 2007-05-17 21:00 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Jan Hudec, Johannes Schindelin, Shawn O. Pearce, Martin Langhoff,
	git

On Thu, 17 May 2007, Nicolas Pitre wrote:

> On Thu, 17 May 2007, Jan Hudec wrote:
>
>> On Thu, May 17, 2007 at 10:41:37 -0400, Nicolas Pitre wrote:
>>> On Thu, 17 May 2007, Johannes Schindelin wrote:
>>>> On Wed, 16 May 2007, Nicolas Pitre wrote:
>>> And if you have 1) the permission and 2) the CPU power to execute such a
>>> cgi on the server and obviously 3) the knowledge to set it up properly,
>>> then why aren't you running the Git daemon in the first place?  After
>>> all, they both boil down to running git-pack-objects and sending out the
>>> result.  I don't think such a solution really buys much.
>>
>> Yes, it does. I had 2 accounts where I could run CGI, but not separate
>> server, at university while I studied and now I can get the same on friend's
>> server. Neither of them would probably be ok for serving larger busy git
>> repository, but something smaller accessed by several people is OK. I think
>> this is quite common for university students.
>>
>> Of course your suggestion which moves the logic to client-side is a good one,
>> but even the cgi with logic on server side would help in some situations.
>
> You could simply wrap git-bundle within a cgi.  That is certainly easy
> enough.

isn't this (or something very similar) exactly what we want for a smalrt 
fetch via http?

after all, we're completely in control of the client software, and the 
useual reason for HTTP-only access is on the client side rather then the 
server side. so http access that wraps the git protocol in http would make 
life much cleaner for lots of people

there are a few cases where all you have is static web space, but I don't 
think it's worth trying to optimize that too much as you still have the 
safety issues to worry about

David Lang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-17 14:09               ` Martin Langhoff
  2007-05-17 15:01                 ` Nicolas Pitre
@ 2007-05-17 23:14                 ` Jakub Narebski
  1 sibling, 0 replies; 47+ messages in thread
From: Jakub Narebski @ 2007-05-17 23:14 UTC (permalink / raw)
  To: git

Martin Langhoff wrote:

> On 5/18/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
>> If you have a solution for that problem, please enlighten me: I don't.
> 
> Ok - worst case scenario - have a minimal hints file that tells me the
> ranges to fetch all commits and all trees. To reduce that Add to the
> hints file data to name the hashes (or even better - offsets) for the
> delta chains that contain commits+trees relevant to all the heads -
> minus 10, 20, 30, 40 commits and 1,2,4,8 and 16 days.
> 
> So there's a good chance the client can get the commits+trees needed
> efficiently. For blobs, all you need is the index to mark the delta
> chains you need.

By the way, I think we always should get the whole delta chain, unless we
are absolutely sure that we have base object(s) in repo.

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-17 20:04                 ` Jan Hudec
  2007-05-17 20:31                   ` Nicolas Pitre
@ 2007-05-18  9:01                   ` Johannes Schindelin
  2007-05-18 17:51                     ` Jan Hudec
  1 sibling, 1 reply; 47+ messages in thread
From: Johannes Schindelin @ 2007-05-18  9:01 UTC (permalink / raw)
  To: Jan Hudec; +Cc: Nicolas Pitre, Shawn O. Pearce, Martin Langhoff, git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1567 bytes --]

Hi,

On Thu, 17 May 2007, Jan Hudec wrote:

> On Thu, May 17, 2007 at 10:41:37 -0400, Nicolas Pitre wrote:
>
> > And if you have 1) the permission and 2) the CPU power to execute such 
> > a cgi on the server and obviously 3) the knowledge to set it up 
> > properly, then why aren't you running the Git daemon in the first 
> > place?  After all, they both boil down to running git-pack-objects and 
> > sending out the result.  I don't think such a solution really buys 
> > much.
> 
> Yes, it does. I had 2 accounts where I could run CGI, but not separate 
> server, at university while I studied and now I can get the same on 
> friend's server. Neither of them would probably be ok for serving larger 
> busy git repository, but something smaller accessed by several people is 
> OK. I think this is quite common for university students.

1) This has nothing to do with the way the repo is served, but how much 
you advertise it. The load will not be lower, just because you use a CGI 
script.

2) you say yourself that git-daemon would have less impact on the load:

> > [...]
> >
> > Et voilà.  Oh, and of course update your local refs from the 
> > remote's.
> > 
> > Actually there is nothing really complex in the above operations. And 
> > with this the server side remains really simple with no special setup 
> > nor extra load beyond the simple serving of file content.
> 
> On the other hand the amount of data transfered is larger, than with the 
> git server approach, because at least the indices have to be transfered 
> in entirety.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-17 20:38     ` Nicolas Pitre
@ 2007-05-18 17:35       ` Jan Hudec
  0 siblings, 0 replies; 47+ messages in thread
From: Jan Hudec @ 2007-05-18 17:35 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Petr Baudis, git

[-- Attachment #1: Type: text/plain, Size: 715 bytes --]

On Thu, May 17, 2007 at 16:38:41 -0400, Nicolas Pitre wrote:
> On Thu, 17 May 2007, Jan Hudec wrote:
> 
> > A particular case would be a group of students wanting to publish their
> > software project (I mean the PRG023 or equivalent). Private computers in the
> > hostel are not allowed to serve anything, so they'd use some of the lab
> > servers (eg. artax, ss1000...). All of them allow full CGI, but running
> > daemons is forbiden.
> 
> And wouldn't the admin authority for those lab servers be amenable to 
> install a Git daemon service?  That'd be a much better solution to me.

It would. But it would really depend on the administrator goodwill.

-- 
						 Jan 'Bulb' Hudec <bulb@ucw.cz>

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-18  9:01                   ` Johannes Schindelin
@ 2007-05-18 17:51                     ` Jan Hudec
  0 siblings, 0 replies; 47+ messages in thread
From: Jan Hudec @ 2007-05-18 17:51 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Nicolas Pitre, Shawn O. Pearce, Martin Langhoff, git

[-- Attachment #1: Type: text/plain, Size: 2906 bytes --]

On Fri, May 18, 2007 at 10:01:52 +0100, Johannes Schindelin wrote:
> Hi,
> 
> On Thu, 17 May 2007, Jan Hudec wrote:
> 
> > On Thu, May 17, 2007 at 10:41:37 -0400, Nicolas Pitre wrote:
> >
> > > And if you have 1) the permission and 2) the CPU power to execute such 
> > > a cgi on the server and obviously 3) the knowledge to set it up 
> > > properly, then why aren't you running the Git daemon in the first 
> > > place?  After all, they both boil down to running git-pack-objects and 
> > > sending out the result.  I don't think such a solution really buys 
> > > much.
> > 
> > Yes, it does. I had 2 accounts where I could run CGI, but not separate 
> > server, at university while I studied and now I can get the same on 
> > friend's server. Neither of them would probably be ok for serving larger 
> > busy git repository, but something smaller accessed by several people is 
> > OK. I think this is quite common for university students.
> 
> 1) This has nothing to do with the way the repo is served, but how much 
> you advertise it. The load will not be lower, just because you use a CGI 
> script.

That won't. But that was never the purpose of "smart cgi". The purpose was to
minimize the bandwidth usage (and connectivity is still not so cheap that
you'd not care) while still working over http either because the users need
to access it from behind firewall or because administrator is not willing to
set up git-daemon for you, while CGI you can run yourself.

> 2) you say yourself that git-daemon would have less impact on the load:

NO, I didn't -- at least not in the paragraph below.

In the below paragraph I said, that *network* use will never be as good with
*dumb* solution, as it can be with smart solution, no matter whether it is
over special protocol or HTTP.

---

Of course it would be less efficient in both CPU and network load, because
there is the overhead of the web server and overhead of the http headers.

Actually I like the ranges solution. If accompanied with repack stategy that
does not pack everything together, but instead creates packs of limited
number of objects -- so that the indices don't exceed configurable size, say
64kB -- could not so much less efficient for the network and have the
advantage of working without ability to execute CGI.

> > > [...]
> > >
> > > Et voilà.  Oh, and of course update your local refs from the 
> > > remote's.
> > > 
> > > Actually there is nothing really complex in the above operations. And 
> > > with this the server side remains really simple with no special setup 
> > > nor extra load beyond the simple serving of file content.
> > 
> > On the other hand the amount of data transfered is larger, than with the 
> > git server approach, because at least the indices have to be transfered 
> > in entirety.

-- 
						 Jan 'Bulb' Hudec <bulb@ucw.cz>

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-17 12:48   ` Matthieu Moy
@ 2007-05-18 18:27     ` Linus Torvalds
  2007-05-18 18:33       ` alan
                         ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Linus Torvalds @ 2007-05-18 18:27 UTC (permalink / raw)
  To: Matthieu Moy; +Cc: git

On Thu, 17 May 2007, Matthieu Moy wrote:
> 
> Many (if not most?) of the people working in a big company, I'd say.
> Year, it sucks, but people having used a paranoid firewall with a
> not-less-paranoid and broken proxy understand what I mean.

Well, we could try to support the git protocol over port 80..

IOW, it's probably easier to try to get people to use

	git clone git://some.host:80/project

and just run git-daemon on port 80, than it is to try to set of magic cgi 
scripts etc.

Doing that with virtual hosts etc should be pretty trivial. Much more so 
than trying to make a git-cgi script.

And yes, I do realize that in theory you can have http-aware firewalls 
that expect to see the normal http sequences in the first few packets in 
order to pass things through, but I seriously doubt it's very common.

			Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-18 18:27     ` Linus Torvalds
@ 2007-05-18 18:33       ` alan
  2007-05-18 19:01       ` Joel Becker
  2007-05-19  0:50       ` david
  2 siblings, 0 replies; 47+ messages in thread
From: alan @ 2007-05-18 18:33 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Matthieu Moy, git

On Fri, 18 May 2007, Linus Torvalds wrote:

>
>
> On Thu, 17 May 2007, Matthieu Moy wrote:
>>
>> Many (if not most?) of the people working in a big company, I'd say.
>> Year, it sucks, but people having used a paranoid firewall with a
>> not-less-paranoid and broken proxy understand what I mean.
>
> Well, we could try to support the git protocol over port 80..
>
> IOW, it's probably easier to try to get people to use
>
> 	git clone git://some.host:80/project
>
> and just run git-daemon on port 80, than it is to try to set of magic cgi
> scripts etc.

Except some filtering firewalls try and strip content from data (like 
ActiveX controls.)

Running git on port 53 will bypass pretty much every firewall out there.

(If you want to learn how to bypass an overactive firewall, talk to a 
bunch of teenagers at a school with an agressive porn filter.)

-- 
"ANSI C says access to the padding fields of a struct is undefined.
ANSI C also says that struct assignment is a memcpy. Therefore struct
assignment in ANSI C is a violation of ANSI C..."
                                   - Alan Cox

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-18 18:27     ` Linus Torvalds
  2007-05-18 18:33       ` alan
@ 2007-05-18 19:01       ` Joel Becker
  2007-05-18 20:06         ` Matthieu Moy
  2007-05-18 20:13         ` Linus Torvalds
  2007-05-19  0:50       ` david
  2 siblings, 2 replies; 47+ messages in thread
From: Joel Becker @ 2007-05-18 19:01 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Matthieu Moy, git

On Fri, May 18, 2007 at 11:27:22AM -0700, Linus Torvalds wrote:
> Well, we could try to support the git protocol over port 80..
> 
> IOW, it's probably easier to try to get people to use
> 
> 	git clone git://some.host:80/project
> 
> and just run git-daemon on port 80, than it is to try to set of magic cgi 
> scripts etc.

	Can we tech the git-daemon to parse the HTTP headers
(specifically, the URL) and return the appropriate HTTP response?

> And yes, I do realize that in theory you can have http-aware firewalls 
> that expect to see the normal http sequences in the first few packets in 
> order to pass things through, but I seriously doubt it's very common.

	It's not about packet scanning, it's about GET vs CONNECT.  If
the proxy allows GET but not CONNECT, it's going to forward the HTTP
protocol to the server, and git-daemon is going to see "GET /project
HTTP/1.1" as its first input.  Now, perhaps we can cook that up behind
some apache so that apache handles vhosting the URL, then calls
git-daemon which can take the stdin.  So we'd be doing POST, not GET.
	On the other hand, if the proxy allows CONNECT, there is no
scanning for HTTP sequences done by the proxy.  It just allows all raw
data (as it figures you're doing SSL).
	A normal company needs to have their firewall allow CONNECT to
9418.  Then git proxying over HTTP is possible to a standard git-daemon.

Joel

-- 

"The first requisite of a good citizen in this republic of ours
 is that he shall be able and willing to pull his weight."
	- Theodore Roosevelt

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-18 19:01       ` Joel Becker
@ 2007-05-18 20:06         ` Matthieu Moy
  2007-05-18 20:13         ` Linus Torvalds
  1 sibling, 0 replies; 47+ messages in thread
From: Matthieu Moy @ 2007-05-18 20:06 UTC (permalink / raw)
  To: Joel Becker; +Cc: Linus Torvalds, git

Joel Becker <Joel.Becker@oracle.com> writes:

> 	A normal company needs to have their firewall allow CONNECT to
> 9418.  Then git proxying over HTTP is possible to a standard
> git-daemon.

443 should work too (that's HTTPS, and the proxy can't filter it,
since this would be a man-in-the-middle attack).

-- 
Matthieu

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-18 19:01       ` Joel Becker
  2007-05-18 20:06         ` Matthieu Moy
@ 2007-05-18 20:13         ` Linus Torvalds
  2007-05-18 21:56           ` Joel Becker
  1 sibling, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2007-05-18 20:13 UTC (permalink / raw)
  To: Joel Becker; +Cc: Matthieu Moy, git



On Fri, 18 May 2007, Joel Becker wrote:
> 
> 	It's not about packet scanning, it's about GET vs CONNECT.  If
> the proxy allows GET but not CONNECT, it's going to forward the HTTP
> protocol to the server, and git-daemon is going to see "GET /project
> HTTP/1.1" as its first input.  Now, perhaps we can cook that up behind
> some apache so that apache handles vhosting the URL, then calls
> git-daemon which can take the stdin.  So we'd be doing POST, not GET.

If it's _just_ the initial GET/CONNECT strings, yeah, we could probably 
easily make the git-daemon just ignore them. That shouldn't be a problem.

But if there's anything *else* required, it gets uglier much more quickly.

		Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-18 20:13         ` Linus Torvalds
@ 2007-05-18 21:56           ` Joel Becker
  2007-05-20 10:30             ` Jan Hudec
  0 siblings, 1 reply; 47+ messages in thread
From: Joel Becker @ 2007-05-18 21:56 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Matthieu Moy, git

On Fri, May 18, 2007 at 01:13:36PM -0700, Linus Torvalds wrote:
> If it's _just_ the initial GET/CONNECT strings, yeah, we could probably 
> easily make the git-daemon just ignore them. That shouldn't be a problem.
> 
> But if there's anything *else* required, it gets uglier much more quickly.

	With CONNECT, there isn't anything.  That is, your
GIT_PROXY_COMMAND handles talking to the proxy, then gives git itself a
raw data pipe.  My proxy allows CONNECT to 9418, and that's how I use it
today.
	If you tried to make POST work (It'd be POST, not GET, as you
need to connect up the sending side), either apache would have to front
it for us, or "git-daemon --http" would have to accept the HTTP headers
on before the input, and output a proper HTTP response before sending
output.  Seeing the headers would allow for us to vhost, even.
	Hmm, but the proxy may not allow two-way communication.  Does
the git protocol have more than one round-trip?  That is:

Client:
    POST http://server.git.host:80/projects/thisproject HTTP/1.1
    Host: server.git.host

    fetch-pack <sha1>
    EOF

Server:
    200 OK HTTP/1.1
    
    <data>
    EOF

should work, I'd think.

Joel


-- 

"Ninety feet between bases is perhaps as close as man has ever come
 to perfection."
	- Red Smith

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-18 18:27     ` Linus Torvalds
  2007-05-18 18:33       ` alan
  2007-05-18 19:01       ` Joel Becker
@ 2007-05-19  0:50       ` david
  2007-05-19  3:58         ` Shawn O. Pearce
  2 siblings, 1 reply; 47+ messages in thread
From: david @ 2007-05-19  0:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Matthieu Moy, git

On Fri, 18 May 2007, Linus Torvalds wrote:

> On Thu, 17 May 2007, Matthieu Moy wrote:
>>
>> Many (if not most?) of the people working in a big company, I'd say.
>> Year, it sucks, but people having used a paranoid firewall with a
>> not-less-paranoid and broken proxy understand what I mean.
>
> Well, we could try to support the git protocol over port 80..
>
> IOW, it's probably easier to try to get people to use
>
> 	git clone git://some.host:80/project
>
> and just run git-daemon on port 80, than it is to try to set of magic cgi
> scripts etc.
>
> Doing that with virtual hosts etc should be pretty trivial. Much more so
> than trying to make a git-cgi script.
>
> And yes, I do realize that in theory you can have http-aware firewalls
> that expect to see the normal http sequences in the first few packets in
> order to pass things through, but I seriously doubt it's very common.

they are actually more common than you think, and getting even more common 
thanks to IE

when a person browsing a hostile website will allow that website to take 
over the machine the demand is created for 'malware filters' for http, to 
do this the firewalls need to decode the http, and in the process limit 
you to only doing legitimate http.

it's also the case that the companies that have firewalls paranoid enough 
to not let you get to the git port are highly likely to be paranoid enough 
to have a malware filtering http firewall.

David Lang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-19  0:50       ` david
@ 2007-05-19  3:58         ` Shawn O. Pearce
  2007-05-19  4:58           ` david
  0 siblings, 1 reply; 47+ messages in thread
From: Shawn O. Pearce @ 2007-05-19  3:58 UTC (permalink / raw)
  To: david; +Cc: Linus Torvalds, Matthieu Moy, git

david@lang.hm wrote:
> when a person browsing a hostile website will allow that website to take 
> over the machine the demand is created for 'malware filters' for http, to 
> do this the firewalls need to decode the http, and in the process limit 
> you to only doing legitimate http.
> 
> it's also the case that the companies that have firewalls paranoid enough 
> to not let you get to the git port are highly likely to be paranoid enough 
> to have a malware filtering http firewall.

I'm behind such a filter, and fetch git.git via HTTP just to keep
my work system current with Junio.  ;-)

Of course we're really really really paranoid about our firewall,
but are also so paranoid that any other web browser *except*
Microsoft Internet Explorer is thought to be a security risk and
is more-or-less banned from the network.

The kicker is some of our developers create public websites, where
testing your local webpage with Firefox and Safari is pretty much
required...  but those browsers still aren't as trusted as IE and
require special clearances.  *shakes head*

We're pretty much limited to:

 *) Running the native Git protocol SSL, where the remote system
 is answering to port 443.  It may not need to be HTTP at all,
 but it probably has to smell enough like SSL to get it through
 the malware filter.  Oh, what's that?  The filter cannot actually
 filter the SSL data?  Funny!  ;-)

 *) Using a single POST upload followed by response from server,
 formatted with minimal HTTP headers.  The real problem as people
 have pointed out is not the HTTP headers, but it is the single
 exchange.

One might think you could use HTTP pipelining to try and get a
bi-directional channel with the remote system, but I'm sure proxy
servers are not required to reuse the same TCP connection to the
remote HTTP server when the inside client piplines a new request.
So any sort of hack on pipelining won't work.

If you really want a stateful exchange you have to treat HTTP as
though it were IP, but with reliable (and much more expensive)
packet delivery, and make the Git daemon keep track of the protocol
state with the client.  Yes, that means that when the client suddenly
goes away and doesn't tell you he went away you also have to garbage
collect your state.  No nice messages from your local kernel.  :-(

-- 
Shawn.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-19  3:58         ` Shawn O. Pearce
@ 2007-05-19  4:58           ` david
  0 siblings, 0 replies; 47+ messages in thread
From: david @ 2007-05-19  4:58 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Linus Torvalds, Matthieu Moy, git

On Fri, 18 May 2007, Shawn O. Pearce wrote:

> david@lang.hm wrote:
>> when a person browsing a hostile website will allow that website to take
>> over the machine the demand is created for 'malware filters' for http, to
>> do this the firewalls need to decode the http, and in the process limit
>> you to only doing legitimate http.
>>
>> it's also the case that the companies that have firewalls paranoid enough
>> to not let you get to the git port are highly likely to be paranoid enough
>> to have a malware filtering http firewall.
>
> I'm behind such a filter, and fetch git.git via HTTP just to keep
> my work system current with Junio.  ;-)
>
> Of course we're really really really paranoid about our firewall,
> but are also so paranoid that any other web browser *except*
> Microsoft Internet Explorer is thought to be a security risk and
> is more-or-less banned from the network.
>
> The kicker is some of our developers create public websites, where
> testing your local webpage with Firefox and Safari is pretty much
> required...  but those browsers still aren't as trusted as IE and
> require special clearances.  *shakes head*

this isn't paranoia, this is just bullheadedness

> We're pretty much limited to:
>
> *) Running the native Git protocol SSL, where the remote system
> is answering to port 443.  It may not need to be HTTP at all,
> but it probably has to smell enough like SSL to get it through
> the malware filter.  Oh, what's that?  The filter cannot actually
> filter the SSL data?  Funny!  ;-)

we're actually paranoid enough to have devices that do man-in-the-middle 
decryption for some sites, and are given copies of the encryption keys 
that other sites (and browsers) use so that it can decrypt the SSL and 
check it. I admit that this is far more paranoid then almost all sites 
though :-)

> *) Using a single POST upload followed by response from server,
> formatted with minimal HTTP headers.  The real problem as people
> have pointed out is not the HTTP headers, but it is the single
> exchange.

> If you really want a stateful exchange you have to treat HTTP as
> though it were IP, but with reliable (and much more expensive)
> packet delivery, and make the Git daemon keep track of the protocol
> state with the client.  Yes, that means that when the client suddenly
> goes away and doesn't tell you he went away you also have to garbage
> collect your state.  No nice messages from your local kernel.  :-(

unfortunantly you are right about this.

David Lang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Smart fetch via HTTP?
  2007-05-18 21:56           ` Joel Becker
@ 2007-05-20 10:30             ` Jan Hudec
  0 siblings, 0 replies; 47+ messages in thread
From: Jan Hudec @ 2007-05-20 10:30 UTC (permalink / raw)
  To: Joel Becker; +Cc: Linus Torvalds, Matthieu Moy, git

[-- Attachment #1: Type: text/plain, Size: 2000 bytes --]

On Fri, May 18, 2007 at 14:56:07 -0700, Joel Becker wrote:
> On Fri, May 18, 2007 at 01:13:36PM -0700, Linus Torvalds wrote:
> > If it's _just_ the initial GET/CONNECT strings, yeah, we could probably 
> > easily make the git-daemon just ignore them. That shouldn't be a problem.
> > 
> > But if there's anything *else* required, it gets uglier much more quickly.
> 
> 	With CONNECT, there isn't anything.  That is, your
> GIT_PROXY_COMMAND handles talking to the proxy, then gives git itself a
> raw data pipe.  My proxy allows CONNECT to 9418, and that's how I use it
> today.

Yes. Connect is easy. However many companies only allow CONNECT to 443
(not that it's much more secure than allowing it anywhere, but at least it
has to block CONNECT to 25 to block sending spam).

> 	If you tried to make POST work (It'd be POST, not GET, as you
> need to connect up the sending side), either apache would have to front
> it for us, or "git-daemon --http" would have to accept the HTTP headers
> on before the input, and output a proper HTTP response before sending
> output.  Seeing the headers would allow for us to vhost, even.
> 	Hmm, but the proxy may not allow two-way communication.  Does
> the git protocol have more than one round-trip?  That is:
> 
> Client:
>     POST http://server.git.host:80/projects/thisproject HTTP/1.1
>     Host: server.git.host
> 
>     fetch-pack <sha1>
>     EOF
> 
> Server:
>     200 OK HTTP/1.1
>     
>     <data>
>     EOF
> 
> should work, I'd think.

Well, that does not require git at all -- apache can handle this all right.
But it's not network-efficient. To be network-efficient, it is necessary to
negotiate the list of objects that need to be send. And that requires more
than one round-trip. Additionally, the current git protocol is streaming --
the client sends data without waiting for the server. So it would require
slightly different protocol over HTTP.

-- 
						 Jan 'Bulb' Hudec <bulb@ucw.cz>

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2007-05-20 10:30 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-15 20:10 Smart fetch via HTTP? Jan Hudec
2007-05-15 22:30 ` A Large Angry SCM
2007-05-15 23:29 ` Shawn O. Pearce
2007-05-16  0:38   ` Junio C Hamano
2007-05-16  5:25 ` Martin Langhoff
2007-05-16 11:33   ` Johannes Schindelin
2007-05-16 21:26     ` Martin Langhoff
2007-05-16 21:54       ` Jakub Narebski
2007-05-17  0:52       ` Johannes Schindelin
2007-05-17  1:03         ` Shawn O. Pearce
2007-05-17  1:04           ` david
2007-05-17  1:26             ` Shawn O. Pearce
2007-05-17  1:45               ` Shawn O. Pearce
2007-05-17 12:36                 ` Theodore Tso
2007-05-17  3:45           ` Nicolas Pitre
2007-05-17 10:48             ` Johannes Schindelin
2007-05-17 14:41               ` Nicolas Pitre
2007-05-17 15:24                 ` Martin Langhoff
2007-05-17 15:34                   ` Nicolas Pitre
2007-05-17 20:04                 ` Jan Hudec
2007-05-17 20:31                   ` Nicolas Pitre
2007-05-17 21:00                     ` david
2007-05-18  9:01                   ` Johannes Schindelin
2007-05-18 17:51                     ` Jan Hudec
2007-05-17 11:28         ` Matthieu Moy
2007-05-17 13:10           ` Martin Langhoff
2007-05-17 13:47             ` Johannes Schindelin
2007-05-17 14:05               ` Matthieu Moy
2007-05-17 14:09               ` Martin Langhoff
2007-05-17 15:01                 ` Nicolas Pitre
2007-05-17 23:14                 ` Jakub Narebski
2007-05-17 14:50               ` Nicolas Pitre
2007-05-17 12:40 ` Petr Baudis
2007-05-17 12:48   ` Matthieu Moy
2007-05-18 18:27     ` Linus Torvalds
2007-05-18 18:33       ` alan
2007-05-18 19:01       ` Joel Becker
2007-05-18 20:06         ` Matthieu Moy
2007-05-18 20:13         ` Linus Torvalds
2007-05-18 21:56           ` Joel Becker
2007-05-20 10:30             ` Jan Hudec
2007-05-19  0:50       ` david
2007-05-19  3:58         ` Shawn O. Pearce
2007-05-19  4:58           ` david
2007-05-17 20:26   ` Jan Hudec
2007-05-17 20:38     ` Nicolas Pitre
2007-05-18 17:35       ` Jan Hudec

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).