* Smart fetch via HTTP?
@ 2007-05-15 20:10 Jan Hudec
2007-05-15 22:30 ` A Large Angry SCM
` (3 more replies)
0 siblings, 4 replies; 47+ messages in thread
From: Jan Hudec @ 2007-05-15 20:10 UTC (permalink / raw)
To: git
[-- Attachment #1: Type: text/plain, Size: 1635 bytes --]
Hello,
Did anyone already think about fetching over HTTP working similarly to the
native git protocol?
That is rather than reading the raw content of the repository, there would be
a CGI script (could be integrated to gitweb), that would negotiate what the
client needs and then generate and send a single pack with it.
Mercurial and bzr both have this option. It would IMO have three benefits:
- Fast access for people behind paranoid firewalls, that only let http and
https (you can tunel anything through, but only to port 443) through.
- Can be run on shared machine. If you have web space on machine shared
by many people, you can set up your own gitweb, but cannot/are not allowed
to start your own network server for git native protocol.
- Less things to set up. If you are setting up gitweb anyway, you'd not need
to set up additional thing for providing fetch access.
Than a question is how to implement it. The current protocol is stateful on
both sides, but the stateless nature of HTTP more or less requires the
protocol to be stateless on the server.
I think it would be possible to use basically the same protocol as now, but
make it stateless for server. That is server first sends it's heads and than
client repeatedly sends all it's wants and some haves until the server acks
all of them and sends the pack.
Alternatively I am thinking about using Bloom filters (somebody came with
such idea on the bzr list when I still followed it). It might be useful, as
over HTTP we need to send as many haves as possible in one go.
--
Jan 'Bulb' Hudec <bulb@ucw.cz>
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-15 20:10 Smart fetch via HTTP? Jan Hudec
@ 2007-05-15 22:30 ` A Large Angry SCM
2007-05-15 23:29 ` Shawn O. Pearce
` (2 subsequent siblings)
3 siblings, 0 replies; 47+ messages in thread
From: A Large Angry SCM @ 2007-05-15 22:30 UTC (permalink / raw)
To: Jan Hudec; +Cc: git
Jan Hudec wrote:
> Hello,
>
> Did anyone already think about fetching over HTTP working similarly to the
> native git protocol?
>
> That is rather than reading the raw content of the repository, there would be
> a CGI script (could be integrated to gitweb), that would negotiate what the
> client needs and then generate and send a single pack with it.
>
> Mercurial and bzr both have this option. It would IMO have three benefits:
> - Fast access for people behind paranoid firewalls, that only let http and
> https (you can tunel anything through, but only to port 443) through.
> - Can be run on shared machine. If you have web space on machine shared
> by many people, you can set up your own gitweb, but cannot/are not allowed
> to start your own network server for git native protocol.
> - Less things to set up. If you are setting up gitweb anyway, you'd not need
> to set up additional thing for providing fetch access.
>
> Than a question is how to implement it. The current protocol is stateful on
> both sides, but the stateless nature of HTTP more or less requires the
> protocol to be stateless on the server.
>
> I think it would be possible to use basically the same protocol as now, but
> make it stateless for server. That is server first sends it's heads and than
> client repeatedly sends all it's wants and some haves until the server acks
> all of them and sends the pack.
>
> Alternatively I am thinking about using Bloom filters (somebody came with
> such idea on the bzr list when I still followed it). It might be useful, as
> over HTTP we need to send as many haves as possible in one go.
>
Bundles?
Client POSTs it's ref set; server uses the ref set to generate and
return the bundle.
Push over http(s) could work the same...
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-15 20:10 Smart fetch via HTTP? Jan Hudec
2007-05-15 22:30 ` A Large Angry SCM
@ 2007-05-15 23:29 ` Shawn O. Pearce
2007-05-16 0:38 ` Junio C Hamano
2007-05-16 5:25 ` Martin Langhoff
2007-05-17 12:40 ` Petr Baudis
3 siblings, 1 reply; 47+ messages in thread
From: Shawn O. Pearce @ 2007-05-15 23:29 UTC (permalink / raw)
To: Jan Hudec; +Cc: git
Jan Hudec <bulb@ucw.cz> wrote:
> Did anyone already think about fetching over HTTP working similarly to the
> native git protocol?
No work has been done on this (that I know of) but I've discussed
it to some extent with Simon 'corecode' Schubert on #git, and I
think he also brought it up on the mailing list not too long after.
I've certainly thought about adding some sort of pack-objects
frontend into gitweb.cgi for this exact purpose. It is really
quite easy, except for the negotation of what the client has. ;-)
> Than a question is how to implement it. The current protocol is stateful on
> both sides, but the stateless nature of HTTP more or less requires the
> protocol to be stateless on the server.
>
> I think it would be possible to use basically the same protocol as now, but
> make it stateless for server. That is server first sends it's heads and than
> client repeatedly sends all it's wants and some haves until the server acks
> all of them and sends the pack.
I think Simon was talking about doubling the number of haves the
client sends in each request. So the client POSTs initially all
of its current refs; then current refs and their parents; then 4
commits back, then 8, etc. The server replies to each POST request
with either a "send more please" or the packfile.
--
Shawn.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-15 23:29 ` Shawn O. Pearce
@ 2007-05-16 0:38 ` Junio C Hamano
0 siblings, 0 replies; 47+ messages in thread
From: Junio C Hamano @ 2007-05-16 0:38 UTC (permalink / raw)
To: Shawn O. Pearce; +Cc: Jan Hudec, git
"Shawn O. Pearce" <spearce@spearce.org> writes:
> Jan Hudec <bulb@ucw.cz> wrote:
>> Did anyone already think about fetching over HTTP working similarly to the
>> native git protocol?
>
> No work has been done on this (that I know of) but I've discussed
> it to some extent with Simon 'corecode' Schubert on #git, and I
> think he also brought it up on the mailing list not too long after.
>
> I've certainly thought about adding some sort of pack-objects
> frontend into gitweb.cgi for this exact purpose. It is really
> quite easy, except for the negotation of what the client has. ;-)
>
>> Than a question is how to implement it. The current protocol is stateful on
>> both sides, but the stateless nature of HTTP more or less requires the
>> protocol to be stateless on the server.
>>
>> I think it would be possible to use basically the same protocol as now, but
>> make it stateless for server. That is server first sends it's heads and than
>> client repeatedly sends all it's wants and some haves until the server acks
>> all of them and sends the pack.
>
> I think Simon was talking about doubling the number of haves the
> client sends in each request. So the client POSTs initially all
> of its current refs; then current refs and their parents; then 4
> commits back, then 8, etc. The server replies to each POST request
> with either a "send more please" or the packfile.
I kinda' like the bundle suggestion ;-)
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-15 20:10 Smart fetch via HTTP? Jan Hudec
2007-05-15 22:30 ` A Large Angry SCM
2007-05-15 23:29 ` Shawn O. Pearce
@ 2007-05-16 5:25 ` Martin Langhoff
2007-05-16 11:33 ` Johannes Schindelin
2007-05-17 12:40 ` Petr Baudis
3 siblings, 1 reply; 47+ messages in thread
From: Martin Langhoff @ 2007-05-16 5:25 UTC (permalink / raw)
To: Jan Hudec; +Cc: git
On 5/16/07, Jan Hudec <bulb@ucw.cz> wrote:
> Did anyone already think about fetching over HTTP working similarly to the
> native git protocol?
Do the indexes have enough info to use them with http ranges? It'd be
chunkier than a smart protocol, but it'd still work with dumb servers.
cheers,
m
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-16 5:25 ` Martin Langhoff
@ 2007-05-16 11:33 ` Johannes Schindelin
2007-05-16 21:26 ` Martin Langhoff
0 siblings, 1 reply; 47+ messages in thread
From: Johannes Schindelin @ 2007-05-16 11:33 UTC (permalink / raw)
To: Martin Langhoff; +Cc: Jan Hudec, git
Hi,
On Wed, 16 May 2007, Martin Langhoff wrote:
> On 5/16/07, Jan Hudec <bulb@ucw.cz> wrote:
> > Did anyone already think about fetching over HTTP working similarly to the
> > native git protocol?
>
> Do the indexes have enough info to use them with http ranges? It'd be
> chunkier than a smart protocol, but it'd still work with dumb servers.
It would not be really performant, would it? Besides, not all Web servers
speak HTTP/1.1...
Ciao,
Dscho
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-16 11:33 ` Johannes Schindelin
@ 2007-05-16 21:26 ` Martin Langhoff
2007-05-16 21:54 ` Jakub Narebski
2007-05-17 0:52 ` Johannes Schindelin
0 siblings, 2 replies; 47+ messages in thread
From: Martin Langhoff @ 2007-05-16 21:26 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Jan Hudec, git
On 5/16/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> On Wed, 16 May 2007, Martin Langhoff wrote:
> > Do the indexes have enough info to use them with http ranges? It'd be
> > chunkier than a smart protocol, but it'd still work with dumb servers.
> It would not be really performant, would it? Besides, not all Web servers
> speak HTTP/1.1...
Performant compared to downloading a huge packfile to get 10% of it?
Sure! It'd probably take a few trips, and you'd end up fetching 20% of
the file, still better than 100%.
> Besides, not all Web servers speak HTTP/1.1...
Are there any interesting webservers out there that don't? Hand-rolled
purpose-built webservers often don't but those don't serve files, they
serve web apps. When it comes to serving files, any webserver that is
supported (security-wise) these days is HTTP/1.1.
And for services like SF.net it'd be a safe low-cpu way of serving git
files. 'cause the git protocol is quite expensive server-side (io+cpu)
as we've seen with kernel.org. Being really smart with a cgi is
probably going to be expensive too.
cheers,
m
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-16 21:26 ` Martin Langhoff
@ 2007-05-16 21:54 ` Jakub Narebski
2007-05-17 0:52 ` Johannes Schindelin
1 sibling, 0 replies; 47+ messages in thread
From: Jakub Narebski @ 2007-05-16 21:54 UTC (permalink / raw)
To: git
Martin Langhoff wrote:
> On 5/16/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
>> On Wed, 16 May 2007, Martin Langhoff wrote:
>> > Do the indexes have enough info to use them with http ranges? It'd be
>> > chunkier than a smart protocol, but it'd still work with dumb servers.
>> It would not be really performant, would it? Besides, not all Web servers
>> speak HTTP/1.1...
>
> Performant compared to downloading a huge packfile to get 10% of it?
> Sure! It'd probably take a few trips, and you'd end up fetching 20% of
> the file, still better than 100%.
That's why you should have something akin to backup policy for pack files,
like daily packs, weekly packs, ..., and the rest, just for the dumb
protocols.
--
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-16 21:26 ` Martin Langhoff
2007-05-16 21:54 ` Jakub Narebski
@ 2007-05-17 0:52 ` Johannes Schindelin
2007-05-17 1:03 ` Shawn O. Pearce
2007-05-17 11:28 ` Matthieu Moy
1 sibling, 2 replies; 47+ messages in thread
From: Johannes Schindelin @ 2007-05-17 0:52 UTC (permalink / raw)
To: Martin Langhoff; +Cc: Jan Hudec, git
Hi,
On Thu, 17 May 2007, Martin Langhoff wrote:
> On 5/16/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> > On Wed, 16 May 2007, Martin Langhoff wrote:
> > > Do the indexes have enough info to use them with http ranges? It'd be
> > > chunkier than a smart protocol, but it'd still work with dumb servers.
> > It would not be really performant, would it? Besides, not all Web servers
> > speak HTTP/1.1...
>
> Performant compared to downloading a huge packfile to get 10% of it?
> Sure! It'd probably take a few trips, and you'd end up fetching 20% of
> the file, still better than 100%.
Don't forget that those 10% probably do not do you the favour to be in
large chunks. Chances are that _every_ _single_ wanted object is separate
from the others.
> > Besides, not all Web servers speak HTTP/1.1...
>
> Are there any interesting webservers out there that don't? Hand-rolled
> purpose-built webservers often don't but those don't serve files, they
> serve web apps. When it comes to serving files, any webserver that is
> supported (security-wise) these days is HTTP/1.1.
>
> And for services like SF.net it'd be a safe low-cpu way of serving git
> files. 'cause the git protocol is quite expensive server-side (io+cpu)
> as we've seen with kernel.org. Being really smart with a cgi is
> probably going to be expensive too.
It's probably better and faster than relying on a feature which does not
exactly help.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-17 0:52 ` Johannes Schindelin
@ 2007-05-17 1:03 ` Shawn O. Pearce
2007-05-17 1:04 ` david
2007-05-17 3:45 ` Nicolas Pitre
2007-05-17 11:28 ` Matthieu Moy
1 sibling, 2 replies; 47+ messages in thread
From: Shawn O. Pearce @ 2007-05-17 1:03 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Martin Langhoff, Jan Hudec, git
Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> Don't forget that those 10% probably do not do you the favour to be in
> large chunks. Chances are that _every_ _single_ wanted object is separate
> from the others.
That's completely possible. Assuming the objects even are packed
in the first place. Its very unlikely that you would be able to
fetch very large of a range from an existing packfile, you would be
submitting most of your range requests for very very small sections.
> > And for services like SF.net it'd be a safe low-cpu way of serving git
> > files. 'cause the git protocol is quite expensive server-side (io+cpu)
> > as we've seen with kernel.org. Being really smart with a cgi is
> > probably going to be expensive too.
>
> It's probably better and faster than relying on a feature which does not
> exactly help.
Yes. Packing more often and pack v4 may help a lot there.
The other thing is kernel.org should really try to encourage the
folks with repositories there to try and share against one master
repository, so the poor OS has a better chance at holding the bulk
of linux-2.6.git in buffer cache.
I'm not suggesting they share specifically against Linus' repository;
maybe hpa and the other admins can host one seperately from Linus and
enourage users to use that repository when on a system they maintain.
In an SF.net type case this doesn't help however. Most of SF.net
is tiny projects with very few, if any, developers. Hence most
of that is going to be unsharable, infrequently accessed, and uh,
not needed to be stored in buffer cache. For the few projects that
are hosted there that have a large developer base they could use
a shared repository approach as I just suggested for kernel.org.
aka the "forks" thing in gitweb, and on repo.or.cz...
--
Shawn.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-17 1:03 ` Shawn O. Pearce
@ 2007-05-17 1:04 ` david
2007-05-17 1:26 ` Shawn O. Pearce
2007-05-17 3:45 ` Nicolas Pitre
1 sibling, 1 reply; 47+ messages in thread
From: david @ 2007-05-17 1:04 UTC (permalink / raw)
To: Shawn O. Pearce; +Cc: Johannes Schindelin, Martin Langhoff, Jan Hudec, git
On Wed, 16 May 2007, Shawn O. Pearce wrote:
> Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
>
>>> And for services like SF.net it'd be a safe low-cpu way of serving git
>>> files. 'cause the git protocol is quite expensive server-side (io+cpu)
>>> as we've seen with kernel.org. Being really smart with a cgi is
>>> probably going to be expensive too.
>>
>> It's probably better and faster than relying on a feature which does not
>> exactly help.
>
> Yes. Packing more often and pack v4 may help a lot there.
>
> The other thing is kernel.org should really try to encourage the
> folks with repositories there to try and share against one master
> repository, so the poor OS has a better chance at holding the bulk
> of linux-2.6.git in buffer cache.
do you mean more precisely share against one object store or do you really
mean repository?
David Lang
> I'm not suggesting they share specifically against Linus' repository;
> maybe hpa and the other admins can host one seperately from Linus and
> enourage users to use that repository when on a system they maintain.
>
> In an SF.net type case this doesn't help however. Most of SF.net
> is tiny projects with very few, if any, developers. Hence most
> of that is going to be unsharable, infrequently accessed, and uh,
> not needed to be stored in buffer cache. For the few projects that
> are hosted there that have a large developer base they could use
> a shared repository approach as I just suggested for kernel.org.
>
> aka the "forks" thing in gitweb, and on repo.or.cz...
>
>
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-17 1:04 ` david
@ 2007-05-17 1:26 ` Shawn O. Pearce
2007-05-17 1:45 ` Shawn O. Pearce
0 siblings, 1 reply; 47+ messages in thread
From: Shawn O. Pearce @ 2007-05-17 1:26 UTC (permalink / raw)
To: david; +Cc: Johannes Schindelin, Martin Langhoff, Jan Hudec, git
david@lang.hm wrote:
> On Wed, 16 May 2007, Shawn O. Pearce wrote:
> >
> >The other thing is kernel.org should really try to encourage the
> >folks with repositories there to try and share against one master
> >repository, so the poor OS has a better chance at holding the bulk
> >of linux-2.6.git in buffer cache.
>
> do you mean more precisely share against one object store or do you really
> mean repository?
Sorry, I did mean "object store". ;-)
Repository is insanity, as the refs and tags namespaces are suddenly
shared. What a nightmare that would become.
--
Shawn.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-17 1:26 ` Shawn O. Pearce
@ 2007-05-17 1:45 ` Shawn O. Pearce
2007-05-17 12:36 ` Theodore Tso
0 siblings, 1 reply; 47+ messages in thread
From: Shawn O. Pearce @ 2007-05-17 1:45 UTC (permalink / raw)
To: david; +Cc: Johannes Schindelin, Martin Langhoff, Jan Hudec, git
"Shawn O. Pearce" <spearce@spearce.org> wrote:
> david@lang.hm wrote:
> > On Wed, 16 May 2007, Shawn O. Pearce wrote:
> > >
> > >The other thing is kernel.org should really try to encourage the
> > >folks with repositories there to try and share against one master
> > >repository, so the poor OS has a better chance at holding the bulk
> > >of linux-2.6.git in buffer cache.
> >
> > do you mean more precisely share against one object store or do you really
> > mean repository?
>
> Sorry, I did mean "object store". ;-)
And even there, I don't mean symlink objects to a shared database,
I mean use the objects/info/alternates file to point to the shared,
read-only location.
Its not perfect. The hotter parts of the object database is almost
always the recent stuff, as that's what people are actively trying
to fetch, or are using as a base when they are trying to fetch from
someone else. The hotter parts are also probably too new to be
in the shared store offered by kernel.org admins, which means you
cannot get good IO buffering. Back to the current set of problems.
A single shared object directory that everyone can write new files
into, but cannot modify or delete from, would help that problem quite
a bit. But it opens up huge problems about pruning, as there is no
way to perform garbage collection on that database without scanning
every ref on the system, and that's just not simply possible on a
busy system like kernel.org.
--
Shawn.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-17 1:03 ` Shawn O. Pearce
2007-05-17 1:04 ` david
@ 2007-05-17 3:45 ` Nicolas Pitre
2007-05-17 10:48 ` Johannes Schindelin
1 sibling, 1 reply; 47+ messages in thread
From: Nicolas Pitre @ 2007-05-17 3:45 UTC (permalink / raw)
To: Shawn O. Pearce; +Cc: Johannes Schindelin, Martin Langhoff, Jan Hudec, git
On Wed, 16 May 2007, Shawn O. Pearce wrote:
> Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> > Don't forget that those 10% probably do not do you the favour to be in
> > large chunks. Chances are that _every_ _single_ wanted object is separate
> > from the others.
>
> That's completely possible. Assuming the objects even are packed
> in the first place. Its very unlikely that you would be able to
> fetch very large of a range from an existing packfile, you would be
> submitting most of your range requests for very very small sections.
Well, in the commit objects case you're likely to have a bunch of them
all contigous.
For tree and blob objects it is less likely.
And of course there is the question of deltas for which you might or
might not have the base object locally already.
Still... I wonder if this could be actually workable. A typical daily
update on the Linux kernel repository might consist of a couple hundreds
or a few tousands objects. This could still be faster to fetch parts of
a pack than the whole pack if the size difference is above a certain
treshold. It is certainly not worse than fetching loose objects.
Things would be pretty horrid if you think of fetching a commit object,
parsing it to find out what tree object to fetch, then parse that tree
object to find out what other objects to fetch, and so on.
But if you only take the approach of fetching the pack index files,
finding out about the objects that the remote has that are not available
locally, and then fetching all those objects from within pack files
without even looking at them (except for deltas), then it should be
possible to issue a couple requests in parallel and possibly have decent
performances. And if it turns out that more than, say, 70% of a
particular pack is to be fetched (you can determine that up front), then
it might be decided to fetch the whole pack.
There is no way to sensibly keep those objects packed on the receiving
end of course, but storing them as loose objects and repacking them
afterwards should be just fine.
Of course you'll get objects from branches in the remote repository you
might not be interested in, but that's a price to pay for such a hack.
On average the overhead shouldn't be that big anyway if branches within
a repository are somewhat related.
I think this is something worth experimenting.
Nicolas
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-17 3:45 ` Nicolas Pitre
@ 2007-05-17 10:48 ` Johannes Schindelin
2007-05-17 14:41 ` Nicolas Pitre
0 siblings, 1 reply; 47+ messages in thread
From: Johannes Schindelin @ 2007-05-17 10:48 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Shawn O. Pearce, Martin Langhoff, Jan Hudec, git
Hi,
On Wed, 16 May 2007, Nicolas Pitre wrote:
> Still... I wonder if this could be actually workable. A typical daily
> update on the Linux kernel repository might consist of a couple hundreds
> or a few tousands objects. This could still be faster to fetch parts of
> a pack than the whole pack if the size difference is above a certain
> treshold. It is certainly not worse than fetching loose objects.
>
> Things would be pretty horrid if you think of fetching a commit object,
> parsing it to find out what tree object to fetch, then parse that tree
> object to find out what other objects to fetch, and so on.
>
> But if you only take the approach of fetching the pack index files,
> finding out about the objects that the remote has that are not available
> locally, and then fetching all those objects from within pack files
> without even looking at them (except for deltas), then it should be
> possible to issue a couple requests in parallel and possibly have decent
> performances. And if it turns out that more than, say, 70% of a
> particular pack is to be fetched (you can determine that up front), then
> it might be decided to fetch the whole pack.
>
> There is no way to sensibly keep those objects packed on the receiving
> end of course, but storing them as loose objects and repacking them
> afterwards should be just fine.
>
> Of course you'll get objects from branches in the remote repository you
> might not be interested in, but that's a price to pay for such a hack.
> On average the overhead shouldn't be that big anyway if branches within
> a repository are somewhat related.
>
> I think this is something worth experimenting.
I am a bit wary about that, because it is so complex. IMHO a cgi which
gets, say, up to a hundred refs (maybe something like ref~0, ref~1, ref~2,
ref~4, ref~8, ref~16, ... for the refs), and then makes a bundle for that
case on the fly, is easier to do.
Of course, as with all cgi scripts, you have to make sure that DOS attacks
have a low probability of success.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-17 0:52 ` Johannes Schindelin
2007-05-17 1:03 ` Shawn O. Pearce
@ 2007-05-17 11:28 ` Matthieu Moy
2007-05-17 13:10 ` Martin Langhoff
1 sibling, 1 reply; 47+ messages in thread
From: Matthieu Moy @ 2007-05-17 11:28 UTC (permalink / raw)
To: git
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> Hi,
>
> On Thu, 17 May 2007, Martin Langhoff wrote:
>
>> On 5/16/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
>> > On Wed, 16 May 2007, Martin Langhoff wrote:
>> > > Do the indexes have enough info to use them with http ranges? It'd be
>> > > chunkier than a smart protocol, but it'd still work with dumb servers.
>> > It would not be really performant, would it? Besides, not all Web servers
>> > speak HTTP/1.1...
>>
>> Performant compared to downloading a huge packfile to get 10% of it?
>> Sure! It'd probably take a few trips, and you'd end up fetching 20% of
>> the file, still better than 100%.
>
> Don't forget that those 10% probably do not do you the favour to be in
> large chunks. Chances are that _every_ _single_ wanted object is separate
> from the others.
FYI, bzr uses HTTP range requests, and the introduction of this
feature lead to significant performance improvement for them (bzr is
more dumb-protocol oriented than git is, so that's really important
there). They have this "index file+data file" system too, so you
download the full index file, and then send an HTTP range request to
get only the relevant parts of the data file.
The thing is, AAUI, they don't send N range requests to get N chunks,
but one HTTP request, requesting the N ranges at a time, and get the N
chunks a a whole (IIRC, a kind of MIME-encoded response from the
server). So, you pay the price of a longer HTTP request, but not the
price of N networks round-trips.
That's surely not as efficient as anything smart on the server, but
might really help for the cases where the server is /not/ smart.
--
Matthieu
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-17 1:45 ` Shawn O. Pearce
@ 2007-05-17 12:36 ` Theodore Tso
0 siblings, 0 replies; 47+ messages in thread
From: Theodore Tso @ 2007-05-17 12:36 UTC (permalink / raw)
To: Shawn O. Pearce
Cc: david, Johannes Schindelin, Martin Langhoff, Jan Hudec, git
On Wed, May 16, 2007 at 09:45:42PM -0400, Shawn O. Pearce wrote:
> Its not perfect. The hotter parts of the object database is almost
> always the recent stuff, as that's what people are actively trying
> to fetch, or are using as a base when they are trying to fetch from
> someone else. The hotter parts are also probably too new to be
> in the shared store offered by kernel.org admins, which means you
> cannot get good IO buffering. Back to the current set of problems.
Actually, as long as objects/info/alternates is pointing at Linus's
kernel.org tree, I would think that it should work relatively well,
since everyone is normally basing their work on top of his tree as a
starting point.
- Ted
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-15 20:10 Smart fetch via HTTP? Jan Hudec
` (2 preceding siblings ...)
2007-05-16 5:25 ` Martin Langhoff
@ 2007-05-17 12:40 ` Petr Baudis
2007-05-17 12:48 ` Matthieu Moy
2007-05-17 20:26 ` Jan Hudec
3 siblings, 2 replies; 47+ messages in thread
From: Petr Baudis @ 2007-05-17 12:40 UTC (permalink / raw)
To: Jan Hudec; +Cc: git
Hi,
On Tue, May 15, 2007 at 10:10:06PM CEST, Jan Hudec wrote:
> Did anyone already think about fetching over HTTP working similarly to the
> That is rather than reading the raw content of the repository, there would be
> a CGI script (could be integrated to gitweb), that would negotiate what the
> client needs and then generate and send a single pack with it.
frankly, I'm not that excited. I'm not disputing that this would be
useful, but I have my doubts on just how *much* useful it would be - I'm
not so sure the set of users affected is really all that large. So I'm
just cooling people down here. ;-))
> Mercurial and bzr both have this option. It would IMO have three benefits:
> - Fast access for people behind paranoid firewalls, that only let http and
> https (you can tunel anything through, but only to port 443) through.
How many users really have this problem? I'm not so sure. There are
certainly some, but enough for this to be a viable argument?
> - Can be run on shared machine. If you have web space on machine shared
> by many people, you can set up your own gitweb, but cannot/are not allowed
> to start your own network server for git native protocol.
You need to have CGI-enabled hosting, set up the CGI script etc. -
overally, the setup is similarly complicated as git-daemon setup, so
it's not "zero-setup" solution anymore.
Again, I'm not sure just how many people are in the situation that
they can run real CGI (not just PHP) but not git-daemon.
> - Less things to set up. If you are setting up gitweb anyway, you'd not need
> to set up additional thing for providing fetch access.
Except, well, how do you "set it up"? You need to make sure
git-update-server-info is run, yes, but that shouldn't be a problem (I'm
not so sure if git does this for you automagically - Cogito would...).
I think 95% of people don't set up gitweb.cgi either for their small
HTTP repositories. :-)
Then again, it's not that it would be really technically complicated -
adding "give me a bundle" support to gitweb should be pretty easy.
However, this support has some "social" costs as well: no compatibility
with older git versions, support cost, confusion between dumb HTTP and
gitweb HTTP transports, more lack of motivation for improving dumb HTTP
transport...
--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
Ever try. Ever fail. No matter. // Try again. Fail again. Fail better.
-- Samuel Beckett
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-17 12:40 ` Petr Baudis
@ 2007-05-17 12:48 ` Matthieu Moy
2007-05-18 18:27 ` Linus Torvalds
2007-05-17 20:26 ` Jan Hudec
1 sibling, 1 reply; 47+ messages in thread
From: Matthieu Moy @ 2007-05-17 12:48 UTC (permalink / raw)
To: git
Petr Baudis <pasky@suse.cz> writes:
>> Mercurial and bzr both have this option. It would IMO have three benefits:
>> - Fast access for people behind paranoid firewalls, that only let http and
>> https (you can tunel anything through, but only to port 443) through.
>
> How many users really have this problem? I'm not so sure.
Many (if not most?) of the people working in a big company, I'd say.
Year, it sucks, but people having used a paranoid firewall with a
not-less-paranoid and broken proxy understand what I mean.
>> - Can be run on shared machine. If you have web space on machine shared
>> by many people, you can set up your own gitweb, but cannot/are not allowed
>> to start your own network server for git native protocol.
>
> You need to have CGI-enabled hosting, set up the CGI script etc. -
> overally, the setup is similarly complicated as git-daemon setup, so
> it's not "zero-setup" solution anymore.
>
> Again, I'm not sure just how many people are in the situation that
> they can run real CGI (not just PHP) but not git-daemon.
Any volunteer to write a full-PHP version of git? ;-)
--
Matthieu
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-17 11:28 ` Matthieu Moy
@ 2007-05-17 13:10 ` Martin Langhoff
2007-05-17 13:47 ` Johannes Schindelin
0 siblings, 1 reply; 47+ messages in thread
From: Martin Langhoff @ 2007-05-17 13:10 UTC (permalink / raw)
To: git
On 5/17/07, Matthieu Moy <Matthieu.Moy@imag.fr> wrote:
> FYI, bzr uses HTTP range requests, and the introduction of this
> feature lead to significant performance improvement for them (bzr is
> more dumb-protocol oriented than git is, so that's really important
> there). They have this "index file+data file" system too, so you
> download the full index file, and then send an HTTP range request to
> get only the relevant parts of the data file.
That's the kind of thing I was imagining. Between the index and an
additional "index-supplement-for-dumb-protocols" maintained by
update-server-info, http ranges can be bent to our evil purposes.
Of course it won't be as network-efficient as the git proto, or even
as the git-over-cgi proto, but it'll surely be server-cpu-and-memory
efficient. And people will benefit from it without having to do any
additional setup.
It might be hard to come up with a usable approach to http ranges. But
I do think it's worth considering carefully.
cheers,
m
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-17 13:10 ` Martin Langhoff
@ 2007-05-17 13:47 ` Johannes Schindelin
2007-05-17 14:05 ` Matthieu Moy
` (2 more replies)
0 siblings, 3 replies; 47+ messages in thread
From: Johannes Schindelin @ 2007-05-17 13:47 UTC (permalink / raw)
To: Martin Langhoff; +Cc: git
Hi,
[I missed this mail, because Matthieu culled the Cc list again]
On Fri, 18 May 2007, Martin Langhoff wrote:
> On 5/17/07, Matthieu Moy <Matthieu.Moy@imag.fr> wrote:
>
> > FYI, bzr uses HTTP range requests, and the introduction of this
> > feature lead to significant performance improvement for them (bzr is
> > more dumb-protocol oriented than git is, so that's really important
> > there). They have this "index file+data file" system too, so you
> > download the full index file, and then send an HTTP range request to
> > get only the relevant parts of the data file.
>
> That's the kind of thing I was imagining. Between the index and an
> additional "index-supplement-for-dumb-protocols" maintained by
> update-server-info, http ranges can be bent to our evil purposes.
>
> Of course it won't be as network-efficient as the git proto, or even
> as the git-over-cgi proto, but it'll surely be server-cpu-and-memory
> efficient. And people will benefit from it without having to do any
> additional setup.
Of course, the problem is that only the server can know beforehand which
objects are needed. Imagine this:
X - Y - Z
\
A
Client has "X", wants "Z", but not "A". Client needs "Y" and "Z". But
client cannot know that it needs "Y" before getting "Z", except if the
server says so.
If you have a solution for that problem, please enlighten me: I don't.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-17 13:47 ` Johannes Schindelin
@ 2007-05-17 14:05 ` Matthieu Moy
2007-05-17 14:09 ` Martin Langhoff
2007-05-17 14:50 ` Nicolas Pitre
2 siblings, 0 replies; 47+ messages in thread
From: Matthieu Moy @ 2007-05-17 14:05 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Martin Langhoff, git
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> Hi,
>
> [I missed this mail, because Matthieu culled the Cc list again]
Sorry about that, miss-configuration of my mailer. I didn't find time
to solve it before.
OTOH, since most people actually complain when you Cc them on a
mailing list, the choice "To Cc or not to Cc" has no universal
solution ;-).
--
Matthieu
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-17 13:47 ` Johannes Schindelin
2007-05-17 14:05 ` Matthieu Moy
@ 2007-05-17 14:09 ` Martin Langhoff
2007-05-17 15:01 ` Nicolas Pitre
2007-05-17 23:14 ` Jakub Narebski
2007-05-17 14:50 ` Nicolas Pitre
2 siblings, 2 replies; 47+ messages in thread
From: Martin Langhoff @ 2007-05-17 14:09 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: git
On 5/18/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> If you have a solution for that problem, please enlighten me: I don't.
Ok - worst case scenario - have a minimal hints file that tells me the
ranges to fetch all commits and all trees. To reduce that Add to the
hints file data to name the hashes (or even better - offsets) for the
delta chains that contain commits+trees relevant to all the heads -
minus 10, 20, 30, 40 commits and 1,2,4,8 and 16 days.
So there's a good chance the client can get the commits+trees needed
efficiently. For blobs, all you need is the index to mark the delta
chains you need.
cheers,
m
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-17 10:48 ` Johannes Schindelin
@ 2007-05-17 14:41 ` Nicolas Pitre
2007-05-17 15:24 ` Martin Langhoff
2007-05-17 20:04 ` Jan Hudec
0 siblings, 2 replies; 47+ messages in thread
From: Nicolas Pitre @ 2007-05-17 14:41 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Shawn O. Pearce, Martin Langhoff, Jan Hudec, git
[-- Attachment #1: Type: TEXT/PLAIN, Size: 4719 bytes --]
On Thu, 17 May 2007, Johannes Schindelin wrote:
> Hi,
>
> On Wed, 16 May 2007, Nicolas Pitre wrote:
>
> > Still... I wonder if this could be actually workable. A typical daily
> > update on the Linux kernel repository might consist of a couple hundreds
> > or a few tousands objects. This could still be faster to fetch parts of
> > a pack than the whole pack if the size difference is above a certain
> > treshold. It is certainly not worse than fetching loose objects.
> >
> > Things would be pretty horrid if you think of fetching a commit object,
> > parsing it to find out what tree object to fetch, then parse that tree
> > object to find out what other objects to fetch, and so on.
> >
> > But if you only take the approach of fetching the pack index files,
> > finding out about the objects that the remote has that are not available
> > locally, and then fetching all those objects from within pack files
> > without even looking at them (except for deltas), then it should be
> > possible to issue a couple requests in parallel and possibly have decent
> > performances. And if it turns out that more than, say, 70% of a
> > particular pack is to be fetched (you can determine that up front), then
> > it might be decided to fetch the whole pack.
> >
> > There is no way to sensibly keep those objects packed on the receiving
> > end of course, but storing them as loose objects and repacking them
> > afterwards should be just fine.
> >
> > Of course you'll get objects from branches in the remote repository you
> > might not be interested in, but that's a price to pay for such a hack.
> > On average the overhead shouldn't be that big anyway if branches within
> > a repository are somewhat related.
> >
> > I think this is something worth experimenting.
>
> I am a bit wary about that, because it is so complex. IMHO a cgi which
> gets, say, up to a hundred refs (maybe something like ref~0, ref~1, ref~2,
> ref~4, ref~8, ref~16, ... for the refs), and then makes a bundle for that
> case on the fly, is easier to do.
And if you have 1) the permission and 2) the CPU power to execute such a
cgi on the server and obviously 3) the knowledge to set it up properly,
then why aren't you running the Git daemon in the first place? After
all, they both boil down to running git-pack-objects and sending out the
result. I don't think such a solution really buys much.
On the other hand, if the client does all the work and provides the
server with a list of ranges within a pack it wants to be sent, then you
simply have zero special setup to perform on the hosting server and you
keep the server load down due to not running pack-objects there. That,
at least, is different enough from the Git daemon to be worth
considering. Not only does it provide an advantage to those who cannot
do anything but http out of their segregated network, but it also
provide many advantages on the server side too while the cgi approach
doesn't.
And actually finding out the list of objects the remote has that you
don't have is not that complex. It could go as follows:
1) Fetch every .idx files the remote has.
2) From those .idx files, keep only a list of objects that are unknown
locally. A good starting point for doing this really efficiently is
the code for git-pack-redundant.
3) From the .idx files we got in (1), create a reverse index to get each
object's size in the remote pack. The code to do this already exists
in builtin-pack-objects.c.
4) With the list of missing objects from (2) along with their offset and
size within a given pack file, fetch those objects from the remote
server. Either perform multiple requests in parallel, or as someone
mentioned already, provide the server with a list of ranges you want
to be sent.
5) Store the received objects as loose objects locally. If a given
object is a delta, verify if its base is available locally, or if it
is listed amongst those objects to be fetched from the server. If
not, add it to the list. In most cases, delta base objects will be
objects already listed to be fetched anyway. To greatly simplify
things, the loose delta object type from 2 years ago could be revived
(commit 91d7b8afc2) since a repack will get rid of them.
6 Repeat (4) and (5) until everything has been fetched.
7) Run git-pack-objects with the list of fetched objects.
Et voilà. Oh, and of course update your local refs from the remote's.
Actually there is nothing really complex in the above operations. And
with this the server side remains really simple with no special setup
nor extra load beyond the simple serving of file content.
Nicolas
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-17 13:47 ` Johannes Schindelin
2007-05-17 14:05 ` Matthieu Moy
2007-05-17 14:09 ` Martin Langhoff
@ 2007-05-17 14:50 ` Nicolas Pitre
2 siblings, 0 replies; 47+ messages in thread
From: Nicolas Pitre @ 2007-05-17 14:50 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Martin Langhoff, git
On Thu, 17 May 2007, Johannes Schindelin wrote:
> Hi,
>
> [I missed this mail, because Matthieu culled the Cc list again]
>
> On Fri, 18 May 2007, Martin Langhoff wrote:
>
> > On 5/17/07, Matthieu Moy <Matthieu.Moy@imag.fr> wrote:
> >
> > > FYI, bzr uses HTTP range requests, and the introduction of this
> > > feature lead to significant performance improvement for them (bzr is
> > > more dumb-protocol oriented than git is, so that's really important
> > > there). They have this "index file+data file" system too, so you
> > > download the full index file, and then send an HTTP range request to
> > > get only the relevant parts of the data file.
> >
> > That's the kind of thing I was imagining. Between the index and an
> > additional "index-supplement-for-dumb-protocols" maintained by
> > update-server-info, http ranges can be bent to our evil purposes.
> >
> > Of course it won't be as network-efficient as the git proto, or even
> > as the git-over-cgi proto, but it'll surely be server-cpu-and-memory
> > efficient. And people will benefit from it without having to do any
> > additional setup.
>
> Of course, the problem is that only the server can know beforehand which
> objects are needed.
But the whole idea is that we don't care.
> Imagine this:
>
> X - Y - Z
> \
> A
>
>
> Client has "X", wants "Z", but not "A". Client needs "Y" and "Z". But
> client cannot know that it needs "Y" before getting "Z", except if the
> server says so.
>
> If you have a solution for that problem, please enlighten me: I don't.
We're talking about a _dumb_ protocol here. If you want something
fancy, just use the Git daemon.
Otherwise, you'll simply get everything the remote has that you don't
have, including A.
In practice this shouldn't be a problem because people tend to have
clean repositories on machines they want their stuff to be published,
meaning that those public repos are usually the result of pushes, hence
they contain only the minimum set of needed objects. Of course you get
every branches and not only a particular one, but that's the price to
pay with a dumb protocol.
Nicolas
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-17 14:09 ` Martin Langhoff
@ 2007-05-17 15:01 ` Nicolas Pitre
2007-05-17 23:14 ` Jakub Narebski
1 sibling, 0 replies; 47+ messages in thread
From: Nicolas Pitre @ 2007-05-17 15:01 UTC (permalink / raw)
To: Martin Langhoff; +Cc: Johannes Schindelin, git
On Fri, 18 May 2007, Martin Langhoff wrote:
> On 5/18/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> > If you have a solution for that problem, please enlighten me: I don't.
>
> Ok - worst case scenario - have a minimal hints file that tells me the
> ranges to fetch all commits and all trees. To reduce that Add to the
> hints file data to name the hashes (or even better - offsets) for the
> delta chains that contain commits+trees relevant to all the heads -
> minus 10, 20, 30, 40 commits and 1,2,4,8 and 16 days.
NO !
This is unreliable, unnecessary, and actually kills the beauty of
the solution's simplicity.
You get updates for every branches the remote has, period.
No server side extra files, no guesses, no arbitrary ranges, no backward
compatibility issues, no crap!
Nicolas
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-17 14:41 ` Nicolas Pitre
@ 2007-05-17 15:24 ` Martin Langhoff
2007-05-17 15:34 ` Nicolas Pitre
2007-05-17 20:04 ` Jan Hudec
1 sibling, 1 reply; 47+ messages in thread
From: Martin Langhoff @ 2007-05-17 15:24 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Johannes Schindelin, Shawn O. Pearce, Jan Hudec, git
On 5/18/07, Nicolas Pitre <nico@cam.org> wrote:
> And if you have 1) the permission and 2) the CPU power to execute such a
> cgi on the server and obviously 3) the knowledge to set it up properly,
> then why aren't you running the Git daemon in the first place?
And you probably _are_ running git daemon. But some clients may be on
shitty connections that only allow http. That's one of the scenarios
we're discussing.
cheers,
m
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-17 15:24 ` Martin Langhoff
@ 2007-05-17 15:34 ` Nicolas Pitre
0 siblings, 0 replies; 47+ messages in thread
From: Nicolas Pitre @ 2007-05-17 15:34 UTC (permalink / raw)
To: Martin Langhoff; +Cc: Johannes Schindelin, Shawn O. Pearce, Jan Hudec, git
On Fri, 18 May 2007, Martin Langhoff wrote:
> On 5/18/07, Nicolas Pitre <nico@cam.org> wrote:
> > And if you have 1) the permission and 2) the CPU power to execute such a
> > cgi on the server and obviously 3) the knowledge to set it up properly,
> > then why aren't you running the Git daemon in the first place?
>
> And you probably _are_ running git daemon. But some clients may be on
> shitty connections that only allow http. That's one of the scenarios
> we're discussing.
That's not what I'm disputing at all.
I'm disputing the vertue of an HTTP solution involving a cgi with Git
bundles vs an HTTP solution involving static file range serving. The
clients on shitty connections don't care either ways.
Nicolas
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-17 14:41 ` Nicolas Pitre
2007-05-17 15:24 ` Martin Langhoff
@ 2007-05-17 20:04 ` Jan Hudec
2007-05-17 20:31 ` Nicolas Pitre
2007-05-18 9:01 ` Johannes Schindelin
1 sibling, 2 replies; 47+ messages in thread
From: Jan Hudec @ 2007-05-17 20:04 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Johannes Schindelin, Shawn O. Pearce, Martin Langhoff, git
[-- Attachment #1: Type: text/plain, Size: 4738 bytes --]
On Thu, May 17, 2007 at 10:41:37 -0400, Nicolas Pitre wrote:
> On Thu, 17 May 2007, Johannes Schindelin wrote:
> > On Wed, 16 May 2007, Nicolas Pitre wrote:
> And if you have 1) the permission and 2) the CPU power to execute such a
> cgi on the server and obviously 3) the knowledge to set it up properly,
> then why aren't you running the Git daemon in the first place? After
> all, they both boil down to running git-pack-objects and sending out the
> result. I don't think such a solution really buys much.
Yes, it does. I had 2 accounts where I could run CGI, but not separate
server, at university while I studied and now I can get the same on friend's
server. Neither of them would probably be ok for serving larger busy git
repository, but something smaller accessed by several people is OK. I think
this is quite common for university students.
Of course your suggestion which moves the logic to client-side is a good one,
but even the cgi with logic on server side would help in some situations.
> On the other hand, if the client does all the work and provides the
> server with a list of ranges within a pack it wants to be sent, then you
> simply have zero special setup to perform on the hosting server and you
> keep the server load down due to not running pack-objects there. That,
> at least, is different enough from the Git daemon to be worth
> considering. Not only does it provide an advantage to those who cannot
> do anything but http out of their segregated network, but it also
> provide many advantages on the server side too while the cgi approach
> doesn't.
>
> And actually finding out the list of objects the remote has that you
> don't have is not that complex. It could go as follows:
>
> 1) Fetch every .idx files the remote has.
... for git it's 1.2 MiB. And that definitely isn't a huge source tree.
Of course the local side could remember which indices it already saw during
previous fetch from that location and not re-fetch them.
A slight problem is, that git-repack normally recombines everything to
a single pack, so the index would have to be re-fetched again anyway.
> 2) From those .idx files, keep only a list of objects that are unknown
> locally. A good starting point for doing this really efficiently is
> the code for git-pack-redundant.
>
> 3) From the .idx files we got in (1), create a reverse index to get each
> object's size in the remote pack. The code to do this already exists
> in builtin-pack-objects.c.
>
> 4) With the list of missing objects from (2) along with their offset and
> size within a given pack file, fetch those objects from the remote
> server. Either perform multiple requests in parallel, or as someone
> mentioned already, provide the server with a list of ranges you want
> to be sent.
Does the git server really have to do so much beyond that? I didn't look at
the algorithm that finds what deltas should be based on, but depending on
that it might (or might not) be possible to proof the client has everything to
understand if the server sends the objects as it currently has them.
> 5) Store the received objects as loose objects locally. If a given
> object is a delta, verify if its base is available locally, or if it
> is listed amongst those objects to be fetched from the server. If
> not, add it to the list. In most cases, delta base objects will be
> objects already listed to be fetched anyway. To greatly simplify
> things, the loose delta object type from 2 years ago could be revived
> (commit 91d7b8afc2) since a repack will get rid of them.
>
> 6 Repeat (4) and (5) until everything has been fetched.
Unless I am really seriously missing something, there is no point in
repeating. For each pack you need to unpack a delta either:
- you have it => ok.
- you don't have it, but the server does =>
but than it's already in the fetch set calculated in 2.
- you don't have it and nor does server =>
the repository at server is corrupted and you can't fix it.
> 7) Run git-pack-objects with the list of fetched objects.
>
> Et voilà. Oh, and of course update your local refs from the remote's.
>
> Actually there is nothing really complex in the above operations. And
> with this the server side remains really simple with no special setup
> nor extra load beyond the simple serving of file content.
On the other hand the amount of data transfered is larger, than with the git
server approach, because at least the indices have to be transfered in
entirety. So each approach has it's own advantages.
--
Jan 'Bulb' Hudec <bulb@ucw.cz>
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-17 12:40 ` Petr Baudis
2007-05-17 12:48 ` Matthieu Moy
@ 2007-05-17 20:26 ` Jan Hudec
2007-05-17 20:38 ` Nicolas Pitre
1 sibling, 1 reply; 47+ messages in thread
From: Jan Hudec @ 2007-05-17 20:26 UTC (permalink / raw)
To: Petr Baudis; +Cc: git
[-- Attachment #1: Type: text/plain, Size: 2165 bytes --]
On Thu, May 17, 2007 at 14:40:06 +0200, Petr Baudis wrote:
> On Tue, May 15, 2007 at 10:10:06PM CEST, Jan Hudec wrote:
> > - Can be run on shared machine. If you have web space on machine shared
> > by many people, you can set up your own gitweb, but cannot/are not allowed
> > to start your own network server for git native protocol.
>
> You need to have CGI-enabled hosting, set up the CGI script etc. -
> overally, the setup is similarly complicated as git-daemon setup, so
> it's not "zero-setup" solution anymore.
>
> Again, I'm not sure just how many people are in the situation that
> they can run real CGI (not just PHP) but not git-daemon.
A particular case would be a group of students wanting to publish their
software project (I mean the PRG023 or equivalent). Private computers in the
hostel are not allowed to serve anything, so they'd use some of the lab
servers (eg. artax, ss1000...). All of them allow full CGI, but running
daemons is forbiden.
> > - Less things to set up. If you are setting up gitweb anyway, you'd not need
> > to set up additional thing for providing fetch access.
>
> Except, well, how do you "set it up"? You need to make sure
> git-update-server-info is run, yes, but that shouldn't be a problem (I'm
> not so sure if git does this for you automagically - Cogito would...).
No. If it worked similar to git-upload-pack, only over http, it would work
without update-server-info, no?
> I think 95% of people don't set up gitweb.cgi either for their small
> HTTP repositories. :-)
>
> Then again, it's not that it would be really technically complicated -
> adding "give me a bundle" support to gitweb should be pretty easy.
> However, this support has some "social" costs as well: no compatibility
> with older git versions, support cost, confusion between dumb HTTP and
> gitweb HTTP transports, more lack of motivation for improving dumb HTTP
> transport...
The dumb transport is definitely useful. Extending it to use ranges if
possible would be useful as well (and maybe more than upload-pack-over-http).
--
Jan 'Bulb' Hudec <bulb@ucw.cz>
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-17 20:04 ` Jan Hudec
@ 2007-05-17 20:31 ` Nicolas Pitre
2007-05-17 21:00 ` david
2007-05-18 9:01 ` Johannes Schindelin
1 sibling, 1 reply; 47+ messages in thread
From: Nicolas Pitre @ 2007-05-17 20:31 UTC (permalink / raw)
To: Jan Hudec; +Cc: Johannes Schindelin, Shawn O. Pearce, Martin Langhoff, git
On Thu, 17 May 2007, Jan Hudec wrote:
> On Thu, May 17, 2007 at 10:41:37 -0400, Nicolas Pitre wrote:
> > On Thu, 17 May 2007, Johannes Schindelin wrote:
> > > On Wed, 16 May 2007, Nicolas Pitre wrote:
> > And if you have 1) the permission and 2) the CPU power to execute such a
> > cgi on the server and obviously 3) the knowledge to set it up properly,
> > then why aren't you running the Git daemon in the first place? After
> > all, they both boil down to running git-pack-objects and sending out the
> > result. I don't think such a solution really buys much.
>
> Yes, it does. I had 2 accounts where I could run CGI, but not separate
> server, at university while I studied and now I can get the same on friend's
> server. Neither of them would probably be ok for serving larger busy git
> repository, but something smaller accessed by several people is OK. I think
> this is quite common for university students.
>
> Of course your suggestion which moves the logic to client-side is a good one,
> but even the cgi with logic on server side would help in some situations.
You could simply wrap git-bundle within a cgi. That is certainly easy
enough.
> > On the other hand, if the client does all the work and provides the
> > server with a list of ranges within a pack it wants to be sent, then you
> > simply have zero special setup to perform on the hosting server and you
> > keep the server load down due to not running pack-objects there. That,
> > at least, is different enough from the Git daemon to be worth
> > considering. Not only does it provide an advantage to those who cannot
> > do anything but http out of their segregated network, but it also
> > provide many advantages on the server side too while the cgi approach
> > doesn't.
> >
> > And actually finding out the list of objects the remote has that you
> > don't have is not that complex. It could go as follows:
> >
> > 1) Fetch every .idx files the remote has.
>
> ... for git it's 1.2 MiB. And that definitely isn't a huge source tree.
> Of course the local side could remember which indices it already saw during
> previous fetch from that location and not re-fetch them.
Right. The name of the pack/index plus its time stamp can be cached.
If the remote doesn't repack too often then the overhead would be
minimal.
> > 2) From those .idx files, keep only a list of objects that are unknown
> > locally. A good starting point for doing this really efficiently is
> > the code for git-pack-redundant.
> >
> > 3) From the .idx files we got in (1), create a reverse index to get each
> > object's size in the remote pack. The code to do this already exists
> > in builtin-pack-objects.c.
> >
> > 4) With the list of missing objects from (2) along with their offset and
> > size within a given pack file, fetch those objects from the remote
> > server. Either perform multiple requests in parallel, or as someone
> > mentioned already, provide the server with a list of ranges you want
> > to be sent.
>
> Does the git server really have to do so much beyond that?
Yes it does. The real thing perform a full object reachability walk and
only the objects that are needed for the wanted branch(es) are sent in a
custom pack meaning that the data transfer is really optimal.
> > 5) Store the received objects as loose objects locally. If a given
> > object is a delta, verify if its base is available locally, or if it
> > is listed amongst those objects to be fetched from the server. If
> > not, add it to the list. In most cases, delta base objects will be
> > objects already listed to be fetched anyway. To greatly simplify
> > things, the loose delta object type from 2 years ago could be revived
> > (commit 91d7b8afc2) since a repack will get rid of them.
> >
> > 6 Repeat (4) and (5) until everything has been fetched.
>
> Unless I am really seriously missing something, there is no point in
> repeating. For each pack you need to unpack a delta either:
> - you have it => ok.
> - you don't have it, but the server does =>
> but than it's already in the fetch set calculated in 2.
> - you don't have it and nor does server =>
> the repository at server is corrupted and you can't fix it.
You're right of course.
Nicolas
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-17 20:26 ` Jan Hudec
@ 2007-05-17 20:38 ` Nicolas Pitre
2007-05-18 17:35 ` Jan Hudec
0 siblings, 1 reply; 47+ messages in thread
From: Nicolas Pitre @ 2007-05-17 20:38 UTC (permalink / raw)
To: Jan Hudec; +Cc: Petr Baudis, git
On Thu, 17 May 2007, Jan Hudec wrote:
> A particular case would be a group of students wanting to publish their
> software project (I mean the PRG023 or equivalent). Private computers in the
> hostel are not allowed to serve anything, so they'd use some of the lab
> servers (eg. artax, ss1000...). All of them allow full CGI, but running
> daemons is forbiden.
And wouldn't the admin authority for those lab servers be amenable to
install a Git daemon service? That'd be a much better solution to me.
Nicolas
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-17 20:31 ` Nicolas Pitre
@ 2007-05-17 21:00 ` david
0 siblings, 0 replies; 47+ messages in thread
From: david @ 2007-05-17 21:00 UTC (permalink / raw)
To: Nicolas Pitre
Cc: Jan Hudec, Johannes Schindelin, Shawn O. Pearce, Martin Langhoff,
git
On Thu, 17 May 2007, Nicolas Pitre wrote:
> On Thu, 17 May 2007, Jan Hudec wrote:
>
>> On Thu, May 17, 2007 at 10:41:37 -0400, Nicolas Pitre wrote:
>>> On Thu, 17 May 2007, Johannes Schindelin wrote:
>>>> On Wed, 16 May 2007, Nicolas Pitre wrote:
>>> And if you have 1) the permission and 2) the CPU power to execute such a
>>> cgi on the server and obviously 3) the knowledge to set it up properly,
>>> then why aren't you running the Git daemon in the first place? After
>>> all, they both boil down to running git-pack-objects and sending out the
>>> result. I don't think such a solution really buys much.
>>
>> Yes, it does. I had 2 accounts where I could run CGI, but not separate
>> server, at university while I studied and now I can get the same on friend's
>> server. Neither of them would probably be ok for serving larger busy git
>> repository, but something smaller accessed by several people is OK. I think
>> this is quite common for university students.
>>
>> Of course your suggestion which moves the logic to client-side is a good one,
>> but even the cgi with logic on server side would help in some situations.
>
> You could simply wrap git-bundle within a cgi. That is certainly easy
> enough.
isn't this (or something very similar) exactly what we want for a smalrt
fetch via http?
after all, we're completely in control of the client software, and the
useual reason for HTTP-only access is on the client side rather then the
server side. so http access that wraps the git protocol in http would make
life much cleaner for lots of people
there are a few cases where all you have is static web space, but I don't
think it's worth trying to optimize that too much as you still have the
safety issues to worry about
David Lang
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-17 14:09 ` Martin Langhoff
2007-05-17 15:01 ` Nicolas Pitre
@ 2007-05-17 23:14 ` Jakub Narebski
1 sibling, 0 replies; 47+ messages in thread
From: Jakub Narebski @ 2007-05-17 23:14 UTC (permalink / raw)
To: git
Martin Langhoff wrote:
> On 5/18/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
>> If you have a solution for that problem, please enlighten me: I don't.
>
> Ok - worst case scenario - have a minimal hints file that tells me the
> ranges to fetch all commits and all trees. To reduce that Add to the
> hints file data to name the hashes (or even better - offsets) for the
> delta chains that contain commits+trees relevant to all the heads -
> minus 10, 20, 30, 40 commits and 1,2,4,8 and 16 days.
>
> So there's a good chance the client can get the commits+trees needed
> efficiently. For blobs, all you need is the index to mark the delta
> chains you need.
By the way, I think we always should get the whole delta chain, unless we
are absolutely sure that we have base object(s) in repo.
--
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-17 20:04 ` Jan Hudec
2007-05-17 20:31 ` Nicolas Pitre
@ 2007-05-18 9:01 ` Johannes Schindelin
2007-05-18 17:51 ` Jan Hudec
1 sibling, 1 reply; 47+ messages in thread
From: Johannes Schindelin @ 2007-05-18 9:01 UTC (permalink / raw)
To: Jan Hudec; +Cc: Nicolas Pitre, Shawn O. Pearce, Martin Langhoff, git
[-- Attachment #1: Type: TEXT/PLAIN, Size: 1567 bytes --]
Hi,
On Thu, 17 May 2007, Jan Hudec wrote:
> On Thu, May 17, 2007 at 10:41:37 -0400, Nicolas Pitre wrote:
>
> > And if you have 1) the permission and 2) the CPU power to execute such
> > a cgi on the server and obviously 3) the knowledge to set it up
> > properly, then why aren't you running the Git daemon in the first
> > place? After all, they both boil down to running git-pack-objects and
> > sending out the result. I don't think such a solution really buys
> > much.
>
> Yes, it does. I had 2 accounts where I could run CGI, but not separate
> server, at university while I studied and now I can get the same on
> friend's server. Neither of them would probably be ok for serving larger
> busy git repository, but something smaller accessed by several people is
> OK. I think this is quite common for university students.
1) This has nothing to do with the way the repo is served, but how much
you advertise it. The load will not be lower, just because you use a CGI
script.
2) you say yourself that git-daemon would have less impact on the load:
> > [...]
> >
> > Et voilà. Oh, and of course update your local refs from the
> > remote's.
> >
> > Actually there is nothing really complex in the above operations. And
> > with this the server side remains really simple with no special setup
> > nor extra load beyond the simple serving of file content.
>
> On the other hand the amount of data transfered is larger, than with the
> git server approach, because at least the indices have to be transfered
> in entirety.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-17 20:38 ` Nicolas Pitre
@ 2007-05-18 17:35 ` Jan Hudec
0 siblings, 0 replies; 47+ messages in thread
From: Jan Hudec @ 2007-05-18 17:35 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Petr Baudis, git
[-- Attachment #1: Type: text/plain, Size: 715 bytes --]
On Thu, May 17, 2007 at 16:38:41 -0400, Nicolas Pitre wrote:
> On Thu, 17 May 2007, Jan Hudec wrote:
>
> > A particular case would be a group of students wanting to publish their
> > software project (I mean the PRG023 or equivalent). Private computers in the
> > hostel are not allowed to serve anything, so they'd use some of the lab
> > servers (eg. artax, ss1000...). All of them allow full CGI, but running
> > daemons is forbiden.
>
> And wouldn't the admin authority for those lab servers be amenable to
> install a Git daemon service? That'd be a much better solution to me.
It would. But it would really depend on the administrator goodwill.
--
Jan 'Bulb' Hudec <bulb@ucw.cz>
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-18 9:01 ` Johannes Schindelin
@ 2007-05-18 17:51 ` Jan Hudec
0 siblings, 0 replies; 47+ messages in thread
From: Jan Hudec @ 2007-05-18 17:51 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Nicolas Pitre, Shawn O. Pearce, Martin Langhoff, git
[-- Attachment #1: Type: text/plain, Size: 2906 bytes --]
On Fri, May 18, 2007 at 10:01:52 +0100, Johannes Schindelin wrote:
> Hi,
>
> On Thu, 17 May 2007, Jan Hudec wrote:
>
> > On Thu, May 17, 2007 at 10:41:37 -0400, Nicolas Pitre wrote:
> >
> > > And if you have 1) the permission and 2) the CPU power to execute such
> > > a cgi on the server and obviously 3) the knowledge to set it up
> > > properly, then why aren't you running the Git daemon in the first
> > > place? After all, they both boil down to running git-pack-objects and
> > > sending out the result. I don't think such a solution really buys
> > > much.
> >
> > Yes, it does. I had 2 accounts where I could run CGI, but not separate
> > server, at university while I studied and now I can get the same on
> > friend's server. Neither of them would probably be ok for serving larger
> > busy git repository, but something smaller accessed by several people is
> > OK. I think this is quite common for university students.
>
> 1) This has nothing to do with the way the repo is served, but how much
> you advertise it. The load will not be lower, just because you use a CGI
> script.
That won't. But that was never the purpose of "smart cgi". The purpose was to
minimize the bandwidth usage (and connectivity is still not so cheap that
you'd not care) while still working over http either because the users need
to access it from behind firewall or because administrator is not willing to
set up git-daemon for you, while CGI you can run yourself.
> 2) you say yourself that git-daemon would have less impact on the load:
NO, I didn't -- at least not in the paragraph below.
In the below paragraph I said, that *network* use will never be as good with
*dumb* solution, as it can be with smart solution, no matter whether it is
over special protocol or HTTP.
---
Of course it would be less efficient in both CPU and network load, because
there is the overhead of the web server and overhead of the http headers.
Actually I like the ranges solution. If accompanied with repack stategy that
does not pack everything together, but instead creates packs of limited
number of objects -- so that the indices don't exceed configurable size, say
64kB -- could not so much less efficient for the network and have the
advantage of working without ability to execute CGI.
> > > [...]
> > >
> > > Et voilà. Oh, and of course update your local refs from the
> > > remote's.
> > >
> > > Actually there is nothing really complex in the above operations. And
> > > with this the server side remains really simple with no special setup
> > > nor extra load beyond the simple serving of file content.
> >
> > On the other hand the amount of data transfered is larger, than with the
> > git server approach, because at least the indices have to be transfered
> > in entirety.
--
Jan 'Bulb' Hudec <bulb@ucw.cz>
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-17 12:48 ` Matthieu Moy
@ 2007-05-18 18:27 ` Linus Torvalds
2007-05-18 18:33 ` alan
` (2 more replies)
0 siblings, 3 replies; 47+ messages in thread
From: Linus Torvalds @ 2007-05-18 18:27 UTC (permalink / raw)
To: Matthieu Moy; +Cc: git
On Thu, 17 May 2007, Matthieu Moy wrote:
>
> Many (if not most?) of the people working in a big company, I'd say.
> Year, it sucks, but people having used a paranoid firewall with a
> not-less-paranoid and broken proxy understand what I mean.
Well, we could try to support the git protocol over port 80..
IOW, it's probably easier to try to get people to use
git clone git://some.host:80/project
and just run git-daemon on port 80, than it is to try to set of magic cgi
scripts etc.
Doing that with virtual hosts etc should be pretty trivial. Much more so
than trying to make a git-cgi script.
And yes, I do realize that in theory you can have http-aware firewalls
that expect to see the normal http sequences in the first few packets in
order to pass things through, but I seriously doubt it's very common.
Linus
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-18 18:27 ` Linus Torvalds
@ 2007-05-18 18:33 ` alan
2007-05-18 19:01 ` Joel Becker
2007-05-19 0:50 ` david
2 siblings, 0 replies; 47+ messages in thread
From: alan @ 2007-05-18 18:33 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Matthieu Moy, git
On Fri, 18 May 2007, Linus Torvalds wrote:
>
>
> On Thu, 17 May 2007, Matthieu Moy wrote:
>>
>> Many (if not most?) of the people working in a big company, I'd say.
>> Year, it sucks, but people having used a paranoid firewall with a
>> not-less-paranoid and broken proxy understand what I mean.
>
> Well, we could try to support the git protocol over port 80..
>
> IOW, it's probably easier to try to get people to use
>
> git clone git://some.host:80/project
>
> and just run git-daemon on port 80, than it is to try to set of magic cgi
> scripts etc.
Except some filtering firewalls try and strip content from data (like
ActiveX controls.)
Running git on port 53 will bypass pretty much every firewall out there.
(If you want to learn how to bypass an overactive firewall, talk to a
bunch of teenagers at a school with an agressive porn filter.)
--
"ANSI C says access to the padding fields of a struct is undefined.
ANSI C also says that struct assignment is a memcpy. Therefore struct
assignment in ANSI C is a violation of ANSI C..."
- Alan Cox
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-18 18:27 ` Linus Torvalds
2007-05-18 18:33 ` alan
@ 2007-05-18 19:01 ` Joel Becker
2007-05-18 20:06 ` Matthieu Moy
2007-05-18 20:13 ` Linus Torvalds
2007-05-19 0:50 ` david
2 siblings, 2 replies; 47+ messages in thread
From: Joel Becker @ 2007-05-18 19:01 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Matthieu Moy, git
On Fri, May 18, 2007 at 11:27:22AM -0700, Linus Torvalds wrote:
> Well, we could try to support the git protocol over port 80..
>
> IOW, it's probably easier to try to get people to use
>
> git clone git://some.host:80/project
>
> and just run git-daemon on port 80, than it is to try to set of magic cgi
> scripts etc.
Can we tech the git-daemon to parse the HTTP headers
(specifically, the URL) and return the appropriate HTTP response?
> And yes, I do realize that in theory you can have http-aware firewalls
> that expect to see the normal http sequences in the first few packets in
> order to pass things through, but I seriously doubt it's very common.
It's not about packet scanning, it's about GET vs CONNECT. If
the proxy allows GET but not CONNECT, it's going to forward the HTTP
protocol to the server, and git-daemon is going to see "GET /project
HTTP/1.1" as its first input. Now, perhaps we can cook that up behind
some apache so that apache handles vhosting the URL, then calls
git-daemon which can take the stdin. So we'd be doing POST, not GET.
On the other hand, if the proxy allows CONNECT, there is no
scanning for HTTP sequences done by the proxy. It just allows all raw
data (as it figures you're doing SSL).
A normal company needs to have their firewall allow CONNECT to
9418. Then git proxying over HTTP is possible to a standard git-daemon.
Joel
--
"The first requisite of a good citizen in this republic of ours
is that he shall be able and willing to pull his weight."
- Theodore Roosevelt
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-18 19:01 ` Joel Becker
@ 2007-05-18 20:06 ` Matthieu Moy
2007-05-18 20:13 ` Linus Torvalds
1 sibling, 0 replies; 47+ messages in thread
From: Matthieu Moy @ 2007-05-18 20:06 UTC (permalink / raw)
To: Joel Becker; +Cc: Linus Torvalds, git
Joel Becker <Joel.Becker@oracle.com> writes:
> A normal company needs to have their firewall allow CONNECT to
> 9418. Then git proxying over HTTP is possible to a standard
> git-daemon.
443 should work too (that's HTTPS, and the proxy can't filter it,
since this would be a man-in-the-middle attack).
--
Matthieu
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-18 19:01 ` Joel Becker
2007-05-18 20:06 ` Matthieu Moy
@ 2007-05-18 20:13 ` Linus Torvalds
2007-05-18 21:56 ` Joel Becker
1 sibling, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2007-05-18 20:13 UTC (permalink / raw)
To: Joel Becker; +Cc: Matthieu Moy, git
On Fri, 18 May 2007, Joel Becker wrote:
>
> It's not about packet scanning, it's about GET vs CONNECT. If
> the proxy allows GET but not CONNECT, it's going to forward the HTTP
> protocol to the server, and git-daemon is going to see "GET /project
> HTTP/1.1" as its first input. Now, perhaps we can cook that up behind
> some apache so that apache handles vhosting the URL, then calls
> git-daemon which can take the stdin. So we'd be doing POST, not GET.
If it's _just_ the initial GET/CONNECT strings, yeah, we could probably
easily make the git-daemon just ignore them. That shouldn't be a problem.
But if there's anything *else* required, it gets uglier much more quickly.
Linus
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-18 20:13 ` Linus Torvalds
@ 2007-05-18 21:56 ` Joel Becker
2007-05-20 10:30 ` Jan Hudec
0 siblings, 1 reply; 47+ messages in thread
From: Joel Becker @ 2007-05-18 21:56 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Matthieu Moy, git
On Fri, May 18, 2007 at 01:13:36PM -0700, Linus Torvalds wrote:
> If it's _just_ the initial GET/CONNECT strings, yeah, we could probably
> easily make the git-daemon just ignore them. That shouldn't be a problem.
>
> But if there's anything *else* required, it gets uglier much more quickly.
With CONNECT, there isn't anything. That is, your
GIT_PROXY_COMMAND handles talking to the proxy, then gives git itself a
raw data pipe. My proxy allows CONNECT to 9418, and that's how I use it
today.
If you tried to make POST work (It'd be POST, not GET, as you
need to connect up the sending side), either apache would have to front
it for us, or "git-daemon --http" would have to accept the HTTP headers
on before the input, and output a proper HTTP response before sending
output. Seeing the headers would allow for us to vhost, even.
Hmm, but the proxy may not allow two-way communication. Does
the git protocol have more than one round-trip? That is:
Client:
POST http://server.git.host:80/projects/thisproject HTTP/1.1
Host: server.git.host
fetch-pack <sha1>
EOF
Server:
200 OK HTTP/1.1
<data>
EOF
should work, I'd think.
Joel
--
"Ninety feet between bases is perhaps as close as man has ever come
to perfection."
- Red Smith
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-18 18:27 ` Linus Torvalds
2007-05-18 18:33 ` alan
2007-05-18 19:01 ` Joel Becker
@ 2007-05-19 0:50 ` david
2007-05-19 3:58 ` Shawn O. Pearce
2 siblings, 1 reply; 47+ messages in thread
From: david @ 2007-05-19 0:50 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Matthieu Moy, git
On Fri, 18 May 2007, Linus Torvalds wrote:
> On Thu, 17 May 2007, Matthieu Moy wrote:
>>
>> Many (if not most?) of the people working in a big company, I'd say.
>> Year, it sucks, but people having used a paranoid firewall with a
>> not-less-paranoid and broken proxy understand what I mean.
>
> Well, we could try to support the git protocol over port 80..
>
> IOW, it's probably easier to try to get people to use
>
> git clone git://some.host:80/project
>
> and just run git-daemon on port 80, than it is to try to set of magic cgi
> scripts etc.
>
> Doing that with virtual hosts etc should be pretty trivial. Much more so
> than trying to make a git-cgi script.
>
> And yes, I do realize that in theory you can have http-aware firewalls
> that expect to see the normal http sequences in the first few packets in
> order to pass things through, but I seriously doubt it's very common.
they are actually more common than you think, and getting even more common
thanks to IE
when a person browsing a hostile website will allow that website to take
over the machine the demand is created for 'malware filters' for http, to
do this the firewalls need to decode the http, and in the process limit
you to only doing legitimate http.
it's also the case that the companies that have firewalls paranoid enough
to not let you get to the git port are highly likely to be paranoid enough
to have a malware filtering http firewall.
David Lang
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-19 0:50 ` david
@ 2007-05-19 3:58 ` Shawn O. Pearce
2007-05-19 4:58 ` david
0 siblings, 1 reply; 47+ messages in thread
From: Shawn O. Pearce @ 2007-05-19 3:58 UTC (permalink / raw)
To: david; +Cc: Linus Torvalds, Matthieu Moy, git
david@lang.hm wrote:
> when a person browsing a hostile website will allow that website to take
> over the machine the demand is created for 'malware filters' for http, to
> do this the firewalls need to decode the http, and in the process limit
> you to only doing legitimate http.
>
> it's also the case that the companies that have firewalls paranoid enough
> to not let you get to the git port are highly likely to be paranoid enough
> to have a malware filtering http firewall.
I'm behind such a filter, and fetch git.git via HTTP just to keep
my work system current with Junio. ;-)
Of course we're really really really paranoid about our firewall,
but are also so paranoid that any other web browser *except*
Microsoft Internet Explorer is thought to be a security risk and
is more-or-less banned from the network.
The kicker is some of our developers create public websites, where
testing your local webpage with Firefox and Safari is pretty much
required... but those browsers still aren't as trusted as IE and
require special clearances. *shakes head*
We're pretty much limited to:
*) Running the native Git protocol SSL, where the remote system
is answering to port 443. It may not need to be HTTP at all,
but it probably has to smell enough like SSL to get it through
the malware filter. Oh, what's that? The filter cannot actually
filter the SSL data? Funny! ;-)
*) Using a single POST upload followed by response from server,
formatted with minimal HTTP headers. The real problem as people
have pointed out is not the HTTP headers, but it is the single
exchange.
One might think you could use HTTP pipelining to try and get a
bi-directional channel with the remote system, but I'm sure proxy
servers are not required to reuse the same TCP connection to the
remote HTTP server when the inside client piplines a new request.
So any sort of hack on pipelining won't work.
If you really want a stateful exchange you have to treat HTTP as
though it were IP, but with reliable (and much more expensive)
packet delivery, and make the Git daemon keep track of the protocol
state with the client. Yes, that means that when the client suddenly
goes away and doesn't tell you he went away you also have to garbage
collect your state. No nice messages from your local kernel. :-(
--
Shawn.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-19 3:58 ` Shawn O. Pearce
@ 2007-05-19 4:58 ` david
0 siblings, 0 replies; 47+ messages in thread
From: david @ 2007-05-19 4:58 UTC (permalink / raw)
To: Shawn O. Pearce; +Cc: Linus Torvalds, Matthieu Moy, git
On Fri, 18 May 2007, Shawn O. Pearce wrote:
> david@lang.hm wrote:
>> when a person browsing a hostile website will allow that website to take
>> over the machine the demand is created for 'malware filters' for http, to
>> do this the firewalls need to decode the http, and in the process limit
>> you to only doing legitimate http.
>>
>> it's also the case that the companies that have firewalls paranoid enough
>> to not let you get to the git port are highly likely to be paranoid enough
>> to have a malware filtering http firewall.
>
> I'm behind such a filter, and fetch git.git via HTTP just to keep
> my work system current with Junio. ;-)
>
> Of course we're really really really paranoid about our firewall,
> but are also so paranoid that any other web browser *except*
> Microsoft Internet Explorer is thought to be a security risk and
> is more-or-less banned from the network.
>
> The kicker is some of our developers create public websites, where
> testing your local webpage with Firefox and Safari is pretty much
> required... but those browsers still aren't as trusted as IE and
> require special clearances. *shakes head*
this isn't paranoia, this is just bullheadedness
> We're pretty much limited to:
>
> *) Running the native Git protocol SSL, where the remote system
> is answering to port 443. It may not need to be HTTP at all,
> but it probably has to smell enough like SSL to get it through
> the malware filter. Oh, what's that? The filter cannot actually
> filter the SSL data? Funny! ;-)
we're actually paranoid enough to have devices that do man-in-the-middle
decryption for some sites, and are given copies of the encryption keys
that other sites (and browsers) use so that it can decrypt the SSL and
check it. I admit that this is far more paranoid then almost all sites
though :-)
> *) Using a single POST upload followed by response from server,
> formatted with minimal HTTP headers. The real problem as people
> have pointed out is not the HTTP headers, but it is the single
> exchange.
> If you really want a stateful exchange you have to treat HTTP as
> though it were IP, but with reliable (and much more expensive)
> packet delivery, and make the Git daemon keep track of the protocol
> state with the client. Yes, that means that when the client suddenly
> goes away and doesn't tell you he went away you also have to garbage
> collect your state. No nice messages from your local kernel. :-(
unfortunantly you are right about this.
David Lang
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Smart fetch via HTTP?
2007-05-18 21:56 ` Joel Becker
@ 2007-05-20 10:30 ` Jan Hudec
0 siblings, 0 replies; 47+ messages in thread
From: Jan Hudec @ 2007-05-20 10:30 UTC (permalink / raw)
To: Joel Becker; +Cc: Linus Torvalds, Matthieu Moy, git
[-- Attachment #1: Type: text/plain, Size: 2000 bytes --]
On Fri, May 18, 2007 at 14:56:07 -0700, Joel Becker wrote:
> On Fri, May 18, 2007 at 01:13:36PM -0700, Linus Torvalds wrote:
> > If it's _just_ the initial GET/CONNECT strings, yeah, we could probably
> > easily make the git-daemon just ignore them. That shouldn't be a problem.
> >
> > But if there's anything *else* required, it gets uglier much more quickly.
>
> With CONNECT, there isn't anything. That is, your
> GIT_PROXY_COMMAND handles talking to the proxy, then gives git itself a
> raw data pipe. My proxy allows CONNECT to 9418, and that's how I use it
> today.
Yes. Connect is easy. However many companies only allow CONNECT to 443
(not that it's much more secure than allowing it anywhere, but at least it
has to block CONNECT to 25 to block sending spam).
> If you tried to make POST work (It'd be POST, not GET, as you
> need to connect up the sending side), either apache would have to front
> it for us, or "git-daemon --http" would have to accept the HTTP headers
> on before the input, and output a proper HTTP response before sending
> output. Seeing the headers would allow for us to vhost, even.
> Hmm, but the proxy may not allow two-way communication. Does
> the git protocol have more than one round-trip? That is:
>
> Client:
> POST http://server.git.host:80/projects/thisproject HTTP/1.1
> Host: server.git.host
>
> fetch-pack <sha1>
> EOF
>
> Server:
> 200 OK HTTP/1.1
>
> <data>
> EOF
>
> should work, I'd think.
Well, that does not require git at all -- apache can handle this all right.
But it's not network-efficient. To be network-efficient, it is necessary to
negotiate the list of objects that need to be send. And that requires more
than one round-trip. Additionally, the current git protocol is streaming --
the client sends data without waiting for the server. So it would require
slightly different protocol over HTTP.
--
Jan 'Bulb' Hudec <bulb@ucw.cz>
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 47+ messages in thread
end of thread, other threads:[~2007-05-20 10:30 UTC | newest]
Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-15 20:10 Smart fetch via HTTP? Jan Hudec
2007-05-15 22:30 ` A Large Angry SCM
2007-05-15 23:29 ` Shawn O. Pearce
2007-05-16 0:38 ` Junio C Hamano
2007-05-16 5:25 ` Martin Langhoff
2007-05-16 11:33 ` Johannes Schindelin
2007-05-16 21:26 ` Martin Langhoff
2007-05-16 21:54 ` Jakub Narebski
2007-05-17 0:52 ` Johannes Schindelin
2007-05-17 1:03 ` Shawn O. Pearce
2007-05-17 1:04 ` david
2007-05-17 1:26 ` Shawn O. Pearce
2007-05-17 1:45 ` Shawn O. Pearce
2007-05-17 12:36 ` Theodore Tso
2007-05-17 3:45 ` Nicolas Pitre
2007-05-17 10:48 ` Johannes Schindelin
2007-05-17 14:41 ` Nicolas Pitre
2007-05-17 15:24 ` Martin Langhoff
2007-05-17 15:34 ` Nicolas Pitre
2007-05-17 20:04 ` Jan Hudec
2007-05-17 20:31 ` Nicolas Pitre
2007-05-17 21:00 ` david
2007-05-18 9:01 ` Johannes Schindelin
2007-05-18 17:51 ` Jan Hudec
2007-05-17 11:28 ` Matthieu Moy
2007-05-17 13:10 ` Martin Langhoff
2007-05-17 13:47 ` Johannes Schindelin
2007-05-17 14:05 ` Matthieu Moy
2007-05-17 14:09 ` Martin Langhoff
2007-05-17 15:01 ` Nicolas Pitre
2007-05-17 23:14 ` Jakub Narebski
2007-05-17 14:50 ` Nicolas Pitre
2007-05-17 12:40 ` Petr Baudis
2007-05-17 12:48 ` Matthieu Moy
2007-05-18 18:27 ` Linus Torvalds
2007-05-18 18:33 ` alan
2007-05-18 19:01 ` Joel Becker
2007-05-18 20:06 ` Matthieu Moy
2007-05-18 20:13 ` Linus Torvalds
2007-05-18 21:56 ` Joel Becker
2007-05-20 10:30 ` Jan Hudec
2007-05-19 0:50 ` david
2007-05-19 3:58 ` Shawn O. Pearce
2007-05-19 4:58 ` david
2007-05-17 20:26 ` Jan Hudec
2007-05-17 20:38 ` Nicolas Pitre
2007-05-18 17:35 ` Jan Hudec
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).