New Feature wanted: Is it possible to let git clone continue last break point?

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* New Feature wanted: Is it possible to let git clone continue last break point?
       [not found] <CAEZo+gfKVY-YgMjd=bEYzRV4-460kqDik-yVcQ9Xs=DoCZOMDg@mail.gmail.com>
@ 2011-10-31  2:28 ` netroby
  2011-10-31  4:00   ` Tay Ray Chuan
                     ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: netroby @ 2011-10-31  2:28 UTC (permalink / raw)
  To: Git Mail List

Is it possible to let git clone continue last break point.
when we git clone very large project from the web,  we may face some
interupt, then we must clone it from zero .

it is bad feeling for low  connection  speed users.

please help us out.

we need git clone continue last break point

netroby
----------------------------------
http://www.netroby.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: New Feature wanted: Is it possible to let git clone continue last break point?
  2011-10-31  2:28 ` New Feature wanted: Is it possible to let git clone continue last break point? netroby
@ 2011-10-31  4:00   ` Tay Ray Chuan
  2011-10-31  9:07   ` Jonathan Nieder
  2011-10-31  9:14   ` Jakub Narebski
  2 siblings, 0 replies; 18+ messages in thread
From: Tay Ray Chuan @ 2011-10-31  4:00 UTC (permalink / raw)
  To: netroby; +Cc: Git Mail List

This is a hard problem that hasn't been solved. Year after year, it's
a GSoC proposal...

What you can do is use --depth 1 with your git-clone; then "extend"
the depth incrementally.
-- Cheers,Ray Chuan

On Mon, Oct 31, 2011 at 10:28 AM, netroby <hufeng1987@gmail.com> wrote:
> Is it possible to let git clone continue last break point.
> when we git clone very large project from the web,  we may face some
> interupt, then we must clone it from zero .
>
> it is bad feeling for low  connection  speed users.
>
> please help us out.
>
> we need git clone continue last break point
>
> netroby
> ----------------------------------
> http://www.netroby.com
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: New Feature wanted: Is it possible to let git clone continue last break point?
  2011-10-31  2:28 ` New Feature wanted: Is it possible to let git clone continue last break point? netroby
  2011-10-31  4:00   ` Tay Ray Chuan
@ 2011-10-31  9:07   ` Jonathan Nieder
  2011-10-31  9:16     ` netroby
  2011-11-02 22:06     ` Jeff King
  2011-10-31  9:14   ` Jakub Narebski
  2 siblings, 2 replies; 18+ messages in thread
From: Jonathan Nieder @ 2011-10-31  9:07 UTC (permalink / raw)
  To: netroby; +Cc: Git Mail List, Tomas Carnecky, Jeff King

Hi,

netroby wrote:

> Is it possible to let git clone continue last break point.
> when we git clone very large project from the web,  we may face some
> interupt, then we must clone it from zero .

You might find [1] useful as a stopgap (thanks, Tomas!).

Something like Jeff's "priming the well with a server-specified
bundle" proposal[2] might be a good way to make the same trick
transparent to clients in the future.

Even with that, later fetches, which grab a pack generated on the fly
to only contain the objects not already fetched, are generally not
resumable.  Overcoming that would presumably require larger protocol
changes, and I don't know of anyone working on it.  (My workaround
when in a setup where this mattered was to use the old-fashioned
"dumb" http protocol.  It worked fine.)

Hope that helps,
Jonathan

[1] http://thread.gmane.org/gmane.comp.version-control.git/181380
[2] http://thread.gmane.org/gmane.comp.version-control.git/164569/focus=164701
    http://thread.gmane.org/gmane.comp.version-control.git/168906/focus=168912

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: New Feature wanted: Is it possible to let git clone continue last break point?
  2011-10-31  2:28 ` New Feature wanted: Is it possible to let git clone continue last break point? netroby
  2011-10-31  4:00   ` Tay Ray Chuan
  2011-10-31  9:07   ` Jonathan Nieder
@ 2011-10-31  9:14   ` Jakub Narebski
  2011-10-31 12:49     ` Michael Schubert
  2 siblings, 1 reply; 18+ messages in thread
From: Jakub Narebski @ 2011-10-31  9:14 UTC (permalink / raw)
  To: netroby; +Cc: Git Mail List

netroby <hufeng1987@gmail.com> writes:

> Is it possible to let git clone continue last break point.
> when we git clone very large project from the web,  we may face some
> interupt, then we must clone it from zero .
> 
> it is bad feeling for low  connection  speed users.
> 
> please help us out.
> 
> we need git clone continue last break point

Resuming "git clone" is not currently possible in Git, and it would be
difficult to add such feature to Git; there were several attempts and
neither succeeded.

What you can do is generate a starter bundle out of your repository
(using "git bundle"), and serve this file via HTTP / FTP / BitTorrent,
i.e. some resumable transport.  Then you "git clone <bundle file>",
fix up configuration, and fetch the rest since bundle creation.

Though this is possible only if it is your project... or can ask
project administrator to provide bundle.

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: New Feature wanted: Is it possible to let git clone continue last break point?
  2011-10-31  9:07   ` Jonathan Nieder
@ 2011-10-31  9:16     ` netroby
  2011-11-02 22:06     ` Jeff King
  1 sibling, 0 replies; 18+ messages in thread
From: netroby @ 2011-10-31  9:16 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Git Mail List, Tomas Carnecky, Jeff King

the example :

i want to clone the freebsd and linux kernel git repo , to view their
source code.

git://github.com/freebsd/freebsd.git

git://github.com/torvalds/linux.git


they are big project, so they are huge.

thanks for your tips. it will let me have a try .


I am current using  256K Adsl , so it is very not stable when clone in progress.

netroby
----------------------------------
http://www.netroby.com



On Mon, Oct 31, 2011 at 17:07, Jonathan Nieder <jrnieder@gmail.com> wrote:
> Hi,
>
> netroby wrote:
>
>> Is it possible to let git clone continue last break point.
>> when we git clone very large project from the web,  we may face some
>> interupt, then we must clone it from zero .
>
> You might find [1] useful as a stopgap (thanks, Tomas!).
>
> Something like Jeff's "priming the well with a server-specified
> bundle" proposal[2] might be a good way to make the same trick
> transparent to clients in the future.
>
> Even with that, later fetches, which grab a pack generated on the fly
> to only contain the objects not already fetched, are generally not
> resumable.  Overcoming that would presumably require larger protocol
> changes, and I don't know of anyone working on it.  (My workaround
> when in a setup where this mattered was to use the old-fashioned
> "dumb" http protocol.  It worked fine.)
>
> Hope that helps,
> Jonathan
>
> [1] http://thread.gmane.org/gmane.comp.version-control.git/181380
> [2] http://thread.gmane.org/gmane.comp.version-control.git/164569/focus=164701
>    http://thread.gmane.org/gmane.comp.version-control.git/168906/focus=168912
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: New Feature wanted: Is it possible to let git clone continue last break point?
  2011-10-31  9:14   ` Jakub Narebski
@ 2011-10-31 12:49     ` Michael Schubert
  0 siblings, 0 replies; 18+ messages in thread
From: Michael Schubert @ 2011-10-31 12:49 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: netroby, Git Mail List

On 10/31/2011 10:14 AM, Jakub Narebski wrote:
> netroby <hufeng1987@gmail.com> writes:
> 
>> Is it possible to let git clone continue last break point.
>> when we git clone very large project from the web,  we may face some
>> interupt, then we must clone it from zero .
>>
>> it is bad feeling for low  connection  speed users.
>>
>> please help us out.
>>
>> we need git clone continue last break point
> 
> Resuming "git clone" is not currently possible in Git, and it would be
> difficult to add such feature to Git; there were several attempts and
> neither succeeded.
> 
> What you can do is generate a starter bundle out of your repository
> (using "git bundle"), and serve this file via HTTP / FTP / BitTorrent,
> i.e. some resumable transport.  Then you "git clone <bundle file>",
> fix up configuration, and fetch the rest since bundle creation.

There's also a "git bundler service":

http://comments.gmane.org/gmane.comp.version-control.git/181380

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: New Feature wanted: Is it possible to let git clone continue last break point?
  2011-10-31  9:07   ` Jonathan Nieder
  2011-10-31  9:16     ` netroby
@ 2011-11-02 22:06     ` Jeff King
  2011-11-02 22:41       ` Junio C Hamano
  1 sibling, 1 reply; 18+ messages in thread
From: Jeff King @ 2011-11-02 22:06 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: netroby, Git Mail List, Tomas Carnecky

On Mon, Oct 31, 2011 at 04:07:18AM -0500, Jonathan Nieder wrote:

> Something like Jeff's "priming the well with a server-specified
> bundle" proposal[2] might be a good way to make the same trick
> transparent to clients in the future.

Yes, that is one of the use cases I hope to address. But it will require
the publisher specifying a mirror location (it's possible we could add
some kind of automagic "hit a bundler service first" config option,
though I fear that the existing small-time bundler services would
crumble under the load).

So in the general case (and in the meantime), you may have to learn to
manually prime the repo using a bundle.

I haven't started on the patches for communicating mirror sites between
the server and client, but I did just write some patches to handle "git
fetch http://host/path/to/file.bundle" automatically, which is the first
step. They need a few finishing touches and some testing, though.

> Even with that, later fetches, which grab a pack generated on the fly
> to only contain the objects not already fetched, are generally not
> resumable.  Overcoming that would presumably require larger protocol
> changes, and I don't know of anyone working on it.  (My workaround
> when in a setup where this mattered was to use the old-fashioned
> "dumb" http protocol.  It worked fine.)

My goal was for the mirror communication between client and server to be
something like:

  - if you don't have object XXXXXX, then prime with URL
    http://host/bundle1

  - if you don't have object YYYYYY, then prime with URL
    http://host/bundle2

and so forth. A cloning client would grab the first bundle, then the
second, and then hit the real repo via the git protocol. A client who
had previously cloned might have XXX, but would now grab bundle2, and
then hit the real repo.

So depending on how often the server side feels like creating new
bundles, you would get most of the changes via bundles, and then only
be getting a small number of objects via git.

The downside of cumulative fetching is that the bundles can only serve
well-known checkpoints. So if you have a timeline like this:

  t0: server publishes bundle/mirror config with one line (the XXX bit
      above)

  t1: you clone, getting the whole bundle. No waste, because you had
      nothing in the first place, and you needed everything.

  t2: you fetch again, getting N commits worth of history via the git
      protocol

  t3: server decides a lot of new objects (let's say M commits worth)
      have accumulated, and generates a new line (the YYY line).

  t4: you fetch, see that you don't yet have YYY, and grab the second
      bundle

But in t4 you grabbed a bundle containing M commits, when you already
had the first N of them. So you actually wasted bandwidth getting
objects you already had. The only benefit is that you grabbed a static
file, which is resumable.

So I suspect there is some black magic involved in deciding when to
create a new bundle, and at what tip. If you create a bundle once a
month, but include only commits up to a week ago, then people pulling
weekly will never grab the bundle, but people pulling less frequently
will get the whole month as a bundle.

A secondary issue is also that in a scheme like this, your mirror list
will grow without bound. So you'd want to periodically repack everything
into a single bundle. But then people who are fetching wouldn't want
that, as it is just an exacerbated version of the same problem above.

Which is all a roundabout way of saying that the git protocol is really
the sane way to do efficient transfers. An alternative, much simpler
scheme would be for the server to just say:

  - if you have nothing, then prime with URL http://host/bundle

And then _only_ clone would bother with checking mirrors. People doing
fetch would be expected to do it often enough that not being resumable
isn't a big deal.

-Peff

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: New Feature wanted: Is it possible to let git clone continue last break point?
  2011-11-02 22:06     ` Jeff King
@ 2011-11-02 22:41       ` Junio C Hamano
  2011-11-02 23:27         ` Jeff King
  0 siblings, 1 reply; 18+ messages in thread
From: Junio C Hamano @ 2011-11-02 22:41 UTC (permalink / raw)
  To: Jeff King; +Cc: Jonathan Nieder, netroby, Git Mail List, Tomas Carnecky

Jeff King <peff@peff.net> writes:

> Which is all a roundabout way of saying that the git protocol is really
> the sane way to do efficient transfers. An alternative, much simpler
> scheme would be for the server to just say:
>
>   - if you have nothing, then prime with URL http://host/bundle
>
> And then _only_ clone would bother with checking mirrors. People doing
> fetch would be expected to do it often enough that not being resumable
> isn't a big deal.

I think that is a sensible place to start.

A more fancy conditional "If you have X then fetch this, if you have Y
fetch that, ..." sounds nice but depending on what branch you are fetching
the answer has to be different. If we were to do that, the natural place
for the server to give the redirect instruction to the client is after the
client finishes saying "want", and before the client starts saying "have".

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: New Feature wanted: Is it possible to let git clone continue last break point?
  2011-11-02 22:41       ` Junio C Hamano
@ 2011-11-02 23:27         ` Jeff King
  2011-11-03  0:06           ` Shawn Pearce
  0 siblings, 1 reply; 18+ messages in thread
From: Jeff King @ 2011-11-02 23:27 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Nieder, netroby, Git Mail List, Tomas Carnecky

On Wed, Nov 02, 2011 at 03:41:36PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > Which is all a roundabout way of saying that the git protocol is really
> > the sane way to do efficient transfers. An alternative, much simpler
> > scheme would be for the server to just say:
> >
> >   - if you have nothing, then prime with URL http://host/bundle
> >
> > And then _only_ clone would bother with checking mirrors. People doing
> > fetch would be expected to do it often enough that not being resumable
> > isn't a big deal.
> 
> I think that is a sensible place to start.

OK. That had been my original intent, but somebody (you?) mentioned the
"if you have X" thing at the GitTogether, which got me thinking.

I don't mind starting slow, as long as we don't paint ourselves into a
corner for future expansion. I'll try to design the data format for
specifying the mirror locations with that extension in mind.

Even if the bundle thing ends up too wasteful, it may still be useful to
offer a "if you don't have X, go see Y" type of mirror when "Y" is
something efficient, like git:// at a faster host (i.e., the "I built 3
commits on top of Linus" case).

> A more fancy conditional "If you have X then fetch this, if you have Y
> fetch that, ..." sounds nice but depending on what branch you are fetching
> the answer has to be different. If we were to do that, the natural place
> for the server to give the redirect instruction to the client is after the
> client finishes saying "want", and before the client starts saying "have".

Agreed. I was really trying to avoid protocol extensions, though, at
least for an initial version. I'd like to see how far we can get doing
the simplest thing.

-Peff

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: New Feature wanted: Is it possible to let git clone continue last break point?
  2011-11-02 23:27         ` Jeff King
@ 2011-11-03  0:06           ` Shawn Pearce
  2011-11-03  2:42             ` Jeff King
  0 siblings, 1 reply; 18+ messages in thread
From: Shawn Pearce @ 2011-11-03  0:06 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Jonathan Nieder, netroby, Git Mail List,
	Tomas Carnecky

On Wed, Nov 2, 2011 at 16:27, Jeff King <peff@peff.net> wrote:
> On Wed, Nov 02, 2011 at 03:41:36PM -0700, Junio C Hamano wrote:
>> Jeff King <peff@peff.net> writes:
>>
>> > Which is all a roundabout way of saying that the git protocol is really
>> > the sane way to do efficient transfers. An alternative, much simpler
>> > scheme would be for the server to just say:
>> >
>> >   - if you have nothing, then prime with URL http://host/bundle
>> >
>> > And then _only_ clone would bother with checking mirrors. People doing
>> > fetch would be expected to do it often enough that not being resumable
>> > isn't a big deal.
>>
>> I think that is a sensible place to start.

Yup, I agree. The "repo" tool used by Android does this in Python
right now[1].  Its a simple hack, if the protocol is HTTP or HTTPS the
client first tries to download $URL/clone.bundle. My servers have
rules that trap on */clone.bundle and issue an HTTP 302 Found response
to direct the client to a CDN. Works. :-)

[1] http://code.google.com/p/git-repo/source/detail?r=f322b9abb4cadc67b991baf6ba1b9f2fbd5d7812&name=stable

> OK. That had been my original intent, but somebody (you?) mentioned the
> "if you have X" thing at the GitTogether, which got me thinking.
>
> I don't mind starting slow, as long as we don't paint ourselves into a
> corner for future expansion. I'll try to design the data format for
> specifying the mirror locations with that extension in mind.

Right. Aside from the fact that $URL/clone.bundle is perhaps a bad way
to decide on the URL to actually fetch (and isn't supportable over
git:// or ssh://)... we should start with the clone case and worry
about incremental updates later.

> Even if the bundle thing ends up too wasteful, it may still be useful to
> offer a "if you don't have X, go see Y" type of mirror when "Y" is
> something efficient, like git:// at a faster host (i.e., the "I built 3
> commits on top of Linus" case).

Actually, I really think the bundle thing is wasteful. Its a ton of
additional disk. Hosts like kernel.org want to use sendfile() when
possible to handle bulk transfers. git:// is not efficient for them
because we don't have sendfile() capability.

Its also expensive for kernel.org to create each Git repository twice
on disk. The disk is cheap. Its the kernel buffer cache that is damned
expensive. Assume for a minute that Linus' kernel repository is a
popular thing to access. If 400M of that history is available in a
normal pack file on disk, and again 400M is available as a "clone
bundle thingy", kernel.org now has to eat 800M of disk buffer cache
for that one Git repository, because both of those files are going to
be hot.

I think I messed up with "repo" using a Git bundle file as its data
source. What we should have done was a bog standard pack file. Then
the client can download the pack file into the .git/objects/pack
directory and just generate the index, reusing the entire dumb
protocol transport logic. It also allows the server to pass out the
same file the server retains for the repository itself, and thus makes
the disk buffer cache only 400M for Linus' repository.

> Agreed. I was really trying to avoid protocol extensions, though, at
> least for an initial version. I'd like to see how far we can get doing
> the simplest thing.

One (maybe dumb idea I had) was making the $GIT_DIR/objects/info/packs
file contain other lines to list reference tips at the time the pack
was made. The client just needs the SHA-1s, it doesn't necessarily
need the branch names themselves. A client could initialize itself by
getting this set of references, creating temporary dummy references at
those SHA-1s, and downloading the corresponding pack file, indexing
it, then resuming with a normal fetch.

Then we wind up with a git:// or ssh:// protocol extension that
enables sendfile() on an entire pack, and to provide the matching
objects/info/packs data to help a client over git:// or ssh://
initialize off the existing pack files.

Obviously there is the existing security feature that over git:// or
ssh:// (or even smart HTTP), a deleted or rewound reference stops
exposing the content in the repository that isn't reachable from the
other reference tips. The repository owner / server administrator will
have to make a choice here, either the existing packs are not exposed
as available via sendfile() until after GC can be run to rebuild them
around the right content set, or they are exposed and the time to
expunge/hide an unreferenced object is expanded until the GC completes
(rather than being immediate after the reference updates).

But either way, I like the idea of coupling the "resumable pack
download" to the *existing* pack files, because this is easy to deal
with. If you do have a rewind/delete and need to expunge content,
users/administrators already know how to run `git gc --expire=now` to
accomplish a full erase. Adding another thing with bundle files
somewhere else that may or may not contain the data you want to erase
and remembering to clean that up is not a good idea.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: New Feature wanted: Is it possible to let git clone continue last break point?
  2011-11-03  0:06           ` Shawn Pearce
@ 2011-11-03  2:42             ` Jeff King
  2011-11-03  4:19               ` Shawn Pearce
  0 siblings, 1 reply; 18+ messages in thread
From: Jeff King @ 2011-11-03  2:42 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: Junio C Hamano, Jonathan Nieder, netroby, Git Mail List,
	Tomas Carnecky

On Wed, Nov 02, 2011 at 05:06:53PM -0700, Shawn O. Pearce wrote:

> Yup, I agree. The "repo" tool used by Android does this in Python
> right now[1].  Its a simple hack, if the protocol is HTTP or HTTPS the
> client first tries to download $URL/clone.bundle. My servers have
> rules that trap on */clone.bundle and issue an HTTP 302 Found response
> to direct the client to a CDN. Works. :-)

I thought of doing something like that, but I wanted to be able to make
cross-domain links. The "302 to a CDN" thing is a clever hack, but it
requires more control of the webserver than some users might have. And
of course it doesn't work for the "redirect to git:// on a different
server" trick. Or redirect from "git://".

My thought of having it in "refs/mirrors" is only slightly less hacky,
but I think covers all of those cases. :)

> > Even if the bundle thing ends up too wasteful, it may still be useful to
> > offer a "if you don't have X, go see Y" type of mirror when "Y" is
> > something efficient, like git:// at a faster host (i.e., the "I built 3
> > commits on top of Linus" case).
> 
> Actually, I really think the bundle thing is wasteful. Its a ton of
> additional disk. Hosts like kernel.org want to use sendfile() when
> possible to handle bulk transfers. git:// is not efficient for them
> because we don't have sendfile() capability.

I didn't quite parse this. You say it is wasteful, but then indicate
that it can use sendfile(), which is a good thing.

However, I do agree with this:

> Its also expensive for kernel.org to create each Git repository twice
> on disk. The disk is cheap. Its the kernel buffer cache that is damned
> expensive. Assume for a minute that Linus' kernel repository is a
> popular thing to access. If 400M of that history is available in a
> normal pack file on disk, and again 400M is available as a "clone
> bundle thingy", kernel.org now has to eat 800M of disk buffer cache
> for that one Git repository, because both of those files are going to
> be hot.

Doubling the disk cache required is evil and ugly. I was hoping it
wouldn't matter because the bundle would be hosted on some far-away CDN
server anyway, though. But that is highly dependent on your setup. And
it's really just glossing over the fact that you have twice as many
servers. ;)

> I think I messed up with "repo" using a Git bundle file as its data
> source. What we should have done was a bog standard pack file. Then
> the client can download the pack file into the .git/objects/pack
> directory and just generate the index, reusing the entire dumb
> protocol transport logic. It also allows the server to pass out the
> same file the server retains for the repository itself, and thus makes
> the disk buffer cache only 400M for Linus' repository.

That would be cool, but what about ref tips? The pack is just a big blob
of objects, but we need ref tips to advertise to the server when we come
back via the smart protocol. We can make a guess about them, obviously,
but it would be nice to communicate them. I guess the mirror data could
include the tips and a pointer to a pack file.

Another issue with packs is that they generally aren't supposed to be
--thin on disk, whereas bundles can be. So I could point you to a
succession of bundles. Which is maybe a feature, or maybe just makes
things insanely complex[1].

> One (maybe dumb idea I had) was making the $GIT_DIR/objects/info/packs
> file contain other lines to list reference tips at the time the pack
> was made.

So yeah, that's another solution to the ref tip thingy, and that would
work. I don't think it would make a big difference whether the tips were
in the "mirror" file, or alongside the packfile. The latter I guess
might make administration easier. The "real" repo points its mirror one
time to a static pack store, and then the client goes and grabs whatever
it can from that store.

> Then we wind up with a git:// or ssh:// protocol extension that
> enables sendfile() on an entire pack, and to provide the matching
> objects/info/packs data to help a client over git:// or ssh://
> initialize off the existing pack files.

I think we can get around this by pointing git:// clients, either via
protocol extension or via a magic ref, to an http pack store. Sure, it's
an extra TCP connection, but that's not a big deal compared to doing an
initial clone of most repos.

So the sendfile() stuff would always happen over http.

> But either way, I like the idea of coupling the "resumable pack
> download" to the *existing* pack files, because this is easy to deal
> with.

Yeah, I'm liking that idea. In reference to my [1] above, what I've
started with is making:

  git fetch http://host/foo.bundle

work automatically. And it does work. But it actually spools the bundle
to disk and then unpacks from it, rather than placing it right into the
objects/pack directory. I did this because:

  1. We have to feed it to "index-pack --fix-thin", because bundles can
     be thin. So they're not suitable for sticking right into the pack
     directory.

  2. We could feed it straight to an index-pack pipe, but then we don't
     have a byte-for-byte file on disk to resume an interrupted
     transfer.

But spooling sucks, of course. It means we use twice as much disk space
during the index-pack as we would otherwise need to, not to mention the
latency of not starting the index-pack until we get the whole file.

Pulling down a non-thin packfile makes the problem go away. We can spool
it right into objects/pack, index it on the fly, and if all is well,
move it into its final filename. If the transfer is interrupted, you
drop what's been indexed so far, finish the transfer, and then re-start
the indexing from scratch (actually, the "on the fly" would probably
involve teaching index-pack to be clever about incrementally reading a
partially written file, but it should be possible).

-Peff

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: New Feature wanted: Is it possible to let git clone continue last break point?
  2011-11-03  2:42             ` Jeff King
@ 2011-11-03  4:19               ` Shawn Pearce
  2011-11-04  8:56                 ` Clemens Buchacher
  0 siblings, 1 reply; 18+ messages in thread
From: Shawn Pearce @ 2011-11-03  4:19 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Jonathan Nieder, netroby, Git Mail List,
	Tomas Carnecky

On Wed, Nov 2, 2011 at 19:42, Jeff King <peff@peff.net> wrote:
> On Wed, Nov 02, 2011 at 05:06:53PM -0700, Shawn O. Pearce wrote:
>
> I thought of doing something like that, but I wanted to be able to make
> cross-domain links. The "302 to a CDN" thing is a clever hack, but it
> requires more control of the webserver than some users might have. And
> of course it doesn't work for the "redirect to git:// on a different
> server" trick. Or redirect from "git://".

I agree. Later I said I regret this being a bundle file. I also regret
it being this $URL/clone.bundle thing. Its a reasonable quick hack in
Python for repo. Its cheap for my servers to respond 404 Not Found or
302 Found, and cheap to use the CDN. But it isn't the right solution
for git-core.

It has given us some useful information already in the context of
android.googlesource.com. It appears to work quite well for
distributing the large Android operating system. So the notion of
making packs available from another URL than the main repository, and
doing it as primarily a pack and not the native Git protocol, with a
follow-up incremental fetch to bring the client current seems to work.
 :-)

> My thought of having it in "refs/mirrors" is only slightly less hacky,
> but I think covers all of those cases. :)

Right, but this would have been a bit more work for me to code in Python. :-)

Long term this may be a better approach, because it does allow the
user to control the redirect without having full control over their
HTTP server. It also supports redirections across protocols like you
noted above. So its probably the direction we will see git-core take.

>> Actually, I really think the bundle thing is wasteful.... sendfile() capability.
>
> I didn't quite parse this. You say it is wasteful, but then indicate
> that it can use sendfile(), which is a good thing.

Apparently I was babbling. Based on what else you say, we agree. That
is good enough for me.

> However, I do agree with this:
>
>> Its also expensive for kernel.org to create each Git repository twice
>> on disk. The disk is cheap. Its the kernel buffer cache that is damned
>> expensive.
>
> Doubling the disk cache required is evil and ugly. I was hoping it
> wouldn't matter because the bundle would be hosted on some far-away CDN
> server anyway, though. But that is highly dependent on your setup. And
> it's really just glossing over the fact that you have twice as many
> servers. ;)

Right. :-)

In my opinion this is the important part. We shouldn't double the disk
usage required to support this. Most users can't afford the extra disk
cache or the extra server required to make this work well. But they
can use sendfile() on the server they have and get a lot of
improvement in clone speed due to lower system load, plus resumable
clone for the relatively stable history part.

> Another issue with packs is that they generally aren't supposed to be
> --thin on disk, whereas bundles can be. So I could point you to a
> succession of bundles. Which is maybe a feature, or maybe just makes
> things insanely complex[1].

Actually we can store --thin on disk safely.  Don't laugh until you
finish reading it through.

To build an incremental pack we modify pack-objects to construct a
completed thin pack on disk. Build up the list of objects that you
want in the thin pack, as though it were thin. Use REF_DELTA format to
reference objects that are not in this set but are delta bases. Copy
the necessary delta bases from the base pack over to the thin pack, at
the end just like it would be if received over the wire. The pack is
now self-contained like its supposed to be, but the tail of it is
redundant information.

If you cache alongside of the pack the "thin" object count, the cut
offset of the thin vs. completed bases, and the SHA-1 of the "thin"
pack, you can serve the "thin" pack by copying the header, then the
region of the file up to the cut point, and the final SHA-1. And there
are no pack file format changes involved.

:-)

Obviously this has some downside. Using REF_DELTA instead of OFS_DELTA
for the relatively small number of references from the "thin" part to
the completed part at the tail isn't a big disk space overhead. The
big overhead is storing the boundary data that served as delta bases
at the tail of this incremental pack. But we already do that when you
transfer this section of data over the network and it was more than
100 objects.

So I think we can get away with doing this. The serving repository is
in no worse state than if the owner had just pushed all of that
incremental stuff into the serving repository and it completed as a
thin pack. With only 2 packs in the serving repository (e.g. the
historical stuff that is stable, and the incremental current thin pack
+ completed bases), git gc --auto wouldn't even kick in to GC this
thing for a while *anyway*. So we already probably have a ton of
repositories in the wild that exhibit this disk layout and space
usage, and nobody has complained about it.

For a server admin or repository owner who cares about his user's
resumable clone support, carrying around a historical pack and a
single new incremental pack for say 2-3 months before repacking the
entire thing down to 1 new historical pack... the disk space and
additional completed base data is an acceptable cost. We already do
it.

Clients can figure out whether or not they should use an incremental
pack download vs the native Git protocol if the incremental pack does
like a bundle does and stores the base information alongside of it.
Actually you don't want the base (the ^ lines in a bundle), but the
immediate child of those. If the client has any of those children,
there is some chance the client has other objects in the pack and
should favor native protocol. But if the client has none of those base
children, but does have the base, it may be more efficient to download
the entire pack to bring the client current.

The problem with incremental pack updates is balancing the number of
round-trip requests against the update rate of the repository against
the polling frequency of the client. Its not an easy thing to solve.

However, we may be able to do better if the server can do a reasonably
fast concat of these thin pack slices together by writing a new object
header and computing the SHA-1 trailer as it goes. Instead of
computing actual graph connectivity, just concat packs together
between the base children and the requested tips. This probably
requires that the client ask for every branch (e.g. the typical
refs/heads/*:refs/remotes/origin/* refspec) and that branches didn't
rewind. But I think this is so common its perhaps worthwhile to look
into optimizing. But note we can do this in the native protocol at the
server side without telling the client anything, or changing the
protocol. It just isn't resumable without a bit more glue to have a
state marker available to the client. Nor does it work on a CDN
without giving the client more information. :-)

> So the sendfile() stuff would always happen over http.

I'm OK with that. I was just saying we may be able to also support
sendfile() over git:// if the repository owner / git-daemon owner
wants us to. Or if not sendfile(), a simple read-write loop that
doesn't have to look at the data, since the client will validate it
all.

> Yeah, I'm liking that idea. In reference to my [1] above, what I've
> started with is making:
>
>  git fetch http://host/foo.bundle

This should work, whether or not we use it for resumable clone. Its
just nice to have that tiny bit of extra glue to make it easy to pull
a bundle. So I'd like this too. :-)

> Pulling down a non-thin packfile makes the problem go away. We can spool
> it right into objects/pack, index it on the fly, and if all is well,
> move it into its final filename. If the transfer is interrupted, you
> drop what's been indexed so far, finish the transfer, and then re-start
> the indexing from scratch (actually, the "on the fly" would probably
> involve teaching index-pack to be clever about incrementally reading a
> partially written file, but it should be possible).

I wonder if we can teach index-pack to work with a thin pack on disk
and complete that by appending to the file, in addition to the
streaming from stdin it supports. Seems like that should be possible.
So then you could save a thin pack to a temp file on disk, and thus
could split a bundle header from its pack content, saving them into
two different temp files, allowing index-pack to avoid copying the
pack portion if its non-thin, or if its a huge thin pack.

I did think about doing this in "repo" and decided it was complex, and
not worth the effort. So we spool. 2G+ bundles. Its not the most
pleasant user experience. If I had more time, I would have tried to
split the bundle header from the pack and written the pack directly
off for index-pack to read from disk.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: New Feature wanted: Is it possible to let git clone continue last break point?
  2011-11-03  4:19               ` Shawn Pearce
@ 2011-11-04  8:56                 ` Clemens Buchacher
  2011-11-04  9:35                   ` Johannes Sixt
  0 siblings, 1 reply; 18+ messages in thread
From: Clemens Buchacher @ 2011-11-04  8:56 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: Jeff King, Junio C Hamano, Jonathan Nieder, netroby,
	Git Mail List, Tomas Carnecky

On Wed, Nov 02, 2011 at 09:19:03PM -0700, Shawn Pearce wrote:
> 
> [...] But they can use sendfile() on the server they have and get
> a lot of improvement in clone speed due to lower system load,
> plus resumable clone for the relatively stable history part.

Setting aside the system load issue for now, couldn't we simply do
the following?

1. Figure out HAVE's and WANT's [1], based on which an ad-hoc pack
   will be made and sent to the client.
2. Cache the information on disk (not the pack but the information
   to re-create it), and give the client a 'ticket number' which
   corresponds to that ad-hoc pack.
3. Start downloading the packfile

When the connection drops, we can resume like this:

1. Send the previously received 'ticket number', and the amount of
   previously received data.
2. Re-generate the pack from the HAVE's and WANT's cached under
   'ticket number'. (This may fail if the repo state has changed
   such that previously accessible refs are now inaccessible.)
3. Resume download of that pack.

The upside of this approach is that it would work automatically,
without any manual setup by the server admin. All the previously
discussed ideas skip the step where we figure out the HAVE's and
WANT's. And to me that implies that we manually prepare a packfile
somewhere on disk, which contains what the user usually WANT's and
is allowed to have (think per-branch access control). Even if we
disregard access control, wouldn't that at least require the server
to create a "clean" pack which does not contain any objects from
the reflog?

The whole mirror thing could be pursued independently of the resume
capability, and if each git repo is capable of resuming the mirrors
can be plain git clones as well.

Just my 2 cents,
Clemens

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: New Feature wanted: Is it possible to let git clone continue last break point?
  2011-11-04  8:56                 ` Clemens Buchacher
@ 2011-11-04  9:35                   ` Johannes Sixt
  2011-11-04 14:22                     ` Shawn Pearce
  0 siblings, 1 reply; 18+ messages in thread
From: Johannes Sixt @ 2011-11-04  9:35 UTC (permalink / raw)
  To: Clemens Buchacher
  Cc: Shawn Pearce, Jeff King, Junio C Hamano, Jonathan Nieder, netroby,
	Git Mail List, Tomas Carnecky

Am 11/4/2011 9:56, schrieb Clemens Buchacher:
> Cache ... not the pack but the information
>    to re-create it...

It has been discussed. It doesn't work. Because with threaded pack
generation, the resulting pack is not deterministic.

-- Hannes

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: New Feature wanted: Is it possible to let git clone continue last break point?
  2011-11-04  9:35                   ` Johannes Sixt
@ 2011-11-04 14:22                     ` Shawn Pearce
  2011-11-04 15:55                       ` Jakub Narebski
                                         ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Shawn Pearce @ 2011-11-04 14:22 UTC (permalink / raw)
  To: Johannes Sixt
  Cc: Clemens Buchacher, Jeff King, Junio C Hamano, Jonathan Nieder,
	netroby, Git Mail List, Tomas Carnecky

On Fri, Nov 4, 2011 at 02:35, Johannes Sixt <j.sixt@viscovery.net> wrote:
> Am 11/4/2011 9:56, schrieb Clemens Buchacher:
>> Cache ... not the pack but the information
>>    to re-create it...
>
> It has been discussed. It doesn't work. Because with threaded pack
> generation, the resulting pack is not deterministic.

The information to create a pack for a repository with 2M objects
(e.g. Linux kernel tree) is *at least* 152M of data. This is just a
first order approximation of what it takes to write out the 2M SHA-1s,
along with say a 4 byte length so you can find given an offset
provided by the client roughly where to resumse in the object stream.
This is like 25% of the pack size itself. Ouch.

This data is still insufficient to resume from. A correct solution
would allow you to resume in the middle of an object, which means we
also need to store some sort of indicator of which representation was
chosen from an existing pack file for object reuse. Which adds more
data to the stream. And then there is the not so simple problem of how
to resume in the middle of an object that was being recompressed on
the fly, such as a large loose object.

By the time you get done with all of that, your "ticket" might as well
be the name of a pack file. And your "resume information" is just a
pack file itself. Which would be very expensive to recreate.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: New Feature wanted: Is it possible to let git clone continue last break point?
  2011-11-04 14:22                     ` Shawn Pearce
@ 2011-11-04 15:55                       ` Jakub Narebski
  2011-11-04 16:05                       ` Nguyen Thai Ngoc Duy
  2011-11-05 10:00                       ` Clemens Buchacher
  2 siblings, 0 replies; 18+ messages in thread
From: Jakub Narebski @ 2011-11-04 15:55 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: Johannes Sixt, Clemens Buchacher, Jeff King, Junio C Hamano,
	Jonathan Nieder, netroby, Git Mail List, Tomas Carnecky

Shawn Pearce <spearce@spearce.org> writes:
> On Fri, Nov 4, 2011 at 02:35, Johannes Sixt <j.sixt@viscovery.net> wrote:
> > Am 11/4/2011 9:56, schrieb Clemens Buchacher:

> > > Cache ... not the pack but the information
> > >    to re-create it...
> >
> > It has been discussed. It doesn't work. Because with threaded pack
> > generation, the resulting pack is not deterministic.
> 
> The information to create a pack for a repository with 2M objects
> (e.g. Linux kernel tree) is *at least* 152M of data. This is just a
> first order approximation of what it takes to write out the 2M SHA-1s,
> along with say a 4 byte length so you can find given an offset
> provided by the client roughly where to resumse in the object stream.
> This is like 25% of the pack size itself. Ouch.

Well, perhaps caching a few most popular packs in some kind of cache
(packfile is saved to disk as it is streamed if we detect that it will
be large), indexing by WANT / HAVE?
 
> This data is still insufficient to resume from. A correct solution
> would allow you to resume in the middle of an object, which means we
> also need to store some sort of indicator of which representation was
> chosen from an existing pack file for object reuse. Which adds more
> data to the stream. And then there is the not so simple problem of how
> to resume in the middle of an object that was being recompressed on
> the fly, such as a large loose object.

Well, so you wouldn't be able to just concatenate packs^W received
data.  Still it should be possible to "repair" halfway downloaded
partial pack...
 
Just my 2 eurocents^W groszy.
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: New Feature wanted: Is it possible to let git clone continue last break point?
  2011-11-04 14:22                     ` Shawn Pearce
  2011-11-04 15:55                       ` Jakub Narebski
@ 2011-11-04 16:05                       ` Nguyen Thai Ngoc Duy
  2011-11-05 10:00                       ` Clemens Buchacher
  2 siblings, 0 replies; 18+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2011-11-04 16:05 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: Johannes Sixt, Clemens Buchacher, Jeff King, Junio C Hamano,
	Jonathan Nieder, netroby, Git Mail List, Tomas Carnecky

2011/11/4 Shawn Pearce <spearce@spearce.org>:
> By the time you get done with all of that, your "ticket" might as well
> be the name of a pack file. And your "resume information" is just a
> pack file itself. Which would be very expensive to recreate.

I'll deal with initial clone case only here. Can we make git protocol
send multiple packs, then send on-disk packs one by one together with
pack SHA1? This way we do not need to recreate anything. If new packs
are created during cloning, git client should be able to construct
"have" list from good packs and fetch updates from server again.
-- 
Duy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: New Feature wanted: Is it possible to let git clone continue last break point?
  2011-11-04 14:22                     ` Shawn Pearce
  2011-11-04 15:55                       ` Jakub Narebski
  2011-11-04 16:05                       ` Nguyen Thai Ngoc Duy
@ 2011-11-05 10:00                       ` Clemens Buchacher
  2 siblings, 0 replies; 18+ messages in thread
From: Clemens Buchacher @ 2011-11-05 10:00 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: Johannes Sixt, Jeff King, Junio C Hamano, Jonathan Nieder,
	netroby, Git Mail List, Tomas Carnecky

On Fri, Nov 04, 2011 at 07:22:20AM -0700, Shawn Pearce wrote:
> On Fri, Nov 4, 2011 at 02:35, Johannes Sixt <j.sixt@viscovery.net> wrote:
> > Am 11/4/2011 9:56, schrieb Clemens Buchacher:
> >> Cache ... not the pack but the information
> >>    to re-create it...
> >
> > It has been discussed. It doesn't work. Because with threaded pack
> > generation, the resulting pack is not deterministic.

So let the client disable it, if they'd rather have a resumeable
fetch than a fast one.

Sorry if I'm being obstinate here. But I don't understand the
problem and I can't find an explanation in related discussions.

> The information to create a pack for a repository with 2M objects
> (e.g. Linux kernel tree) is *at least* 152M of data. This is just a
> first order approximation of what it takes to write out the 2M SHA-1s,
> along with say a 4 byte length so you can find given an offset
> provided by the client roughly where to resumse in the object stream.
> This is like 25% of the pack size itself. Ouch.

Sorry, I should not have said HAVEs. All we need is the common
commits, and the sha1s of the WANTed branch heads at the time of
the initial fetch. That shouldn't be more than 10 or so in typical
cases.

> This data is still insufficient to resume from. A correct solution
> would allow you to resume in the middle of an object, which means we
> also need to store some sort of indicator of which representation was
> chosen from an existing pack file for object reuse. Which adds more
> data to the stream. And then there is the not so simple problem of how
> to resume in the middle of an object that was being recompressed on
> the fly, such as a large loose object.

How often does the "representation chosen from an existing pack
file for object reuse" change? Long term determinism is a problem,
yes. But I see no reason why it should not work for this short-term
case. So long as the pack is created by one particular git and libz
version, and for this particular consecutive run of fetches, we do
not need to store anything about the pack. The client downloads n
MB of data until the drop. To resume, the client says it already
has n MB of data.

No?

Clemens

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2011-11-05 10:12 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAEZo+gfKVY-YgMjd=bEYzRV4-460kqDik-yVcQ9Xs=DoCZOMDg@mail.gmail.com>
2011-10-31  2:28 ` New Feature wanted: Is it possible to let git clone continue last break point? netroby
2011-10-31  4:00   ` Tay Ray Chuan
2011-10-31  9:07   ` Jonathan Nieder
2011-10-31  9:16     ` netroby
2011-11-02 22:06     ` Jeff King
2011-11-02 22:41       ` Junio C Hamano
2011-11-02 23:27         ` Jeff King
2011-11-03  0:06           ` Shawn Pearce
2011-11-03  2:42             ` Jeff King
2011-11-03  4:19               ` Shawn Pearce
2011-11-04  8:56                 ` Clemens Buchacher
2011-11-04  9:35                   ` Johannes Sixt
2011-11-04 14:22                     ` Shawn Pearce
2011-11-04 15:55                       ` Jakub Narebski
2011-11-04 16:05                       ` Nguyen Thai Ngoc Duy
2011-11-05 10:00                       ` Clemens Buchacher
2011-10-31  9:14   ` Jakub Narebski
2011-10-31 12:49     ` Michael Schubert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).