* New Feature wanted: Is it possible to let git clone continue last break point? [not found] <CAEZo+gfKVY-YgMjd=bEYzRV4-460kqDik-yVcQ9Xs=DoCZOMDg@mail.gmail.com> @ 2011-10-31 2:28 ` netroby 2011-10-31 4:00 ` Tay Ray Chuan ` (2 more replies) 0 siblings, 3 replies; 18+ messages in thread From: netroby @ 2011-10-31 2:28 UTC (permalink / raw) To: Git Mail List Is it possible to let git clone continue last break point. when we git clone very large project from the web, we may face some interupt, then we must clone it from zero . it is bad feeling for low connection speed users. please help us out. we need git clone continue last break point netroby ---------------------------------- http://www.netroby.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: New Feature wanted: Is it possible to let git clone continue last break point? 2011-10-31 2:28 ` New Feature wanted: Is it possible to let git clone continue last break point? netroby @ 2011-10-31 4:00 ` Tay Ray Chuan 2011-10-31 9:07 ` Jonathan Nieder 2011-10-31 9:14 ` Jakub Narebski 2 siblings, 0 replies; 18+ messages in thread From: Tay Ray Chuan @ 2011-10-31 4:00 UTC (permalink / raw) To: netroby; +Cc: Git Mail List This is a hard problem that hasn't been solved. Year after year, it's a GSoC proposal... What you can do is use --depth 1 with your git-clone; then "extend" the depth incrementally. -- Cheers,Ray Chuan On Mon, Oct 31, 2011 at 10:28 AM, netroby <hufeng1987@gmail.com> wrote: > Is it possible to let git clone continue last break point. > when we git clone very large project from the web, we may face some > interupt, then we must clone it from zero . > > it is bad feeling for low connection speed users. > > please help us out. > > we need git clone continue last break point > > netroby > ---------------------------------- > http://www.netroby.com > -- > To unsubscribe from this list: send the line "unsubscribe git" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: New Feature wanted: Is it possible to let git clone continue last break point? 2011-10-31 2:28 ` New Feature wanted: Is it possible to let git clone continue last break point? netroby 2011-10-31 4:00 ` Tay Ray Chuan @ 2011-10-31 9:07 ` Jonathan Nieder 2011-10-31 9:16 ` netroby 2011-11-02 22:06 ` Jeff King 2011-10-31 9:14 ` Jakub Narebski 2 siblings, 2 replies; 18+ messages in thread From: Jonathan Nieder @ 2011-10-31 9:07 UTC (permalink / raw) To: netroby; +Cc: Git Mail List, Tomas Carnecky, Jeff King Hi, netroby wrote: > Is it possible to let git clone continue last break point. > when we git clone very large project from the web, we may face some > interupt, then we must clone it from zero . You might find [1] useful as a stopgap (thanks, Tomas!). Something like Jeff's "priming the well with a server-specified bundle" proposal[2] might be a good way to make the same trick transparent to clients in the future. Even with that, later fetches, which grab a pack generated on the fly to only contain the objects not already fetched, are generally not resumable. Overcoming that would presumably require larger protocol changes, and I don't know of anyone working on it. (My workaround when in a setup where this mattered was to use the old-fashioned "dumb" http protocol. It worked fine.) Hope that helps, Jonathan [1] http://thread.gmane.org/gmane.comp.version-control.git/181380 [2] http://thread.gmane.org/gmane.comp.version-control.git/164569/focus=164701 http://thread.gmane.org/gmane.comp.version-control.git/168906/focus=168912 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: New Feature wanted: Is it possible to let git clone continue last break point? 2011-10-31 9:07 ` Jonathan Nieder @ 2011-10-31 9:16 ` netroby 2011-11-02 22:06 ` Jeff King 1 sibling, 0 replies; 18+ messages in thread From: netroby @ 2011-10-31 9:16 UTC (permalink / raw) To: Jonathan Nieder; +Cc: Git Mail List, Tomas Carnecky, Jeff King the example : i want to clone the freebsd and linux kernel git repo , to view their source code. git://github.com/freebsd/freebsd.git git://github.com/torvalds/linux.git they are big project, so they are huge. thanks for your tips. it will let me have a try . I am current using 256K Adsl , so it is very not stable when clone in progress. netroby ---------------------------------- http://www.netroby.com On Mon, Oct 31, 2011 at 17:07, Jonathan Nieder <jrnieder@gmail.com> wrote: > Hi, > > netroby wrote: > >> Is it possible to let git clone continue last break point. >> when we git clone very large project from the web, we may face some >> interupt, then we must clone it from zero . > > You might find [1] useful as a stopgap (thanks, Tomas!). > > Something like Jeff's "priming the well with a server-specified > bundle" proposal[2] might be a good way to make the same trick > transparent to clients in the future. > > Even with that, later fetches, which grab a pack generated on the fly > to only contain the objects not already fetched, are generally not > resumable. Overcoming that would presumably require larger protocol > changes, and I don't know of anyone working on it. (My workaround > when in a setup where this mattered was to use the old-fashioned > "dumb" http protocol. It worked fine.) > > Hope that helps, > Jonathan > > [1] http://thread.gmane.org/gmane.comp.version-control.git/181380 > [2] http://thread.gmane.org/gmane.comp.version-control.git/164569/focus=164701 > http://thread.gmane.org/gmane.comp.version-control.git/168906/focus=168912 > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: New Feature wanted: Is it possible to let git clone continue last break point? 2011-10-31 9:07 ` Jonathan Nieder 2011-10-31 9:16 ` netroby @ 2011-11-02 22:06 ` Jeff King 2011-11-02 22:41 ` Junio C Hamano 1 sibling, 1 reply; 18+ messages in thread From: Jeff King @ 2011-11-02 22:06 UTC (permalink / raw) To: Jonathan Nieder; +Cc: netroby, Git Mail List, Tomas Carnecky On Mon, Oct 31, 2011 at 04:07:18AM -0500, Jonathan Nieder wrote: > Something like Jeff's "priming the well with a server-specified > bundle" proposal[2] might be a good way to make the same trick > transparent to clients in the future. Yes, that is one of the use cases I hope to address. But it will require the publisher specifying a mirror location (it's possible we could add some kind of automagic "hit a bundler service first" config option, though I fear that the existing small-time bundler services would crumble under the load). So in the general case (and in the meantime), you may have to learn to manually prime the repo using a bundle. I haven't started on the patches for communicating mirror sites between the server and client, but I did just write some patches to handle "git fetch http://host/path/to/file.bundle" automatically, which is the first step. They need a few finishing touches and some testing, though. > Even with that, later fetches, which grab a pack generated on the fly > to only contain the objects not already fetched, are generally not > resumable. Overcoming that would presumably require larger protocol > changes, and I don't know of anyone working on it. (My workaround > when in a setup where this mattered was to use the old-fashioned > "dumb" http protocol. It worked fine.) My goal was for the mirror communication between client and server to be something like: - if you don't have object XXXXXX, then prime with URL http://host/bundle1 - if you don't have object YYYYYY, then prime with URL http://host/bundle2 and so forth. A cloning client would grab the first bundle, then the second, and then hit the real repo via the git protocol. A client who had previously cloned might have XXX, but would now grab bundle2, and then hit the real repo. So depending on how often the server side feels like creating new bundles, you would get most of the changes via bundles, and then only be getting a small number of objects via git. The downside of cumulative fetching is that the bundles can only serve well-known checkpoints. So if you have a timeline like this: t0: server publishes bundle/mirror config with one line (the XXX bit above) t1: you clone, getting the whole bundle. No waste, because you had nothing in the first place, and you needed everything. t2: you fetch again, getting N commits worth of history via the git protocol t3: server decides a lot of new objects (let's say M commits worth) have accumulated, and generates a new line (the YYY line). t4: you fetch, see that you don't yet have YYY, and grab the second bundle But in t4 you grabbed a bundle containing M commits, when you already had the first N of them. So you actually wasted bandwidth getting objects you already had. The only benefit is that you grabbed a static file, which is resumable. So I suspect there is some black magic involved in deciding when to create a new bundle, and at what tip. If you create a bundle once a month, but include only commits up to a week ago, then people pulling weekly will never grab the bundle, but people pulling less frequently will get the whole month as a bundle. A secondary issue is also that in a scheme like this, your mirror list will grow without bound. So you'd want to periodically repack everything into a single bundle. But then people who are fetching wouldn't want that, as it is just an exacerbated version of the same problem above. Which is all a roundabout way of saying that the git protocol is really the sane way to do efficient transfers. An alternative, much simpler scheme would be for the server to just say: - if you have nothing, then prime with URL http://host/bundle And then _only_ clone would bother with checking mirrors. People doing fetch would be expected to do it often enough that not being resumable isn't a big deal. -Peff ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: New Feature wanted: Is it possible to let git clone continue last break point? 2011-11-02 22:06 ` Jeff King @ 2011-11-02 22:41 ` Junio C Hamano 2011-11-02 23:27 ` Jeff King 0 siblings, 1 reply; 18+ messages in thread From: Junio C Hamano @ 2011-11-02 22:41 UTC (permalink / raw) To: Jeff King; +Cc: Jonathan Nieder, netroby, Git Mail List, Tomas Carnecky Jeff King <peff@peff.net> writes: > Which is all a roundabout way of saying that the git protocol is really > the sane way to do efficient transfers. An alternative, much simpler > scheme would be for the server to just say: > > - if you have nothing, then prime with URL http://host/bundle > > And then _only_ clone would bother with checking mirrors. People doing > fetch would be expected to do it often enough that not being resumable > isn't a big deal. I think that is a sensible place to start. A more fancy conditional "If you have X then fetch this, if you have Y fetch that, ..." sounds nice but depending on what branch you are fetching the answer has to be different. If we were to do that, the natural place for the server to give the redirect instruction to the client is after the client finishes saying "want", and before the client starts saying "have". ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: New Feature wanted: Is it possible to let git clone continue last break point? 2011-11-02 22:41 ` Junio C Hamano @ 2011-11-02 23:27 ` Jeff King 2011-11-03 0:06 ` Shawn Pearce 0 siblings, 1 reply; 18+ messages in thread From: Jeff King @ 2011-11-02 23:27 UTC (permalink / raw) To: Junio C Hamano; +Cc: Jonathan Nieder, netroby, Git Mail List, Tomas Carnecky On Wed, Nov 02, 2011 at 03:41:36PM -0700, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > Which is all a roundabout way of saying that the git protocol is really > > the sane way to do efficient transfers. An alternative, much simpler > > scheme would be for the server to just say: > > > > - if you have nothing, then prime with URL http://host/bundle > > > > And then _only_ clone would bother with checking mirrors. People doing > > fetch would be expected to do it often enough that not being resumable > > isn't a big deal. > > I think that is a sensible place to start. OK. That had been my original intent, but somebody (you?) mentioned the "if you have X" thing at the GitTogether, which got me thinking. I don't mind starting slow, as long as we don't paint ourselves into a corner for future expansion. I'll try to design the data format for specifying the mirror locations with that extension in mind. Even if the bundle thing ends up too wasteful, it may still be useful to offer a "if you don't have X, go see Y" type of mirror when "Y" is something efficient, like git:// at a faster host (i.e., the "I built 3 commits on top of Linus" case). > A more fancy conditional "If you have X then fetch this, if you have Y > fetch that, ..." sounds nice but depending on what branch you are fetching > the answer has to be different. If we were to do that, the natural place > for the server to give the redirect instruction to the client is after the > client finishes saying "want", and before the client starts saying "have". Agreed. I was really trying to avoid protocol extensions, though, at least for an initial version. I'd like to see how far we can get doing the simplest thing. -Peff ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: New Feature wanted: Is it possible to let git clone continue last break point? 2011-11-02 23:27 ` Jeff King @ 2011-11-03 0:06 ` Shawn Pearce 2011-11-03 2:42 ` Jeff King 0 siblings, 1 reply; 18+ messages in thread From: Shawn Pearce @ 2011-11-03 0:06 UTC (permalink / raw) To: Jeff King Cc: Junio C Hamano, Jonathan Nieder, netroby, Git Mail List, Tomas Carnecky On Wed, Nov 2, 2011 at 16:27, Jeff King <peff@peff.net> wrote: > On Wed, Nov 02, 2011 at 03:41:36PM -0700, Junio C Hamano wrote: >> Jeff King <peff@peff.net> writes: >> >> > Which is all a roundabout way of saying that the git protocol is really >> > the sane way to do efficient transfers. An alternative, much simpler >> > scheme would be for the server to just say: >> > >> > - if you have nothing, then prime with URL http://host/bundle >> > >> > And then _only_ clone would bother with checking mirrors. People doing >> > fetch would be expected to do it often enough that not being resumable >> > isn't a big deal. >> >> I think that is a sensible place to start. Yup, I agree. The "repo" tool used by Android does this in Python right now[1]. Its a simple hack, if the protocol is HTTP or HTTPS the client first tries to download $URL/clone.bundle. My servers have rules that trap on */clone.bundle and issue an HTTP 302 Found response to direct the client to a CDN. Works. :-) [1] http://code.google.com/p/git-repo/source/detail?r=f322b9abb4cadc67b991baf6ba1b9f2fbd5d7812&name=stable > OK. That had been my original intent, but somebody (you?) mentioned the > "if you have X" thing at the GitTogether, which got me thinking. > > I don't mind starting slow, as long as we don't paint ourselves into a > corner for future expansion. I'll try to design the data format for > specifying the mirror locations with that extension in mind. Right. Aside from the fact that $URL/clone.bundle is perhaps a bad way to decide on the URL to actually fetch (and isn't supportable over git:// or ssh://)... we should start with the clone case and worry about incremental updates later. > Even if the bundle thing ends up too wasteful, it may still be useful to > offer a "if you don't have X, go see Y" type of mirror when "Y" is > something efficient, like git:// at a faster host (i.e., the "I built 3 > commits on top of Linus" case). Actually, I really think the bundle thing is wasteful. Its a ton of additional disk. Hosts like kernel.org want to use sendfile() when possible to handle bulk transfers. git:// is not efficient for them because we don't have sendfile() capability. Its also expensive for kernel.org to create each Git repository twice on disk. The disk is cheap. Its the kernel buffer cache that is damned expensive. Assume for a minute that Linus' kernel repository is a popular thing to access. If 400M of that history is available in a normal pack file on disk, and again 400M is available as a "clone bundle thingy", kernel.org now has to eat 800M of disk buffer cache for that one Git repository, because both of those files are going to be hot. I think I messed up with "repo" using a Git bundle file as its data source. What we should have done was a bog standard pack file. Then the client can download the pack file into the .git/objects/pack directory and just generate the index, reusing the entire dumb protocol transport logic. It also allows the server to pass out the same file the server retains for the repository itself, and thus makes the disk buffer cache only 400M for Linus' repository. > Agreed. I was really trying to avoid protocol extensions, though, at > least for an initial version. I'd like to see how far we can get doing > the simplest thing. One (maybe dumb idea I had) was making the $GIT_DIR/objects/info/packs file contain other lines to list reference tips at the time the pack was made. The client just needs the SHA-1s, it doesn't necessarily need the branch names themselves. A client could initialize itself by getting this set of references, creating temporary dummy references at those SHA-1s, and downloading the corresponding pack file, indexing it, then resuming with a normal fetch. Then we wind up with a git:// or ssh:// protocol extension that enables sendfile() on an entire pack, and to provide the matching objects/info/packs data to help a client over git:// or ssh:// initialize off the existing pack files. Obviously there is the existing security feature that over git:// or ssh:// (or even smart HTTP), a deleted or rewound reference stops exposing the content in the repository that isn't reachable from the other reference tips. The repository owner / server administrator will have to make a choice here, either the existing packs are not exposed as available via sendfile() until after GC can be run to rebuild them around the right content set, or they are exposed and the time to expunge/hide an unreferenced object is expanded until the GC completes (rather than being immediate after the reference updates). But either way, I like the idea of coupling the "resumable pack download" to the *existing* pack files, because this is easy to deal with. If you do have a rewind/delete and need to expunge content, users/administrators already know how to run `git gc --expire=now` to accomplish a full erase. Adding another thing with bundle files somewhere else that may or may not contain the data you want to erase and remembering to clean that up is not a good idea. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: New Feature wanted: Is it possible to let git clone continue last break point? 2011-11-03 0:06 ` Shawn Pearce @ 2011-11-03 2:42 ` Jeff King 2011-11-03 4:19 ` Shawn Pearce 0 siblings, 1 reply; 18+ messages in thread From: Jeff King @ 2011-11-03 2:42 UTC (permalink / raw) To: Shawn Pearce Cc: Junio C Hamano, Jonathan Nieder, netroby, Git Mail List, Tomas Carnecky On Wed, Nov 02, 2011 at 05:06:53PM -0700, Shawn O. Pearce wrote: > Yup, I agree. The "repo" tool used by Android does this in Python > right now[1]. Its a simple hack, if the protocol is HTTP or HTTPS the > client first tries to download $URL/clone.bundle. My servers have > rules that trap on */clone.bundle and issue an HTTP 302 Found response > to direct the client to a CDN. Works. :-) I thought of doing something like that, but I wanted to be able to make cross-domain links. The "302 to a CDN" thing is a clever hack, but it requires more control of the webserver than some users might have. And of course it doesn't work for the "redirect to git:// on a different server" trick. Or redirect from "git://". My thought of having it in "refs/mirrors" is only slightly less hacky, but I think covers all of those cases. :) > > Even if the bundle thing ends up too wasteful, it may still be useful to > > offer a "if you don't have X, go see Y" type of mirror when "Y" is > > something efficient, like git:// at a faster host (i.e., the "I built 3 > > commits on top of Linus" case). > > Actually, I really think the bundle thing is wasteful. Its a ton of > additional disk. Hosts like kernel.org want to use sendfile() when > possible to handle bulk transfers. git:// is not efficient for them > because we don't have sendfile() capability. I didn't quite parse this. You say it is wasteful, but then indicate that it can use sendfile(), which is a good thing. However, I do agree with this: > Its also expensive for kernel.org to create each Git repository twice > on disk. The disk is cheap. Its the kernel buffer cache that is damned > expensive. Assume for a minute that Linus' kernel repository is a > popular thing to access. If 400M of that history is available in a > normal pack file on disk, and again 400M is available as a "clone > bundle thingy", kernel.org now has to eat 800M of disk buffer cache > for that one Git repository, because both of those files are going to > be hot. Doubling the disk cache required is evil and ugly. I was hoping it wouldn't matter because the bundle would be hosted on some far-away CDN server anyway, though. But that is highly dependent on your setup. And it's really just glossing over the fact that you have twice as many servers. ;) > I think I messed up with "repo" using a Git bundle file as its data > source. What we should have done was a bog standard pack file. Then > the client can download the pack file into the .git/objects/pack > directory and just generate the index, reusing the entire dumb > protocol transport logic. It also allows the server to pass out the > same file the server retains for the repository itself, and thus makes > the disk buffer cache only 400M for Linus' repository. That would be cool, but what about ref tips? The pack is just a big blob of objects, but we need ref tips to advertise to the server when we come back via the smart protocol. We can make a guess about them, obviously, but it would be nice to communicate them. I guess the mirror data could include the tips and a pointer to a pack file. Another issue with packs is that they generally aren't supposed to be --thin on disk, whereas bundles can be. So I could point you to a succession of bundles. Which is maybe a feature, or maybe just makes things insanely complex[1]. > One (maybe dumb idea I had) was making the $GIT_DIR/objects/info/packs > file contain other lines to list reference tips at the time the pack > was made. So yeah, that's another solution to the ref tip thingy, and that would work. I don't think it would make a big difference whether the tips were in the "mirror" file, or alongside the packfile. The latter I guess might make administration easier. The "real" repo points its mirror one time to a static pack store, and then the client goes and grabs whatever it can from that store. > Then we wind up with a git:// or ssh:// protocol extension that > enables sendfile() on an entire pack, and to provide the matching > objects/info/packs data to help a client over git:// or ssh:// > initialize off the existing pack files. I think we can get around this by pointing git:// clients, either via protocol extension or via a magic ref, to an http pack store. Sure, it's an extra TCP connection, but that's not a big deal compared to doing an initial clone of most repos. So the sendfile() stuff would always happen over http. > But either way, I like the idea of coupling the "resumable pack > download" to the *existing* pack files, because this is easy to deal > with. Yeah, I'm liking that idea. In reference to my [1] above, what I've started with is making: git fetch http://host/foo.bundle work automatically. And it does work. But it actually spools the bundle to disk and then unpacks from it, rather than placing it right into the objects/pack directory. I did this because: 1. We have to feed it to "index-pack --fix-thin", because bundles can be thin. So they're not suitable for sticking right into the pack directory. 2. We could feed it straight to an index-pack pipe, but then we don't have a byte-for-byte file on disk to resume an interrupted transfer. But spooling sucks, of course. It means we use twice as much disk space during the index-pack as we would otherwise need to, not to mention the latency of not starting the index-pack until we get the whole file. Pulling down a non-thin packfile makes the problem go away. We can spool it right into objects/pack, index it on the fly, and if all is well, move it into its final filename. If the transfer is interrupted, you drop what's been indexed so far, finish the transfer, and then re-start the indexing from scratch (actually, the "on the fly" would probably involve teaching index-pack to be clever about incrementally reading a partially written file, but it should be possible). -Peff ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: New Feature wanted: Is it possible to let git clone continue last break point? 2011-11-03 2:42 ` Jeff King @ 2011-11-03 4:19 ` Shawn Pearce 2011-11-04 8:56 ` Clemens Buchacher 0 siblings, 1 reply; 18+ messages in thread From: Shawn Pearce @ 2011-11-03 4:19 UTC (permalink / raw) To: Jeff King Cc: Junio C Hamano, Jonathan Nieder, netroby, Git Mail List, Tomas Carnecky On Wed, Nov 2, 2011 at 19:42, Jeff King <peff@peff.net> wrote: > On Wed, Nov 02, 2011 at 05:06:53PM -0700, Shawn O. Pearce wrote: > > I thought of doing something like that, but I wanted to be able to make > cross-domain links. The "302 to a CDN" thing is a clever hack, but it > requires more control of the webserver than some users might have. And > of course it doesn't work for the "redirect to git:// on a different > server" trick. Or redirect from "git://". I agree. Later I said I regret this being a bundle file. I also regret it being this $URL/clone.bundle thing. Its a reasonable quick hack in Python for repo. Its cheap for my servers to respond 404 Not Found or 302 Found, and cheap to use the CDN. But it isn't the right solution for git-core. It has given us some useful information already in the context of android.googlesource.com. It appears to work quite well for distributing the large Android operating system. So the notion of making packs available from another URL than the main repository, and doing it as primarily a pack and not the native Git protocol, with a follow-up incremental fetch to bring the client current seems to work. :-) > My thought of having it in "refs/mirrors" is only slightly less hacky, > but I think covers all of those cases. :) Right, but this would have been a bit more work for me to code in Python. :-) Long term this may be a better approach, because it does allow the user to control the redirect without having full control over their HTTP server. It also supports redirections across protocols like you noted above. So its probably the direction we will see git-core take. >> Actually, I really think the bundle thing is wasteful.... sendfile() capability. > > I didn't quite parse this. You say it is wasteful, but then indicate > that it can use sendfile(), which is a good thing. Apparently I was babbling. Based on what else you say, we agree. That is good enough for me. > However, I do agree with this: > >> Its also expensive for kernel.org to create each Git repository twice >> on disk. The disk is cheap. Its the kernel buffer cache that is damned >> expensive. > > Doubling the disk cache required is evil and ugly. I was hoping it > wouldn't matter because the bundle would be hosted on some far-away CDN > server anyway, though. But that is highly dependent on your setup. And > it's really just glossing over the fact that you have twice as many > servers. ;) Right. :-) In my opinion this is the important part. We shouldn't double the disk usage required to support this. Most users can't afford the extra disk cache or the extra server required to make this work well. But they can use sendfile() on the server they have and get a lot of improvement in clone speed due to lower system load, plus resumable clone for the relatively stable history part. > Another issue with packs is that they generally aren't supposed to be > --thin on disk, whereas bundles can be. So I could point you to a > succession of bundles. Which is maybe a feature, or maybe just makes > things insanely complex[1]. Actually we can store --thin on disk safely. Don't laugh until you finish reading it through. To build an incremental pack we modify pack-objects to construct a completed thin pack on disk. Build up the list of objects that you want in the thin pack, as though it were thin. Use REF_DELTA format to reference objects that are not in this set but are delta bases. Copy the necessary delta bases from the base pack over to the thin pack, at the end just like it would be if received over the wire. The pack is now self-contained like its supposed to be, but the tail of it is redundant information. If you cache alongside of the pack the "thin" object count, the cut offset of the thin vs. completed bases, and the SHA-1 of the "thin" pack, you can serve the "thin" pack by copying the header, then the region of the file up to the cut point, and the final SHA-1. And there are no pack file format changes involved. :-) Obviously this has some downside. Using REF_DELTA instead of OFS_DELTA for the relatively small number of references from the "thin" part to the completed part at the tail isn't a big disk space overhead. The big overhead is storing the boundary data that served as delta bases at the tail of this incremental pack. But we already do that when you transfer this section of data over the network and it was more than 100 objects. So I think we can get away with doing this. The serving repository is in no worse state than if the owner had just pushed all of that incremental stuff into the serving repository and it completed as a thin pack. With only 2 packs in the serving repository (e.g. the historical stuff that is stable, and the incremental current thin pack + completed bases), git gc --auto wouldn't even kick in to GC this thing for a while *anyway*. So we already probably have a ton of repositories in the wild that exhibit this disk layout and space usage, and nobody has complained about it. For a server admin or repository owner who cares about his user's resumable clone support, carrying around a historical pack and a single new incremental pack for say 2-3 months before repacking the entire thing down to 1 new historical pack... the disk space and additional completed base data is an acceptable cost. We already do it. Clients can figure out whether or not they should use an incremental pack download vs the native Git protocol if the incremental pack does like a bundle does and stores the base information alongside of it. Actually you don't want the base (the ^ lines in a bundle), but the immediate child of those. If the client has any of those children, there is some chance the client has other objects in the pack and should favor native protocol. But if the client has none of those base children, but does have the base, it may be more efficient to download the entire pack to bring the client current. The problem with incremental pack updates is balancing the number of round-trip requests against the update rate of the repository against the polling frequency of the client. Its not an easy thing to solve. However, we may be able to do better if the server can do a reasonably fast concat of these thin pack slices together by writing a new object header and computing the SHA-1 trailer as it goes. Instead of computing actual graph connectivity, just concat packs together between the base children and the requested tips. This probably requires that the client ask for every branch (e.g. the typical refs/heads/*:refs/remotes/origin/* refspec) and that branches didn't rewind. But I think this is so common its perhaps worthwhile to look into optimizing. But note we can do this in the native protocol at the server side without telling the client anything, or changing the protocol. It just isn't resumable without a bit more glue to have a state marker available to the client. Nor does it work on a CDN without giving the client more information. :-) > So the sendfile() stuff would always happen over http. I'm OK with that. I was just saying we may be able to also support sendfile() over git:// if the repository owner / git-daemon owner wants us to. Or if not sendfile(), a simple read-write loop that doesn't have to look at the data, since the client will validate it all. > Yeah, I'm liking that idea. In reference to my [1] above, what I've > started with is making: > > git fetch http://host/foo.bundle This should work, whether or not we use it for resumable clone. Its just nice to have that tiny bit of extra glue to make it easy to pull a bundle. So I'd like this too. :-) > Pulling down a non-thin packfile makes the problem go away. We can spool > it right into objects/pack, index it on the fly, and if all is well, > move it into its final filename. If the transfer is interrupted, you > drop what's been indexed so far, finish the transfer, and then re-start > the indexing from scratch (actually, the "on the fly" would probably > involve teaching index-pack to be clever about incrementally reading a > partially written file, but it should be possible). I wonder if we can teach index-pack to work with a thin pack on disk and complete that by appending to the file, in addition to the streaming from stdin it supports. Seems like that should be possible. So then you could save a thin pack to a temp file on disk, and thus could split a bundle header from its pack content, saving them into two different temp files, allowing index-pack to avoid copying the pack portion if its non-thin, or if its a huge thin pack. I did think about doing this in "repo" and decided it was complex, and not worth the effort. So we spool. 2G+ bundles. Its not the most pleasant user experience. If I had more time, I would have tried to split the bundle header from the pack and written the pack directly off for index-pack to read from disk. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: New Feature wanted: Is it possible to let git clone continue last break point? 2011-11-03 4:19 ` Shawn Pearce @ 2011-11-04 8:56 ` Clemens Buchacher 2011-11-04 9:35 ` Johannes Sixt 0 siblings, 1 reply; 18+ messages in thread From: Clemens Buchacher @ 2011-11-04 8:56 UTC (permalink / raw) To: Shawn Pearce Cc: Jeff King, Junio C Hamano, Jonathan Nieder, netroby, Git Mail List, Tomas Carnecky On Wed, Nov 02, 2011 at 09:19:03PM -0700, Shawn Pearce wrote: > > [...] But they can use sendfile() on the server they have and get > a lot of improvement in clone speed due to lower system load, > plus resumable clone for the relatively stable history part. Setting aside the system load issue for now, couldn't we simply do the following? 1. Figure out HAVE's and WANT's [1], based on which an ad-hoc pack will be made and sent to the client. 2. Cache the information on disk (not the pack but the information to re-create it), and give the client a 'ticket number' which corresponds to that ad-hoc pack. 3. Start downloading the packfile When the connection drops, we can resume like this: 1. Send the previously received 'ticket number', and the amount of previously received data. 2. Re-generate the pack from the HAVE's and WANT's cached under 'ticket number'. (This may fail if the repo state has changed such that previously accessible refs are now inaccessible.) 3. Resume download of that pack. The upside of this approach is that it would work automatically, without any manual setup by the server admin. All the previously discussed ideas skip the step where we figure out the HAVE's and WANT's. And to me that implies that we manually prepare a packfile somewhere on disk, which contains what the user usually WANT's and is allowed to have (think per-branch access control). Even if we disregard access control, wouldn't that at least require the server to create a "clean" pack which does not contain any objects from the reflog? The whole mirror thing could be pursued independently of the resume capability, and if each git repo is capable of resuming the mirrors can be plain git clones as well. Just my 2 cents, Clemens ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: New Feature wanted: Is it possible to let git clone continue last break point? 2011-11-04 8:56 ` Clemens Buchacher @ 2011-11-04 9:35 ` Johannes Sixt 2011-11-04 14:22 ` Shawn Pearce 0 siblings, 1 reply; 18+ messages in thread From: Johannes Sixt @ 2011-11-04 9:35 UTC (permalink / raw) To: Clemens Buchacher Cc: Shawn Pearce, Jeff King, Junio C Hamano, Jonathan Nieder, netroby, Git Mail List, Tomas Carnecky Am 11/4/2011 9:56, schrieb Clemens Buchacher: > Cache ... not the pack but the information > to re-create it... It has been discussed. It doesn't work. Because with threaded pack generation, the resulting pack is not deterministic. -- Hannes ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: New Feature wanted: Is it possible to let git clone continue last break point? 2011-11-04 9:35 ` Johannes Sixt @ 2011-11-04 14:22 ` Shawn Pearce 2011-11-04 15:55 ` Jakub Narebski ` (2 more replies) 0 siblings, 3 replies; 18+ messages in thread From: Shawn Pearce @ 2011-11-04 14:22 UTC (permalink / raw) To: Johannes Sixt Cc: Clemens Buchacher, Jeff King, Junio C Hamano, Jonathan Nieder, netroby, Git Mail List, Tomas Carnecky On Fri, Nov 4, 2011 at 02:35, Johannes Sixt <j.sixt@viscovery.net> wrote: > Am 11/4/2011 9:56, schrieb Clemens Buchacher: >> Cache ... not the pack but the information >> to re-create it... > > It has been discussed. It doesn't work. Because with threaded pack > generation, the resulting pack is not deterministic. The information to create a pack for a repository with 2M objects (e.g. Linux kernel tree) is *at least* 152M of data. This is just a first order approximation of what it takes to write out the 2M SHA-1s, along with say a 4 byte length so you can find given an offset provided by the client roughly where to resumse in the object stream. This is like 25% of the pack size itself. Ouch. This data is still insufficient to resume from. A correct solution would allow you to resume in the middle of an object, which means we also need to store some sort of indicator of which representation was chosen from an existing pack file for object reuse. Which adds more data to the stream. And then there is the not so simple problem of how to resume in the middle of an object that was being recompressed on the fly, such as a large loose object. By the time you get done with all of that, your "ticket" might as well be the name of a pack file. And your "resume information" is just a pack file itself. Which would be very expensive to recreate. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: New Feature wanted: Is it possible to let git clone continue last break point? 2011-11-04 14:22 ` Shawn Pearce @ 2011-11-04 15:55 ` Jakub Narebski 2011-11-04 16:05 ` Nguyen Thai Ngoc Duy 2011-11-05 10:00 ` Clemens Buchacher 2 siblings, 0 replies; 18+ messages in thread From: Jakub Narebski @ 2011-11-04 15:55 UTC (permalink / raw) To: Shawn Pearce Cc: Johannes Sixt, Clemens Buchacher, Jeff King, Junio C Hamano, Jonathan Nieder, netroby, Git Mail List, Tomas Carnecky Shawn Pearce <spearce@spearce.org> writes: > On Fri, Nov 4, 2011 at 02:35, Johannes Sixt <j.sixt@viscovery.net> wrote: > > Am 11/4/2011 9:56, schrieb Clemens Buchacher: > > > Cache ... not the pack but the information > > > to re-create it... > > > > It has been discussed. It doesn't work. Because with threaded pack > > generation, the resulting pack is not deterministic. > > The information to create a pack for a repository with 2M objects > (e.g. Linux kernel tree) is *at least* 152M of data. This is just a > first order approximation of what it takes to write out the 2M SHA-1s, > along with say a 4 byte length so you can find given an offset > provided by the client roughly where to resumse in the object stream. > This is like 25% of the pack size itself. Ouch. Well, perhaps caching a few most popular packs in some kind of cache (packfile is saved to disk as it is streamed if we detect that it will be large), indexing by WANT / HAVE? > This data is still insufficient to resume from. A correct solution > would allow you to resume in the middle of an object, which means we > also need to store some sort of indicator of which representation was > chosen from an existing pack file for object reuse. Which adds more > data to the stream. And then there is the not so simple problem of how > to resume in the middle of an object that was being recompressed on > the fly, such as a large loose object. Well, so you wouldn't be able to just concatenate packs^W received data. Still it should be possible to "repair" halfway downloaded partial pack... Just my 2 eurocents^W groszy. -- Jakub Narębski ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: New Feature wanted: Is it possible to let git clone continue last break point? 2011-11-04 14:22 ` Shawn Pearce 2011-11-04 15:55 ` Jakub Narebski @ 2011-11-04 16:05 ` Nguyen Thai Ngoc Duy 2011-11-05 10:00 ` Clemens Buchacher 2 siblings, 0 replies; 18+ messages in thread From: Nguyen Thai Ngoc Duy @ 2011-11-04 16:05 UTC (permalink / raw) To: Shawn Pearce Cc: Johannes Sixt, Clemens Buchacher, Jeff King, Junio C Hamano, Jonathan Nieder, netroby, Git Mail List, Tomas Carnecky 2011/11/4 Shawn Pearce <spearce@spearce.org>: > By the time you get done with all of that, your "ticket" might as well > be the name of a pack file. And your "resume information" is just a > pack file itself. Which would be very expensive to recreate. I'll deal with initial clone case only here. Can we make git protocol send multiple packs, then send on-disk packs one by one together with pack SHA1? This way we do not need to recreate anything. If new packs are created during cloning, git client should be able to construct "have" list from good packs and fetch updates from server again. -- Duy ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: New Feature wanted: Is it possible to let git clone continue last break point? 2011-11-04 14:22 ` Shawn Pearce 2011-11-04 15:55 ` Jakub Narebski 2011-11-04 16:05 ` Nguyen Thai Ngoc Duy @ 2011-11-05 10:00 ` Clemens Buchacher 2 siblings, 0 replies; 18+ messages in thread From: Clemens Buchacher @ 2011-11-05 10:00 UTC (permalink / raw) To: Shawn Pearce Cc: Johannes Sixt, Jeff King, Junio C Hamano, Jonathan Nieder, netroby, Git Mail List, Tomas Carnecky On Fri, Nov 04, 2011 at 07:22:20AM -0700, Shawn Pearce wrote: > On Fri, Nov 4, 2011 at 02:35, Johannes Sixt <j.sixt@viscovery.net> wrote: > > Am 11/4/2011 9:56, schrieb Clemens Buchacher: > >> Cache ... not the pack but the information > >> to re-create it... > > > > It has been discussed. It doesn't work. Because with threaded pack > > generation, the resulting pack is not deterministic. So let the client disable it, if they'd rather have a resumeable fetch than a fast one. Sorry if I'm being obstinate here. But I don't understand the problem and I can't find an explanation in related discussions. > The information to create a pack for a repository with 2M objects > (e.g. Linux kernel tree) is *at least* 152M of data. This is just a > first order approximation of what it takes to write out the 2M SHA-1s, > along with say a 4 byte length so you can find given an offset > provided by the client roughly where to resumse in the object stream. > This is like 25% of the pack size itself. Ouch. Sorry, I should not have said HAVEs. All we need is the common commits, and the sha1s of the WANTed branch heads at the time of the initial fetch. That shouldn't be more than 10 or so in typical cases. > This data is still insufficient to resume from. A correct solution > would allow you to resume in the middle of an object, which means we > also need to store some sort of indicator of which representation was > chosen from an existing pack file for object reuse. Which adds more > data to the stream. And then there is the not so simple problem of how > to resume in the middle of an object that was being recompressed on > the fly, such as a large loose object. How often does the "representation chosen from an existing pack file for object reuse" change? Long term determinism is a problem, yes. But I see no reason why it should not work for this short-term case. So long as the pack is created by one particular git and libz version, and for this particular consecutive run of fetches, we do not need to store anything about the pack. The client downloads n MB of data until the drop. To resume, the client says it already has n MB of data. No? Clemens ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: New Feature wanted: Is it possible to let git clone continue last break point? 2011-10-31 2:28 ` New Feature wanted: Is it possible to let git clone continue last break point? netroby 2011-10-31 4:00 ` Tay Ray Chuan 2011-10-31 9:07 ` Jonathan Nieder @ 2011-10-31 9:14 ` Jakub Narebski 2011-10-31 12:49 ` Michael Schubert 2 siblings, 1 reply; 18+ messages in thread From: Jakub Narebski @ 2011-10-31 9:14 UTC (permalink / raw) To: netroby; +Cc: Git Mail List netroby <hufeng1987@gmail.com> writes: > Is it possible to let git clone continue last break point. > when we git clone very large project from the web, we may face some > interupt, then we must clone it from zero . > > it is bad feeling for low connection speed users. > > please help us out. > > we need git clone continue last break point Resuming "git clone" is not currently possible in Git, and it would be difficult to add such feature to Git; there were several attempts and neither succeeded. What you can do is generate a starter bundle out of your repository (using "git bundle"), and serve this file via HTTP / FTP / BitTorrent, i.e. some resumable transport. Then you "git clone <bundle file>", fix up configuration, and fetch the rest since bundle creation. Though this is possible only if it is your project... or can ask project administrator to provide bundle. -- Jakub Narębski ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: New Feature wanted: Is it possible to let git clone continue last break point? 2011-10-31 9:14 ` Jakub Narebski @ 2011-10-31 12:49 ` Michael Schubert 0 siblings, 0 replies; 18+ messages in thread From: Michael Schubert @ 2011-10-31 12:49 UTC (permalink / raw) To: Jakub Narebski; +Cc: netroby, Git Mail List On 10/31/2011 10:14 AM, Jakub Narebski wrote: > netroby <hufeng1987@gmail.com> writes: > >> Is it possible to let git clone continue last break point. >> when we git clone very large project from the web, we may face some >> interupt, then we must clone it from zero . >> >> it is bad feeling for low connection speed users. >> >> please help us out. >> >> we need git clone continue last break point > > Resuming "git clone" is not currently possible in Git, and it would be > difficult to add such feature to Git; there were several attempts and > neither succeeded. > > What you can do is generate a starter bundle out of your repository > (using "git bundle"), and serve this file via HTTP / FTP / BitTorrent, > i.e. some resumable transport. Then you "git clone <bundle file>", > fix up configuration, and fetch the rest since bundle creation. There's also a "git bundler service": http://comments.gmane.org/gmane.comp.version-control.git/181380 ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2011-11-05 10:12 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <CAEZo+gfKVY-YgMjd=bEYzRV4-460kqDik-yVcQ9Xs=DoCZOMDg@mail.gmail.com> 2011-10-31 2:28 ` New Feature wanted: Is it possible to let git clone continue last break point? netroby 2011-10-31 4:00 ` Tay Ray Chuan 2011-10-31 9:07 ` Jonathan Nieder 2011-10-31 9:16 ` netroby 2011-11-02 22:06 ` Jeff King 2011-11-02 22:41 ` Junio C Hamano 2011-11-02 23:27 ` Jeff King 2011-11-03 0:06 ` Shawn Pearce 2011-11-03 2:42 ` Jeff King 2011-11-03 4:19 ` Shawn Pearce 2011-11-04 8:56 ` Clemens Buchacher 2011-11-04 9:35 ` Johannes Sixt 2011-11-04 14:22 ` Shawn Pearce 2011-11-04 15:55 ` Jakub Narebski 2011-11-04 16:05 ` Nguyen Thai Ngoc Duy 2011-11-05 10:00 ` Clemens Buchacher 2011-10-31 9:14 ` Jakub Narebski 2011-10-31 12:49 ` Michael Schubert
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).