* Resumable git clone? @ 2016-03-02 1:30 Josh Triplett 2016-03-02 1:40 ` Stefan Beller ` (2 more replies) 0 siblings, 3 replies; 24+ messages in thread From: Josh Triplett @ 2016-03-02 1:30 UTC (permalink / raw) To: git; +Cc: sarah, viro If you clone a repository, and the connection drops, the next attempt will have to start from scratch. This can add significant time and expense if you're on a low-bandwidth or metered connection trying to clone something like Linux. Would it be possible to make git clone resumable after a partial clone? (And, ideally, to make that the default?) In a discussion elsewhere, Al Viro suggested taking the partial pack received so far, repairing any truncation, indexing the objects it contains, and then re-running clone and not having to fetch those objects. This may also require extending receive-pack's protocol for determining objects the recipient already has, as the partial pack may not have a consistent set of reachable objects. Before starting down the path of developing patches for this, does the approach seem potentially reasonable? - Josh Triplett ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Resumable git clone? 2016-03-02 1:30 Resumable git clone? Josh Triplett @ 2016-03-02 1:40 ` Stefan Beller 2016-03-02 2:30 ` Al Viro 2016-03-02 1:45 ` Duy Nguyen 2016-03-02 8:41 ` Junio C Hamano 2 siblings, 1 reply; 24+ messages in thread From: Stefan Beller @ 2016-03-02 1:40 UTC (permalink / raw) To: Josh Triplett, Duy Nguyen; +Cc: git@vger.kernel.org, sarah, viro + Duy, who tried resumable clone a few days/weeks ago On Tue, Mar 1, 2016 at 5:30 PM, Josh Triplett <josh@joshtriplett.org> wrote: > If you clone a repository, and the connection drops, the next attempt > will have to start from scratch. This can add significant time and > expense if you're on a low-bandwidth or metered connection trying to > clone something like Linux. > > Would it be possible to make git clone resumable after a partial clone? > (And, ideally, to make that the default?) > > In a discussion elsewhere, Al Viro suggested taking the partial pack > received so far, ok, > repairing any truncation, So throwing away half finished stuff while keeping the front load? > indexing the objects it > contains, and then re-running clone and not having to fetch those > objects. The pack is not deterministic for a given repository. When creating the pack, you may encounter races between threads, such that the order in a pack differs. > This may also require extending receive-pack's protocol for > determining objects the recipient already has, as the partial pack may > not have a consistent set of reachable objects. > > Before starting down the path of developing patches for this, does the > approach seem potentially reasonable? I think that sounds reasonable on a high level, but I'd expect it blows up in complexity as in the receive-pack's protocol or in the code for having to handle partial stuff. Thanks, Stefan > > - Josh Triplett > -- > To unsubscribe from this list: send the line "unsubscribe git" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Resumable git clone? 2016-03-02 1:40 ` Stefan Beller @ 2016-03-02 2:30 ` Al Viro 2016-03-02 6:31 ` Junio C Hamano ` (2 more replies) 0 siblings, 3 replies; 24+ messages in thread From: Al Viro @ 2016-03-02 2:30 UTC (permalink / raw) To: Stefan Beller; +Cc: Josh Triplett, Duy Nguyen, git@vger.kernel.org, sarah On Tue, Mar 01, 2016 at 05:40:28PM -0800, Stefan Beller wrote: > So throwing away half finished stuff while keeping the front load? Throw away the object that got truncated and ones for which delta chain doesn't resolve entirely in the transferred part. > > indexing the objects it > > contains, and then re-running clone and not having to fetch those > > objects. > > The pack is not deterministic for a given repository. When creating > the pack, you may encounter races between threads, such that the order > in a pack differs. FWIW, I wasn't proposing to recreate the remaining bits of that _pack_; just do the normal pull with one addition: start with sending the list of sha1 of objects you are about to send and let the recepient reply with "I already have <set of sha1>, don't bother with those". And exclude those from the transfer. Encoding for the set being available is an interesting variable here - might be plain list of sha1, might be its complement ("I want the following subset"), might be "145th to 1029th, 1517th and 1890th to 1920th of the list you've sent"; which form ends up more efficient needs to be found experimentally... IIRC, the objection had been that the organisation of the pack will lead to many cases when deltas are transferred *first*, with base object not getting there prior to disconnect. I suspect that fraction of the objects getting through would still be worth it, but I hadn't experimented enough to be able to tell... I was more interested in resumable _pull_, with restarted clone treated as special case of that. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Resumable git clone? 2016-03-02 2:30 ` Al Viro @ 2016-03-02 6:31 ` Junio C Hamano 2016-03-02 7:37 ` Duy Nguyen 2016-03-02 8:13 ` Josh Triplett 2016-03-02 8:14 ` Duy Nguyen 2 siblings, 1 reply; 24+ messages in thread From: Junio C Hamano @ 2016-03-02 6:31 UTC (permalink / raw) To: Al Viro Cc: Stefan Beller, Josh Triplett, Duy Nguyen, git@vger.kernel.org, sarah Al Viro <viro@ZenIV.linux.org.uk> writes: > FWIW, I wasn't proposing to recreate the remaining bits of that _pack_; > just do the normal pull with one addition: start with sending the list > of sha1 of objects you are about to send and let the recepient reply > with "I already have <set of sha1>, don't bother with those". And exclude > those from the transfer. I did a quick-and-dirty unscientific experiment. I had a clone of Linus's repository that was about a week old, whose tip was at 4de8ebef (Merge tag 'trace-fixes-v4.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace, 2016-02-22). To bring it up to date (i.e. a pull about a week's worth of progress) to f691b77b (Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs, 2016-03-01): $ git rev-list --objects 4de8ebef..f691b77b1fc | wc -l 1396 $ git rev-parse 4de8ebef..f691b77b1fc | git pack-objects --revs --delta-base-offset --stdout | wc -c 2444127 So in order to salvage some transfer out of 2.4MB, the hypothetical Al protocol would first have the upload-pack give 20*1396 = 28kB object names to fetch-pack; no matter how fetch-pack encodes its preference, its answer would be less than 28kB. We would likely to design this part of the new protocol in line with the existing part and use textual object names, so let's round them up to 100kB. That is quite small, even if you are on a crappy connection that you need to retry 5 times, the additional overhead to negotiate the list of objects alone would be 0.5MB (or less than 20% of the real transfer). That is quite interesting [*1*]. For the approach to be practical, you would have to write a program that reads from a truncated packfile and writes a new packfile, excising deltas that lack their bases, to salvage objects from a half-transferred packfile; it is however unclear how involved the code would get. It is probably OK for a tiny pack that has only 1400 objects--we could just pass the early part through unpack-objects and let it die when it hits EOF, but for a "resumable clone", I do not think you can afford to unpack 4.6M objects in the kernel repository into loose objects. The approach of course requires the server end to spend 5 times as many cycles as usual in order to help a client that retries 5 times. On the other hand, the resumable "clone" we were discussing by allowing the server to respond with a slightly older bundle or a pack and then asking the client to fill the latest bits by a follow-up fetch targets to reduce the load of the server side (the "slightly older" part can be offloaded to CDN). It is a happy side effect that material offloaded to CDN can more easily obtained via HTTPS that is trivially resumable ;-) I think your "I've got these already" extention may be worth trying, and it is definitely better than the "let's make sure the server end creates byte-for-byte identical pack stream, and discard the early part without sending it to the network", and it may help resuming a small incremental fetch, but I do not think it is advisable to use it for a full clone, given that it is very likely that we would be adding the "offload 'clone' to CDN" kind. Even though I can foresee both kinds to co-exist, I do not think it is practical to offer it for resuming multi-hour cloning of the kernel repository (or worse, Android repositories) over a trans-Pacific link, for example. [Footnote] *1* To update v4.5-rc1 to today's HEAD involves 10809 objects, and the pack data takes 14955728 bytes. That translates to ~440kB needed to advertise a list of textual object names to salvage object transfer of 15MB. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Resumable git clone? 2016-03-02 6:31 ` Junio C Hamano @ 2016-03-02 7:37 ` Duy Nguyen 2016-03-02 7:44 ` Duy Nguyen 2016-03-02 7:54 ` Josh Triplett 0 siblings, 2 replies; 24+ messages in thread From: Duy Nguyen @ 2016-03-02 7:37 UTC (permalink / raw) To: Junio C Hamano Cc: Al Viro, Stefan Beller, Josh Triplett, git@vger.kernel.org, sarah On Wed, Mar 2, 2016 at 1:31 PM, Junio C Hamano <gitster@pobox.com> wrote: > Al Viro <viro@ZenIV.linux.org.uk> writes: > >> FWIW, I wasn't proposing to recreate the remaining bits of that _pack_; >> just do the normal pull with one addition: start with sending the list >> of sha1 of objects you are about to send and let the recepient reply >> with "I already have <set of sha1>, don't bother with those". And exclude >> those from the transfer. > > I did a quick-and-dirty unscientific experiment. > > I had a clone of Linus's repository that was about a week old, whose > tip was at 4de8ebef (Merge tag 'trace-fixes-v4.5-rc5' of > git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace, > 2016-02-22). To bring it up to date (i.e. a pull about a week's > worth of progress) to f691b77b (Merge branch 'for-linus' of > git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs, 2016-03-01): > > $ git rev-list --objects 4de8ebef..f691b77b1fc | wc -l > 1396 > $ git rev-parse 4de8ebef..f691b77b1fc | > git pack-objects --revs --delta-base-offset --stdout | > wc -c > 2444127 > > So in order to salvage some transfer out of 2.4MB, the hypothetical > Al protocol would first have the upload-pack give 20*1396 = 28kB It could be 10*1396 or less. If the server calculates the shortest unambiguous SHA-1 length (quite cheap on fully packed repo) and sends it to the client, the client can just sends short SHA-1 instead. It's racy though because objects are being added to the server and abbrev length may go up. But we can check ambiguity for all SHA-1 sent by client and ask for resend for ambiguous ones. On my linux-2.6.git, 10 letters (so 5 bytes) are needed for unambiguous short SHA-1. But we can even go optimistic and ask the client for shorter SHA-1 with hope that resend won't be many. > object names to fetch-pack; no matter how fetch-pack encodes its > preference, its answer would be less than 28kB. We would likely to > design this part of the new protocol in line with the existing part > and use textual object names, so let's round them up to 100kB. -- Duy ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Resumable git clone? 2016-03-02 7:37 ` Duy Nguyen @ 2016-03-02 7:44 ` Duy Nguyen 2016-03-02 7:54 ` Josh Triplett 1 sibling, 0 replies; 24+ messages in thread From: Duy Nguyen @ 2016-03-02 7:44 UTC (permalink / raw) To: Junio C Hamano Cc: Al Viro, Stefan Beller, Josh Triplett, git@vger.kernel.org, sarah On Wed, Mar 2, 2016 at 2:37 PM, Duy Nguyen <pclouds@gmail.com> wrote: >> So in order to salvage some transfer out of 2.4MB, the hypothetical >> Al protocol would first have the upload-pack give 20*1396 = 28kB > > It could be 10*1396 or less.... Oops somehow I read previous mails as client sends SHA-1 to server, not the other way around that you and Al were talking about. But the same principle applies to the other direction, I think. -- Duy ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Resumable git clone? 2016-03-02 7:37 ` Duy Nguyen 2016-03-02 7:44 ` Duy Nguyen @ 2016-03-02 7:54 ` Josh Triplett 2016-03-02 8:31 ` Junio C Hamano 1 sibling, 1 reply; 24+ messages in thread From: Josh Triplett @ 2016-03-02 7:54 UTC (permalink / raw) To: Duy Nguyen Cc: Junio C Hamano, Al Viro, Stefan Beller, git@vger.kernel.org, sarah On Wed, Mar 02, 2016 at 02:37:53PM +0700, Duy Nguyen wrote: > On Wed, Mar 2, 2016 at 1:31 PM, Junio C Hamano <gitster@pobox.com> wrote: > > Al Viro <viro@ZenIV.linux.org.uk> writes: > > > >> FWIW, I wasn't proposing to recreate the remaining bits of that _pack_; > >> just do the normal pull with one addition: start with sending the list > >> of sha1 of objects you are about to send and let the recepient reply > >> with "I already have <set of sha1>, don't bother with those". And exclude > >> those from the transfer. > > > > I did a quick-and-dirty unscientific experiment. > > > > I had a clone of Linus's repository that was about a week old, whose > > tip was at 4de8ebef (Merge tag 'trace-fixes-v4.5-rc5' of > > git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace, > > 2016-02-22). To bring it up to date (i.e. a pull about a week's > > worth of progress) to f691b77b (Merge branch 'for-linus' of > > git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs, 2016-03-01): > > > > $ git rev-list --objects 4de8ebef..f691b77b1fc | wc -l > > 1396 > > $ git rev-parse 4de8ebef..f691b77b1fc | > > git pack-objects --revs --delta-base-offset --stdout | > > wc -c > > 2444127 > > > > So in order to salvage some transfer out of 2.4MB, the hypothetical > > Al protocol would first have the upload-pack give 20*1396 = 28kB > > It could be 10*1396 or less. If the server calculates the shortest > unambiguous SHA-1 length (quite cheap on fully packed repo) and sends > it to the client, the client can just sends short SHA-1 instead. It's > racy though because objects are being added to the server and abbrev > length may go up. But we can check ambiguity for all SHA-1 sent by > client and ask for resend for ambiguous ones. > > On my linux-2.6.git, 10 letters (so 5 bytes) are needed for > unambiguous short SHA-1. But we can even go optimistic and ask the > client for shorter SHA-1 with hope that resend won't be many. I don't think it's worth the trouble and ambiguity to send abbreviated object names over the wire. I think several simpler optimizations seem preferable, such as binary object names, and abbreviating complete object sets ("I have these commits/trees and everything they need recursively; I also have this stack of random objects."). That would work especially well for resumable pull, or for the case of optimizing pull during the merge window. - Josh Triplett ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Resumable git clone? 2016-03-02 7:54 ` Josh Triplett @ 2016-03-02 8:31 ` Junio C Hamano 2016-03-02 9:28 ` Duy Nguyen 2016-03-02 16:41 ` Josh Triplett 0 siblings, 2 replies; 24+ messages in thread From: Junio C Hamano @ 2016-03-02 8:31 UTC (permalink / raw) To: Josh Triplett Cc: Duy Nguyen, Al Viro, Stefan Beller, git@vger.kernel.org, sarah Josh Triplett <josh@joshtriplett.org> writes: > I don't think it's worth the trouble and ambiguity to send abbreviated > object names over the wire. Yup. My unscientific experiment was to show that the list would be far smaller than the actual transfer and between full binary and full textual object name representations there would not be much meaningful difference--you seem to have a better design sense to grasp that point ;-) > I think several simpler optimizations seem > preferable, such as binary object names, and abbreviating complete > object sets ("I have these commits/trees and everything they need > recursively; I also have this stack of random objects."). Given the way pack stream is organized (i.e. commits first and then trees and blobs that belong to the same delta chain together), and our assumed goal being to salvage objects from an interrupted transfer of a packfile, you are unlikely to ever see "I have these commits/trees and everything they need" that are salvaged from such a failed transfer. So I doubt such an optimization is worth doing. Besides it is very expensive to compute (the computation is done on the client side, so the cycles burned and the time the user has to wait is of much less concern, though); you'd essentially be doing "git fsck" to find the "dangling" objects. The list of what would be transferred needs to come in full from the server end, as the list names objects that the receiving end may not have seen, but the response by the client could be encoded much tightly. For the full list of N objects from the server, we can think of your response to be a bitstream of N bits, each on-bit in which signals an unwanted object in the list. You can optimize this transfer by RLE compressing the bitstream, for example. As git-over-HTTP is stateless, however, you cannot assume that the server side remembers what it sent to the client (instead, the client side needs to re-post what it heard from the server in the previous exchange to allow the server side to use it after validating). So "objects at these indices in your list" kind of optimization may not work very well in that environment. I'd imagine that an exchange of "Here are the list of objects", "Give me these objects" done naively in full 40-hex object names would work OK there, though. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Resumable git clone? 2016-03-02 8:31 ` Junio C Hamano @ 2016-03-02 9:28 ` Duy Nguyen 2016-03-02 16:41 ` Josh Triplett 1 sibling, 0 replies; 24+ messages in thread From: Duy Nguyen @ 2016-03-02 9:28 UTC (permalink / raw) To: Junio C Hamano Cc: Josh Triplett, Al Viro, Stefan Beller, git@vger.kernel.org, sarah On Wed, Mar 2, 2016 at 3:31 PM, Junio C Hamano <gitster@pobox.com> wrote: > Josh Triplett <josh@joshtriplett.org> writes: > >> I don't think it's worth the trouble and ambiguity to send abbreviated >> object names over the wire. > > Yup. My unscientific experiment was to show that the list would be > far smaller than the actual transfer and between full binary and > full textual object name representations there would not be much > meaningful difference--you seem to have a better design sense to > grasp that point ;-) It may matter, depending on your user target. In order to progress a fetch/pull, I need to get at least one object before my connection goes down. Picking a random blob in the "large file" range in linux-2.6, fs/nls/nls_cp950.c, 500kb. Let's assume the worst case that the blob is transferred gzipped, not deltified, that's about 100k. Assume again I'm a lazy linux lurker who only fetches after every release, the rev-list output between v4.2 and v4.3 is 6M. Even if we transfer this list over http with compression, the list is 2.9M, way bigger than one blob transfer. Which raises the bar to my successful fetch. -- Duy ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Resumable git clone? 2016-03-02 8:31 ` Junio C Hamano 2016-03-02 9:28 ` Duy Nguyen @ 2016-03-02 16:41 ` Josh Triplett 1 sibling, 0 replies; 24+ messages in thread From: Josh Triplett @ 2016-03-02 16:41 UTC (permalink / raw) To: Junio C Hamano Cc: Duy Nguyen, Al Viro, Stefan Beller, git@vger.kernel.org, sarah On Wed, Mar 02, 2016 at 12:31:16AM -0800, Junio C Hamano wrote: > Josh Triplett <josh@joshtriplett.org> writes: > > I think several simpler optimizations seem > > preferable, such as binary object names, and abbreviating complete > > object sets ("I have these commits/trees and everything they need > > recursively; I also have this stack of random objects."). > > Given the way pack stream is organized (i.e. commits first and then > trees and blobs that belong to the same delta chain together), and > our assumed goal being to salvage objects from an interrupted > transfer of a packfile, you are unlikely to ever see "I have these > commits/trees and everything they need" that are salvaged from such > a failed transfer. So I doubt such an optimization is worth doing. True for the resumable clone case. For that optimization, I was thinking of the "pull during the merge window" case that Al Viro was also interested in optimizing. > Besides it is very expensive to compute (the computation is done on > the client side, so the cycles burned and the time the user has to > wait is of much less concern, though); you'd essentially be doing > "git fsck" to find the "dangling" objects. Trading client-side computation for bandwidth can potentially be worthwhile if you have plenty of local compute but a slow and metered link. > The list of what would be transferred needs to come in full from the > server end, as the list names objects that the receiving end may not > have seen, but the response by the client could be encoded much > tightly. For the full list of N objects from the server, we can > think of your response to be a bitstream of N bits, each on-bit in > which signals an unwanted object in the list. You can optimize this > transfer by RLE compressing the bitstream, for example. > > As git-over-HTTP is stateless, however, you cannot assume that the > server side remembers what it sent to the client (instead, the > client side needs to re-post what it heard from the server in the > previous exchange to allow the server side to use it after > validating). So "objects at these indices in your list" kind of > optimization may not work very well in that environment. I'd > imagine that an exchange of "Here are the list of objects", "Give me > these objects" done naively in full 40-hex object names would work > OK there, though. Good point. Between statelessness and Duy's point about the client list usually being smaller than the server list, perhaps it would make sense to not have the server send a list at all, and just have the client send its own list. - Josh Triplett ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Resumable git clone? 2016-03-02 2:30 ` Al Viro 2016-03-02 6:31 ` Junio C Hamano @ 2016-03-02 8:13 ` Josh Triplett 2016-03-02 8:22 ` Duy Nguyen 2016-03-02 8:14 ` Duy Nguyen 2 siblings, 1 reply; 24+ messages in thread From: Josh Triplett @ 2016-03-02 8:13 UTC (permalink / raw) To: Al Viro; +Cc: Stefan Beller, Duy Nguyen, git@vger.kernel.org, sarah On Wed, Mar 02, 2016 at 02:30:24AM +0000, Al Viro wrote: > On Tue, Mar 01, 2016 at 05:40:28PM -0800, Stefan Beller wrote: > > > So throwing away half finished stuff while keeping the front load? > > Throw away the object that got truncated and ones for which delta chain > doesn't resolve entirely in the transferred part. > > > > indexing the objects it > > > contains, and then re-running clone and not having to fetch those > > > objects. > > > > The pack is not deterministic for a given repository. When creating > > the pack, you may encounter races between threads, such that the order > > in a pack differs. > > FWIW, I wasn't proposing to recreate the remaining bits of that _pack_; > just do the normal pull with one addition: start with sending the list > of sha1 of objects you are about to send and let the recepient reply > with "I already have <set of sha1>, don't bother with those". And exclude > those from the transfer. Encoding for the set being available is an > interesting variable here - might be plain list of sha1, might be its > complement ("I want the following subset"), might be "145th to 1029th, > 1517th and 1890th to 1920th of the list you've sent"; which form ends > up more efficient needs to be found experimentally... As a simple proposal, the server could send the list of hashes (in approximately the same order it would send the pack), the client could send back a bitmap where '0' means "send it" and '1' means "got that one already", and the client could compress that bitmap. That gives you the RLE and similar without having to write it yourself. That might not be optimal, but it would likely set a high bar with minimal effort. One debatable optimization on top of that would rely on git object structure to imply objects hashes without sending them: the message from the server could have a list of commit/tree hashes that imply sending all objects reachable from those, without having to send all the implied hashes. However, that would then make the message back from the client about what it already has larger and more complicated; that might not make it worthwhile. This seems like a good case for doing the simplest possible thing first (complete hash list, compressed "got it already" bitmap), seeing how much benefit that provides, and creating a v2 protocol if some additional optimization proves sufficiently worthwhile. - Josh Triplett ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Resumable git clone? 2016-03-02 8:13 ` Josh Triplett @ 2016-03-02 8:22 ` Duy Nguyen 2016-03-02 8:32 ` Jeff King 2016-03-02 16:40 ` Josh Triplett 0 siblings, 2 replies; 24+ messages in thread From: Duy Nguyen @ 2016-03-02 8:22 UTC (permalink / raw) To: Josh Triplett; +Cc: Al Viro, Stefan Beller, git@vger.kernel.org, sarah On Wed, Mar 2, 2016 at 3:13 PM, Josh Triplett <josh@joshtriplett.org> wrote: > On Wed, Mar 02, 2016 at 02:30:24AM +0000, Al Viro wrote: >> On Tue, Mar 01, 2016 at 05:40:28PM -0800, Stefan Beller wrote: >> >> > So throwing away half finished stuff while keeping the front load? >> >> Throw away the object that got truncated and ones for which delta chain >> doesn't resolve entirely in the transferred part. >> >> > > indexing the objects it >> > > contains, and then re-running clone and not having to fetch those >> > > objects. >> > >> > The pack is not deterministic for a given repository. When creating >> > the pack, you may encounter races between threads, such that the order >> > in a pack differs. >> >> FWIW, I wasn't proposing to recreate the remaining bits of that _pack_; >> just do the normal pull with one addition: start with sending the list >> of sha1 of objects you are about to send and let the recepient reply >> with "I already have <set of sha1>, don't bother with those". And exclude >> those from the transfer. Encoding for the set being available is an >> interesting variable here - might be plain list of sha1, might be its >> complement ("I want the following subset"), might be "145th to 1029th, >> 1517th and 1890th to 1920th of the list you've sent"; which form ends >> up more efficient needs to be found experimentally... > > As a simple proposal, the server could send the list of hashes (in > approximately the same order it would send the pack), the client could > send back a bitmap where '0' means "send it" and '1' means "got that one > already", and the client could compress that bitmap. That gives you the > RLE and similar without having to write it yourself. That might not be > optimal, but it would likely set a high bar with minimal effort. We have an implementation of EWAH bitmap compression, so compressing is not a problem. But I still don't see why it's more efficient to have the server send the hash list to the client. Assume you need to transfer N objects. That direction makes you always send N hashes. But if the client sends the list of already fetched objects, M, then M <= N. And we won't need to send the bitmap. What did I miss? -- Duy ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Resumable git clone? 2016-03-02 8:22 ` Duy Nguyen @ 2016-03-02 8:32 ` Jeff King 2016-03-02 10:47 ` Bhavik Bavishi 2016-03-02 16:40 ` Josh Triplett 1 sibling, 1 reply; 24+ messages in thread From: Jeff King @ 2016-03-02 8:32 UTC (permalink / raw) To: Duy Nguyen Cc: Josh Triplett, Al Viro, Stefan Beller, git@vger.kernel.org, sarah On Wed, Mar 02, 2016 at 03:22:17PM +0700, Duy Nguyen wrote: > > As a simple proposal, the server could send the list of hashes (in > > approximately the same order it would send the pack), the client could > > send back a bitmap where '0' means "send it" and '1' means "got that one > > already", and the client could compress that bitmap. That gives you the > > RLE and similar without having to write it yourself. That might not be > > optimal, but it would likely set a high bar with minimal effort. > > We have an implementation of EWAH bitmap compression, so compressing > is not a problem. > > But I still don't see why it's more efficient to have the server send > the hash list to the client. Assume you need to transfer N objects. > That direction makes you always send N hashes. But if the client sends > the list of already fetched objects, M, then M <= N. And we won't need > to send the bitmap. What did I miss? Right, I don't see what the point is in compressing the bitmap. The sha1 list for a clone of linux.git is 87 megabytes. The return bitmap, even naively, is 500K. Unless you are trying to optimize for wildly asymmetric links. If the client just naively sends "here's what I have", then we know it can never be _more_ than 87 megabytes. And as a bonus, the longer the list is, the more we are saving (so at the moment you are sending 82MB, it's really worth it, because you do have 95% of the pack, which is worth amortizing). I'm still a little dubious that anything involving "send all the hashes" is going to be useful in practice, especially for something like the kernel (where you have tons of huge small objects that delta well). It would work better when you have gigantic objects that don't delta (so the cost of a sha1 versus the object size is way better), but then I think we'd do better to transfer all of the normal-sized bits up front, and then allow fetching the large stuff separately. -Peff ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Resumable git clone? 2016-03-02 8:32 ` Jeff King @ 2016-03-02 10:47 ` Bhavik Bavishi 0 siblings, 0 replies; 24+ messages in thread From: Bhavik Bavishi @ 2016-03-02 10:47 UTC (permalink / raw) To: git; +Cc: Josh Triplett, Al Viro, Stefan Beller, git@vger.kernel.org, sarah On 3/2/16 2:02 PM, Jeff King wrote: > On Wed, Mar 02, 2016 at 03:22:17PM +0700, Duy Nguyen wrote: > >>> As a simple proposal, the server could send the list of hashes (in >>> approximately the same order it would send the pack), the client could >>> send back a bitmap where '0' means "send it" and '1' means "got that one >>> already", and the client could compress that bitmap. That gives you the >>> RLE and similar without having to write it yourself. That might not be >>> optimal, but it would likely set a high bar with minimal effort. >> >> We have an implementation of EWAH bitmap compression, so compressing >> is not a problem. >> >> But I still don't see why it's more efficient to have the server send >> the hash list to the client. Assume you need to transfer N objects. >> That direction makes you always send N hashes. But if the client sends >> the list of already fetched objects, M, then M <= N. And we won't need >> to send the bitmap. What did I miss? > > Right, I don't see what the point is in compressing the bitmap. The sha1 > list for a clone of linux.git is 87 megabytes. The return bitmap, even > naively, is 500K. Unless you are trying to optimize for wildly > asymmetric links. > > If the client just naively sends "here's what I have", then we know it > can never be _more_ than 87 megabytes. And as a bonus, the longer the > list is, the more we are saving (so at the moment you are sending 82MB, > it's really worth it, because you do have 95% of the pack, which is > worth amortizing). > > I'm still a little dubious that anything involving "send all the hashes" > is going to be useful in practice, especially for something like the > kernel (where you have tons of huge small objects that delta well). It > would work better when you have gigantic objects that don't delta (so > the cost of a sha1 versus the object size is way better), but then I > think we'd do better to transfer all of the normal-sized bits up front, > and then allow fetching the large stuff separately. > > -Peff > In case if we can have object-lookup-db like provisioning with stored information like SHA-1, type of object, parent if any, size of that object, as in entire hierarchy tree without data like commit message, tag name. This implementation may be look as bit duplication of existing information. At initial clone time server sends object-lookup-db to client and then, by reading object-lookup-db client sends SHA1 to server to get/fecth objects, it can be got in parallel, as well. This process may not be transfer efficient but it can be resumable, as client knows what got sync and what's remain and which SHA1 refers to what object type. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Resumable git clone? 2016-03-02 8:22 ` Duy Nguyen 2016-03-02 8:32 ` Jeff King @ 2016-03-02 16:40 ` Josh Triplett 1 sibling, 0 replies; 24+ messages in thread From: Josh Triplett @ 2016-03-02 16:40 UTC (permalink / raw) To: Duy Nguyen; +Cc: Al Viro, Stefan Beller, git@vger.kernel.org, sarah On Wed, Mar 02, 2016 at 03:22:17PM +0700, Duy Nguyen wrote: > On Wed, Mar 2, 2016 at 3:13 PM, Josh Triplett <josh@joshtriplett.org> wrote: > > On Wed, Mar 02, 2016 at 02:30:24AM +0000, Al Viro wrote: > >> On Tue, Mar 01, 2016 at 05:40:28PM -0800, Stefan Beller wrote: > >> > >> > So throwing away half finished stuff while keeping the front load? > >> > >> Throw away the object that got truncated and ones for which delta chain > >> doesn't resolve entirely in the transferred part. > >> > >> > > indexing the objects it > >> > > contains, and then re-running clone and not having to fetch those > >> > > objects. > >> > > >> > The pack is not deterministic for a given repository. When creating > >> > the pack, you may encounter races between threads, such that the order > >> > in a pack differs. > >> > >> FWIW, I wasn't proposing to recreate the remaining bits of that _pack_; > >> just do the normal pull with one addition: start with sending the list > >> of sha1 of objects you are about to send and let the recepient reply > >> with "I already have <set of sha1>, don't bother with those". And exclude > >> those from the transfer. Encoding for the set being available is an > >> interesting variable here - might be plain list of sha1, might be its > >> complement ("I want the following subset"), might be "145th to 1029th, > >> 1517th and 1890th to 1920th of the list you've sent"; which form ends > >> up more efficient needs to be found experimentally... > > > > As a simple proposal, the server could send the list of hashes (in > > approximately the same order it would send the pack), the client could > > send back a bitmap where '0' means "send it" and '1' means "got that one > > already", and the client could compress that bitmap. That gives you the > > RLE and similar without having to write it yourself. That might not be > > optimal, but it would likely set a high bar with minimal effort. > > We have an implementation of EWAH bitmap compression, so compressing > is not a problem. > > But I still don't see why it's more efficient to have the server send > the hash list to the client. Assume you need to transfer N objects. > That direction makes you always send N hashes. But if the client sends > the list of already fetched objects, M, then M <= N. And we won't need > to send the bitmap. What did I miss? M can potentially be larger than N if you have many remotes and branches in your local repository that the server doesn't have. However, that certainly wouldn't be the common case, and in that case heuristics on the client side could help there in determining a subset to send. I can't think of any good argument for the server's hash list; a client-sent list does seem reasonable. - Josh Triplett ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Resumable git clone? 2016-03-02 2:30 ` Al Viro 2016-03-02 6:31 ` Junio C Hamano 2016-03-02 8:13 ` Josh Triplett @ 2016-03-02 8:14 ` Duy Nguyen 2 siblings, 0 replies; 24+ messages in thread From: Duy Nguyen @ 2016-03-02 8:14 UTC (permalink / raw) To: Al Viro; +Cc: Stefan Beller, Josh Triplett, git@vger.kernel.org, sarah On Wed, Mar 2, 2016 at 9:30 AM, Al Viro <viro@zeniv.linux.org.uk> wrote: > IIRC, the objection had been that the organisation of the pack will lead > to many cases when deltas are transferred *first*, with base object not > getting there prior to disconnect. I suspect that fraction of the objects > getting through would still be worth it, but I hadn't experimented enough > to be able to tell... No. If deltas refer to base objects by offset, the (unsigned) offset is negated before use. So base objects must always sent first. If deltas refer to base objects by full SHA-1 then base objects can appear anywhere in the pack in theory. But I think we only use full SHA-1 references for out-of-thin-pack objects, never to an existing object in the pack. -- Duy ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Resumable git clone? 2016-03-02 1:30 Resumable git clone? Josh Triplett 2016-03-02 1:40 ` Stefan Beller @ 2016-03-02 1:45 ` Duy Nguyen 2016-03-02 8:41 ` Junio C Hamano 2 siblings, 0 replies; 24+ messages in thread From: Duy Nguyen @ 2016-03-02 1:45 UTC (permalink / raw) To: Josh Triplett; +Cc: Git Mailing List, sarah, viro On Wed, Mar 2, 2016 at 8:30 AM, Josh Triplett <josh@joshtriplett.org> wrote: > If you clone a repository, and the connection drops, the next attempt > will have to start from scratch. This can add significant time and > expense if you're on a low-bandwidth or metered connection trying to > clone something like Linux. > > Would it be possible to make git clone resumable after a partial clone? > (And, ideally, to make that the default?) > > In a discussion elsewhere, Al Viro suggested taking the partial pack > received so far, repairing any truncation, indexing the objects it > contains, and then re-running clone and not having to fetch those > objects. This may also require extending receive-pack's protocol for > determining objects the recipient already has, as the partial pack may > not have a consistent set of reachable objects. > > Before starting down the path of developing patches for this, does the > approach seem potentially reasonable? This topic came up recently (thanks Sarah!) and Shawn proposed a different approach that (I think) is simpler and more effective for resume _clone_ case. I'm not sure if anybody is implementing it though. [1] http://thread.gmane.org/gmane.comp.version-control.git/285921 -- Duy ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Resumable git clone? 2016-03-02 1:30 Resumable git clone? Josh Triplett 2016-03-02 1:40 ` Stefan Beller 2016-03-02 1:45 ` Duy Nguyen @ 2016-03-02 8:41 ` Junio C Hamano 2016-03-02 15:51 ` Konstantin Ryabitsev ` (2 more replies) 2 siblings, 3 replies; 24+ messages in thread From: Junio C Hamano @ 2016-03-02 8:41 UTC (permalink / raw) To: Josh Triplett, Konstantin Ryabitsev; +Cc: git, sarah, viro Josh Triplett <josh@joshtriplett.org> writes: > If you clone a repository, and the connection drops, the next attempt > will have to start from scratch. This can add significant time and > expense if you're on a low-bandwidth or metered connection trying to > clone something like Linux. For this particular issue, your friendly k.org administrator already has a solution. Torvalds/linux.git is made into a bundle weekly with $ git bundle create clone.bundle --all and the result placed on k.org CDN. So low-bandwidth cloners can grab it over resumable http, clone from the bundle, and then fill the most recent part by fetching from k.org already. The tooling to allow this kind of "bundle" (and possibly other forms of "CDN offload" material) transparently used by "git clone" was the proposal by Shawn Pearce mentioned elsewhere in this thread. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Resumable git clone? 2016-03-02 8:41 ` Junio C Hamano @ 2016-03-02 15:51 ` Konstantin Ryabitsev 2016-03-02 16:49 ` Josh Triplett 2016-03-24 8:00 ` Philip Oakley 2 siblings, 0 replies; 24+ messages in thread From: Konstantin Ryabitsev @ 2016-03-02 15:51 UTC (permalink / raw) To: Junio C Hamano; +Cc: Josh Triplett, git, sarah, viro [-- Attachment #1: Type: text/plain, Size: 1240 bytes --] On Wed, Mar 02, 2016 at 12:41:20AM -0800, Junio C Hamano wrote: > Josh Triplett <josh@joshtriplett.org> writes: > > > If you clone a repository, and the connection drops, the next attempt > > will have to start from scratch. This can add significant time and > > expense if you're on a low-bandwidth or metered connection trying to > > clone something like Linux. > > For this particular issue, your friendly k.org administrator already > has a solution. Torvalds/linux.git is made into a bundle weekly > with > > $ git bundle create clone.bundle --all > > and the result placed on k.org CDN. So low-bandwidth cloners can > grab it over resumable http, clone from the bundle, and then fill > the most recent part by fetching from k.org already. I finally got around to documenting this here: https://kernel.org/cloning-linux-from-a-bundle.html > The tooling to allow this kind of "bundle" (and possibly other forms > of "CDN offload" material) transparently used by "git clone" was the > proposal by Shawn Pearce mentioned elsewhere in this thread. To reiterate, I believe that would be an awesome feature. Regards, -- Konstantin Ryabitsev Linux Foundation Collab Projects Montréal, Québec [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 819 bytes --] ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Resumable git clone? 2016-03-02 8:41 ` Junio C Hamano 2016-03-02 15:51 ` Konstantin Ryabitsev @ 2016-03-02 16:49 ` Josh Triplett 2016-03-02 17:57 ` Junio C Hamano 2016-03-24 8:00 ` Philip Oakley 2 siblings, 1 reply; 24+ messages in thread From: Josh Triplett @ 2016-03-02 16:49 UTC (permalink / raw) To: Junio C Hamano; +Cc: Konstantin Ryabitsev, git, sarah, viro On Wed, Mar 02, 2016 at 12:41:20AM -0800, Junio C Hamano wrote: > Josh Triplett <josh@joshtriplett.org> writes: > > If you clone a repository, and the connection drops, the next attempt > > will have to start from scratch. This can add significant time and > > expense if you're on a low-bandwidth or metered connection trying to > > clone something like Linux. > > For this particular issue, your friendly k.org administrator already > has a solution. Torvalds/linux.git is made into a bundle weekly > with > > $ git bundle create clone.bundle --all > > and the result placed on k.org CDN. So low-bandwidth cloners can > grab it over resumable http, clone from the bundle, and then fill > the most recent part by fetching from k.org already. > > The tooling to allow this kind of "bundle" (and possibly other forms > of "CDN offload" material) transparently used by "git clone" was the > proposal by Shawn Pearce mentioned elsewhere in this thread. That does help in the case of cloning torvalds/linux.git from kernel.org, and I'd love to see it used transparently. However, even with that, I still also see value in a resumable git clone (or git pull) for many other repositories elsewhere, with a somewhat lower pull-to-push ratio than kernel.org. Supporting resumption based on objects, without the repository needing to generate and keep around a bundle, seems preferable for such repositories. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Resumable git clone? 2016-03-02 16:49 ` Josh Triplett @ 2016-03-02 17:57 ` Junio C Hamano 0 siblings, 0 replies; 24+ messages in thread From: Junio C Hamano @ 2016-03-02 17:57 UTC (permalink / raw) To: Josh Triplett; +Cc: Konstantin Ryabitsev, git, sarah, viro Josh Triplett <josh@joshtriplett.org> writes: > That does help in the case of cloning torvalds/linux.git from > kernel.org, and I'd love to see it used transparently. > > However, even with that, I still also see value in a resumable git clone > (or git pull) for many other repositories elsewhere,... By "transparently" the statement you are responding to meant many things. "git clone" of course need to be updated on the client side, but things like "git repack" that is run on the server end may start producing extra files in the repository, and updated "git daemon" and/or "git upload-pack" would take these extra files as a signal that the material produced during the last repack is usable for bootstrapping a new clone with "wget -c" equivalent. So even if you are not yet automatically offloading to CDN, such a set of updates on the server side would "transparently" enable the resumable clone for all repositories elsewhere when deployed and enabled ;-) ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Resumable git clone? 2016-03-02 8:41 ` Junio C Hamano 2016-03-02 15:51 ` Konstantin Ryabitsev 2016-03-02 16:49 ` Josh Triplett @ 2016-03-24 8:00 ` Philip Oakley 2016-03-24 15:53 ` Junio C Hamano 2 siblings, 1 reply; 24+ messages in thread From: Philip Oakley @ 2016-03-24 8:00 UTC (permalink / raw) To: Junio C Hamano, Josh Triplett, Konstantin Ryabitsev; +Cc: Git List, sarah, viro From: "Junio C Hamano" <gitster@pobox.com> Sent: Wednesday, March 02, 2016 8:41 AM > Josh Triplett <josh@joshtriplett.org> writes: > >> If you clone a repository, and the connection drops, the next attempt >> will have to start from scratch. This can add significant time and >> expense if you're on a low-bandwidth or metered connection trying to >> clone something like Linux. > > For this particular issue, your friendly k.org administrator already > has a solution. Torvalds/linux.git is made into a bundle weekly > with > > $ git bundle create clone.bundle --all > Isn't this use of '--all' a bit of oversharing? I had proposed a doc patch to the bundle manpage way back (see $gmane/205897) to give the user that example, but it wasn't accepted as it was thought wrong. " I also think "--all" is a bad advice for another reason. Doesn't it shove refs from refs/remotes/* hierarchy in the resulting bundle? It is fine for archiving purposes, but it does not seem to be a good advice to create a bundle to clone from." Perhaps the '--clone-bundle' (or maybe'--bundle-clone') option from $gmane/288222 [PATCH] index-pack: --clone-bundle option 2016-03-03 maybe a suitable new <rev-list-arg> to get just the right content? > and the result placed on k.org CDN. So low-bandwidth cloners can > grab it over resumable http, clone from the bundle, and then fill > the most recent part by fetching from k.org already. > > The tooling to allow this kind of "bundle" (and possibly other forms > of "CDN offload" material) transparently used by "git clone" was the > proposal by Shawn Pearce mentioned elsewhere in this thread. > > -- > To unsubscribe from this list: send the line "unsubscribe git" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Resumable git clone? 2016-03-24 8:00 ` Philip Oakley @ 2016-03-24 15:53 ` Junio C Hamano 2016-03-24 21:08 ` Philip Oakley 0 siblings, 1 reply; 24+ messages in thread From: Junio C Hamano @ 2016-03-24 15:53 UTC (permalink / raw) To: Philip Oakley; +Cc: Josh Triplett, Konstantin Ryabitsev, Git List, sarah, viro "Philip Oakley" <philipoakley@iee.org> writes: > From: "Junio C Hamano" <gitster@pobox.com> >> >>> If you clone a repository, and the connection drops, the next attempt >>> will have to start from scratch. This can add significant time and >>> expense if you're on a low-bandwidth or metered connection trying to >>> clone something like Linux. >> >> For this particular issue, your friendly k.org administrator already >> has a solution. Torvalds/linux.git is made into a bundle weekly >> with >> >> $ git bundle create clone.bundle --all >> >> and the result placed on k.org CDN. So low-bandwidth cloners can >> grab it over resumable http, clone from the bundle, and then fill >> the most recent part by fetching from k.org already. > > Isn't this use of '--all' a bit of oversharing? Not for the exact use case mentioned; k.org administrator knows what is in Linus's repository and is aware that there is no remote-tracking branches or secret branches that may make the resulting bundle unsuitable for priming a clone. > " I also think "--all" is a bad advice for another reason. I do not think it is a good advice for everybody, but the thing is, what you are responding is not an advice. It is just a statement of a fact, what is already done, one of the existing practices that an approach to "resumable clone" may want to help. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Resumable git clone? 2016-03-24 15:53 ` Junio C Hamano @ 2016-03-24 21:08 ` Philip Oakley 0 siblings, 0 replies; 24+ messages in thread From: Philip Oakley @ 2016-03-24 21:08 UTC (permalink / raw) To: Junio C Hamano; +Cc: Josh Triplett, Konstantin Ryabitsev, Git List, sarah, viro From: "Junio C Hamano" <gitster@pobox.com> > "Philip Oakley" <philipoakley@iee.org> writes: > >> From: "Junio C Hamano" <gitster@pobox.com> >>> >>>> If you clone a repository, and the connection drops, the next attempt >>>> will have to start from scratch. This can add significant time and >>>> expense if you're on a low-bandwidth or metered connection trying to >>>> clone something like Linux. >>> >>> For this particular issue, your friendly k.org administrator already >>> has a solution. Torvalds/linux.git is made into a bundle weekly >>> with >>> >>> $ git bundle create clone.bundle --all >>> >>> and the result placed on k.org CDN. So low-bandwidth cloners can >>> grab it over resumable http, clone from the bundle, and then fill >>> the most recent part by fetching from k.org already. >> >> Isn't this use of '--all' a bit of oversharing? > > Not for the exact use case mentioned; k.org administrator knows what > is in Linus's repository and is aware that there is no remote-tracking > branches or secret branches that may make the resulting bundle unsuitable > for priming a clone. OK > >> " I also think "--all" is a bad advice for another reason. > > I do not think it is a good advice for everybody, but the thing is, > what you are responding is not an advice. It is just a statement of > a fact, what is already done, one of the existing practices that an > approach to "resumable clone" may want to help. > I was picking up on the need, for others who maybe generating clone bundles, that '--all' may not be the right thing for them, and that somewhere we should record whatever is deemed the equivalent of the current clone command. This would get away from the web examples which show '--all' as a quick solution for bundling (I'm one of the offenders there). If I understand the clone code, the rev-list-args would be "HEAD --branches". But I could well be wrong. -- Philip ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2016-03-24 21:09 UTC | newest] Thread overview: 24+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-03-02 1:30 Resumable git clone? Josh Triplett 2016-03-02 1:40 ` Stefan Beller 2016-03-02 2:30 ` Al Viro 2016-03-02 6:31 ` Junio C Hamano 2016-03-02 7:37 ` Duy Nguyen 2016-03-02 7:44 ` Duy Nguyen 2016-03-02 7:54 ` Josh Triplett 2016-03-02 8:31 ` Junio C Hamano 2016-03-02 9:28 ` Duy Nguyen 2016-03-02 16:41 ` Josh Triplett 2016-03-02 8:13 ` Josh Triplett 2016-03-02 8:22 ` Duy Nguyen 2016-03-02 8:32 ` Jeff King 2016-03-02 10:47 ` Bhavik Bavishi 2016-03-02 16:40 ` Josh Triplett 2016-03-02 8:14 ` Duy Nguyen 2016-03-02 1:45 ` Duy Nguyen 2016-03-02 8:41 ` Junio C Hamano 2016-03-02 15:51 ` Konstantin Ryabitsev 2016-03-02 16:49 ` Josh Triplett 2016-03-02 17:57 ` Junio C Hamano 2016-03-24 8:00 ` Philip Oakley 2016-03-24 15:53 ` Junio C Hamano 2016-03-24 21:08 ` Philip Oakley
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).