Resumable git clone?

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Resumable git clone?
@ 2016-03-02  1:30 Josh Triplett
  2016-03-02  1:40 ` Stefan Beller
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Josh Triplett @ 2016-03-02  1:30 UTC (permalink / raw)
  To: git; +Cc: sarah, viro

If you clone a repository, and the connection drops, the next attempt
will have to start from scratch.  This can add significant time and
expense if you're on a low-bandwidth or metered connection trying to
clone something like Linux.

Would it be possible to make git clone resumable after a partial clone?
(And, ideally, to make that the default?)

In a discussion elsewhere, Al Viro suggested taking the partial pack
received so far, repairing any truncation, indexing the objects it
contains, and then re-running clone and not having to fetch those
objects.  This may also require extending receive-pack's protocol for
determining objects the recipient already has, as the partial pack may
not have a consistent set of reachable objects.

Before starting down the path of developing patches for this, does the
approach seem potentially reasonable?

- Josh Triplett

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Resumable git clone?
  2016-03-02  1:30 Resumable git clone? Josh Triplett
@ 2016-03-02  1:40 ` Stefan Beller
  2016-03-02  2:30   ` Al Viro
  2016-03-02  1:45 ` Duy Nguyen
  2016-03-02  8:41 ` Junio C Hamano
  2 siblings, 1 reply; 24+ messages in thread
From: Stefan Beller @ 2016-03-02  1:40 UTC (permalink / raw)
  To: Josh Triplett, Duy Nguyen; +Cc: git@vger.kernel.org, sarah, viro

+ Duy, who tried resumable clone a few days/weeks ago

On Tue, Mar 1, 2016 at 5:30 PM, Josh Triplett <josh@joshtriplett.org> wrote:
> If you clone a repository, and the connection drops, the next attempt
> will have to start from scratch.  This can add significant time and
> expense if you're on a low-bandwidth or metered connection trying to
> clone something like Linux.
>
> Would it be possible to make git clone resumable after a partial clone?
> (And, ideally, to make that the default?)
>
> In a discussion elsewhere, Al Viro suggested taking the partial pack
> received so far,

ok,

> repairing any truncation,

So throwing away half finished stuff while keeping the front load?

> indexing the objects it
> contains, and then re-running clone and not having to fetch those
> objects.

The pack is not deterministic for a given repository. When creating
the pack, you may encounter races between threads, such that the order
in a pack differs.

> This may also require extending receive-pack's protocol for
> determining objects the recipient already has, as the partial pack may
> not have a consistent set of reachable objects.
>
> Before starting down the path of developing patches for this, does the
> approach seem potentially reasonable?

I think that sounds reasonable on a high level, but I'd expect it blows up
in complexity as in the receive-pack's protocol or in the code for having
to handle partial stuff.

Thanks,
Stefan

>
> - Josh Triplett
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Resumable git clone?
  2016-03-02  1:40 ` Stefan Beller
@ 2016-03-02  2:30   ` Al Viro
  2016-03-02  6:31     ` Junio C Hamano
                       ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Al Viro @ 2016-03-02  2:30 UTC (permalink / raw)
  To: Stefan Beller; +Cc: Josh Triplett, Duy Nguyen, git@vger.kernel.org, sarah

On Tue, Mar 01, 2016 at 05:40:28PM -0800, Stefan Beller wrote:

> So throwing away half finished stuff while keeping the front load?

Throw away the object that got truncated and ones for which delta chain
doesn't resolve entirely in the transferred part.

> > indexing the objects it
> > contains, and then re-running clone and not having to fetch those
> > objects.
> 
> The pack is not deterministic for a given repository. When creating
> the pack, you may encounter races between threads, such that the order
> in a pack differs.

FWIW, I wasn't proposing to recreate the remaining bits of that _pack_;
just do the normal pull with one addition: start with sending the list
of sha1 of objects you are about to send and let the recepient reply
with "I already have <set of sha1>, don't bother with those".  And exclude
those from the transfer.  Encoding for the set being available is an
interesting variable here - might be plain list of sha1, might be its
complement ("I want the following subset"), might be "145th to 1029th,
1517th and 1890th to 1920th of the list you've sent"; which form ends
up more efficient needs to be found experimentally...

IIRC, the objection had been that the organisation of the pack will lead
to many cases when deltas are transferred *first*, with base object not
getting there prior to disconnect.  I suspect that fraction of the objects
getting through would still be worth it, but I hadn't experimented enough
to be able to tell...

I was more interested in resumable _pull_, with restarted clone treated as
special case of that.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Resumable git clone?
  2016-03-02  2:30   ` Al Viro
@ 2016-03-02  6:31     ` Junio C Hamano
  2016-03-02  7:37       ` Duy Nguyen
  2016-03-02  8:13     ` Josh Triplett
  2016-03-02  8:14     ` Duy Nguyen
  2 siblings, 1 reply; 24+ messages in thread
From: Junio C Hamano @ 2016-03-02  6:31 UTC (permalink / raw)
  To: Al Viro
  Cc: Stefan Beller, Josh Triplett, Duy Nguyen, git@vger.kernel.org,
	sarah

Al Viro <viro@ZenIV.linux.org.uk> writes:

> FWIW, I wasn't proposing to recreate the remaining bits of that _pack_;
> just do the normal pull with one addition: start with sending the list
> of sha1 of objects you are about to send and let the recepient reply
> with "I already have <set of sha1>, don't bother with those".  And exclude
> those from the transfer.

I did a quick-and-dirty unscientific experiment.

I had a clone of Linus's repository that was about a week old, whose
tip was at 4de8ebef (Merge tag 'trace-fixes-v4.5-rc5' of
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace,
2016-02-22).  To bring it up to date (i.e. a pull about a week's
worth of progress) to f691b77b (Merge branch 'for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs, 2016-03-01):

    $ git rev-list --objects 4de8ebef..f691b77b1fc | wc -l
    1396
    $ git rev-parse 4de8ebef..f691b77b1fc |
      git pack-objects --revs --delta-base-offset --stdout |
      wc -c
    2444127

So in order to salvage some transfer out of 2.4MB, the hypothetical
Al protocol would first have the upload-pack give 20*1396 = 28kB
object names to fetch-pack; no matter how fetch-pack encodes its
preference, its answer would be less than 28kB.  We would likely to
design this part of the new protocol in line with the existing part
and use textual object names, so let's round them up to 100kB.

That is quite small, even if you are on a crappy connection that you
need to retry 5 times, the additional overhead to negotiate the list
of objects alone would be 0.5MB (or less than 20% of the real
transfer).

That is quite interesting [*1*].

For the approach to be practical, you would have to write a program
that reads from a truncated packfile and writes a new packfile,
excising deltas that lack their bases, to salvage objects from a
half-transferred packfile; it is however unclear how involved the
code would get.

It is probably OK for a tiny pack that has only 1400 objects--we
could just pass the early part through unpack-objects and let it die
when it hits EOF, but for a "resumable clone", I do not think you
can afford to unpack 4.6M objects in the kernel repository into
loose objects.

The approach of course requires the server end to spend 5 times as
many cycles as usual in order to help a client that retries 5 times.

On the other hand, the resumable "clone" we were discussing by
allowing the server to respond with a slightly older bundle or a
pack and then asking the client to fill the latest bits by a
follow-up fetch targets to reduce the load of the server side (the
"slightly older" part can be offloaded to CDN).  It is a happy side
effect that material offloaded to CDN can more easily obtained via
HTTPS that is trivially resumable ;-)

I think your "I've got these already" extention may be worth trying,
and it is definitely better than the "let's make sure the server end
creates byte-for-byte identical pack stream, and discard the early
part without sending it to the network", and it may help resuming a
small incremental fetch, but I do not think it is advisable to use
it for a full clone, given that it is very likely that we would be
adding the "offload 'clone' to CDN" kind.  Even though I can foresee
both kinds to co-exist, I do not think it is practical to offer it
for resuming multi-hour cloning of the kernel repository (or worse,
Android repositories) over a trans-Pacific link, for example.

[Footnote]

*1* To update v4.5-rc1 to today's HEAD involves 10809 objects, and
    the pack data takes 14955728 bytes.  That translates to ~440kB
    needed to advertise a list of textual object names to salvage
    object transfer of 15MB.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Resumable git clone?
  2016-03-02  6:31     ` Junio C Hamano
@ 2016-03-02  7:37       ` Duy Nguyen
  2016-03-02  7:44         ` Duy Nguyen
  2016-03-02  7:54         ` Josh Triplett
  0 siblings, 2 replies; 24+ messages in thread
From: Duy Nguyen @ 2016-03-02  7:37 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Al Viro, Stefan Beller, Josh Triplett, git@vger.kernel.org, sarah

On Wed, Mar 2, 2016 at 1:31 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Al Viro <viro@ZenIV.linux.org.uk> writes:
>
>> FWIW, I wasn't proposing to recreate the remaining bits of that _pack_;
>> just do the normal pull with one addition: start with sending the list
>> of sha1 of objects you are about to send and let the recepient reply
>> with "I already have <set of sha1>, don't bother with those".  And exclude
>> those from the transfer.
>
> I did a quick-and-dirty unscientific experiment.
>
> I had a clone of Linus's repository that was about a week old, whose
> tip was at 4de8ebef (Merge tag 'trace-fixes-v4.5-rc5' of
> git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace,
> 2016-02-22).  To bring it up to date (i.e. a pull about a week's
> worth of progress) to f691b77b (Merge branch 'for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs, 2016-03-01):
>
>     $ git rev-list --objects 4de8ebef..f691b77b1fc | wc -l
>     1396
>     $ git rev-parse 4de8ebef..f691b77b1fc |
>       git pack-objects --revs --delta-base-offset --stdout |
>       wc -c
>     2444127
>
> So in order to salvage some transfer out of 2.4MB, the hypothetical
> Al protocol would first have the upload-pack give 20*1396 = 28kB

It could be 10*1396 or less. If the server calculates the shortest
unambiguous SHA-1 length (quite cheap on fully packed repo) and sends
it to the client, the client can just sends short SHA-1 instead. It's
racy though because objects are being added to the server and abbrev
length may go up. But we can check ambiguity for all SHA-1 sent by
client and ask for resend for ambiguous ones.

On my linux-2.6.git, 10 letters (so 5 bytes) are needed for
unambiguous short SHA-1. But we can even go optimistic and ask the
client for shorter SHA-1 with hope that resend won't be many.

> object names to fetch-pack; no matter how fetch-pack encodes its
> preference, its answer would be less than 28kB.  We would likely to
> design this part of the new protocol in line with the existing part
> and use textual object names, so let's round them up to 100kB.
-- 
Duy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Resumable git clone?
  2016-03-02  7:37       ` Duy Nguyen
@ 2016-03-02  7:44         ` Duy Nguyen
  2016-03-02  7:54         ` Josh Triplett
  1 sibling, 0 replies; 24+ messages in thread
From: Duy Nguyen @ 2016-03-02  7:44 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Al Viro, Stefan Beller, Josh Triplett, git@vger.kernel.org, sarah

On Wed, Mar 2, 2016 at 2:37 PM, Duy Nguyen <pclouds@gmail.com> wrote:
>> So in order to salvage some transfer out of 2.4MB, the hypothetical
>> Al protocol would first have the upload-pack give 20*1396 = 28kB
>
> It could be 10*1396 or less....

Oops somehow I read previous mails as client sends SHA-1 to server,
not the other way around that you and Al were talking about. But the
same principle applies to the other direction, I think.
-- 
Duy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Resumable git clone?
  2016-03-02  7:37       ` Duy Nguyen
  2016-03-02  7:44         ` Duy Nguyen
@ 2016-03-02  7:54         ` Josh Triplett
  2016-03-02  8:31           ` Junio C Hamano
  1 sibling, 1 reply; 24+ messages in thread
From: Josh Triplett @ 2016-03-02  7:54 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Junio C Hamano, Al Viro, Stefan Beller, git@vger.kernel.org,
	sarah

On Wed, Mar 02, 2016 at 02:37:53PM +0700, Duy Nguyen wrote:
> On Wed, Mar 2, 2016 at 1:31 PM, Junio C Hamano <gitster@pobox.com> wrote:
> > Al Viro <viro@ZenIV.linux.org.uk> writes:
> >
> >> FWIW, I wasn't proposing to recreate the remaining bits of that _pack_;
> >> just do the normal pull with one addition: start with sending the list
> >> of sha1 of objects you are about to send and let the recepient reply
> >> with "I already have <set of sha1>, don't bother with those".  And exclude
> >> those from the transfer.
> >
> > I did a quick-and-dirty unscientific experiment.
> >
> > I had a clone of Linus's repository that was about a week old, whose
> > tip was at 4de8ebef (Merge tag 'trace-fixes-v4.5-rc5' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace,
> > 2016-02-22).  To bring it up to date (i.e. a pull about a week's
> > worth of progress) to f691b77b (Merge branch 'for-linus' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs, 2016-03-01):
> >
> >     $ git rev-list --objects 4de8ebef..f691b77b1fc | wc -l
> >     1396
> >     $ git rev-parse 4de8ebef..f691b77b1fc |
> >       git pack-objects --revs --delta-base-offset --stdout |
> >       wc -c
> >     2444127
> >
> > So in order to salvage some transfer out of 2.4MB, the hypothetical
> > Al protocol would first have the upload-pack give 20*1396 = 28kB
> 
> It could be 10*1396 or less. If the server calculates the shortest
> unambiguous SHA-1 length (quite cheap on fully packed repo) and sends
> it to the client, the client can just sends short SHA-1 instead. It's
> racy though because objects are being added to the server and abbrev
> length may go up. But we can check ambiguity for all SHA-1 sent by
> client and ask for resend for ambiguous ones.
> 
> On my linux-2.6.git, 10 letters (so 5 bytes) are needed for
> unambiguous short SHA-1. But we can even go optimistic and ask the
> client for shorter SHA-1 with hope that resend won't be many.

I don't think it's worth the trouble and ambiguity to send abbreviated
object names over the wire.  I think several simpler optimizations seem
preferable, such as binary object names, and abbreviating complete
object sets ("I have these commits/trees and everything they need
recursively; I also have this stack of random objects.").

That would work especially well for resumable pull, or for the case of
optimizing pull during the merge window.

- Josh Triplett

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Resumable git clone?
  2016-03-02  7:54         ` Josh Triplett
@ 2016-03-02  8:31           ` Junio C Hamano
  2016-03-02  9:28             ` Duy Nguyen
  2016-03-02 16:41             ` Josh Triplett
  0 siblings, 2 replies; 24+ messages in thread
From: Junio C Hamano @ 2016-03-02  8:31 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Duy Nguyen, Al Viro, Stefan Beller, git@vger.kernel.org, sarah

Josh Triplett <josh@joshtriplett.org> writes:

> I don't think it's worth the trouble and ambiguity to send abbreviated
> object names over the wire.  

Yup.  My unscientific experiment was to show that the list would be
far smaller than the actual transfer and between full binary and
full textual object name representations there would not be much
meaningful difference--you seem to have a better design sense to
grasp that point ;-)

> I think several simpler optimizations seem
> preferable, such as binary object names, and abbreviating complete
> object sets ("I have these commits/trees and everything they need
> recursively; I also have this stack of random objects.").

Given the way pack stream is organized (i.e. commits first and then
trees and blobs that belong to the same delta chain together), and
our assumed goal being to salvage objects from an interrupted
transfer of a packfile, you are unlikely to ever see "I have these
commits/trees and everything they need" that are salvaged from such
a failed transfer.  So I doubt such an optimization is worth doing.

Besides it is very expensive to compute (the computation is done on
the client side, so the cycles burned and the time the user has to
wait is of much less concern, though); you'd essentially be doing
"git fsck" to find the "dangling" objects.

The list of what would be transferred needs to come in full from the
server end, as the list names objects that the receiving end may not
have seen, but the response by the client could be encoded much
tightly.  For the full list of N objects from the server, we can
think of your response to be a bitstream of N bits, each on-bit in
which signals an unwanted object in the list.  You can optimize this
transfer by RLE compressing the bitstream, for example.

As git-over-HTTP is stateless, however, you cannot assume that the
server side remembers what it sent to the client (instead, the
client side needs to re-post what it heard from the server in the
previous exchange to allow the server side to use it after
validating).  So "objects at these indices in your list" kind of
optimization may not work very well in that environment.  I'd
imagine that an exchange of "Here are the list of objects", "Give me
these objects" done naively in full 40-hex object names would work
OK there, though.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Resumable git clone?
  2016-03-02  8:31           ` Junio C Hamano
@ 2016-03-02  9:28             ` Duy Nguyen
  2016-03-02 16:41             ` Josh Triplett
  1 sibling, 0 replies; 24+ messages in thread
From: Duy Nguyen @ 2016-03-02  9:28 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Josh Triplett, Al Viro, Stefan Beller, git@vger.kernel.org, sarah

On Wed, Mar 2, 2016 at 3:31 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Josh Triplett <josh@joshtriplett.org> writes:
>
>> I don't think it's worth the trouble and ambiguity to send abbreviated
>> object names over the wire.
>
> Yup.  My unscientific experiment was to show that the list would be
> far smaller than the actual transfer and between full binary and
> full textual object name representations there would not be much
> meaningful difference--you seem to have a better design sense to
> grasp that point ;-)

It may matter, depending on your user target. In order to progress a
fetch/pull, I need to get at least one object before my connection
goes down. Picking a random blob in the "large file" range in
linux-2.6, fs/nls/nls_cp950.c, 500kb. Let's assume the worst case that
the blob is transferred gzipped, not deltified, that's about 100k.
Assume again I'm a lazy linux lurker who only fetches after every
release, the rev-list output between v4.2 and v4.3 is 6M. Even if we
transfer this list over http with compression, the list is 2.9M, way
bigger than one blob transfer. Which raises the bar to my successful
fetch.
-- 
Duy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Resumable git clone?
  2016-03-02  8:31           ` Junio C Hamano
  2016-03-02  9:28             ` Duy Nguyen
@ 2016-03-02 16:41             ` Josh Triplett
  1 sibling, 0 replies; 24+ messages in thread
From: Josh Triplett @ 2016-03-02 16:41 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Duy Nguyen, Al Viro, Stefan Beller, git@vger.kernel.org, sarah

On Wed, Mar 02, 2016 at 12:31:16AM -0800, Junio C Hamano wrote:
> Josh Triplett <josh@joshtriplett.org> writes:
> > I think several simpler optimizations seem
> > preferable, such as binary object names, and abbreviating complete
> > object sets ("I have these commits/trees and everything they need
> > recursively; I also have this stack of random objects.").
> 
> Given the way pack stream is organized (i.e. commits first and then
> trees and blobs that belong to the same delta chain together), and
> our assumed goal being to salvage objects from an interrupted
> transfer of a packfile, you are unlikely to ever see "I have these
> commits/trees and everything they need" that are salvaged from such
> a failed transfer.  So I doubt such an optimization is worth doing.

True for the resumable clone case.  For that optimization, I was
thinking of the "pull during the merge window" case that Al Viro was
also interested in optimizing.

> Besides it is very expensive to compute (the computation is done on
> the client side, so the cycles burned and the time the user has to
> wait is of much less concern, though); you'd essentially be doing
> "git fsck" to find the "dangling" objects.

Trading client-side computation for bandwidth can potentially be
worthwhile if you have plenty of local compute but a slow and metered
link.

> The list of what would be transferred needs to come in full from the
> server end, as the list names objects that the receiving end may not
> have seen, but the response by the client could be encoded much
> tightly.  For the full list of N objects from the server, we can
> think of your response to be a bitstream of N bits, each on-bit in
> which signals an unwanted object in the list.  You can optimize this
> transfer by RLE compressing the bitstream, for example.
> 
> As git-over-HTTP is stateless, however, you cannot assume that the
> server side remembers what it sent to the client (instead, the
> client side needs to re-post what it heard from the server in the
> previous exchange to allow the server side to use it after
> validating).  So "objects at these indices in your list" kind of
> optimization may not work very well in that environment.  I'd
> imagine that an exchange of "Here are the list of objects", "Give me
> these objects" done naively in full 40-hex object names would work
> OK there, though.

Good point.  Between statelessness and Duy's point about the client list
usually being smaller than the server list, perhaps it would make sense
to not have the server send a list at all, and just have the client send
its own list.

- Josh Triplett

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Resumable git clone?
  2016-03-02  2:30   ` Al Viro
  2016-03-02  6:31     ` Junio C Hamano
@ 2016-03-02  8:13     ` Josh Triplett
  2016-03-02  8:22       ` Duy Nguyen
  2016-03-02  8:14     ` Duy Nguyen
  2 siblings, 1 reply; 24+ messages in thread
From: Josh Triplett @ 2016-03-02  8:13 UTC (permalink / raw)
  To: Al Viro; +Cc: Stefan Beller, Duy Nguyen, git@vger.kernel.org, sarah

On Wed, Mar 02, 2016 at 02:30:24AM +0000, Al Viro wrote:
> On Tue, Mar 01, 2016 at 05:40:28PM -0800, Stefan Beller wrote:
> 
> > So throwing away half finished stuff while keeping the front load?
> 
> Throw away the object that got truncated and ones for which delta chain
> doesn't resolve entirely in the transferred part.
>  
> > > indexing the objects it
> > > contains, and then re-running clone and not having to fetch those
> > > objects.
> > 
> > The pack is not deterministic for a given repository. When creating
> > the pack, you may encounter races between threads, such that the order
> > in a pack differs.
> 
> FWIW, I wasn't proposing to recreate the remaining bits of that _pack_;
> just do the normal pull with one addition: start with sending the list
> of sha1 of objects you are about to send and let the recepient reply
> with "I already have <set of sha1>, don't bother with those".  And exclude
> those from the transfer.  Encoding for the set being available is an
> interesting variable here - might be plain list of sha1, might be its
> complement ("I want the following subset"), might be "145th to 1029th,
> 1517th and 1890th to 1920th of the list you've sent"; which form ends
> up more efficient needs to be found experimentally...

As a simple proposal, the server could send the list of hashes (in
approximately the same order it would send the pack), the client could
send back a bitmap where '0' means "send it" and '1' means "got that one
already", and the client could compress that bitmap.  That gives you the
RLE and similar without having to write it yourself.  That might not be
optimal, but it would likely set a high bar with minimal effort.

One debatable optimization on top of that would rely on git object
structure to imply objects hashes without sending them: the message from
the server could have a list of commit/tree hashes that imply sending
all objects reachable from those, without having to send all the implied
hashes.  However, that would then make the message back from the client
about what it already has larger and more complicated; that might not
make it worthwhile.

This seems like a good case for doing the simplest possible thing first
(complete hash list, compressed "got it already" bitmap), seeing how
much benefit that provides, and creating a v2 protocol if some
additional optimization proves sufficiently worthwhile.

- Josh Triplett

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Resumable git clone?
  2016-03-02  8:13     ` Josh Triplett
@ 2016-03-02  8:22       ` Duy Nguyen
  2016-03-02  8:32         ` Jeff King
  2016-03-02 16:40         ` Josh Triplett
  0 siblings, 2 replies; 24+ messages in thread
From: Duy Nguyen @ 2016-03-02  8:22 UTC (permalink / raw)
  To: Josh Triplett; +Cc: Al Viro, Stefan Beller, git@vger.kernel.org, sarah

On Wed, Mar 2, 2016 at 3:13 PM, Josh Triplett <josh@joshtriplett.org> wrote:
> On Wed, Mar 02, 2016 at 02:30:24AM +0000, Al Viro wrote:
>> On Tue, Mar 01, 2016 at 05:40:28PM -0800, Stefan Beller wrote:
>>
>> > So throwing away half finished stuff while keeping the front load?
>>
>> Throw away the object that got truncated and ones for which delta chain
>> doesn't resolve entirely in the transferred part.
>>
>> > > indexing the objects it
>> > > contains, and then re-running clone and not having to fetch those
>> > > objects.
>> >
>> > The pack is not deterministic for a given repository. When creating
>> > the pack, you may encounter races between threads, such that the order
>> > in a pack differs.
>>
>> FWIW, I wasn't proposing to recreate the remaining bits of that _pack_;
>> just do the normal pull with one addition: start with sending the list
>> of sha1 of objects you are about to send and let the recepient reply
>> with "I already have <set of sha1>, don't bother with those".  And exclude
>> those from the transfer.  Encoding for the set being available is an
>> interesting variable here - might be plain list of sha1, might be its
>> complement ("I want the following subset"), might be "145th to 1029th,
>> 1517th and 1890th to 1920th of the list you've sent"; which form ends
>> up more efficient needs to be found experimentally...
>
> As a simple proposal, the server could send the list of hashes (in
> approximately the same order it would send the pack), the client could
> send back a bitmap where '0' means "send it" and '1' means "got that one
> already", and the client could compress that bitmap.  That gives you the
> RLE and similar without having to write it yourself.  That might not be
> optimal, but it would likely set a high bar with minimal effort.

We have an implementation of EWAH bitmap compression, so compressing
is not a problem.

But I still don't see why it's more efficient to have the server send
the hash list to the client. Assume you need to transfer N objects.
That direction makes you always send N hashes. But if the client sends
the list of already fetched objects, M, then M <= N. And we won't need
to send the bitmap. What did I miss?
-- 
Duy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Resumable git clone?
  2016-03-02  8:22       ` Duy Nguyen
@ 2016-03-02  8:32         ` Jeff King
  2016-03-02 10:47           ` Bhavik Bavishi
  2016-03-02 16:40         ` Josh Triplett
  1 sibling, 1 reply; 24+ messages in thread
From: Jeff King @ 2016-03-02  8:32 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Josh Triplett, Al Viro, Stefan Beller, git@vger.kernel.org, sarah

On Wed, Mar 02, 2016 at 03:22:17PM +0700, Duy Nguyen wrote:

> > As a simple proposal, the server could send the list of hashes (in
> > approximately the same order it would send the pack), the client could
> > send back a bitmap where '0' means "send it" and '1' means "got that one
> > already", and the client could compress that bitmap.  That gives you the
> > RLE and similar without having to write it yourself.  That might not be
> > optimal, but it would likely set a high bar with minimal effort.
> 
> We have an implementation of EWAH bitmap compression, so compressing
> is not a problem.
> 
> But I still don't see why it's more efficient to have the server send
> the hash list to the client. Assume you need to transfer N objects.
> That direction makes you always send N hashes. But if the client sends
> the list of already fetched objects, M, then M <= N. And we won't need
> to send the bitmap. What did I miss?

Right, I don't see what the point is in compressing the bitmap. The sha1
list for a clone of linux.git is 87 megabytes. The return bitmap, even
naively, is 500K. Unless you are trying to optimize for wildly
asymmetric links.

If the client just naively sends "here's what I have", then we know it
can never be _more_ than 87 megabytes. And as a bonus, the longer the
list is, the more we are saving (so at the moment you are sending 82MB,
it's really worth it, because you do have 95% of the pack, which is
worth amortizing).

I'm still a little dubious that anything involving "send all the hashes"
is going to be useful in practice, especially for something like the
kernel (where you have tons of huge small objects that delta well). It
would work better when you have gigantic objects that don't delta (so
the cost of a sha1 versus the object size is way better), but then I
think we'd do better to transfer all of the normal-sized bits up front,
and then allow fetching the large stuff separately.

-Peff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Resumable git clone?
  2016-03-02  8:32         ` Jeff King
@ 2016-03-02 10:47           ` Bhavik Bavishi
  0 siblings, 0 replies; 24+ messages in thread
From: Bhavik Bavishi @ 2016-03-02 10:47 UTC (permalink / raw)
  To: git; +Cc: Josh Triplett, Al Viro, Stefan Beller, git@vger.kernel.org, sarah

On 3/2/16 2:02 PM, Jeff King wrote:
> On Wed, Mar 02, 2016 at 03:22:17PM +0700, Duy Nguyen wrote:
>
>>> As a simple proposal, the server could send the list of hashes (in
>>> approximately the same order it would send the pack), the client could
>>> send back a bitmap where '0' means "send it" and '1' means "got that one
>>> already", and the client could compress that bitmap.  That gives you the
>>> RLE and similar without having to write it yourself.  That might not be
>>> optimal, but it would likely set a high bar with minimal effort.
>>
>> We have an implementation of EWAH bitmap compression, so compressing
>> is not a problem.
>>
>> But I still don't see why it's more efficient to have the server send
>> the hash list to the client. Assume you need to transfer N objects.
>> That direction makes you always send N hashes. But if the client sends
>> the list of already fetched objects, M, then M <= N. And we won't need
>> to send the bitmap. What did I miss?
>
> Right, I don't see what the point is in compressing the bitmap. The sha1
> list for a clone of linux.git is 87 megabytes. The return bitmap, even
> naively, is 500K. Unless you are trying to optimize for wildly
> asymmetric links.
>
> If the client just naively sends "here's what I have", then we know it
> can never be _more_ than 87 megabytes. And as a bonus, the longer the
> list is, the more we are saving (so at the moment you are sending 82MB,
> it's really worth it, because you do have 95% of the pack, which is
> worth amortizing).
>
> I'm still a little dubious that anything involving "send all the hashes"
> is going to be useful in practice, especially for something like the
> kernel (where you have tons of huge small objects that delta well). It
> would work better when you have gigantic objects that don't delta (so
> the cost of a sha1 versus the object size is way better), but then I
> think we'd do better to transfer all of the normal-sized bits up front,
> and then allow fetching the large stuff separately.
>
> -Peff
>


In case if we can have object-lookup-db like provisioning with stored 
information like SHA-1, type of object, parent if any, size of that 
object, as in entire hierarchy tree without data like commit message, 
tag name. This implementation may be look as bit duplication of existing 
information.

At initial clone time server sends object-lookup-db to client and then, 
by reading object-lookup-db client sends SHA1 to server to get/fecth 
objects, it can be got in parallel, as well. This process may not be 
transfer efficient but it can be resumable, as client knows what got 
sync and what's remain and which SHA1 refers to what object type.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Resumable git clone?
  2016-03-02  8:22       ` Duy Nguyen
  2016-03-02  8:32         ` Jeff King
@ 2016-03-02 16:40         ` Josh Triplett
  1 sibling, 0 replies; 24+ messages in thread
From: Josh Triplett @ 2016-03-02 16:40 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Al Viro, Stefan Beller, git@vger.kernel.org, sarah

On Wed, Mar 02, 2016 at 03:22:17PM +0700, Duy Nguyen wrote:
> On Wed, Mar 2, 2016 at 3:13 PM, Josh Triplett <josh@joshtriplett.org> wrote:
> > On Wed, Mar 02, 2016 at 02:30:24AM +0000, Al Viro wrote:
> >> On Tue, Mar 01, 2016 at 05:40:28PM -0800, Stefan Beller wrote:
> >>
> >> > So throwing away half finished stuff while keeping the front load?
> >>
> >> Throw away the object that got truncated and ones for which delta chain
> >> doesn't resolve entirely in the transferred part.
> >>
> >> > > indexing the objects it
> >> > > contains, and then re-running clone and not having to fetch those
> >> > > objects.
> >> >
> >> > The pack is not deterministic for a given repository. When creating
> >> > the pack, you may encounter races between threads, such that the order
> >> > in a pack differs.
> >>
> >> FWIW, I wasn't proposing to recreate the remaining bits of that _pack_;
> >> just do the normal pull with one addition: start with sending the list
> >> of sha1 of objects you are about to send and let the recepient reply
> >> with "I already have <set of sha1>, don't bother with those".  And exclude
> >> those from the transfer.  Encoding for the set being available is an
> >> interesting variable here - might be plain list of sha1, might be its
> >> complement ("I want the following subset"), might be "145th to 1029th,
> >> 1517th and 1890th to 1920th of the list you've sent"; which form ends
> >> up more efficient needs to be found experimentally...
> >
> > As a simple proposal, the server could send the list of hashes (in
> > approximately the same order it would send the pack), the client could
> > send back a bitmap where '0' means "send it" and '1' means "got that one
> > already", and the client could compress that bitmap.  That gives you the
> > RLE and similar without having to write it yourself.  That might not be
> > optimal, but it would likely set a high bar with minimal effort.
> 
> We have an implementation of EWAH bitmap compression, so compressing
> is not a problem.
> 
> But I still don't see why it's more efficient to have the server send
> the hash list to the client. Assume you need to transfer N objects.
> That direction makes you always send N hashes. But if the client sends
> the list of already fetched objects, M, then M <= N. And we won't need
> to send the bitmap. What did I miss?

M can potentially be larger than N if you have many remotes and branches
in your local repository that the server doesn't have.  However, that
certainly wouldn't be the common case, and in that case heuristics on
the client side could help there in determining a subset to send.

I can't think of any good argument for the server's hash list; a
client-sent list does seem reasonable.

- Josh Triplett

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Resumable git clone?
  2016-03-02  2:30   ` Al Viro
  2016-03-02  6:31     ` Junio C Hamano
  2016-03-02  8:13     ` Josh Triplett
@ 2016-03-02  8:14     ` Duy Nguyen
  2 siblings, 0 replies; 24+ messages in thread
From: Duy Nguyen @ 2016-03-02  8:14 UTC (permalink / raw)
  To: Al Viro; +Cc: Stefan Beller, Josh Triplett, git@vger.kernel.org, sarah

On Wed, Mar 2, 2016 at 9:30 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> IIRC, the objection had been that the organisation of the pack will lead
> to many cases when deltas are transferred *first*, with base object not
> getting there prior to disconnect.  I suspect that fraction of the objects
> getting through would still be worth it, but I hadn't experimented enough
> to be able to tell...

No. If deltas refer to  base objects by offset,  the (unsigned) offset
is negated before use. So base objects must always sent first. If
deltas refer to base objects by full SHA-1 then base objects can
appear anywhere in the pack in theory. But I think we only use full
SHA-1 references for out-of-thin-pack objects, never to an existing
object in the pack.
-- 
Duy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Resumable git clone?
  2016-03-02  1:30 Resumable git clone? Josh Triplett
  2016-03-02  1:40 ` Stefan Beller
@ 2016-03-02  1:45 ` Duy Nguyen
  2016-03-02  8:41 ` Junio C Hamano
  2 siblings, 0 replies; 24+ messages in thread
From: Duy Nguyen @ 2016-03-02  1:45 UTC (permalink / raw)
  To: Josh Triplett; +Cc: Git Mailing List, sarah, viro

On Wed, Mar 2, 2016 at 8:30 AM, Josh Triplett <josh@joshtriplett.org> wrote:
> If you clone a repository, and the connection drops, the next attempt
> will have to start from scratch.  This can add significant time and
> expense if you're on a low-bandwidth or metered connection trying to
> clone something like Linux.
>
> Would it be possible to make git clone resumable after a partial clone?
> (And, ideally, to make that the default?)
>
> In a discussion elsewhere, Al Viro suggested taking the partial pack
> received so far, repairing any truncation, indexing the objects it
> contains, and then re-running clone and not having to fetch those
> objects.  This may also require extending receive-pack's protocol for
> determining objects the recipient already has, as the partial pack may
> not have a consistent set of reachable objects.
>
> Before starting down the path of developing patches for this, does the
> approach seem potentially reasonable?

This topic came up recently (thanks Sarah!) and Shawn proposed a
different approach that (I think) is simpler and more effective for
resume _clone_ case. I'm not sure if anybody is implementing it
though.

[1] http://thread.gmane.org/gmane.comp.version-control.git/285921
-- 
Duy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Resumable git clone?
  2016-03-02  1:30 Resumable git clone? Josh Triplett
  2016-03-02  1:40 ` Stefan Beller
  2016-03-02  1:45 ` Duy Nguyen
@ 2016-03-02  8:41 ` Junio C Hamano
  2016-03-02 15:51   ` Konstantin Ryabitsev
                     ` (2 more replies)
  2 siblings, 3 replies; 24+ messages in thread
From: Junio C Hamano @ 2016-03-02  8:41 UTC (permalink / raw)
  To: Josh Triplett, Konstantin Ryabitsev; +Cc: git, sarah, viro

Josh Triplett <josh@joshtriplett.org> writes:

> If you clone a repository, and the connection drops, the next attempt
> will have to start from scratch.  This can add significant time and
> expense if you're on a low-bandwidth or metered connection trying to
> clone something like Linux.

For this particular issue, your friendly k.org administrator already
has a solution.  Torvalds/linux.git is made into a bundle weekly
with

    $ git bundle create clone.bundle --all

and the result placed on k.org CDN.  So low-bandwidth cloners can
grab it over resumable http, clone from the bundle, and then fill
the most recent part by fetching from k.org already.

The tooling to allow this kind of "bundle" (and possibly other forms
of "CDN offload" material) transparently used by "git clone" was the
proposal by Shawn Pearce mentioned elsewhere in this thread.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Resumable git clone?
  2016-03-02  8:41 ` Junio C Hamano
@ 2016-03-02 15:51   ` Konstantin Ryabitsev
  2016-03-02 16:49   ` Josh Triplett
  2016-03-24  8:00   ` Philip Oakley
  2 siblings, 0 replies; 24+ messages in thread
From: Konstantin Ryabitsev @ 2016-03-02 15:51 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Josh Triplett, git, sarah, viro

[-- Attachment #1: Type: text/plain, Size: 1240 bytes --]

On Wed, Mar 02, 2016 at 12:41:20AM -0800, Junio C Hamano wrote:
> Josh Triplett <josh@joshtriplett.org> writes:
> 
> > If you clone a repository, and the connection drops, the next attempt
> > will have to start from scratch.  This can add significant time and
> > expense if you're on a low-bandwidth or metered connection trying to
> > clone something like Linux.
> 
> For this particular issue, your friendly k.org administrator already
> has a solution.  Torvalds/linux.git is made into a bundle weekly
> with
> 
>     $ git bundle create clone.bundle --all
> 
> and the result placed on k.org CDN.  So low-bandwidth cloners can
> grab it over resumable http, clone from the bundle, and then fill
> the most recent part by fetching from k.org already.

I finally got around to documenting this here:
https://kernel.org/cloning-linux-from-a-bundle.html

> The tooling to allow this kind of "bundle" (and possibly other forms
> of "CDN offload" material) transparently used by "git clone" was the
> proposal by Shawn Pearce mentioned elsewhere in this thread.

To reiterate, I believe that would be an awesome feature.

Regards,
-- 
Konstantin Ryabitsev
Linux Foundation Collab Projects
Montréal, Québec

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Resumable git clone?
  2016-03-02  8:41 ` Junio C Hamano
  2016-03-02 15:51   ` Konstantin Ryabitsev
@ 2016-03-02 16:49   ` Josh Triplett
  2016-03-02 17:57     ` Junio C Hamano
  2016-03-24  8:00   ` Philip Oakley
  2 siblings, 1 reply; 24+ messages in thread
From: Josh Triplett @ 2016-03-02 16:49 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Konstantin Ryabitsev, git, sarah, viro

On Wed, Mar 02, 2016 at 12:41:20AM -0800, Junio C Hamano wrote:
> Josh Triplett <josh@joshtriplett.org> writes:
> > If you clone a repository, and the connection drops, the next attempt
> > will have to start from scratch.  This can add significant time and
> > expense if you're on a low-bandwidth or metered connection trying to
> > clone something like Linux.
> 
> For this particular issue, your friendly k.org administrator already
> has a solution.  Torvalds/linux.git is made into a bundle weekly
> with
> 
>     $ git bundle create clone.bundle --all
> 
> and the result placed on k.org CDN.  So low-bandwidth cloners can
> grab it over resumable http, clone from the bundle, and then fill
> the most recent part by fetching from k.org already.
> 
> The tooling to allow this kind of "bundle" (and possibly other forms
> of "CDN offload" material) transparently used by "git clone" was the
> proposal by Shawn Pearce mentioned elsewhere in this thread.

That does help in the case of cloning torvalds/linux.git from
kernel.org, and I'd love to see it used transparently.

However, even with that, I still also see value in a resumable git clone
(or git pull) for many other repositories elsewhere, with a somewhat
lower pull-to-push ratio than kernel.org.  Supporting resumption based
on objects, without the repository needing to generate and keep around a
bundle, seems preferable for such repositories.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Resumable git clone?
  2016-03-02 16:49   ` Josh Triplett
@ 2016-03-02 17:57     ` Junio C Hamano
  0 siblings, 0 replies; 24+ messages in thread
From: Junio C Hamano @ 2016-03-02 17:57 UTC (permalink / raw)
  To: Josh Triplett; +Cc: Konstantin Ryabitsev, git, sarah, viro

Josh Triplett <josh@joshtriplett.org> writes:

> That does help in the case of cloning torvalds/linux.git from
> kernel.org, and I'd love to see it used transparently.
>
> However, even with that, I still also see value in a resumable git clone
> (or git pull) for many other repositories elsewhere,...

By "transparently" the statement you are responding to meant many
things.

"git clone" of course need to be updated on the client side, but
things like "git repack" that is run on the server end may start
producing extra files in the repository, and updated "git daemon"
and/or "git upload-pack" would take these extra files as a signal
that the material produced during the last repack is usable for
bootstrapping a new clone with "wget -c" equivalent.  So even if you
are not yet automatically offloading to CDN, such a set of updates
on the server side would "transparently" enable the resumable clone
for all repositories elsewhere when deployed and enabled ;-)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Resumable git clone?
  2016-03-02  8:41 ` Junio C Hamano
  2016-03-02 15:51   ` Konstantin Ryabitsev
  2016-03-02 16:49   ` Josh Triplett
@ 2016-03-24  8:00   ` Philip Oakley
  2016-03-24 15:53     ` Junio C Hamano
  2 siblings, 1 reply; 24+ messages in thread
From: Philip Oakley @ 2016-03-24  8:00 UTC (permalink / raw)
  To: Junio C Hamano, Josh Triplett, Konstantin Ryabitsev; +Cc: Git List, sarah, viro

From: "Junio C Hamano" <gitster@pobox.com>
Sent: Wednesday, March 02, 2016 8:41 AM
> Josh Triplett <josh@joshtriplett.org> writes:
>
>> If you clone a repository, and the connection drops, the next attempt
>> will have to start from scratch.  This can add significant time and
>> expense if you're on a low-bandwidth or metered connection trying to
>> clone something like Linux.
>
> For this particular issue, your friendly k.org administrator already
> has a solution.  Torvalds/linux.git is made into a bundle weekly
> with
>
>    $ git bundle create clone.bundle --all
>

Isn't this use of '--all' a bit of oversharing? I had proposed a doc patch
to the bundle manpage way back (see $gmane/205897) to give the
user that example, but it wasn't accepted as it was thought wrong.

" I also think "--all" is a bad advice for another reason.  Doesn't it
shove refs from refs/remotes/* hierarchy in the resulting bundle?
It is fine for archiving purposes, but it does not seem to be a good
advice to create a bundle to clone from."

Perhaps the '--clone-bundle' (or maybe'--bundle-clone') option from 
$gmane/288222  [PATCH] index-pack: --clone-bundle option 2016-03-03 maybe a 
suitable new <rev-list-arg> to get just the right content?

> and the result placed on k.org CDN.  So low-bandwidth cloners can
> grab it over resumable http, clone from the bundle, and then fill
> the most recent part by fetching from k.org already.
>
> The tooling to allow this kind of "bundle" (and possibly other forms
> of "CDN offload" material) transparently used by "git clone" was the
> proposal by Shawn Pearce mentioned elsewhere in this thread.
>
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Resumable git clone?
  2016-03-24  8:00   ` Philip Oakley
@ 2016-03-24 15:53     ` Junio C Hamano
  2016-03-24 21:08       ` Philip Oakley
  0 siblings, 1 reply; 24+ messages in thread
From: Junio C Hamano @ 2016-03-24 15:53 UTC (permalink / raw)
  To: Philip Oakley; +Cc: Josh Triplett, Konstantin Ryabitsev, Git List, sarah, viro

"Philip Oakley" <philipoakley@iee.org> writes:

> From: "Junio C Hamano" <gitster@pobox.com>
>>
>>> If you clone a repository, and the connection drops, the next attempt
>>> will have to start from scratch.  This can add significant time and
>>> expense if you're on a low-bandwidth or metered connection trying to
>>> clone something like Linux.
>>
>> For this particular issue, your friendly k.org administrator already
>> has a solution.  Torvalds/linux.git is made into a bundle weekly
>> with
>>
>>    $ git bundle create clone.bundle --all
>>
>> and the result placed on k.org CDN.  So low-bandwidth cloners can
>> grab it over resumable http, clone from the bundle, and then fill
>> the most recent part by fetching from k.org already.
>
> Isn't this use of '--all' a bit of oversharing?

Not for the exact use case mentioned; k.org administrator knows what
is in Linus's repository and is aware that there is no remote-tracking
branches or secret branches that may make the resulting bundle unsuitable
for priming a clone.

> " I also think "--all" is a bad advice for another reason.

I do not think it is a good advice for everybody, but the thing is,
what you are responding is not an advice.  It is just a statement of
a fact, what is already done, one of the existing practices that an
approach to "resumable clone" may want to help.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Resumable git clone?
  2016-03-24 15:53     ` Junio C Hamano
@ 2016-03-24 21:08       ` Philip Oakley
  0 siblings, 0 replies; 24+ messages in thread
From: Philip Oakley @ 2016-03-24 21:08 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Josh Triplett, Konstantin Ryabitsev, Git List, sarah, viro

From: "Junio C Hamano" <gitster@pobox.com>
> "Philip Oakley" <philipoakley@iee.org> writes:
>
>> From: "Junio C Hamano" <gitster@pobox.com>
>>>
>>>> If you clone a repository, and the connection drops, the next attempt
>>>> will have to start from scratch.  This can add significant time and
>>>> expense if you're on a low-bandwidth or metered connection trying to
>>>> clone something like Linux.
>>>
>>> For this particular issue, your friendly k.org administrator already
>>> has a solution.  Torvalds/linux.git is made into a bundle weekly
>>> with
>>>
>>>    $ git bundle create clone.bundle --all
>>>
>>> and the result placed on k.org CDN.  So low-bandwidth cloners can
>>> grab it over resumable http, clone from the bundle, and then fill
>>> the most recent part by fetching from k.org already.
>>
>> Isn't this use of '--all' a bit of oversharing?
>
> Not for the exact use case mentioned; k.org administrator knows what
> is in Linus's repository and is aware that there is no remote-tracking
> branches or secret branches that may make the resulting bundle unsuitable
> for priming a clone.

OK
>
>> " I also think "--all" is a bad advice for another reason.
>
> I do not think it is a good advice for everybody, but the thing is,
> what you are responding is not an advice.  It is just a statement of
> a fact, what is already done, one of the existing practices that an
> approach to "resumable clone" may want to help.
>
I was picking up on the need, for others who maybe generating clone bundles, 
that '--all' may not be the right thing for them, and that somewhere we 
should record whatever is deemed the equivalent of the current clone 
command. This would get away from the web examples which show '--all' as a 
quick solution for bundling (I'm one of the offenders there).

If I understand the clone code, the  rev-list-args would be 
"HEAD --branches". But I could well be wrong.
--
Philip 

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2016-03-24 21:09 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-02  1:30 Resumable git clone? Josh Triplett
2016-03-02  1:40 ` Stefan Beller
2016-03-02  2:30   ` Al Viro
2016-03-02  6:31     ` Junio C Hamano
2016-03-02  7:37       ` Duy Nguyen
2016-03-02  7:44         ` Duy Nguyen
2016-03-02  7:54         ` Josh Triplett
2016-03-02  8:31           ` Junio C Hamano
2016-03-02  9:28             ` Duy Nguyen
2016-03-02 16:41             ` Josh Triplett
2016-03-02  8:13     ` Josh Triplett
2016-03-02  8:22       ` Duy Nguyen
2016-03-02  8:32         ` Jeff King
2016-03-02 10:47           ` Bhavik Bavishi
2016-03-02 16:40         ` Josh Triplett
2016-03-02  8:14     ` Duy Nguyen
2016-03-02  1:45 ` Duy Nguyen
2016-03-02  8:41 ` Junio C Hamano
2016-03-02 15:51   ` Konstantin Ryabitsev
2016-03-02 16:49   ` Josh Triplett
2016-03-02 17:57     ` Junio C Hamano
2016-03-24  8:00   ` Philip Oakley
2016-03-24 15:53     ` Junio C Hamano
2016-03-24 21:08       ` Philip Oakley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).