Re: Continue git clone after interruption

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jakub Narebski <jnareb@gmail.com>
To: Nicolas Pitre <nico@cam.org>
Cc: Tomasz Kontusz <roverorna@gmail.com>, git <git@vger.kernel.org>,
	Johannes Schindelin <Johannes.Schindelin@gmx.de>,
	Scott Chacon <schacon@gmail.com>
Subject: Re: Continue git clone after interruption
Date: Fri, 21 Aug 2009 23:41:30 +0200	[thread overview]
Message-ID: <200908212341.33324.jnareb@gmail.com> (raw)
In-Reply-To: <alpine.LFD.2.00.0908211614220.6044@xanadu.home>

On Fri, 21 Aug 2009, Nicolas Pitre wrote:
> On Fri, 21 Aug 2009, Jakub Narebski wrote:
>> On Thu, 20 Aug 2009, Nicolas Pitre wrote:
>>> On Thu, 20 Aug 2009, Jakub Narebski wrote:

>>>> It is however only 2.5 MB out of 37 MB that are resumable, which is 7%
>>>> (well, that of course depends on repository).  Not that much that is
>>>> resumable.
>>> 
>>> Take the Linux kernel then.  It is more like 75 MB.
>> 
>> Ah... good example.
>> 
>> On the other hand Linux is fairly large project in terms of LoC, but
>> it had its history cut when moving to Git, so the ratio of git-archive
>> of HEAD to the size of packfile is overemphasized here.
> 
> That doesn't matter.  You still need that amount of data up front to do 
> anything.  And I doubt people with slow links will want the full history 
> anyway, regardless if it goes backward 4 years or 18 years back.

On the other hand unreliable link doesn't need to mean unreasonably
slow link.

Hopefully GitTorrent / git-mirror-sync would finally come out of 
vapourware and wouldn't share the fate of Duke Nukem Forever ;-),
and we would have this as an alternative to clone large repositories.
Well, supposedly there is some code, and last year GSoC project at
least shook the dust out of initial design and made it simplier, IIUC.
 
>> You make use here of a few facts:
[...]

>> 2. There is support in git pack format to do 'deepening' of shallow
>>    clone, which means that git can generate incrementals in top-down
>>    order, _similar to how objects are ordered in packfile_.
> 
> Well... the pack format was not meant for that "support".  The fact that 
> the typical object order used by pack-objects when serving fetch request 
> is amenable to incremental top-down updates is rather coincidental and 
> not really planned.

Ooops.  I meant "git pack PROTOCOL" here, not "git pack _format_".
the one about want/have/shallow/deepen exchange.
 
[...]
>>> A special 
>>> mode to pack-object could place commit objects only after all the 
>>> objects needed to create that revision.  So once you get a commit object 
>>> on the receiving end, you could assume that all objects reachable from 
>>> that commit are already received, or you had them locally already.
>> 
>> Yes, with such mode (which I think wouldn't reduce / interfere with
>> ability for upload-pack to pack more tightly by reordering objects
>> and choosing different deltas) it would be easy to do a salvage of
>> a partially completed / transferred packfile.  Even if there is no
>> extension to tell git server which objects we have ("have" is only
>> about commits), if there is at least one commit object in received
>> part of packfile, we can try to continue from later (from more);
>> there is less left to download.
> 
> Exact.  Suffice to set the last received commit(s) (after validation) as 
> one of the shallow points.

Assuming that received commit is full (has all prerequisites), and
is connected to the rest of body of partially [shallow] cloned 
repository.

>>>> Documentation/technical/shallow.txt doesn't cover "shallow", "unshallow"
>>>> and "deepen" commands from 'shallow' capability extension to git pack
>>>> protocol (http://git-scm.com/gitserver.txt).
>>> 
>>> 404 Not Found
>>> 
>>> Maybe that should be committed to git in Documentation/technical/  as 
>>> well?
>> 
>> This was plain text RFC for the Git Packfile Protocol, generated from
>> rfc2629 XML sources at http://github.com/schacon/gitserver-rfc
> 
> I suggest you track it down and prod/propose a version for merging in 
> the git repository.

Scott Chacon was (and is) CC-ed.
 
I don't know if you remember mentioned discussion about pack protocol, 
stemming from the fact that some of git (re)implementations (Dulwich,
JGit) failed to implement it properly, where properly = same as 
git-core, i.e. the original implementation in C... because there were
not enough documentation.


>>>> P.S. As you can see implementing resumable clone isn't easy...
>>> 
>>> I've been saying that all along for quite a while now.   ;-)
>> 
>> Well, on the other hand side we have example of how long it took to
>> come to current implementation of git submodules.  But if finally
>> got done.
> 
> In this case there is still no new line of code what so ever.  Thinking 
> it through is what takes time.

Measure twice, cut once :-)

In this case I think design upfront is a good solution.
 
>> The git-archive + deepening approach you proposed can be split into
>> smaller individual improvements.  You don't need to implement it all
>> at once.
[...]

>> 3. Create new git-archive pseudoformat, used to transfer single commit
>>    (with commit object and original branch name in some extended header,
>>    similar to how commit ID is stored in extended pax header or ZIP
>>    comment).  It would imply not using export-* gitattributes.
> 
> The format I was envisioning is really simple:
> 
> First the size of the raw commit object data content in decimal, 
> followed by a 0 byte, followed by the actual content of the commit 
> object, followed by a 0 byte.  (Note: this could be the exact same 
> content as the canonical commit object data with the "commit" prefix, 
> but as all the rest are all blob content this would be redundant.)
> 
> Then, for each file:
> 
>  - The file mode in octal notation just as in tree objects
>  - a space
>  - the size of the file in decimal
>  - a tab
>  - the full path of the file
>  - a 0 byte
>  - the file content as found in the corresponding blob
>  - a 0 byte
> 
> And finally some kind of marker to indicate the end of the stream.
> 
> Put the lot through zlib and you're done.

So you don't want to just tack commit object (as extended pax header,
or a comment - if it is at all possible) to the existing 'tar' and
'zip' archive formats.  Probably better to design format from scratch.
 
>> 4. Implement alternate ordering of objects in packfile, so commit object
>>    is put immediately after all its prerequisites.
> 
> That would require some changes in the object enumeration code which is 
> an area of the code I don't know well.

Oh.

-- 
Jakub Narebski
Poland

next prev parent reply	other threads:[~2009-08-21 21:44 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-08-17 11:42 Continue git clone after interruption Tomasz Kontusz
2009-08-17 12:31 ` Johannes Schindelin
2009-08-17 15:23   ` Shawn O. Pearce
2009-08-18  5:43   ` Matthieu Moy
2009-08-18  6:58     ` Tomasz Kontusz
2009-08-18 17:56       ` Nicolas Pitre
2009-08-18 18:45         ` Jakub Narebski
2009-08-18 20:01           ` Nicolas Pitre
2009-08-18 21:02             ` Jakub Narebski
2009-08-18 21:32               ` Nicolas Pitre
2009-08-19 15:19                 ` Jakub Narebski
2009-08-19 19:04                   ` Nicolas Pitre
2009-08-19 19:42                     ` Jakub Narebski
2009-08-19 21:13                       ` Nicolas Pitre
2009-08-20  0:26                         ` Sam Vilain
2009-08-20  7:37                         ` Jakub Narebski
2009-08-20  7:48                           ` Nguyen Thai Ngoc Duy
2009-08-20  8:23                             ` Jakub Narebski
2009-08-20 18:41                           ` Nicolas Pitre
2009-08-21 10:07                             ` Jakub Narebski
2009-08-21 10:26                               ` Matthieu Moy
2009-08-21 21:07                               ` Nicolas Pitre
2009-08-21 21:41                                 ` Jakub Narebski [this message]
2009-08-22  0:59                                   ` Nicolas Pitre
2009-08-21 23:07                                 ` Sam Vilain
2009-08-22  3:37                                   ` Nicolas Pitre
2009-08-22  5:50                                     ` Sam Vilain
2009-08-22  8:13                                       ` Nicolas Pitre
2009-08-23 10:37                                         ` Sam Vilain
2009-08-20 22:57                           ` Sam Vilain
2009-08-18 22:28             ` Johannes Schindelin
2009-08-18 23:40               ` Nicolas Pitre
2009-08-19  7:35                 ` Johannes Schindelin
2009-08-19  8:25                   ` Nguyen Thai Ngoc Duy
2009-08-19  9:52                     ` Johannes Schindelin
2009-08-19 17:21                   ` Nicolas Pitre
2009-08-19 22:23                     ` René Scharfe
2009-08-19  4:42           ` Sitaram Chamarty
2009-08-19  9:53             ` Jakub Narebski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200908212341.33324.jnareb@gmail.com \
    --to=jnareb@gmail.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=git@vger.kernel.org \
    --cc=nico@cam.org \
    --cc=roverorna@gmail.com \
    --cc=schacon@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).