Re: Continue git clone after interruption

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Nicolas Pitre <nico@cam.org>
To: Jakub Narebski <jnareb@gmail.com>
Cc: Tomasz Kontusz <roverorna@gmail.com>, git <git@vger.kernel.org>,
	Johannes Schindelin <Johannes.Schindelin@gmx.de>
Subject: Re: Continue git clone after interruption
Date: Wed, 19 Aug 2009 17:13:59 -0400 (EDT)	[thread overview]
Message-ID: <alpine.LFD.2.00.0908191552020.6044@xanadu.home> (raw)
In-Reply-To: <200908192142.51384.jnareb@gmail.com>

On Wed, 19 Aug 2009, Jakub Narebski wrote:

> Cc-ed Dscho, so he can easier participate in this subthread.
> 
> On Wed, 19 Aug 2009, Nicolas Pitre wrote:
> > On Wed, 19 Aug 2009, Jakub Narebski wrote:
> 
> > > P.S. What do you think about 'bundle' capability extension mentioned
> > >      in a side sub-thread?
> > 
> > I don't like it.  Reason is that it forces the server to be (somewhat) 
> > stateful by having to keep track of those bundles and cycle them, and it 
> > doubles the disk usage by having one copy of the repository in the form 
> > of the original pack(s) and another copy as a bundle.
> 
> I agree about problems with disk usage, but I disagree about server
> having to be stateful; server can just simply scan for bundles, and
> offer links to them if client requests 'bundles' capability, somewhere
> around initial git-ls-remote list of refs.

But that's the client that has to deal with what the server wants to 
offer, instead of the server actually serving data as the client wants.

> Well, offering daily bundle in addition to daily snapshot could be
> a good practice, at least until git acquires resumable fetch (resumable
> clone).

Outside of Git: maybe.  Through the git protocol: no.  And what would 
that bundle contain over the daily snapshot?  The whole history?  If so 
that goes against the idea that people concerned by all this have slow 
links and probably aren't interested in the time to download it all.  If 
the bundle contains only the top revision then it has no advantage over 
the snapshot.  Somewhere in the middle?  Sure, but then where to draw 
the line?  That's for the client to decide, not the server 
administrator.

And what if you start your slow transfer which breaks in the middle.  
The next morning you want to restart it in the hope that you might 
resume the transfer of the bundle that is incomplete.  But crap, the 
server has updated its bundle and your half-bundle is now useless. 
You've wasted your bandwidth for nothing.

> > If you think about git.kernel.org which has maybe hundreds of 
> > repositories where the big majority of them are actually forks of Linus' 
> > own repository, then having all those forks reference Linus' repository 
> > is a big disk space saver (and IO too as the referenced repository is 
> > likely to remain cached in memory).  Having a bundle ready for each of 
> > them will simply kill that space advantage, unless they all share the 
> > same bundle.
> 
> I am thinking about sharing the same bundle for related projects.

... meaning more administrative burden.

> > Now sharing that common bundle could be done of course, but that makes 
> > things yet more complex while still wasting IO because some requests 
> > will hit the common pack and some others will hit the bundle, making 
> > less efficient usage of the disk cache on the server.
> 
> Hmmm... true (unless bundles are on separate server).

... meaning additional but avoidable costs.

> > Yet, that bundle would probably not contain the latest revision if it is 
> > only periodically updated, even less so if it is shared between multiple 
> > repositories as outlined above.  And what people with slow/unreliable 
> > network links are probably most interested in is the latest revision and 
> > maybe a few older revisions, but probably not the whole repository as 
> > that is simply too long to wait for.  Hence having a big bundle is not 
> > flexible either with regards to the actual data transfer size.
> 
> I agree that bundle would be useful for restartable clone, and not
> useful for restartable fetch.  Well, unless you count (non-existing)
> GitTorrent / git-mirror-sync as this solution... ;-)

I don't think fetches after a clone are such an issue.  They are 
typically transfers being orders of magnitude smaller than the initial 
clone.  Same goes for fetches to deepen a shallow clone which are in 
fact fetches going back in history instead of forward.  I still stands 
by my assertion that bundles are suboptimal for a restartable clone.

As for GitTorrent / git-mirror-sync... those are still vaporwares to me 
and I therefore have doubts about their actual feasability. So no, I 
don't count on them.

> > Hence having a restartable git-archive service to create the top 
> > revision with the ability to cheaply (in terms of network bandwidth) 
> > deepen the history afterwards is probably the most straight forward way 
> > to achieve that.  The server needs no be aware of separate bundles, etc.  
> > And the shared object store still works as usual with the same cached IO 
> > whether the data is needed for a traditional fetch or a "git archive" 
> > operation.
> 
> It's the "cheaply deepen history" that I doubt would be easy.  This is
> the most difficult part, I think (see also below).

Don't think so.  Try this:

	mkdir test
	cd test
	git init
	git fetch --depth=1 git://git.kernel.org/pub/scm/git/git.git

REsult:

remote: Counting objects: 1824, done.
remote: Compressing objects: 100% (1575/1575), done.
Receiving objects: 100% (1824/1824), 3.01 MiB | 975 KiB/s, done.
remote: Total 1824 (delta 299), reused 1165 (delta 180)
Resolving deltas: 100% (299/299), done.
From git://git.kernel.org/pub/scm/git/git
 * branch            HEAD       -> FETCH_HEAD

You'll get the very latest revision for HEAD, and only that.  The size 
of the transfer will be roughly the size of a daily snapshot, except it 
is fully up to date.  It is however non resumable in the event of a 
network outage.  My proposal is to replace this with a "git archive" 
call.  It won't get all branches, but for the purpose of initialising 
one's repository that should be good enough.  And the "git archive" can 
be fully resumable as I explained.

Now to deepen that history.  Let's say you want 10 more revisions going 
back then you simply perform the fetch again with a --depth=10.  Right 
now it doesn't seem to work optimally, but the pack that is then being 
sent could be made of deltas against objects found in the commits we 
already have.  Currently it seems that a pack that also includes those 
objects we already have in addition to those we want is created, which 
is IMHO a flaw in the shallow support that shouldn't be too hard to fix.  
Each level of deepening should then be as small as standard fetches 
going forward when updating the repository with new revisions.

> > Why "git archive"?  Because its content is well defined.  So if you give 
> > it a commit SHA1 you will always get the same stream of bytes (after 
> > decompression) since the way git sort files is strictly defined.  It is 
> > therefore easy to tell a remote "git archive" instance that we want the 
> > content for commit xyz but that we already got n files already, and that 
> > the last file we've got has m bytes.  There is simply no confusion about 
> > what we've got already, unlike with a partial pack which might need 
> > yet-to-be-received objects in order to make sense of what has been 
> > already received.  The server simply has to skip that many files and 
> > resume the transfer at that point, independently of the compression or 
> > even the archive format.
> 
> Let's reiterate it to check if I understand it correctly:
> 
> Any "restartable clone" / "resumable fetch" solution must begin with
> a file which is rock-solid stable wrt. reproductability given the same
> parameters.  git-archive has this feature, packfile doesn't (so I guess
> that bundle also doesn't, unless it was cached / saved on disk).

Right.

> It would be useful if it was possible to generate part of this rock-solid
> file for partial (range, resume) request, without need to generate 
> (calculate) parts that client already downloaded.  Otherwise server has
> to either waste disk space and IO for caching, or waste CPU (and IO)
> on generating part which is not needed and dropping it to /dev/null.
> git-archive you say has this feature.

"Could easily have" is more appropriate.

> Next you need to tell server that you have those objects got using
> resumable download part ("git archive HEAD" in your proposal), and
> that it can use them and do not include them in prepared file/pack.
> "have" is limited to commits, and "have <sha1>" tells server that
> you have <sha1> and all its prerequisites (dependences).  You can't 
> use "have <sha1>" with git-archive solution.  I don't know enough
> about 'shallow' capability (and what it enables) to know whether
> it can be used for that.  Can you elaborate?

See above, or Documentation/technical/shallow.txt.

> Then you have to finish clone / fetch.  All solutions so far include
> some kind of incremental improvements.  My first proposal of bisect
> fetching 1/nth or predefined size pack is buttom-up solution, where
> we build full clone from root commits up.  You propose, from what
> I understand build full clone from top commit down, using deepening
> from shallow clone.  In this step you either get full incremental
> or not; downloading incremental (from what I understand) is not
> resumable / they do not support partial fetch.

Right.  However, like I said, the incremental part should be much 
smaller and therefore less susceptible to network troubles.

Nicolas

next prev parent reply	other threads:[~2009-08-19 21:14 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-08-17 11:42 Continue git clone after interruption Tomasz Kontusz
2009-08-17 12:31 ` Johannes Schindelin
2009-08-17 15:23   ` Shawn O. Pearce
2009-08-18  5:43   ` Matthieu Moy
2009-08-18  6:58     ` Tomasz Kontusz
2009-08-18 17:56       ` Nicolas Pitre
2009-08-18 18:45         ` Jakub Narebski
2009-08-18 20:01           ` Nicolas Pitre
2009-08-18 21:02             ` Jakub Narebski
2009-08-18 21:32               ` Nicolas Pitre
2009-08-19 15:19                 ` Jakub Narebski
2009-08-19 19:04                   ` Nicolas Pitre
2009-08-19 19:42                     ` Jakub Narebski
2009-08-19 21:13                       ` Nicolas Pitre [this message]
2009-08-20  0:26                         ` Sam Vilain
2009-08-20  7:37                         ` Jakub Narebski
2009-08-20  7:48                           ` Nguyen Thai Ngoc Duy
2009-08-20  8:23                             ` Jakub Narebski
2009-08-20 18:41                           ` Nicolas Pitre
2009-08-21 10:07                             ` Jakub Narebski
2009-08-21 10:26                               ` Matthieu Moy
2009-08-21 21:07                               ` Nicolas Pitre
2009-08-21 21:41                                 ` Jakub Narebski
2009-08-22  0:59                                   ` Nicolas Pitre
2009-08-21 23:07                                 ` Sam Vilain
2009-08-22  3:37                                   ` Nicolas Pitre
2009-08-22  5:50                                     ` Sam Vilain
2009-08-22  8:13                                       ` Nicolas Pitre
2009-08-23 10:37                                         ` Sam Vilain
2009-08-20 22:57                           ` Sam Vilain
2009-08-18 22:28             ` Johannes Schindelin
2009-08-18 23:40               ` Nicolas Pitre
2009-08-19  7:35                 ` Johannes Schindelin
2009-08-19  8:25                   ` Nguyen Thai Ngoc Duy
2009-08-19  9:52                     ` Johannes Schindelin
2009-08-19 17:21                   ` Nicolas Pitre
2009-08-19 22:23                     ` René Scharfe
2009-08-19  4:42           ` Sitaram Chamarty
2009-08-19  9:53             ` Jakub Narebski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LFD.2.00.0908191552020.6044@xanadu.home \
    --to=nico@cam.org \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=git@vger.kernel.org \
    --cc=jnareb@gmail.com \
    --cc=roverorna@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).