Re: Continue git clone after interruption

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Sam Vilain <sam@vilain.net>
To: Nicolas Pitre <nico@cam.org>
Cc: Jakub Narebski <jnareb@gmail.com>,
	Tomasz Kontusz <roverorna@gmail.com>, git <git@vger.kernel.org>,
	Johannes Schindelin <Johannes.Schindelin@gmx.de>
Subject: Re: Continue git clone after interruption
Date: Thu, 20 Aug 2009 12:26:38 +1200	[thread overview]
Message-ID: <1250727998.5867.185.camel@maia.lan> (raw)
In-Reply-To: <alpine.LFD.2.00.0908191552020.6044@xanadu.home>

On Wed, 2009-08-19 at 17:13 -0400, Nicolas Pitre wrote:
> > It's the "cheaply deepen history" that I doubt would be easy.  This is
> > the most difficult part, I think (see also below).
> 
> Don't think so.  Try this:
> 
> 	mkdir test
> 	cd test
> 	git init
> 	git fetch --depth=1 git://git.kernel.org/pub/scm/git/git.git
> 
> REsult:
> 
> remote: Counting objects: 1824, done.
> remote: Compressing objects: 100% (1575/1575), done.
> Receiving objects: 100% (1824/1824), 3.01 MiB | 975 KiB/s, done.
> remote: Total 1824 (delta 299), reused 1165 (delta 180)
> Resolving deltas: 100% (299/299), done.
> From git://git.kernel.org/pub/scm/git/git
>  * branch            HEAD       -> FETCH_HEAD
> 
> You'll get the very latest revision for HEAD, and only that.  The size 
> of the transfer will be roughly the size of a daily snapshot, except it 
> is fully up to date.  It is however non resumable in the event of a 
> network outage.  My proposal is to replace this with a "git archive" 
> call.  It won't get all branches, but for the purpose of initialising 
> one's repository that should be good enough.  And the "git archive" can 
> be fully resumable as I explained.
> 
> Now to deepen that history.  Let's say you want 10 more revisions going 
> back then you simply perform the fetch again with a --depth=10.  Right 
> now it doesn't seem to work optimally, but the pack that is then being 
> sent could be made of deltas against objects found in the commits we 
> already have.  Currently it seems that a pack that also includes those 
> objects we already have in addition to those we want is created, which 
> is IMHO a flaw in the shallow support that shouldn't be too hard to fix.  
> Each level of deepening should then be as small as standard fetches 
> going forward when updating the repository with new revisions.

Nicholas, apart from starting with most recent commits and working
backwards, this is very similar to the "bundle slicing" idea defined in
GitTorrent.  What the GitTorrent research project has so far achieved is
defining a slicing algorithm, and figuring out how well slicing works,
in terms of wasted bandwidth.

If you do it right, then you can support download spreading across
mirrors, too.  Eg, given a starting point, a 'slice size' - which I
based on uncompressed object size but could as well be based on commit
count - and a slice number to fetch, you should be able to look up in
the revision list index the revisions to select and then make a thin
pack corresponding to those commits.  Currently creating this index is
the slowest part of creating bundle fragments in my Perl implementation.

Once Nick Edelen's project is mergeable, we have a mechanism for being
able to relatively quickly draw a manifest of objects for these slices.

So how much bandwidth is lost?

Eg, for git.git, taking the complete object list, slicing it into 1024k
(uncompressed) bundle slices, and making thin packs from those slices we
get:

Generating index...
Length is 1291327524, 1232 blocks
Slice #0: 1050390 => 120406 (11%)
Slice #1: 1058162 => 124978 (11%)
Slice #2: 1049858 => 104363 (9%)
...
Slice #51: 1105090 => 43140 (3%)
Slice #52: 1091282 => 45367 (4%)
Slice #53: 1067675 => 39792 (3%)
...
Slice #211: 1086238 => 25451 (2%)
Slice #212: 1055705 => 31294 (2%)
Slice #213: 1059460 => 7767 (0%)
...
Slice #1129: 1109209 => 38182 (3%)
Slice #1130: 1125925 => 29829 (2%)
Slice #1131: 1120203 => 14446 (1%)
Final slice: 623055 => 49345
Overall compressed: 39585851
Calculating Repository bundle size...
Counting objects: 107369, done.
Compressing objects: 100% (28059/28059), done.
Writing objects: 100% (107369/107369), 29.20 MiB | 48321 KiB/s, done.
Total 107369 (delta 78185), reused 106770 (delta 77609)
Bundle size: 30638967
Overall inefficiency: 29%

In the above output, the first figure is the complete un-delta'd,
uncompressed size of the slice - that is, the size of all of the new
objects that the commit introduces.  The second figure is the full size
of a thin pack with those objects in it.  ie the above tells me that in
git.git there are 1.2GB of uncompressed objects.  Each slice ends up
varying in size between about 10k and 200k, but most of the slices end
up between 15k and 50k.

Actually the test script was thrown off by a loose root and that added
about 3MB to the compressed size, so the overall inefficiency with this
block size is actually more like 20%.  I think I am running into the
flaw you mention above, too, especially when I do a larger block size
run:

Generating index...
Length is 1291327524, 62 blocks
Slice #0: 21000218 => 1316165 (6%)
Slice #1: 20988208 => 1107636 (5%)
...
Slice #59: 21102776 => 1387722 (6%)
Slice #60: 20974960 => 876648 (4%)
Final slice: 6715954 => 261218
Overall compressed: 50071857
Calculating Repository bundle size...
Counting objects: 107369, done.
Compressing objects: 100% (28059/28059), done.
Writing objects: 100% (107369/107369), 29.20 MiB | 48353 KiB/s, done.
Total 107369 (delta 78185), reused 106770 (delta 77609)
Bundle size: 30638967
Overall inefficiency: 63%

Somehow we made larger packs but the total packed size was larger.

Trying with 100MB "blocks" I get:

Generating index...
Length is 1291327524, 13 blocks
Slice #0: 104952661 => 4846553 (4%)
Slice #1: 104898188 => 2830056 (2%)
Slice #2: 105007998 => 2856535 (2%)
Slice #3: 104909972 => 2583402 (2%)
Slice #4: 104909440 => 2187708 (2%)
Slice #5: 104859786 => 2555686 (2%)
Slice #6: 104873317 => 2358914 (2%)
Slice #7: 104881597 => 2183894 (2%)
Slice #8: 104863418 => 3555224 (3%)
Slice #9: 104896599 => 3192564 (3%)
Slice #10: 104876697 => 3895707 (3%)
Slice #11: 104903491 => 3731555 (3%)
Final slice: 32494360 => 1270887
Overall compressed: 38048685
Calculating Repository bundle size...
Counting objects: 107369, done.
Compressing objects: 100% (28059/28059), done.
Writing objects: 100% (107369/107369), 29.20 MiB | 48040 KiB/s, done.
Total 107369 (delta 78185), reused 106770 (delta 77609)
Bundle size: 30638967
Overall inefficiency: 24%

In the above, we broke the git.git download into 13 partial downloads of
a few meg each, at the expense of an extra 24% of download.

Anyway I've hopefully got more to add to this but this will do for a
starting point.

Sam

next prev parent reply	other threads:[~2009-08-20  0:24 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-08-17 11:42 Continue git clone after interruption Tomasz Kontusz
2009-08-17 12:31 ` Johannes Schindelin
2009-08-17 15:23   ` Shawn O. Pearce
2009-08-18  5:43   ` Matthieu Moy
2009-08-18  6:58     ` Tomasz Kontusz
2009-08-18 17:56       ` Nicolas Pitre
2009-08-18 18:45         ` Jakub Narebski
2009-08-18 20:01           ` Nicolas Pitre
2009-08-18 21:02             ` Jakub Narebski
2009-08-18 21:32               ` Nicolas Pitre
2009-08-19 15:19                 ` Jakub Narebski
2009-08-19 19:04                   ` Nicolas Pitre
2009-08-19 19:42                     ` Jakub Narebski
2009-08-19 21:13                       ` Nicolas Pitre
2009-08-20  0:26                         ` Sam Vilain [this message]
2009-08-20  7:37                         ` Jakub Narebski
2009-08-20  7:48                           ` Nguyen Thai Ngoc Duy
2009-08-20  8:23                             ` Jakub Narebski
2009-08-20 18:41                           ` Nicolas Pitre
2009-08-21 10:07                             ` Jakub Narebski
2009-08-21 10:26                               ` Matthieu Moy
2009-08-21 21:07                               ` Nicolas Pitre
2009-08-21 21:41                                 ` Jakub Narebski
2009-08-22  0:59                                   ` Nicolas Pitre
2009-08-21 23:07                                 ` Sam Vilain
2009-08-22  3:37                                   ` Nicolas Pitre
2009-08-22  5:50                                     ` Sam Vilain
2009-08-22  8:13                                       ` Nicolas Pitre
2009-08-23 10:37                                         ` Sam Vilain
2009-08-20 22:57                           ` Sam Vilain
2009-08-18 22:28             ` Johannes Schindelin
2009-08-18 23:40               ` Nicolas Pitre
2009-08-19  7:35                 ` Johannes Schindelin
2009-08-19  8:25                   ` Nguyen Thai Ngoc Duy
2009-08-19  9:52                     ` Johannes Schindelin
2009-08-19 17:21                   ` Nicolas Pitre
2009-08-19 22:23                     ` René Scharfe
2009-08-19  4:42           ` Sitaram Chamarty
2009-08-19  9:53             ` Jakub Narebski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1250727998.5867.185.camel@maia.lan \
    --to=sam@vilain.net \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=git@vger.kernel.org \
    --cc=jnareb@gmail.com \
    --cc=nico@cam.org \
    --cc=roverorna@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).