From: Shawn Pearce <spearce@spearce.org>
To: Nicolas Pitre <nico@fluxnic.net>
Cc: Johannes Sixt <j.sixt@viscovery.net>,
git@vger.kernel.org, Junio C Hamano <gitster@pobox.com>,
John Hawley <warthog19@eaglescrag.net>
Subject: Re: [RFC] Add --create-cache to repack
Date: Mon, 31 Jan 2011 10:47:34 -0800 [thread overview]
Message-ID: <AANLkTimW=fuKrhw6+ZDipEtSGX_oR4LbTZzyAxZ8Pry1@mail.gmail.com> (raw)
In-Reply-To: <AANLkTi=U7qRRij=BQXC1Goqa9toDFfaVKT=+-8zYxCcc@mail.gmail.com>
On Fri, Jan 28, 2011 at 17:32, Shawn Pearce <spearce@spearce.org> wrote:
>>> >
>>> >> This started because I was looking for a way to speed up clones coming
>>> >> from a JGit server. Cloning the linux-2.6 repository is painful,
>
> Well, scratch the idea in this thread. I think.
Nope, I'm back in favor with this after fixing JGit's thin pack
generation. Here's why.
Take linux-2.6.git as of Jan 12th, with the cache root as of Dec 28th:
$ git update-ref HEAD f878133bf022717b880d0e0995b8f91436fd605c
$ git-repack.sh --create-cache \
--cache-root=b52e2a6d6d05421dea6b6a94582126af8cd5cca2 \
--cache-include=v2.6.11-tree
$ git repack -a -d
$ ls -lh objects/pack/
total 456M
1.4M pack-74af5edca80797736fe4de7279b2a81af98470a5.idx
38M pack-74af5edca80797736fe4de7279b2a81af98470a5.pack
49M pack-d3e77c8b3045c7f54fa2fb6bbfd4dceca1e2b9fa.idx
89 pack-d3e77c8b3045c7f54fa2fb6bbfd4dceca1e2b9fa.keep
368M pack-d3e77c8b3045c7f54fa2fb6bbfd4dceca1e2b9fa.pack
Our "recent history" is 38M, and our "cached pack" is 368M. Its a bit
more disk than is strictly necessary, this should be ~380M. Call it
~26M of wasted disk. The "cached object list" I proposed elsewhere in
this thread would cost about 41M of disk and is utterly useless except
for initial clones. Here we are wasting about 26M of disk to have
slightly shorter delta chains in the cached pack (otherwise known as
our ancient history). So its a slightly smaller waste, and we get
some (minor) benefit.
Clone without pack caching:
$ time git clone --bare git://localhost/tmp_linux26_withTag tmp_in.git
Cloning into bare repository tmp_in.git...
remote: Counting objects: 1861830, done
remote: Finding sources: 100% (1861830/1861830)
remote: Getting sizes: 100% (88243/88243)
remote: Compressing objects: 100% (88184/88184)
Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done.
remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844)
Resolving deltas: 100% (1564621/1564621), done.
real 3m19.005s
user 1m36.250s
sys 0m10.290s
Clone with pack caching:
$ time git clone --bare git://localhost/tmp_linux26_withTag tmp_in.git
Cloning into bare repository tmp_in.git...
remote: Counting objects: 1601, done
remote: Counting objects: 1828460, done
remote: Finding sources: 100% (50475/50475)
remote: Getting sizes: 100% (18843/18843)
remote: Compressing objects: 100% (7585/7585)
remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510)
Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done.
Resolving deltas: 100% (1559477/1559477), done.
real 2m2.938s
user 1m35.890s
sys 0m9.830s
Using the cached pack increased our total data transfer by 2.39 MiB,
but saved 1m17s on server computation time. If we go back and look at
our cached pack size (368M), the leading thin-pack should be about
10.4 MiB (378.40M - 368M = 10.4M). If I modify the tmp_in.git client
to have only the cached pack's tip and fetch using CGit, we see the
thin pack to bring ourselves current is 11.07 MiB (JGit does this in
10.96 MiB):
$ cd tmp_in.git
$ git update-ref HEAD b52e2a6d6d05421dea6b6a94582126af8cd5cca2
$ git repack -a -d ; # yay we are at ~1 month ago
$ time git fetch ../tmp_linux26_withTag
remote: Counting objects: 60570, done.
remote: Compressing objects: 100% (11924/11924), done.
remote: Total 49804 (delta 42196), reused 44837 (delta 37231)
Receiving objects: 100% (49804/49804), 11.07 MiB | 7.37 MiB/s, done.
Resolving deltas: 100% (42196/42196), completed with 4956 local objects.
From ../tmp_linux26_withTag
* branch HEAD -> FETCH_HEAD
real 0m35.083s
user 0m25.710s
sys 0m1.190s
The pack caching feature is *no worse* in transfer size than if the
client copied the pack from 1 month ago, and then did an incremental
fetch to bring themselves current. Compared to the naive clone, it
saves an incredible amount of working set space and CPU time. The
server only needs to keep track of the incremental thin pack, and can
completely ignore the ancient history objects. Its a great
alternative for projects that want users to rsync/http dumb transport
down a large stable repository, then incremental fetch themselves
current. Or busy mirror sites that are willing to trade some small
bandwidth for server CPU and memory.
In this particular example, there is ~11 MiB of data that cannot be
safely resumed, or the first 2.9%. At 56 KiB/s, a client needs to get
through the first 3 minutes of transfer before they can reach the
resumable checkpoint (where the thin pack ends, and the cached pack
starts). It would be better if we could resume anywhere in the
stream, but being able to resume the last 97% is infinitely better
than being able to resume nothing. If someone wants to really go
crazy, this is where a "gittorrent" client could start up and handle
the remaining 97% of the transfer. :-)
I think this is worthwhile. If we are afraid of the extra 2.39 MiB
data transfer this forces on the client when the repository owner
enables the feature, we should go back and improve our thin-pack code.
Transferring 11 MiB to catch up a kernel from Dec 28th to Jan 12th
sounds like a lot of data, and any improvements in the general
thin-pack code would shrink the leading thin-pack, possibly getting us
that 2.39 MiB back.
--
Shawn.
next prev parent reply other threads:[~2011-01-31 18:48 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-01-28 8:06 [RFC] Add --create-cache to repack Shawn O. Pearce
2011-01-28 9:08 ` Johannes Sixt
2011-01-28 14:37 ` Shawn Pearce
2011-01-28 15:33 ` Johannes Sixt
2011-01-28 18:22 ` Shawn Pearce
2011-01-28 19:15 ` Jay Soffian
2011-01-28 19:19 ` Shawn Pearce
2011-01-28 18:46 ` Nicolas Pitre
2011-01-28 19:15 ` Shawn Pearce
2011-01-28 21:09 ` Nicolas Pitre
2011-01-29 1:32 ` Shawn Pearce
2011-01-29 2:34 ` Shawn Pearce
2011-01-30 8:05 ` Junio C Hamano
2011-01-30 19:43 ` Shawn Pearce
2011-01-30 20:02 ` Junio C Hamano
2011-01-30 20:20 ` Shawn Pearce
2011-01-30 22:26 ` Nicolas Pitre
2011-01-29 4:08 ` Nicolas Pitre
2011-01-29 4:35 ` Shawn Pearce
2011-01-30 6:51 ` Junio C Hamano
2011-01-30 17:14 ` Nicolas Pitre
2011-01-30 17:41 ` A Large Angry SCM
2011-01-30 19:29 ` Shawn Pearce
2011-01-30 22:13 ` Shawn Pearce
2011-01-31 18:47 ` Shawn Pearce [this message]
2011-01-31 21:48 ` Nicolas Pitre
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='AANLkTimW=fuKrhw6+ZDipEtSGX_oR4LbTZzyAxZ8Pry1@mail.gmail.com' \
--to=spearce@spearce.org \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=j.sixt@viscovery.net \
--cc=nico@fluxnic.net \
--cc=warthog19@eaglescrag.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).