From: Avery Pennarun <apenwarr@gmail.com>
To: Git Mailing List <git@vger.kernel.org>
Subject: Idea: global git object cache
Date: Fri, 8 Jan 2010 16:05:15 -0500 [thread overview]
Message-ID: <32541b131001081305nc25a811i73c96d1d252b9246@mail.gmail.com> (raw)
Hi all,
One thing I find curious about git is how objects mostly aren't shared
between multiple repositories on the local system. For example, if I
do:
git clone git://git.kernel.org/pub/scm/git/git.git git1
git clone git://git.kernel.org/pub/scm/git/git.git git2
Then I end up downloading the same objects from kernel.org *twice*.
If I use --reference on the second clone, then I can avoid
re-downloading all the objects, and it's much faster.
Unfortunately, I have to provide that option by hand, which is a
problem for git-submodule: it goes out to clone someone else's
repository automatically and doesn't know how to guess a value for
--reference. Another thing I commonly want to do with submodules is
to rm -rf the submodule's files, eg. because I change branches and git
doesn't clean it automatically. But then when I switch branches back
to the one with the submodule, git wants to go re-download the
submodule *again*. Redoing the checkout makes sense to me (just as
git deletes/recreates files when I normally switch branches) but
re-downloading seems silly.
So here's my suggestion to minimize downloads in a pretty easy way:
- whenever git creates a packfile in any repo (eg. during git gc or
git fetch), make an *extra* hardlink of it into
~/.gitcache/objects/pack.
- whenever git is considering which objects it does/doesn't currently
have, also consider the packs in ~/.gitcache/objects/pack (ie. using
the git/objects/alternates mechanism). If one of the packs qualifies,
hardlink it into the current repo. Maybe give it a .keep file to
indicate that it's counterproductive to repack this pack.
- after git deletes a packfile in any repo (eg. during git gc), check
the link count of that pack in ~/.gitcache/objects/pack; if it's now
down to just 1, there are no other users of the pack, so delete it
there too. You would also need to prune the cachedir occasionally to
deal with repositories that were deleted in other ways (eg. rm -rf).
- share the list of refs in a similar way (noticing that you probably
have different refs in multiple repos that are named
"refs/heads/master" of course) so that fetches will be efficient.
- extra improvement to submodule behaviour: hardlink packs from the
submodule into the supermodule's objects/pack directory (or use a
different directory like .git/submodules/pack to keep things
separate). Also, submodules should use the superproject's pack
directory as an alternate, in case (as often happens for me) the
supermodule already contains a bunch of objects from the submodule,
because the modules were split at some point.
I believe this would be quite easy to implement and would give an
immediate efficiency improvement. The ~/.gitcache feature could be
enabled/disabled by a config option. Is there any reason not to do
it?
Thanks,
Avery
P.S. I've been testing git's behaviour with lots of very large packs -
I'm currently using about 58 packs of about 1 GB each - as part of my
'bup' git-based backup tool (http://apenwarr.ca/log/?m=201001#04).
Repacking and fsck are obviously horrendously slow with that much
data, but bup avoids those operations as much as possible, and a
~/.gitcache wouldn't need to worry about them either (since each repo
is still responsible for repacking its own packs). Overall
performance for other git operations seems to be fine, though. And
searching the cache as a last restore can be optimized by always
searching packs in MRU order, in case git doesn't already do this.
reply other threads:[~2010-01-08 21:05 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=32541b131001081305nc25a811i73c96d1d252b9246@mail.gmail.com \
--to=apenwarr@gmail.com \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).