From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: git <git@vger.kernel.org>,
Git for human beings <git-users@googlegroups.com>
Cc: Christian Couder <christian.couder@gmail.com>
Subject: How de-duplicate similar repositories with alternates
Date: Thu, 29 Nov 2018 15:59:26 +0100 [thread overview]
Message-ID: <87zhtsx73l.fsf@evledraar.gmail.com> (raw)
A co-worker asked me today how space could be saved when you have
multiple checkouts of the same repository (at different revs) on the
same machine. I said since these won't block-level de-duplicate well[1]
one way to do this is with alternates.
However, once you have an existing clone I didn't know how to get the
gains without a full re-clone, but I hadn't looked deeply into it. As it
turns out I'm wrong about that, which I found when writing the following
test-case which shows that it works:
(
cd /tmp &&
rm -rf /tmp/git-{master,pu,pu-alt}.git &&
# Normal clones
git clone --bare --no-tags --single-branch --branch master https://github.com/git/git.git /tmp/git-master.git &&
git clone --bare --no-tags --single-branch --branch pu https://github.com/git/git.git /tmp/git-pu.git &&
# An 'alternate' clone using 'master' objects from another repo
git --bare init /tmp/git-pu-alt.git &&
for git in git-pu.git git-pu-alt.git
do
echo /tmp/git-master.git/objects >/tmp/$git/objects/info/alternates
done &&
git -C git-pu-alt.git fetch --no-tags https://github.com/git/git.git pu:pu
# Respective sizes, 'alternate' clone much smaller
du -shc /tmp/git-*.git &&
# GC them all. Compacts the git-pu.git to git-pu-alt.git's size
for repo in git-*.git
do
git -C $repo gc
done &&
du -shc /tmp/git-*.git
# Add another big history (GFW) to git-{pu,master}.git (in that order!)
for repo in $(ls -d /tmp/git-*.git | sort -r)
do
git -C $repo fetch --no-tags https://github.com/git-for-windows/git master:master-gfw
done &&
du -shc /tmp/git-*.git &&
# Another GC. The objects now in git-master.git will be de-duped by all
for repo in git-*.git
do
git -C $repo gc
done &&
du -shc /tmp/git-*.git
)
This shows a scenario where we clone git.git at "master" and "pu" in
different places. After clone the relevant sizes are:
108M /tmp/git-master.git
3.2M /tmp/git-pu-alt.git
109M /tmp/git-pu.git
219M total
I.e. git-pu-alt.git is much smaller since it points via alternates to
git-master.git, and the history of "pu" shares most of the objects with
"master". But then how do you get those gains for git-pu.git? Turns out
you just "git gc"
111M /tmp/git-master.git
2.1M /tmp/git-pu-alt.git
2.1M /tmp/git-pu.git
115M total
This is the thing I was wrong about, in retrospect probably because I'd
been putting PATH_TO_REPO in objects/info/alternates, but we actually
need PATH_TO_REPO/objects, and "git gc" won't warn about this (or "git
fsck"). Probably a good idea to patch that at some point, i.e. whine
about paths in alternates that don't have objects, or at the very least
those that don't exist. #leftoverbits
Then when we fetch git-for-windows:master to all the repos they all grow
by the amount git-for-windows has diverged:
144M /tmp/git-master.git
36M /tmp/git-pu-alt.git
36M /tmp/git-pu.git
214M total
Note that the "sort -r" is critical here. If we fetched git-master.git
first (at this point the alternate for git-pu*.git) we wouldn't get the
duplication in the first place, but instead:
144M /tmp/git-master.git
2.1M /tmp/git-pu-alt.git
2.1M /tmp/git-pu.git
148M total
This shows the importance of keeping such an 'alternate' repo
up-to-date, i.e. we don't get the duplication in the first place, but
regardless (this from a run with sort -r) a "git gc" will coalesce them:
131M /tmp/git-master.git
2.1M /tmp/git-pu-alt.git
2.2M /tmp/git-pu.git
135M total
If you find this interesting make sure to read my
https://public-inbox.org/git/87k1s3bomt.fsf@evledraar.gmail.com/ and
https://public-inbox.org/git/87in7nbi5b.fsf@evledraar.gmail.com/ for the
caveats, i.e. if this is something intended for users then no ref in the
alternate can ever be rewound, that'll potentially result in repository
corruption.
1. https://public-inbox.org/git/87bmhiykvw.fsf@evledraar.gmail.com/
next reply other threads:[~2018-11-29 14:59 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-11-29 14:59 Ævar Arnfjörð Bjarmason [this message]
2018-11-29 16:09 ` How de-duplicate similar repositories with alternates Ævar Arnfjörð Bjarmason
2018-11-29 18:55 ` Stefan Beller
2018-11-29 20:10 ` Ævar Arnfjörð Bjarmason
2018-11-29 20:43 ` Duy Nguyen
2018-12-04 7:06 ` Jeff King
2018-12-04 12:07 ` Derrick Stolee
2018-12-04 6:59 ` Jeff King
2018-12-04 10:43 ` Ævar Arnfjörð Bjarmason
2018-12-04 13:27 ` [PATCH 0/3] sha1-file: warn if alternate is a git repo (not object dir) Ævar Arnfjörð Bjarmason
2018-12-04 13:27 ` [PATCH 1/3] sha1-file: test the error behavior of alt_odb_usable() Ævar Arnfjörð Bjarmason
2019-03-28 20:04 ` [PATCH v2] " Ævar Arnfjörð Bjarmason
2019-03-29 13:46 ` Jeff King
2019-03-29 13:55 ` Ævar Arnfjörð Bjarmason
2019-04-08 15:57 ` Ævar Arnfjörð Bjarmason
2019-04-09 8:21 ` Junio C Hamano
2019-04-09 8:45 ` Ævar Arnfjörð Bjarmason
2019-04-09 9:43 ` Junio C Hamano
2019-04-09 14:14 ` Jeff King
2019-04-09 8:29 ` Junio C Hamano
2018-12-04 13:27 ` [PATCH 2/3] sha1-file: emit error if an alternate looks like a repository Ævar Arnfjörð Bjarmason
2018-12-05 3:35 ` Junio C Hamano
2018-12-05 6:10 ` Jeff King
2018-12-04 13:27 ` [PATCH 3/3] sha1-file: change alternate "error:" message to "warning:" Ævar Arnfjörð Bjarmason
2018-12-05 3:37 ` Junio C Hamano
2018-12-05 5:54 ` Jeff King
2018-12-05 3:30 ` How de-duplicate similar repositories with alternates Junio C Hamano
2018-12-04 13:35 ` Ævar Arnfjörð Bjarmason
2018-12-04 14:17 ` Derrick Stolee
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87zhtsx73l.fsf@evledraar.gmail.com \
--to=avarab@gmail.com \
--cc=christian.couder@gmail.com \
--cc=git-users@googlegroups.com \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.