From: Michael Haggerty <mhagger@alum.mit.edu>
To: Barry Roberts <blr@robertsr.us>
Cc: git <git@vger.kernel.org>
Subject: Re: Replacing large blobs in git history
Date: Wed, 07 Mar 2012 10:04:37 +0100 [thread overview]
Message-ID: <4F5724A5.7050405@alum.mit.edu> (raw)
In-Reply-To: <CAD-6W7byTiuE9MFZY1yG_ann-Ox7+wGjYduZ=Wwmw0ToF5Pynw@mail.gmail.com>
On 03/06/2012 05:09 PM, Barry Roberts wrote:
> I started this question on #git last week, but this is getting long,
> and things have changed some, so I'm going to try here.
>
> I had a 3rd party jar file checked in to our git repository. It was
> about 4 mb, so no big deal. Then about 17 months ago somebody checked
> in a 550 mb version. There were several versions of the original file
> in several different directories. The large version replaced the
> small version in some of those directories (but not all of them).
> Then somebody found a "small" version that was only 110 mb and
> replaced some of the 550 mb files and some of the old 4 mb files.
> Finally several months after that we got the correct updated 5 mb
> latest version. But I'm still carrying around an extra 660 mb in my
> object database, and we are adding developers and moving to an
> off-site location with lower bandwidth and higher latency, so I would
> like to clean this up.
>
> My first attempt just removed the blob (by hash ID). It's been over a
> year since the small correct file was checked in, so the odds of ever
> needing to build anything that old are very slim. But after thinking
> about it some, I came up with this to replace the blob with the
> correct one and wanted to see if this is a reasonable way to do this
> before I actually backup and then replace my central git repository.
>
> git filter-branch --index-filter 'killem=$(git ls-files --stage |
> grep 7a36af54a6c47\\\|abe809091bcb3 ) ; if [ -n "$killem" ] ; then git
> ls-files --stage |grep 7a36af54a6c47\\\|abe809091bcb3 | sed -f
> /home/blr/tmp/chgblob.sed | git update-index --index-info ; fi'
>
> chgblob.sed looks like this:
> s/7a36af54a6c47a29eb9690caefa132489d39c4d0/8924ef0f78b3d09957a8697ca93cce6700771071/g
> s/abe809091bcb37a06284f8353366074622d72373/8924ef0f78b3d09957a8697ca93cce6700771071/g
>
> 7a36af is the 550 mb blob, abe80909 is the 110 mb, and 8924ef0f is the
> 5 mb new version.
You could use "git replace" to cause the bad blobs to be replaced
everywhere they appear:
$ git replace 7a36af54a6c47a29eb9690caefa132489d39c4d0 \
8924ef0f78b3d09957a8697ca93cce6700771071
$ git replace abe809091bcb37a06284f8353366074622d72373 \
8924ef0f78b3d09957a8697ca93cce6700771071
Then you could use "git filter-branch" to "bake in" the substitutions
(but please see the caveats mentioned by Neal).
It seems like an alternative to using "git filter-branch" would be to
share the "git replace" references across repositories. This would make
the short versions of the file appear wherever they should without
requiring history to be rewritten entirely. But I don't believe that
this approach would allow the long versions of the file to be discarded
by the git garbage collector, so it would not help you reduce clone sizes.
Michael
--
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/
prev parent reply other threads:[~2012-03-07 9:04 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-03-06 16:09 Replacing large blobs in git history Barry Roberts
2012-03-06 20:49 ` Neal Kreitzinger
2012-03-07 21:27 ` Ævar Arnfjörð Bjarmason
2012-03-08 15:39 ` Holger Hellmuth
2012-03-08 21:22 ` Junio C Hamano
2012-03-07 9:04 ` Michael Haggerty [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4F5724A5.7050405@alum.mit.edu \
--to=mhagger@alum.mit.edu \
--cc=blr@robertsr.us \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).