git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Replacing large blobs in git history
@ 2012-03-06 16:09 Barry Roberts
  2012-03-06 20:49 ` Neal Kreitzinger
  2012-03-07  9:04 ` Michael Haggerty
  0 siblings, 2 replies; 6+ messages in thread
From: Barry Roberts @ 2012-03-06 16:09 UTC (permalink / raw)
  To: git

I started this question on #git last week, but this is getting long,
and things have changed some, so I'm going to try here.

I had a 3rd party jar file checked in to our git repository.  It was
about 4 mb, so no big deal.  Then about 17 months ago somebody checked
in a 550 mb version.  There were several versions of the original file
in several different directories.  The large version replaced the
small version in some of those directories (but not all of them).
Then somebody found a "small" version that was only 110 mb and
replaced some of the 550 mb files and some of the old 4 mb files.
Finally several months after that we got the correct updated 5 mb
latest version.  But I'm still carrying around an extra 660 mb in my
object database, and we are adding developers and moving to an
off-site location with lower bandwidth and higher latency, so I would
like to clean this up.

My first attempt just removed the blob (by hash ID).  It's been over a
year since the small correct file was checked in, so the odds of ever
needing to build anything that old are very slim. But after thinking
about it some, I came up with this to replace the blob with the
correct one and wanted to see if this is a reasonable way to do this
before I actually backup and then replace my central git repository.

git filter-branch --index-filter 'killem=$(git ls-files --stage  |
grep 7a36af54a6c47\\\|abe809091bcb3 ) ; if [ -n "$killem" ] ; then git
ls-files --stage |grep 7a36af54a6c47\\\|abe809091bcb3 | sed -f
/home/blr/tmp/chgblob.sed |  git update-index --index-info ; fi'

chgblob.sed looks like this:
s/7a36af54a6c47a29eb9690caefa132489d39c4d0/8924ef0f78b3d09957a8697ca93cce6700771071/g
s/abe809091bcb37a06284f8353366074622d72373/8924ef0f78b3d09957a8697ca93cce6700771071/g

7a36af is the 550 mb blob, abe80909 is the 110 mb, and 8924ef0f is the
5 mb new version.

This isn't extremely efficient since it does the 'git ls-filess
--stage' twice (once to see if the blob is used, then again to change
it ONLY if the blob is referenced in the current index).  But that
only adds a few seconds to the 28 minute runtime, so I'm not too
worried about that.  And yes, I could just check for the return value
of grep, but I did echo $killem while I was debugging and that was
useful, so I just left it like that.

Does this look like a reasonable way to accomplish what I'm trying to
do, or am I doing something that's going to cause grief later?

Thanks,
Barry

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-03-08 21:23 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-06 16:09 Replacing large blobs in git history Barry Roberts
2012-03-06 20:49 ` Neal Kreitzinger
2012-03-07 21:27   ` Ævar Arnfjörð Bjarmason
2012-03-08 15:39     ` Holger Hellmuth
2012-03-08 21:22       ` Junio C Hamano
2012-03-07  9:04 ` Michael Haggerty

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).