Replacing large blobs in git history

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Replacing large blobs in git history
@ 2012-03-06 16:09 Barry Roberts
  2012-03-06 20:49 ` Neal Kreitzinger
  2012-03-07  9:04 ` Michael Haggerty
  0 siblings, 2 replies; 6+ messages in thread
From: Barry Roberts @ 2012-03-06 16:09 UTC (permalink / raw)
  To: git

I started this question on #git last week, but this is getting long,
and things have changed some, so I'm going to try here.

I had a 3rd party jar file checked in to our git repository.  It was
about 4 mb, so no big deal.  Then about 17 months ago somebody checked
in a 550 mb version.  There were several versions of the original file
in several different directories.  The large version replaced the
small version in some of those directories (but not all of them).
Then somebody found a "small" version that was only 110 mb and
replaced some of the 550 mb files and some of the old 4 mb files.
Finally several months after that we got the correct updated 5 mb
latest version.  But I'm still carrying around an extra 660 mb in my
object database, and we are adding developers and moving to an
off-site location with lower bandwidth and higher latency, so I would
like to clean this up.

My first attempt just removed the blob (by hash ID).  It's been over a
year since the small correct file was checked in, so the odds of ever
needing to build anything that old are very slim. But after thinking
about it some, I came up with this to replace the blob with the
correct one and wanted to see if this is a reasonable way to do this
before I actually backup and then replace my central git repository.

git filter-branch --index-filter 'killem=$(git ls-files --stage  |
grep 7a36af54a6c47\\\|abe809091bcb3 ) ; if [ -n "$killem" ] ; then git
ls-files --stage |grep 7a36af54a6c47\\\|abe809091bcb3 | sed -f
/home/blr/tmp/chgblob.sed |  git update-index --index-info ; fi'

chgblob.sed looks like this:
s/7a36af54a6c47a29eb9690caefa132489d39c4d0/8924ef0f78b3d09957a8697ca93cce6700771071/g
s/abe809091bcb37a06284f8353366074622d72373/8924ef0f78b3d09957a8697ca93cce6700771071/g

7a36af is the 550 mb blob, abe80909 is the 110 mb, and 8924ef0f is the
5 mb new version.

This isn't extremely efficient since it does the 'git ls-filess
--stage' twice (once to see if the blob is used, then again to change
it ONLY if the blob is referenced in the current index).  But that
only adds a few seconds to the 28 minute runtime, so I'm not too
worried about that.  And yes, I could just check for the return value
of grep, but I did echo $killem while I was debugging and that was
useful, so I just left it like that.

Does this look like a reasonable way to accomplish what I'm trying to
do, or am I doing something that's going to cause grief later?

Thanks,
Barry

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Replacing large blobs in git history
  2012-03-06 16:09 Replacing large blobs in git history Barry Roberts
@ 2012-03-06 20:49 ` Neal Kreitzinger
  2012-03-07 21:27   ` Ævar Arnfjörð Bjarmason
  2012-03-07  9:04 ` Michael Haggerty
  1 sibling, 1 reply; 6+ messages in thread
From: Neal Kreitzinger @ 2012-03-06 20:49 UTC (permalink / raw)
  To: Barry Roberts; +Cc: git

On 3/6/2012 10:09 AM, Barry Roberts wrote:
> I started this question on #git last week, but this is getting long,
> and things have changed some, so I'm going to try here.
>
> I had a 3rd party jar file checked in to our git repository.  It was
> about 4 mb, so no big deal.  Then about 17 months ago somebody
> checked in a 550 mb version.  There were several versions of the
> original file in several different directories.  The large version
> replaced the small version in some of those directories (but not all
> of them). Then somebody found a "small" version that was only 110 mb
> and replaced some of the 550 mb files and some of the old 4 mb
> files. Finally several months after that we got the correct updated 5
> mb latest version.  But I'm still carrying around an extra 660 mb in
> my object database, and we are adding developers and moving to an
> off-site location with lower bandwidth and higher latency, so I
> would like to clean this up.
>
> My first attempt just removed the blob (by hash ID).  It's been over
> a year since the small correct file was checked in, so the odds of
> ever needing to build anything that old are very slim. But after
> thinking about it some, I came up with this to replace the blob with
> the correct one and wanted to see if this is a reasonable way to do
> this before I actually backup and then replace my central git
> repository.
>
> git filter-branch --index-filter 'killem=$(git ls-files --stage  |
> grep 7a36af54a6c47\\\|abe809091bcb3 ) ; if [ -n "$killem" ] ; then
> git ls-files --stage |grep 7a36af54a6c47\\\|abe809091bcb3 | sed -f
> /home/blr/tmp/chgblob.sed |  git update-index --index-info ; fi'
>
> chgblob.sed looks like this:
> s/7a36af54a6c47a29eb9690caefa132489d39c4d0/8924ef0f78b3d09957a8697ca93cce6700771071/g
>
>
s/abe809091bcb37a06284f8353366074622d72373/8924ef0f78b3d09957a8697ca93cce6700771071/g
>
> 7a36af is the 550 mb blob, abe80909 is the 110 mb, and 8924ef0f is
> the 5 mb new version.
>
> This isn't extremely efficient since it does the 'git ls-filess
> --stage' twice (once to see if the blob is used, then again to
> change it ONLY if the blob is referenced in the current index).  But
> that only adds a few seconds to the 28 minute runtime, so I'm not
> too worried about that.  And yes, I could just check for the return
> value of grep, but I did echo $killem while I was debugging and that
> was useful, so I just left it like that.
>
> Does this look like a reasonable way to accomplish what I'm trying
> to do, or am I doing something that's going to cause grief later?
>
Be aware that you are rewriting history.  I assume this is published
history that you are going to run filter-branch on.  That means everyone 
who cloned from the old history (pre-filter-branch), not to mention 
those who also have WIP based on the old history, will need to somehow 
adjust to the new history.  How do you plan on addressing that?  (see 
git-rebase manpage section "recovering from upstream rebase" for more 
info on the implications of rewriting history.)

(I have never done filter-branch, and am not an expert on git, but do 
find this subject relevant to normal use of git.)

v/r,
neal

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Replacing large blobs in git history
  2012-03-06 20:49 ` Neal Kreitzinger
@ 2012-03-07 21:27   ` Ævar Arnfjörð Bjarmason
  2012-03-08 15:39     ` Holger Hellmuth
  0 siblings, 1 reply; 6+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2012-03-07 21:27 UTC (permalink / raw)
  To: Neal Kreitzinger; +Cc: Barry Roberts, git

On Tue, Mar 6, 2012 at 21:49, Neal Kreitzinger <nkreitzinger@gmail.com> wrote:
> On 3/6/2012 10:09 AM, Barry Roberts wrote:
> Be aware that you are rewriting history.  I assume this is published
> history that you are going to run filter-branch on.  That means everyone who
> cloned from the old history (pre-filter-branch), not to mention those who
> also have WIP based on the old history, will need to somehow adjust to the
> new history.

Does something other than git-fsck actually check whether the
collection of blobs you're getting from the remote when you clone have
sensible sha1's?

What'll happen if he replaces that 550MB blob with a 0 byte blob but
hacks the object store so that it pretends to have the same sha1?

Of course the real solution to this issue is to either rewrite
history, or to change Git to support partially fetching the old blobs
in your project.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Replacing large blobs in git history
  2012-03-07 21:27   ` Ævar Arnfjörð Bjarmason
@ 2012-03-08 15:39     ` Holger Hellmuth
  2012-03-08 21:22       ` Junio C Hamano
  0 siblings, 1 reply; 6+ messages in thread
From: Holger Hellmuth @ 2012-03-08 15:39 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Neal Kreitzinger, Barry Roberts, git

On 07.03.2012 22:27, Ævar Arnfjörð Bjarmason wrote:
> Does something other than git-fsck actually check whether the
> collection of blobs you're getting from the remote when you clone have
> sensible sha1's?
>
> What'll happen if he replaces that 550MB blob with a 0 byte blob but
> hacks the object store so that it pretends to have the same sha1?

This is something I tested once because of security concerns (i.e. what 
happens if a malicious intruder just drops something else into the 
object store) and if I remember correctly only git-fsck was able to spot 
the switch. But I didn't test cloning, only a few local operations.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Replacing large blobs in git history
  2012-03-08 15:39     ` Holger Hellmuth
@ 2012-03-08 21:22       ` Junio C Hamano
  0 siblings, 0 replies; 6+ messages in thread
From: Junio C Hamano @ 2012-03-08 21:22 UTC (permalink / raw)
  To: Holger Hellmuth
  Cc: Ævar Arnfjörð Bjarmason, Neal Kreitzinger,
	Barry Roberts, git

Holger Hellmuth <hellmuth@ira.uka.de> writes:

> On 07.03.2012 22:27, Ævar Arnfjörð Bjarmason wrote:
>> Does something other than git-fsck actually check whether the
>> collection of blobs you're getting from the remote when you clone have
>> sensible sha1's?
>>
>> What'll happen if he replaces that 550MB blob with a 0 byte blob but
>> hacks the object store so that it pretends to have the same sha1?
>
> This is something I tested once because of security concerns
> (i.e. what happens if a malicious intruder just drops something else
> into the object store) and if I remember correctly only git-fsck was
> able to spot the switch. But I didn't test cloning, only a few local
> operations.

Local operation that do not have to look at such a corrupt blob will
not verify everything under the sun every time for obvious reasons.

An operation to transfer objects out of the repository (e.g. serving
as the source of "clone" from elsewhere) will notice when it has to
send such a corrupt object and you will be prevented from spreading
the damage.

The same thing for a transfer in the reverse direction. When the
other side tells us that it is giving us everything we asked, we
still look at all the objects we received to make sure.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Replacing large blobs in git history
  2012-03-06 16:09 Replacing large blobs in git history Barry Roberts
  2012-03-06 20:49 ` Neal Kreitzinger
@ 2012-03-07  9:04 ` Michael Haggerty
  1 sibling, 0 replies; 6+ messages in thread
From: Michael Haggerty @ 2012-03-07  9:04 UTC (permalink / raw)
  To: Barry Roberts; +Cc: git

On 03/06/2012 05:09 PM, Barry Roberts wrote:
> I started this question on #git last week, but this is getting long,
> and things have changed some, so I'm going to try here.
> 
> I had a 3rd party jar file checked in to our git repository.  It was
> about 4 mb, so no big deal.  Then about 17 months ago somebody checked
> in a 550 mb version.  There were several versions of the original file
> in several different directories.  The large version replaced the
> small version in some of those directories (but not all of them).
> Then somebody found a "small" version that was only 110 mb and
> replaced some of the 550 mb files and some of the old 4 mb files.
> Finally several months after that we got the correct updated 5 mb
> latest version.  But I'm still carrying around an extra 660 mb in my
> object database, and we are adding developers and moving to an
> off-site location with lower bandwidth and higher latency, so I would
> like to clean this up.
> 
> My first attempt just removed the blob (by hash ID).  It's been over a
> year since the small correct file was checked in, so the odds of ever
> needing to build anything that old are very slim. But after thinking
> about it some, I came up with this to replace the blob with the
> correct one and wanted to see if this is a reasonable way to do this
> before I actually backup and then replace my central git repository.
> 
> git filter-branch --index-filter 'killem=$(git ls-files --stage  |
> grep 7a36af54a6c47\\\|abe809091bcb3 ) ; if [ -n "$killem" ] ; then git
> ls-files --stage |grep 7a36af54a6c47\\\|abe809091bcb3 | sed -f
> /home/blr/tmp/chgblob.sed |  git update-index --index-info ; fi'
> 
> chgblob.sed looks like this:
> s/7a36af54a6c47a29eb9690caefa132489d39c4d0/8924ef0f78b3d09957a8697ca93cce6700771071/g
> s/abe809091bcb37a06284f8353366074622d72373/8924ef0f78b3d09957a8697ca93cce6700771071/g
> 
> 7a36af is the 550 mb blob, abe80909 is the 110 mb, and 8924ef0f is the
> 5 mb new version.

You could use "git replace" to cause the bad blobs to be replaced
everywhere they appear:

    $ git replace 7a36af54a6c47a29eb9690caefa132489d39c4d0 \
                  8924ef0f78b3d09957a8697ca93cce6700771071
    $ git replace abe809091bcb37a06284f8353366074622d72373 \
                  8924ef0f78b3d09957a8697ca93cce6700771071

Then you could use "git filter-branch" to "bake in" the substitutions
(but please see the caveats mentioned by Neal).

It seems like an alternative to using "git filter-branch" would be to
share the "git replace" references across repositories.  This would make
the short versions of the file appear wherever they should without
requiring history to be rewritten entirely.  But I don't believe that
this approach would allow the long versions of the file to be discarded
by the git garbage collector, so it would not help you reduce clone sizes.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-03-08 21:23 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-06 16:09 Replacing large blobs in git history Barry Roberts
2012-03-06 20:49 ` Neal Kreitzinger
2012-03-07 21:27   ` Ævar Arnfjörð Bjarmason
2012-03-08 15:39     ` Holger Hellmuth
2012-03-08 21:22       ` Junio C Hamano
2012-03-07  9:04 ` Michael Haggerty

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).