git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Trying to use git-filter-branch to compress history by removing large, obsolete binary files
@ 2007-10-07 21:23 Elijah Newren
  2007-10-07 21:38 ` Frank Lichtenheld
  2007-10-07 22:08 ` Alex Riesen
  0 siblings, 2 replies; 31+ messages in thread
From: Elijah Newren @ 2007-10-07 21:23 UTC (permalink / raw)
  To: git

Hi,

I'm using git-cvsimport to import some CVS repos, which unfortunately
included dozens of large regression test output files in their ancient
history...some of which measure hundreds of megabytes in size.  I'd
like to prune them out of the git history (I don't have access to
prune them out of the CVS history), but I'm running into problems.

The following set of instructions will duplicate my problem with a
smaller repo; why is the local git repository bigger after running
git-filter-branch rather than smaller as I'd expect?  I'm probably
missing something obvious, but I have no idea what it is.

The steps:

# Make a small repo
mkdir test
cd test
git init
echo hi > there
git add there
git commit -m 'Small repo'

# Add a random 10M binary file
dd if=/dev/urandom of=testme.txt count=10 bs=1M
git add testme.txt
git commit -m 'Add big binary file'

# Remove the 10M binary file
git rm testme.txt
git commit -m 'Remove big binary file'

# Compress the repo, see how big the repo is
git gc --aggressive --prune
du -ks .                       # 10548K
du -ks .git                    # 10532K

# Try to rewrite history to remove the binary file
git-filter-branch --tree-filter 'rm -f testme.txt' HEAD
git reset --hard

# Try to recompress and clean up, then check the new size
git gc --aggressive --prune
du -ks .                       # 10580K !?!?!?
du -ks .git                    # 10564K


Thanks,
Elijah

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Trying to use git-filter-branch to compress history by removing large, obsolete binary files
  2007-10-07 21:23 Trying to use git-filter-branch to compress history by removing large, obsolete binary files Elijah Newren
@ 2007-10-07 21:38 ` Frank Lichtenheld
  2007-10-07 22:00   ` Elijah Newren
  2007-10-07 22:08 ` Alex Riesen
  1 sibling, 1 reply; 31+ messages in thread
From: Frank Lichtenheld @ 2007-10-07 21:38 UTC (permalink / raw)
  To: Elijah Newren; +Cc: git

On Sun, Oct 07, 2007 at 03:23:59PM -0600, Elijah Newren wrote:
> The following set of instructions will duplicate my problem with a
> smaller repo; why is the local git repository bigger after running
> git-filter-branch rather than smaller as I'd expect?  I'm probably
> missing something obvious, but I have no idea what it is.

The usual suspect would be the reflog.

Gruesse,
-- 
Frank Lichtenheld <frank@lichtenheld.de>
www: http://www.djpig.de/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Trying to use git-filter-branch to compress history by removing large, obsolete binary files
  2007-10-07 21:38 ` Frank Lichtenheld
@ 2007-10-07 22:00   ` Elijah Newren
  2007-10-07 22:19     ` Alex Riesen
  2007-10-07 23:19     ` Johannes Schindelin
  0 siblings, 2 replies; 31+ messages in thread
From: Elijah Newren @ 2007-10-07 22:00 UTC (permalink / raw)
  To: Frank Lichtenheld; +Cc: git

On 10/7/07, Frank Lichtenheld <frank@lichtenheld.de> wrote:
> On Sun, Oct 07, 2007 at 03:23:59PM -0600, Elijah Newren wrote:
> > The following set of instructions will duplicate my problem with a
> > smaller repo; why is the local git repository bigger after running
> > git-filter-branch rather than smaller as I'd expect?  I'm probably
> > missing something obvious, but I have no idea what it is.
>
> The usual suspect would be the reflog.

The git-filter-branch documentation mentions creating refs/original
under .git.  Unfortunately, it doesn't contain any links or
documentation on how I'd clean those out and I haven't been able to
figure it out.  I asked on #git how to clean these out and got some
answers that didn't work (git branch -d and something else I don't
remember).  So...how do I fix the reflog, and then repack to have a
pack under 11MB in size?

Thanks,
Elijah

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Trying to use git-filter-branch to compress history by removing large, obsolete binary files
  2007-10-07 21:23 Trying to use git-filter-branch to compress history by removing large, obsolete binary files Elijah Newren
  2007-10-07 21:38 ` Frank Lichtenheld
@ 2007-10-07 22:08 ` Alex Riesen
  1 sibling, 0 replies; 31+ messages in thread
From: Alex Riesen @ 2007-10-07 22:08 UTC (permalink / raw)
  To: Elijah Newren; +Cc: git

Elijah Newren, Sun, Oct 07, 2007 23:23:59 +0200:
> # Try to recompress and clean up, then check the new size
> git gc --aggressive --prune
> du -ks .                       # 10580K !?!?!?
> du -ks .git                    # 10564K

git-filter-branch makes a backup of your original references:

$ git filter-branch --help
...
       Always verify that the rewritten version is correct: The original refs,
       if different from the rewritten ones, will be stored in the namespace
       refs/original/.
...

These will keep your big files in repository.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Trying to use git-filter-branch to compress history by removing large, obsolete binary files
  2007-10-07 22:00   ` Elijah Newren
@ 2007-10-07 22:19     ` Alex Riesen
  2007-10-07 22:24       ` Elijah Newren
  2007-10-07 23:19     ` Johannes Schindelin
  1 sibling, 1 reply; 31+ messages in thread
From: Alex Riesen @ 2007-10-07 22:19 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Frank Lichtenheld, git

Elijah Newren, Mon, Oct 08, 2007 00:00:51 +0200:
> On 10/7/07, Frank Lichtenheld <frank@lichtenheld.de> wrote:
> > On Sun, Oct 07, 2007 at 03:23:59PM -0600, Elijah Newren wrote:
> > > The following set of instructions will duplicate my problem with a
> > > smaller repo; why is the local git repository bigger after running
> > > git-filter-branch rather than smaller as I'd expect?  I'm probably
> > > missing something obvious, but I have no idea what it is.
> >
> > The usual suspect would be the reflog.
> 
> The git-filter-branch documentation mentions creating refs/original
> under .git.  Unfortunately, it doesn't contain any links or
> documentation on how I'd clean those out and I haven't been able to
> figure it out.  I asked on #git how to clean these out and got some
> answers that didn't work (git branch -d and something else I don't
> remember).

rm -rf .git/refs/original/refs/heads/<the branch where HEAD pointed to>
(assuming you haven't repacked yet)

or just edit .git/packed-refs and remove everything "refs/original"
which fits the criteria

> So...how do I fix the reflog, and then repack to have a
> pack under 11MB in size?

git reflog expire --all (it is a bit to much. You can just edit
.git/logs/* in any text editor)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Trying to use git-filter-branch to compress history by removing large, obsolete binary files
  2007-10-07 22:19     ` Alex Riesen
@ 2007-10-07 22:24       ` Elijah Newren
  2007-10-07 23:40         ` Alex Riesen
  2007-10-07 23:43         ` Dmitry Potapov
  0 siblings, 2 replies; 31+ messages in thread
From: Elijah Newren @ 2007-10-07 22:24 UTC (permalink / raw)
  To: Alex Riesen; +Cc: Frank Lichtenheld, git

On 10/7/07, Alex Riesen <raa.lkml@gmail.com> wrote:
<snip>
> rm -rf .git/refs/original/refs/heads/<the branch where HEAD pointed to>
> (assuming you haven't repacked yet)
>
> or just edit .git/packed-refs and remove everything "refs/original"
> which fits the criteria
>
> > So...how do I fix the reflog, and then repack to have a
> > pack under 11MB in size?
>
> git reflog expire --all (it is a bit to much. You can just edit
> .git/logs/* in any text editor)

So...

$ du -hs .
11M     .
$ rm -rf .git/refs/original/
$ vi .git/packed-refs
# Remove the line referring to refs/original...
$ git reflog expire --all
$ git gc --aggressive --prune
$ du -hs .
11M     .

It's still 11MB.

Any other ideas?

Elijah

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Trying to use git-filter-branch to compress history by removing large, obsolete binary files
  2007-10-07 22:00   ` Elijah Newren
  2007-10-07 22:19     ` Alex Riesen
@ 2007-10-07 23:19     ` Johannes Schindelin
  2007-10-07 23:24       ` Elijah Newren
  1 sibling, 1 reply; 31+ messages in thread
From: Johannes Schindelin @ 2007-10-07 23:19 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Frank Lichtenheld, git

Hi,

On Sun, 7 Oct 2007, Elijah Newren wrote:

> So...how do I fix the reflog, and then repack to have a pack under 11MB 
> in size?

Just clone it.  The clone will be much smaller.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Trying to use git-filter-branch to compress history by removing large, obsolete binary files
  2007-10-07 23:19     ` Johannes Schindelin
@ 2007-10-07 23:24       ` Elijah Newren
  2007-10-07 23:28         ` Johannes Schindelin
  0 siblings, 1 reply; 31+ messages in thread
From: Elijah Newren @ 2007-10-07 23:24 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Frank Lichtenheld, git

On 10/7/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> Hi,
>
> On Sun, 7 Oct 2007, Elijah Newren wrote:
>
> > So...how do I fix the reflog, and then repack to have a pack under 11MB
> > in size?
>
> Just clone it.  The clone will be much smaller.

$ git clone test test2
<snip>
$ du -hs test
11M     test
$ du -hs test2
11M     test2

Any other ideas?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Trying to use git-filter-branch to compress history by removing large, obsolete binary files
  2007-10-07 23:24       ` Elijah Newren
@ 2007-10-07 23:28         ` Johannes Schindelin
  2007-10-07 23:38           ` Elijah Newren
  0 siblings, 1 reply; 31+ messages in thread
From: Johannes Schindelin @ 2007-10-07 23:28 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Frank Lichtenheld, git

Hi,

On Sun, 7 Oct 2007, Elijah Newren wrote:

> On 10/7/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
>
> > On Sun, 7 Oct 2007, Elijah Newren wrote:
> >
> > > So...how do I fix the reflog, and then repack to have a pack under 
> > > 11MB in size?
> >
> > Just clone it.  The clone will be much smaller.
> 
> $ git clone test test2
> <snip>
> $ du -hs test
> 11M     test
> $ du -hs test2
> 11M     test2
> 
> Any other ideas?

Yep.  Maybe it is necessary to run "git gc" in test2.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Trying to use git-filter-branch to compress history by removing large, obsolete binary files
  2007-10-07 23:28         ` Johannes Schindelin
@ 2007-10-07 23:38           ` Elijah Newren
  2007-10-08  0:34             ` Johannes Schindelin
  0 siblings, 1 reply; 31+ messages in thread
From: Elijah Newren @ 2007-10-07 23:38 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Frank Lichtenheld, git

Hi,

On 10/7/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> On Sun, 7 Oct 2007, Elijah Newren wrote:
<snip>
> > $ git clone test test2
> > <snip>
> > $ du -hs test
> > 11M     test
> > $ du -hs test2
> > 11M     test2
> >
> > Any other ideas?
>
> Yep.  Maybe it is necessary to run "git gc" in test2.

Sweet, finally solved!  That brings test2 down to 340K.

However, the solution seems somewhat involved...it requires running
git-filter-branch, git reset, removing the .git/refs/original/
directory, editing .git/packed-refs in some editor, running git reflog
expire, cloning the resulting repository, and running git gc yet
again.  It seems like there has to be an easier way.  (Anyone have
one?)

Oh, and git-filter-branch could really use some explanatory note about
how to actually complete rewriting the history.

Thanks,
Elijah

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Trying to use git-filter-branch to compress history by removing large, obsolete binary files
  2007-10-07 22:24       ` Elijah Newren
@ 2007-10-07 23:40         ` Alex Riesen
  2007-10-08  0:09           ` Elijah Newren
  2007-10-07 23:43         ` Dmitry Potapov
  1 sibling, 1 reply; 31+ messages in thread
From: Alex Riesen @ 2007-10-07 23:40 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Frank Lichtenheld, git

Elijah Newren, Mon, Oct 08, 2007 00:24:49 +0200:
> On 10/7/07, Alex Riesen <raa.lkml@gmail.com> wrote:
> > rm -rf .git/refs/original/refs/heads/<the branch where HEAD pointed to>
> > (assuming you haven't repacked yet)
> >
> > or just edit .git/packed-refs and remove everything "refs/original"
> > which fits the criteria
> >
> > > So...how do I fix the reflog, and then repack to have a
> > > pack under 11MB in size?
> >
> > git reflog expire --all (it is a bit to much. You can just edit
> > .git/logs/* in any text editor)
> 
> So...
> 
> $ du -hs .
> 11M     .
> $ rm -rf .git/refs/original/
> $ vi .git/packed-refs
> # Remove the line referring to refs/original...
> $ git reflog expire --all
> $ git gc --aggressive --prune
> $ du -hs .
> 11M     .
> 
> It's still 11MB.
> 
> Any other ideas?

you missed something. Your example compresses to about 124k.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Trying to use git-filter-branch to compress history by removing large, obsolete binary files
  2007-10-07 22:24       ` Elijah Newren
  2007-10-07 23:40         ` Alex Riesen
@ 2007-10-07 23:43         ` Dmitry Potapov
  2007-10-08  0:22           ` Elijah Newren
  1 sibling, 1 reply; 31+ messages in thread
From: Dmitry Potapov @ 2007-10-07 23:43 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Alex Riesen, Frank Lichtenheld, git

On Sun, Oct 07, 2007 at 04:24:49PM -0600, Elijah Newren wrote:
> $ git reflog expire --all
> $ git gc --aggressive --prune

I believe this should work:

git reflog expire --all --expire-unreachable=0
git gc --prune

Warning: all unreachable references will be removed!

Dmitry

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Trying to use git-filter-branch to compress history by removing large, obsolete binary files
  2007-10-07 23:40         ` Alex Riesen
@ 2007-10-08  0:09           ` Elijah Newren
  2007-10-08  6:15             ` Alex Riesen
  0 siblings, 1 reply; 31+ messages in thread
From: Elijah Newren @ 2007-10-08  0:09 UTC (permalink / raw)
  To: Alex Riesen; +Cc: Frank Lichtenheld, git

On 10/7/07, Alex Riesen <raa.lkml@gmail.com> wrote:
> you missed something. Your example compresses to about 124k.

What version of git are you running?  I reran all the steps to which
you responded (repeated below for clarity) with git-1.5.3.3 and still
get 11MB.  Also, you must have different filesystem extents than me
since an empty git repo takes 196k here[1], so I don't think any repo
is going to get down to 124k.

My understanding of the steps you suggest would work:

# Make a small repo
mkdir test
cd test
git init
echo hi > there
git add there
git commit -m 'Small repo'

# Add a random 10M binary file
dd if=/dev/urandom of=testme.txt count=10 bs=1M
git add testme.txt
git commit -m 'Add big binary file'

# Remove the 10M binary file
git rm testme.txt
git commit -m 'Remove big binary file'

# Compress the repo, see how big the repo is
git gc --aggressive --prune
du -ks .                       # 10548K
du -ks .git                    # 10532K

# Try to rewrite history to remove the binary file
git-filter-branch --tree-filter 'rm -f testme.txt' HEAD
git reset --hard

# Try to recompress and clean up, then check the new size
git gc --aggressive --prune
du -ks .                       # 10580K !?!?!?
du -ks .git                    # 10564K

# Do the stuff Alex suggests to trim the history
rm -rf .git/refs/original/
vi .git/packed-refs
# Use vi to remove the line referring to refs/original...
git reflog expire --all
git gc --aggressive --prune
du -ks .                      # Still 10564K


Thanks,
Elijah

[1] An empty git repo takes 196k for me, as can be seen by:
$mkdir tmp
$cd tmp
$git init
$du -hs .
196K    .

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Trying to use git-filter-branch to compress history by removing large, obsolete binary files
  2007-10-07 23:43         ` Dmitry Potapov
@ 2007-10-08  0:22           ` Elijah Newren
  2007-10-08  1:06             ` Dmitry Potapov
  0 siblings, 1 reply; 31+ messages in thread
From: Elijah Newren @ 2007-10-08  0:22 UTC (permalink / raw)
  To: Dmitry Potapov; +Cc: Alex Riesen, Frank Lichtenheld, git

On 10/7/07, Dmitry Potapov <dpotapov@gmail.com> wrote:
> On Sun, Oct 07, 2007 at 04:24:49PM -0600, Elijah Newren wrote:
> > $ git reflog expire --all
> > $ git gc --aggressive --prune
>
> I believe this should work:
>
> git reflog expire --all --expire-unreachable=0
> git gc --prune

Yes, this seems to work.  So the history-rewriting steps are

git-filter-branch --tree-filter 'rm -f testme.txt' HEAD
git reset --hard
rm -rf .git/refs/original/
vi .git/packed-refs
# Use vi to remove the line referring to refs/original...
git reflog expire --all --expire-unreachable=0
git gc --prune

Seems like a wrapper is needed.  :-)

> Warning: all unreachable references will be removed!

What other scenarios could lead to unreachable references?  I don't
know how to determine whether this is safe or not (except that these
were test repositories anyway, so I don't care what happens to them).

Thanks!
Elijah

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Trying to use git-filter-branch to compress history by removing large, obsolete binary files
  2007-10-07 23:38           ` Elijah Newren
@ 2007-10-08  0:34             ` Johannes Schindelin
  2007-10-08  0:47               ` Elijah Newren
  2007-10-08  1:00               ` J. Bruce Fields
  0 siblings, 2 replies; 31+ messages in thread
From: Johannes Schindelin @ 2007-10-08  0:34 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Frank Lichtenheld, git

Hi,

On Sun, 7 Oct 2007, Elijah Newren wrote:

> On 10/7/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> > On Sun, 7 Oct 2007, Elijah Newren wrote:
> <snip>
> > > $ git clone test test2
> > > <snip>
> > > $ du -hs test
> > > 11M     test
> > > $ du -hs test2
> > > 11M     test2
> > >
> > > Any other ideas?
> >
> > Yep.  Maybe it is necessary to run "git gc" in test2.
> 
> Sweet, finally solved!  That brings test2 down to 340K.
> 
> However, the solution seems somewhat involved...it requires running 
> git-filter-branch, git reset, removing the .git/refs/original/ 
> directory, editing .git/packed-refs in some editor, running git reflog 
> expire, cloning the resulting repository, and running git gc yet again.  
> It seems like there has to be an easier way.  (Anyone have one?)

It should be as easy as git filter-branch and git clone.

> Oh, and git-filter-branch could really use some explanatory note about 
> how to actually complete rewriting the history.

It does what it should do.  It is _your_ task to look at refs/original/* 
if everything went alright.  Then you just delete the checked refs.

What made your case so cumbersome was that you wanted the big objects out 
_now_, instead of having them in for a grace period.  BTW this grace 
period is in place to help _you_, not the program.  (In case you fscked up 
and need those objects back.)

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Trying to use git-filter-branch to compress history by removing large, obsolete binary files
  2007-10-08  0:34             ` Johannes Schindelin
@ 2007-10-08  0:47               ` Elijah Newren
  2007-10-08  2:28                 ` Sam Vilain
  2007-10-08  1:00               ` J. Bruce Fields
  1 sibling, 1 reply; 31+ messages in thread
From: Elijah Newren @ 2007-10-08  0:47 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Frank Lichtenheld, git

On 10/7/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> It should be as easy as git filter-branch and git clone.

Yes, a git filter-branch, git clone, AND git gc in the clone avoids
all those funny ref editing commands.  However, cloning a 5.6GB repo
(the size of one of the real repos I'm dealing with) will likely take
a long time (and may push me past the limits of disk space), so using
other steps to avoid the need to clone actually seems nicer.

> > Oh, and git-filter-branch could really use some explanatory note about
> > how to actually complete rewriting the history.
>
> It does what it should do.  It is _your_ task to look at refs/original/*
> if everything went alright.  Then you just delete the checked refs.
>
> What made your case so cumbersome was that you wanted the big objects out
> _now_, instead of having them in for a grace period.  BTW this grace
> period is in place to help _you_, not the program.  (In case you fscked up
> and need those objects back.)

Sure, I think that's a sane default.  And I think it's fine that it
should be my task to look at the refs to check that everything worked
okay and delete them.  But it's nearly impossible to figure out how to
do that!  _That_ is my complaint.  I got multiple misleading or
incomplete answers (both on this list and in #git) before getting some
working solutions, so this task is obviously far from trivial.  I
really think that adding instructions about how to check and delete
the relevant refs would be a very useful addition to the
documentation.

Thanks everyone for the help!

Elijah

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Trying to use git-filter-branch to compress history by removing large, obsolete binary files
  2007-10-08  0:34             ` Johannes Schindelin
  2007-10-08  0:47               ` Elijah Newren
@ 2007-10-08  1:00               ` J. Bruce Fields
  2007-10-08  1:06                 ` Johannes Schindelin
  1 sibling, 1 reply; 31+ messages in thread
From: J. Bruce Fields @ 2007-10-08  1:00 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Elijah Newren, Frank Lichtenheld, git

On Mon, Oct 08, 2007 at 01:34:07AM +0100, Johannes Schindelin wrote:
> It does what it should do.  It is _your_ task to look at refs/original/* 
> if everything went alright.  Then you just delete the checked refs.

It seems odd to me, by the way, that filter-branch has its own
home-grown backup mechanism.  Lots of other commands can "lose" commits,
but none of them keep an extra backup like this.

And I find it tedious for quicker jobs which it might otherwise be
useful for (e.g. rewrites of commits in my tree not yet in upstream),
unless I wrap it in a script that cleans up after itself.

--b.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Trying to use git-filter-branch to compress history by removing large, obsolete binary files
  2007-10-08  1:00               ` J. Bruce Fields
@ 2007-10-08  1:06                 ` Johannes Schindelin
  2007-10-08  6:22                   ` Johannes Sixt
  0 siblings, 1 reply; 31+ messages in thread
From: Johannes Schindelin @ 2007-10-08  1:06 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Elijah Newren, Frank Lichtenheld, git

Hi,

On Sun, 7 Oct 2007, J. Bruce Fields wrote:

> On Mon, Oct 08, 2007 at 01:34:07AM +0100, Johannes Schindelin wrote:
> > It does what it should do.  It is _your_ task to look at refs/original/* 
> > if everything went alright.  Then you just delete the checked refs.
> 
> It seems odd to me, by the way, that filter-branch has its own 
> home-grown backup mechanism.  Lots of other commands can "lose" commits, 
> but none of them keep an extra backup like this.

The rationale was this: filter-branch recently learnt how to rewrite many 
branches, and it might be tedious to find out which ones.  But then, there 
is git log --no-walk --all, so maybe I really should get rid of 
refs/original/*?

I'd like to have some comments from the heavier filter-branch users on 
that...

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Trying to use git-filter-branch to compress history by removing large, obsolete binary files
  2007-10-08  0:22           ` Elijah Newren
@ 2007-10-08  1:06             ` Dmitry Potapov
  2007-10-08  9:27               ` Andreas Ericsson
  0 siblings, 1 reply; 31+ messages in thread
From: Dmitry Potapov @ 2007-10-08  1:06 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Alex Riesen, Frank Lichtenheld, git

On Sun, Oct 07, 2007 at 06:22:28PM -0600, Elijah Newren wrote:
> 
> git-filter-branch --tree-filter 'rm -f testme.txt' HEAD
> git reset --hard
> rm -rf .git/refs/original/
> vi .git/packed-refs
> # Use vi to remove the line referring to refs/original...
> git reflog expire --all --expire-unreachable=0
> git gc --prune
> 
> Seems like a wrapper is needed.  :-)

Actually, I would rather not, because you rarely need to remove anything
immediately, and 30 days delay is reasonable time to give you a chance
to recover that you removed accidentally. You can reduce it by setting
appropriate value for gc.reflogExpireUnreachable in your configuration.
The only thing you need to do is to remove .git/refs/original/heads/something
after you are sure that git-filter-branch did exactly what you wanted.

> 
> > Warning: all unreachable references will be removed!
> 
> What other scenarios could lead to unreachable references?

Any re-writing of history leads to that.

> I don't
> know how to determine whether this is safe or not (except that these
> were test repositories anyway, so I don't care what happens to them).

Git logs all your action, so even re-writing history would not be
so disastrous if you suddenly realized that you did something wrong.
The history is stored for 30 days by default. Usually, you do not
need to mess with Git internals like you did above. Your useless
files still will disappear after being unreachable for 30 days.

OTOH, if you want to have a clean repository immediately, I believe
'git clone' is a better option. After you made a local clone using
it, 'git gc' should remove old garbage.

Dmitry

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Trying to use git-filter-branch to compress history by removing large, obsolete binary files
  2007-10-08  0:47               ` Elijah Newren
@ 2007-10-08  2:28                 ` Sam Vilain
  0 siblings, 0 replies; 31+ messages in thread
From: Sam Vilain @ 2007-10-08  2:28 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Johannes Schindelin, Frank Lichtenheld, git

Elijah Newren wrote:
> On 10/7/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
>> It should be as easy as git filter-branch and git clone.
> 
> Yes, a git filter-branch, git clone, AND git gc in the clone avoids
> all those funny ref editing commands.  However, cloning a 5.6GB repo
> (the size of one of the real repos I'm dealing with) will likely take
> a long time (and may push me past the limits of disk space), so using
> other steps to avoid the need to clone actually seems nicer.

You can just delete the logs and references that you don't want and run
git gc --prune.

However.

git gc creates a new pack before deleting the old one.  Garbage
collection usually does this; make a copy of everything to a new place
and then free all of the old space.  If *that* is a problem, ie you
don't have enough space for two copies of the repository and the junk,
you'll have to do a partial import, leave the junk you don't want
unpacked, cleanup and prune, then finish the import.  Which sounds like
a lot of hassle when you should really just find a place with more space
to work with!

Sam.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Trying to use git-filter-branch to compress history by removing large, obsolete binary files
  2007-10-08  0:09           ` Elijah Newren
@ 2007-10-08  6:15             ` Alex Riesen
  2007-10-08  9:23               ` Andreas Ericsson
  0 siblings, 1 reply; 31+ messages in thread
From: Alex Riesen @ 2007-10-08  6:15 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Frank Lichtenheld, git

Elijah Newren, Mon, Oct 08, 2007 02:09:50 +0200:
> On 10/7/07, Alex Riesen <raa.lkml@gmail.com> wrote:
> > you missed something. Your example compresses to about 124k.
> 
> What version of git are you running?  I reran all the steps to which

git version 1.5.3.4.225.g31b973 (irrelevant custom modifications)

> you responded (repeated below for clarity) with git-1.5.3.3 and still
> get 11MB.  Also, you must have different filesystem extents than me
> since an empty git repo takes 196k here[1], so I don't think any repo
> is going to get down to 124k.

it is ext3. I do not install the hooks (~8k apparent, ~32k fs blocks)
and never activate logs by default.

> # Use vi to remove the line referring to refs/original...
> git reflog expire --all

another part of the suggestion re reflogs was to look into the logs,
to check if expire actually removed anything. It seems to have been
the culprit.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Trying to use git-filter-branch to compress history by removing large, obsolete binary files
  2007-10-08  1:06                 ` Johannes Schindelin
@ 2007-10-08  6:22                   ` Johannes Sixt
  2007-10-08 14:36                     ` J. Bruce Fields
  0 siblings, 1 reply; 31+ messages in thread
From: Johannes Sixt @ 2007-10-08  6:22 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: J. Bruce Fields, Elijah Newren, Frank Lichtenheld, git

Johannes Schindelin schrieb:
> The rationale was this: filter-branch recently learnt how to rewrite many 
> branches, and it might be tedious to find out which ones.  But then, there 
> is git log --no-walk --all, so maybe I really should get rid of 
> refs/original/*?
> 
> I'd like to have some comments from the heavier filter-branch users on 
> that...

IMHO, a backup of the original refs is needed. However, it may be wise to 
store them in the refs/heads namespace so that 'git branch -d' can delete 
them and 'git branch -m' can move them back if something went wrong.

-- Hannes

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Trying to use git-filter-branch to compress history by removing large, obsolete binary files
  2007-10-08  6:15             ` Alex Riesen
@ 2007-10-08  9:23               ` Andreas Ericsson
  0 siblings, 0 replies; 31+ messages in thread
From: Andreas Ericsson @ 2007-10-08  9:23 UTC (permalink / raw)
  To: Alex Riesen; +Cc: Elijah Newren, Frank Lichtenheld, git

Alex Riesen wrote:
> Elijah Newren, Mon, Oct 08, 2007 02:09:50 +0200:
>> On 10/7/07, Alex Riesen <raa.lkml@gmail.com> wrote:
>>> you missed something. Your example compresses to about 124k.
>> What version of git are you running?  I reran all the steps to which
> 
> git version 1.5.3.4.225.g31b973 (irrelevant custom modifications)
> 
>> you responded (repeated below for clarity) with git-1.5.3.3 and still
>> get 11MB.  Also, you must have different filesystem extents than me
>> since an empty git repo takes 196k here[1], so I don't think any repo
>> is going to get down to 124k.
> 
> it is ext3. I do not install the hooks (~8k apparent, ~32k fs blocks)
> and never activate logs by default.
> 
>> # Use vi to remove the line referring to refs/original...
>> git reflog expire --all
> 
> another part of the suggestion re reflogs was to look into the logs,
> to check if expire actually removed anything. It seems to have been
> the culprit.
> 

On my system, running git version 1.5.3.3.131.g34c6d,

	git reflog expire --all

does absolutely nothing.

	git reflog expire --expire=0 --all

truncates all the reflogs. I'm not sure if this is intended or not.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Trying to use git-filter-branch to compress history by removing large, obsolete binary files
  2007-10-08  1:06             ` Dmitry Potapov
@ 2007-10-08  9:27               ` Andreas Ericsson
  2007-10-08 10:05                 ` Karl Hasselström
  2007-10-08 12:40                 ` Dmitry Potapov
  0 siblings, 2 replies; 31+ messages in thread
From: Andreas Ericsson @ 2007-10-08  9:27 UTC (permalink / raw)
  To: Dmitry Potapov; +Cc: Elijah Newren, Alex Riesen, Frank Lichtenheld, git

Dmitry Potapov wrote:
> On Sun, Oct 07, 2007 at 06:22:28PM -0600, Elijah Newren wrote:
>> git-filter-branch --tree-filter 'rm -f testme.txt' HEAD
>> git reset --hard
>> rm -rf .git/refs/original/
>> vi .git/packed-refs
>> # Use vi to remove the line referring to refs/original...
>> git reflog expire --all --expire-unreachable=0
>> git gc --prune
>>
>> Seems like a wrapper is needed.  :-)
> 
> Actually, I would rather not, because you rarely need to remove anything
> immediately, and 30 days delay is reasonable time to give you a chance
> to recover that you removed accidentally. You can reduce it by setting
> appropriate value for gc.reflogExpireUnreachable in your configuration.
> The only thing you need to do is to remove .git/refs/original/heads/something
> after you are sure that git-filter-branch did exactly what you wanted.
> 
>>> Warning: all unreachable references will be removed!
>> What other scenarios could lead to unreachable references?
> 
> Any re-writing of history leads to that.
> 

git-rebase being the most common culprit, right alongside 'git commit --amend'.

>> I don't
>> know how to determine whether this is safe or not (except that these
>> were test repositories anyway, so I don't care what happens to them).
> 
> Git logs all your action, so even re-writing history would not be
> so disastrous if you suddenly realized that you did something wrong.
> The history is stored for 30 days by default. Usually, you do not
> need to mess with Git internals like you did above. Your useless
> files still will disappear after being unreachable for 30 days.
> 
> OTOH, if you want to have a clean repository immediately, I believe
> 'git clone' is a better option. After you made a local clone using
> it, 'git gc' should remove old garbage.
> 

A clone only fetches revs reachable from a ref, so pruning immediately
after a clone is completely pointless.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Trying to use git-filter-branch to compress history by removing large, obsolete binary files
  2007-10-08  9:27               ` Andreas Ericsson
@ 2007-10-08 10:05                 ` Karl Hasselström
  2007-10-08 12:40                 ` Dmitry Potapov
  1 sibling, 0 replies; 31+ messages in thread
From: Karl Hasselström @ 2007-10-08 10:05 UTC (permalink / raw)
  To: Andreas Ericsson
  Cc: Dmitry Potapov, Elijah Newren, Alex Riesen, Frank Lichtenheld,
	git

On 2007-10-08 11:27:33 +0200, Andreas Ericsson wrote:

> Dmitry Potapov wrote:
>
> > On Sun, Oct 07, 2007 at 06:22:28PM -0600, Elijah Newren wrote:
> >
> > > What other scenarios could lead to unreachable references?
> >
> > Any re-writing of history leads to that.
>
> git-rebase being the most common culprit, right alongside 'git
> commit --amend'.

StGit (and presumably guilt, and any other similar tool) are just
glorified rebase wrappers, so they'll generate tons of unreachable
objects too.

-- 
Karl Hasselström, kha@treskal.com
      www.treskal.com/kalle

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Trying to use git-filter-branch to compress history by removing large, obsolete binary files
  2007-10-08  9:27               ` Andreas Ericsson
  2007-10-08 10:05                 ` Karl Hasselström
@ 2007-10-08 12:40                 ` Dmitry Potapov
  2007-10-08 13:01                   ` Karl Hasselström
  1 sibling, 1 reply; 31+ messages in thread
From: Dmitry Potapov @ 2007-10-08 12:40 UTC (permalink / raw)
  To: Andreas Ericsson; +Cc: Elijah Newren, Alex Riesen, Frank Lichtenheld, git

On Mon, Oct 08, 2007 at 11:27:33AM +0200, Andreas Ericsson wrote:
> Dmitry Potapov wrote:
> >OTOH, if you want to have a clean repository immediately, I believe
> >'git clone' is a better option. After you made a local clone using
> >it, 'git gc' should remove old garbage.
> >
> 
> A clone only fetches revs reachable from a ref, so pruning immediately
> after a clone is completely pointless.

Not true. git-clone copies the whole pack, so it can contain unreachable
objects. Here is a simple script that demonstrates that without garbage
collection the size of the cloned repository will be the same as the
original one.

===========================================
# Make a small repo
mkdir test
cd test
git init
echo hi > there
git add there
git commit -m 'Small repo'

# Add a random 10M binary file
dd if=/dev/urandom of=testme.txt count=10 bs=1M
git add testme.txt
git commit -m 'Add big binary file'

# Remove the 10M binary file
git rm testme.txt
git commit -m 'Remove big binary file'

# Compress the repo, see how big the repo is
git gc --aggressive --prune
du -ks .                       # 10348
du -ks .git                    # 10344

git-whatchanged

# Try to rewrite history to remove the binary file
git-filter-branch --tree-filter 'rm -f testme.txt' HEAD
git reset --hard

# Remove original refs
rm .git/refs/original/refs/heads/master

# Remove back
cd ..

# Clone repository
git-clone -l test/.git test2

cd test2
du -ks .git # 10360

# Now run garbage collection
git gc
du -ks .git # 96

===========================================

Dmitry

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Trying to use git-filter-branch to compress history by removing large, obsolete binary files
  2007-10-08 12:40                 ` Dmitry Potapov
@ 2007-10-08 13:01                   ` Karl Hasselström
  0 siblings, 0 replies; 31+ messages in thread
From: Karl Hasselström @ 2007-10-08 13:01 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: Andreas Ericsson, Elijah Newren, Alex Riesen, Frank Lichtenheld,
	git

On 2007-10-08 16:40:17 +0400, Dmitry Potapov wrote:

> On Mon, Oct 08, 2007 at 11:27:33AM +0200, Andreas Ericsson wrote:
>
> > A clone only fetches revs reachable from a ref, so pruning
> > immediately after a clone is completely pointless.
>
> Not true. git-clone copies the whole pack, so it can contain
> unreachable objects.
[...]
> git-clone -l test/.git test2

Try without the -l option and with a file:// URL:

  git clone file:///path/to/test/.git test2

From the git-clone man page:

--local::
-l::
        When the repository to clone from is on a local machine, this
        flag bypasses normal "git aware" transport mechanism and
        clones the repository by making a copy of HEAD and everything
        under objects and refs directories. The files under
        `.git/objects/` directory are hardlinked to save space when
        possible. This is now the default when the source repository
        is specified with `/path/to/repo` syntax, so it essentially is
        a no-op option. To force copying instead of hardlinking (which
        may be desirable if you are trying to make a back-up of your
        repository), but still avoid the usual "git aware" transport
        mechanism, `--no-hardlinks` can be used.

-- 
Karl Hasselström, kha@treskal.com
      www.treskal.com/kalle

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Trying to use git-filter-branch to compress history by removing large, obsolete binary files
  2007-10-08  6:22                   ` Johannes Sixt
@ 2007-10-08 14:36                     ` J. Bruce Fields
  2007-10-08 16:37                       ` Theodore Tso
  0 siblings, 1 reply; 31+ messages in thread
From: J. Bruce Fields @ 2007-10-08 14:36 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: Johannes Schindelin, Elijah Newren, Frank Lichtenheld, git

On Mon, Oct 08, 2007 at 08:22:42AM +0200, Johannes Sixt wrote:
> Johannes Schindelin schrieb:
>> The rationale was this: filter-branch recently learnt how to rewrite many 
>> branches, and it might be tedious to find out which ones.  But then, there 
>> is git log --no-walk --all, so maybe I really should get rid of 
>> refs/original/*?
>> I'd like to have some comments from the heavier filter-branch users on 
>> that...
>
> IMHO, a backup of the original refs is needed.

And we can't rely instead on reflogs or some other existing mechanism?

> However, it may be wise to store them in the refs/heads namespace so
> that 'git branch -d' can delete them and 'git branch -m' can move them
> back if something went wrong.

If people want backups like this it'd seem easier to turn this on
optionally with commandline switches, like patch's --backup, --prefix,
--suffix options.

Having it by default leave these backups around, even when everything
succeeds, makes for unnecessary cleanup work in the normal case, and is
inconsistent with the behavior of other git commands that destroy or
rewrite history.

--b.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Trying to use git-filter-branch to compress history by removing large, obsolete binary files
  2007-10-08 14:36                     ` J. Bruce Fields
@ 2007-10-08 16:37                       ` Theodore Tso
  2007-10-08 19:05                         ` J. Bruce Fields
  2007-10-09 10:37                         ` Johannes Schindelin
  0 siblings, 2 replies; 31+ messages in thread
From: Theodore Tso @ 2007-10-08 16:37 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Johannes Sixt, Johannes Schindelin, Elijah Newren,
	Frank Lichtenheld, git

On Mon, Oct 08, 2007 at 10:36:50AM -0400, J. Bruce Fields wrote:
> Having it by default leave these backups around, even when everything
> succeeds, makes for unnecessary cleanup work in the normal case, and is
> inconsistent with the behavior of other git commands that destroy or
> rewrite history.

I think what makes git-filter-branch different is that you can change
a large amount of history with git-filter-branch, including large
numbers of tags, etc.  The reflog is quite sufficient to recover from
a screwed up "git commit --amend".  But I don't think the reflog is
going to be sufficient given the kinds of changes that
git-filter-branch can potentially do to your repository.  Maybe
default of --backup vs --no-backup could be changed via a config
parameter, but I think the default is of backing up refs is a good
think....

Perhaps a solution would be to add "git-filter-branch --cleanup" that
that clears the reflog and wipes the backed up tags; perhaps first
asking interactively if the user is really sure he/she wants to do
this.

						- Ted

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Trying to use git-filter-branch to compress history by removing large, obsolete binary files
  2007-10-08 16:37                       ` Theodore Tso
@ 2007-10-08 19:05                         ` J. Bruce Fields
  2007-10-09 10:37                         ` Johannes Schindelin
  1 sibling, 0 replies; 31+ messages in thread
From: J. Bruce Fields @ 2007-10-08 19:05 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Johannes Sixt, Johannes Schindelin, Elijah Newren,
	Frank Lichtenheld, git

On Mon, Oct 08, 2007 at 12:37:01PM -0400, Theodore Tso wrote:
> I think what makes git-filter-branch different is that you can change
> a large amount of history with git-filter-branch, including large
> numbers of tags, etc.  The reflog is quite sufficient to recover from
> a screwed up "git commit --amend".  But I don't think the reflog is
> going to be sufficient given the kinds of changes that
> git-filter-branch can potentially do to your repository.  Maybe
> default of --backup vs --no-backup could be changed via a config
> parameter, but I think the default is of backing up refs is a good
> think....

Yeah, it's clearly designed with rewriting a whole repo in mind.

It might also be handy, though, as a quick way to rewrite a single
branch.  (E.g., "add 'Acked-by: Joe' to everything in 'for-upstream' not
in 'origin'", or "rename foo to bar in every commit in 'topic' not in
'origin'".).

I find the current defaults awkward for that case.  Maybe it'd make
sense to treat the two cases differently.

> Perhaps a solution would be to add "git-filter-branch --cleanup" that
> that clears the reflog and wipes the backed up tags; perhaps first
> asking interactively if the user is really sure he/she wants to do
> this.

Maybe.

--b.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Trying to use git-filter-branch to compress history by removing large, obsolete binary files
  2007-10-08 16:37                       ` Theodore Tso
  2007-10-08 19:05                         ` J. Bruce Fields
@ 2007-10-09 10:37                         ` Johannes Schindelin
  1 sibling, 0 replies; 31+ messages in thread
From: Johannes Schindelin @ 2007-10-09 10:37 UTC (permalink / raw)
  To: Theodore Tso
  Cc: J. Bruce Fields, Johannes Sixt, Elijah Newren, Frank Lichtenheld,
	git

Hi,

On Mon, 8 Oct 2007, Theodore Tso wrote:

> On Mon, Oct 08, 2007 at 10:36:50AM -0400, J. Bruce Fields wrote:
> > Having it by default leave these backups around, even when everything
> > succeeds, makes for unnecessary cleanup work in the normal case, and is
> > inconsistent with the behavior of other git commands that destroy or
> > rewrite history.
> 
> I think what makes git-filter-branch different is that you can change a 
> large amount of history with git-filter-branch, including large numbers 
> of tags, etc.  The reflog is quite sufficient to recover from a screwed 
> up "git commit --amend".
>
> [...]
>
> But I don't think the reflog is going to be sufficient given the kinds 
> of changes that git-filter-branch can potentially do to your repository.

FWIW after reading Bruce's reasoning, I tend towards having no "backups" 
by default (I say "backups", since they are _only_ written when the 
respective branch has changed).

And I do not think that the config variable is a good approach; if you 
want backups or not is a per-case decision.  So your proposal would only 
result in even more confusion.

My preference ATM is to write nothing per default, but only when 
--original <namespace> was given.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2007-10-09 10:38 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-07 21:23 Trying to use git-filter-branch to compress history by removing large, obsolete binary files Elijah Newren
2007-10-07 21:38 ` Frank Lichtenheld
2007-10-07 22:00   ` Elijah Newren
2007-10-07 22:19     ` Alex Riesen
2007-10-07 22:24       ` Elijah Newren
2007-10-07 23:40         ` Alex Riesen
2007-10-08  0:09           ` Elijah Newren
2007-10-08  6:15             ` Alex Riesen
2007-10-08  9:23               ` Andreas Ericsson
2007-10-07 23:43         ` Dmitry Potapov
2007-10-08  0:22           ` Elijah Newren
2007-10-08  1:06             ` Dmitry Potapov
2007-10-08  9:27               ` Andreas Ericsson
2007-10-08 10:05                 ` Karl Hasselström
2007-10-08 12:40                 ` Dmitry Potapov
2007-10-08 13:01                   ` Karl Hasselström
2007-10-07 23:19     ` Johannes Schindelin
2007-10-07 23:24       ` Elijah Newren
2007-10-07 23:28         ` Johannes Schindelin
2007-10-07 23:38           ` Elijah Newren
2007-10-08  0:34             ` Johannes Schindelin
2007-10-08  0:47               ` Elijah Newren
2007-10-08  2:28                 ` Sam Vilain
2007-10-08  1:00               ` J. Bruce Fields
2007-10-08  1:06                 ` Johannes Schindelin
2007-10-08  6:22                   ` Johannes Sixt
2007-10-08 14:36                     ` J. Bruce Fields
2007-10-08 16:37                       ` Theodore Tso
2007-10-08 19:05                         ` J. Bruce Fields
2007-10-09 10:37                         ` Johannes Schindelin
2007-10-07 22:08 ` Alex Riesen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).