git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* git repack vs git gc --aggressive
@ 2012-08-07 18:22 Felix Natter
  2012-08-07 18:44 ` Jeff King
  0 siblings, 1 reply; 7+ messages in thread
From: Felix Natter @ 2012-08-07 18:22 UTC (permalink / raw)
  To: git

hello,

I read this:
  http://metalinguist.wordpress.com/2007/12/06/the-woes-of-git-gc-aggressive-and-how-git-deltas-work/
where
  git repack -a -d --depth=250 --window=250
is mentioned as a (recommended) alternative to git gc --aggressive.

I am a bit confused, because the page also mentions that git gc --aggressive
is recommended when a repo has been imported using git fast-import.

So my questions are:

1. is the above repack command (with --depth=500) safe? Of course I want
   to be absolutely sure that our repo will be consistent.
   Do I need another command ("git gc", "git prune") as well?

2. is it the right tool for the job or shall I use git gc --aggressive?

Thanks!
-- 
Felix Natter

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: git repack vs git gc --aggressive
  2012-08-07 18:22 git repack vs git gc --aggressive Felix Natter
@ 2012-08-07 18:44 ` Jeff King
  2012-08-07 19:05   ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff King @ 2012-08-07 18:44 UTC (permalink / raw)
  To: Felix Natter; +Cc: git

On Tue, Aug 07, 2012 at 08:22:21PM +0200, Felix Natter wrote:

> I read this:
>   http://metalinguist.wordpress.com/2007/12/06/the-woes-of-git-gc-aggressive-and-how-git-deltas-work/
> where
>   git repack -a -d --depth=250 --window=250
> is mentioned as a (recommended) alternative to git gc --aggressive.

Note how old that post is. In fact, on the very same day it was posted,
the discussion on the mailing list resulted in this commit:

  commit 1c192f3442414a6ce83f9a524806fc26a0861d2d
  Author: Johannes Schindelin <Johannes.Schindelin@gmx.de>
  Date:   Thu Dec 6 12:03:38 2007 +0000

      gc --aggressive: make it really aggressive

      The default was not to change the window or depth at all.  As suggested
      by Jon Smirl, Linus Torvalds and others, default to

          --window=250 --depth=250

So the packing parameters are the same these days for either method.
Note that "git gc --aggressive" will also use "-f" to recompute all
deltas. This is more expensive, but gives git more flexibility if the
old deltas were sub-optimal (typically, this is the case if the existing
pack was generated by fast-import, which favors speed of import versus
coming up with an optimal storage pattern).

> So my questions are:
> 
> 1. is the above repack command (with --depth=500) safe? Of course I want
>    to be absolutely sure that our repo will be consistent.
>    Do I need another command ("git gc", "git prune") as well?

Yes, it's safe. Changing the depth parameter can never lose data.
However, it's probably not a good idea for two reasons:

  1. It probably does nothing. You're not likely to hit a 500-depth
     delta chain (the point of the "250" in --aggressive is that it is
     already ridiculously high).

  2. Even if you did come up with a 500-depth delta chain, it may not be
     a good tradeoff. You might save a little bit of space, but keep in
     mind that to generate the object data, it means that git will have
     to follow a chain of 500 deltas to regenerate the object.

Of course, every workload is different. One can develop pathological
cases where --depth=500 saves a lot of space. But it's unlikely that it
is the case for a normal repository. You can always try both and see the
result.

In fact, I'd also test how just "git gc" behaves versus "git gc
--aggressive" for your repo. The former is much less expensive to run.
You really shouldn't need to be running "--aggressive" all the time, so
if you are looking at doing a nightly repack or similar, just "git gc"
is probably fine.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: git repack vs git gc --aggressive
  2012-08-07 18:44 ` Jeff King
@ 2012-08-07 19:05   ` Junio C Hamano
  2012-08-10 19:09     ` Felix Natter
  0 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2012-08-07 19:05 UTC (permalink / raw)
  To: Jeff King; +Cc: Felix Natter, git

Jeff King <peff@peff.net> writes:

> So the packing parameters are the same these days for either method.
> Note that "git gc --aggressive" will also use "-f" to recompute all
> deltas. This is more expensive, but gives git more flexibility if the
> old deltas were sub-optimal (typically, this is the case if the existing
> pack was generated by fast-import, which favors speed of import versus
> coming up with an optimal storage pattern).

Also your fetch often results in storing the pack received from the
other end straight to your local repository (with necessary objects
to complete the pack the other end did not send appended at the
end).  If the server side hasn't been packed with "-f", you will
inherit the badness until you repack with "-f".

> Of course, every workload is different. One can develop pathological
> cases where --depth=500 saves a lot of space. But it's unlikely that it
> is the case for a normal repository. You can always try both and see the
> result.

For a dataset where ridiculously large depth really is a win, these
objects would have to be reasonably large and cost of expanding the
base and then applying hundreds of delta to recover one object may
not be negligible. The user should consider if he is willing to pay
the price every time he does a local Git operation.

> In fact, I'd also test how just "git gc" behaves versus "git gc
> --aggressive" for your repo. The former is much less expensive to run.
> You really shouldn't need to be running "--aggressive" all the time, so
> if you are looking at doing a nightly repack or similar, just "git gc"
> is probably fine.

As I am coming from "large depth is harmful" school, I would
recommend

 - "git repack -a -d -f" with large "--window" with reasonably short
   "--depth" once, and mark the result with .keep;
 
 - "git repack -a -d -f" once every several weeks; and

 - "git gc" or "git repack" (without any other options) daily.

and ignore "--aggressive" entirely.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: git repack vs git gc --aggressive
  2012-08-07 19:05   ` Junio C Hamano
@ 2012-08-10 19:09     ` Felix Natter
  2012-08-10 20:09       ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Felix Natter @ 2012-08-10 19:09 UTC (permalink / raw)
  To: git

Junio C Hamano <gitster@pobox.com> writes:

> Jeff King <peff@peff.net> writes:
>
>> So the packing parameters are the same these days for either method.
>> Note that "git gc --aggressive" will also use "-f" to recompute all
>> deltas. This is more expensive, but gives git more flexibility if the
>> old deltas were sub-optimal (typically, this is the case if the existing
>> pack was generated by fast-import, which favors speed of import versus
>> coming up with an optimal storage pattern).
>
> Also your fetch often results in storing the pack received from the
> other end straight to your local repository (with necessary objects
> to complete the pack the other end did not send appended at the
> end).  If the server side hasn't been packed with "-f", you will
> inherit the badness until you repack with "-f".
>
>> Of course, every workload is different. One can develop pathological
>> cases where --depth=500 saves a lot of space. But it's unlikely that it
>> is the case for a normal repository. You can always try both and see the
>> result.
>
> For a dataset where ridiculously large depth really is a win, these
> objects would have to be reasonably large and cost of expanding the
> base and then applying hundreds of delta to recover one object may
> not be negligible. The user should consider if he is willing to pay
> the price every time he does a local Git operation.
>
>> In fact, I'd also test how just "git gc" behaves versus "git gc
>> --aggressive" for your repo. The former is much less expensive to run.
>> You really shouldn't need to be running "--aggressive" all the time, so
>> if you are looking at doing a nightly repack or similar, just "git gc"
>> is probably fine.

Thank you both very much for your answers!

I have a few questions about this:

> As I am coming from "large depth is harmful" school, I would
> recommend
>
>  - "git repack -a -d -f" with large "--window" with reasonably short
>    "--depth" once, 

So something like --depth=250 and --window=500? 

> and mark the result with .keep;

I guess you refer to a toplevel '.keep' file. But what does
that do (sorry, couldn't find anything on google)?
  
>  - "git repack -a -d -f" once every several weeks; and
>
>  - "git gc" or "git repack" (without any other options) daily.
>
> and ignore "--aggressive" entirely.

One more question: I use bzr fast-export | git fast-import to import
branches from bzr:

    bzr fast-export --marks=$MARKS_BZR --git-branch="$BRANCHNAME" "$BZR_FREEPLANE_REPO/$BRANCHNAME/" | \
        git fast-import --import-marks=$MARKS_GIT --export-marks=$MARKS_GIT

Will those marks files (which remember which commits are already there in
the git repo) also work after I have done git repack / git gc?
In other words, can I import bzr-branches after I have run git repack /
git gc on the repo?

Thank you!
-- 
Felix Natter

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: git repack vs git gc --aggressive
  2012-08-10 19:09     ` Felix Natter
@ 2012-08-10 20:09       ` Junio C Hamano
  2012-08-13 14:20         ` Marc Branchaud
  0 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2012-08-10 20:09 UTC (permalink / raw)
  To: Felix Natter; +Cc: git

Felix Natter <fnatter@gmx.net> writes:

> I have a few questions about this:
>
>> As I am coming from "large depth is harmful" school, I would
>> recommend
>>
>>  - "git repack -a -d -f" with large "--window" with reasonably short
>>    "--depth" once, 
>
> So something like --depth=250 and --window=500? 

I would use more like --depth=16 or 32 in my local repositories.

>> and mark the result with .keep;
>
> I guess you refer to a toplevel '.keep' file.

Not at all.  And it is not documented, it seems X-<.

Typically you have a pair of files in .git/objects/pack, e.g.

  .git/objects/pack/pack-2e3e3b332b446278f9ff91c4f497bc6ed2626d00.idx
  .git/objects/pack/pack-2e3e3b332b446278f9ff91c4f497bc6ed2626d00.pack

And you can add another file next to them

  .git/objects/pack/pack-2e3e3b332b446278f9ff91c4f497bc6ed2626d00.keep

to prevent the pack from getting repacked.  I think "git clone" does
this for you after an initial import.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: git repack vs git gc --aggressive
  2012-08-10 20:09       ` Junio C Hamano
@ 2012-08-13 14:20         ` Marc Branchaud
  2012-08-13 17:19           ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Marc Branchaud @ 2012-08-13 14:20 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Felix Natter, git

On 12-08-10 04:09 PM, Junio C Hamano wrote:
> Felix Natter <fnatter@gmx.net> writes:
> 
>> I have a few questions about this:
>>
>>> As I am coming from "large depth is harmful" school, I would
>>> recommend
>>>
>>>  - "git repack -a -d -f" with large "--window" with reasonably short
>>>    "--depth" once, 
>>
>> So something like --depth=250 and --window=500? 
> 
> I would use more like --depth=16 or 32 in my local repositories.
> 
>>> and mark the result with .keep;
>>
>> I guess you refer to a toplevel '.keep' file.
> 
> Not at all.  And it is not documented, it seems X-<.
> 
> Typically you have a pair of files in .git/objects/pack, e.g.
> 
>   .git/objects/pack/pack-2e3e3b332b446278f9ff91c4f497bc6ed2626d00.idx
>   .git/objects/pack/pack-2e3e3b332b446278f9ff91c4f497bc6ed2626d00.pack
> 
> And you can add another file next to them
> 
>   .git/objects/pack/pack-2e3e3b332b446278f9ff91c4f497bc6ed2626d00.keep
> 
> to prevent the pack from getting repacked.  I think "git clone" does
> this for you after an initial import.

1.7.12.rc1 does not.

I even cloned from a repo with a few .keep files, but ended up with only one
big .pack file.

Maybe clone should preserve the packs it gets from the upstream repo?  For
example, our main repo has a 690MB pack file that's marked .keep, but the
clone just ends up with a single 725MB pack file.  Would our clones see
performance improvements if they that big 690MB pack separate from the others?

Perhaps the fact that clone creates a single pack file makes it impossible to
preserve the .keep packs from the upstream?

(I figure it's probably not a good idea for clone to .keep the single pack
file it creates.)

		M.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: git repack vs git gc --aggressive
  2012-08-13 14:20         ` Marc Branchaud
@ 2012-08-13 17:19           ` Junio C Hamano
  0 siblings, 0 replies; 7+ messages in thread
From: Junio C Hamano @ 2012-08-13 17:19 UTC (permalink / raw)
  To: marcnarc; +Cc: Felix Natter, git

Marc Branchaud <mbranchaud@xiplink.com> writes:

> On 12-08-10 04:09 PM, Junio C Hamano wrote:
>> Felix Natter <fnatter@gmx.net> writes:
>> 
>>> I have a few questions about this:
>>>
>>>> As I am coming from "large depth is harmful" school, I would
>>>> recommend
>>>>
>>>>  - "git repack -a -d -f" with large "--window" with reasonably short
>>>>    "--depth" once, 
>>>
>>> So something like --depth=250 and --window=500? 
>> 
>> I would use more like --depth=16 or 32 in my local repositories.
>> 
>>>> and mark the result with .keep;
>>>
>>> I guess you refer to a toplevel '.keep' file.
>> 
>> Not at all.  And it is not documented, it seems X-<.
>> 
>> Typically you have a pair of files in .git/objects/pack, e.g.
>> 
>>   .git/objects/pack/pack-2e3e3b332b446278f9ff91c4f497bc6ed2626d00.idx
>>   .git/objects/pack/pack-2e3e3b332b446278f9ff91c4f497bc6ed2626d00.pack
>> 
>> And you can add another file next to them
>> 
>>   .git/objects/pack/pack-2e3e3b332b446278f9ff91c4f497bc6ed2626d00.keep
>> 
>> to prevent the pack from getting repacked.  I think "git clone" does
>> this for you after an initial import.
>
> 1.7.12.rc1 does not.

Sorry, I misremembered.  It was removed at 1db4a75 (Remove
unnecessary pack-*.keep file after successful git-clone,
2008-07-08), so even when the sender gave you a crappy pack, you can
repack locally to correct it.

> Maybe clone should preserve the packs it gets from the upstream repo?

That was part of the intention of the code 1db4a75 removed.

> For
> example, our main repo has a 690MB pack file that's marked .keep, but the
> clone just ends up with a single 725MB pack file.  Would our clones see
> performance improvements if they that big 690MB pack separate from the others?

There is no "pack boundary" in the object transfer protocol.  What
comes out of the wire is a single stream of pack data, so the above
is not feasible without major surgery and backward incompatible
change.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-08-13 17:20 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-08-07 18:22 git repack vs git gc --aggressive Felix Natter
2012-08-07 18:44 ` Jeff King
2012-08-07 19:05   ` Junio C Hamano
2012-08-10 19:09     ` Felix Natter
2012-08-10 20:09       ` Junio C Hamano
2012-08-13 14:20         ` Marc Branchaud
2012-08-13 17:19           ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).