dangling commits and blobs: is this normal?

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* dangling commits and blobs: is this normal?
@ 2009-04-21 21:46 John Dlugosz
  2009-04-22 15:27 ` Jeff King
  0 siblings, 1 reply; 21+ messages in thread
From: John Dlugosz @ 2009-04-21 21:46 UTC (permalink / raw)
  To: git

Immediately after doing a git gc, a git fsck --full reports dangling
objects.  Is this normal?  What does dangling mean, if not those things
that gc finds?

--John

TradeStation Group, Inc. is a publicly-traded holding company (NASDAQ GS: TRAD) of three operating subsidiaries, TradeStation Securities, Inc. (Member NYSE, FINRA, SIPC and NFA), TradeStation Technologies, Inc., a trading software and subscription company, and TradeStation Europe Limited, a United Kingdom, FSA-authorized introducing brokerage firm. None of these companies provides trading or investment advice, recommendations or endorsements of any kind. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited.
  If you received this in error, please contact the sender and delete the material from any computer.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: dangling commits and blobs: is this normal?
  2009-04-21 21:46 dangling commits and blobs: is this normal? John Dlugosz
@ 2009-04-22 15:27 ` Jeff King
  2009-04-22 16:53   ` Brandon Casey
  2009-04-22 20:15   ` John Dlugosz
  0 siblings, 2 replies; 21+ messages in thread
From: Jeff King @ 2009-04-22 15:27 UTC (permalink / raw)
  To: John Dlugosz; +Cc: git

On Tue, Apr 21, 2009 at 05:46:16PM -0400, John Dlugosz wrote:

> Immediately after doing a git gc, a git fsck --full reports dangling
> objects.  Is this normal?  What does dangling mean, if not those things
> that gc finds?

gc will leave dangling loose objects for a set expiration time
(defaulting to two weeks). This makes it safe to run even if there are
operations in progress that want those dangling objects, but haven't yet
added a reference to them (as long as said operation takes less than two
weeks).

You can also end up with dangling objects in packs. When that pack is
repacked, those objects will be loosened, and then eventually expired
under the rule mentioned above. However, I believe gc will not always
repack old packs; it will make new packs until you have a lot of packs,
and then combine them all (at least that is what "gc --auto" will do; I
don't recall whether just "git gc" follows the same rule).

-Peff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: dangling commits and blobs: is this normal?
  2009-04-22 15:27 ` Jeff King
@ 2009-04-22 16:53   ` Brandon Casey
  2009-04-22 17:39     ` Nicolas Pitre
  2009-04-22 20:15   ` John Dlugosz
  1 sibling, 1 reply; 21+ messages in thread
From: Brandon Casey @ 2009-04-22 16:53 UTC (permalink / raw)
  To: Jeff King; +Cc: John Dlugosz, git

Jeff King wrote:
> On Tue, Apr 21, 2009 at 05:46:16PM -0400, John Dlugosz wrote:
> 
>> Immediately after doing a git gc, a git fsck --full reports dangling
>> objects.  Is this normal?  What does dangling mean, if not those things
>> that gc finds?
> 
> gc will leave dangling loose objects for a set expiration time
> (defaulting to two weeks). This makes it safe to run even if there are
> operations in progress that want those dangling objects, but haven't yet
> added a reference to them (as long as said operation takes less than two
> weeks).
> 
> You can also end up with dangling objects in packs. When that pack is
> repacked, those objects will be loosened, and then eventually expired
> under the rule mentioned above. However, I believe gc will not always
> repack old packs; it will make new packs until you have a lot of packs,
> and then combine them all (at least that is what "gc --auto" will do; I
> don't recall whether just "git gc" follows the same rule).

'git gc' (without --auto) always creates one new pack.

I've often wondered whether a plain 'git gc' should adopt the behavior
of --auto with respect to the number of packs.  If there were few packs,
then 'git gc' would do an incremental repack, rather than a 'repack -A -d -l'.

I'm still on the fence about it.  I think 'git gc' is supposed to be a
do-the-right-thing command, so in that sense I think it would be good
behavior and it would probably be what most less experienced users want.
But, 'git gc' is also used by experienced users who may expect the historical
behavior and may _want_ the "pack into one pack" behavior.  It could also be
that the more experienced users who want the "pack into one pack" behavior
are actually the only users of 'git gc', and others just rely on the automatic
'git gc --auto' calling.

Not sure.

-brandon

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: dangling commits and blobs: is this normal?
  2009-04-22 16:53   ` Brandon Casey
@ 2009-04-22 17:39     ` Nicolas Pitre
  2009-04-22 18:15       ` Matthieu Moy
  2009-04-22 19:26       ` Brandon Casey
  0 siblings, 2 replies; 21+ messages in thread
From: Nicolas Pitre @ 2009-04-22 17:39 UTC (permalink / raw)
  To: Brandon Casey; +Cc: Jeff King, John Dlugosz, git

On Wed, 22 Apr 2009, Brandon Casey wrote:

> Jeff King wrote:
> > On Tue, Apr 21, 2009 at 05:46:16PM -0400, John Dlugosz wrote:
> > 
> >> Immediately after doing a git gc, a git fsck --full reports dangling
> >> objects.  Is this normal?  What does dangling mean, if not those things
> >> that gc finds?
> > 
> > gc will leave dangling loose objects for a set expiration time
> > (defaulting to two weeks). This makes it safe to run even if there are
> > operations in progress that want those dangling objects, but haven't yet
> > added a reference to them (as long as said operation takes less than two
> > weeks).
> > 
> > You can also end up with dangling objects in packs. When that pack is
> > repacked, those objects will be loosened, and then eventually expired
> > under the rule mentioned above. However, I believe gc will not always
> > repack old packs; it will make new packs until you have a lot of packs,
> > and then combine them all (at least that is what "gc --auto" will do; I
> > don't recall whether just "git gc" follows the same rule).
> 
> 'git gc' (without --auto) always creates one new pack.
> 
> I've often wondered whether a plain 'git gc' should adopt the behavior
> of --auto with respect to the number of packs.  If there were few packs,
> then 'git gc' would do an incremental repack, rather than a 'repack -A -d -l'.

Why so?  Having fewer packs is always a good thing.  Having only one 
pack is of course the optimal situation.  The --auto version doesn't do 
it in the hope of being lightter and less noticeable by the user.  
However the user manually invoking gc should be expecting some work is 
actually happening.  If you don't want the whole repo read from one pack 
just to be written in another pack (say the repo is huge and waiting 
after the IO is not worth it) then just mark such a pack with a .keep 
file.


Nicolas

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: dangling commits and blobs: is this normal?
  2009-04-22 17:39     ` Nicolas Pitre
@ 2009-04-22 18:15       ` Matthieu Moy
  2009-04-22 19:08         ` Jeff King
  2009-04-22 19:14         ` Nicolas Pitre
  2009-04-22 19:26       ` Brandon Casey
  1 sibling, 2 replies; 21+ messages in thread
From: Matthieu Moy @ 2009-04-22 18:15 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Brandon Casey, Jeff King, John Dlugosz, git

Nicolas Pitre <nico@cam.org> writes:

> Why so?  Having fewer packs is always a good thing.  Having only one 
> pack is of course the optimal situation. 

Good and optimal wrt Git, but not wrt an incremental backup system for
example. I have a "git gc" running daily in a cron job in each of my
repositories, but to be nice with my sysadmin, I don't want to rewrite
tens of megabytes of data each night just because I commited a 2 lines
patch somewhere.

-- 
Matthieu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: dangling commits and blobs: is this normal?
  2009-04-22 18:15       ` Matthieu Moy
@ 2009-04-22 19:08         ` Jeff King
  2009-04-22 19:45           ` Brandon Casey
  2009-04-23 11:51           ` Matthieu Moy
  2009-04-22 19:14         ` Nicolas Pitre
  1 sibling, 2 replies; 21+ messages in thread
From: Jeff King @ 2009-04-22 19:08 UTC (permalink / raw)
  To: Matthieu Moy; +Cc: Nicolas Pitre, Brandon Casey, John Dlugosz, git

On Wed, Apr 22, 2009 at 08:15:56PM +0200, Matthieu Moy wrote:

> Nicolas Pitre <nico@cam.org> writes:
> 
> > Why so?  Having fewer packs is always a good thing.  Having only one 
> > pack is of course the optimal situation. 
> 
> Good and optimal wrt Git, but not wrt an incremental backup system for
> example. I have a "git gc" running daily in a cron job in each of my
> repositories, but to be nice with my sysadmin, I don't want to rewrite
> tens of megabytes of data each night just because I commited a 2 lines
> patch somewhere.

You can mark your "big" pack with a .keep, then do your nightly gc as
usual. You'll have a smaller pack being rewritten each night. When it
gets big enough, drop the .keep, gc, and then .keep the new pack.

Yes, it's a bit more work for you, but having "git gc" optimize by
default for git's performance seems to be the only sensible course.
Your idea of what is "big enough" above is somewhat outside the realm of
git, so you have to pay the price to specify it by tweaking the
keep-files.

-Peff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: dangling commits and blobs: is this normal?
  2009-04-22 18:15       ` Matthieu Moy
  2009-04-22 19:08         ` Jeff King
@ 2009-04-22 19:14         ` Nicolas Pitre
  1 sibling, 0 replies; 21+ messages in thread
From: Nicolas Pitre @ 2009-04-22 19:14 UTC (permalink / raw)
  To: Matthieu Moy; +Cc: Brandon Casey, Jeff King, John Dlugosz, git

On Wed, 22 Apr 2009, Matthieu Moy wrote:

> Nicolas Pitre <nico@cam.org> writes:
> 
> > Why so?  Having fewer packs is always a good thing.  Having only one 
> > pack is of course the optimal situation. 
> 
> Good and optimal wrt Git, but not wrt an incremental backup system for
> example.

This goes without saying that git should optimize for its own usage by 
default, and not for a particular backup system.

> I have a "git gc" running daily in a cron job in each of my
> repositories, but to be nice with my sysadmin, I don't want to rewrite
> tens of megabytes of data each night just because I commited a 2 lines
> patch somewhere.

Just add a .keep file along side your .pack file after repacking.


Nicolas

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: dangling commits and blobs: is this normal?
  2009-04-22 17:39     ` Nicolas Pitre
  2009-04-22 18:15       ` Matthieu Moy
@ 2009-04-22 19:26       ` Brandon Casey
  2009-04-22 20:00         ` Nicolas Pitre
  1 sibling, 1 reply; 21+ messages in thread
From: Brandon Casey @ 2009-04-22 19:26 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jeff King, John Dlugosz, git

Nicolas Pitre wrote:
> On Wed, 22 Apr 2009, Brandon Casey wrote:

>> I've often wondered whether a plain 'git gc' should adopt the behavior
>> of --auto with respect to the number of packs.  If there were few packs,
>> then 'git gc' would do an incremental repack, rather than a 'repack -A -d -l'.
> 
> Why so?  Having fewer packs is always a good thing.  Having only one 
> pack is of course the optimal situation.  The --auto version doesn't do 
> it in the hope of being lightter and less noticeable by the user.

The only reason for avoiding packing all packs into one would be speed in
this case also.  I recall reading complaints or surprise about gc
repacking all packs into one, so I'm only trying to think about how to
match program behavior with user expectations.  gc does a lot already,
and even Jeff wasn't sure what to expect from 'git gc' with respect to
packs.  Possibly an acceptable trade off between speed and optimal packing
would be to adopt the --auto behavior for deciding when to use '-A' with
repack.

> However the user manually invoking gc should be expecting some work is 
> actually happening.  If you don't want the whole repo read from one pack 
> just to be written in another pack (say the repo is huge and waiting 
> after the IO is not worth it) then just mark such a pack with a .keep 
> file.

That's true, but a user who knows about the .keep mechanism would also
not be afraid to run 'repack -d -l' (I'm ignoring the other operations
of gc).

-brandon

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: dangling commits and blobs: is this normal?
  2009-04-22 19:08         ` Jeff King
@ 2009-04-22 19:45           ` Brandon Casey
  2009-04-22 19:58             ` Jeff King
  2009-04-22 20:07             ` Nicolas Pitre
  2009-04-23 11:51           ` Matthieu Moy
  1 sibling, 2 replies; 21+ messages in thread
From: Brandon Casey @ 2009-04-22 19:45 UTC (permalink / raw)
  To: Jeff King; +Cc: Matthieu Moy, Nicolas Pitre, John Dlugosz, git

Jeff King wrote:
> On Wed, Apr 22, 2009 at 08:15:56PM +0200, Matthieu Moy wrote:
> 
>> Nicolas Pitre <nico@cam.org> writes:
>>
>>> Why so?  Having fewer packs is always a good thing.  Having only one 
>>> pack is of course the optimal situation. 
>> Good and optimal wrt Git, but not wrt an incremental backup system for
>> example. I have a "git gc" running daily in a cron job in each of my
>> repositories, but to be nice with my sysadmin, I don't want to rewrite
>> tens of megabytes of data each night just because I commited a 2 lines
>> patch somewhere.
> 
> You can mark your "big" pack with a .keep, then do your nightly gc as
> usual. You'll have a smaller pack being rewritten each night. When it
> gets big enough, drop the .keep, gc, and then .keep the new pack.
> 
> Yes, it's a bit more work for you, but having "git gc" optimize by
> default for git's performance seems to be the only sensible course.
> Your idea of what is "big enough" above is somewhat outside the realm of
> git, so you have to pay the price to specify it by tweaking the
> keep-files.

But isn't git-gc supposed to be the "high-level" command that just does
the right thing?  It doesn't seem to me to be outside the scope of this
command to make a decision about trading off speed/io for optimal repo
layout.  In fact, it does do this already.  The default window, depth and
compression settings are chosen to be "good enough", not to produce the
absolute optimum repo.

I'm just pointing out that everything is a trade off.  So I think saying
something like "gc must optimize for git's performance" is not entirely
accurate.  We make tradeoffs now.  Other tradeoffs may be helpful.

Also, don't interpret my comments as me being convinced that a change to
gc should be made.  It's a trivial patch, but I'm not yet certain one
way or the other.

-brandon

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: dangling commits and blobs: is this normal?
  2009-04-22 19:45           ` Brandon Casey
@ 2009-04-22 19:58             ` Jeff King
  2009-04-22 20:07             ` Nicolas Pitre
  1 sibling, 0 replies; 21+ messages in thread
From: Jeff King @ 2009-04-22 19:58 UTC (permalink / raw)
  To: Brandon Casey; +Cc: Matthieu Moy, Nicolas Pitre, John Dlugosz, git

On Wed, Apr 22, 2009 at 02:45:29PM -0500, Brandon Casey wrote:

> > Yes, it's a bit more work for you, but having "git gc" optimize by
> > default for git's performance seems to be the only sensible course.
> > Your idea of what is "big enough" above is somewhat outside the realm of
> > git, so you have to pay the price to specify it by tweaking the
> > keep-files.
> 
> But isn't git-gc supposed to be the "high-level" command that just does
> the right thing?  It doesn't seem to me to be outside the scope of this
> command to make a decision about trading off speed/io for optimal repo
> layout.  In fact, it does do this already.  The default window, depth and
> compression settings are chosen to be "good enough", not to produce the
> absolute optimum repo.
> 
> I'm just pointing out that everything is a trade off.  So I think saying
> something like "gc must optimize for git's performance" is not entirely
> accurate.  We make tradeoffs now.  Other tradeoffs may be helpful.

Sure, but my point was that git doesn't even know _how_ to make that
tradeoff. It doesn't know what you consider a reasonable size of backup
for your incremental backups, how often you might want to rollover your
keep files, how often you expect to commit and how big the commits will
be, etc.

So it does the most reasonable thing, which is to optimize for git
itself based on what it does know. If there is any improvement to be
made, it is probably to make a simpler way for the user to specify that
external knowledge to git (because tweaking .keep files really is
unnecessarily complex for Matthieu's scenario). And maybe that is just
adding a config variable analagous to "gc.autopacklimit" to be used for
regular gc, but that would default to 0 (i.e., default to the current
behavior of always repacking).

But I don't think it makes sense to change the default.

-Peff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: dangling commits and blobs: is this normal?
  2009-04-22 19:26       ` Brandon Casey
@ 2009-04-22 20:00         ` Nicolas Pitre
  2009-04-22 20:05           ` Jeff King
  0 siblings, 1 reply; 21+ messages in thread
From: Nicolas Pitre @ 2009-04-22 20:00 UTC (permalink / raw)
  To: Brandon Casey; +Cc: Jeff King, John Dlugosz, git

On Wed, 22 Apr 2009, Brandon Casey wrote:

> Nicolas Pitre wrote:
> > On Wed, 22 Apr 2009, Brandon Casey wrote:
> 
> >> I've often wondered whether a plain 'git gc' should adopt the behavior
> >> of --auto with respect to the number of packs.  If there were few packs,
> >> then 'git gc' would do an incremental repack, rather than a 'repack -A -d -l'.
> > 
> > Why so?  Having fewer packs is always a good thing.  Having only one 
> > pack is of course the optimal situation.  The --auto version doesn't do 
> > it in the hope of being lightter and less noticeable by the user.
> 
> The only reason for avoiding packing all packs into one would be speed in
> this case also.  I recall reading complaints or surprise about gc
> repacking all packs into one, so I'm only trying to think about how to
> match program behavior with user expectations.

It's user's expectations that need adjusting then.  Making a single pack 
is indeed the job of an explicit gc invocation.

> gc does a lot already, and even Jeff wasn't sure what to expect from 
> 'git gc' with respect to packs.  Possibly an acceptable trade off 
> between speed and optimal packing would be to adopt the --auto 
> behavior for deciding when to use '-A' with repack.

And what would be the point of manually running 'git gc' then, given 
that 'git gc --auto' is already invoked automatically after most commit 
creating commands?

I mean, if you consider explicit 'git gc' too long, then simply wait 
until you can spare the time, if at all.  This is not like a non gc'd 
repository suddently becomes non functional.

WRT trade offs, the current behavior is already a pretty good compromize 
between speed and optimal packing, the later implying -f to 'git 
repack' which is far far slower.

Nicolas

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: dangling commits and blobs: is this normal?
  2009-04-22 20:00         ` Nicolas Pitre
@ 2009-04-22 20:05           ` Jeff King
  2009-04-22 20:11             ` Nicolas Pitre
  2009-04-23 17:43             ` Geert Bosch
  0 siblings, 2 replies; 21+ messages in thread
From: Jeff King @ 2009-04-22 20:05 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Brandon Casey, John Dlugosz, git

On Wed, Apr 22, 2009 at 04:00:06PM -0400, Nicolas Pitre wrote:

> And what would be the point of manually running 'git gc' then, given 
> that 'git gc --auto' is already invoked automatically after most commit 
> creating commands?
> 
> I mean, if you consider explicit 'git gc' too long, then simply wait 
> until you can spare the time, if at all.  This is not like a non gc'd 
> repository suddently becomes non functional.

The other tradeoff, mentioned by Matthieu, is not about speed, but about
rollover of files on disk. I think he would be in favor of a less
optimal pack setup if it meant rewriting the largest packfile less
frequently.

However, it may be reasonable to suggest that he just not manually "gc"
then. If he is not generating enough commits to warrant an auto-gc, then
he is probably not losing much by having loose objects. And if he is,
then auto-gc is already taking care of it.

-Peff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: dangling commits and blobs: is this normal?
  2009-04-22 19:45           ` Brandon Casey
  2009-04-22 19:58             ` Jeff King
@ 2009-04-22 20:07             ` Nicolas Pitre
  1 sibling, 0 replies; 21+ messages in thread
From: Nicolas Pitre @ 2009-04-22 20:07 UTC (permalink / raw)
  To: Brandon Casey; +Cc: Jeff King, Matthieu Moy, John Dlugosz, git

On Wed, 22 Apr 2009, Brandon Casey wrote:

> But isn't git-gc supposed to be the "high-level" command that just does
> the right thing?  It doesn't seem to me to be outside the scope of this
> command to make a decision about trading off speed/io for optimal repo
> layout.  In fact, it does do this already.  The default window, depth and
> compression settings are chosen to be "good enough", not to produce the
> absolute optimum repo.

Exact.

> I'm just pointing out that everything is a trade off.  So I think saying
> something like "gc must optimize for git's performance" is not entirely
> accurate.  We make tradeoffs now.  Other tradeoffs may be helpful.

Git makes tradeoffs for itself.  Trying to optimize by _default_ for 
some random backup system, or any other environmental component not 
involved in git usage, is completely silly.

> Also, don't interpret my comments as me being convinced that a change to
> gc should be made.  It's a trivial patch, but I'm not yet certain one
> way or the other.

Be free to interpret my replies as me being certain of not doing such a 
change.


Nicolas

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: dangling commits and blobs: is this normal?
  2009-04-22 20:05           ` Jeff King
@ 2009-04-22 20:11             ` Nicolas Pitre
  2009-04-23 17:43             ` Geert Bosch
  1 sibling, 0 replies; 21+ messages in thread
From: Nicolas Pitre @ 2009-04-22 20:11 UTC (permalink / raw)
  To: Jeff King; +Cc: Brandon Casey, John Dlugosz, git

On Wed, 22 Apr 2009, Jeff King wrote:

> On Wed, Apr 22, 2009 at 04:00:06PM -0400, Nicolas Pitre wrote:
> 
> > And what would be the point of manually running 'git gc' then, given 
> > that 'git gc --auto' is already invoked automatically after most commit 
> > creating commands?
> > 
> > I mean, if you consider explicit 'git gc' too long, then simply wait 
> > until you can spare the time, if at all.  This is not like a non gc'd 
> > repository suddently becomes non functional.
> 
> The other tradeoff, mentioned by Matthieu, is not about speed, but about
> rollover of files on disk. I think he would be in favor of a less
> optimal pack setup if it meant rewriting the largest packfile less
> frequently.
> 
> However, it may be reasonable to suggest that he just not manually "gc"
> then. If he is not generating enough commits to warrant an auto-gc, then
> he is probably not losing much by having loose objects. And if he is,
> then auto-gc is already taking care of it.

My point exactly.

And those people savvy enough to automate 'git gc' nightly should be 
able to cope with .keep files as well.


Nicolas

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: dangling commits and blobs: is this normal?
  2009-04-22 15:27 ` Jeff King
  2009-04-22 16:53   ` Brandon Casey
@ 2009-04-22 20:15   ` John Dlugosz
  1 sibling, 0 replies; 21+ messages in thread
From: John Dlugosz @ 2009-04-22 20:15 UTC (permalink / raw)
  To: Jeff King; +Cc: git

> -----Original Message-----
> From: Jeff King [mailto:peff@peff.net]
> Sent: Wednesday, April 22, 2009 10:27 AM
> To: John Dlugosz
> Cc: git@vger.kernel.org
> Subject: Re: dangling commits and blobs: is this normal?
> 
> 
> gc will leave dangling loose objects for a set expiration time
> (defaulting to two weeks). This makes it safe to run even if there are
> operations in progress that want those dangling objects, but haven't
> yet
> added a reference to them (as long as said operation takes less than
> two
> weeks).

Ah, very enlightening.  I see: it's not just reflog stuff (which gc should know are root entry points and not complain about), it really does leave uncollected garbage on purpose, in case something is in progress.

--John

TradeStation Group, Inc. is a publicly-traded holding company (NASDAQ GS: TRAD) of three operating subsidiaries, TradeStation Securities, Inc. (Member NYSE, FINRA, SIPC and NFA), TradeStation Technologies, Inc., a trading software and subscription company, and TradeStation Europe Limited, a United Kingdom, FSA-authorized introducing brokerage firm. None of these companies provides trading or investment advice, recommendations or endorsements of any kind. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: dangling commits and blobs: is this normal?
  2009-04-22 19:08         ` Jeff King
  2009-04-22 19:45           ` Brandon Casey
@ 2009-04-23 11:51           ` Matthieu Moy
  1 sibling, 0 replies; 21+ messages in thread
From: Matthieu Moy @ 2009-04-23 11:51 UTC (permalink / raw)
  To: Jeff King; +Cc: Nicolas Pitre, Brandon Casey, John Dlugosz, git

Jeff King <peff@peff.net> writes:

> On Wed, Apr 22, 2009 at 08:15:56PM +0200, Matthieu Moy wrote:
>
>> Nicolas Pitre <nico@cam.org> writes:
>> 
>> > Why so?  Having fewer packs is always a good thing.  Having only one 
>> > pack is of course the optimal situation. 
>> 
>> Good and optimal wrt Git, but not wrt an incremental backup system for
>> example. I have a "git gc" running daily in a cron job in each of my
>> repositories, but to be nice with my sysadmin, I don't want to rewrite
>> tens of megabytes of data each night just because I commited a 2 lines
>> patch somewhere.
>
> You can mark your "big" pack with a .keep, then do your nightly gc as
> usual. You'll have a smaller pack being rewritten each night. When it
> gets big enough, drop the .keep, gc, and then .keep the new pack.

(thanks, I wasn't aware of this .keep thing before reading this
thread)

> Yes, it's a bit more work for you, but having "git gc" optimize by
> default for git's performance seems to be the only sensible course.

Sure. Sorry if my message read as "git gc does the wrong thing", I was
just mentionning that it's not optimal with respect to everything.

-- 
Matthieu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: dangling commits and blobs: is this normal?
  2009-04-22 20:05           ` Jeff King
  2009-04-22 20:11             ` Nicolas Pitre
@ 2009-04-23 17:43             ` Geert Bosch
  2009-04-23 17:56               ` Shawn O. Pearce
  2009-04-23 18:51               ` Nicolas Pitre
  1 sibling, 2 replies; 21+ messages in thread
From: Geert Bosch @ 2009-04-23 17:43 UTC (permalink / raw)
  To: Jeff King; +Cc: Nicolas Pitre, Brandon Casey, John Dlugosz, git

On Apr 22, 2009, at 16:05, Jeff King wrote:
> The other tradeoff, mentioned by Matthieu, is not about speed, but  
> about
> rollover of files on disk. I think he would be in favor of a less
> optimal pack setup if it meant rewriting the largest packfile less
> frequently.
>
> However, it may be reasonable to suggest that he just not manually  
> "gc"
> then. If he is not generating enough commits to warrant an auto-gc,  
> then
> he is probably not losing much by having loose objects. And if he is,
> then auto-gc is already taking care of it.

For large repositories with lots of large files, git spends too much
time copying large packs for relatively little gain. This is obvious  
when
you include a few dozen large objects in any repository.
Currently, there is no limit to the number of times this data may
be copied. In particular, the average amount of I/O needed for
changes of size X depends linearly on the size of the total repository.
So, the mere presence of a couple of large objects has an large  
distributed overhead.

Wouldn't it be better to have a maximum of N packs, named
pack_0 .. pack_(N - 1),  in the repository with each pack_i being
between 2^i and 2^(i+1)-1 bytes large? We could even dispense
completely with loose objects and instead have each git operation
create a single new pack.

Then the repacking rule simply becomes: if a new pack_i would
overwrite one of the same name, both packs are merged into a new  
pack_(i+1).

To analyze performance, let's assume the worst case, where the
size of a pack is equal to the expanded size of all objects contained  
in it
and new packs only have unique objects. With these assumptions, an  
object
residing in pack_i can only be merged into a pack_j with j > i.
So, if any repository of size n has k objects, the maximum total I/O  
required
to create the repository (counting all operations in its history) is  
O(n log k).

The current situation, the number of repacks required is linear in the  
number of
objects, so the total work required is more like O(n k).

While I understand that the above is a gross simplification, and actual
performance is dictated by packing efficiency and constant factors  
rather
than asymptotic performance, I think the general idea of limiting the
number of packs in the way described is useful and will lead to  
significant
speedups, especially during large imports that currently require  
frequent
repacking of the entire repository.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: dangling commits and blobs: is this normal?
  2009-04-23 17:43             ` Geert Bosch
@ 2009-04-23 17:56               ` Shawn O. Pearce
  2009-04-23 18:10                 ` Geert Bosch
  2009-04-23 18:51               ` Nicolas Pitre
  1 sibling, 1 reply; 21+ messages in thread
From: Shawn O. Pearce @ 2009-04-23 17:56 UTC (permalink / raw)
  To: Geert Bosch; +Cc: Jeff King, Nicolas Pitre, Brandon Casey, John Dlugosz, git

Geert Bosch <bosch@adacore.com> wrote:
> significant
> speedups, especially during large imports that currently require  
> frequent
> repacking of the entire repository.

Large imports should be using fast-import, and then issue a single
massive `git repack -f --window=250 --depth=50` or some such repack
command after the entire import is complete.

If your favorite import tool (*cough* git-svn *cough*) can't use
fast-import, and you are importing a large enough repository that
this matters to you, use another importer that can use fast-import.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: dangling commits and blobs: is this normal?
  2009-04-23 17:56               ` Shawn O. Pearce
@ 2009-04-23 18:10                 ` Geert Bosch
  2009-04-23 18:17                   ` Matthias Andree
  0 siblings, 1 reply; 21+ messages in thread
From: Geert Bosch @ 2009-04-23 18:10 UTC (permalink / raw)
  To: Shawn O. Pearce
  Cc: Jeff King, Nicolas Pitre, Brandon Casey, John Dlugosz, git


On Apr 23, 2009, at 13:56, Shawn O. Pearce wrote:

> If your favorite import tool (*cough* git-svn *cough*) can't use
> fast-import, and you are importing a large enough repository that
> this matters to you, use another importer that can use fast-import.

How did you guess? :) You're right of course, except that I can't
use fast-import AFAIK. The issue is also more general, as the
same scenario of adding new objects and repacking occurs
outside the context of git-svn.

   -Geert

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: dangling commits and blobs: is this normal?
  2009-04-23 18:10                 ` Geert Bosch
@ 2009-04-23 18:17                   ` Matthias Andree
  0 siblings, 0 replies; 21+ messages in thread
From: Matthias Andree @ 2009-04-23 18:17 UTC (permalink / raw)
  To: Geert Bosch, Shawn O. Pearce
  Cc: Jeff King, Nicolas Pitre, Brandon Casey, John Dlugosz, git

Am 23.04.2009, 20:10 Uhr, schrieb Geert Bosch <bosch@adacore.com>:

>
> On Apr 23, 2009, at 13:56, Shawn O. Pearce wrote:
>
>> If your favorite import tool (*cough* git-svn *cough*) can't use
>> fast-import, and you are importing a large enough repository that
>> this matters to you, use another importer that can use fast-import.
>
> How did you guess? :) You're right of course, except that I can't
> use fast-import AFAIK. The issue is also more general, as the
> same scenario of adding new objects and repacking occurs
> outside the context of git-svn.

If you can't use fast-import for lack of access to the SVN repo, svnsync  
may help with that part. Or easier: ask the admin to upload a dump or  
provide one for download... :-)

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: dangling commits and blobs: is this normal?
  2009-04-23 17:43             ` Geert Bosch
  2009-04-23 17:56               ` Shawn O. Pearce
@ 2009-04-23 18:51               ` Nicolas Pitre
  1 sibling, 0 replies; 21+ messages in thread
From: Nicolas Pitre @ 2009-04-23 18:51 UTC (permalink / raw)
  To: Geert Bosch; +Cc: Jeff King, Brandon Casey, John Dlugosz, git

On Thu, 23 Apr 2009, Geert Bosch wrote:

> 
> On Apr 22, 2009, at 16:05, Jeff King wrote:
> > The other tradeoff, mentioned by Matthieu, is not about speed, but about
> > rollover of files on disk. I think he would be in favor of a less
> > optimal pack setup if it meant rewriting the largest packfile less
> > frequently.
> > 
> > However, it may be reasonable to suggest that he just not manually "gc"
> > then. If he is not generating enough commits to warrant an auto-gc, then
> > he is probably not losing much by having loose objects. And if he is,
> > then auto-gc is already taking care of it.
> 
> For large repositories with lots of large files, git spends too much
> time copying large packs for relatively little gain. This is obvious when
> you include a few dozen large objects in any repository.
> Currently, there is no limit to the number of times this data may
> be copied. In particular, the average amount of I/O needed for
> changes of size X depends linearly on the size of the total repository.
> So, the mere presence of a couple of large objects has an large distributed
> overhead.

You can put a limit on the number of times this data is copied, and even 
set the limit to zero.  Just add a .keep file to your .pack file and 
that data will remain in stone.  Any further repack will consider only 
those newly added objects you may have.

> Wouldn't it be better to have a maximum of N packs, named
> pack_0 .. pack_(N - 1),  in the repository with each pack_i being
> between 2^i and 2^(i+1)-1 bytes large? We could even dispense
> completely with loose objects and instead have each git operation
> create a single new pack.

I suggested that already for large enough objects.  For small objects 
this makes no sense as you may accumulate too many of them and each one 
would need to be opened in order to find if it contains the desired 
object whereas currently you need a simple directory lookup.

> number of packs in the way described is useful and will lead to significant
> speedups, especially during large imports that currently require frequent
> repacking of the entire repository.

Others commented on that issue already.


Nicolas

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2009-04-23 18:53 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-21 21:46 dangling commits and blobs: is this normal? John Dlugosz
2009-04-22 15:27 ` Jeff King
2009-04-22 16:53   ` Brandon Casey
2009-04-22 17:39     ` Nicolas Pitre
2009-04-22 18:15       ` Matthieu Moy
2009-04-22 19:08         ` Jeff King
2009-04-22 19:45           ` Brandon Casey
2009-04-22 19:58             ` Jeff King
2009-04-22 20:07             ` Nicolas Pitre
2009-04-23 11:51           ` Matthieu Moy
2009-04-22 19:14         ` Nicolas Pitre
2009-04-22 19:26       ` Brandon Casey
2009-04-22 20:00         ` Nicolas Pitre
2009-04-22 20:05           ` Jeff King
2009-04-22 20:11             ` Nicolas Pitre
2009-04-23 17:43             ` Geert Bosch
2009-04-23 17:56               ` Shawn O. Pearce
2009-04-23 18:10                 ` Geert Bosch
2009-04-23 18:17                   ` Matthias Andree
2009-04-23 18:51               ` Nicolas Pitre
2009-04-22 20:15   ` John Dlugosz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).