git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* What happens when the repository is bigger than gc.autopacklimit * pack.packSizeLimit?
@ 2014-08-27 19:36 Dale R. Worley
  2014-08-27 19:47 ` Jeff King
  2014-08-27 20:52 ` Junio C Hamano
  0 siblings, 2 replies; 7+ messages in thread
From: Dale R. Worley @ 2014-08-27 19:36 UTC (permalink / raw)
  To: git

[Previously sent to the git-users mailing list, but it probably should
be addressed here.]

A number of commands invoke "git gc --auto" to clean up the repository
when there might be a lot of dangling objects and/or there might be
far too many unpacked files.  The manual pages say:

    git gc:
       --auto
           With this option, git gc checks whether any housekeeping is
           required; if not, it exits without performing any work. Some git
           commands run git gc --auto after performing operations that could
           create many loose objects.

           Housekeeping is required if there are too many loose objects or too
           many packs in the repository. If the number of loose objects
           exceeds the value of the gc.auto configuration variable, then all
           loose objects are combined into a single pack using git repack -d
           -l. Setting the value of gc.auto to 0 disables automatic packing of
           loose objects.

    git config:
       gc.autopacklimit
           When there are more than this many packs that are not marked with
           *.keep file in the repository, git gc --auto consolidates them into
           one larger pack. The default value is 50. Setting this to 0
           disables it.

What happens when the amount of data in the repository exceeds
gc.autopacklimit * pack.packSizeLimit?  According to the
documentation, "git gc --auto" will then *always* repack the
repository, whether it needs it or not, because the data will require
more than gc.autopacklimit pack files.

And it appears from an experiment that this is what happens.  I have a
repository with pack.packSizeLimit = 99m, and there are 104 pack
files, and even when "git gc" is done, if I do "git gc --auto", it
will do git-repack again.

Looking at the code, I see:

builtin/gc.c:
static int too_many_packs(void)
{
	struct packed_git *p;
	int cnt;

	if (gc_auto_pack_limit <= 0)
		return 0;

	prepare_packed_git();
	for (cnt = 0, p = packed_git; p; p = p->next) {
		if (!p->pack_local)
			continue;
		if (p->pack_keep)
			continue;
		/*
		 * Perhaps check the size of the pack and count only
		 * very small ones here?
		 */
		cnt++;
	}
	return gc_auto_pack_limit <= cnt;
}

Yes, perhaps you *should* check the size of the pack!

What is a good strategy for making this function behave as we want it to?

Dale

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: What happens when the repository is bigger than gc.autopacklimit * pack.packSizeLimit?
  2014-08-27 19:36 What happens when the repository is bigger than gc.autopacklimit * pack.packSizeLimit? Dale R. Worley
@ 2014-08-27 19:47 ` Jeff King
  2014-08-29 15:38   ` Dale R. Worley
  2014-08-27 20:52 ` Junio C Hamano
  1 sibling, 1 reply; 7+ messages in thread
From: Jeff King @ 2014-08-27 19:47 UTC (permalink / raw)
  To: Dale R. Worley; +Cc: git

On Wed, Aug 27, 2014 at 03:36:53PM -0400, Dale R. Worley wrote:

> And it appears from an experiment that this is what happens.  I have a
> repository with pack.packSizeLimit = 99m, and there are 104 pack
> files, and even when "git gc" is done, if I do "git gc --auto", it
> will do git-repack again.

I agree that "gc --auto" could be smarter here, but I have to wonder:
why are you setting the packsize limit to 99m in the first place? It is
generally much more efficient to place everything in a single pack.
There are more delta opportunities, fewer base objects, lookup is faster
(we binary search each pack index, but linearly move through the list of
indices), and it is required for advanced techniques like bitmaps.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: What happens when the repository is bigger than gc.autopacklimit * pack.packSizeLimit?
  2014-08-27 19:36 What happens when the repository is bigger than gc.autopacklimit * pack.packSizeLimit? Dale R. Worley
  2014-08-27 19:47 ` Jeff King
@ 2014-08-27 20:52 ` Junio C Hamano
  2014-08-29 15:47   ` Dale R. Worley
  1 sibling, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2014-08-27 20:52 UTC (permalink / raw)
  To: Dale R. Worley; +Cc: git

worley@alum.mit.edu (Dale R. Worley) writes:

> builtin/gc.c:
> static int too_many_packs(void)
> {
> 	struct packed_git *p;
> 	int cnt;
>
> 	if (gc_auto_pack_limit <= 0)
> 		return 0;
>
> 	prepare_packed_git();
> 	for (cnt = 0, p = packed_git; p; p = p->next) {
> 		if (!p->pack_local)
> 			continue;
> 		if (p->pack_keep)
> 			continue;
> 		/*
> 		 * Perhaps check the size of the pack and count only
> 		 * very small ones here?
> 		 */
> 		cnt++;
> 	}
> 	return gc_auto_pack_limit <= cnt;
> }
>
> Yes, perhaps you *should* check the size of the pack!
>
> What is a good strategy for making this function behave as we want it to?

Whoever decides the details of "as we want it to" gets to decide
;-).

I think what we want is a mode where we repack only loose objects
and "small" packs by concatenating them into a single "large" one
(with deduping of base objects, the total would become smaller than
the sum), while leaving existing "large" ones alone.  Daily
repacking would just coalesce new objects into the "current" pack
that grows gradually and at some point it stops growing and join the
more longer term "large" ones, until a full gc is done to optimize
the overall history traversal, or something.

But if your definition of the boundary between "small" and "large"
is unreasonably low (and/or your definition of "too many" is
unreasonably small), you will always have the problem you found.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: What happens when the repository is bigger than gc.autopacklimit * pack.packSizeLimit?
  2014-08-27 19:47 ` Jeff King
@ 2014-08-29 15:38   ` Dale R. Worley
  2014-08-29 18:47     ` Jeff King
  0 siblings, 1 reply; 7+ messages in thread
From: Dale R. Worley @ 2014-08-29 15:38 UTC (permalink / raw)
  To: Jeff King; +Cc: git

> From: Jeff King <peff@peff.net>

> why are you setting the packsize limit to 99m in the first place?

I want to copy the Git repository to box.com as a backup measure, and
my account on box.com limits files to 100 MB.

> There are more delta opportunities

In this repository, only the smallest files are text files; the bulk
of the files are executable binaries.  So I've set
core.bigFileThreshold to 10k to stop Git from attempting
delta-compression of the binaries.  That makes the repository slightly
larger, but it dramatically speeds the repacking process.

Dale

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: What happens when the repository is bigger than gc.autopacklimit * pack.packSizeLimit?
  2014-08-27 20:52 ` Junio C Hamano
@ 2014-08-29 15:47   ` Dale R. Worley
  0 siblings, 0 replies; 7+ messages in thread
From: Dale R. Worley @ 2014-08-29 15:47 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

> From: Junio C Hamano <gitster@pobox.com>

> But if your definition of the boundary between "small" and "large"
> is unreasonably low (and/or your definition of "too many" is
> unreasonably small), you will always have the problem you found.

I would propose that a pack whose size is "close enough" to
packSizeLimit should be assumed to have already been built by
repacking, and shouldn't count against autopacklimit.

That's easy to implement, and causes the desirable result that "git gc
--auto" isn't triggerable immediate after repacking.

Of course, eventually there will be enough loose objects, and
everything will get repacked (even the "full" packs).  But that will
happen only occasionally.

That does leave open the question of what is "close enough".  Off the
top of my head, a pack which is larger than packSizeLimit minus (the
size limit for files we put in packs) can be considered "full" in this
test.

Then again, maybe the solution is to just set autopacklimit very high,
perhaps even by default -- in real use, eventually the gc.auto test
will be triggered.

Dale

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: What happens when the repository is bigger than gc.autopacklimit * pack.packSizeLimit?
  2014-08-29 15:38   ` Dale R. Worley
@ 2014-08-29 18:47     ` Jeff King
  2014-08-29 18:54       ` Dale R. Worley
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff King @ 2014-08-29 18:47 UTC (permalink / raw)
  To: Dale R. Worley; +Cc: git

On Fri, Aug 29, 2014 at 11:38:00AM -0400, Dale R. Worley wrote:

> > From: Jeff King <peff@peff.net>
> 
> > why are you setting the packsize limit to 99m in the first place?
> 
> I want to copy the Git repository to box.com as a backup measure, and
> my account on box.com limits files to 100 MB.

That makes sense, though I question whether packs are really helping you
in the first place. I wonder if you would be better off keep your
non-delta binaries as loose objects (this would require a new option to
pack-objects and teaching "gc --auto" to ignore these when counting
loose objects, but would be fairly straightforward).

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: What happens when the repository is bigger than gc.autopacklimit * pack.packSizeLimit?
  2014-08-29 18:47     ` Jeff King
@ 2014-08-29 18:54       ` Dale R. Worley
  0 siblings, 0 replies; 7+ messages in thread
From: Dale R. Worley @ 2014-08-29 18:54 UTC (permalink / raw)
  To: Jeff King; +Cc: git

> From: Jeff King <peff@peff.net>

> That makes sense, though I question whether packs are really helping you
> in the first place. I wonder if you would be better off keep your
> non-delta binaries as loose objects (this would require a new option to
> pack-objects and teaching "gc --auto" to ignore these when counting
> loose objects, but would be fairly straightforward).

Having 40,000 lose objects might be troublesome in its own way.

Dale

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-08-29 18:54 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-08-27 19:36 What happens when the repository is bigger than gc.autopacklimit * pack.packSizeLimit? Dale R. Worley
2014-08-27 19:47 ` Jeff King
2014-08-29 15:38   ` Dale R. Worley
2014-08-29 18:47     ` Jeff King
2014-08-29 18:54       ` Dale R. Worley
2014-08-27 20:52 ` Junio C Hamano
2014-08-29 15:47   ` Dale R. Worley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).