git pull & git gc

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* git pull & git gc
@ 2015-03-18 13:53 Дилян Палаузов
  2015-03-18 14:16 ` Duy Nguyen
  0 siblings, 1 reply; 21+ messages in thread
From: Дилян Палаузов @ 2015-03-18 13:53 UTC (permalink / raw)
  To: git

Hello,

I have a local folder with the git-repository (so that its .git/config 
contains ([remote "origin"]\n	url = git://github.com/git/git.git\nfetch 
= +refs/heads/*:refs/remotes/origin/* )

I do there "git pull".

Usually the output is
   Already up to date

but since today it prints
   Auto packing the repository in background for optimum performance.
   See "git help gc" for manual housekeeping.
   Already up-to-date.

and starts in the background a "git gc --auto" process.  This is all 
fine, however, when the "git gc" process finishes, and I do again "git 
pull" I get the same message, as above (git gc is again started).

My understanding is, that "git gc" has to be occasionally run and then 
the garbage collection is done for a while.  In the concrete case, if 
"git pull" starts "git gc" in the background and prints a message on 
this, it is all fine, but running "git pull" after a while, when the 
garbage collection was recently done, where shall be neither message nor 
an action about "git gc".

My system-wide gitconfig contains "[pack] threads = 1".

I have "tar xJf"'ed my local git repository and have put it under
   http://mail.aegee.org/dpa/v/git-repository.tar.xz

The question is:

Why does "git pull" every time when it is invoked today print 
information about "git gc"?

I have git 2.3.3 adjusted with "./configure --with-openssl 
--with-libpcre --with-curl --with-expat".

Thanks in advance for your answer

Dilian

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git pull & git gc
  2015-03-18 13:53 git pull & git gc Дилян Палаузов
@ 2015-03-18 14:16 ` Duy Nguyen
  2015-03-18 14:23   ` Дилян Палаузов
  0 siblings, 1 reply; 21+ messages in thread
From: Duy Nguyen @ 2015-03-18 14:16 UTC (permalink / raw)
  To: Дилян Палаузов
  Cc: Git Mailing List

On Wed, Mar 18, 2015 at 8:53 PM, Дилян Палаузов
<dilyan.palauzov@aegee.org> wrote:
> Hello,
>
> I have a local folder with the git-repository (so that its .git/config
> contains ([remote "origin"]\n    url = git://github.com/git/git.git\nfetch =
> +refs/heads/*:refs/remotes/origin/* )
>
> I do there "git pull".
>
> Usually the output is
>   Already up to date
>
> but since today it prints
>   Auto packing the repository in background for optimum performance.
>   See "git help gc" for manual housekeeping.
>   Already up-to-date.
>
> and starts in the background a "git gc --auto" process.  This is all fine,
> however, when the "git gc" process finishes, and I do again "git pull" I get
> the same message, as above (git gc is again started).

So if you do "git gc --auto" now, does it exit immediately or go
through the garbage collection process again (it'll print something)?
What does "git count-objects -v" show?
-- 
Duy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git pull & git gc
  2015-03-18 14:16 ` Duy Nguyen
@ 2015-03-18 14:23   ` Дилян Палаузов
  2015-03-18 14:33     ` Duy Nguyen
  0 siblings, 1 reply; 21+ messages in thread
From: Дилян Палаузов @ 2015-03-18 14:23 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Git Mailing List

Hello,

# git gc --auto
Auto packing the repository in background for optimum performance.
See "git help gc" for manual housekeeping.

and calls in the background:

25618     1  0 32451   884   1 14:20 ?        00:00:00 git gc --auto
25639 25618 51 49076 49428   0 14:20 ?        00:00:07 git prune 
--expire 2.weeks.ago

# git count-objects -v
count: 6039
size: 65464
in-pack: 185432
packs: 1
size-pack: 46687
prune-packable: 0
garbage: 0
size-garbage: 0

Regards
   Dilian


On 18.03.2015 15:16, Duy Nguyen wrote:
> On Wed, Mar 18, 2015 at 8:53 PM, Дилян Палаузов
> <dilyan.palauzov@aegee.org> wrote:
>> Hello,
>>
>> I have a local folder with the git-repository (so that its .git/config
>> contains ([remote "origin"]\n    url = git://github.com/git/git.git\nfetch =
>> +refs/heads/*:refs/remotes/origin/* )
>>
>> I do there "git pull".
>>
>> Usually the output is
>>    Already up to date
>>
>> but since today it prints
>>    Auto packing the repository in background for optimum performance.
>>    See "git help gc" for manual housekeeping.
>>    Already up-to-date.
>>
>> and starts in the background a "git gc --auto" process.  This is all fine,
>> however, when the "git gc" process finishes, and I do again "git pull" I get
>> the same message, as above (git gc is again started).
>
> So if you do "git gc --auto" now, does it exit immediately or go
> through the garbage collection process again (it'll print something)?
> What does "git count-objects -v" show?
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git pull & git gc
  2015-03-18 14:23   ` Дилян Палаузов
@ 2015-03-18 14:33     ` Duy Nguyen
  2015-03-18 14:41       ` Duy Nguyen
  2015-03-18 14:48       ` Дилян Палаузов
  0 siblings, 2 replies; 21+ messages in thread
From: Duy Nguyen @ 2015-03-18 14:33 UTC (permalink / raw)
  To: Дилян Палаузов
  Cc: Git Mailing List

On Wed, Mar 18, 2015 at 9:23 PM, Дилян Палаузов
<dilyan.palauzov@aegee.org> wrote:
> Hello,
>
> # git gc --auto
> Auto packing the repository in background for optimum performance.
> See "git help gc" for manual housekeeping.
>
> and calls in the background:
>
> 25618     1  0 32451   884   1 14:20 ?        00:00:00 git gc --auto
> 25639 25618 51 49076 49428   0 14:20 ?        00:00:07 git prune --expire
> 2.weeks.ago
>
> # git count-objects -v
> count: 6039

loose number threshold is 6700, unless you tweaked something. But
there's a tweak, we'll come back to this.

> size: 65464
> in-pack: 185432
> packs: 1

Pack threshold is 50, You only have one pack, good

OK back to the "count 6039" above. You have that many loose objects.
But 'git gc' is lazier than 'git count-objects'. It assume a flat
distribution, and only counts the number of objects in .git/objects/17
directory only, then extrapolate for the total number.

So can you see how many files you have in this directory
.git/objects/17? That number, multiplied by 256, should be greater
than 6700. If that's the case "git gc" laziness is the problem. If
not, I made some mistake in analyzing this and we'll start again.
-- 
Duy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git pull & git gc
  2015-03-18 14:33     ` Duy Nguyen
@ 2015-03-18 14:41       ` Duy Nguyen
  2015-03-18 14:58         ` John Keeping
  2015-03-18 14:48       ` Дилян Палаузов
  1 sibling, 1 reply; 21+ messages in thread
From: Duy Nguyen @ 2015-03-18 14:41 UTC (permalink / raw)
  To: Дилян Палаузов
  Cc: Git Mailing List

On Wed, Mar 18, 2015 at 9:33 PM, Duy Nguyen <pclouds@gmail.com> wrote:
> If not, I made some mistake in analyzing this and we'll start again.

I did make one mistake, the first "gc" should have reduced the number
of loose objects to zero. Why didn't it.?  I'll come back to this
tomorrow if nobody finds out first :)
-- 
Duy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git pull & git gc
  2015-03-18 14:33     ` Duy Nguyen
  2015-03-18 14:41       ` Duy Nguyen
@ 2015-03-18 14:48       ` Дилян Палаузов
  2015-03-18 21:07         ` Jeff King
  1 sibling, 1 reply; 21+ messages in thread
From: Дилян Палаузов @ 2015-03-18 14:48 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Git Mailing List

Hello Duy,

#ls .git/objects/17/*  | wc -l
30

30 * 256 = 7 680 > 6 700

And now?  Do I have to run git gc --aggressive ?

Kind regards
   Dilian


On 18.03.2015 15:33, Duy Nguyen wrote:
> On Wed, Mar 18, 2015 at 9:23 PM, Дилян Палаузов
> <dilyan.palauzov@aegee.org> wrote:
>> Hello,
>>
>> # git gc --auto
>> Auto packing the repository in background for optimum performance.
>> See "git help gc" for manual housekeeping.
>>
>> and calls in the background:
>>
>> 25618     1  0 32451   884   1 14:20 ?        00:00:00 git gc --auto
>> 25639 25618 51 49076 49428   0 14:20 ?        00:00:07 git prune --expire
>> 2.weeks.ago
>>
>> # git count-objects -v
>> count: 6039
>
> loose number threshold is 6700, unless you tweaked something. But
> there's a tweak, we'll come back to this.
>
>> size: 65464
>> in-pack: 185432
>> packs: 1
>
> Pack threshold is 50, You only have one pack, good
>
> OK back to the "count 6039" above. You have that many loose objects.
> But 'git gc' is lazier than 'git count-objects'. It assume a flat
> distribution, and only counts the number of objects in .git/objects/17
> directory only, then extrapolate for the total number.
>
> So can you see how many files you have in this directory
> .git/objects/17? That number, multiplied by 256, should be greater
> than 6700. If that's the case "git gc" laziness is the problem. If
> not, I made some mistake in analyzing this and we'll start again.
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git pull & git gc
  2015-03-18 14:41       ` Duy Nguyen
@ 2015-03-18 14:58         ` John Keeping
  2015-03-18 21:04           ` Jeff King
  2015-03-19  9:47           ` Duy Nguyen
  0 siblings, 2 replies; 21+ messages in thread
From: John Keeping @ 2015-03-18 14:58 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Дилян Палаузов,
	Git Mailing List

On Wed, Mar 18, 2015 at 09:41:59PM +0700, Duy Nguyen wrote:
> On Wed, Mar 18, 2015 at 9:33 PM, Duy Nguyen <pclouds@gmail.com> wrote:
> > If not, I made some mistake in analyzing this and we'll start again.
> 
> I did make one mistake, the first "gc" should have reduced the number
> of loose objects to zero. Why didn't it.?  I'll come back to this
> tomorrow if nobody finds out first :)

Most likely they are not referenced by anything but are younger than 2
weeks.

I saw a similar issue with automatic gc triggering after every operation
when I did something equivalent to:

	git add <lots of files>
	git commit
	git reset --hard HEAD^

which creates a log of unreachable objects which are not old enough to
be pruned.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git pull & git gc
  2015-03-18 14:58         ` John Keeping
@ 2015-03-18 21:04           ` Jeff King
  2015-03-19  0:31             ` Duy Nguyen
  2015-03-19  9:47           ` Duy Nguyen
  1 sibling, 1 reply; 21+ messages in thread
From: Jeff King @ 2015-03-18 21:04 UTC (permalink / raw)
  To: John Keeping
  Cc: Duy Nguyen,
	Дилян Палаузов,
	Git Mailing List

On Wed, Mar 18, 2015 at 02:58:15PM +0000, John Keeping wrote:

> On Wed, Mar 18, 2015 at 09:41:59PM +0700, Duy Nguyen wrote:
> > On Wed, Mar 18, 2015 at 9:33 PM, Duy Nguyen <pclouds@gmail.com> wrote:
> > > If not, I made some mistake in analyzing this and we'll start again.
> > 
> > I did make one mistake, the first "gc" should have reduced the number
> > of loose objects to zero. Why didn't it.?  I'll come back to this
> > tomorrow if nobody finds out first :)
> 
> Most likely they are not referenced by anything but are younger than 2
> weeks.
> 
> I saw a similar issue with automatic gc triggering after every operation
> when I did something equivalent to:
> 
> 	git add <lots of files>
> 	git commit
> 	git reset --hard HEAD^
> 
> which creates a log of unreachable objects which are not old enough to
> be pruned.

Yes, this is almost certainly the problem. Though to be pedantic, the
command above will still have a reflog entry, so the objects will be
reachable (and packed). But there are other variants that don't leave
the objects reachable from even reflogs.

I don't know if there is an easy way around this. Auto-gc's object count
is making the assumption that running the gc will reduce the number of
objects, but obviously it does not always do so. We could do a more
thorough check and find the number of actual packable and prune-able
objects. The "prune-able" part of that is easy; just omit objects from
the count that are newer than 2 weeks. But "packable" is expensive. You
would have to compute reachability by walking from the tips. That can
take tens of seconds on a large repo.

You could perhaps cut off the walk early when you hit a packed commit
(this does not strictly imply that all of the related objects are
packed, but it would be good enough for a heuristic). But even that is
probably too expensive for "gc --auto".

-Peff

PS Note that in git v2.2.0 and up, prune will leave not only "recent"
   unreachable objects, but older objects which are reachable from those
   recent ones (so that we keep or prune whole chunks of history, rather
   than dropping part and leaving the rest broken). Technically this
   exacerbates the problem (we keep more objects), though I doubt it
   makes much difference in practice (most chunks of history were
   created at similar times, so the mtimes of the whole chunk will be
   close together).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git pull & git gc
  2015-03-18 14:48       ` Дилян Палаузов
@ 2015-03-18 21:07         ` Jeff King
  0 siblings, 0 replies; 21+ messages in thread
From: Jeff King @ 2015-03-18 21:07 UTC (permalink / raw)
  To: Дилян Палаузов
  Cc: Duy Nguyen, Git Mailing List

On Wed, Mar 18, 2015 at 03:48:42PM +0100, Дилян Палаузов wrote:

> #ls .git/objects/17/*  | wc -l
> 30
> 
> 30 * 256 = 7 680 > 6 700
> 
> And now?  Do I have to run git gc --aggressive ?

No, aggressive just controls the time we spend on repacking. If the
guess is correct that the objects are kept because they are unreachable
but "recent", then shortening the prune expiration time would get rid of
them. E.g., "git gc --prune=1.hour.ago".

That does not solve the underlying problem discussed elsewhere in the
thread, but it would make this particular instance of it go away. :)

-Peff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git pull & git gc
  2015-03-18 21:04           ` Jeff King
@ 2015-03-19  0:31             ` Duy Nguyen
  2015-03-19  1:27               ` Jeff King
  0 siblings, 1 reply; 21+ messages in thread
From: Duy Nguyen @ 2015-03-19  0:31 UTC (permalink / raw)
  To: Jeff King
  Cc: John Keeping,
	Дилян Палаузов,
	Git Mailing List

On Thu, Mar 19, 2015 at 4:04 AM, Jeff King <peff@peff.net> wrote:
> On Wed, Mar 18, 2015 at 02:58:15PM +0000, John Keeping wrote:
>
>> On Wed, Mar 18, 2015 at 09:41:59PM +0700, Duy Nguyen wrote:
>> > On Wed, Mar 18, 2015 at 9:33 PM, Duy Nguyen <pclouds@gmail.com> wrote:
>> > > If not, I made some mistake in analyzing this and we'll start again.
>> >
>> > I did make one mistake, the first "gc" should have reduced the number
>> > of loose objects to zero. Why didn't it.?  I'll come back to this
>> > tomorrow if nobody finds out first :)
>>
>> Most likely they are not referenced by anything but are younger than 2
>> weeks.
>>
>> I saw a similar issue with automatic gc triggering after every operation
>> when I did something equivalent to:
>>
>>       git add <lots of files>
>>       git commit
>>       git reset --hard HEAD^
>>
>> which creates a log of unreachable objects which are not old enough to
>> be pruned.
>
> Yes, this is almost certainly the problem. Though to be pedantic, the
> command above will still have a reflog entry, so the objects will be
> reachable (and packed). But there are other variants that don't leave
> the objects reachable from even reflogs.
>
> I don't know if there is an easy way around this. Auto-gc's object count
> is making the assumption that running the gc will reduce the number of
> objects, but obviously it does not always do so. We could do a more
> thorough check and find the number of actual packable and prune-able
> objects. The "prune-able" part of that is easy; just omit objects from
> the count that are newer than 2 weeks. But "packable" is expensive. You
> would have to compute reachability by walking from the tips. That can
> take tens of seconds on a large repo.

Or we could count/estimate the number of loose objects again after
repack/prune. Then we can maybe have a way to prevent the next gc that
we know will not improve the situation anyway. One option is pack
unreachable objects in the second pack. This would stop the next gc,
but that would screw prune up because st_mtime info is gone.. Maybe we
just save a file to tell gc to ignore the number of loose objects
until after a specific date.
-- 
Duy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git pull & git gc
  2015-03-19  0:31             ` Duy Nguyen
@ 2015-03-19  1:27               ` Jeff King
  2015-03-19  2:01                 ` Mike Hommey
                                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Jeff King @ 2015-03-19  1:27 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: John Keeping,
	Дилян Палаузов,
	Git Mailing List

On Thu, Mar 19, 2015 at 07:31:48AM +0700, Duy Nguyen wrote:

> Or we could count/estimate the number of loose objects again after
> repack/prune. Then we can maybe have a way to prevent the next gc that
> we know will not improve the situation anyway. One option is pack
> unreachable objects in the second pack. This would stop the next gc,
> but that would screw prune up because st_mtime info is gone.. Maybe we
> just save a file to tell gc to ignore the number of loose objects
> until after a specific date.

I don't think packing the unreachables is a good plan. They just end up
accumulating then, and they never expire, because we keep refreshing
their mtime at each pack (unless you pack them once and then leave them
to expire, but then you end up with a large number of packs).

Keeping a file that says "I ran gc at time T, and there were still N
objects left over" is probably the best bet. When the next "gc --auto"
runs, if T is recent enough, subtract N from the estimated number of
objects. I'm not sure of the right value for "recent enough" there,
though. If it is too far back, you will not gc when you could. If it is
too close, then you will end up running gc repeatedly, waiting for those
objects to leave the expiration window.

I guess leaving a bunch of loose objects around longer than necessary
isn't the end of the world. It wastes space, but it does not actively
make the rest of git slower (whereas having a large number of packs does
impact performance). So you could probably make "recent enough" be "T <
now - gc.pruneExpire / 4" or something. At most we would try to gc 4
times before dropping unreachable objects, and for the default period,
that's only once every couple days.

-Peff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git pull & git gc
  2015-03-19  1:27               ` Jeff King
@ 2015-03-19  2:01                 ` Mike Hommey
  2015-03-19  4:14                   ` Jeff King
  2015-03-19  2:27                 ` Junio C Hamano
  2015-03-19  4:15                 ` Duy Nguyen
  2 siblings, 1 reply; 21+ messages in thread
From: Mike Hommey @ 2015-03-19  2:01 UTC (permalink / raw)
  To: Jeff King
  Cc: Duy Nguyen, John Keeping,
	Дилян Палаузов,
	Git Mailing List

On Wed, Mar 18, 2015 at 09:27:22PM -0400, Jeff King wrote:
> On Thu, Mar 19, 2015 at 07:31:48AM +0700, Duy Nguyen wrote:
> 
> > Or we could count/estimate the number of loose objects again after
> > repack/prune. Then we can maybe have a way to prevent the next gc that
> > we know will not improve the situation anyway. One option is pack
> > unreachable objects in the second pack. This would stop the next gc,
> > but that would screw prune up because st_mtime info is gone.. Maybe we
> > just save a file to tell gc to ignore the number of loose objects
> > until after a specific date.
> 
> I don't think packing the unreachables is a good plan. They just end up
> accumulating then, and they never expire, because we keep refreshing
> their mtime at each pack (unless you pack them once and then leave them
> to expire, but then you end up with a large number of packs).

Note, sometimes I wish unreachables were packed. Recently, I ended up in
a situation where running gc created something like 3GB of data as per
du, because I suddenly had something like 600K unreachable objects, each
of them, as a loose object, taking at least 4K on disk. This made my
.git take 5GB instead of 2GB. That surely didn't feel like garbage
collection.

Mike

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git pull & git gc
  2015-03-19  1:27               ` Jeff King
  2015-03-19  2:01                 ` Mike Hommey
@ 2015-03-19  2:27                 ` Junio C Hamano
  2015-03-19  4:09                   ` Jeff King
  2015-03-19  4:15                 ` Duy Nguyen
  2 siblings, 1 reply; 21+ messages in thread
From: Junio C Hamano @ 2015-03-19  2:27 UTC (permalink / raw)
  To: Jeff King
  Cc: Duy Nguyen, John Keeping,
	Дилян Палаузов,
	Git Mailing List

On Wed, Mar 18, 2015 at 6:27 PM, Jeff King <peff@peff.net> wrote:
>
> Keeping a file that says "I ran gc at time T, and there were still N
> objects left over" is probably the best bet. When the next "gc --auto"
> runs, if T is recent enough, subtract N from the estimated number of
> objects. I'm not sure of the right value for "recent enough" there,
> though. If it is too far back, you will not gc when you could. If it is
> too close, then you will end up running gc repeatedly, waiting for those
> objects to leave the expiration window.
>
> I guess leaving a bunch of loose objects around longer than necessary
> isn't the end of the world. It wastes space, but it does not actively
> make the rest of git slower (whereas having a large number of packs does
> impact performance). So you could probably make "recent enough" be "T <
> now - gc.pruneExpire / 4" or something. At most we would try to gc 4
> times before dropping unreachable objects, and for the default period,
> that's only once every couple days.

We could simply prune unreachables more aggressively, and it would
solve this issue at the root cause, no?

We do keep things reachable from reflogs, so the only thing you are
getting by leaving the unreachables around is for an expert to perform
some forensic analysis---especially if there are so many loose objects
that are all unreachable, nobody sane can go through them one by one
and guess correctly if each of them is what they wished they kept if
their ancient reflog entry extended a few weeks more.

That is, unless there is some tool to analyse the unreachable loose
objects, collect them into meaningful islands, and present them in
some way that the end user can make sense of, which I do not think
exists (yet).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git pull & git gc
  2015-03-19  2:27                 ` Junio C Hamano
@ 2015-03-19  4:09                   ` Jeff King
  0 siblings, 0 replies; 21+ messages in thread
From: Jeff King @ 2015-03-19  4:09 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Duy Nguyen, John Keeping,
	Дилян Палаузов,
	Git Mailing List

On Wed, Mar 18, 2015 at 07:27:46PM -0700, Junio C Hamano wrote:

> > I guess leaving a bunch of loose objects around longer than necessary
> > isn't the end of the world. It wastes space, but it does not actively
> > make the rest of git slower (whereas having a large number of packs does
> > impact performance). So you could probably make "recent enough" be "T <
> > now - gc.pruneExpire / 4" or something. At most we would try to gc 4
> > times before dropping unreachable objects, and for the default period,
> > that's only once every couple days.
> 
> We could simply prune unreachables more aggressively, and it would
> solve this issue at the root cause, no?

Yes, but not too aggressively. You mentioned object archaeology, but my
main interest is avoiding corruption. The mtime check is the thing that
prevents us from pruning objects being used for an operation-in-progress
that has not yet updated a ref.  For some long-running operations, like
adding files to a commit, we take into account references like a blob
being mentioned in the index. But I do not know offhand if there are
other long-running operations that would run into problems if we
shortened the expiration time drastically.  Anything building a
temporary index is potentially problematic.

But if we assume that operations like that tend to create and reference
their objects within a reasonable time period (say, seconds to minutes)
then the current default of 2 weeks is absurd for this purpose.  For
raciness within a single operation, a few seconds is probably enough
(e.g., we may write out a commit object and then update the ref a few
milliseconds later).

The potential for problems is exacerbated by the fact that object `X`
may exist in the filesystem with an old mtime, and then a new operation
wants to reference it. That's made somewhat better by 33d4221
(write_sha1_file: freshen existing objects, 2014-10-15), as before we
could silently turn a file write into a noop. But it's still racy to do:

  git cat-file -e $commit
  git update-ref refs/heads/foo $commit

as we do not update the mtime for a read-only operation like cat-file
(and even if we did, it's still somewhat racy as prune does not
atomically check the mtime and remove the file).

So I think there's definitely some possible danger with dropping the
default prune expiration time.

For a long time GitHub ran with it as 1.hour.ago. We definitely saw some
oddities and corruption over the years that were apparently caused by
over-aggressive pruning and/or raciness. I've fixed a number of bugs,
and things did get better as a result. But I could not say whether all
such problems are gone. These days we do our regular repacks with
"--keep-unreachable" and almost never prune anything.

It's also not clear whether GitHub represents anything close to "normal"
use. We have a much smaller array of operations that we perform (most
objects are either from a push, or from a test-merge between a topic
branch and HEAD). But we also have busy repos that are frequently doing
gc in the background (especially because we share object storage, so
activity on another fork can trigger a gc job that affects a whole
repository network). On workstations, I'd guess most git-gc jobs run
during a fairly quiescent period.

All of which is to say that I don't really know the answer, and there
may be dragons. I'd imagine that dropping the default expiration time
from 2 weeks to 1 day would probably be fine. A good way to experiment
would be for some brave souls to set gc.pruneexpire themselves, run with
it for a few weeks or months, and see if anything goes wrong.

-Peff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git pull & git gc
  2015-03-19  2:01                 ` Mike Hommey
@ 2015-03-19  4:14                   ` Jeff King
  2015-03-19  4:26                     ` Mike Hommey
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff King @ 2015-03-19  4:14 UTC (permalink / raw)
  To: Mike Hommey
  Cc: Duy Nguyen, John Keeping,
	Дилян Палаузов,
	Git Mailing List

On Thu, Mar 19, 2015 at 11:01:17AM +0900, Mike Hommey wrote:

> > I don't think packing the unreachables is a good plan. They just end up
> > accumulating then, and they never expire, because we keep refreshing
> > their mtime at each pack (unless you pack them once and then leave them
> > to expire, but then you end up with a large number of packs).
> 
> Note, sometimes I wish unreachables were packed. Recently, I ended up in
> a situation where running gc created something like 3GB of data as per
> du, because I suddenly had something like 600K unreachable objects, each
> of them, as a loose object, taking at least 4K on disk. This made my
> .git take 5GB instead of 2GB. That surely didn't feel like garbage
> collection.

That's definitely a thing that happens, but it is a bit of a corner
case. It's unusual to have such a large number of unreferenced objects
all at once.

I don't suppose you happen to remember the details, but would a lower
expiration time (e.g., 1 day or 1 hour) have made all of those objects
go away? Or were they really from some extremely recent event (of
course, "event" here might just have been "I did a full repack right
before rewriting history" which would freshen the mtimes on everything
in the pack).

Certainly the "loosening" behavior for unreachable objects has corner
cases like this, and they suck when you hit one. Leaving the objects
packed would be better, but IMHO is not a viable alternative unless
somebody comes up with a plan for segregating the "old" objects in a way
that they actually expire eventually, and don't just keep getting
repacked and freshened over and over.

-Peff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git pull & git gc
  2015-03-19  1:27               ` Jeff King
  2015-03-19  2:01                 ` Mike Hommey
  2015-03-19  2:27                 ` Junio C Hamano
@ 2015-03-19  4:15                 ` Duy Nguyen
  2015-03-19  4:20                   ` Jeff King
  2 siblings, 1 reply; 21+ messages in thread
From: Duy Nguyen @ 2015-03-19  4:15 UTC (permalink / raw)
  To: Jeff King
  Cc: John Keeping,
	Дилян Палаузов,
	Git Mailing List

On Thu, Mar 19, 2015 at 8:27 AM, Jeff King <peff@peff.net> wrote:
> Keeping a file that says "I ran gc at time T, and there were still N
> objects left over" is probably the best bet. When the next "gc --auto"
> runs, if T is recent enough, subtract N from the estimated number of
> objects. I'm not sure of the right value for "recent enough" there,
> though. If it is too far back, you will not gc when you could. If it is
> too close, then you will end up running gc repeatedly, waiting for those
> objects to leave the expiration window.

And would not be hard to implement either. git-gc is already prepared
to deal with stale gc.pid, which would stop git-gc for a day or so
before it deletes gc.pid and starts anyway. All we need to do is check
at the end of git-gc, if we know for sure the next 'gc --auto' is a
waste, then leave gc.pid behind.
-- 
Duy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git pull & git gc
  2015-03-19  4:15                 ` Duy Nguyen
@ 2015-03-19  4:20                   ` Jeff King
  2015-03-19  4:29                     ` Duy Nguyen
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff King @ 2015-03-19  4:20 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: John Keeping,
	Дилян Палаузов,
	Git Mailing List

On Thu, Mar 19, 2015 at 11:15:19AM +0700, Duy Nguyen wrote:

> On Thu, Mar 19, 2015 at 8:27 AM, Jeff King <peff@peff.net> wrote:
> > Keeping a file that says "I ran gc at time T, and there were still N
> > objects left over" is probably the best bet. When the next "gc --auto"
> > runs, if T is recent enough, subtract N from the estimated number of
> > objects. I'm not sure of the right value for "recent enough" there,
> > though. If it is too far back, you will not gc when you could. If it is
> > too close, then you will end up running gc repeatedly, waiting for those
> > objects to leave the expiration window.
> 
> And would not be hard to implement either. git-gc is already prepared
> to deal with stale gc.pid, which would stop git-gc for a day or so
> before it deletes gc.pid and starts anyway. All we need to do is check
> at the end of git-gc, if we know for sure the next 'gc --auto' is a
> waste, then leave gc.pid behind.

That omits the "N objects left over" information. Which I think may be
useful, because otherwise the rule is basically "don't do another gc at
all for X time units". That's OK for most use, but it has its own corner
cases. E.g., imagine you are doing an SVN import that does an auto-gc
check every 1000 commits. You have some unreferenced objects in your
repository. After the first 1000 commits, we do a gc, and then say "wow,
still a lot of cruft; let's block gc for a day". Five minutes later,
after another 1000 commits, we run "gc --auto" again. It doesn't run
because of the cruft-check, even though there are a _huge_ number of new
packable objects.

If the blocker file tells us "7000 extra objects" and we see that there
are 17,000 in the repo, then we know it's still worth doing the gc
(i.e., we know we that we'll probably end up ignoring the 7000 cruft
that didn't get cleaned up last time, but we also know that there are
10,000 new objects).

-Peff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git pull & git gc
  2015-03-19  4:14                   ` Jeff King
@ 2015-03-19  4:26                     ` Mike Hommey
  0 siblings, 0 replies; 21+ messages in thread
From: Mike Hommey @ 2015-03-19  4:26 UTC (permalink / raw)
  To: Jeff King
  Cc: Duy Nguyen, John Keeping,
	Дилян Палаузов,
	Git Mailing List

On Thu, Mar 19, 2015 at 12:14:53AM -0400, Jeff King wrote:
> On Thu, Mar 19, 2015 at 11:01:17AM +0900, Mike Hommey wrote:
> 
> > > I don't think packing the unreachables is a good plan. They just end up
> > > accumulating then, and they never expire, because we keep refreshing
> > > their mtime at each pack (unless you pack them once and then leave them
> > > to expire, but then you end up with a large number of packs).
> > 
> > Note, sometimes I wish unreachables were packed. Recently, I ended up in
> > a situation where running gc created something like 3GB of data as per
> > du, because I suddenly had something like 600K unreachable objects, each
> > of them, as a loose object, taking at least 4K on disk. This made my
> > .git take 5GB instead of 2GB. That surely didn't feel like garbage
> > collection.
> 
> That's definitely a thing that happens, but it is a bit of a corner
> case. It's unusual to have such a large number of unreferenced objects
> all at once.
> 
> I don't suppose you happen to remember the details, but would a lower
> expiration time (e.g., 1 day or 1 hour) have made all of those objects
> go away? Or were they really from some extremely recent event (of
> course, "event" here might just have been "I did a full repack right
> before rewriting history" which would freshen the mtimes on everything
> in the pack).

Unfortunately, I don't know the exact details. But yes, I guess a lower
expiration time might have helped.

> Certainly the "loosening" behavior for unreachable objects has corner
> cases like this, and they suck when you hit one. Leaving the objects
> packed would be better, but IMHO is not a viable alternative unless
> somebody comes up with a plan for segregating the "old" objects in a way
> that they actually expire eventually, and don't just keep getting
> repacked and freshened over and over.

It sure is a corner case, otoh, when it happens, every single git
operation calls git gc --auto, which happily spends 5 minutes sucking
CPU to end up doing nothing in practice. And add more salt on the
injury if you are on battery

6700 loose objects seems easy to reach on a repo with 6M objects...

Mike

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git pull & git gc
  2015-03-19  4:20                   ` Jeff King
@ 2015-03-19  4:29                     ` Duy Nguyen
  2015-03-19  4:34                       ` Jeff King
  0 siblings, 1 reply; 21+ messages in thread
From: Duy Nguyen @ 2015-03-19  4:29 UTC (permalink / raw)
  To: Jeff King
  Cc: John Keeping,
	Дилян Палаузов,
	Git Mailing List

On Thu, Mar 19, 2015 at 11:20 AM, Jeff King <peff@peff.net> wrote:
> On Thu, Mar 19, 2015 at 11:15:19AM +0700, Duy Nguyen wrote:
>
>> On Thu, Mar 19, 2015 at 8:27 AM, Jeff King <peff@peff.net> wrote:
>> > Keeping a file that says "I ran gc at time T, and there were still N
>> > objects left over" is probably the best bet. When the next "gc --auto"
>> > runs, if T is recent enough, subtract N from the estimated number of
>> > objects. I'm not sure of the right value for "recent enough" there,
>> > though. If it is too far back, you will not gc when you could. If it is
>> > too close, then you will end up running gc repeatedly, waiting for those
>> > objects to leave the expiration window.
>>
>> And would not be hard to implement either. git-gc is already prepared
>> to deal with stale gc.pid, which would stop git-gc for a day or so
>> before it deletes gc.pid and starts anyway. All we need to do is check
>> at the end of git-gc, if we know for sure the next 'gc --auto' is a
>> waste, then leave gc.pid behind.
>
> That omits the "N objects left over" information. Which I think may be
> useful, because otherwise the rule is basically "don't do another gc at
> all for X time units". That's OK for most use, but it has its own corner
> cases.

True. But saving "N objects left over" in a file also has a corner
case. If the user "prune --expire=now" manually, the next 'gc --auto'
still thinks we have that many leftovers and keeps delaying gc for
some more time. Unless we make 'prune' (or any other commands that
delete leftovers) to also delete this file. Yeah maybe saving this
info in a file will work.

> E.g., imagine you are doing an SVN import that does an auto-gc
> check every 1000 commits. You have some unreferenced objects in your
> repository. After the first 1000 commits, we do a gc, and then say "wow,
> still a lot of cruft; let's block gc for a day". Five minutes later,
> after another 1000 commits, we run "gc --auto" again. It doesn't run
> because of the cruft-check, even though there are a _huge_ number of new
> packable objects.
>
> If the blocker file tells us "7000 extra objects" and we see that there
> are 17,000 in the repo, then we know it's still worth doing the gc
> (i.e., we know we that we'll probably end up ignoring the 7000 cruft
> that didn't get cleaned up last time, but we also know that there are
> 10,000 new objects).
-- 
Duy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git pull & git gc
  2015-03-19  4:29                     ` Duy Nguyen
@ 2015-03-19  4:34                       ` Jeff King
  0 siblings, 0 replies; 21+ messages in thread
From: Jeff King @ 2015-03-19  4:34 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: John Keeping,
	Дилян Палаузов,
	Git Mailing List

On Thu, Mar 19, 2015 at 11:29:57AM +0700, Duy Nguyen wrote:

> > That omits the "N objects left over" information. Which I think may be
> > useful, because otherwise the rule is basically "don't do another gc at
> > all for X time units". That's OK for most use, but it has its own corner
> > cases.
> 
> True. But saving "N objects left over" in a file also has a corner
> case. If the user "prune --expire=now" manually, the next 'gc --auto'
> still thinks we have that many leftovers and keeps delaying gc for
> some more time. Unless we make 'prune' (or any other commands that
> delete leftovers) to also delete this file. Yeah maybe saving this
> info in a file will work.

I assumed that the user would not run prune manually, but would run "git
gc --prune=now". And yeah, definitely any time gc runs, it should update
the file (if there are fewer than `gc.auto` objects, I think it could
just delete the file).

We could also apply that rule any run of "git prune", but my mental
model is that "git gc" is the magical porcelain that will do this stuff
for you, and "git prune" is the plumbing that users shouldn't need to
call themselves. I don't know if that model is shared by users, though. :)

-Peff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git pull & git gc
  2015-03-18 14:58         ` John Keeping
  2015-03-18 21:04           ` Jeff King
@ 2015-03-19  9:47           ` Duy Nguyen
  1 sibling, 0 replies; 21+ messages in thread
From: Duy Nguyen @ 2015-03-19  9:47 UTC (permalink / raw)
  To: John Keeping
  Cc: Дилян Палаузов,
	Git Mailing List

On Wed, Mar 18, 2015 at 9:58 PM, John Keeping <john@keeping.me.uk> wrote:
> On Wed, Mar 18, 2015 at 09:41:59PM +0700, Duy Nguyen wrote:
>> On Wed, Mar 18, 2015 at 9:33 PM, Duy Nguyen <pclouds@gmail.com> wrote:
>> > If not, I made some mistake in analyzing this and we'll start again.
>>
>> I did make one mistake, the first "gc" should have reduced the number
>> of loose objects to zero. Why didn't it.?  I'll come back to this
>> tomorrow if nobody finds out first :)
>
> Most likely they are not referenced by anything but are younger than 2
> weeks.
>
> I saw a similar issue with automatic gc triggering after every operation
> when I did something equivalent to:
>
>         git add <lots of files>
>         git commit
>         git reset --hard HEAD^
>
> which creates a log of unreachable objects which are not old enough to
> be pruned.

And there's another problem caused by background gc. If it's not run
in background, it should print this

warning: There are too many unreachable loose objects; run 'git prune'
to remove them.

but because background gc does not have access to stdout/stderr
anymore, this is lost.
-- 
Duy

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2015-03-19  9:48 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-18 13:53 git pull & git gc Дилян Палаузов
2015-03-18 14:16 ` Duy Nguyen
2015-03-18 14:23   ` Дилян Палаузов
2015-03-18 14:33     ` Duy Nguyen
2015-03-18 14:41       ` Duy Nguyen
2015-03-18 14:58         ` John Keeping
2015-03-18 21:04           ` Jeff King
2015-03-19  0:31             ` Duy Nguyen
2015-03-19  1:27               ` Jeff King
2015-03-19  2:01                 ` Mike Hommey
2015-03-19  4:14                   ` Jeff King
2015-03-19  4:26                     ` Mike Hommey
2015-03-19  2:27                 ` Junio C Hamano
2015-03-19  4:09                   ` Jeff King
2015-03-19  4:15                 ` Duy Nguyen
2015-03-19  4:20                   ` Jeff King
2015-03-19  4:29                     ` Duy Nguyen
2015-03-19  4:34                       ` Jeff King
2015-03-19  9:47           ` Duy Nguyen
2015-03-18 14:48       ` Дилян Палаузов
2015-03-18 21:07         ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).