Re: [BUG?] gc and impatience

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [BUG?] gc and impatience
       [not found] <1rpxs5pa827iefbyduyodlc7.1375495435629@email.android.com>
@ 2013-08-05 17:34 ` Ramkumar Ramachandra
  2013-08-05 18:45   ` Martin Fick
  0 siblings, 1 reply; 10+ messages in thread
From: Ramkumar Ramachandra @ 2013-08-05 17:34 UTC (permalink / raw)
  To: Martin Fick; +Cc: Git List

Martin Fick wrote:
> https://gerrit-review.googlesource.com/#/c/35215/

Very cool. Of what I understood:

So, the problem is that my .git/objects/pack is polluted with little
packs everytime I fetch (or push, if you're the server), and this is
problematic from the perspective of a overtly (naively) aggressive gc
that hammers out all fragmentation.  So, on the first run, the little
packfiles I have are all "consolidated" into big packfiles; you also
write .keep files to say that "don't gc these big packs we just
generated".  In subsequent runs, the little packfiles from the fetch
are absorbed into a pack that is immune to gc.  You're also using a
size heuristic, to consolidate similarly sized packfiles.  You also
have a --ratio to tweak the ratio of sizes.

I've checked it in and started using it; so yeah: I'll chew on it for
a few weeks.

Thanks.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG?] gc and impatience
  2013-08-05 17:34 ` [BUG?] gc and impatience Ramkumar Ramachandra
@ 2013-08-05 18:45   ` Martin Fick
  2013-08-06  2:59     ` Ramkumar Ramachandra
  0 siblings, 1 reply; 10+ messages in thread
From: Martin Fick @ 2013-08-05 18:45 UTC (permalink / raw)
  To: Ramkumar Ramachandra; +Cc: Martin Fick, Git List

On Monday, August 05, 2013 11:34:24 am Ramkumar Ramachandra 
wrote:
> Martin Fick wrote:
> > https://gerrit-review.googlesource.com/#/c/35215/
> 
> Very cool. Of what I understood:
> 
> So, the problem is that my .git/objects/pack is polluted
> with little packs everytime I fetch (or push, if you're
> the server), and this is problematic from the
> perspective of a overtly (naively) aggressive gc that
> hammers out all fragmentation.  So, on the first run,
> the little packfiles I have are all "consolidated" into
> big packfiles; you also write .keep files to say that
> "don't gc these big packs we just generated".  In
> subsequent runs, the little packfiles from the fetch are
> absorbed into a pack that is immune to gc.  You're also
> using a size heuristic, to consolidate similarly sized
> packfiles.  You also have a --ratio to tweak the ratio
> of sizes.

Yes, pretty much.  

I suspect that a smarter implementation would do a "less 
good job of packing" to save time also.  I think this can be 
done by further limiting much of the lookups to the packs 
being packed (or some limited set of the greater packfiles).  
I admit I don't really understand how much the packing does 
today, but I believe it still looks at the larger packs with 
keeps to potentially deltafy against them, or to determine 
which objects are duplicated and thus should not be put into 
the new smaller packfiles?  I say this because the time 
savings of this script is not as significant as I would have 
expected it to be (but the IO is).  I think that it is 
possible to design a git gc using this rolling approach that 
would actually greatly reduce the time spent packing also.  
However, I don't think that can easily be done in a script 
like mine which just wraps itself around git gc.  I hope 
that someone more familiar with git gc than me might take 
this on some day. :)

> I've checked it in and started using it; so yeah: I'll
> chew on it for a few weeks.

The script also does some nasty timestamp manipulations that 
I am not proud of.  They had significant time impacts for 
us, and likely could have been achieved some other way.  
They shouldn't be relevant to the packing algo though.  I 
hope it doesn't interfere with the evaluation of the 
approach.

Thanks for taking an interest in it,

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG?] gc and impatience
  2013-08-05 18:45   ` Martin Fick
@ 2013-08-06  2:59     ` Ramkumar Ramachandra
  0 siblings, 0 replies; 10+ messages in thread
From: Ramkumar Ramachandra @ 2013-08-06  2:59 UTC (permalink / raw)
  To: Martin Fick; +Cc: Martin Fick, Git List

Martin Fick wrote:
> I hope
> that someone more familiar with git gc than me might take
> this on some day. :)

More likely scenario: someone who is unfamiliar with it will read and
patch it little by little :)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [BUG?] gc and impatience
@ 2013-08-03  1:48 Ramkumar Ramachandra
  2013-08-03  3:53 ` Duy Nguyen
  0 siblings, 1 reply; 10+ messages in thread
From: Ramkumar Ramachandra @ 2013-08-03  1:48 UTC (permalink / raw)
  To: Git List

Hi,

I was pulling in some changes in the morning to find:

 Auto packing the repository for optimum performance. You may also
 run "git gc" manually. See "git help gc" for more information.

Being my usual impatient self, I opened another prompt and started
merging changes. After the checkout, it started running another gc
(why!?), which I attempted to kill using ^C.

  Counting objects: 449291   x$

It didn't just fail to stop, but it kept writing output making my
prompt completely unusable. I finally just killed the pane. Now, it's
struggling to update-index and update my cache (read: more waiting).

Why is gc not designed for impatient people, and what needs to be done
to change this?

Ram

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG?] gc and impatience
  2013-08-03  1:48 Ramkumar Ramachandra
@ 2013-08-03  3:53 ` Duy Nguyen
  2013-08-03  4:44   ` Junio C Hamano
  2013-08-06  2:14   ` Ramkumar Ramachandra
  0 siblings, 2 replies; 10+ messages in thread
From: Duy Nguyen @ 2013-08-03  3:53 UTC (permalink / raw)
  To: Ramkumar Ramachandra; +Cc: Git List

On Sat, Aug 3, 2013 at 8:48 AM, Ramkumar Ramachandra <artagnon@gmail.com> wrote:
>  Auto packing the repository for optimum performance. You may also
>  run "git gc" manually. See "git help gc" for more information.
>
> Being my usual impatient self, I opened another prompt and started
> merging changes. After the checkout, it started running another gc
> (why!?),

Good point. I think that is because gc does not check if gc is already
running. Adding such a check should not be too hard. I think gc could
save its pid in $GIT_DIR/auto-gc.pid. The next auto-gc checks this, if
the pid is valid, skip auto-gc.

> Why is gc not designed for impatient people, and what needs to be done
> to change this?

Some improvements could be made in gc, for example warn users about
upcoming gc so they can run it in background (of course the above bug
should be fixed)

http://thread.gmane.org/gmane.comp.version-control.git/197716/focus=197877

or speed up repack by implementing pack-objects --merge-pack:

http://thread.gmane.org/gmane.comp.version-control.git/57672/focus=57943

Or you could just make a cron job to gc all repos every week and the
problem goes away ;-)
-- 
Duy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG?] gc and impatience
  2013-08-03  3:53 ` Duy Nguyen
@ 2013-08-03  4:44   ` Junio C Hamano
  2013-08-03  5:25     ` Duy Nguyen
  2013-08-06  2:14   ` Ramkumar Ramachandra
  1 sibling, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2013-08-03  4:44 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Ramkumar Ramachandra, Git List

On Fri, Aug 2, 2013 at 8:53 PM, Duy Nguyen <pclouds@gmail.com> wrote:
> Good point. I think that is because gc does not check if gc is already
> running. Adding such a check should not be too hard. I think gc could
> save its pid in $GIT_DIR/auto-gc.pid. The next auto-gc checks this, if
> the pid is valid, skip auto-gc.

Defining "valid" is a tricky business, though, as pid can and will
wrap around, and the directory could be shared on multiple machines. A
pid written by a process on one machine has no relation to any pid on
another machine.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG?] gc and impatience
  2013-08-03  4:44   ` Junio C Hamano
@ 2013-08-03  5:25     ` Duy Nguyen
  2013-08-05 15:24       ` Junio C Hamano
  0 siblings, 1 reply; 10+ messages in thread
From: Duy Nguyen @ 2013-08-03  5:25 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Ramkumar Ramachandra, Git List

On Sat, Aug 3, 2013 at 11:44 AM, Junio C Hamano <gitster@pobox.com> wrote:
> On Fri, Aug 2, 2013 at 8:53 PM, Duy Nguyen <pclouds@gmail.com> wrote:
>> Good point. I think that is because gc does not check if gc is already
>> running. Adding such a check should not be too hard. I think gc could
>> save its pid in $GIT_DIR/auto-gc.pid. The next auto-gc checks this, if
>> the pid is valid, skip auto-gc.
>
> Defining "valid" is a tricky business, though, as pid can and will
> wrap around,

Yes there is a chance that the old pid is not used for another process
and it could get worse when that process is a daemon and runs forever.
If we go the optimistic way, we could check mtime of auto-gc.pid. If
it's older than a couple hours, ignore it and run gc anyway, assuming
gc can't last longer than an hour or so. A more reliable way is save a
unix socket instead of auto-gc.pid and send something over the socket
to verify it's gc, but I think it's overkill.

> and the directory could be shared on multiple machines. A
> pid written by a process on one machine has no relation to any pid on
> another machine.

I worry less about this. It's not the right model to have two machines
modify the same shared repository (gc --auto is only triggered when we
think there are new objects) even though I think we support it. If
it's two _scripts_ modifying the same repo, I don't care as this is
more about user interaction. If it's two people modifying the same
repo, it sounds like an insane setup and there may be more issues to
worry about than gc --auto.
-- 
Duy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG?] gc and impatience
  2013-08-03  5:25     ` Duy Nguyen
@ 2013-08-05 15:24       ` Junio C Hamano
  2013-08-05 15:54         ` Ramkumar Ramachandra
  0 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2013-08-05 15:24 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Ramkumar Ramachandra, Git List

Duy Nguyen <pclouds@gmail.com> writes:

> I worry less about this. It's not the right model to have two machines
> modify the same shared repository (gc --auto is only triggered when we
> think there are new objects) even though I think we support it.

I am a bit hesitant to dismiss with "It's not the right model", as
the original of accessing the repository from two terminals while
one clearly is being accessed busily by gc falls into the same
category.

> If
> it's two _scripts_ modifying the same repo, I don't care as this is
> more about user interaction.

It can very well be two terminals, one on one machine each, both
with the same human end-user interaction.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG?] gc and impatience
  2013-08-05 15:24       ` Junio C Hamano
@ 2013-08-05 15:54         ` Ramkumar Ramachandra
  0 siblings, 0 replies; 10+ messages in thread
From: Ramkumar Ramachandra @ 2013-08-05 15:54 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Duy Nguyen, Git List

Junio C Hamano wrote:
> I am a bit hesitant to dismiss with "It's not the right model", as
> the original of accessing the repository from two terminals while
> one clearly is being accessed busily by gc falls into the same
> category.

As to why I think it makes sense: garbage collecting unreferenced
objects has nothing to do with updating refs, or checking out a
worktree.  Think about my earlier "make push.default = current resolve
HEAD early"; why would the user want to update the ref that is being
pushed?  She'd most likely want to continue working on another feature
on some other branch, and that's perfectly fine.

In long-running runtimes, garbage collection is absolutely essential
to the performance. Often, stupidly written garbage collectors that
stop-the-world (the execution of the program), compact the memory
after collection, and then restart the program, can cause the user to
throw that runtime out the window (Emacs has a really stupid one, by
the way).  Most modern runtimes have concurrent garbage collectors
that are allocated very fine-grained slots by the scheduler: so, the
program won't suddenly come to a grinding halt to do garbage
collection. The reason it's so hard to do concurrent gc is because
there can be races between data modification via variables (main
program), and data being moved around in memory for compacting (gc).

Having said all this, the problem is highly simplified in git, because
the object store is a const-store. A particular key (sha-1) is
guaranteed never to point to the wrong data.  Frankly, even if there
is concurrent access to the object store, the worst thing that can
happen is that the gc didn't collect some dangling objects that were
created during the gc run.

Unless you have some irrational fear of introducing some unexpected
behavior in some convoluted corner case, I really don't see what the
problem is.  I'm sure server-side implementations have to do it all
the time: GitHub and Gerrit certainly doesn't say "I'm gc'ing; please
pull after 10 mins".  Perhaps they're more conservative than the
client side about gc (space is cheap), but that's just a sane default.

> It can very well be two terminals, one on one machine each, both
> with the same human end-user interaction.

Someone does an SSH my machine to a submarine in Russia over a slow
connection. I remove an ordinary file, while she's trying to write to
it. When did anyone make any guarantees about no races? What does git
gc specifically have to do with this?

For the record, you can easily mess up your worktree by running two
different worktree updates (checkout/ merge) on two different
terminals: nothing forbidding it. I don't see how _not_ forbidding gc
on two different terminals is better than forbidding it. This is quite
an obscure feature for few super-impatient people, and we haven't even
advertised it in any documentation.

Unless you can present an alternative now (patch-form, please), I
think you're being irrationally conservative about this.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG?] gc and impatience
  2013-08-03  3:53 ` Duy Nguyen
  2013-08-03  4:44   ` Junio C Hamano
@ 2013-08-06  2:14   ` Ramkumar Ramachandra
  1 sibling, 0 replies; 10+ messages in thread
From: Ramkumar Ramachandra @ 2013-08-06  2:14 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Git List

Duy Nguyen wrote:
> Good point. I think that is because gc does not check if gc is already
> running. Adding such a check should not be too hard. I think gc could
> save its pid in $GIT_DIR/auto-gc.pid. The next auto-gc checks this, if
> the pid is valid, skip auto-gc.

Check.  I also talked about gc not catching SIGINT properly: I'm
looking the issue.

> Or you could just make a cron job to gc all repos every week and the
> problem goes away ;-)

Fundamentally, we need to fix these problems:

1. Don't make the repo unusable when a gc is running: I don't expect
anything more than minor annoyances after your patch is checked in.

2. Improve the IO profile, so gc doesn't aggressively hammer out tiny
fragmentations. For this, git-exproll.sh is definitely a step in the
right direction.

3. Improve how gc fundamentally works, so we can minimize rebuilds and
CPU time. jc's git merge-pack is interesting, but I'm not very hopeful
about a naive incremental-packing. We need to keep the major
undeltified objects near the top of the file, and build an idx sorted
by SHA-1; mangling the offsets in the header after a packfile has been
written is both complicated and dangerous (we might introduce subtle
bugs corrupting the packfile), I think. I haven't thought about it
hard enough though.

We'll tackle these problems bit by bit in future patches.

Thanks.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2013-08-06  3:00 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1rpxs5pa827iefbyduyodlc7.1375495435629@email.android.com>
2013-08-05 17:34 ` [BUG?] gc and impatience Ramkumar Ramachandra
2013-08-05 18:45   ` Martin Fick
2013-08-06  2:59     ` Ramkumar Ramachandra
2013-08-03  1:48 Ramkumar Ramachandra
2013-08-03  3:53 ` Duy Nguyen
2013-08-03  4:44   ` Junio C Hamano
2013-08-03  5:25     ` Duy Nguyen
2013-08-05 15:24       ` Junio C Hamano
2013-08-05 15:54         ` Ramkumar Ramachandra
2013-08-06  2:14   ` Ramkumar Ramachandra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).