git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Benchmarks regarding git's gc
@ 2011-11-08 11:34 Felipe Contreras
  2011-11-08 14:37 ` Nguyen Thai Ngoc Duy
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Felipe Contreras @ 2011-11-08 11:34 UTC (permalink / raw)
  To: git

Has anybody seen these?
http://draketo.de/proj/hg-vs-git-server/test-results.html#results

Seems like a potential area of improvement.

Cheers.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Benchmarks regarding git's gc
  2011-11-08 11:34 Benchmarks regarding git's gc Felipe Contreras
@ 2011-11-08 14:37 ` Nguyen Thai Ngoc Duy
  2011-11-08 16:28   ` Felipe Contreras
  2011-11-08 16:40 ` Brandon Casey
  2011-11-09  5:12 ` Michael Haggerty
  2 siblings, 1 reply; 7+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2011-11-08 14:37 UTC (permalink / raw)
  To: Felipe Contreras; +Cc: git

On Tue, Nov 8, 2011 at 6:34 PM, Felipe Contreras
<felipe.contreras@gmail.com> wrote:
> Has anybody seen these?
> http://draketo.de/proj/hg-vs-git-server/test-results.html#results
>
> Seems like a potential area of improvement.

The proportion between time and commits may have something to do with
reachability test, where we traverse all commits and trees (I think
twice in git-gc, one when it runs "reflog expire" and one "prune").
packv4 is supposed to make tree traversing faster. Although it'd be
best if we could avoid this test.
-- 
Duy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Benchmarks regarding git's gc
  2011-11-08 14:37 ` Nguyen Thai Ngoc Duy
@ 2011-11-08 16:28   ` Felipe Contreras
  0 siblings, 0 replies; 7+ messages in thread
From: Felipe Contreras @ 2011-11-08 16:28 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: git

On Tue, Nov 8, 2011 at 4:37 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> On Tue, Nov 8, 2011 at 6:34 PM, Felipe Contreras
> <felipe.contreras@gmail.com> wrote:
>> Has anybody seen these?
>> http://draketo.de/proj/hg-vs-git-server/test-results.html#results
>>
>> Seems like a potential area of improvement.
>
> The proportion between time and commits may have something to do with
> reachability test, where we traverse all commits and trees (I think
> twice in git-gc, one when it runs "reflog expire" and one "prune").
> packv4 is supposed to make tree traversing faster. Although it'd be
> best if we could avoid this test.

Is there someone working on a way to avoid this test?

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Benchmarks regarding git's gc
  2011-11-08 11:34 Benchmarks regarding git's gc Felipe Contreras
  2011-11-08 14:37 ` Nguyen Thai Ngoc Duy
@ 2011-11-08 16:40 ` Brandon Casey
  2011-11-08 21:58   ` Jeff King
  2011-11-09  5:12 ` Michael Haggerty
  2 siblings, 1 reply; 7+ messages in thread
From: Brandon Casey @ 2011-11-08 16:40 UTC (permalink / raw)
  To: Felipe Contreras; +Cc: git

On Tue, Nov 8, 2011 at 5:34 AM, Felipe Contreras
<felipe.contreras@gmail.com> wrote:
> Has anybody seen these?
> http://draketo.de/proj/hg-vs-git-server/test-results.html#results
>
> Seems like a potential area of improvement.

I think this is a case of designing the problem space so that your
intended winner wins and your intended loser loses.

'git gc' is designed so that it can be run out-of-band.  It doesn't
make a lot of sense to design your application so that the end-user
has to wait for git-gc to run, and I don't think anyone ever would.
But that is similar to how mercurial works, and that is why the author
of that page measured git's performance that way.

Off the top of my head, it seems to me that running 'git gc --auto' in
a cron job with SCHED_BATCH scheduling and some simple locking may
satisfy his requirement of keeping the repository size bounded and
ensuring that the pack operation does not affect the operation of the
web app.

-Brandon

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Benchmarks regarding git's gc
  2011-11-08 16:40 ` Brandon Casey
@ 2011-11-08 21:58   ` Jeff King
  2011-11-09 12:34     ` David Michael Barr
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff King @ 2011-11-08 21:58 UTC (permalink / raw)
  To: Brandon Casey; +Cc: Felipe Contreras, git

On Tue, Nov 08, 2011 at 10:40:15AM -0600, Brandon Casey wrote:

> On Tue, Nov 8, 2011 at 5:34 AM, Felipe Contreras
> <felipe.contreras@gmail.com> wrote:
> > Has anybody seen these?
> > http://draketo.de/proj/hg-vs-git-server/test-results.html#results
> >
> > Seems like a potential area of improvement.
> 
> I think this is a case of designing the problem space so that your
> intended winner wins and your intended loser loses.

Sort of. It is a real problem space, and mercurial does have some
advantage in that area.

His problem definition is that of a git-backed server database that is
under constant load creating new commits. So imagine wikipedia backed by
git.

Mercurial's strategy (as I understand it) is to always calculate and
store deltas as new commits are created. Git's strategy is to store full
objects, and then worry about deltification later. So of course git is
going to do more work, and especially more I/O.

Git's strategy is fine for the workload for which it was designed:
people making commits in burst, and occasionally doing book-keeping to
make things smaller.

But for a constant-commit workflow, the burstiness is annoying, and the
amount of I/O can be cumbersome.  We realized this long ago when
importing old histories into git. And that's why fast-import was born:
it does at least a minimal level of delta and puts everything into a
single packfile, instead of writing out loose objects.

If you were writing commits at some fast constant rate into your
repository, then you'd probably want to do the same thing. And it would
be fairly easy to do on top of git's object model. At best, it's just a
specialized commit command (like fast-import), and at worst it's
probably a more incremental object store.

So he may have a point that mercurial might perform better for some
metrics than git in the current state. But I think a lot of that is
because nobody has bothered putting git into this situation and done the
tweaks needed to make it fast. You can argue that git sucks because it
needs tweaking, of course, but if I were picking between the two systems
to implement something like this, I'd consider picking git and doing the
tweaks (of course, I'm far from unbiased).

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Benchmarks regarding git's gc
  2011-11-08 11:34 Benchmarks regarding git's gc Felipe Contreras
  2011-11-08 14:37 ` Nguyen Thai Ngoc Duy
  2011-11-08 16:40 ` Brandon Casey
@ 2011-11-09  5:12 ` Michael Haggerty
  2 siblings, 0 replies; 7+ messages in thread
From: Michael Haggerty @ 2011-11-09  5:12 UTC (permalink / raw)
  To: Felipe Contreras; +Cc: git

On 11/08/2011 12:34 PM, Felipe Contreras wrote:
> Has anybody seen these?
> http://draketo.de/proj/hg-vs-git-server/test-results.html#results
> 
> Seems like a potential area of improvement.

The fact that git requires periodic garbage collection is indeed
annoying (even in interactive use) and even more annoying in the
scenario discussed by the author of this article.

With respect to the article's claims about the overall efficiency of
Mercurial vs. git, I would like to point out that the author's use of a
test repository with a linear history avoids one of Mercurial's big
design weaknesses.  If the repository had had a branching history,
Mercurial's numbers would probably be significantly less flattering.

Mercurial's revlog repository format [1] (at least the last time I
checked) uses a single data file to hold the contents of all versions of
a single file in the working copy.  It appends a delta to the end of the
revlog file for each revision, with periodic fulltexts.  It is designed
to make it possible to reconstruct any file revision via a single seek
and a single read of at most twice the length of the file's fulltext
(assuming that the index is already known).  The avoidance of disk seeks
goes a long way to explaining Mercurial's competitive performance
despite the fact that it is written in Python.

However, the deltas stored in revlog are not relative to a revision's
parent(s), but rather relative to the previous revision in the revlog
file, which is typically the most recent revision committed *to any
branch*.  Therefore, revlog is very good at storing a linear series of
commits, but is considerably less efficient at storing a history with
lots of branches that were under development concurrently.  The net
result is that the history of a branchy repository can take up much more
space than that of a linear repository.

There was a GSOC "parentdelta" project to allow deltas to be computed
against parents [2], later replaced by a second "generaldelta" scheme
[3], but AFAICT this is still experimental and they are struggling with
its performance.

There is also a script in contrib that reorders the revisions in a
revlog file to put topological neighbors closer together [4].  This can
shrink the size of the file dramatically.  But of course this script is
something like "git gc" in the sense that it would presumably need to be
run periodically, and each run would have to lock the repo for some time.

All this is not to detract from the fact that Mercurial, by not
requiring garbage collection, has a big advantage against git in certain
scenarios.

Michael

[1]
http://mercurial.selenic.com/wiki/FAQ#FAQ.2BAC8-TechnicalDetails.How_does_Mercurial_store_its_data.3F
[2] http://mercurial.selenic.com/wiki/ParentDeltaPlan
[3]
http://mercurial.selenic.com/wiki/WhatsNew#Mercurial_1.9_.282011-07-01.29
[4] http://selenic.com/hg/file/54c0517c0fe8/contrib/shrink-revlog.py

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Benchmarks regarding git's gc
  2011-11-08 21:58   ` Jeff King
@ 2011-11-09 12:34     ` David Michael Barr
  0 siblings, 0 replies; 7+ messages in thread
From: David Michael Barr @ 2011-11-09 12:34 UTC (permalink / raw)
  To: Jeff King; +Cc: Brandon Casey, Felipe Contreras, git

On Wed, Nov 9, 2011 at 8:58 AM, Jeff King <peff@peff.net> wrote:
> On Tue, Nov 08, 2011 at 10:40:15AM -0600, Brandon Casey wrote:
>
>> On Tue, Nov 8, 2011 at 5:34 AM, Felipe Contreras
>> <felipe.contreras@gmail.com> wrote:
>> > Has anybody seen these?
>> > http://draketo.de/proj/hg-vs-git-server/test-results.html#results
>> >
>> > Seems like a potential area of improvement.
>>
>> I think this is a case of designing the problem space so that your
>> intended winner wins and your intended loser loses.
>
> Sort of. It is a real problem space, and mercurial does have some
> advantage in that area.
[...]
> So he may have a point that mercurial might perform better for some
> metrics than git in the current state. But I think a lot of that is
> because nobody has bothered putting git into this situation and done the
> tweaks needed to make it fast. You can argue that git sucks because it
> needs tweaking, of course, but if I were picking between the two systems
> to implement something like this, I'd consider picking git and doing the
> tweaks (of course, I'm far from unbiased).

It is the case that the default behaviour of git gc --auto is far from optimal.
I've been playing with ways to achieve both better asymptotic
performance and less jitter.
One part of that is choosing "good" packing parameters for a given repo.
I did this in a partially automated fashion for WebKit but I think the
process can be generalised.
The other issue is how often you repack ancient history, the potential
waste is obvious.
To this end I propose a repacking strategy in the spirit of merge-sort:
If you can maintain the constraint that the sizes of packs in a repo
form a geometric sequence, my napkin says the amortised cost of gc is
log(n).

--
David Barr.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2011-11-09 12:35 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-08 11:34 Benchmarks regarding git's gc Felipe Contreras
2011-11-08 14:37 ` Nguyen Thai Ngoc Duy
2011-11-08 16:28   ` Felipe Contreras
2011-11-08 16:40 ` Brandon Casey
2011-11-08 21:58   ` Jeff King
2011-11-09 12:34     ` David Michael Barr
2011-11-09  5:12 ` Michael Haggerty

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).