git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Michael Haggerty <mhagger@alum.mit.edu>
To: Felipe Contreras <felipe.contreras@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: Benchmarks regarding git's gc
Date: Wed, 09 Nov 2011 06:12:55 +0100	[thread overview]
Message-ID: <4EBA0BD7.3050301@alum.mit.edu> (raw)
In-Reply-To: <CAMP44s3E-YCMQQzJAU2wjjD-adpNy6bGb-=iUQ=p=bFTWxm+Ng@mail.gmail.com>

On 11/08/2011 12:34 PM, Felipe Contreras wrote:
> Has anybody seen these?
> http://draketo.de/proj/hg-vs-git-server/test-results.html#results
> 
> Seems like a potential area of improvement.

The fact that git requires periodic garbage collection is indeed
annoying (even in interactive use) and even more annoying in the
scenario discussed by the author of this article.

With respect to the article's claims about the overall efficiency of
Mercurial vs. git, I would like to point out that the author's use of a
test repository with a linear history avoids one of Mercurial's big
design weaknesses.  If the repository had had a branching history,
Mercurial's numbers would probably be significantly less flattering.

Mercurial's revlog repository format [1] (at least the last time I
checked) uses a single data file to hold the contents of all versions of
a single file in the working copy.  It appends a delta to the end of the
revlog file for each revision, with periodic fulltexts.  It is designed
to make it possible to reconstruct any file revision via a single seek
and a single read of at most twice the length of the file's fulltext
(assuming that the index is already known).  The avoidance of disk seeks
goes a long way to explaining Mercurial's competitive performance
despite the fact that it is written in Python.

However, the deltas stored in revlog are not relative to a revision's
parent(s), but rather relative to the previous revision in the revlog
file, which is typically the most recent revision committed *to any
branch*.  Therefore, revlog is very good at storing a linear series of
commits, but is considerably less efficient at storing a history with
lots of branches that were under development concurrently.  The net
result is that the history of a branchy repository can take up much more
space than that of a linear repository.

There was a GSOC "parentdelta" project to allow deltas to be computed
against parents [2], later replaced by a second "generaldelta" scheme
[3], but AFAICT this is still experimental and they are struggling with
its performance.

There is also a script in contrib that reorders the revisions in a
revlog file to put topological neighbors closer together [4].  This can
shrink the size of the file dramatically.  But of course this script is
something like "git gc" in the sense that it would presumably need to be
run periodically, and each run would have to lock the repo for some time.

All this is not to detract from the fact that Mercurial, by not
requiring garbage collection, has a big advantage against git in certain
scenarios.

Michael

[1]
http://mercurial.selenic.com/wiki/FAQ#FAQ.2BAC8-TechnicalDetails.How_does_Mercurial_store_its_data.3F
[2] http://mercurial.selenic.com/wiki/ParentDeltaPlan
[3]
http://mercurial.selenic.com/wiki/WhatsNew#Mercurial_1.9_.282011-07-01.29
[4] http://selenic.com/hg/file/54c0517c0fe8/contrib/shrink-revlog.py

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

      parent reply	other threads:[~2011-11-09  5:13 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-11-08 11:34 Benchmarks regarding git's gc Felipe Contreras
2011-11-08 14:37 ` Nguyen Thai Ngoc Duy
2011-11-08 16:28   ` Felipe Contreras
2011-11-08 16:40 ` Brandon Casey
2011-11-08 21:58   ` Jeff King
2011-11-09 12:34     ` David Michael Barr
2011-11-09  5:12 ` Michael Haggerty [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4EBA0BD7.3050301@alum.mit.edu \
    --to=mhagger@alum.mit.edu \
    --cc=felipe.contreras@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).