* Benchmarks regarding git's gc @ 2011-11-08 11:34 Felipe Contreras 2011-11-08 14:37 ` Nguyen Thai Ngoc Duy ` (2 more replies) 0 siblings, 3 replies; 7+ messages in thread From: Felipe Contreras @ 2011-11-08 11:34 UTC (permalink / raw) To: git Has anybody seen these? http://draketo.de/proj/hg-vs-git-server/test-results.html#results Seems like a potential area of improvement. Cheers. -- Felipe Contreras ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Benchmarks regarding git's gc 2011-11-08 11:34 Benchmarks regarding git's gc Felipe Contreras @ 2011-11-08 14:37 ` Nguyen Thai Ngoc Duy 2011-11-08 16:28 ` Felipe Contreras 2011-11-08 16:40 ` Brandon Casey 2011-11-09 5:12 ` Michael Haggerty 2 siblings, 1 reply; 7+ messages in thread From: Nguyen Thai Ngoc Duy @ 2011-11-08 14:37 UTC (permalink / raw) To: Felipe Contreras; +Cc: git On Tue, Nov 8, 2011 at 6:34 PM, Felipe Contreras <felipe.contreras@gmail.com> wrote: > Has anybody seen these? > http://draketo.de/proj/hg-vs-git-server/test-results.html#results > > Seems like a potential area of improvement. The proportion between time and commits may have something to do with reachability test, where we traverse all commits and trees (I think twice in git-gc, one when it runs "reflog expire" and one "prune"). packv4 is supposed to make tree traversing faster. Although it'd be best if we could avoid this test. -- Duy ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Benchmarks regarding git's gc 2011-11-08 14:37 ` Nguyen Thai Ngoc Duy @ 2011-11-08 16:28 ` Felipe Contreras 0 siblings, 0 replies; 7+ messages in thread From: Felipe Contreras @ 2011-11-08 16:28 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy; +Cc: git On Tue, Nov 8, 2011 at 4:37 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote: > On Tue, Nov 8, 2011 at 6:34 PM, Felipe Contreras > <felipe.contreras@gmail.com> wrote: >> Has anybody seen these? >> http://draketo.de/proj/hg-vs-git-server/test-results.html#results >> >> Seems like a potential area of improvement. > > The proportion between time and commits may have something to do with > reachability test, where we traverse all commits and trees (I think > twice in git-gc, one when it runs "reflog expire" and one "prune"). > packv4 is supposed to make tree traversing faster. Although it'd be > best if we could avoid this test. Is there someone working on a way to avoid this test? -- Felipe Contreras ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Benchmarks regarding git's gc 2011-11-08 11:34 Benchmarks regarding git's gc Felipe Contreras 2011-11-08 14:37 ` Nguyen Thai Ngoc Duy @ 2011-11-08 16:40 ` Brandon Casey 2011-11-08 21:58 ` Jeff King 2011-11-09 5:12 ` Michael Haggerty 2 siblings, 1 reply; 7+ messages in thread From: Brandon Casey @ 2011-11-08 16:40 UTC (permalink / raw) To: Felipe Contreras; +Cc: git On Tue, Nov 8, 2011 at 5:34 AM, Felipe Contreras <felipe.contreras@gmail.com> wrote: > Has anybody seen these? > http://draketo.de/proj/hg-vs-git-server/test-results.html#results > > Seems like a potential area of improvement. I think this is a case of designing the problem space so that your intended winner wins and your intended loser loses. 'git gc' is designed so that it can be run out-of-band. It doesn't make a lot of sense to design your application so that the end-user has to wait for git-gc to run, and I don't think anyone ever would. But that is similar to how mercurial works, and that is why the author of that page measured git's performance that way. Off the top of my head, it seems to me that running 'git gc --auto' in a cron job with SCHED_BATCH scheduling and some simple locking may satisfy his requirement of keeping the repository size bounded and ensuring that the pack operation does not affect the operation of the web app. -Brandon ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Benchmarks regarding git's gc 2011-11-08 16:40 ` Brandon Casey @ 2011-11-08 21:58 ` Jeff King 2011-11-09 12:34 ` David Michael Barr 0 siblings, 1 reply; 7+ messages in thread From: Jeff King @ 2011-11-08 21:58 UTC (permalink / raw) To: Brandon Casey; +Cc: Felipe Contreras, git On Tue, Nov 08, 2011 at 10:40:15AM -0600, Brandon Casey wrote: > On Tue, Nov 8, 2011 at 5:34 AM, Felipe Contreras > <felipe.contreras@gmail.com> wrote: > > Has anybody seen these? > > http://draketo.de/proj/hg-vs-git-server/test-results.html#results > > > > Seems like a potential area of improvement. > > I think this is a case of designing the problem space so that your > intended winner wins and your intended loser loses. Sort of. It is a real problem space, and mercurial does have some advantage in that area. His problem definition is that of a git-backed server database that is under constant load creating new commits. So imagine wikipedia backed by git. Mercurial's strategy (as I understand it) is to always calculate and store deltas as new commits are created. Git's strategy is to store full objects, and then worry about deltification later. So of course git is going to do more work, and especially more I/O. Git's strategy is fine for the workload for which it was designed: people making commits in burst, and occasionally doing book-keeping to make things smaller. But for a constant-commit workflow, the burstiness is annoying, and the amount of I/O can be cumbersome. We realized this long ago when importing old histories into git. And that's why fast-import was born: it does at least a minimal level of delta and puts everything into a single packfile, instead of writing out loose objects. If you were writing commits at some fast constant rate into your repository, then you'd probably want to do the same thing. And it would be fairly easy to do on top of git's object model. At best, it's just a specialized commit command (like fast-import), and at worst it's probably a more incremental object store. So he may have a point that mercurial might perform better for some metrics than git in the current state. But I think a lot of that is because nobody has bothered putting git into this situation and done the tweaks needed to make it fast. You can argue that git sucks because it needs tweaking, of course, but if I were picking between the two systems to implement something like this, I'd consider picking git and doing the tweaks (of course, I'm far from unbiased). -Peff ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Benchmarks regarding git's gc 2011-11-08 21:58 ` Jeff King @ 2011-11-09 12:34 ` David Michael Barr 0 siblings, 0 replies; 7+ messages in thread From: David Michael Barr @ 2011-11-09 12:34 UTC (permalink / raw) To: Jeff King; +Cc: Brandon Casey, Felipe Contreras, git On Wed, Nov 9, 2011 at 8:58 AM, Jeff King <peff@peff.net> wrote: > On Tue, Nov 08, 2011 at 10:40:15AM -0600, Brandon Casey wrote: > >> On Tue, Nov 8, 2011 at 5:34 AM, Felipe Contreras >> <felipe.contreras@gmail.com> wrote: >> > Has anybody seen these? >> > http://draketo.de/proj/hg-vs-git-server/test-results.html#results >> > >> > Seems like a potential area of improvement. >> >> I think this is a case of designing the problem space so that your >> intended winner wins and your intended loser loses. > > Sort of. It is a real problem space, and mercurial does have some > advantage in that area. [...] > So he may have a point that mercurial might perform better for some > metrics than git in the current state. But I think a lot of that is > because nobody has bothered putting git into this situation and done the > tweaks needed to make it fast. You can argue that git sucks because it > needs tweaking, of course, but if I were picking between the two systems > to implement something like this, I'd consider picking git and doing the > tweaks (of course, I'm far from unbiased). It is the case that the default behaviour of git gc --auto is far from optimal. I've been playing with ways to achieve both better asymptotic performance and less jitter. One part of that is choosing "good" packing parameters for a given repo. I did this in a partially automated fashion for WebKit but I think the process can be generalised. The other issue is how often you repack ancient history, the potential waste is obvious. To this end I propose a repacking strategy in the spirit of merge-sort: If you can maintain the constraint that the sizes of packs in a repo form a geometric sequence, my napkin says the amortised cost of gc is log(n). -- David Barr. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Benchmarks regarding git's gc 2011-11-08 11:34 Benchmarks regarding git's gc Felipe Contreras 2011-11-08 14:37 ` Nguyen Thai Ngoc Duy 2011-11-08 16:40 ` Brandon Casey @ 2011-11-09 5:12 ` Michael Haggerty 2 siblings, 0 replies; 7+ messages in thread From: Michael Haggerty @ 2011-11-09 5:12 UTC (permalink / raw) To: Felipe Contreras; +Cc: git On 11/08/2011 12:34 PM, Felipe Contreras wrote: > Has anybody seen these? > http://draketo.de/proj/hg-vs-git-server/test-results.html#results > > Seems like a potential area of improvement. The fact that git requires periodic garbage collection is indeed annoying (even in interactive use) and even more annoying in the scenario discussed by the author of this article. With respect to the article's claims about the overall efficiency of Mercurial vs. git, I would like to point out that the author's use of a test repository with a linear history avoids one of Mercurial's big design weaknesses. If the repository had had a branching history, Mercurial's numbers would probably be significantly less flattering. Mercurial's revlog repository format [1] (at least the last time I checked) uses a single data file to hold the contents of all versions of a single file in the working copy. It appends a delta to the end of the revlog file for each revision, with periodic fulltexts. It is designed to make it possible to reconstruct any file revision via a single seek and a single read of at most twice the length of the file's fulltext (assuming that the index is already known). The avoidance of disk seeks goes a long way to explaining Mercurial's competitive performance despite the fact that it is written in Python. However, the deltas stored in revlog are not relative to a revision's parent(s), but rather relative to the previous revision in the revlog file, which is typically the most recent revision committed *to any branch*. Therefore, revlog is very good at storing a linear series of commits, but is considerably less efficient at storing a history with lots of branches that were under development concurrently. The net result is that the history of a branchy repository can take up much more space than that of a linear repository. There was a GSOC "parentdelta" project to allow deltas to be computed against parents [2], later replaced by a second "generaldelta" scheme [3], but AFAICT this is still experimental and they are struggling with its performance. There is also a script in contrib that reorders the revisions in a revlog file to put topological neighbors closer together [4]. This can shrink the size of the file dramatically. But of course this script is something like "git gc" in the sense that it would presumably need to be run periodically, and each run would have to lock the repo for some time. All this is not to detract from the fact that Mercurial, by not requiring garbage collection, has a big advantage against git in certain scenarios. Michael [1] http://mercurial.selenic.com/wiki/FAQ#FAQ.2BAC8-TechnicalDetails.How_does_Mercurial_store_its_data.3F [2] http://mercurial.selenic.com/wiki/ParentDeltaPlan [3] http://mercurial.selenic.com/wiki/WhatsNew#Mercurial_1.9_.282011-07-01.29 [4] http://selenic.com/hg/file/54c0517c0fe8/contrib/shrink-revlog.py -- Michael Haggerty mhagger@alum.mit.edu http://softwareswirl.blogspot.com/ ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2011-11-09 12:35 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-11-08 11:34 Benchmarks regarding git's gc Felipe Contreras 2011-11-08 14:37 ` Nguyen Thai Ngoc Duy 2011-11-08 16:28 ` Felipe Contreras 2011-11-08 16:40 ` Brandon Casey 2011-11-08 21:58 ` Jeff King 2011-11-09 12:34 ` David Michael Barr 2011-11-09 5:12 ` Michael Haggerty
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).