git performance

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* git performance
@ 2008-10-22 20:17 Edward Ned Harvey
  2008-10-22 20:36 ` Jeff King
  2008-10-22 22:42 ` Jakub Narebski
  0 siblings, 2 replies; 22+ messages in thread
From: Edward Ned Harvey @ 2008-10-22 20:17 UTC (permalink / raw)
  To: git

I see things all over the Internet saying git is fast.  I'm currently struggling with poor svn performance and poor attitude of svn developers, so I'd like to consider switching to git.  A quick question first.

The core of the performance problem I'm facing is the need to "walk the tree" for many thousand files.  Every time I do "svn update" or "svn status" the svn client must stat every file to check for local modifications (a coffee cup or a beer worth of stats).  In essence, this is unavoidable if there is no mechanism to constantly monitor filesystem activity during normal operations.  Analogous to filesystem journaling.

So - I didn't see anything out there saying "git is fast because it uses inotify" or anything like that.  Perhaps git would not help me at all?  Because git still needs to stat all the files in the tree?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git performance
  2008-10-22 20:17 git performance Edward Ned Harvey
@ 2008-10-22 20:36 ` Jeff King
  2008-10-22 21:13   ` Peter Harris
  2008-10-22 21:55   ` Edward Ned Harvey
  2008-10-22 22:42 ` Jakub Narebski
  1 sibling, 2 replies; 22+ messages in thread
From: Jeff King @ 2008-10-22 20:36 UTC (permalink / raw)
  To: Edward Ned Harvey; +Cc: git

On Wed, Oct 22, 2008 at 04:17:16PM -0400, Edward Ned Harvey wrote:

> So - I didn't see anything out there saying "git is fast because it
> uses inotify" or anything like that.  Perhaps git would not help me at
> all?  Because git still needs to stat all the files in the tree?

Yes, it does stat all the files. How many files are you talking about,
and what platform?  From a warm cache on Linux, the 23,000 files kernel
repo takes about a tenth of a second to stat all files for me (and this
on a several year-old machine). And of course many operations don't
require stat'ing at all (like looking at logs, or diffs that don't
involve the working tree).

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git performance
  2008-10-22 20:36 ` Jeff King
@ 2008-10-22 21:13   ` Peter Harris
  2008-10-22 21:55   ` Edward Ned Harvey
  1 sibling, 0 replies; 22+ messages in thread
From: Peter Harris @ 2008-10-22 21:13 UTC (permalink / raw)
  To: Jeff King; +Cc: Edward Ned Harvey, git

On Wed, Oct 22, 2008 at 4:36 PM, Jeff King wrote:
> On Wed, Oct 22, 2008 at 04:17:16PM -0400, Edward Ned Harvey wrote:
>
>> So - I didn't see anything out there saying "git is fast because it
>> uses inotify" or anything like that.  Perhaps git would not help me at
>> all?  Because git still needs to stat all the files in the tree?
>
> Yes, it does stat all the files. How many files are you talking about,
> and what platform?  From a warm cache on Linux, the 23,000 files kernel
> repo takes about a tenth of a second to stat all files for me (and this
> on a several year-old machine). And of course many operations don't
> require stat'ing at all (like looking at logs, or diffs that don't
> involve the working tree).

Windows is rather slower than Linux, so differences are more obvious.
I find git feels "only" about 2x as fast as svn at status. svn has to
stat all of its base files too, whereas git has the index. git pull
(vs svn update) feels better than 2x faster, since git doesn't need to
walk the tree and lock every sub-dir before it even connects to the
remote server.

So we're not talking 'inotify' fast, but maybe half a cup of coffee
instead of a full cup if you have that many files.

"git-svn" is really quite good. I recommend you try a quick (trunk and
maybe one branch only, last few revisions only) import of your svn
tree to test with.

Peter Harris

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: git performance
  2008-10-22 20:36 ` Jeff King
  2008-10-22 21:13   ` Peter Harris
@ 2008-10-22 21:55   ` Edward Ned Harvey
  2008-10-23  7:11     ` Andreas Ericsson
                       ` (7 more replies)
  1 sibling, 8 replies; 22+ messages in thread
From: Edward Ned Harvey @ 2008-10-22 21:55 UTC (permalink / raw)
  To: git

> Yes, it does stat all the files. How many files are you talking about,
> and what platform?  From a warm cache on Linux, the 23,000 files kernel
> repo takes about a tenth of a second to stat all files for me (and this
> on a several year-old machine). And of course many operations don't
> require stat'ing at all (like looking at logs, or diffs that don't
> involve the working tree).

No worries.  No solution can meet everyone's needs.

I'm talking about 40-50,000 files, on multi-user production linux, which means the cache is never warm, except when I'm benchmarking.  Specifically RHEL 4 with the files on NFS mount.  Cold cache "svn st" takes ~10 mins.  Warm cache 20-30 sec.  Surprisingly to me, performance was approx the same for files on local disk versus NFS.  Probably the best solution for us is perforce, we just don't like the pricetag.

Out of curiosity, what are they talking about, when they say "git is fast?"  Just the fact that it's all local disk, or is there more to it than that?  I could see - git would probably outperform perforce for versioning of large files (let's say iso files) to benefit from sustained local disk IO, while perforce would probably outperform anything I can think of, operating on thousands of tiny files, because it will never walk the tree.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git performance
  2008-10-22 20:17 git performance Edward Ned Harvey
  2008-10-22 20:36 ` Jeff King
@ 2008-10-22 22:42 ` Jakub Narebski
  2008-10-23  7:43   ` Andreas Ericsson
  1 sibling, 1 reply; 22+ messages in thread
From: Jakub Narebski @ 2008-10-22 22:42 UTC (permalink / raw)
  To: Edward Ned Harvey; +Cc: git

"Edward Ned Harvey" <git@nedharvey.com> writes:

> I see things all over the Internet saying git is fast.  I'm
> currently struggling with poor svn performance and poor attitude of
> svn developers, so I'd like to consider switching to git.  A quick
> question first.
> 
> The core of the performance problem I'm facing is the need to "walk
> the tree" for many thousand files.  Every time I do "svn update" or
> "svn status" the svn client must stat every file to check for local
> modifications (a coffee cup or a beer worth of stats).  In essence,
> this is unavoidable if there is no mechanism to constantly monitor
> filesystem activity during normal operations.  Analogous to
> filesystem journaling.
> 
> So - I didn't see anything out there saying "git is fast because it
> uses inotify" or anything like that.  Perhaps git would not help me
> at all?  Because git still needs to stat all the files in the tree?

http://git.or.cz/gitwiki/GitBenchmarks

While it should be possible to use 'assume unchanged' bit together
with inotify / icron, it is not something tha is done; IIRC Mercurial
had Linux-only InotifyPlugin...

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git performance
  2008-10-22 21:55   ` Edward Ned Harvey
@ 2008-10-23  7:11     ` Andreas Ericsson
  2008-10-23  7:11     ` Andreas Ericsson
                       ` (6 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Andreas Ericsson @ 2008-10-23  7:11 UTC (permalink / raw)
  To: Edward Ned Harvey; +Cc: git

Edward Ned Harvey wrote:
>> Yes, it does stat all the files. How many files are you talking about,
>> and what platform?  From a warm cache on Linux, the 23,000 files kernel
>> repo takes about a tenth of a second to stat all files for me (and this
>> on a several year-old machine). And of course many operations don't
>> require stat'ing at all (like looking at logs, or diffs that don't
>> involve the working tree).
> 
> No worries.  No solution can meet everyone's needs.
> 
> I'm talking about 40-50,000 files, on multi-user production linux, which means the cache is never warm, except when I'm benchmarking.  Specifically RHEL 4 with the files on NFS mount.  Cold cache "svn st" takes ~10 mins.  Warm cache 20-30 sec.  Surprisingly to me, performance was approx the same for files on local disk versus NFS.  Probably the best solution for us is perforce, we just don't like the pricetag.
> 
> Out of curiosity, what are they talking about, when they say "git is fast?"  Just the fact that it's all local disk, or is there more to it than that?  I could see - git would probably outperform perforce for versioning of large files (let's say iso files) to benefit from sustained local disk IO, while perforce would probably outperform anything I can think of, operating on thousands of tiny files, because it will never walk the tree.
> 



> 
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git performance
  2008-10-22 21:55   ` Edward Ned Harvey
  2008-10-23  7:11     ` Andreas Ericsson
@ 2008-10-23  7:11     ` Andreas Ericsson
  2008-10-23  7:41     ` Andreas Ericsson
                       ` (5 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Andreas Ericsson @ 2008-10-23  7:11 UTC (permalink / raw)
  To: Edward Ned Harvey; +Cc: git

Edward Ned Harvey wrote:
>> Yes, it does stat all the files. How many files are you talking about,
>> and what platform?  From a warm cache on Linux, the 23,000 files kernel
>> repo takes about a tenth of a second to stat all files for me (and this
>> on a several year-old machine). And of course many operations don't
>> require stat'ing at all (like looking at logs, or diffs that don't
>> involve the working tree).
> 
> No worries.  No solution can meet everyone's needs.
> 
> I'm talking about 40-50,000 files, on multi-user production linux, which means the cache is never warm, except when I'm benchmarking.  Specifically RHEL 4 with the files on NFS mount.  Cold cache "svn st" takes ~10 mins.  Warm cache 20-30 sec.  Surprisingly to me, performance was approx the same for files on local disk versus NFS.  Probably the best solution for us is perforce, we just don't like the pricetag.
> 
> Out of curiosity, what are they talking about, when they say "git is fast?"  Just the fact that it's all local disk, or is there more to it than that?  I could see - git would probably outperform perforce for versioning of large files (let's say iso files) to benefit from sustained local disk IO, while perforce would probably outperform anything I can think of, operating on thousands of tiny files, because it will never walk the tree.
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git performance
  2008-10-22 21:55   ` Edward Ned Harvey
  2008-10-23  7:11     ` Andreas Ericsson
  2008-10-23  7:11     ` Andreas Ericsson
@ 2008-10-23  7:41     ` Andreas Ericsson
  2008-10-23 12:16     ` Matthieu Moy
                       ` (4 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Andreas Ericsson @ 2008-10-23  7:41 UTC (permalink / raw)
  To: Edward Ned Harvey; +Cc: git

Edward Ned Harvey wrote:
>> Yes, it does stat all the files. How many files are you talking
>> about, and what platform?  From a warm cache on Linux, the 23,000
>> files kernel repo takes about a tenth of a second to stat all files
>> for me (and this on a several year-old machine). And of course many
>> operations don't require stat'ing at all (like looking at logs, or
>> diffs that don't involve the working tree).
> 
> No worries.  No solution can meet everyone's needs.
> 
> I'm talking about 40-50,000 files, on multi-user production linux,

Umm... using git to track a production server? I think there's something
in your specific use-case that eluded pretty much everyone here the
first time you asked about it.

git was built to maintain the linux kernel with its patch-and-merge based
workflow, 117k commits and 25k files. It's *good* at that sort of thing,
but a lot of features are "source-code management" specific. It sounds to
me you're asking for something that will keep a backup of most of your
entire system (apart from /home), which it's not really suited for. For
instance, it doesn't keep track of mode-bits on files (apart from
"executable or not").

> which means the cache is never warm, except when I'm benchmarking.
> Specifically RHEL 4 with the files on NFS mount.  Cold cache "svn st"
> takes ~10 mins.  Warm cache 20-30 sec.  Surprisingly to me,
> performance was approx the same for files on local disk versus NFS.
> Probably the best solution for us is perforce, we just don't like the
> pricetag.
> 
> Out of curiosity, what are they talking about, when they say "git is
> fast?"

Merges, patch application, committing, history walking and data
transfers are all extremely quick operations under git.

Actually, history walking isn't extremely quick, but several neat
tricks are in place that make it *seem* quick. Running
"git log drivers/net/wireless" on the linux kernel with a cold
cache starts spitting out output after about 1 second on my measly
laptop (where the kernel has 117k commits on 25k files).

>  Just the fact that it's all local disk, or is there more to
> it than that?  I could see - git would probably outperform perforce
> for versioning of large files (let's say iso files) to benefit from
> sustained local disk IO, while perforce would probably outperform
> anything I can think of, operating on thousands of tiny files,
> because it will never walk the tree.
> 

Git doesn't *have* to walk the tree either. "git status" obviously
has to do that, since you're asking "what files have changed in this
tree since I last added stuff to the index", but you can use git just
fine without ever issuing "git status" (assuming you're the one
controlling the changes, that is).

"git rm" and "git add" won't walk the tree. They're just interested in
the paths you give them and won't touch anything else.

"git commit path1 path2" won't walk the tree. It has to walk the paths
(which can be entire subdirectories, or all of them), but not more than
that.

"git push" (ie, send your changes upstream) won't walk the tree. It'll
just look at the history and how they differ.

"git merge" (and therefore also "git pull") doesn't walk the tree. It
only makes sure paths that are touched by the merge are up-to-date.

Apart from that, it would be trivial to hack up some inotify config
and scripts that stages changes in a separate index-file and then
add a simple wrapper that operates on the separate index-file rather
than the "regular" one.

Sample "giti" wrapper:
--%<--%<--%<--
#!/bin/sh
# giti - inotify driven git wrapper
GIT_INDEX=.git/inotify-index
export GIT_INDEX
case "$@" in
	status)
		git diff --name-only --cached
		exit $?
		;;
esac

git "$@"
--%<--%<--%<--

Sample inotify script:
--%<--%<--%<--
#!/bin/sh
GIT_INDEX=.git/inotify-index git add $1
--%<--%<--%<--

Sample incrontab(5) entry:
--%<--%<--%<--
/watched/path IN_CLOSE_WRITE inotify.git $@/$#
--%<--%<--%<--

Totally untested ofcourse, so it probably needs tweaking. It should
work rather well though, assuming you're somewhat careful what
arguments you send to the "giti" wrapper and make sure to never
use any git-commands that *have* to walk the entire tree (such as
"git commit -a").

Let us know how it pans out.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git performance
  2008-10-22 22:42 ` Jakub Narebski
@ 2008-10-23  7:43   ` Andreas Ericsson
  2008-10-23 13:04     ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 22+ messages in thread
From: Andreas Ericsson @ 2008-10-23  7:43 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Edward Ned Harvey, git

Jakub Narebski wrote:
> "Edward Ned Harvey" <git@nedharvey.com> writes:
> 
>> I see things all over the Internet saying git is fast.  I'm
>> currently struggling with poor svn performance and poor attitude of
>> svn developers, so I'd like to consider switching to git.  A quick
>> question first.
>>
>> The core of the performance problem I'm facing is the need to "walk
>> the tree" for many thousand files.  Every time I do "svn update" or
>> "svn status" the svn client must stat every file to check for local
>> modifications (a coffee cup or a beer worth of stats).  In essence,
>> this is unavoidable if there is no mechanism to constantly monitor
>> filesystem activity during normal operations.  Analogous to
>> filesystem journaling.
>>
>> So - I didn't see anything out there saying "git is fast because it
>> uses inotify" or anything like that.  Perhaps git would not help me
>> at all?  Because git still needs to stat all the files in the tree?
> 
> http://git.or.cz/gitwiki/GitBenchmarks
> 
> While it should be possible to use 'assume unchanged' bit together
> with inotify / icron, it is not something tha is done; IIRC Mercurial
> had Linux-only InotifyPlugin...
> 

Well, inotify() is Linux specific, so it'd be quite hard to support on
another platform. Emulating it with a billion stat() calls feels rather
like a disk (and I/O performance) killer.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git performance
  2008-10-22 21:55   ` Edward Ned Harvey
                       ` (2 preceding siblings ...)
  2008-10-23  7:41     ` Andreas Ericsson
@ 2008-10-23 12:16     ` Matthieu Moy
  2008-10-23 16:39     ` Jeff King
                       ` (3 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Matthieu Moy @ 2008-10-23 12:16 UTC (permalink / raw)
  To: Edward Ned Harvey; +Cc: git

"Edward Ned Harvey" <git@nedharvey.com> writes:

>> Yes, it does stat all the files. How many files are you talking about,
>> and what platform?  From a warm cache on Linux, the 23,000 files kernel
>> repo takes about a tenth of a second to stat all files for me (and this
>> on a several year-old machine). And of course many operations don't
>> require stat'ing at all (like looking at logs, or diffs that don't
>> involve the working tree).
>
> No worries.  No solution can meet everyone's needs.
>
> I'm talking about 40-50,000 files, on multi-user production linux,
> which means the cache is never warm, except when I'm benchmarking.
> Specifically RHEL 4 with the files on NFS mount. Cold cache "svn st"
> takes ~10 mins. Warm cache 20-30 sec.

SVN does not only has to stat the files. It also has to read the
stat-cache information wich is split in one .svn/ per directory in the
working tree. Not sure which operation dominates the performance,
though. Best is just to try.

> Out of curiosity, what are they talking about, when they say "git is
> fast?" Just the fact that it's all local disk, or is there more to
> it than that?

Not just local disk: bzr also works locally, and git is much faster on
most operations (bzr status can now compete with git, but "git log"
and "git commit" can be instantaneous where bzr take 1 minute for
example).

For sure, doing most operations locally is the key to being fast, but
Git has also been written so that the complexity of algorithms be as
low as possible.

> I could see - git would probably outperform perforce for versioning
> of large files (let's say iso files) to benefit from sustained local
> disk IO, while perforce would probably outperform anything I can
> think of, operating on thousands of tiny files, because it will
> never walk the tree.

Mercurial has an extension called "inotify" that avoids walking the
disk too. AFAIK doesn't have an equivalent in Git (mostly because most
people interested find git fast enough).

-- 
Matthieu

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git performance
  2008-10-23  7:43   ` Andreas Ericsson
@ 2008-10-23 13:04     ` Nguyen Thai Ngoc Duy
  0 siblings, 0 replies; 22+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2008-10-23 13:04 UTC (permalink / raw)
  To: Andreas Ericsson; +Cc: Jakub Narebski, Edward Ned Harvey, git

On 10/23/08, Andreas Ericsson <ae@op5.se> wrote:
> Jakub Narebski wrote:
>
> > "Edward Ned Harvey" <git@nedharvey.com> writes:
> >
> >
> > > I see things all over the Internet saying git is fast.  I'm
> > > currently struggling with poor svn performance and poor attitude of
> > > svn developers, so I'd like to consider switching to git.  A quick
> > > question first.
> > >
> > > The core of the performance problem I'm facing is the need to "walk
> > > the tree" for many thousand files.  Every time I do "svn update" or
> > > "svn status" the svn client must stat every file to check for local
> > > modifications (a coffee cup or a beer worth of stats).  In essence,
> > > this is unavoidable if there is no mechanism to constantly monitor
> > > filesystem activity during normal operations.  Analogous to
> > > filesystem journaling.
> > >
> > > So - I didn't see anything out there saying "git is fast because it
> > > uses inotify" or anything like that.  Perhaps git would not help me
> > > at all?  Because git still needs to stat all the files in the tree?
> > >
> >
> > http://git.or.cz/gitwiki/GitBenchmarks
> >
> > While it should be possible to use 'assume unchanged' bit together
> > with inotify / icron, it is not something tha is done; IIRC Mercurial
> > had Linux-only InotifyPlugin...
> >
> >
>
>  Well, inotify() is Linux specific, so it'd be quite hard to support on
>  another platform. Emulating it with a billion stat() calls feels rather
>  like a disk (and I/O performance) killer.

There is "filemon" on Windows, which monitors file access. I don't
know how it impacts performance though. A quick search revealed kqueue
for FreeBSD/Mac OSX.
-- 
Duy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git performance
  2008-10-22 21:55   ` Edward Ned Harvey
                       ` (3 preceding siblings ...)
  2008-10-23 12:16     ` Matthieu Moy
@ 2008-10-23 16:39     ` Jeff King
       [not found]       ` <000001c9358f$232bac70$69830550$@com>
  2008-10-23 18:31     ` Daniel Barkalow
                       ` (2 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Jeff King @ 2008-10-23 16:39 UTC (permalink / raw)
  To: Edward Ned Harvey; +Cc: git

On Wed, Oct 22, 2008 at 05:55:14PM -0400, Edward Ned Harvey wrote:

> I'm talking about 40-50,000 files, on multi-user production linux,
> which means the cache is never warm, except when I'm benchmarking.

Well, if you have a cold cache it's going to take longer. :) You should
probably benchmark if you want to know exactly how long.

> Specifically RHEL 4 with the files on NFS mount.  Cold cache "svn st"
> takes ~10 mins.  Warm cache 20-30 sec.  Surprisingly to me,

Wow, that is awful. For comparison, "git status" from a cold on the
kernel repo takes me 17 seconds. From a warm cache, less than half a
second.

Yes, the cold cache case would probably be better with inotify, but
compared to svn, that's screaming fast. I haven't used perforce. If your
bottleneck really is stat'ing the tree, then yes, something that avoided
that might perform better (but weigh that particular optimization
against other things which might be slower).

> Out of curiosity, what are they talking about, when they say "git is
> fast?"

Well, there are the numbers above. When comparing to SVN or (god forbid)
CVS, there are order of magnitude speedups for most common operations.

>  Just the fact that it's all local disk, or is there more to it
> than that?  I could see - git would probably outperform perforce for

The things that generally make git fast are:

  - using a compact on-disk structure (including zlib and aggressive
    delta-finding) to keep your cache warm (and when it's not warm, to
    get data off the disk as quickly as possible)

  - the content-addressable nature of objects means we can just look at
    the data we need to solve a problem. For example,
    getting the history between point A and point B is "O(the number of
    commits between A and B)", _not_ "O(the size of the repo)".
    Viewing a log without generating diffs is "O(the number of
    commits)", not "O(some combination of the number of commits and the
    number of files in each commit)". Diffing two points in history is
    "O(the size of the differences between the two points)" and is
    totally independent of the number of commits between the two points.

  - most operations are streamable. "git log >/dev/null" on the kernel
    repo (about 90,000 commits) takes 8.5 seconds on my box. But it
    starts generating output immediately, so it _feels_ instant, and the
    rest of the data is generated while I read the first commit in my
    pager.

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: git performance
  2008-10-22 21:55   ` Edward Ned Harvey
                       ` (4 preceding siblings ...)
  2008-10-23 16:39     ` Jeff King
@ 2008-10-23 18:31     ` Daniel Barkalow
  2008-10-23 22:24     ` Nanako Shiraishi
  2008-10-24  7:55     ` Pete Harlan
  7 siblings, 0 replies; 22+ messages in thread
From: Daniel Barkalow @ 2008-10-23 18:31 UTC (permalink / raw)
  To: Edward Ned Harvey; +Cc: git

On Wed, 22 Oct 2008, Edward Ned Harvey wrote:

> Out of curiosity, what are they talking about, when they say "git is 
> fast?"  Just the fact that it's all local disk, or is there more to it 
> than that?  I could see - git would probably outperform perforce for 
> versioning of large files (let's say iso files) to benefit from 
> sustained local disk IO, while perforce would probably outperform 
> anything I can think of, operating on thousands of tiny files, because 
> it will never walk the tree. 

It shouldn't be too hard to make git work like perforce with respect to 
walking the tree. git keeps an index of the stat() info it saw when it 
last looked at files, and only looks at the contents of files whose stat() 
info has changed. In order to have it work like perforce, it would just 
need to have a flag in the stat() info index for "don't even bother", 
which it would use for files that aren't "open"; for files with this flag, 
the check for index freshness would always say it's fresh without looking 
at the filesystem. Then you'd just have a config option to check out files 
as "not open" (and not writeable), and have a "git open" program that 
would chmod files and get their real stat info.

Of course, git is tuned for cases where the modify/build/test cycle 
requires stat() (or worse) on every file.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git performance
  2008-10-22 21:55   ` Edward Ned Harvey
                       ` (5 preceding siblings ...)
  2008-10-23 18:31     ` Daniel Barkalow
@ 2008-10-23 22:24     ` Nanako Shiraishi
  2008-10-24  3:56       ` Daniel Barkalow
  2008-10-24  7:55     ` Pete Harlan
  7 siblings, 1 reply; 22+ messages in thread
From: Nanako Shiraishi @ 2008-10-23 22:24 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: Edward Ned Harvey, git

Quoting Daniel Barkalow <barkalow@iabervon.org>:

> On Wed, 22 Oct 2008, Edward Ned Harvey wrote:
>
>> Out of curiosity, what are they talking about, when they say "git is 
>> fast?"  Just the fact that it's all local disk, or is there more to it 
>> than that?  I could see - git would probably outperform perforce for 
>> versioning of large files (let's say iso files) to benefit from 
>> sustained local disk IO, while perforce would probably outperform 
>> anything I can think of, operating on thousands of tiny files, because 
>> it will never walk the tree. 
>
> It shouldn't be too hard to make git work like perforce with respect to 
> walking the tree. git keeps an index of the stat() info it saw when it 
> last looked at files, and only looks at the contents of files whose stat() 
> info has changed. In order to have it work like perforce, it would just 
> need to have a flag in the stat() info index for "don't even bother", 

Are you describing the "assume unchanged bit"?

-- 
Nanako Shiraishi
http://ivory.ap.teacup.com/nanako3/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git performance
  2008-10-23 22:24     ` Nanako Shiraishi
@ 2008-10-24  3:56       ` Daniel Barkalow
  0 siblings, 0 replies; 22+ messages in thread
From: Daniel Barkalow @ 2008-10-24  3:56 UTC (permalink / raw)
  To: Nanako Shiraishi; +Cc: Edward Ned Harvey, git

On Fri, 24 Oct 2008, Nanako Shiraishi wrote:

> Quoting Daniel Barkalow <barkalow@iabervon.org>:
> 
> > On Wed, 22 Oct 2008, Edward Ned Harvey wrote:
> >
> >> Out of curiosity, what are they talking about, when they say "git is 
> >> fast?"  Just the fact that it's all local disk, or is there more to it 
> >> than that?  I could see - git would probably outperform perforce for 
> >> versioning of large files (let's say iso files) to benefit from 
> >> sustained local disk IO, while perforce would probably outperform 
> >> anything I can think of, operating on thousands of tiny files, because 
> >> it will never walk the tree. 
> >
> > It shouldn't be too hard to make git work like perforce with respect to 
> > walking the tree. git keeps an index of the stat() info it saw when it 
> > last looked at files, and only looks at the contents of files whose stat() 
> > info has changed. In order to have it work like perforce, it would just 
> > need to have a flag in the stat() info index for "don't even bother", 
> 
> Are you describing the "assume unchanged bit"?

Yes, but with the user write mode bit in the filesystem set to 
no-assume-unchanged, which is how Perforce users cope with it. I hadn't 
realized it had been implemented to get set on a per-file basis, rather 
than just as a global setting that caused it to not stat() anything except 
right when it was told to update.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git performance
  2008-10-22 21:55   ` Edward Ned Harvey
                       ` (6 preceding siblings ...)
  2008-10-23 22:24     ` Nanako Shiraishi
@ 2008-10-24  7:55     ` Pete Harlan
  2008-10-24 23:10       ` Pete Harlan
  7 siblings, 1 reply; 22+ messages in thread
From: Pete Harlan @ 2008-10-24  7:55 UTC (permalink / raw)
  To: Edward Ned Harvey; +Cc: git

Edward Ned Harvey wrote:
> > Yes, it does stat all the files. How many files are you talking about,
> > and what platform?  From a warm cache on Linux, the 23,000 files kernel
> > repo takes about a tenth of a second to stat all files for me (and this
>
> I'm talking about 40-50,000 files, on multi-user production linux,
> which means the cache is never warm, except when I'm benchmarking.
> Specifically RHEL 4 with the files on NFS mount.  Cold cache "svn
> st" takes ~10 mins.  Warm cache 20-30 sec.  Surprisingly to me,

I did some tests with a repo with ~32k files, and git was slightly
slower than svn with a cold cache (10.2s vs 8.4s), and around twice as
fast with a warm cache (.5s vs 1s).

Git 1.6.0.2, svn 1.4.6. Cache made cold with
"echo 1 >/proc/sys/vm/drop_caches".  Timings best of 5 runs.

(I did various benchmarks with svn 1.5.3 also, but there's something
awfully wrong with svn 1.5.x's merging, which takes pathologically
long compared with 1.4 (minutes instead of seconds), and it wasn't
noticeably faster than 1.4 at anything I tested.)

> performance was approx the same for files on local disk versus NFS.

10 minutes seems like a crazy amount of time for 40-50k files.  If you
didn't say you'd tested it on local disks, it would really sound like
a bad NFS interaction more than an svn problem.

> Out of curiosity, what are they talking about, when they say "git is
> fast?"

In my comparisons between svn and git, the operation "checkout
revision N of the tree" (i.e., "svn update -r 40000" vs "git checkout
302c7476") took five minutes on subversion and ten seconds using git.
The tests were all local, so git wasn't benefiting from being a DVCS,
it was just eerily fast on some things.  Svn was even that slow when
the revisions were 1 commit different, if it was a large enough
commit.

I don't check out whole revisions like that very often, but switching
between branches is a similar operation.  It doesn't usually take five
minutes in svn but it's an interruption, and with git it isn't.

For almost everything I tried git was faster, but status wasn't really
one of them.  The compelling cases were the number of things that were
faster _enough_ to no longer be an interruption, and being a DVCS, and
rebase, and rebase -i, and gitk, and a smarter blame, and
branching/merging support like it's something you'd do all day long,
not just when you were forced to.

HTH,

--Pete

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git performance
       [not found]       ` <000001c9358f$232bac70$69830550$@com>
@ 2008-10-24 14:29         ` Jeff King
  2008-10-24 17:42           ` George Shammas
  2008-10-24 17:53           ` Linus Torvalds
  0 siblings, 2 replies; 22+ messages in thread
From: Jeff King @ 2008-10-24 14:29 UTC (permalink / raw)
  To: Edward Ned Harvey; +Cc: git

On Fri, Oct 24, 2008 at 12:15:19AM -0400, Edward Ned Harvey wrote:

> Feel free to forward to the list, if anyone's still talking about it.
> I already un-subscribed.

Posting is not limited to subscribers, so you can happily continue the
conversation there by cc'ing the list (and I am cc'ing the list here).

> I did my benchmarking at least two months ago, so I forgot the exact
> results now, so I ran the benchmark once just now.  I also downloaded
> git, and did "git status" for comparison.  I rebooted the system in
> between each trial run, to clear the cache.  Here's the results:

Side note: on Linux, it is much easier to clear the cache via

  echo 1 >/proc/sys/vm/drop_caches

than to reboot for each benchmark.

> Local disk mirror "time git status" on the same tree. 17,468 versioned files, so the whole tree is 30,647 including .git files
> 	0m 25s	cold cache
> 	0m 0.2s	warm cache trial 1
> 	0m 0.2s	warm cache trial 2

Hmm. That's a lot of increase in files for .git. Did you try repacking
and then running your test?

> I questioned whether svn and git were causing unnecessary overhead.

Sure, they are doing more than just walking. So there is overhead, but
it's hard to say how much is unnecessary. However, if you were working
with an unpacked git, then it may have had to open() a lot of files in
the object db (keep in mind that status doesn't just show the difference
between the working tree and the index; it shows the difference between
the index and the last commit. So maybe "git diff" would be a more
accurate comparison).

> Conclusions:  
> * For "status" operations on cold cache, large file count, Neither the
> performance of git or svn approaches the ideal.  Both are an order of
> magnitude slower than ideal, which is still assuming "ideal" requires
> walking the tree.  A better ideal avoids the need to walk the tree,
> and has near-zero total cost.

Try your git benchmark again with a packed repo, and I think you will
find it approaches the time it takes to walk the tree.

That being said, if walking the tree is unacceptable to you, then no,
current git won't work. You would need to patch it to use inotify (once
upon a time there was some discussion of this, but it never went
anywhere -- I guess most people work on machines where they can keep the
cache relatively warm).

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git performance
  2008-10-24 14:29         ` Jeff King
@ 2008-10-24 17:42           ` George Shammas
  2008-10-24 19:06             ` Jakub Narebski
  2008-10-24 17:53           ` Linus Torvalds
  1 sibling, 1 reply; 22+ messages in thread
From: George Shammas @ 2008-10-24 17:42 UTC (permalink / raw)
  To: git

If you are really trying to backup a filesystem, you may want to look
at a filesystem that can do snapshots, it would be a lot more
efficient then a version control system.  Such as NILFS and ZFS.

http://en.wikipedia.org/wiki/NILFS
http://en.wikipedia.org/wiki/ZFS

Both these will allow you to look at changed files over time. NILFS is
slightlly diffrent in that it doesn't take snapshots, because it never
deletes, so you can rollback every change on a file. They both also
allow each user to rollback their own files if they wanted to, so if
this is your goal, source code version control is not for you, and a
good file system is for you.

-G

On Fri, Oct 24, 2008 at 10:29 AM, Jeff King <peff@peff.net> wrote:
> On Fri, Oct 24, 2008 at 12:15:19AM -0400, Edward Ned Harvey wrote:
>
>> Feel free to forward to the list, if anyone's still talking about it.
>> I already un-subscribed.
>
> Posting is not limited to subscribers, so you can happily continue the
> conversation there by cc'ing the list (and I am cc'ing the list here).
>
>> I did my benchmarking at least two months ago, so I forgot the exact
>> results now, so I ran the benchmark once just now.  I also downloaded
>> git, and did "git status" for comparison.  I rebooted the system in
>> between each trial run, to clear the cache.  Here's the results:
>
> Side note: on Linux, it is much easier to clear the cache via
>
>  echo 1 >/proc/sys/vm/drop_caches
>
> than to reboot for each benchmark.
>
>> Local disk mirror "time git status" on the same tree. 17,468 versioned files, so the whole tree is 30,647 including .git files
>>       0m 25s  cold cache
>>       0m 0.2s warm cache trial 1
>>       0m 0.2s warm cache trial 2
>
> Hmm. That's a lot of increase in files for .git. Did you try repacking
> and then running your test?
>
>> I questioned whether svn and git were causing unnecessary overhead.
>
> Sure, they are doing more than just walking. So there is overhead, but
> it's hard to say how much is unnecessary. However, if you were working
> with an unpacked git, then it may have had to open() a lot of files in
> the object db (keep in mind that status doesn't just show the difference
> between the working tree and the index; it shows the difference between
> the index and the last commit. So maybe "git diff" would be a more
> accurate comparison).
>
>> Conclusions:
>> * For "status" operations on cold cache, large file count, Neither the
>> performance of git or svn approaches the ideal.  Both are an order of
>> magnitude slower than ideal, which is still assuming "ideal" requires
>> walking the tree.  A better ideal avoids the need to walk the tree,
>> and has near-zero total cost.
>
> Try your git benchmark again with a packed repo, and I think you will
> find it approaches the time it takes to walk the tree.
>
> That being said, if walking the tree is unacceptable to you, then no,
> current git won't work. You would need to patch it to use inotify (once
> upon a time there was some discussion of this, but it never went
> anywhere -- I guess most people work on machines where they can keep the
> cache relatively warm).
>
> -Peff
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git performance
  2008-10-24 14:29         ` Jeff King
  2008-10-24 17:42           ` George Shammas
@ 2008-10-24 17:53           ` Linus Torvalds
  2008-10-24 18:20             ` Jeff King
  1 sibling, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2008-10-24 17:53 UTC (permalink / raw)
  To: Jeff King; +Cc: Edward Ned Harvey, git

On Fri, 24 Oct 2008, Jeff King wrote:
> 
> Side note: on Linux, it is much easier to clear the cache via
> 
>   echo 1 >/proc/sys/vm/drop_caches

Use "echo 3" instead of "1".

It's actually a bitmask, with bit 0 being "data" (pagecache) and bit 1 
being "metadata" (inodes and directory caches).

And since git (or any SCM) is very metadata-intensive, you really should 
make sure to drop metadata too, otherwise your caches won't be really very 
cold at all.

(But it obviously depends on the operation you're testing - some are more 
about the inodes and directories, others are about file data access).

			Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git performance
  2008-10-24 17:53           ` Linus Torvalds
@ 2008-10-24 18:20             ` Jeff King
  0 siblings, 0 replies; 22+ messages in thread
From: Jeff King @ 2008-10-24 18:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Edward Ned Harvey, git

On Fri, Oct 24, 2008 at 10:53:20AM -0700, Linus Torvalds wrote:

> >   echo 1 >/proc/sys/vm/drop_caches
> 
> Use "echo 3" instead of "1".
> 
> It's actually a bitmask, with bit 0 being "data" (pagecache) and bit 1 
> being "metadata" (inodes and directory caches).
> 
> And since git (or any SCM) is very metadata-intensive, you really should 
> make sure to drop metadata too, otherwise your caches won't be really very 
> cold at all.
> 
> (But it obviously depends on the operation you're testing - some are more 
> about the inodes and directories, others are about file data access).

Ah, thanks. In this case, he was interested in walking the directory
tree, so the metadata caching was indeed very important.

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git performance
  2008-10-24 17:42           ` George Shammas
@ 2008-10-24 19:06             ` Jakub Narebski
  0 siblings, 0 replies; 22+ messages in thread
From: Jakub Narebski @ 2008-10-24 19:06 UTC (permalink / raw)
  To: George Shammas; +Cc: git

"George Shammas" <georgyo@gmail.com> writes:

> If you are really trying to backup a filesystem, you may want to look
> at a filesystem that can do snapshots, it would be a lot more
> efficient then a version control system.  Such as NILFS and ZFS.
> 
> http://en.wikipedia.org/wiki/NILFS
> http://en.wikipedia.org/wiki/ZFS

Or ext3cow, or (currently in early stages of development) Tux3

  http://en.wikipedia.org/wiki/Ext3cow
  http://en.wikipedia.org/wiki/Tux3

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git performance
  2008-10-24  7:55     ` Pete Harlan
@ 2008-10-24 23:10       ` Pete Harlan
  0 siblings, 0 replies; 22+ messages in thread
From: Pete Harlan @ 2008-10-24 23:10 UTC (permalink / raw)
  To: Edward Ned Harvey; +Cc: git

Pete Harlan wrote:
> Edward Ned Harvey wrote:
>>> Yes, it does stat all the files. How many files are you talking about,
>>> and what platform?  From a warm cache on Linux, the 23,000 files kernel
>>> repo takes about a tenth of a second to stat all files for me (and this
>> I'm talking about 40-50,000 files, on multi-user production linux,
>> which means the cache is never warm, except when I'm benchmarking.
>> Specifically RHEL 4 with the files on NFS mount.  Cold cache "svn
>> st" takes ~10 mins.  Warm cache 20-30 sec.  Surprisingly to me,
> 
> I did some tests with a repo with ~32k files, and git was slightly
> slower than svn with a cold cache (10.2s vs 8.4s), and around twice as
> fast with a warm cache (.5s vs 1s).
> 
> Git 1.6.0.2, svn 1.4.6. Cache made cold with
> "echo 1 >/proc/sys/vm/drop_caches".  Timings best of 5 runs.

After redoing this test with "echo 3 >/proc/sys/vm/drop_caches" (which
also discards metadata, as pointed out by Linus), the cold-cache
timings are:

	svn 12.65 seconds
	git 10.3  seconds

So no Earth-shattering difference, but now git is somewhat quicker
than Subversion at everything I tested.

--Pete

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2008-10-24 23:11 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-22 20:17 git performance Edward Ned Harvey
2008-10-22 20:36 ` Jeff King
2008-10-22 21:13   ` Peter Harris
2008-10-22 21:55   ` Edward Ned Harvey
2008-10-23  7:11     ` Andreas Ericsson
2008-10-23  7:11     ` Andreas Ericsson
2008-10-23  7:41     ` Andreas Ericsson
2008-10-23 12:16     ` Matthieu Moy
2008-10-23 16:39     ` Jeff King
     [not found]       ` <000001c9358f$232bac70$69830550$@com>
2008-10-24 14:29         ` Jeff King
2008-10-24 17:42           ` George Shammas
2008-10-24 19:06             ` Jakub Narebski
2008-10-24 17:53           ` Linus Torvalds
2008-10-24 18:20             ` Jeff King
2008-10-23 18:31     ` Daniel Barkalow
2008-10-23 22:24     ` Nanako Shiraishi
2008-10-24  3:56       ` Daniel Barkalow
2008-10-24  7:55     ` Pete Harlan
2008-10-24 23:10       ` Pete Harlan
2008-10-22 22:42 ` Jakub Narebski
2008-10-23  7:43   ` Andreas Ericsson
2008-10-23 13:04     ` Nguyen Thai Ngoc Duy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).