* Re: git performance
2008-10-22 21:55 ` Edward Ned Harvey
@ 2008-10-23 7:11 ` Andreas Ericsson
2008-10-23 7:11 ` Andreas Ericsson
` (6 subsequent siblings)
7 siblings, 0 replies; 22+ messages in thread
From: Andreas Ericsson @ 2008-10-23 7:11 UTC (permalink / raw)
To: Edward Ned Harvey; +Cc: git
Edward Ned Harvey wrote:
>> Yes, it does stat all the files. How many files are you talking about,
>> and what platform? From a warm cache on Linux, the 23,000 files kernel
>> repo takes about a tenth of a second to stat all files for me (and this
>> on a several year-old machine). And of course many operations don't
>> require stat'ing at all (like looking at logs, or diffs that don't
>> involve the working tree).
>
> No worries. No solution can meet everyone's needs.
>
> I'm talking about 40-50,000 files, on multi-user production linux, which means the cache is never warm, except when I'm benchmarking. Specifically RHEL 4 with the files on NFS mount. Cold cache "svn st" takes ~10 mins. Warm cache 20-30 sec. Surprisingly to me, performance was approx the same for files on local disk versus NFS. Probably the best solution for us is perforce, we just don't like the pricetag.
>
> Out of curiosity, what are they talking about, when they say "git is fast?" Just the fact that it's all local disk, or is there more to it than that? I could see - git would probably outperform perforce for versioning of large files (let's say iso files) to benefit from sustained local disk IO, while perforce would probably outperform anything I can think of, operating on thousands of tiny files, because it will never walk the tree.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: git performance
2008-10-22 21:55 ` Edward Ned Harvey
2008-10-23 7:11 ` Andreas Ericsson
@ 2008-10-23 7:11 ` Andreas Ericsson
2008-10-23 7:41 ` Andreas Ericsson
` (5 subsequent siblings)
7 siblings, 0 replies; 22+ messages in thread
From: Andreas Ericsson @ 2008-10-23 7:11 UTC (permalink / raw)
To: Edward Ned Harvey; +Cc: git
Edward Ned Harvey wrote:
>> Yes, it does stat all the files. How many files are you talking about,
>> and what platform? From a warm cache on Linux, the 23,000 files kernel
>> repo takes about a tenth of a second to stat all files for me (and this
>> on a several year-old machine). And of course many operations don't
>> require stat'ing at all (like looking at logs, or diffs that don't
>> involve the working tree).
>
> No worries. No solution can meet everyone's needs.
>
> I'm talking about 40-50,000 files, on multi-user production linux, which means the cache is never warm, except when I'm benchmarking. Specifically RHEL 4 with the files on NFS mount. Cold cache "svn st" takes ~10 mins. Warm cache 20-30 sec. Surprisingly to me, performance was approx the same for files on local disk versus NFS. Probably the best solution for us is perforce, we just don't like the pricetag.
>
> Out of curiosity, what are they talking about, when they say "git is fast?" Just the fact that it's all local disk, or is there more to it than that? I could see - git would probably outperform perforce for versioning of large files (let's say iso files) to benefit from sustained local disk IO, while perforce would probably outperform anything I can think of, operating on thousands of tiny files, because it will never walk the tree.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: git performance
2008-10-22 21:55 ` Edward Ned Harvey
2008-10-23 7:11 ` Andreas Ericsson
2008-10-23 7:11 ` Andreas Ericsson
@ 2008-10-23 7:41 ` Andreas Ericsson
2008-10-23 12:16 ` Matthieu Moy
` (4 subsequent siblings)
7 siblings, 0 replies; 22+ messages in thread
From: Andreas Ericsson @ 2008-10-23 7:41 UTC (permalink / raw)
To: Edward Ned Harvey; +Cc: git
Edward Ned Harvey wrote:
>> Yes, it does stat all the files. How many files are you talking
>> about, and what platform? From a warm cache on Linux, the 23,000
>> files kernel repo takes about a tenth of a second to stat all files
>> for me (and this on a several year-old machine). And of course many
>> operations don't require stat'ing at all (like looking at logs, or
>> diffs that don't involve the working tree).
>
> No worries. No solution can meet everyone's needs.
>
> I'm talking about 40-50,000 files, on multi-user production linux,
Umm... using git to track a production server? I think there's something
in your specific use-case that eluded pretty much everyone here the
first time you asked about it.
git was built to maintain the linux kernel with its patch-and-merge based
workflow, 117k commits and 25k files. It's *good* at that sort of thing,
but a lot of features are "source-code management" specific. It sounds to
me you're asking for something that will keep a backup of most of your
entire system (apart from /home), which it's not really suited for. For
instance, it doesn't keep track of mode-bits on files (apart from
"executable or not").
> which means the cache is never warm, except when I'm benchmarking.
> Specifically RHEL 4 with the files on NFS mount. Cold cache "svn st"
> takes ~10 mins. Warm cache 20-30 sec. Surprisingly to me,
> performance was approx the same for files on local disk versus NFS.
> Probably the best solution for us is perforce, we just don't like the
> pricetag.
>
> Out of curiosity, what are they talking about, when they say "git is
> fast?"
Merges, patch application, committing, history walking and data
transfers are all extremely quick operations under git.
Actually, history walking isn't extremely quick, but several neat
tricks are in place that make it *seem* quick. Running
"git log drivers/net/wireless" on the linux kernel with a cold
cache starts spitting out output after about 1 second on my measly
laptop (where the kernel has 117k commits on 25k files).
> Just the fact that it's all local disk, or is there more to
> it than that? I could see - git would probably outperform perforce
> for versioning of large files (let's say iso files) to benefit from
> sustained local disk IO, while perforce would probably outperform
> anything I can think of, operating on thousands of tiny files,
> because it will never walk the tree.
>
Git doesn't *have* to walk the tree either. "git status" obviously
has to do that, since you're asking "what files have changed in this
tree since I last added stuff to the index", but you can use git just
fine without ever issuing "git status" (assuming you're the one
controlling the changes, that is).
"git rm" and "git add" won't walk the tree. They're just interested in
the paths you give them and won't touch anything else.
"git commit path1 path2" won't walk the tree. It has to walk the paths
(which can be entire subdirectories, or all of them), but not more than
that.
"git push" (ie, send your changes upstream) won't walk the tree. It'll
just look at the history and how they differ.
"git merge" (and therefore also "git pull") doesn't walk the tree. It
only makes sure paths that are touched by the merge are up-to-date.
Apart from that, it would be trivial to hack up some inotify config
and scripts that stages changes in a separate index-file and then
add a simple wrapper that operates on the separate index-file rather
than the "regular" one.
Sample "giti" wrapper:
--%<--%<--%<--
#!/bin/sh
# giti - inotify driven git wrapper
GIT_INDEX=.git/inotify-index
export GIT_INDEX
case "$@" in
status)
git diff --name-only --cached
exit $?
;;
esac
git "$@"
--%<--%<--%<--
Sample inotify script:
--%<--%<--%<--
#!/bin/sh
GIT_INDEX=.git/inotify-index git add $1
--%<--%<--%<--
Sample incrontab(5) entry:
--%<--%<--%<--
/watched/path IN_CLOSE_WRITE inotify.git $@/$#
--%<--%<--%<--
Totally untested ofcourse, so it probably needs tweaking. It should
work rather well though, assuming you're somewhat careful what
arguments you send to the "giti" wrapper and make sure to never
use any git-commands that *have* to walk the entire tree (such as
"git commit -a").
Let us know how it pans out.
--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: git performance
2008-10-22 21:55 ` Edward Ned Harvey
` (2 preceding siblings ...)
2008-10-23 7:41 ` Andreas Ericsson
@ 2008-10-23 12:16 ` Matthieu Moy
2008-10-23 16:39 ` Jeff King
` (3 subsequent siblings)
7 siblings, 0 replies; 22+ messages in thread
From: Matthieu Moy @ 2008-10-23 12:16 UTC (permalink / raw)
To: Edward Ned Harvey; +Cc: git
"Edward Ned Harvey" <git@nedharvey.com> writes:
>> Yes, it does stat all the files. How many files are you talking about,
>> and what platform? From a warm cache on Linux, the 23,000 files kernel
>> repo takes about a tenth of a second to stat all files for me (and this
>> on a several year-old machine). And of course many operations don't
>> require stat'ing at all (like looking at logs, or diffs that don't
>> involve the working tree).
>
> No worries. No solution can meet everyone's needs.
>
> I'm talking about 40-50,000 files, on multi-user production linux,
> which means the cache is never warm, except when I'm benchmarking.
> Specifically RHEL 4 with the files on NFS mount. Cold cache "svn st"
> takes ~10 mins. Warm cache 20-30 sec.
SVN does not only has to stat the files. It also has to read the
stat-cache information wich is split in one .svn/ per directory in the
working tree. Not sure which operation dominates the performance,
though. Best is just to try.
> Out of curiosity, what are they talking about, when they say "git is
> fast?" Just the fact that it's all local disk, or is there more to
> it than that?
Not just local disk: bzr also works locally, and git is much faster on
most operations (bzr status can now compete with git, but "git log"
and "git commit" can be instantaneous where bzr take 1 minute for
example).
For sure, doing most operations locally is the key to being fast, but
Git has also been written so that the complexity of algorithms be as
low as possible.
> I could see - git would probably outperform perforce for versioning
> of large files (let's say iso files) to benefit from sustained local
> disk IO, while perforce would probably outperform anything I can
> think of, operating on thousands of tiny files, because it will
> never walk the tree.
Mercurial has an extension called "inotify" that avoids walking the
disk too. AFAIK doesn't have an equivalent in Git (mostly because most
people interested find git fast enough).
--
Matthieu
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: git performance
2008-10-22 21:55 ` Edward Ned Harvey
` (3 preceding siblings ...)
2008-10-23 12:16 ` Matthieu Moy
@ 2008-10-23 16:39 ` Jeff King
[not found] ` <000001c9358f$232bac70$69830550$@com>
2008-10-23 18:31 ` Daniel Barkalow
` (2 subsequent siblings)
7 siblings, 1 reply; 22+ messages in thread
From: Jeff King @ 2008-10-23 16:39 UTC (permalink / raw)
To: Edward Ned Harvey; +Cc: git
On Wed, Oct 22, 2008 at 05:55:14PM -0400, Edward Ned Harvey wrote:
> I'm talking about 40-50,000 files, on multi-user production linux,
> which means the cache is never warm, except when I'm benchmarking.
Well, if you have a cold cache it's going to take longer. :) You should
probably benchmark if you want to know exactly how long.
> Specifically RHEL 4 with the files on NFS mount. Cold cache "svn st"
> takes ~10 mins. Warm cache 20-30 sec. Surprisingly to me,
Wow, that is awful. For comparison, "git status" from a cold on the
kernel repo takes me 17 seconds. From a warm cache, less than half a
second.
Yes, the cold cache case would probably be better with inotify, but
compared to svn, that's screaming fast. I haven't used perforce. If your
bottleneck really is stat'ing the tree, then yes, something that avoided
that might perform better (but weigh that particular optimization
against other things which might be slower).
> Out of curiosity, what are they talking about, when they say "git is
> fast?"
Well, there are the numbers above. When comparing to SVN or (god forbid)
CVS, there are order of magnitude speedups for most common operations.
> Just the fact that it's all local disk, or is there more to it
> than that? I could see - git would probably outperform perforce for
The things that generally make git fast are:
- using a compact on-disk structure (including zlib and aggressive
delta-finding) to keep your cache warm (and when it's not warm, to
get data off the disk as quickly as possible)
- the content-addressable nature of objects means we can just look at
the data we need to solve a problem. For example,
getting the history between point A and point B is "O(the number of
commits between A and B)", _not_ "O(the size of the repo)".
Viewing a log without generating diffs is "O(the number of
commits)", not "O(some combination of the number of commits and the
number of files in each commit)". Diffing two points in history is
"O(the size of the differences between the two points)" and is
totally independent of the number of commits between the two points.
- most operations are streamable. "git log >/dev/null" on the kernel
repo (about 90,000 commits) takes 8.5 seconds on my box. But it
starts generating output immediately, so it _feels_ instant, and the
rest of the data is generated while I read the first commit in my
pager.
-Peff
^ permalink raw reply [flat|nested] 22+ messages in thread
* RE: git performance
2008-10-22 21:55 ` Edward Ned Harvey
` (4 preceding siblings ...)
2008-10-23 16:39 ` Jeff King
@ 2008-10-23 18:31 ` Daniel Barkalow
2008-10-23 22:24 ` Nanako Shiraishi
2008-10-24 7:55 ` Pete Harlan
7 siblings, 0 replies; 22+ messages in thread
From: Daniel Barkalow @ 2008-10-23 18:31 UTC (permalink / raw)
To: Edward Ned Harvey; +Cc: git
On Wed, 22 Oct 2008, Edward Ned Harvey wrote:
> Out of curiosity, what are they talking about, when they say "git is
> fast?" Just the fact that it's all local disk, or is there more to it
> than that? I could see - git would probably outperform perforce for
> versioning of large files (let's say iso files) to benefit from
> sustained local disk IO, while perforce would probably outperform
> anything I can think of, operating on thousands of tiny files, because
> it will never walk the tree.
It shouldn't be too hard to make git work like perforce with respect to
walking the tree. git keeps an index of the stat() info it saw when it
last looked at files, and only looks at the contents of files whose stat()
info has changed. In order to have it work like perforce, it would just
need to have a flag in the stat() info index for "don't even bother",
which it would use for files that aren't "open"; for files with this flag,
the check for index freshness would always say it's fresh without looking
at the filesystem. Then you'd just have a config option to check out files
as "not open" (and not writeable), and have a "git open" program that
would chmod files and get their real stat info.
Of course, git is tuned for cases where the modify/build/test cycle
requires stat() (or worse) on every file.
-Daniel
*This .sig left intentionally blank*
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: git performance
2008-10-22 21:55 ` Edward Ned Harvey
` (5 preceding siblings ...)
2008-10-23 18:31 ` Daniel Barkalow
@ 2008-10-23 22:24 ` Nanako Shiraishi
2008-10-24 3:56 ` Daniel Barkalow
2008-10-24 7:55 ` Pete Harlan
7 siblings, 1 reply; 22+ messages in thread
From: Nanako Shiraishi @ 2008-10-23 22:24 UTC (permalink / raw)
To: Daniel Barkalow; +Cc: Edward Ned Harvey, git
Quoting Daniel Barkalow <barkalow@iabervon.org>:
> On Wed, 22 Oct 2008, Edward Ned Harvey wrote:
>
>> Out of curiosity, what are they talking about, when they say "git is
>> fast?" Just the fact that it's all local disk, or is there more to it
>> than that? I could see - git would probably outperform perforce for
>> versioning of large files (let's say iso files) to benefit from
>> sustained local disk IO, while perforce would probably outperform
>> anything I can think of, operating on thousands of tiny files, because
>> it will never walk the tree.
>
> It shouldn't be too hard to make git work like perforce with respect to
> walking the tree. git keeps an index of the stat() info it saw when it
> last looked at files, and only looks at the contents of files whose stat()
> info has changed. In order to have it work like perforce, it would just
> need to have a flag in the stat() info index for "don't even bother",
Are you describing the "assume unchanged bit"?
--
Nanako Shiraishi
http://ivory.ap.teacup.com/nanako3/
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: git performance
2008-10-23 22:24 ` Nanako Shiraishi
@ 2008-10-24 3:56 ` Daniel Barkalow
0 siblings, 0 replies; 22+ messages in thread
From: Daniel Barkalow @ 2008-10-24 3:56 UTC (permalink / raw)
To: Nanako Shiraishi; +Cc: Edward Ned Harvey, git
On Fri, 24 Oct 2008, Nanako Shiraishi wrote:
> Quoting Daniel Barkalow <barkalow@iabervon.org>:
>
> > On Wed, 22 Oct 2008, Edward Ned Harvey wrote:
> >
> >> Out of curiosity, what are they talking about, when they say "git is
> >> fast?" Just the fact that it's all local disk, or is there more to it
> >> than that? I could see - git would probably outperform perforce for
> >> versioning of large files (let's say iso files) to benefit from
> >> sustained local disk IO, while perforce would probably outperform
> >> anything I can think of, operating on thousands of tiny files, because
> >> it will never walk the tree.
> >
> > It shouldn't be too hard to make git work like perforce with respect to
> > walking the tree. git keeps an index of the stat() info it saw when it
> > last looked at files, and only looks at the contents of files whose stat()
> > info has changed. In order to have it work like perforce, it would just
> > need to have a flag in the stat() info index for "don't even bother",
>
> Are you describing the "assume unchanged bit"?
Yes, but with the user write mode bit in the filesystem set to
no-assume-unchanged, which is how Perforce users cope with it. I hadn't
realized it had been implemented to get set on a per-file basis, rather
than just as a global setting that caused it to not stat() anything except
right when it was told to update.
-Daniel
*This .sig left intentionally blank*
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: git performance
2008-10-22 21:55 ` Edward Ned Harvey
` (6 preceding siblings ...)
2008-10-23 22:24 ` Nanako Shiraishi
@ 2008-10-24 7:55 ` Pete Harlan
2008-10-24 23:10 ` Pete Harlan
7 siblings, 1 reply; 22+ messages in thread
From: Pete Harlan @ 2008-10-24 7:55 UTC (permalink / raw)
To: Edward Ned Harvey; +Cc: git
Edward Ned Harvey wrote:
> > Yes, it does stat all the files. How many files are you talking about,
> > and what platform? From a warm cache on Linux, the 23,000 files kernel
> > repo takes about a tenth of a second to stat all files for me (and this
>
> I'm talking about 40-50,000 files, on multi-user production linux,
> which means the cache is never warm, except when I'm benchmarking.
> Specifically RHEL 4 with the files on NFS mount. Cold cache "svn
> st" takes ~10 mins. Warm cache 20-30 sec. Surprisingly to me,
I did some tests with a repo with ~32k files, and git was slightly
slower than svn with a cold cache (10.2s vs 8.4s), and around twice as
fast with a warm cache (.5s vs 1s).
Git 1.6.0.2, svn 1.4.6. Cache made cold with
"echo 1 >/proc/sys/vm/drop_caches". Timings best of 5 runs.
(I did various benchmarks with svn 1.5.3 also, but there's something
awfully wrong with svn 1.5.x's merging, which takes pathologically
long compared with 1.4 (minutes instead of seconds), and it wasn't
noticeably faster than 1.4 at anything I tested.)
> performance was approx the same for files on local disk versus NFS.
10 minutes seems like a crazy amount of time for 40-50k files. If you
didn't say you'd tested it on local disks, it would really sound like
a bad NFS interaction more than an svn problem.
> Out of curiosity, what are they talking about, when they say "git is
> fast?"
In my comparisons between svn and git, the operation "checkout
revision N of the tree" (i.e., "svn update -r 40000" vs "git checkout
302c7476") took five minutes on subversion and ten seconds using git.
The tests were all local, so git wasn't benefiting from being a DVCS,
it was just eerily fast on some things. Svn was even that slow when
the revisions were 1 commit different, if it was a large enough
commit.
I don't check out whole revisions like that very often, but switching
between branches is a similar operation. It doesn't usually take five
minutes in svn but it's an interruption, and with git it isn't.
For almost everything I tried git was faster, but status wasn't really
one of them. The compelling cases were the number of things that were
faster _enough_ to no longer be an interruption, and being a DVCS, and
rebase, and rebase -i, and gitk, and a smarter blame, and
branching/merging support like it's something you'd do all day long,
not just when you were forced to.
HTH,
--Pete
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: git performance
2008-10-24 7:55 ` Pete Harlan
@ 2008-10-24 23:10 ` Pete Harlan
0 siblings, 0 replies; 22+ messages in thread
From: Pete Harlan @ 2008-10-24 23:10 UTC (permalink / raw)
To: Edward Ned Harvey; +Cc: git
Pete Harlan wrote:
> Edward Ned Harvey wrote:
>>> Yes, it does stat all the files. How many files are you talking about,
>>> and what platform? From a warm cache on Linux, the 23,000 files kernel
>>> repo takes about a tenth of a second to stat all files for me (and this
>> I'm talking about 40-50,000 files, on multi-user production linux,
>> which means the cache is never warm, except when I'm benchmarking.
>> Specifically RHEL 4 with the files on NFS mount. Cold cache "svn
>> st" takes ~10 mins. Warm cache 20-30 sec. Surprisingly to me,
>
> I did some tests with a repo with ~32k files, and git was slightly
> slower than svn with a cold cache (10.2s vs 8.4s), and around twice as
> fast with a warm cache (.5s vs 1s).
>
> Git 1.6.0.2, svn 1.4.6. Cache made cold with
> "echo 1 >/proc/sys/vm/drop_caches". Timings best of 5 runs.
After redoing this test with "echo 3 >/proc/sys/vm/drop_caches" (which
also discards metadata, as pointed out by Linus), the cold-cache
timings are:
svn 12.65 seconds
git 10.3 seconds
So no Earth-shattering difference, but now git is somewhat quicker
than Subversion at everything I tested.
--Pete
^ permalink raw reply [flat|nested] 22+ messages in thread