* Bad git status performance
@ 2008-11-21 0:28 Jean-Luc Herren
2008-11-21 0:42 ` David Bryson
[not found] ` <c9e534200811201711y887ddd2t33013ec4a7db3c9a@mail.gmail.com>
0 siblings, 2 replies; 5+ messages in thread
From: Jean-Luc Herren @ 2008-11-21 0:28 UTC (permalink / raw)
To: Git Mailing List
Hi list!
I'm getting bad performance on 'git status' when I have staged
many changes to big files. For example, consider this:
$ git init
Initialized empty Git repository in $HOME/test/.git/
$ for X in $(seq 100); do dd if=/dev/zero of=$X bs=1M count=1 2> /dev/null; done
$ git add .
$ git commit -m 'Lots of zeroes'
Created initial commit ed54346: Lots of zeroes
100 files changed, 0 insertions(+), 0 deletions(-)
create mode 100644 1
create mode 100644 10
...
create mode 100644 98
create mode 100644 99
$ for X in $(seq 100); do echo > $X; done
$ time git status
# On branch master
# Changed but not updated:
# (use "git add <file>..." to update what will be committed)
#
# modified: 1
# modified: 10
...
# modified: 98
# modified: 99
#
no changes added to commit (use "git add" and/or "git commit -a")
real 0m0.003s
user 0m0.001s
sys 0m0.002s
$ git add -u
$ time git status
# On branch master
# Changes to be committed:
# (use "git reset HEAD <file>..." to unstage)
#
# modified: 1
# modified: 10
...
# modified: 98
# modified: 99
#
real 0m16.291s
user 0m16.054s
sys 0m0.221s
The first 'git status' shows the same difference as the second,
just the second time it's staged instead of unstaged. Why does it
take 16 seconds the second time when it's instant the first time?
(Side note: There once was a discussion about adding natural order
of branch names, but seems it never made it into git. The same
would make sense for 'git status' too.)
Cheers,
jlh
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: Bad git status performance 2008-11-21 0:28 Bad git status performance Jean-Luc Herren @ 2008-11-21 0:42 ` David Bryson [not found] ` <c9e534200811201711y887ddd2t33013ec4a7db3c9a@mail.gmail.com> 1 sibling, 0 replies; 5+ messages in thread From: David Bryson @ 2008-11-21 0:42 UTC (permalink / raw) To: Jean-Luc Herren; +Cc: Git Mailing List Hi, On Fri, Nov 21, 2008 at 01:28:14AM +0100 or thereabouts, Jean-Luc Herren wrote: > Hi list! > > I'm getting bad performance on 'git status' when I have staged > many changes to big files. For example, consider this: > [snip] > $ time git status > # On branch master > # Changes to be committed: > # (use "git reset HEAD <file>..." to unstage) > # > # modified: 1 > # modified: 10 > ... > # modified: 98 > # modified: 99 > # > > real 0m16.291s > user 0m16.054s > sys 0m0.221s > > The first 'git status' shows the same difference as the second, > just the second time it's staged instead of unstaged. Why does it > take 16 seconds the second time when it's instant the first time? I had similar problems with a repository that contained several tarballs of gcc and the linux kernel(don't ask me why it was not my repository). Some weeks ago I mentioned this on IRC, and the problem really was not necessarily git. The way it was explained to me(and please correct or clairify where I am wrong) is that git asked linux for the status of those files and being that they are so large they were swapped out of memory. The result is the kernel reading those large files back in to see if they have changed at all. My impression is that this is not a git bug but a cache-tuning problem. Dave ^ permalink raw reply [flat|nested] 5+ messages in thread
[parent not found: <c9e534200811201711y887ddd2t33013ec4a7db3c9a@mail.gmail.com>]
* Re: Bad git status performance [not found] ` <c9e534200811201711y887ddd2t33013ec4a7db3c9a@mail.gmail.com> @ 2008-11-21 12:46 ` Jean-Luc Herren 2008-11-21 15:19 ` Michael J Gruber 0 siblings, 1 reply; 5+ messages in thread From: Jean-Luc Herren @ 2008-11-21 12:46 UTC (permalink / raw) To: Glenn Griffin, Git Mailing List Glenn Griffin wrote: > On Thu, Nov 20, 2008 at 4:28 PM, Jean-Luc Herren <jlh@gmx.ch> wrote: >> The first 'git status' shows the same difference as the second, >> just the second time it's staged instead of unstaged. Why does it >> take 16 seconds the second time when it's instant the first time? > > I believe the two runs of git status need to do very different things. > When run the first time, git knows the files in your working > directory are not in the index so it can easily say those files are > 'Changed but not updated' just from their existence. I might be mistaken about how the index works, but those paths *are* in the index at that time. They just have the old content, i.e. the same content as in HEAD. When HEAD == index, then nothing is staged. But the presence of those files alone doesn't tell you that they have changed. You have to look at the content and compare it to the index (== HEAD in this situation) to see whether they have changed or not and for some reason git can do this very quickly. > The second run > those files do exist in both the index and the working directory, so > git status first shows the files that are 'Changes to be committed' > and that should be fast, but additionally git status will check to see > if those files in your working directory have changed since you added > them to the index. Which is basically the same comparision as above, just it turns out that they have not changed. But even then, we're talking about comparing a 1 byte file in the index to a 1 byte file in the work tree. That doesn't take 16 seconds, even for 100 files. So this makes me believe it's the first step (comparing HEAD to the index to show staged changes) that is slow. And when you compare a 1MB file to a 1 byte file, you don't need to read all of the big file, you can tell they're not the same right after the first byte. (Even an doing stat() is enough, since the size is not the same.) Another thing that came to my mind is maybe rename detection kicks in, even though no path vanished and none is new. I believe rename detection doesn't happen for unstaged changes, which might explain the difference in speed. btw, I forgot to mention that I get this with branches maint, master, next and pu. (And I hope you don't mind I take this back to the list.) jlh ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Bad git status performance 2008-11-21 12:46 ` Jean-Luc Herren @ 2008-11-21 15:19 ` Michael J Gruber 2008-11-21 20:07 ` Jean-Luc Herren 0 siblings, 1 reply; 5+ messages in thread From: Michael J Gruber @ 2008-11-21 15:19 UTC (permalink / raw) To: Jean-Luc Herren; +Cc: Glenn Griffin, Git Mailing List, Junio C Hamano Jean-Luc Herren venit, vidit, dixit 21.11.2008 13:46: > Glenn Griffin wrote: >> On Thu, Nov 20, 2008 at 4:28 PM, Jean-Luc Herren <jlh@gmx.ch> wrote: >>> The first 'git status' shows the same difference as the second, >>> just the second time it's staged instead of unstaged. Why does it >>> take 16 seconds the second time when it's instant the first time? >> I believe the two runs of git status need to do very different things. >> When run the first time, git knows the files in your working >> directory are not in the index so it can easily say those files are >> 'Changed but not updated' just from their existence. > > I might be mistaken about how the index works, but those paths > *are* in the index at that time. They just have the old content, > i.e. the same content as in HEAD. When HEAD == index, then > nothing is staged. > > But the presence of those files alone doesn't tell you that they > have changed. You have to look at the content and compare it to > the index (== HEAD in this situation) to see whether they have > changed or not and for some reason git can do this very quickly. > >> The second run >> those files do exist in both the index and the working directory, so >> git status first shows the files that are 'Changes to be committed' >> and that should be fast, but additionally git status will check to see >> if those files in your working directory have changed since you added >> them to the index. > > Which is basically the same comparision as above, just it turns > out that they have not changed. But even then, we're talking > about comparing a 1 byte file in the index to a 1 byte file in the > work tree. That doesn't take 16 seconds, even for 100 files. > > So this makes me believe it's the first step (comparing HEAD to > the index to show staged changes) that is slow. And when you > compare a 1MB file to a 1 byte file, you don't need to read all of > the big file, you can tell they're not the same right after the > first byte. (Even an doing stat() is enough, since the size is > not the same.) > > Another thing that came to my mind is maybe rename detection kicks > in, even though no path vanished and none is new. I believe > rename detection doesn't happen for unstaged changes, which might > explain the difference in speed. > > btw, I forgot to mention that I get this with branches maint, > master, next and pu. Interestingly, all of git diff --stat git diff --stat --cached git diff --stat HEAD are "fast" (0.2s or so), i.e. diffing index-wtree, HEAD-index, HEAD-wtree. Linus' threaded stat doesn't help either for status, btw (20s). Experimenting further: Using 10 files with 10MB each (rather than 100 times 1MB) brings down the time by a factor 10 roughly - and so does using 100 files with 100k each. Huh? Latter may be expected (10MB total), but former (100MB total)? Now it's getting funny: Changing your "echo >" to "echo ">>" (in your 100 files 1MB case) makes things "almost fast" again (1.3s). OK, it's "use the source, Luke" time... Actually the part you don't see takes the most time: wt_status_print_updated() And in fact I can confirm your suspicion: wt_status_print_updated() enforces rename detection (ignoring any config). Forcing it off (rev.diffopt.detect_rename = 0;) cuts down the 20s to 0.75s. How about a config option status.renames (or something like -M) for status? Michael ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Bad git status performance 2008-11-21 15:19 ` Michael J Gruber @ 2008-11-21 20:07 ` Jean-Luc Herren 0 siblings, 0 replies; 5+ messages in thread From: Jean-Luc Herren @ 2008-11-21 20:07 UTC (permalink / raw) To: Michael J Gruber; +Cc: Glenn Griffin, Git Mailing List, Junio C Hamano Michael J Gruber wrote: > Experimenting further: Using 10 files with 10MB each (rather than 100 > times 1MB) brings down the time by a factor 10 roughly - and so does > using 100 files with 100k each. Huh? Latter may be expected (10MB > total), but former (100MB total)? 100 files at each 100k gives me 1.73s, so about 10x speed up. So it seems git indeed looks at the content of the files and having a tenth of the content means it's ten times as fast. Interestingly, using only a single file of 100MB gives me 0.6s. Which is still very slow for the job of telling that a 100MB file is not equal to a 1 byte file. And certainly there's no renaming going on with a single file. > Now it's getting funny: Changing your "echo >" to "echo ">>" (in your > 100 files 1MB case) makes things "almost fast" again (1.3s). Same here and that's pretty interesting, because in this situation I can understand the slow down: Comparing two 1MB files that differ only at their ends is expected to take some time, as you have to go through the entire file until you notice they're not the same. jlh ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2008-11-21 20:08 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-21 0:28 Bad git status performance Jean-Luc Herren
2008-11-21 0:42 ` David Bryson
[not found] ` <c9e534200811201711y887ddd2t33013ec4a7db3c9a@mail.gmail.com>
2008-11-21 12:46 ` Jean-Luc Herren
2008-11-21 15:19 ` Michael J Gruber
2008-11-21 20:07 ` Jean-Luc Herren
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).