Bad git status performance

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Bad git status performance
@ 2008-11-21  0:28 Jean-Luc Herren
  2008-11-21  0:42 ` David Bryson
       [not found] ` <c9e534200811201711y887ddd2t33013ec4a7db3c9a@mail.gmail.com>
  0 siblings, 2 replies; 5+ messages in thread
From: Jean-Luc Herren @ 2008-11-21  0:28 UTC (permalink / raw)
  To: Git Mailing List

Hi list!

I'm getting bad performance on 'git status' when I have staged
many changes to big files.  For example, consider this:

$ git init
Initialized empty Git repository in $HOME/test/.git/

$ for X in $(seq 100); do dd if=/dev/zero of=$X bs=1M count=1 2> /dev/null; done

$ git add .

$ git commit -m 'Lots of zeroes'
Created initial commit ed54346: Lots of zeroes
 100 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 1
 create mode 100644 10
...
 create mode 100644 98
 create mode 100644 99

$ for X in $(seq 100); do echo > $X; done

$ time git status
# On branch master
# Changed but not updated:
#   (use "git add <file>..." to update what will be committed)
#
#       modified:   1
#       modified:   10
...
#       modified:   98
#       modified:   99
#
no changes added to commit (use "git add" and/or "git commit -a")

real    0m0.003s
user    0m0.001s
sys     0m0.002s

$ git add -u

$ time git status
# On branch master
# Changes to be committed:
#   (use "git reset HEAD <file>..." to unstage)
#
#       modified:   1
#       modified:   10
...
#       modified:   98
#       modified:   99
#

real    0m16.291s
user    0m16.054s
sys     0m0.221s

The first 'git status' shows the same difference as the second,
just the second time it's staged instead of unstaged.  Why does it
take 16 seconds the second time when it's instant the first time?

(Side note: There once was a discussion about adding natural order
of branch names, but seems it never made it into git.  The same
would make sense for 'git status' too.)

Cheers,
jlh

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Bad git status performance
  2008-11-21  0:28 Bad git status performance Jean-Luc Herren
@ 2008-11-21  0:42 ` David Bryson
       [not found] ` <c9e534200811201711y887ddd2t33013ec4a7db3c9a@mail.gmail.com>
  1 sibling, 0 replies; 5+ messages in thread
From: David Bryson @ 2008-11-21  0:42 UTC (permalink / raw)
  To: Jean-Luc Herren; +Cc: Git Mailing List

Hi,

On Fri, Nov 21, 2008 at 01:28:14AM +0100 or thereabouts, Jean-Luc Herren wrote:
> Hi list!
> 
> I'm getting bad performance on 'git status' when I have staged
> many changes to big files.  For example, consider this:
> 
[snip]
> $ time git status
> # On branch master
> # Changes to be committed:
> #   (use "git reset HEAD <file>..." to unstage)
> #
> #       modified:   1
> #       modified:   10
> ...
> #       modified:   98
> #       modified:   99
> #
> 
> real    0m16.291s
> user    0m16.054s
> sys     0m0.221s
> 
> The first 'git status' shows the same difference as the second,
> just the second time it's staged instead of unstaged.  Why does it
> take 16 seconds the second time when it's instant the first time?

I had similar problems with a repository that contained several tarballs
of gcc and the linux kernel(don't ask me why it was not my repository).

Some weeks ago I mentioned this on IRC, and the problem really was not
necessarily git.  The way it was explained to me(and please correct or
clairify where I am wrong) is that git asked linux for the status of
those files and being that they are so large they were swapped out of
memory.

The result is the kernel reading those large files back in to see if
they have changed at all.  My impression is that this is not a git bug
but a cache-tuning problem.

Dave

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Bad git status performance
       [not found] ` <c9e534200811201711y887ddd2t33013ec4a7db3c9a@mail.gmail.com>
@ 2008-11-21 12:46   ` Jean-Luc Herren
  2008-11-21 15:19     ` Michael J Gruber
  0 siblings, 1 reply; 5+ messages in thread
From: Jean-Luc Herren @ 2008-11-21 12:46 UTC (permalink / raw)
  To: Glenn Griffin, Git Mailing List

Glenn Griffin wrote:
> On Thu, Nov 20, 2008 at 4:28 PM, Jean-Luc Herren <jlh@gmx.ch> wrote:
>> The first 'git status' shows the same difference as the second,
>> just the second time it's staged instead of unstaged.  Why does it
>> take 16 seconds the second time when it's instant the first time?
> 
> I believe the two runs of git status need to do very different things.
>  When run the first time, git knows the files in your working
> directory are not in the index so it can easily say those files are
> 'Changed but not updated' just from their existence.

I might be mistaken about how the index works, but those paths
*are* in the index at that time.  They just have the old content,
i.e. the same content as in HEAD.  When HEAD == index, then
nothing is staged.

But the presence of those files alone doesn't tell you that they
have changed.  You have to look at the content and compare it to
the index (== HEAD in this situation) to see whether they have
changed or not and for some reason git can do this very quickly.

> The second run
> those files do exist in both the index and the working directory, so
> git status first shows the files that are 'Changes to be committed'
> and that should be fast, but additionally git status will check to see
> if those files in your working directory have changed since you added
> them to the index.

Which is basically the same comparision as above, just it turns
out that they have not changed.  But even then, we're talking
about comparing a 1 byte file in the index to a 1 byte file in the
work tree.  That doesn't take 16 seconds, even for 100 files.

So this makes me believe it's the first step (comparing HEAD to
the index to show staged changes) that is slow.  And when you
compare a 1MB file to a 1 byte file, you don't need to read all of
the big file, you can tell they're not the same right after the
first byte.  (Even an doing stat() is enough, since the size is
not the same.)

Another thing that came to my mind is maybe rename detection kicks
in, even though no path vanished and none is new.  I believe
rename detection doesn't happen for unstaged changes, which might
explain the difference in speed.

btw, I forgot to mention that I get this with branches maint,
master, next and pu.

(And I hope you don't mind I take this back to the list.)

jlh

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Bad git status performance
  2008-11-21 12:46   ` Jean-Luc Herren
@ 2008-11-21 15:19     ` Michael J Gruber
  2008-11-21 20:07       ` Jean-Luc Herren
  0 siblings, 1 reply; 5+ messages in thread
From: Michael J Gruber @ 2008-11-21 15:19 UTC (permalink / raw)
  To: Jean-Luc Herren; +Cc: Glenn Griffin, Git Mailing List, Junio C Hamano

Jean-Luc Herren venit, vidit, dixit 21.11.2008 13:46:
> Glenn Griffin wrote:
>> On Thu, Nov 20, 2008 at 4:28 PM, Jean-Luc Herren <jlh@gmx.ch> wrote:
>>> The first 'git status' shows the same difference as the second,
>>> just the second time it's staged instead of unstaged.  Why does it
>>> take 16 seconds the second time when it's instant the first time?
>> I believe the two runs of git status need to do very different things.
>>  When run the first time, git knows the files in your working
>> directory are not in the index so it can easily say those files are
>> 'Changed but not updated' just from their existence.
> 
> I might be mistaken about how the index works, but those paths
> *are* in the index at that time.  They just have the old content,
> i.e. the same content as in HEAD.  When HEAD == index, then
> nothing is staged.
> 
> But the presence of those files alone doesn't tell you that they
> have changed.  You have to look at the content and compare it to
> the index (== HEAD in this situation) to see whether they have
> changed or not and for some reason git can do this very quickly.
> 
>> The second run
>> those files do exist in both the index and the working directory, so
>> git status first shows the files that are 'Changes to be committed'
>> and that should be fast, but additionally git status will check to see
>> if those files in your working directory have changed since you added
>> them to the index.
> 
> Which is basically the same comparision as above, just it turns
> out that they have not changed.  But even then, we're talking
> about comparing a 1 byte file in the index to a 1 byte file in the
> work tree.  That doesn't take 16 seconds, even for 100 files.
> 
> So this makes me believe it's the first step (comparing HEAD to
> the index to show staged changes) that is slow.  And when you
> compare a 1MB file to a 1 byte file, you don't need to read all of
> the big file, you can tell they're not the same right after the
> first byte.  (Even an doing stat() is enough, since the size is
> not the same.)
> 
> Another thing that came to my mind is maybe rename detection kicks
> in, even though no path vanished and none is new.  I believe
> rename detection doesn't happen for unstaged changes, which might
> explain the difference in speed.
> 
> btw, I forgot to mention that I get this with branches maint,
> master, next and pu.

Interestingly, all of

git diff --stat
git diff --stat --cached
git diff --stat HEAD

are "fast" (0.2s or so), i.e. diffing index-wtree, HEAD-index,
HEAD-wtree. Linus' threaded stat doesn't help either for status, btw (20s).

Experimenting further: Using 10 files with 10MB each (rather than 100
times 1MB) brings down the time by a factor 10 roughly - and so does
using 100 files with 100k each. Huh? Latter may be expected (10MB
total), but former (100MB total)?

Now it's getting funny: Changing your "echo >" to "echo ">>" (in your
100 files 1MB case) makes things "almost fast" again (1.3s).

OK, it's "use the source, Luke" time... Actually the part you don't see
takes the most time:
wt_status_print_updated()

And in fact I can confirm your suspicion: wt_status_print_updated()
enforces rename detection (ignoring any config). Forcing it off
(rev.diffopt.detect_rename = 0;) cuts down the 20s to 0.75s.

How about a config option status.renames (or something like -M) for status?

Michael

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Bad git status performance
  2008-11-21 15:19     ` Michael J Gruber
@ 2008-11-21 20:07       ` Jean-Luc Herren
  0 siblings, 0 replies; 5+ messages in thread
From: Jean-Luc Herren @ 2008-11-21 20:07 UTC (permalink / raw)
  To: Michael J Gruber; +Cc: Glenn Griffin, Git Mailing List, Junio C Hamano

Michael J Gruber wrote:
> Experimenting further: Using 10 files with 10MB each (rather than 100
> times 1MB) brings down the time by a factor 10 roughly - and so does
> using 100 files with 100k each. Huh? Latter may be expected (10MB
> total), but former (100MB total)?

100 files at each 100k gives me 1.73s, so about 10x speed up.  So
it seems git indeed looks at the content of the files and having a
tenth of the content means it's ten times as fast.

Interestingly, using only a single file of 100MB gives me 0.6s.
Which is still very slow for the job of telling that a 100MB file
is not equal to a 1 byte file.  And certainly there's no renaming
going on with a single file.

> Now it's getting funny: Changing your "echo >" to "echo ">>" (in your
> 100 files 1MB case) makes things "almost fast" again (1.3s).

Same here and that's pretty interesting, because in this situation
I can understand the slow down: Comparing two 1MB files that
differ only at their ends is expected to take some time, as you
have to go through the entire file until you notice they're not
the same.

jlh

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2008-11-21 20:08 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-21  0:28 Bad git status performance Jean-Luc Herren
2008-11-21  0:42 ` David Bryson
     [not found] ` <c9e534200811201711y887ddd2t33013ec4a7db3c9a@mail.gmail.com>
2008-11-21 12:46   ` Jean-Luc Herren
2008-11-21 15:19     ` Michael J Gruber
2008-11-21 20:07       ` Jean-Luc Herren

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).