git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Bryan Turner <bturner@atlassian.com>
Cc: Pol Online <info@pol-online.net>, Git Users <git@vger.kernel.org>
Subject: Re: git status / git diff -C not detecting file copy
Date: Tue, 2 Dec 2014 01:55:50 -0500	[thread overview]
Message-ID: <20141202065550.GB1948@peff.net> (raw)
In-Reply-To: <CAGyf7-F9twCEUY-LN=xEf4=gfNW8oLEHJmTjHRQ2MncHZ2emZQ@mail.gmail.com>

On Sun, Nov 30, 2014 at 12:54:53PM +1100, Bryan Turner wrote:

> I'll let someone a little more intimately familiar with the internals
> of git status comment on why the documentation for that mentions
> copies.

I don't think there is a good reason. git-status has used renames since
mid-2005. The documentation mentioning copies was added much later,
along with the short and porcelain formats. That code handles whatever
the diff engine throws at it.  I don't think anybody considered at that
time the fact that you cannot actually provoke status to look for
copies.

Interestingly, the rename behavior dates all the way back to:

  commit 753fd78458b6d7d0e65ce0ebe7b62e1bc55f3992
  Author: Linus Torvalds <torvalds@ppc970.osdl.org>
  Date:   Fri Jun 17 15:34:19 2005 -0700

  Use "-M" instead of "-C" for "git diff" and "git status"
  
  The "C" in "-C" may stand for "Cool", but it's also pretty slow, since
  right now it leaves all unmodified files to be tested even if there are
  no new files at all.  That just ends up being unacceptably slow for big
  projects, especially if it's not all in the cache.

I suspect that the copy code may be much faster these days (it sounds
like we did not even have the find-copies-harder distinction then, and
these days we certainly take the quick return if there are no copy
destination candidates).

To get a rough sense of how much effort is entailed in the various
options, here are "git log --raw" timings for git.git (all timings are
warm cache, best-of-five, wall clock time):

  log --raw:       0m2.311s
  log --raw -M:    0m2.362s
  log --raw -C:    0m2.576s
  log --raw -C -C: 1m4.462s

You can see that rename detection adds a little, and copy detections
adds a little more.  That makes sense; it's rare for new files to appear
at the same that old files are going away (renames), so most of the time
it does nothing. Copies introduce a bit more work; we have to compare
against any changed files, and there are typically several in each
commit. find-copies-harder is...well, very expensive.

These timings are of diffs between commits and their parents, of course.
But if we assume that "git status" will show diffs roughly similar to
what gets committed, then this should be comparable. There are about 30K
non-merge commits we traversed there, so adding 200ms is an average of
not very much per commit. Of course the cost is disproportionately borne
by diffs which have an actual file come into being. There are ~2000
commits that introduce a file, so it's probably accurate to say that it
either adds nothing in most cases, or ~1/10th of a millisecond in
others.

Note this is also doing inexact detection, which involves actually
looking at the contents of candidate blobs (whereas exact detection can
be done by comparing sha1s, which is very fast). If you set
diff.renamelimit to "1", then we do only exact detections. Here are
timings there:

  log --raw:       0m02.311s    (for reference)
  log --raw -M:    0m02.337s
  log --raw -C:    0m02.347s
  log --raw -C -C: 0m24.419s

That speeds things up a fair bit, even for "-C" (we don't have to access
the blobs anymore, so I suspect the time is going to just accessing all
of the trees; normally diff does not descend into subtrees that have the
same sha1). Of course, you probably wouldn't want to turn off inexact
renames completely. I suspect what you'd want is a --find-copies-moderately
where we look for cheap copies using "-C", and then follow up with "-C
-C" only using exact renames.

So from these timings, I'd conclude that:

  1. It's probably fine to turn on copies for "git status".

  2. It's probably even OK to use "-C -C" for some projects. Even though
     22s looks scary there, that's only 11ms for git.git (remember,
     spread across 2000 commits). For linux.git, it's much, much worse.
     I killed my "-C -C" run after 10 minutes, and it had only gone
     through 1/20th of the commits. Extrapolating, you're looking at
     500ms or so added to a "git status" run.

     So you'd almost certainly want this to be configurable.

Does either of you want to try your hand at a patch? Just enabling
copies should be a one-liner. Making it configurable is more involved,
but should also be pretty straightforward.

-Peff

  reply	other threads:[~2014-12-02  6:55 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-11-30  0:35 git status / git diff -C not detecting file copy Pol Online
2014-11-30  1:03 ` Bryan Turner
2014-11-30  1:30   ` Pol Online
2014-11-30  1:54     ` Bryan Turner
2014-12-02  6:55       ` Jeff King [this message]
2014-12-02 14:15         ` Pol Online
2014-12-02 17:57         ` Junio C Hamano
2014-12-02 20:09           ` Jeff King
2014-12-03 16:01             ` Junio C Hamano
2014-12-02 21:40         ` Bryan Turner
2014-12-02 21:50           ` Jeff King
2014-12-03 16:03             ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20141202065550.GB1948@peff.net \
    --to=peff@peff.net \
    --cc=bturner@atlassian.com \
    --cc=git@vger.kernel.org \
    --cc=info@pol-online.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).