git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: Bryan Turner <bturner@atlassian.com>,
	Pol Online <info@pol-online.net>, Git Users <git@vger.kernel.org>
Subject: Re: git status / git diff -C not detecting file copy
Date: Tue, 2 Dec 2014 15:09:11 -0500	[thread overview]
Message-ID: <20141202200910.GB23461@peff.net> (raw)
In-Reply-To: <xmqqvblukk98.fsf@gitster.dls.corp.google.com>

On Tue, Dec 02, 2014 at 09:57:07AM -0800, Junio C Hamano wrote:

> > To get a rough sense of how much effort is entailed in the various
> > options, here are "git log --raw" timings for git.git (all timings are
> > warm cache, best-of-five, wall clock time):
> 
> The rationale of the change talks about "big projects" and your
> analysis and argument to advocate reversion of that change is based
> on "git.git"?  What is going on here?

I find that git.git is often a useful and easy thing to time to
extrapolate to other projects. It's 1/10th-1/20th the size of the kernel
(both in tree size and commit depth), which I do consider a "big
project" (and I have a feeling is what Linus was talking about).

Of course, performance numbers do not always scale linearly with repo
size. I didn't show the full numbers for the kernel, but they are:

  log --raw:       0m53.587s
  log --raw -M:    0m55.424s
  log --raw -C:    1m02.733s
  log --raw -C -C: <killed after 10 minutes>

There are ~20K commits that introduce files in the kernel (about 10x
what git.git had). So renames add well under a millisecond to each of
those diffs, and a single "-C" adds a third of a millisecond.

Which is pretty in-line with the git.git findings (it is not linear
here, but actually fairly constant. This makes sense, as it scales with
the size of the commit, not the size of the tree).

And as I noted, "-C -C" is rather expensive (I gave some estimated
timings earlier; you could probably come up with something more accurate
by doing smarter sampling).

> Also our history is fairly unusual in that we do not have a lot of
> renames (there was one large "s|builtin-|builtin/|" rename event,
> but that is about it), which may not be a good example to base such
> a design decision on.

I think the work scales not with the number of actual renames, but with
the number of commits where we even bother to look at renames at all
(i.e., ones with an 'A' diff-status). And my estimates assume that we
pay zero cost for other diffs, and attribute all of the extra time to
those diffs. So I think frequency of rename (or 'A') events does not
impact the estimate of the impact on a single "git status" run.

What does impact it is the _size_ of each commit. If you quite
frequently add a new file while touching tens of thousands of other
files, then the performance will be a lot more noticeable. And both
git.git and linux.git are probably better than some other projects about
having small commits.

Still, though. I stand by my earlier conclusions. Even with commits ten
times as large as the kernel's, you are talking about 3ms added to a
"status" run (and again, only when you hit such a gigantic commit _and_
it has an 'A' file).

> It is a fine idea to make this configurable ("status.renames = -C"
> or something, perhaps?), though.

I think it would be OK to move to "-C" as a default, but I agree it is
nicer if it is configurable, as it gives a safety hatch for people in
pathological repos to drop back to the old behavior (or even turn off
renames altogether).

-Peff

  reply	other threads:[~2014-12-02 20:10 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-11-30  0:35 git status / git diff -C not detecting file copy Pol Online
2014-11-30  1:03 ` Bryan Turner
2014-11-30  1:30   ` Pol Online
2014-11-30  1:54     ` Bryan Turner
2014-12-02  6:55       ` Jeff King
2014-12-02 14:15         ` Pol Online
2014-12-02 17:57         ` Junio C Hamano
2014-12-02 20:09           ` Jeff King [this message]
2014-12-03 16:01             ` Junio C Hamano
2014-12-02 21:40         ` Bryan Turner
2014-12-02 21:50           ` Jeff King
2014-12-03 16:03             ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20141202200910.GB23461@peff.net \
    --to=peff@peff.net \
    --cc=bturner@atlassian.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=info@pol-online.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).