git blame performance

All of lore.kernel.org
 help / color / mirror / Atom feed

* git blame performance
@ 2015-11-06 13:37 Jan Smets
  2015-11-06 14:52 ` Michael Haggerty
  0 siblings, 1 reply; 3+ messages in thread
From: Jan Smets @ 2015-11-06 13:37 UTC (permalink / raw)
  To: git

Hi

I have recently migrated a fairly large project from CVS to Git.
One of the issues we're having is the blame/annotate performance.

The repository contains +650k commits total, of which ~300k are on 
master. (raw size = ~8GB)

Running blame on one of the oldest files takes over 30 seconds.
This is on a fairly beefy (server) machine with lots of ram and the 
repository on a ramdisk. Running git 2.5.2

cvs annotate of the same file (over the network) is ready in 0.8 seconds.
blame/annotate is a frequently used operation, ranging between 5 to 20 
usages a day per developer.

I have two questions

  1) Is there a way to speed this up (in git)? eg: can it run multi 
threaded? build pre-cached blame views?
  2) I was thinking to work around the issue and use gitweb 
blame_incremental and pre-populate the cache.

If you can think of any other (short term) solutions I would really like 
to hear them.

Thank you

- Jan

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: git blame performance
  2015-11-06 13:37 git blame performance Jan Smets
@ 2015-11-06 14:52 ` Michael Haggerty
  2015-11-06 17:53   ` Junio C Hamano
  0 siblings, 1 reply; 3+ messages in thread
From: Michael Haggerty @ 2015-11-06 14:52 UTC (permalink / raw)
  To: Jan Smets, git

On 11/06/2015 02:37 PM, Jan Smets wrote:
> I have recently migrated a fairly large project from CVS to Git.
> One of the issues we're having is the blame/annotate performance.
> [...]
> cvs annotate of the same file (over the network) is ready in 0.8 seconds.
> blame/annotate is a frequently used operation, ranging between 5 to 20
> usages a day per developer.

cvs annotate and git blame both have to follow history back until they
find the commit that introduced the oldest line that is still in the
current version of the file. So for a really old file, a lot of history
has to be walked through.

The reason that cvs annotate is so much faster than git blame is that
CVS stores revisions filewise, with all of the modifications to file
$FILE being stored in a single $FILE,v file. So in the worst case, CVS
only has to read this one file.

Git, on the other hand, stores revisions treewise. It has no way of
knowing, ab initio, which revisions touched a given file. (In fact, this
concept is not even well-defined because the answer depends on things
like whether copy (-C) and move (-M) detection are turned on and what
parameters they were given.) This means that git blame has to traverse
most of history to find the commits that touched $FILE.

Slow git blame is thus a relatively unavoidable consequence of Git's
data model. That's not to say that it can't be sped up somewhat, but it
will never reach CVS speeds.

But it does have some features that can reduce the work:

-L <start>,<end>, -L :<funcname> -- Annotate only the given line range.
This option can speed things up (1) if the range of lines does not
include the oldest lines, (2) by limiting which parents of merge commits
have to be followed.

--incremental -- if you are using this command to build tooling, this
option allows partial results to be returned early, to reduce the wait
until the user sees something.

If you are not interested in changes older than a certain date or
revision, you can limit the amount of history that git blame traverses.
See SPECIFYING RANGES in the manpage.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: git blame performance
  2015-11-06 14:52 ` Michael Haggerty
@ 2015-11-06 17:53   ` Junio C Hamano
  0 siblings, 0 replies; 3+ messages in thread
From: Junio C Hamano @ 2015-11-06 17:53 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Jan Smets, git

Michael Haggerty <mhagger@alum.mit.edu> writes:

> The reason that cvs annotate is so much faster than git blame is that
> CVS stores revisions filewise, with all of the modifications to file
> $FILE being stored in a single $FILE,v file. So in the worst case, CVS
> only has to read this one file.
>
> Git, on the other hand, stores revisions treewise. It has no way of
> knowing, ab initio, which revisions touched a given file. (In fact, this
> concept is not even well-defined because the answer depends on things
> like whether copy (-C) and move (-M) detection are turned on and what
> parameters they were given.) This means that git blame has to traverse
> most of history to find the commits that touched $FILE.
>
> Slow git blame is thus a relatively unavoidable consequence of Git's
> data model. That's not to say that it can't be sped up somewhat, but it
> will never reach CVS speeds.

Another thing to consider for a converted repository is that mass
converters tend to either not make a pack at all or make a pack that
is horribly inefficient to access.  Running "git repack -a -d -f"
with a small value of "--depth" may be a thing worth trying, if that
is contributing to the performance.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-11-06 17:54 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-11-06 13:37 git blame performance Jan Smets
2015-11-06 14:52 ` Michael Haggerty
2015-11-06 17:53   ` Junio C Hamano

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.