How to efficiently blame an entire repo?

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* How to efficiently blame an entire repo?
@ 2010-04-29 23:12 Jay Soffian
  2010-04-30 19:45 ` Avery Pennarun
  2010-04-30 21:21 ` Jeff King
  0 siblings, 2 replies; 4+ messages in thread
From: Jay Soffian @ 2010-04-29 23:12 UTC (permalink / raw)
  To: git

Let's say you've got a repo with ~ 40K files and 35K commits.
Well-packed .git is about 800MB.

You want to find out how many lines of code a particular group of
individuals has contributed to HEAD.

The naive solution is to run git blame on all 40K files grep'ing for
the just the authors you want.

Possibly a step up from that is first using log --name-status
--author=... to find just the files which have been touched by those
authors and then blaming only those files.

I guess the next step up would be parsing the diff hunks output by log
-p, but then you're basically re-implementing blame I think.

Am I missing a clever solution?

j.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: How to efficiently blame an entire repo?
  2010-04-29 23:12 How to efficiently blame an entire repo? Jay Soffian
@ 2010-04-30 19:45 ` Avery Pennarun
  2010-04-30 20:16   ` Jay Soffian
  2010-04-30 21:21 ` Jeff King
  1 sibling, 1 reply; 4+ messages in thread
From: Avery Pennarun @ 2010-04-30 19:45 UTC (permalink / raw)
  To: Jay Soffian; +Cc: git

On Thu, Apr 29, 2010 at 7:12 PM, Jay Soffian <jaysoffian@gmail.com> wrote:
> Let's say you've got a repo with ~ 40K files and 35K commits.
> Well-packed .git is about 800MB.
>
> You want to find out how many lines of code a particular group of
> individuals has contributed to HEAD.
>[...]
> Am I missing a clever solution?

How often do you need to do this?  If it's just once in your life,
then the brute force solution of just letting 'git blame' grind
through it for a few hours is probably the cleverest :)

Have fun,

Avery

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: How to efficiently blame an entire repo?
  2010-04-30 19:45 ` Avery Pennarun
@ 2010-04-30 20:16   ` Jay Soffian
  0 siblings, 0 replies; 4+ messages in thread
From: Jay Soffian @ 2010-04-30 20:16 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: git

On Fri, Apr 30, 2010 at 3:45 PM, Avery Pennarun <apenwarr@gmail.com> wrote:
> On Thu, Apr 29, 2010 at 7:12 PM, Jay Soffian <jaysoffian@gmail.com> wrote:
>> Let's say you've got a repo with ~ 40K files and 35K commits.
>> Well-packed .git is about 800MB.
>>
>> You want to find out how many lines of code a particular group of
>> individuals has contributed to HEAD.
>>[...]
>> Am I missing a clever solution?
>
> How often do you need to do this?  If it's just once in your life,
> then the brute force solution of just letting 'git blame' grind
> through it for a few hours is probably the cleverest :)

Yeah, I ended up doing this basically.

Setup a .mailmap mapping the authors I was interested in to domain.com. Then:

$ git log --pretty='%H %aE' HEAD | grep domain.com |
  awk '{print $1}' |
  git log --no-walk --stdin --name-only --pretty=%n |
  grep -v '^$' | sort -u > files1
$ git ls-files | sort > files2
$ comm -12 files1 files2 > files
$ xargs < files -n1 git annotate | grep domain.com

I didn't use --author=domain.com w/the first log invocation because I
wasn't sure if it respected .mailmap and was too lazy to look it up.

I probably I could've used --diff-filter in the second log invocation, but, meh.

So that worked. Took about 12 minutes to run on a recent Macbook Pro.
Aside, blame's --porcelain switch is rather poorly documented and
annotate seemed to have the right output for the job.

j.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: How to efficiently blame an entire repo?
  2010-04-29 23:12 How to efficiently blame an entire repo? Jay Soffian
  2010-04-30 19:45 ` Avery Pennarun
@ 2010-04-30 21:21 ` Jeff King
  1 sibling, 0 replies; 4+ messages in thread
From: Jeff King @ 2010-04-30 21:21 UTC (permalink / raw)
  To: Jay Soffian; +Cc: git

On Thu, Apr 29, 2010 at 07:12:27PM -0400, Jay Soffian wrote:

> Let's say you've got a repo with ~ 40K files and 35K commits.
> Well-packed .git is about 800MB.
> 
> You want to find out how many lines of code a particular group of
> individuals has contributed to HEAD.
> 
> The naive solution is to run git blame on all 40K files grep'ing for
> the just the authors you want.

With the exception of your "blame only those files that you know your
authors have touched" optimization, I think you pretty much have to do
this. Anything else will just be reimplementing blame. You can't throw
away most content prematurely, because it may end up blaming to your
authors of interest eventually.

I think this is also what Junio ended up doing when presenting at
GitTogether '08:

  http://userweb.kernel.org/~junio/200810-Chron.pdf

In theory you might be able to do multi-file blame faster.  I would be
curious to see the performance difference between:

  $ git blame file1 file2 ;# not actually implemented

and

  $ for i in file1 file2; do git blame $i; done

Much of the work is O(content), but there is some overlap in walking the
history and generating diffs.

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2010-04-30 21:21 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-29 23:12 How to efficiently blame an entire repo? Jay Soffian
2010-04-30 19:45 ` Avery Pennarun
2010-04-30 20:16   ` Jay Soffian
2010-04-30 21:21 ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).