git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [GitStats] Bling bling or some statistics on the git.git repository
       [not found] <bd6139dc0807090621n308b0159n92d946c165d3a5dd@mail.gmail.com>
@ 2008-07-11 21:04 ` Sverre Rabbelier
  2008-07-11 21:22   ` Johannes Schindelin
  0 siblings, 1 reply; 21+ messages in thread
From: Sverre Rabbelier @ 2008-07-11 21:04 UTC (permalink / raw)
  To: Git Mailinglist; +Cc: David Symonds

[I sent this mail earlier, but I think vger rejected it due to the
size of the attachments, I have uploaded them instead now, they can be
found at:
http://alturin.googlepages.com/activity_per_author.txt
http://alturin.googlepages.com/full_activity.txt ]

Heya,

Today I sat down and finished the activity aggregation code. Now it is
possible to generate the attached files with the following commands:
$ stats.py author -e --id=an > activity_per_author.txt
$ stats.py author -a  > full_activity.txt

The first one calculates the activity of all developers on a per-file
basis and dumps it into the file. The "--id=an" switch sets the
grouping field to "%an" (see man git-log), since the default (%ae) is
not that helpful for git.git (I don't know people by their e-mail, I
know them by their name). This was already possible (with "author
-d"), but before one had to pick a specific developer, now it will
show for _all_ developers.This is interesting stuff, although for a
huge project like git it's a bit much to take in. What is probably
more interesting is the second command, it shows how much change a
file has had in it's existence.
I temporarily modified the code to output %04d instead of %4d so that
I could do the following:
$ stats.py author -a  > full_activity_sortable.txt

A few highlights from the sorted file:

$ cat full_activity_sortable.txt | sort | tail -n 20
0170:  2721+  1060- = refs.c
0172:  4369+  2004- = builtin-pack-objects.c
0177:   345+   233- = GIT-VERSION-GEN
0178:  2855+  2121- = commit.c
0178:  4779+  2227- = fast-import.c
0179:  2677+  1400- = read-cache.c
0185:  5661+  2056- = builtin-apply.c
0186:  3269+  1255- = revision.c
0213:  1884+   460- = Documentation/config.txt
0232:  2257+  1621- = Documentation/git.txt
0236:  3990+  1991- = contrib/fast-import/git-p4
0281:  2753+  2220- = git.c
0333: 10259+  7150- = git-gui.sh
0338: 11337+  6187- = git-svn.perl
0338:  5755+  3159- = sha1_file.c
0397: 10230+  9599- = diff.c
0412: 23248+ 20257- = gitk
0432: 10580+  4502- = gitweb/gitweb.perl
0490:  1412+   619- = cache.h
0977:  4703+  2705- = Makefile

$ cat Makefile | wc -l
1482

For some reason you people can't seem to make up your mind about a
file that's not even 1500 lines in size ;). With almost a thousand
edits so far, it's been edited so many times it could've been written
from scratch three times (except that the amount of lines deleted
doesn't match). Also interesting to note is that the "external" files
such as gitweb, gitk, git-gui and git-svn make up the bulk of all
changes. The two contenders from the native git camp are diff.c and
sha1_file.c which both have a lot of LOC. This information is
interesting for GitStats as it might help determine which files have
had a lot of change, and which files are not touched a lot.

A note is in order here, this data was mined with "git log --num-stat"
so things like moving files and copying files are not accounted for. I
thought about using git-blame to gather this info before, but it is
not the right tool for the job. If anyone else has any idea's on what
would be better please let me know and I'll happily dig into it :).

--
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [GitStats] Bling bling or some statistics on the git.git repository
  2008-07-11 21:04 ` [GitStats] Bling bling or some statistics on the git.git repository Sverre Rabbelier
@ 2008-07-11 21:22   ` Johannes Schindelin
  2008-07-11 21:39     ` Johannes Schindelin
  2008-07-11 21:52     ` [GitStats] Bling bling or some statistics on the git.git repository Sverre Rabbelier
  0 siblings, 2 replies; 21+ messages in thread
From: Johannes Schindelin @ 2008-07-11 21:22 UTC (permalink / raw)
  To: Sverre Rabbelier; +Cc: Git Mailinglist, David Symonds

Hi,

On Fri, 11 Jul 2008, Sverre Rabbelier wrote:

> I temporarily modified the code to output %04d instead of %4d so that I 
> could do the following:
>
>	 $ stats.py author -a > full_activity_sortable.txt

You might be delighted to read up on the "-n" switch to sort(1).

> A few highlights from the sorted file:
> 
> $ cat full_activity_sortable.txt | sort | tail -n 20

More intuitive would have been "sort -r | head -n 20", I guess.

> 0170:  2721+  1060- = refs.c

I guess that 170 is the total number of commit touching that file, the "+" 
and "-" numbers the changes respectively?

I think quite a lot of our changes do code moves; this should be accounted 
for differently.

> 0172:  4369+  2004- = builtin-pack-objects.c
> 0177:   345+   233- = GIT-VERSION-GEN
> 0178:  2855+  2121- = commit.c
> 0178:  4779+  2227- = fast-import.c
> 0179:  2677+  1400- = read-cache.c
> 0185:  5661+  2056- = builtin-apply.c
> 0186:  3269+  1255- = revision.c
> 0213:  1884+   460- = Documentation/config.txt
> 0232:  2257+  1621- = Documentation/git.txt
> 0236:  3990+  1991- = contrib/fast-import/git-p4
> 0281:  2753+  2220- = git.c
> 0333: 10259+  7150- = git-gui.sh
> 0338: 11337+  6187- = git-svn.perl
> 0338:  5755+  3159- = sha1_file.c
> 0397: 10230+  9599- = diff.c
> 0412: 23248+ 20257- = gitk
> 0432: 10580+  4502- = gitweb/gitweb.perl
> 0490:  1412+   619- = cache.h
> 0977:  4703+  2705- = Makefile
> 
> $ cat Makefile | wc -l
> 1482
> 
> For some reason you people can't seem to make up your mind about a
> file that's not even 1500 lines in size ;).

Heh.  We might need to change it once or twice, in the future.

> A note is in order here, this data was mined with "git log --num-stat"
> so things like moving files and copying files are not accounted for.

In my opinion it would be even more interesting to see code moves (i.e. 
not whole files).  For example, we moved some stuff from builtins into the 
library.  The real change here is not in the lines added and deleted.

> I thought about using git-blame to gather this info before, but it is 
> not the right tool for the job. If anyone else has any idea's on what 
> would be better please let me know and I'll happily dig into it :).

I think that you need to analyze the diff directly.  One possible (quick 
'n dirty) way would be to cut out long consecutive "+" parts of the hunks, 
replace the "-" by "+", and use "git diff --no-index" to do the hard part 
of searching for that code in the "-" part of the original diff.

If that turns out to be useful, we can still think about a proper API 
using xdiff.

Just an idea,
Dscho

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [GitStats] Bling bling or some statistics on the git.git repository
  2008-07-11 21:22   ` Johannes Schindelin
@ 2008-07-11 21:39     ` Johannes Schindelin
  2008-07-11 21:55       ` Johannes Schindelin
  2008-07-11 21:55       ` Sverre Rabbelier
  2008-07-11 21:52     ` [GitStats] Bling bling or some statistics on the git.git repository Sverre Rabbelier
  1 sibling, 2 replies; 21+ messages in thread
From: Johannes Schindelin @ 2008-07-11 21:39 UTC (permalink / raw)
  To: Sverre Rabbelier; +Cc: Git Mailinglist, David Symonds

Hi,

something else I just realized: you might want to use .mailmap, e.g. to 
coalesce the changes of Shawn "O." Pearce correctly.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [GitStats] Bling bling or some statistics on the git.git repository
  2008-07-11 21:22   ` Johannes Schindelin
  2008-07-11 21:39     ` Johannes Schindelin
@ 2008-07-11 21:52     ` Sverre Rabbelier
  2008-07-11 22:07       ` Johannes Schindelin
  1 sibling, 1 reply; 21+ messages in thread
From: Sverre Rabbelier @ 2008-07-11 21:52 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Git Mailinglist, David Symonds

On Fri, Jul 11, 2008 at 11:22 PM, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
> Hi,
>
> On Fri, 11 Jul 2008, Sverre Rabbelier wrote:
>
>> I temporarily modified the code to output %04d instead of %4d so that I
>> could do the following:
>>
>>        $ stats.py author -a > full_activity_sortable.txt
>
> You might be delighted to read up on the "-n" switch to sort(1).

Heh, yes, very much so :). I probably shouldof known there is such an
option, but having the source at hand the change to '%4d' was the
first thing that came to mind.

>> A few highlights from the sorted file:
>>
>> $ cat full_activity_sortable.txt | sort | tail -n 20
>
> More intuitive would have been "sort -r | head -n 20", I guess.

Since that wouldof put the 'number one' at the top? Yeah, I guess it
wouldof, nice one.

>> 0170:  2721+  1060- = refs.c
>
> I guess that 170 is the total number of commit touching that file, the "+"
> and "-" numbers the changes respectively?

Correct, I probably should have explain that. The +es are how many
lines were added and the -es are the total amount of lines that were
deleted, yup.

> I think quite a lot of our changes do code moves; this should be accounted
> for differently.

Yeah, I wish 'git log -C -C -M --numstat --sacrifice-chicken
--pretty=format:%ae --' would take care of that... That is, a
git-blame like mechanism that would detect such moves on a per-commit
basis and report them would be very useful to me.

>> For some reason you people can't seem to make up your mind about a
>> file that's not even 1500 lines in size ;).
>
> Heh.  We might need to change it once or twice, in the future.

*chuckles*, I'm curious why the Makefile is such a hard file to get right :).

>> A note is in order here, this data was mined with "git log --num-stat"
>> so things like moving files and copying files are not accounted for.
>
> In my opinion it would be even more interesting to see code moves (i.e.
> not whole files).  For example, we moved some stuff from builtins into the
> library.  The real change here is not in the lines added and deleted.

Very much so, but the former I figure can be easily done with 'git log
-C -C -M' I discovered (I need to parse it's output though, and also
determine what to do with moves statistics wise. Should changes made
due to moves just be ignored?)

>> I thought about using git-blame to gather this info before, but it is
>> not the right tool for the job. If anyone else has any idea's on what
>> would be better please let me know and I'll happily dig into it :).
>
> I think that you need to analyze the diff directly.  One possible (quick
> 'n dirty) way would be to cut out long consecutive "+" parts of the hunks,
> replace the "-" by "+", and use "git diff --no-index" to do the hard part
> of searching for that code in the "-" part of the original diff.

That sounds interesting, I won't need to actually do that though, I
already have a diff parser that gives me the lines added VS lines
deleted on a hunk-by-hunk basis. If it is a true move (e.g., code
removed in file X and added in file Y) it should be trivial to detect
that.
Something along the lines of:
for hunk in added:
  if hunk in deleted:
    print("Over here!!")

> Just an idea,

Much appreciated! I will look into this.

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [GitStats] Bling bling or some statistics on the git.git repository
  2008-07-11 21:39     ` Johannes Schindelin
@ 2008-07-11 21:55       ` Johannes Schindelin
  2008-07-11 22:05         ` Sverre Rabbelier
  2008-07-11 21:55       ` Sverre Rabbelier
  1 sibling, 1 reply; 21+ messages in thread
From: Johannes Schindelin @ 2008-07-11 21:55 UTC (permalink / raw)
  To: Sverre Rabbelier; +Cc: Git Mailinglist, David Symonds

Hi,

On Fri, 11 Jul 2008, Johannes Schindelin wrote:

> something else I just realized: you might want to use .mailmap, e.g. to 
> coalesce the changes of Shawn "O." Pearce correctly.

Yet another thing: while it is true that git-gui is usually pulled in 
(with the subtree strategy), some parts were changed in git.git directly, 
so you will need to cope with the wholesale rename with every merge.

Besides, it is slightly distracting to see the file names differently from 
what they are in HEAD^{tree}.  But that may be just me.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [GitStats] Bling bling or some statistics on the git.git repository
  2008-07-11 21:39     ` Johannes Schindelin
  2008-07-11 21:55       ` Johannes Schindelin
@ 2008-07-11 21:55       ` Sverre Rabbelier
  2008-07-11 22:11         ` Johannes Schindelin
  1 sibling, 1 reply; 21+ messages in thread
From: Sverre Rabbelier @ 2008-07-11 21:55 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Git Mailinglist, David Symonds

Heya,

On Fri, Jul 11, 2008 at 11:39 PM, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
> something else I just realized: you might want to use .mailmap, e.g. to
> coalesce the changes of Shawn "O." Pearce correctly.

Ah, hmmm, but I'm not sure how, as not nearly every developer is in
the .mailmap file and many devs use different e-mails while most use
the same name. A similar file containing developer aliases maybe?
E.g.:
"Shawn Pearce = Shawn O. Pearce"?


-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [GitStats] Bling bling or some statistics on the git.git repository
  2008-07-11 21:55       ` Johannes Schindelin
@ 2008-07-11 22:05         ` Sverre Rabbelier
  2008-07-11 22:10           ` Johannes Schindelin
  0 siblings, 1 reply; 21+ messages in thread
From: Sverre Rabbelier @ 2008-07-11 22:05 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Git Mailinglist, David Symonds

On Fri, Jul 11, 2008 at 11:55 PM, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
> Hi,
>
> On Fri, 11 Jul 2008, Johannes Schindelin wrote:
>
>> something else I just realized: you might want to use .mailmap, e.g. to
>> coalesce the changes of Shawn "O." Pearce correctly.
>
> Yet another thing: while it is true that git-gui is usually pulled in
> (with the subtree strategy), some parts were changed in git.git directly,
> so you will need to cope with the wholesale rename with every merge.

I think I understand the cause, but I'm not sure I understand the
consequences? Will some files at some times appear to be located at
/path/to/file.txt and at other times at /subdir/path/to/file.txt? If
so, how could I possibly handle that? How do I know that when it says
/path/to/file.txt it means /subdir/path/to/file.txt instead?

> Besides, it is slightly distracting to see the file names differently from
> what they are in HEAD^{tree}.  But that may be just me.

Different how? In that it shows the contents of subdirs? For at least
my repo this is a Very Good Thing (tm) as all my source-code is in a
/src directory. If all changes to subdirs were aggregated I wouldn't
get any useful metrics at all. Or am I misunderstanding the difference
between the current output and that of HEAD^{tree}?

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [GitStats] Bling bling or some statistics on the git.git repository
  2008-07-11 21:52     ` [GitStats] Bling bling or some statistics on the git.git repository Sverre Rabbelier
@ 2008-07-11 22:07       ` Johannes Schindelin
  2008-07-11 22:50         ` Sverre Rabbelier
  0 siblings, 1 reply; 21+ messages in thread
From: Johannes Schindelin @ 2008-07-11 22:07 UTC (permalink / raw)
  To: Sverre Rabbelier; +Cc: Git Mailinglist, David Symonds

Hi,

On Fri, 11 Jul 2008, Sverre Rabbelier wrote:

> On Fri, Jul 11, 2008 at 11:22 PM, Johannes Schindelin
> <Johannes.Schindelin@gmx.de> wrote:
> >
> > On Fri, 11 Jul 2008, Sverre Rabbelier wrote:
> >
> > I think quite a lot of our changes do code moves; this should be 
> > accounted for differently.
> 
> Yeah, I wish 'git log -C -C -M --numstat --sacrifice-chicken 
> --pretty=format:%ae --' would take care of that... That is, a git-blame 
> like mechanism that would detect such moves on a per-commit basis and 
> report them would be very useful to me.

Well, the chicken (or better, a goat) should be sacrificed by you...  The 
option I would call "--code-moves".

But the semantics of that need to be sorted out in a shell script first; 
maybe like I outlined (if that was not coherent, please say so).

> >> For some reason you people can't seem to make up your mind about a 
> >> file that's not even 1500 lines in size ;).
> >
> > Heh.  We might need to change it once or twice, in the future.
> 
> *chuckles*, I'm curious why the Makefile is such a hard file to get 
> right :).

Well, it is not a matter of getting it right, but it is a matter of 
changes.  For example, everytime we move code from one program into the 
library, and create a file for that, code changes.

Everytime we realize that we need to set some special settings for a 
platform, the Makefile needs to be changed.

Everytime a shell script is added, or converted to a builtin, the Makefile 
needs to be changed.

We are lucky that our maintainer is such a good shell hacker, otherwise 
every version number change, you guessed it, would need the Makefile to be 
changed.

> >> A note is in order here, this data was mined with "git log 
> >> --num-stat" so things like moving files and copying files are not 
> >> accounted for.
> >
> > In my opinion it would be even more interesting to see code moves 
> > (i.e. not whole files).  For example, we moved some stuff from 
> > builtins into the library.  The real change here is not in the lines 
> > added and deleted.
> 
> Very much so, but the former I figure can be easily done with 'git log
> -C -C -M' I discovered (I need to parse it's output though, and also
> determine what to do with moves statistics wise. Should changes made
> due to moves just be ignored?)

That is not very interesting, as we often move so small parts (think "one 
function") that -C -C -M does not trigger.

> >> I thought about using git-blame to gather this info before, but it is 
> >> not the right tool for the job. If anyone else has any idea's on what 
> >> would be better please let me know and I'll happily dig into it :).
> >
> > I think that you need to analyze the diff directly.  One possible (quick
> > 'n dirty) way would be to cut out long consecutive "+" parts of the hunks,
> > replace the "-" by "+", and use "git diff --no-index" to do the hard part
> > of searching for that code in the "-" part of the original diff.
> 
> That sounds interesting, I won't need to actually do that though, I
> already have a diff parser that gives me the lines added VS lines
> deleted on a hunk-by-hunk basis. If it is a true move (e.g., code
> removed in file X and added in file Y) it should be trivial to detect
> that.
> Something along the lines of:
> for hunk in added:
>   if hunk in deleted:
>     print("Over here!!")

I think that is not enough, as a code move can mean that part of a 
function was refactored into a function.  The consequence is often a 
reindent, and possibly rewrapping.

And it can mean that some lines have to be inserted here and there.  I 
still would count that as a code move "with touch-ups".

So I'd like to see something like

<number-of-commits>: <lines-added> <lines-removed> \
	<lines-moved-from> <lines-moved-to> <filename>

BTW I realized something else: your 
http://alturin.googlepages.com/full_activity.txt lists only 
"gitk-git/po/es.po" under git-git/po/.  And it has as many added as 
deleted lines.

So I suspect that "po/*" really lists both gitk's as well as git-gui's .po 
files, but merged together.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [GitStats] Bling bling or some statistics on the git.git repository
  2008-07-11 22:05         ` Sverre Rabbelier
@ 2008-07-11 22:10           ` Johannes Schindelin
  0 siblings, 0 replies; 21+ messages in thread
From: Johannes Schindelin @ 2008-07-11 22:10 UTC (permalink / raw)
  To: Sverre Rabbelier; +Cc: Git Mailinglist, David Symonds

Hi,

On Sat, 12 Jul 2008, Sverre Rabbelier wrote:

> On Fri, Jul 11, 2008 at 11:55 PM, Johannes Schindelin
> <Johannes.Schindelin@gmx.de> wrote:
>
> > On Fri, 11 Jul 2008, Johannes Schindelin wrote:
> >
> >> something else I just realized: you might want to use .mailmap, e.g. 
> >> to coalesce the changes of Shawn "O." Pearce correctly.
> >
> > Yet another thing: while it is true that git-gui is usually pulled in 
> > (with the subtree strategy), some parts were changed in git.git 
> > directly, so you will need to cope with the wholesale rename with 
> > every merge.
> 
> I think I understand the cause, but I'm not sure I understand the
> consequences? Will some files at some times appear to be located at
> /path/to/file.txt and at other times at /subdir/path/to/file.txt? If
> so, how could I possibly handle that? How do I know that when it says
> /path/to/file.txt it means /subdir/path/to/file.txt instead?
>
> > Besides, it is slightly distracting to see the file names differently 
> > from what they are in HEAD^{tree}.  But that may be just me.
> 
> Different how? In that it shows the contents of subdirs? For at least
> my repo this is a Very Good Thing (tm) as all my source-code is in a
> /src directory. If all changes to subdirs were aggregated I wouldn't
> get any useful metrics at all. Or am I misunderstanding the difference
> between the current output and that of HEAD^{tree}?

What I meant is this: you list "git-gui.sh", while I would have expected 
"git-gui/git-gui.sh".

So you might need to rename the keys of your aggregating dict (I haven't 
looked at the code, but suspect that's how you do it) with renaming 
commits (including the merge commits).

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [GitStats] Bling bling or some statistics on the git.git repository
  2008-07-11 21:55       ` Sverre Rabbelier
@ 2008-07-11 22:11         ` Johannes Schindelin
  2008-07-11 22:14           ` Sverre Rabbelier
  0 siblings, 1 reply; 21+ messages in thread
From: Johannes Schindelin @ 2008-07-11 22:11 UTC (permalink / raw)
  To: Sverre Rabbelier; +Cc: Git Mailinglist, David Symonds

Hi,

On Fri, 11 Jul 2008, Sverre Rabbelier wrote:

> On Fri, Jul 11, 2008 at 11:39 PM, Johannes Schindelin
> <Johannes.Schindelin@gmx.de> wrote:
> > something else I just realized: you might want to use .mailmap, e.g. 
> > to coalesce the changes of Shawn "O." Pearce correctly.
> 
> Ah, hmmm, but I'm not sure how, as not nearly every developer is in
> the .mailmap file and many devs use different e-mails while most use
> the same name. A similar file containing developer aliases maybe?
> E.g.:
> "Shawn Pearce = Shawn O. Pearce"?

The mechanism is this: you look up the email in .mailmap (actually you 
parse that once, but the idea stays the same), and if there is a name for 
it, you use that _instead of_ the given author name.

Otherwise you use the given author name, and typos be damned.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [GitStats] Bling bling or some statistics on the git.git repository
  2008-07-11 22:11         ` Johannes Schindelin
@ 2008-07-11 22:14           ` Sverre Rabbelier
  2008-07-11 23:02             ` Johannes Schindelin
  0 siblings, 1 reply; 21+ messages in thread
From: Sverre Rabbelier @ 2008-07-11 22:14 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Git Mailinglist, David Symonds

On Sat, Jul 12, 2008 at 12:11 AM, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
> The mechanism is this: you look up the email in .mailmap (actually you
> parse that once, but the idea stays the same), and if there is a name for
> it, you use that _instead of_ the given author name.

Ah, so you suggest changing "format:%ae" to "format:%ae %an" and
falling back to the latter id specified on that line if the former is
not in .mailmap? That would work I guess, I'll put it on my TODO list
:).

> Otherwise you use the given author name, and typos be damned.

Fair enough :).

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [GitStats] Bling bling or some statistics on the git.git repository
  2008-07-11 22:07       ` Johannes Schindelin
@ 2008-07-11 22:50         ` Sverre Rabbelier
  2008-07-11 23:33           ` Johannes Schindelin
  0 siblings, 1 reply; 21+ messages in thread
From: Sverre Rabbelier @ 2008-07-11 22:50 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Git Mailinglist, David Symonds

On Sat, Jul 12, 2008 at 12:07 AM, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
>> On Fri, Jul 11, 2008 at 11:22 PM, Johannes Schindelin
>> Yeah, I wish 'git log -C -C -M --numstat --sacrifice-chicken
>> --pretty=format:%ae --' would take care of that... That is, a git-blame
>> like mechanism that would detect such moves on a per-commit basis and
>> report them would be very useful to me.
>
> Well, the chicken (or better, a goat) should be sacrificed by you...  The
> option I would call "--code-moves".

If you suggest I write up a patch to 'git log' I am afraid that would
require quite a bit more skill && knowledge of 'git log' than I have
(which is about Null :P).

> But the semantics of that need to be sorted out in a shell script first;
> maybe like I outlined (if that was not coherent, please say so).

Python is one big shell script :P, so if you meant that it should be
part of GitStats (instead of part of 'git log', which I commented on
above), python would be just fine :). The concept was clear enough
though, I think I understand what you mean.

> Well, it is not a matter of getting it right, but it is a matter of
> changes.  For example, everytime we move code from one program into the
> library, and create a file for that, code changes.

<snip>

Yes, that's true, with what you described it makes sense :).

>> Very much so, but the former I figure can be easily done with 'git log
>> -C -C -M' I discovered (I need to parse it's output though, and also
>> determine what to do with moves statistics wise. Should changes made
>> due to moves just be ignored?)
>
> That is not very interesting, as we often move so small parts (think "one
> function") that -C -C -M does not trigger.

Right, why aim for the stuff when there's much more interesting fun
out there? If there was a --code-moves I agree with you that it would
be a lot more interesting to have than going with the current approach
and throwing in '-C -C -M'.

>> That sounds interesting, I won't need to actually do that though, I
>> already have a diff parser that gives me the lines added VS lines
>> deleted on a hunk-by-hunk basis. If it is a true move (e.g., code
>> removed in file X and added in file Y) it should be trivial to detect
>> that.
>> Something along the lines of:
>> for hunk in added:
>>   if hunk in deleted:
>>     print("Over here!!")
>
> I think that is not enough, as a code move can mean that part of a
> function was refactored into a function.  The consequence is often a
> reindent, and possibly rewrapping.

Mhhh, such would be beyond the scope of implementing manually indeed,
and should be left to the likes of a diff tool instead in order to
prevent reinventing the wheel :).

> And it can mean that some lines have to be inserted here and there.  I
> still would count that as a code move "with touch-ups".

True, true, so it turns out that the most interesting data is the most
difficult to mine, how typical.

> So I'd like to see something like
>
> <number-of-commits>: <lines-added> <lines-removed> \
>        <lines-moved-from> <lines-moved-to> <filename>

Ah, I like the idea of recording moved-from and moved-to seperately
instead of ignoring it, why throw away such a perfectly useful
statistic. It would be really nice if I could get this data from 'git
log' (e.g., the lines-moved-from and lines-moved-to) instead of having
to calculate it myself.

> BTW I realized something else: your
> http://alturin.googlepages.com/full_activity.txt lists only
> "gitk-git/po/es.po" under git-git/po/.  And it has as many added as
> deleted lines.

Correct, that's because that is what 'git log' tells me. Have a look at:
$ git log --pretty=format:%ae --numstat HEAD --
And grep for "\.po", you'll see that it lists the other po files under
"/po/de.po"

> So I suspect that "po/*" really lists both gitk's as well as git-gui's .po
> files, but merged together.

Feasible, if I use '-C -C -M' then the behavior on a directory rename
should be to take the found statistics under that directory and move
them too. That could be expensive though, what with having to search
all the keys whether they are affected and so.

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [GitStats] Bling bling or some statistics on the git.git repository
  2008-07-11 22:14           ` Sverre Rabbelier
@ 2008-07-11 23:02             ` Johannes Schindelin
  2008-07-11 23:28               ` [PATCH] Add pretty format %aN which gives the author name, respecting .mailmap Johannes Schindelin
  0 siblings, 1 reply; 21+ messages in thread
From: Johannes Schindelin @ 2008-07-11 23:02 UTC (permalink / raw)
  To: Sverre Rabbelier; +Cc: Git Mailinglist, David Symonds

Hi,

On Sat, 12 Jul 2008, Sverre Rabbelier wrote:

> On Sat, Jul 12, 2008 at 12:11 AM, Johannes Schindelin
> <Johannes.Schindelin@gmx.de> wrote:
> > The mechanism is this: you look up the email in .mailmap (actually you
> > parse that once, but the idea stays the same), and if there is a name for
> > it, you use that _instead of_ the given author name.
> 
> Ah, so you suggest changing "format:%ae" to "format:%ae %an" and
> falling back to the latter id specified on that line if the former is
> not in .mailmap? That would work I guess, I'll put it on my TODO list
> :).

Hmm.  I missed the fact that you used pretty formats.  Seems like %an does 
not respect .mailmap; I'm on it.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH] Add pretty format %aN which gives the author name, respecting .mailmap
  2008-07-11 23:02             ` Johannes Schindelin
@ 2008-07-11 23:28               ` Johannes Schindelin
  2008-07-11 23:30                 ` Sverre Rabbelier
  0 siblings, 1 reply; 21+ messages in thread
From: Johannes Schindelin @ 2008-07-11 23:28 UTC (permalink / raw)
  To: Sverre Rabbelier, gitster; +Cc: Git Mailinglist, David Symonds


The pretty format %an does not respect .mailmap, but gives the exact
author name recorded in the commit.  Sometimes it is more desirable,
however, to look if the email has another name mapped to it in .mailmap.

This commit adds %aN (and %cN for the committer name) to do exactly that.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---

	On Sat, 12 Jul 2008, Johannes Schindelin wrote:

	> On Sat, 12 Jul 2008, Sverre Rabbelier wrote:
	> 
	> > On Sat, Jul 12, 2008 at 12:11 AM, Johannes Schindelin
	> > <Johannes.Schindelin@gmx.de> wrote:
	> > > The mechanism is this: you look up the email in .mailmap 
	> > > (actually you parse that once, but the idea stays the same), and if 
	> > > there is a name for it, you use that _instead of_ the given author 
	> > > name.
	> > 
	> > Ah, so you suggest changing "format:%ae" to "format:%ae %an" 
	> > and falling back to the latter id specified on that line if the 
	> > former is not in .mailmap? That would work I guess, I'll put it on my 
	> > TODO list :).
	> 
	> Hmm.  I missed the fact that you used pretty formats.  Seems 
	> like %an does not respect .mailmap; I'm on it.

 Documentation/pretty-formats.txt |    2 ++
 pretty.c                         |   27 +++++++++++++++++++++++++--
 2 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/Documentation/pretty-formats.txt b/Documentation/pretty-formats.txt
index 69e6d2f..c11d495 100644
--- a/Documentation/pretty-formats.txt
+++ b/Documentation/pretty-formats.txt
@@ -101,6 +101,7 @@ The placeholders are:
 - '%P': parent hashes
 - '%p': abbreviated parent hashes
 - '%an': author name
+- '%aN': author name (respecting .mailmap)
 - '%ae': author email
 - '%ad': author date
 - '%aD': author date, RFC2822 style
@@ -108,6 +109,7 @@ The placeholders are:
 - '%at': author date, UNIX timestamp
 - '%ai': author date, ISO 8601 format
 - '%cn': committer name
+- '%cN': committer name (respecting .mailmap)
 - '%ce': committer email
 - '%cd': committer date
 - '%cD': committer date, RFC2822 style
diff --git a/pretty.c b/pretty.c
index 8eb39e9..862364b 100644
--- a/pretty.c
+++ b/pretty.c
@@ -3,6 +3,8 @@
 #include "utf8.h"
 #include "diff.h"
 #include "revision.h"
+#include "path-list.h"
+#include "mailmap.h"
 
 static char *user_format;
 
@@ -288,6 +290,25 @@ static char *logmsg_reencode(const struct commit *commit,
 	return out;
 }
 
+static int mailmap_name(struct strbuf *sb, const char *email)
+{
+	static struct path_list *mail_map;
+	char buffer[1024];
+
+	if (!mail_map) {
+		mail_map = xcalloc(1, sizeof(*mail_map));
+		read_mailmap(mail_map, ".mailmap", NULL);
+	}
+
+	if (!mail_map->nr)
+		return -1;
+
+	if (!map_email(mail_map, email, buffer, sizeof(buffer)))
+		return -1;
+	strbuf_addstr(sb, buffer);
+	return 0;
+}
+
 static size_t format_person_part(struct strbuf *sb, char part,
                                const char *msg, int len)
 {
@@ -309,10 +330,12 @@ static size_t format_person_part(struct strbuf *sb, char part,
 	if (end >= len - 2)
 		goto skip;
 
-	if (part == 'n') {	/* name */
+	if (part == 'n' || part == 'N') {	/* name */
 		while (end > 0 && isspace(msg[end - 1]))
 			end--;
-		strbuf_add(sb, msg, end);
+		if (part != 'N' || !msg[end] || !msg[end + 1] ||
+				mailmap_name(sb, msg + end + 2) < 0)
+			strbuf_add(sb, msg, end);
 		return placeholder_len;
 	}
 	start = ++end; /* save email start position */
-- 
1.5.6.2.509.g109edf

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] Add pretty format %aN which gives the author name, respecting .mailmap
  2008-07-11 23:28               ` [PATCH] Add pretty format %aN which gives the author name, respecting .mailmap Johannes Schindelin
@ 2008-07-11 23:30                 ` Sverre Rabbelier
  2008-07-11 23:42                   ` Johannes Schindelin
  0 siblings, 1 reply; 21+ messages in thread
From: Sverre Rabbelier @ 2008-07-11 23:30 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: gitster, Git Mailinglist, David Symonds

On Sat, Jul 12, 2008 at 1:28 AM, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
>
> The pretty format %an does not respect .mailmap, but gives the exact
> author name recorded in the commit.  Sometimes it is more desirable,
> however, to look if the email has another name mapped to it in .mailmap.
>
> This commit adds %aN (and %cN for the committer name) to do exactly that.
>
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---

Whoah, that's fast ;).
I'm not sure what to do though, if I use this new %cN GitStats will
only work with the latest git version... :(

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [GitStats] Bling bling or some statistics on the git.git repository
  2008-07-11 22:50         ` Sverre Rabbelier
@ 2008-07-11 23:33           ` Johannes Schindelin
  2008-07-12  7:39             ` Sverre Rabbelier
  2008-07-12 22:36             ` Sverre Rabbelier
  0 siblings, 2 replies; 21+ messages in thread
From: Johannes Schindelin @ 2008-07-11 23:33 UTC (permalink / raw)
  To: Sverre Rabbelier; +Cc: Git Mailinglist, David Symonds

Hi,

On Sat, 12 Jul 2008, Sverre Rabbelier wrote:

> On Sat, Jul 12, 2008 at 12:07 AM, Johannes Schindelin
> <Johannes.Schindelin@gmx.de> wrote:
> >> On Fri, Jul 11, 2008 at 11:22 PM, Johannes Schindelin Yeah, I wish 
> >> 'git log -C -C -M --numstat --sacrifice-chicken --pretty=format:%ae 
> >> --' would take care of that... That is, a git-blame like mechanism 
> >> that would detect such moves on a per-commit basis and report them 
> >> would be very useful to me.
> >
> > Well, the chicken (or better, a goat) should be sacrificed by you...  
> > The option I would call "--code-moves".
> 
> If you suggest I write up a patch to 'git log' I am afraid that would 
> require quite a bit more skill && knowledge of 'git log' than I have 
> (which is about Null :P).

If you were suspecting that I would write the patch once the semantics are 
finalized, you would be absolutely correct.

> > But the semantics of that need to be sorted out in a shell script 
> > first; maybe like I outlined (if that was not coherent, please say 
> > so).
> 
> Python is one big shell script :P, so if you meant that it should be 
> part of GitStats (instead of part of 'git log', which I commented on 
> above), python would be just fine :). The concept was clear enough 
> though, I think I understand what you mean.

Fair enough.  As long as you use a language that is easy to prototype 
quickly and dirtily in.  Such as Python.

> >> Very much so, but the former I figure can be easily done with 'git 
> >> log -C -C -M' I discovered (I need to parse it's output though, and 
> >> also determine what to do with moves statistics wise. Should changes 
> >> made due to moves just be ignored?)
> >
> > That is not very interesting, as we often move so small parts (think 
> > "one function") that -C -C -M does not trigger.
> 
> Right, why aim for the stuff when there's much more interesting fun out 
> there? If there was a --code-moves I agree with you that it would be a 
> lot more interesting to have than going with the current approach and 
> throwing in '-C -C -M'.

Let's go for it, then!

> >> That sounds interesting, I won't need to actually do that though, I 
> >> already have a diff parser that gives me the lines added VS lines 
> >> deleted on a hunk-by-hunk basis. If it is a true move (e.g., code 
> >> removed in file X and added in file Y) it should be trivial to detect 
> >> that.
> >>
> >> Something along the lines of:
> >>  for hunk in added:
> >>   if hunk in deleted:
> >>     print("Over here!!")
> >
> > I think that is not enough, as a code move can mean that part of a
> > function was refactored into a function.  The consequence is often a
> > reindent, and possibly rewrapping.
> 
> Mhhh, such would be beyond the scope of implementing manually indeed,
> and should be left to the likes of a diff tool instead in order to
> prevent reinventing the wheel :).

That is why I was suggesting using the diff tool with munged input to find 
out what works best.

When that is done, I'll turn it into C.

> > BTW I realized something else: your 
> > http://alturin.googlepages.com/full_activity.txt lists only 
> > "gitk-git/po/es.po" under git-git/po/.  And it has as many added as 
> > deleted lines.
> 
> Correct, that's because that is what 'git log' tells me.

I suspect that one big "git log" will not tell you enough.  You probably 
need to make your tool aware (at least a little) about merges, just as you 
probably made it aware about parent/child relationships (to track the 
changes along renames)...

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Add pretty format %aN which gives the author name, respecting .mailmap
  2008-07-11 23:30                 ` Sverre Rabbelier
@ 2008-07-11 23:42                   ` Johannes Schindelin
  2008-07-12  8:44                     ` Sverre Rabbelier
  0 siblings, 1 reply; 21+ messages in thread
From: Johannes Schindelin @ 2008-07-11 23:42 UTC (permalink / raw)
  To: Sverre Rabbelier; +Cc: gitster, Git Mailinglist, David Symonds

Hi,

On Sat, 12 Jul 2008, Sverre Rabbelier wrote:

> On Sat, Jul 12, 2008 at 1:28 AM, Johannes Schindelin
> <Johannes.Schindelin@gmx.de> wrote:
> >
> > The pretty format %an does not respect .mailmap, but gives the exact 
> > author name recorded in the commit.  Sometimes it is more desirable, 
> > however, to look if the email has another name mapped to it in 
> > .mailmap.
> >
> > This commit adds %aN (and %cN for the committer name) to do exactly 
> > that.
> >
> > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> > ---
> 
> Whoah, that's fast ;).

Heh.  The price is that my patches are usually more buggy than other 
contributors' patches.

> I'm not sure what to do though, if I use this new %cN GitStats will
> only work with the latest git version... :(

Yes, that is correct.  But my impression was that GitStats was never meant 
as a pure add-on, but rather some integral part of Git, no?  IOW at least 
contrib/ stuff.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [GitStats] Bling bling or some statistics on the git.git repository
  2008-07-11 23:33           ` Johannes Schindelin
@ 2008-07-12  7:39             ` Sverre Rabbelier
  2008-07-12 22:36             ` Sverre Rabbelier
  1 sibling, 0 replies; 21+ messages in thread
From: Sverre Rabbelier @ 2008-07-12  7:39 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Git Mailinglist, David Symonds

On Sat, Jul 12, 2008 at 1:33 AM, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
> If you were suspecting that I would write the patch once the semantics are
> finalized, you would be absolutely correct.

Awesome!

>> Mhhh, such would be beyond the scope of implementing manually indeed,
>> and should be left to the likes of a diff tool instead in order to
>> prevent reinventing the wheel :).
>
> That is why I was suggesting using the diff tool with munged input to find
> out what works best.

This makes sense, put the responsibility where it belongs, that way
someone else may use it as well.

> When that is done, I'll turn it into C.

Very much appreciated. I will start playing around with munged diffs
today and keep you posted of the result.

>> Correct, that's because that is what 'git log' tells me.
>
> I suspect that one big "git log" will not tell you enough.  You probably
> need to make your tool aware (at least a little) about merges, just as you
> probably made it aware about parent/child relationships (to track the
> changes along renames)...

Atm it is not aware of parent/child relationships in the activity
analysis. (This is not the case in the 'branch contains' metric, in
which I use 'git rev-list --parents' to extract and analyse that.) I
thought of tracking renames by honoring what "git log -C -C -M" tells
me (e.g., whenever it says {foo/bar.c => bar.c} I move all the metrics
under key "foo/bar.c" to "bar.c").

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Add pretty format %aN which gives the author name, respecting .mailmap
  2008-07-11 23:42                   ` Johannes Schindelin
@ 2008-07-12  8:44                     ` Sverre Rabbelier
  0 siblings, 0 replies; 21+ messages in thread
From: Sverre Rabbelier @ 2008-07-12  8:44 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: gitster, Git Mailinglist, David Symonds

On Sat, Jul 12, 2008 at 1:42 AM, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
> Heh.  The price is that my patches are usually more buggy than other
> contributors' patches.

Pick Two: Good, Fast, and Cheap...? ;)

>> I'm not sure what to do though, if I use this new %cN GitStats will
>> only work with the latest git version... :(
>
> Yes, that is correct.  But my impression was that GitStats was never meant
> as a pure add-on, but rather some integral part of Git, no?  IOW at least
> contrib/ stuff.

Aye, that is true. Currently it should run with pretty much any
version, but if it does end up in git.git there is no reason not to
benefit from the most recent changes.

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [GitStats] Bling bling or some statistics on the git.git repository
  2008-07-11 23:33           ` Johannes Schindelin
  2008-07-12  7:39             ` Sverre Rabbelier
@ 2008-07-12 22:36             ` Sverre Rabbelier
  2008-07-13  0:29               ` Johannes Schindelin
  1 sibling, 1 reply; 21+ messages in thread
From: Sverre Rabbelier @ 2008-07-12 22:36 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Git Mailinglist, David Symonds

On Sat, Jul 12, 2008 at 1:33 AM, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
> Fair enough.  As long as you use a language that is easy to prototype
> quickly and dirtily in.  Such as Python.

It turned out to work really nicely, especially since I already had a
parser for diffs ;).

> Let's go for it, then!

My first attempt at something usable is at [0].

> That is why I was suggesting using the diff tool with munged input to find
> out what works best.

I used python's difflib so that I didn't have to write out to a file.

> When that is done, I'll turn it into C.

Goodluckwiththat ;D.

[0] http://repo.or.cz/w/git-stats.git?a=commit;h=ac2a192f1dcec387e96f2a7eb5d66a38401b0722

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [GitStats] Bling bling or some statistics on the git.git repository
  2008-07-12 22:36             ` Sverre Rabbelier
@ 2008-07-13  0:29               ` Johannes Schindelin
  0 siblings, 0 replies; 21+ messages in thread
From: Johannes Schindelin @ 2008-07-13  0:29 UTC (permalink / raw)
  To: sverre; +Cc: Git Mailinglist, David Symonds

Hi,

On Sun, 13 Jul 2008, Sverre Rabbelier wrote:

> On Sat, Jul 12, 2008 at 1:33 AM, Johannes Schindelin
> <Johannes.Schindelin@gmx.de> wrote:
> 
> > When that is done, I'll turn it into C.
> 
> Goodluckwiththat ;D.
> 
> [0] 
> http://repo.or.cz/w/git-stats.git?a=commit;h=ac2a192f1dcec387e96f2a7eb5d66a38401b0722

Can you back that up with examples before I go and spend my time on 
turning that into C, so I can see that it is what I want, first?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2008-07-13  0:29 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <bd6139dc0807090621n308b0159n92d946c165d3a5dd@mail.gmail.com>
2008-07-11 21:04 ` [GitStats] Bling bling or some statistics on the git.git repository Sverre Rabbelier
2008-07-11 21:22   ` Johannes Schindelin
2008-07-11 21:39     ` Johannes Schindelin
2008-07-11 21:55       ` Johannes Schindelin
2008-07-11 22:05         ` Sverre Rabbelier
2008-07-11 22:10           ` Johannes Schindelin
2008-07-11 21:55       ` Sverre Rabbelier
2008-07-11 22:11         ` Johannes Schindelin
2008-07-11 22:14           ` Sverre Rabbelier
2008-07-11 23:02             ` Johannes Schindelin
2008-07-11 23:28               ` [PATCH] Add pretty format %aN which gives the author name, respecting .mailmap Johannes Schindelin
2008-07-11 23:30                 ` Sverre Rabbelier
2008-07-11 23:42                   ` Johannes Schindelin
2008-07-12  8:44                     ` Sverre Rabbelier
2008-07-11 21:52     ` [GitStats] Bling bling or some statistics on the git.git repository Sverre Rabbelier
2008-07-11 22:07       ` Johannes Schindelin
2008-07-11 22:50         ` Sverre Rabbelier
2008-07-11 23:33           ` Johannes Schindelin
2008-07-12  7:39             ` Sverre Rabbelier
2008-07-12 22:36             ` Sverre Rabbelier
2008-07-13  0:29               ` Johannes Schindelin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).