A generalization of git blame

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* A generalization of git blame
@ 2012-09-25 18:14 xmeng
  2012-09-25 22:19 ` Philip Oakley
  0 siblings, 1 reply; 7+ messages in thread
From: xmeng @ 2012-09-25 18:14 UTC (permalink / raw)
  To: git

Hi,

I have been developing my git tool (based on the git internal API) that
can find out all the commits that have changed a line for better
authorship.

The reason is for my binary code authorship research, I use machine
learning to classify code authorship. To produce training data, I start
with a source code repository with well-known author labels for each line
and then compiling the project into binary. So, I am able to know the
authorship for binary code and then apply some machine learning
techniques.

To get ground truth of authorship for each line, I start with git-blame.
But later I find this is not sufficient because the last commit may only
add comments or may only change a small part of the line, so that I
shouldn't attribute the line of code to the last author. Of course, there
must be some debates on who can be the representative of a line of code.
So what I would like to do is find out all the commits that have ever
changed a line, then I can try different approaches to summarize over all
these commits to produce my final authorship label (or even tuple).

I was wondering whether there have been similar debates over accurate
authorship in this community before and whether there may be other people
interested in this work.

Thanks

--Xiaozhu

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A generalization of git blame
  2012-09-25 18:14 A generalization of git blame xmeng
@ 2012-09-25 22:19 ` Philip Oakley
  2012-09-25 23:05   ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Philip Oakley @ 2012-09-25 22:19 UTC (permalink / raw)
  To: xmeng, git

From: <xmeng@cs.wisc.edu>
> I have been developing my git tool (based on the git internal API) 
> that
> can find out all the commits that have changed a line for better
> authorship.
>
> The reason is for my binary code authorship research, I use machine
> learning to classify code authorship. To produce training data, I 
> start
> with a source code repository with well-known author labels for each 
> line
> and then compiling the project into binary. So, I am able to know the
> authorship for binary code and then apply some machine learning
> techniques.
>
> To get ground truth of authorship for each line, I start with 
> git-blame.
> But later I find this is not sufficient because the last commit may 
> only
> add comments or may only change a small part of the line, so that I
> shouldn't attribute the line of code to the last author.

I would suggest there is:
- White space adjustment
- Comment or documentation (assumes you can parse the 'code' to decide 
that it isn't executable code)
- word changes within expressions
- complete replacement of line (whole statement?)

Custom & practice is the likely decider.

>                              Of course, there
> must be some debates on who can be the representative of a line of 
> code.
> So what I would like to do is find out all the commits that have ever
> changed a line, then I can try different approaches to summarize over 
> all
> these commits to produce my final authorship label (or even tuple).
>
> I was wondering whether there have been similar debates over accurate
> authorship in this community before and whether there may be other 
> people
> interested in this work.

I'd suggest looking at the various 'diff' formats, such as character 
diff, word diff, and line diff for discussions.

>
> Thanks
>
> --Xiaozhu
>
Philip 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A generalization of git blame
  2012-09-25 22:19 ` Philip Oakley
@ 2012-09-25 23:05   ` Junio C Hamano
  2012-09-26 15:36     ` xmeng
  0 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2012-09-25 23:05 UTC (permalink / raw)
  To: Philip Oakley; +Cc: xmeng, git

"Philip Oakley" <philipoakley@iee.org> writes:

>> To get ground truth of authorship for each line, I start with
>> git-blame.
>> But later I find this is not sufficient because the last commit may
>> only
>> add comments or may only change a small part of the line, so that I
>> shouldn't attribute the line of code to the last author.
>
> I would suggest there is:
> - White space adjustment
> - Comment or documentation (assumes you can parse the 'code' to decide
> that it isn't executable code)
> - word changes within expressions
> - complete replacement of line (whole statement?)

You are being generous by listing easier cases ;-) I'd add a couple
more that are more problematic if your approach does not consider
semantics.

 - A function gained a new parameter, to which pretty much everbody
   passes the same default value.

	-void fn(int a, int b, int c)
	+void fn(int a, int b, int c, int d)
	 {
	+	if (d) {
	+		...
	+		return;
	+	}
		...
	 }

         void frotz(void)
	 {
		...
        -	fn(a, b, c);
        +	fn(a, b, c, 0);
        	...
        -	fn(a, b, d);
        +	fn(a, b, d, 1);
        	...

   The same commit that changed the above call site must have
   changed the definition of function "fn" and defined what the new
   fourth parameter means.  It is likely that, when the default
   value most everybody passes (perhaps "0") is given, "fn" does
   what it used to do, and a different value may trigger a new
   behaviour of "fn".  It could be argued that the former call
   should not be blamed for this commit, while the latter callsite
   should.

 - A variable was renamed, and the meaning of a line suddenly
   changed, even though the text of that line did not change at all.

	 static int foo;
         ...
        -int xyzzy(int foo)
	+int xyzzy(int bar)
	 {
		... some complex computation that
                ... involves foo and bar, resulting in
                ... updating of foo comes here ...
		return foo * 2;
 	 }

   Whom to blame the behaviour of (i.e. returned value from) the
   function?  The "return foo * 2" never changed with this patch,
   but the patch _is_ responsible for changing the behaviour.

   As the OP is interested in tracking the origin of the _binary_,
   this case is even more interesting, as the generated machine code
   to compute the foo * 2 would likely to be very different before
   and after the patch.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A generalization of git blame
  2012-09-25 23:05   ` Junio C Hamano
@ 2012-09-26 15:36     ` xmeng
  2012-09-26 19:11       ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: xmeng @ 2012-09-26 15:36 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Philip Oakley, xmeng, git


> "Philip Oakley" <philipoakley@iee.org> writes:
>
>>> To get ground truth of authorship for each line, I start with
>>> git-blame.
>>> But later I find this is not sufficient because the last commit may
>>> only
>>> add comments or may only change a small part of the line, so that I
>>> shouldn't attribute the line of code to the last author.
>>
>> I would suggest there is:
>> - White space adjustment
>> - Comment or documentation (assumes you can parse the 'code' to decide
>> that it isn't executable code)
>> - word changes within expressions
>> - complete replacement of line (whole statement?)
>
> You are being generous by listing easier cases ;-) I'd add a couple
> more that are more problematic if your approach does not consider
> semantics.
>
>  - A function gained a new parameter, to which pretty much everbody
>    passes the same default value.
>
> 	-void fn(int a, int b, int c)
> 	+void fn(int a, int b, int c, int d)
> 	 {
> 	+	if (d) {
> 	+		...
> 	+		return;
> 	+	}
> 		...
> 	 }
>
>          void frotz(void)
> 	 {
> 		...
>         -	fn(a, b, c);
>         +	fn(a, b, c, 0);
>         	...
>         -	fn(a, b, d);
>         +	fn(a, b, d, 1);
>         	...
>
>    The same commit that changed the above call site must have
>    changed the definition of function "fn" and defined what the new
>    fourth parameter means.  It is likely that, when the default
>    value most everybody passes (perhaps "0") is given, "fn" does
>    what it used to do, and a different value may trigger a new
>    behaviour of "fn".  It could be argued that the former call
>    should not be blamed for this commit, while the latter callsite
>    should.
>
>  - A variable was renamed, and the meaning of a line suddenly
>    changed, even though the text of that line did not change at all.
>
> 	 static int foo;
>          ...
>         -int xyzzy(int foo)
> 	+int xyzzy(int bar)
> 	 {
> 		... some complex computation that
>                 ... involves foo and bar, resulting in
>                 ... updating of foo comes here ...
> 		return foo * 2;
>  	 }
>
>    Whom to blame the behaviour of (i.e. returned value from) the
>    function?  The "return foo * 2" never changed with this patch,
>    but the patch _is_ responsible for changing the behaviour.
>
>    As the OP is interested in tracking the origin of the _binary_,
>    this case is even more interesting, as the generated machine code
>    to compute the foo * 2 would likely to be very different before
>    and after the patch.
>
>

Thanks for both your great suggestions. Current my approach doesn't
consider semantics yet and this should be an interesting to do.

Another question is that is it possible to include my tool as a git
built-in tool in the future? I know that my tool is still not good for any
release. But I would like to share my work with other people if other
people are interested. And if it is possible, I think I will have a
stronger motivation to make my tool more robust and useful.

Thanks

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A generalization of git blame
  2012-09-26 15:36     ` xmeng
@ 2012-09-26 19:11       ` Junio C Hamano
  2012-09-27  4:18         ` xmeng
  0 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2012-09-26 19:11 UTC (permalink / raw)
  To: xmeng; +Cc: Philip Oakley, git

xmeng@cs.wisc.edu writes:

> Another question is that is it possible to include my tool as a git
> built-in tool in the future?

It largely depends on how the user would interact with your program,
which is totally unclear as we haven't seen any part of it.  I do
not think we have enough information to answer the question at this
point.

> I know that my tool is still not good for any release. But I would
> like to share my work with other people if other people are
> interested.

If it is a trivial script that largely depends on what we already
ship, I would not mind carrying it in contrib/.  If it is anything
substantial and substantially useful, however, I would suspect that
you are better off not be in in my tree, but rather want to be
independent.  Finishing it to be useful for your purpose, publishing
it somewhere people can take a look at and adding a pointer to
https://git.wiki.kernel.org/index.php/InterfacesFrontendsAndTools is
probably where you would want to start.

> And if it is possible, I think I will have a stronger motivation
> to make my tool more robust and useful.

I've seen from time to time people ask "I am thinking of doing this;
will a patch be accepted?  If so, I'll work on it." before showing
any work, and my response always has been:

 (1) We won't know how useful and interesting your contribution be
     for our audience, until we see it; and

 (2) If you truly believe in your work (find it useful, find writing
     it fun, etc.), that would be incentive enough for you to work
     on it, whether or not the result will land in my tree.  You
     should instead aim for something so brilliant that we would
     come to you begging for your permission to include it in our
     project.

I think it applies to your inquiry as well.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A generalization of git blame
  2012-09-26 19:11       ` Junio C Hamano
@ 2012-09-27  4:18         ` xmeng
  2012-09-27  6:38           ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: xmeng @ 2012-09-27  4:18 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Philip Oakley, git

> It largely depends on how the user would interact with your program,
> which is totally unclear as we haven't seen any part of it.  I do
> not think we have enough information to answer the question at this
> point.

Do you mean it largely depends on the diversity of options on input and
output formats? Currently I just take a path and output lists like:

53: (1da177e4c3f41524e886b7f1b8a0c1fc7321cac2,Linus Torvalds)
(201b6264ff3865090747f58f48e087c3a35e0dbc,Christoph Lameter)
(e2bdb933ab8b7db71c318a4ddcf78a9fffd61ecb,Hugh Dickins)
54: (7cf9c2c76c1a17b32f2da85b50cd4fe468ed44b5,Nick Piggin)
(e2bdb933ab8b7db71c318a4ddcf78a9fffd61ecb,Hugh Dickins)
55: (e2bdb933ab8b7db71c318a4ddcf78a9fffd61ecb,Hugh Dickins)
56: (1da177e4c3f41524e886b7f1b8a0c1fc7321cac2,Linus Torvalds)

This shows for all lines in the specified file, all commits that have
changed a line.

I know this is not enough for a tool. So this is case, does "how the user
would interact with your program" mean that I should add more options like
which revision to start, some parameters in the algorithm, choice on
different diff algorithms, and options for customizing the contents of
output?

> If it is a trivial script that largely depends on what we already
> ship, I would not mind carrying it in contrib/.  If it is anything
> substantial and substantially useful, however, I would suspect that
> you are better off not be in in my tree, but rather want to be
> independent.  Finishing it to be useful for your purpose, publishing
> it somewhere people can take a look at and adding a pointer to
> https://git.wiki.kernel.org/index.php/InterfacesFrontendsAndTools is
> probably where you would want to start.

I think it would be helpful for our discussion if I first briefly
introduce my approach. It starts at the current head and tracks all the
lines in the specified file. It then goes along the commit graph
topologically. For any two commits A and B connected by an edge from A to
B (commit B is the parent of commit A) and if A is not a merge commit, it
first calls the tree-diff interface to get the added lines  in A and
deleted lines in B, and detect renaming. Then it applies ldiff algorithm
(http://sourceforge.net/projects/ldiff/ . It is in perl. I adapt it into c
in my program) to match the added lines and deleted lines. The result is
that now we have some added lines, some deleted lines and some changed
lines. For added lines, I know the this commit is the last one to change
it. For changed lines, I add this commit to the final result and keep
tracking them. Currently I ignore those deleted lines. But if A is a merge
commit, it doesn't change authorship except for merge conflicts.

I did start with implementing my approach with python. The script keeps
calling git-log to get the diff and then apply ldiff. The problem is I
cannot store whole git-log result in the memory. Even the diff contents of
a specified file can explode my memory. So then I only query for a small
amount of commits each time. This makes my program extremely slow. It
takes about half hour to analyze a file in a project with 15000 commits.

Then I refactor my code based on git internal api. It turns out to work
very well. Now the program can finish in 10sec instead of half hour. And
now I can also apply my tool to larger project like linux kernel.

So if I want other people to use my tool, I firstly add an entry on the
webpage, and then other interested people can get my tool by cloning from
my git branch?

>  (1) We won't know how useful and interesting your contribution be
>      for our audience, until we see it; and
>
>  (2) If you truly believe in your work (find it useful, find writing
>      it fun, etc.), that would be incentive enough for you to work
>      on it, whether or not the result will land in my tree.  You
>      should instead aim for something so brilliant that we would
>      come to you begging for your permission to include it in our
>      project.
>
> I think it applies to your inquiry as well.

You are definitely right. Actually I have spent a lot of time on my tool.
I just want to know if there are some issues that I should know, otherwise
it will prevent me from publishing it or prevent it from becoming a
standard git built-in tool.

Thanks

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A generalization of git blame
  2012-09-27  4:18         ` xmeng
@ 2012-09-27  6:38           ` Junio C Hamano
  0 siblings, 0 replies; 7+ messages in thread
From: Junio C Hamano @ 2012-09-27  6:38 UTC (permalink / raw)
  To: xmeng; +Cc: Philip Oakley, git

xmeng@cs.wisc.edu writes:

>> It largely depends on how the user would interact with your program,
>> which is totally unclear as we haven't seen any part of it.  I do
>> not think we have enough information to answer the question at this
>> point.
>
> Do you mean it largely depends on the diversity of options on input and
> output formats? ...
> ... I know this is not enough for a tool. So this is case, does "how the user
> would interact with your program" mean that I should add ...

I am not saying anything about what you should or should not do. It
is your program, and we haven't seen anything about it, other than
handwaving, what good it will do to its users, so I am not qualified
to make such a comment on it yet.  What I meant by "how the users
would interact with..." are things like this:

 - Why users would want to use it in the first place?  What are
   missing from existing tool set?

 - What kind of questions do users ask to the program and how do
   they ask them?

 - How are the answers to these questions presented by the program
   to the users?

 - How do users interpret and use these answers in what way?

Notice that I didn't include "How do you compute the answers?"  When
we are initially evaluating a feature at "how do they interact" level,
we are not interested in the implementation at all.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-09-27  6:38 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-25 18:14 A generalization of git blame xmeng
2012-09-25 22:19 ` Philip Oakley
2012-09-25 23:05   ` Junio C Hamano
2012-09-26 15:36     ` xmeng
2012-09-26 19:11       ` Junio C Hamano
2012-09-27  4:18         ` xmeng
2012-09-27  6:38           ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).