* On blame/pickaxe
@ 2006-10-13 1:43 Junio C Hamano
2006-10-13 7:54 ` Junio C Hamano
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Junio C Hamano @ 2006-10-13 1:43 UTC (permalink / raw)
To: Luben Tuikov; +Cc: git
I have something that is tentatively called pickaxe, but that is
different from "diff -S<pickaxe>". The name will need to
change, perhaps to git-blame.
It's quite a while since the original algorithm sketch that led
to the current git-blame was posted here; I suspect that most of
the people recently joined the list haven't seen it. The idea
in there still applies, though...
Here is a re-drawn sketch of how the 'pickaxe' edition of the
algorithm works.
0. Outline.
We track range of lines from a path in one revision. That's
what "-L n,m", "path", and "commit" parameters to the command
are about.
When the program starts, the commit given on the command line is
suspected to be responsible for all lines in the given range.
But being a suspect and determined to be the guilty party are
different things.
The name of the game is "not taking responsibility but passing
the blame to your parents". In other words, "that's not a new
problem I introduced -- it was there before my patch was
applied". You are exonerated for those lines that you can
successfully pass blame to your parent, making that parent a new
suspect for the lines. There will remain lines you cannot pass
blame to anybody else -- you will be determined to be the guilty
party for them. When nobody can pass any more blame to anybody
else, the dance stops and we will know who is guilty for every
line of the input.
1. Data structure.
We keep track of who the current suspect for each line is in the
structure called 'scoreboard'. It is a collection of ranges,
called 'blame_entry'.
Each blame_entry describes which lines (in the final image) it
is about, who the current suspect for the range is, if the
suspect is already known to be guilty, where in the suspected
commit the lines came from (path and line number).
In the 'blame' implementation, a suspect is represented by a
commit object with one path hanging to its util field.
'pickaxe' edition introduces the concept of 'origin', which is a
pair of <commit,path>. We currently do not use more than one
path per commit, but hopefully we will soon.
2. Operation.
First the scoreboard is initialized with a single blame_entry
that makes the path in the commit on the command line the
suspect for all lines.
Then, we repeatedly pick one entry from the scoreboard, and give
a chance to the suspect to exonerate himself, by calling
pass_blame(). The suspect passes the blame to its parents by
updating the scoreboard, removing his name from ranges of lines
he can prove that came from his parent and instead writing that
suspected parent's name. Inside pass_blame(), just before it
returns, if the suspect cannot pass blame for some lines to
anybody else, it becomes guilty for them.
When there is no blame_entry in the scoreboard whose truly
guilty party is still undetermined, we have fully blamed the
input, and the loop finishes.
3. Passing the blame.
You (a <commit,path> tuple) are suspected for introducing
certain lines, and you would want to pass blame to your parent.
How would you do that?
First, you find if your parent has the same path; if not, you
find if between your parent and you there was a rename and find
the original path in the parent. If you are a merge, you do so
for all your parents. The path in the parent and your path may
have many common lines, and if the lines you are the suspect are
the same as the ones in the parent, you can pass the blame,
because these lines were there before you touched them.
How would you find what's common? We run "diff" (diff -u0).
Unified context-0 diff would give something like:
@@ -n,m +l,k @@
-removed
+added
...
In this application, we are not interested in the diff text
itself at all. In fact, we are interested in what's outside the
diff. If the above is output from parent to you, what it tells
us is that lines before your line "l" are the same as lines
before parent's "n", so if the above is the first hunk, you can
say your lines 1..(l-1) came from your parent and you are not
responsible for them. The above hunk also tells us that in your
path, lines after (l+k) are the same as the lines in parent's
lines (n+m); we will know how many such same lines there are
(could be zero) by inspecting the next hunk.
For efficiency reasons, inside pass_blame(), we run just one
diff between the parent and you, and scan the blame_entry for
the same <commit,path> in the scoreboard, and check overlap with
each of them. Suppose we are suspected for these lines:
<---- e ----->
and by looking the diff, we determined the range from one parent
is like this:
<--p--->
Then we can split the blame_entry into three parts:
<-><--p---><->
You are still suspected for the first and the last part of the
original entry, but for the mid-part you successfully passed the
blame to that parent. Depending on how the ranges overlap, this
may split into two, or parent may take blame for the whole range
(i.e. no split). This is computed in blame_overlap().
When done with one parent, if you are a merge, you will then try
to pass the blame on the remaining part that you are still
suspected for to other parents.
The classic 'blame' algorithm stops here, and the current
pickaxe does the same; you take responsibility for the
remainder.
4. Passing more blame.
Instead of taking responsibility for the remainder, there are
other ways to find other people to pass blame on. That's what
the "NEEDSWORK" comment in pass_blame() is about.
A typical example is a change that moves one static C function
from lower part of the file to upper part of the same file,
because you added a new caller in the middle. The path in your
parent and the path in you would look like this:
parent you
A static foo() {
B }
C A
D B
E C
F D
G ... call foo();
static foo() { E
} F
I G
J I
With the common part finding code with diff, you will be able to
pass blames for lines A B C D E F G I J to your parent. You are
truly guilty for introducing "... call foo();". The problem
here is that in addition, you will be blamed for the lines that
implement "static foo() { ... }" at the beginning of your file.
To blame the implementation of foo() to the parent, we could do
something like this:
for each blame_entry that you are still suspected for,
diff those lines (and only those lines) with the parent,
to see if you find a copy. If there is a copy, you can
pass the blame to the parent.
This needs to be limited for non-trivial changes only, though.
For example, one of the typical things you would do is to add one
"else if" clause to a sequence of already existing "if .. else if
.." chain, like this:
if (foo) {
...
}
else if (bar) {
...
+ }
+ else if (baz) {
+ ...
}
You are suspected for these three lines; if the "find copies
again" is allowed to find any insignificant copy, the first line
(closing brace) you introduced could easily be blamed to an
unrelated line anywhere in the parent that match /^\t\}$/. To
prevent such nonsense from happening, we need to limit such
"find copies" attempt to say minimum of N lines (and minimum of
M tokens that consist of non-whitespace, non-punctuation
letters). Also we probably would not want to find a match in
the whole parent but only in the part of the parent that is
removed in the parent-to-you diff.
Another typical example is a code restructuring to move one
function from a file to another file. The same "find copies"
principle, and caveat for small and insignificant copies, apply
to this situation as well. If it is a code movement, the file
that the function moved from needs to be identified, which can
be done by running diff-tree between parent and you -- anything
that was modified or removed is a good candidate to look for
such copies. Again, we probably need to limit the copy-finding
source to the part that was removed from the parent in the diff
to you.
After the operation 3 (Passing the blame) runs, if movement of
lines from another file is detected as in the above sketch, we
will pass blame to an unrelated path in the parent. Some lines
are blamed on the file we started tracking in the parent, and
some other lines are blamed on a different file in the parent.
This is why 'pickaxe' uses <commit,path> tuple (aka 'origin') to
represent a suspect.
All of this sounds quite simple and not so difficult to code,
although rejecting insignificant copies would involve fair
amount of heuristic tweaking like I needed to do while working
on the rename detection.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: On blame/pickaxe
2006-10-13 1:43 On blame/pickaxe Junio C Hamano
@ 2006-10-13 7:54 ` Junio C Hamano
2006-10-14 10:26 ` Junio C Hamano
2006-10-16 23:45 ` Luben Tuikov
2 siblings, 0 replies; 12+ messages in thread
From: Junio C Hamano @ 2006-10-13 7:54 UTC (permalink / raw)
To: git
In an earlier message, I talked about pickaxe data structure
that has potential for blaming more than one paths in a single
revision.
I was talking about the future enhancements, but it turns out
that there is a corner case that the feature is already needed
using the classic blame algorithm, and running blame shows its
limitation through.
The scenario is like this:
- initial has two files A and B
- branch left renames A to C and adds stuff at the top.
- branch right renames B to C and adds stuff at the bottom.
- left and right is merged and leaves one file, C, by taking
what both branches did.
- annotate C from the top.
The parts that came from A and B in the initial revision should
be assigned to these files in the initial commit. If you look
at the blame output, you will see it fails to do so. pickaxe
gives what is expected.
-- >8 --
#!/bin/sh
if test -d .git
then
echo 'please run this in a fresh empty directory'
exit
fi
git init-db
for i in 1 2 3 4 5 6 7 8 9 ; do echo line line line $i ; done >A
for i in A B C D E F G H I ; do echo line line line $i ; done >B
git add A B
git commit --author='Initial <initial@author>' -m initial
git branch right
git branch left
# Left
git checkout left
for i in 1 2 3; do echo added by left; done >C
cat A >>C
rm -f A B
git update-index --add --remove A B C
git commit --author='Left <left@branch>' -m Left
# Right
git checkout right
cat B >C
for i in 1 2 3; do echo added by right; done >>C
rm -f A B
git update-index --add --remove A B C
git commit --author='Right <right@branch>' -m Right
# Merge them; this should fail which is expected
git pull . left
{
git cat-file blob :3:C
git cat-file blob :2:C
} >C
git update-index C
git commit --author='Merge <merge@branch>' -m 'Changes are merged.'
rm -f C~*
echo blame
git blame -f -n -t C
echo pickaxe
git pickaxe -f -n -t C
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: On blame/pickaxe
2006-10-13 1:43 On blame/pickaxe Junio C Hamano
2006-10-13 7:54 ` Junio C Hamano
@ 2006-10-14 10:26 ` Junio C Hamano
2006-10-14 23:43 ` Junio C Hamano
2006-10-16 23:45 ` Luben Tuikov
2 siblings, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2006-10-14 10:26 UTC (permalink / raw)
To: git
Junio C Hamano <junkio@cox.net> writes:
> 4. Passing more blame.
>
> Instead of taking responsibility for the remainder, there are
> other ways to find other people to pass blame on. That's what
> the "NEEDSWORK" comment in pass_blame() is about.
I've spent a few hours tonight to further work (eh, "have fun")
on this. The version at the tip of "pu" implements detection of
a case like this:
> A typical example is a change that moves one static C function
> from lower part of the file to upper part of the same file,
> because you added a new caller in the middle. The path in your
> parent and the path in you would look like this:
>
> parent you
>
> A static foo() {
> B }
> C A
> D B
> E C
> F D
> G ... call foo();
> static foo() { E
> } F
> I G
> J I
>
> With the common part finding code with diff, you will be able to
> pass blames for lines A B C D E F G I J to your parent. You are
> truly guilty for introducing "... call foo();". The problem
> here is that in addition, you will be blamed for the lines that
> implement "static foo() { ... }" at the beginning of your file.
You can use the pickaxe on its source itself, like this:
git pickaxe -n master..pu builtin-pickaxe.c
If you compare this with output from:
git log --pretty=short -p master..pu builtin-pickaxe.c
you will notice the line-movement detection in action.
During the course of development, I had to move quite a few
static functions around so that they are defined before their
first call site. This is partly because I am very bad at
initial planning (who is?) and this still being in experimental
stage I did not bother declaring static functions upfront as
forward declarations.
For example, commit db3f0f2 introduces find_last_in_target()
function, but it was moved up by commit b5c0e4f (near the tip of
"pu"). pickaxe blames the implementation of it to db3f0f20, and
also notices that the bulk of its logic was actually copied from
the implementation of pass_blame_to_parent() function in commit
b14dc9ef (the initial commit that introduced builtin-pickaxe.c).
What _ought_ to come next is to detect line movement across
files, but I'll go to bed for now.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: On blame/pickaxe
2006-10-14 10:26 ` Junio C Hamano
@ 2006-10-14 23:43 ` Junio C Hamano
2006-10-16 2:21 ` Petr Baudis
0 siblings, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2006-10-14 23:43 UTC (permalink / raw)
To: Linus Torvalds; +Cc: git
Junio C Hamano <junkio@cox.net> writes:
> Junio C Hamano <junkio@cox.net> writes:
>
>> 4. Passing more blame.
>>
>> Instead of taking responsibility for the remainder, there are
>> other ways to find other people to pass blame on. That's what
>> the "NEEDSWORK" comment in pass_blame() is about.
>
> I've spent a few hours tonight to further work (eh, "have fun")
> on this. The version at the tip of "pu" implements detection of
> a case like this:
> ...
> You can use the pickaxe on its source itself, like this:
>
> git pickaxe -n master..pu builtin-pickaxe.c
>
> If you compare this with output from:
>
> git log --pretty=short -p master..pu builtin-pickaxe.c
>
> you will notice the line-movement detection in action.
>
> During the course of development, I had to move quite a few
> static functions around so that they are defined before their
> first call site. This is partly because I am very bad at
> initial planning (who is?) and this still being in experimental
> stage I did not bother declaring static functions upfront as
> forward declarations.
>
> For example, commit db3f0f2 introduces find_last_in_target()
> function, but it was moved up by commit b5c0e4f (near the tip of
> "pu"). pickaxe blames the implementation of it to db3f0f20, and
> also notices that the bulk of its logic was actually copied from
> the implementation of pass_blame_to_parent() function in commit
> b14dc9ef (the initial commit that introduced builtin-pickaxe.c).
>
> What _ought_ to come next is to detect line movement across
> files, but I'll go to bed for now.
And this implements it. After testing it some more, I'll have
it near the tip of "pu" sometime tonight.
I have to admit that I am becoming quite fond of the name
"pickaxe", by the way.
I consider that the "following the changed lines and see where
it came from" you described in the ancient message:
http://thread.gmane.org/gmane.comp.version-control.git/27/focus=217
the holy grail of software archaeology. Once we have it, I
would feel I am pretty much _done_ with git. It will have
everything I want it to do.
My "diff -S" (where the "pickaxe" name started) was a crude
first step to achieve that goal. The way I envisioned it was
that you would feed the lines 50-89 as a single -S parameter to
"git whatchanged" and the tool finds the commit that touched
that area, and the Porcelain would either figure out what the
corresponding lines in the previous commit look like or ask the
user to highlight what to dig deeper interactively, and restart
another "whatchanged" with adjusted value for -S. Unfortunately
nobody wrote such a Porcelain command.
I think the command with this patch gets us closer to that goal;
although it still does not have a way to detect "oh these five
copies were consolidated into one" (I'd imagine we _could_ do
that in find_copy_in_parent() function if we really wanted to --
but that would be a tad expensive). In any case, I think this
is going in the right direction.
Before anybody asks, the example Daniel Barkalow gave on date.c
is far trickier and even this version would not catch it. It
requires digging random, distant, ancestors in order to look for
lines that were lost by an earlier removal.
-- >8 --
git-pickaxe: blame cut-and-paste lines.
This completes the initial round of git-pickaxe. In addition to
the line movement the previous round implements, this finds new
lines that were created by moving or cutting-and-pasting lines
from different path in the parent.
With this,
git pickaxe -f -n v1.4.0 -- revision.c
finds that major part of that file actually came from rev-list.c
when Linus split the latter at commit ae563642 and blames them
to earlier commits that touches rev-list.c.
Signed-off-by: Junio C Hamano <junkio@cox.net>
---
builtin-pickaxe.c | 71 +++++++++++++++++++++++++++++++++++------------------
1 files changed, 47 insertions(+), 24 deletions(-)
diff --git a/builtin-pickaxe.c b/builtin-pickaxe.c
index 3f9f679..1124919 100644
--- a/builtin-pickaxe.c
+++ b/builtin-pickaxe.c
@@ -638,39 +638,16 @@ static int find_copy_in_blob(struct scor
return 0;
}
-static int find_copy_in_parent(struct scoreboard *sb,
- struct origin *target,
- struct commit *parent,
- struct origin *porigin)
-{
- int last_in_target;
-
- last_in_target = find_last_in_target(sb, target);
- if (last_in_target < 0)
- return 1; /* nothing remains for this target */
-
- /* NEEDSWORK: run diff-tree between target->commit with parent
- * to find the paths that were modified from parent, and find
- * copies in the parent; if we find any we can pass blame to it.
- * We have checked path in porigin so do not bother checking it
- * again (but porigin might be NULL).
- */
-
- return 0;
-}
-
static int find_move_in_parent(struct scoreboard *sb,
struct origin *target,
struct origin *parent)
{
- int last_in_target;
struct blame_entry *ent;
mmfile_t file_p;
char type[10];
char *blob_p;
- last_in_target = find_last_in_target(sb, target);
- if (last_in_target < 0)
+ if (find_last_in_target(sb, target) < 0)
return 1; /* nothing remains for this target */
blob_p = read_sha1_file(parent->blob_sha1, type,
@@ -690,6 +667,52 @@ static int find_move_in_parent(struct sc
return 0;
}
+static int find_copy_in_parent(struct scoreboard *sb,
+ struct origin *target,
+ struct commit *parent,
+ struct origin *porigin)
+{
+ struct diff_options diff_opts;
+ const char *paths[1];
+ int no_more, i;
+
+ if (find_last_in_target(sb, target) < 0)
+ return 1; /* nothing remains for this target */
+
+ diff_setup(&diff_opts);
+ diff_opts.recursive = 1;
+ diff_opts.output_format = DIFF_FORMAT_NO_OUTPUT;
+
+ /* Try "find copies harder" on new path */
+ if (!porigin || strcmp(target->path, porigin->path)) {
+ diff_opts.detect_rename = DIFF_DETECT_COPY;
+ diff_opts.find_copies_harder = 1;
+ }
+ paths[0] = NULL;
+ diff_tree_setup_paths(paths, &diff_opts);
+ if (diff_setup_done(&diff_opts) < 0)
+ die("diff-setup");
+ diff_tree_sha1(parent->tree->object.sha1,
+ target->commit->tree->object.sha1,
+ "", &diff_opts);
+ diffcore_std(&diff_opts);
+
+ for (i = no_more = 0; !no_more && i < diff_queued_diff.nr; i++) {
+ struct diff_filepair *p = diff_queued_diff.queue[i];
+ struct origin *new_porigin;
+ if (!DIFF_FILE_VALID(p->one))
+ continue; /* does not exist in parent */
+ if (porigin && !strcmp(p->one->path, porigin->path))
+ continue; /* find_move already checked this path */
+
+ new_porigin = find_origin(sb, parent, p->one->path);
+ no_more = find_move_in_parent(sb, target, new_porigin);
+ }
+ diff_flush(&diff_opts);
+
+ return no_more;
+}
+
#define MAXPARENT 16
static void pass_blame(struct scoreboard *sb, struct origin *origin)
{
--
1.4.3.rc2.g195ab
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: On blame/pickaxe
2006-10-14 23:43 ` Junio C Hamano
@ 2006-10-16 2:21 ` Petr Baudis
2006-10-16 6:43 ` Junio C Hamano
0 siblings, 1 reply; 12+ messages in thread
From: Petr Baudis @ 2006-10-16 2:21 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Luben Tuikov, git, Linus Torvalds
Thanks for the nice writeup!
Dear diary, on Fri, Oct 13, 2006 at 03:43:46AM CEST, I got a letter
where Junio C Hamano <junkio@cox.net> said that...
> When done with one parent, if you are a merge, you will then try
> to pass the blame on the remaining part that you are still
> suspected for to other parents.
(This got me nervous but I guess it actually makes sense - if only one
parent modified a line, it's right to pass the blame up; else if you
took one parent's version verbatim, it's right to pass the blame up as
well (I think); else you've already got the blame assigned to the merge
commit and everything is all right.)
> To blame the implementation of foo() to the parent, we could do
> something like this:
>
> for each blame_entry that you are still suspected for,
> diff those lines (and only those lines) with the parent,
> to see if you find a copy. If there is a copy, you can
> pass the blame to the parent.
Now, this is very nifty and so for moving functions around, but I
think it is very dangerous for anything where ordering matters - large
arrays definitions, patch series files, etc. In that case, you've
completely ommitted the fact that the movement occured, which can be
crucial and based on the current behaviour of the tools, I think people
expect this now. To put it shortly, "who wrote this line" vs "who put
this line here".
I would prefer to stay more conservative here by default. Graphical
tools could solve this by some nice presentation like
3fea1 int moved() -.
3fea1 { |
28abc watch(more + anime); | 17d50
17d50 return !hypno; |
3fea1 } -'
Dear diary, on Sun, Oct 15, 2006 at 01:43:53AM CEST, I got a letter
where Junio C Hamano <junkio@cox.net> said that...
> I have to admit that I am becoming quite fond of the name
> "pickaxe", by the way.
Gaah, please don't. ;-)
--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
#!/bin/perl -sp0777i<X+d*lMLa^*lN%0]dsXx++lMlN/dsM0<j]dsj
$/=unpack('H*',$_);$_=`echo 16dio\U$k"SK$/SM$n\EsN0p[lN*1
lK[d2%Sa2/d0$^Ixp"|dc`;s/\W//g;$_=pack('H*',/((..)*)$/)
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: On blame/pickaxe
2006-10-16 2:21 ` Petr Baudis
@ 2006-10-16 6:43 ` Junio C Hamano
2006-10-16 14:02 ` Josef Weidendorfer
0 siblings, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2006-10-16 6:43 UTC (permalink / raw)
To: Petr Baudis; +Cc: Luben Tuikov, git, Linus Torvalds
Petr Baudis <pasky@suse.cz> writes:
> Thanks for the nice writeup!
>
> Dear diary, on Fri, Oct 13, 2006 at 03:43:46AM CEST, I got a letter
> where Junio C Hamano <junkio@cox.net> said that...
>> When done with one parent, if you are a merge, you will then try
>> to pass the blame on the remaining part that you are still
>> suspected for to other parents.
>
> (This got me nervous but I guess it actually makes sense - if only one
> parent modified a line, it's right to pass the blame up; else if you
> took one parent's version verbatim, it's right to pass the blame up as
> well (I think); else you've already got the blame assigned to the merge
> commit and everything is all right.)
Well, this part is the classic blame algorithm. The beauty of
it is that a merge case falls out as a natural consequence of
the one-parent case and you do not have to do anything special.
You either inherited a line from your parent (or one or more of
your parents) or you created the line yourself. If you subtract
what you can blame your parents for, the remainder is what you
introduced. The number of parents you have does not have any
effect on that logic.
> Now, this is very nifty and so for moving functions around, but I
> think it is very dangerous for anything where ordering matters - large
> arrays definitions, patch series files, etc. In that case, you've
> completely ommitted the fact that the movement occured, which can be
> crucial and based on the current behaviour of the tools, I think people
> expect this now. To put it shortly, "who wrote this line" vs "who put
> this line here".
That's exactly why we have -f and -n options, so that the
program that reads from blame output can tell where things came
from. It is not about "who wrote it" vs "who put it here";
pickaxe gives a lot more than that: "where did this originally
come from", i.e. "by whom in which file at what line number was
the line created".
If the user is not prepared to see code movement, pickaxe can be
run without -M nor -C to get the classic blame output.
By the way, We would not want to do the code movement (or
copying from unrelated file) for very trivial lines. E.g. you
would not want to blame the following three lines:
+ }
+}
+#endif
to a random file that happens to have the exact copy of the
above that is not related at all. Something like the above can
happen almost anywhere. The current implementation of -M/-C
does not do this very well. find_copy_in_blob() currently
passes blame to the parent when it finds nontrivial copies, but
instead it should inspect the patch and return a score, and the
caller should take the parent with the best match and assign
blame to it.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: On blame/pickaxe
2006-10-16 6:43 ` Junio C Hamano
@ 2006-10-16 14:02 ` Josef Weidendorfer
2006-10-16 14:15 ` Andy Whitcroft
0 siblings, 1 reply; 12+ messages in thread
From: Josef Weidendorfer @ 2006-10-16 14:02 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Petr Baudis, Luben Tuikov, git, Linus Torvalds
Hi,
this blame-passing thing really looks very promising and powerful.
On Monday 16 October 2006 08:43, you wrote:
> If the user is not prepared to see code movement, pickaxe can be
> run without -M nor -C to get the classic blame output.
Another blame-passing heuristic would be very interesting for code:
"Ignore white-space changes".
This way, commits which only do some reindentations simply are skipped.
It looks like such a thing would just be a matter of passing "-b" to
executions of "diff" in the blame-passing algorithm.
Josef
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: On blame/pickaxe
2006-10-16 14:02 ` Josef Weidendorfer
@ 2006-10-16 14:15 ` Andy Whitcroft
2006-10-17 0:44 ` Petr Baudis
0 siblings, 1 reply; 12+ messages in thread
From: Andy Whitcroft @ 2006-10-16 14:15 UTC (permalink / raw)
To: Josef Weidendorfer
Cc: Junio C Hamano, Petr Baudis, Luben Tuikov, git, Linus Torvalds
Josef Weidendorfer wrote:
> Hi,
>
> this blame-passing thing really looks very promising and powerful.
>
> On Monday 16 October 2006 08:43, you wrote:
>> If the user is not prepared to see code movement, pickaxe can be
>> run without -M nor -C to get the classic blame output.
>
> Another blame-passing heuristic would be very interesting for code:
> "Ignore white-space changes".
> This way, commits which only do some reindentations simply are skipped.
>
> It looks like such a thing would just be a matter of passing "-b" to
> executions of "diff" in the blame-passing algorithm.
I am thinking that that is probabally going to need to be optional, for
example python the indentation is everything to the meaning of the code.
-apw
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: On blame/pickaxe
2006-10-13 1:43 On blame/pickaxe Junio C Hamano
2006-10-13 7:54 ` Junio C Hamano
2006-10-14 10:26 ` Junio C Hamano
@ 2006-10-16 23:45 ` Luben Tuikov
2006-10-17 9:03 ` Junio C Hamano
2 siblings, 1 reply; 12+ messages in thread
From: Luben Tuikov @ 2006-10-16 23:45 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
Junio,
Excellent write up. Comments inline:
--- Junio C Hamano <junkio@cox.net> wrote:
> I have something that is tentatively called pickaxe, but that is
> different from "diff -S<pickaxe>". The name will need to
> change, perhaps to git-blame.
>
> It's quite a while since the original algorithm sketch that led
> to the current git-blame was posted here; I suspect that most of
> the people recently joined the list haven't seen it. The idea
> in there still applies, though...
>
> Here is a re-drawn sketch of how the 'pickaxe' edition of the
> algorithm works.
>
> 0. Outline.
>
> We track range of lines from a path in one revision. That's
> what "-L n,m", "path", and "commit" parameters to the command
> are about.
>
> When the program starts, the commit given on the command line is
> suspected to be responsible for all lines in the given range.
> But being a suspect and determined to be the guilty party are
> different things.
>
> The name of the game is "not taking responsibility but passing
> the blame to your parents". In other words, "that's not a new
> problem I introduced -- it was there before my patch was
> applied". You are exonerated for those lines that you can
> successfully pass blame to your parent, making that parent a new
> suspect for the lines. There will remain lines you cannot pass
> blame to anybody else -- you will be determined to be the guilty
> party for them. When nobody can pass any more blame to anybody
> else, the dance stops and we will know who is guilty for every
> line of the input.
Stopping early -- that's good.
> 1. Data structure.
>
> We keep track of who the current suspect for each line is in the
> structure called 'scoreboard'. It is a collection of ranges,
> called 'blame_entry'.
>
> Each blame_entry describes which lines (in the final image) it
> is about, who the current suspect for the range is, if the
> suspect is already known to be guilty, where in the suspected
> commit the lines came from (path and line number).
>
> In the 'blame' implementation, a suspect is represented by a
> commit object with one path hanging to its util field.
> 'pickaxe' edition introduces the concept of 'origin', which is a
> pair of <commit,path>. We currently do not use more than one
> path per commit, but hopefully we will soon.
>
>
> 2. Operation.
>
> First the scoreboard is initialized with a single blame_entry
> that makes the path in the commit on the command line the
> suspect for all lines.
>
> Then, we repeatedly pick one entry from the scoreboard, and give
> a chance to the suspect to exonerate himself, by calling
> pass_blame(). The suspect passes the blame to its parents by
> updating the scoreboard, removing his name from ranges of lines
> he can prove that came from his parent and instead writing that
> suspected parent's name. Inside pass_blame(), just before it
> returns, if the suspect cannot pass blame for some lines to
> anybody else, it becomes guilty for them.
>
> When there is no blame_entry in the scoreboard whose truly
> guilty party is still undetermined, we have fully blamed the
> input, and the loop finishes.
>
>
> 3. Passing the blame.
>
> You (a <commit,path> tuple) are suspected for introducing
> certain lines, and you would want to pass blame to your parent.
> How would you do that?
>
> First, you find if your parent has the same path; if not, you
> find if between your parent and you there was a rename and find
> the original path in the parent. If you are a merge, you do so
> for all your parents. The path in the parent and your path may
> have many common lines, and if the lines you are the suspect are
> the same as the ones in the parent, you can pass the blame,
> because these lines were there before you touched them.
Do you handle the case where a merge had a conflict and
the user changed the code (resolved) and then committed?
In this case some lines will have to be blamed on the
merge commit itself.
> How would you find what's common? We run "diff" (diff -u0).
> Unified context-0 diff would give something like:
>
> @@ -n,m +l,k @@
> -removed
> +added
> ...
>
> In this application, we are not interested in the diff text
> itself at all. In fact, we are interested in what's outside the
> diff. If the above is output from parent to you, what it tells
> us is that lines before your line "l" are the same as lines
> before parent's "n", so if the above is the first hunk, you can
> say your lines 1..(l-1) came from your parent and you are not
> responsible for them. The above hunk also tells us that in your
> path, lines after (l+k) are the same as the lines in parent's
> lines (n+m); we will know how many such same lines there are
> (could be zero) by inspecting the next hunk.
>
> For efficiency reasons, inside pass_blame(), we run just one
> diff between the parent and you, and scan the blame_entry for
> the same <commit,path> in the scoreboard, and check overlap with
> each of them. Suppose we are suspected for these lines:
>
> <---- e ----->
>
> and by looking the diff, we determined the range from one parent
> is like this:
>
> <--p--->
>
> Then we can split the blame_entry into three parts:
>
> <-><--p---><->
>
> You are still suspected for the first and the last part of the
> original entry, but for the mid-part you successfully passed the
> blame to that parent. Depending on how the ranges overlap, this
> may split into two, or parent may take blame for the whole range
> (i.e. no split). This is computed in blame_overlap().
>
> When done with one parent, if you are a merge, you will then try
> to pass the blame on the remaining part that you are still
> suspected for to other parents.
>
> The classic 'blame' algorithm stops here, and the current
> pickaxe does the same; you take responsibility for the
> remainder.
>
>
> 4. Passing more blame.
>
> Instead of taking responsibility for the remainder, there are
> other ways to find other people to pass blame on. That's what
> the "NEEDSWORK" comment in pass_blame() is about.
>
> A typical example is a change that moves one static C function
> from lower part of the file to upper part of the same file,
> because you added a new caller in the middle. The path in your
> parent and the path in you would look like this:
>
> parent you
>
> A static foo() {
> B }
> C A
> D B
> E C
> F D
> G ... call foo();
> static foo() { E
> } F
> I G
> J I
>
> With the common part finding code with diff, you will be able to
> pass blames for lines A B C D E F G I J to your parent. You are
> truly guilty for introducing "... call foo();". The problem
> here is that in addition, you will be blamed for the lines that
> implement "static foo() { ... }" at the beginning of your file.
How about move-and-edit scenario? Wouldn't that make the algorithm
somewhat complicated if we wanted to properly describe (blame)
move-and-edit? Or are you going to detect only simple hunk-moves?
> To blame the implementation of foo() to the parent, we could do
> something like this:
>
> for each blame_entry that you are still suspected for,
> diff those lines (and only those lines) with the parent,
> to see if you find a copy. If there is a copy, you can
> pass the blame to the parent.
>
> This needs to be limited for non-trivial changes only, though.
> For example, one of the typical things you would do is to add one
> "else if" clause to a sequence of already existing "if .. else if
> .." chain, like this:
>
> if (foo) {
> ...
> }
> else if (bar) {
> ...
> + }
> + else if (baz) {
> + ...
> }
>
> You are suspected for these three lines; if the "find copies
> again" is allowed to find any insignificant copy, the first line
> (closing brace) you introduced could easily be blamed to an
> unrelated line anywhere in the parent that match /^\t\}$/. To
> prevent such nonsense from happening, we need to limit such
> "find copies" attempt to say minimum of N lines (and minimum of
> M tokens that consist of non-whitespace, non-punctuation
> letters). Also we probably would not want to find a match in
> the whole parent but only in the part of the parent that is
> removed in the parent-to-you diff.
>
> Another typical example is a code restructuring to move one
> function from a file to another file. The same "find copies"
> principle, and caveat for small and insignificant copies, apply
> to this situation as well. If it is a code movement, the file
> that the function moved from needs to be identified, which can
> be done by running diff-tree between parent and you -- anything
> that was modified or removed is a good candidate to look for
> such copies. Again, we probably need to limit the copy-finding
> source to the part that was removed from the parent in the diff
> to you.
Here I see a lot of _arbitration_ in the form of "choose a good
value for N and a good value for M".
Is it possible that we do away with such "user" arbitration,
and instead find an algorithm that solves every case...? Even
if we have to go back to a simpler "blame".
I'm concerned that with "user" arbitration, there would always
be at least one corner case for which the "setting-on-the-knobs"
would fail.
> After the operation 3 (Passing the blame) runs, if movement of
> lines from another file is detected as in the above sketch, we
> will pass blame to an unrelated path in the parent. Some lines
> are blamed on the file we started tracking in the parent, and
> some other lines are blamed on a different file in the parent.
> This is why 'pickaxe' uses <commit,path> tuple (aka 'origin') to
> represent a suspect.
>
> All of this sounds quite simple and not so difficult to code,
> although rejecting insignificant copies would involve fair
> amount of heuristic tweaking like I needed to do while working
> on the rename detection.
Indeed.
Luben
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: On blame/pickaxe
2006-10-16 14:15 ` Andy Whitcroft
@ 2006-10-17 0:44 ` Petr Baudis
2006-10-21 3:20 ` Junio C Hamano
0 siblings, 1 reply; 12+ messages in thread
From: Petr Baudis @ 2006-10-17 0:44 UTC (permalink / raw)
To: Andy Whitcroft
Cc: Josef Weidendorfer, Junio C Hamano, Luben Tuikov, git,
Linus Torvalds
Dear diary, on Mon, Oct 16, 2006 at 04:15:01PM CEST, I got a letter
where Andy Whitcroft <apw@shadowen.org> said that...
> Josef Weidendorfer wrote:
> > Hi,
> >
> > this blame-passing thing really looks very promising and powerful.
> >
> > On Monday 16 October 2006 08:43, you wrote:
> >> If the user is not prepared to see code movement, pickaxe can be
> >> run without -M nor -C to get the classic blame output.
Ok, so in this case -M and -C does not mean just looking for
copies/movements in other files but inside the same file as well.
Perhaps we might want to differentiate those two cases since searching
in all files might be significantly slower.
> > Another blame-passing heuristic would be very interesting for code:
> > "Ignore white-space changes".
> > This way, commits which only do some reindentations simply are skipped.
> >
> > It looks like such a thing would just be a matter of passing "-b" to
> > executions of "diff" in the blame-passing algorithm.
>
> I am thinking that that is probabally going to need to be optional, for
> example python the indentation is everything to the meaning of the code.
(OTOH, just today I was retrieving some code from deep inside a script
to a common function, which of course caused massive indentation shift.
So it is very desirable in order to catch these. But more we get
involved in this, the more we will probably want to know about the
syntax of the content we are digging in.)
--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
#!/bin/perl -sp0777i<X+d*lMLa^*lN%0]dsXx++lMlN/dsM0<j]dsj
$/=unpack('H*',$_);$_=`echo 16dio\U$k"SK$/SM$n\EsN0p[lN*1
lK[d2%Sa2/d0$^Ixp"|dc`;s/\W//g;$_=pack('H*',/((..)*)$/)
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: On blame/pickaxe
2006-10-16 23:45 ` Luben Tuikov
@ 2006-10-17 9:03 ` Junio C Hamano
0 siblings, 0 replies; 12+ messages in thread
From: Junio C Hamano @ 2006-10-17 9:03 UTC (permalink / raw)
To: Luben Tuikov; +Cc: git
Luben Tuikov <ltuikov@yahoo.com> writes:
>> 3. Passing the blame.
>>
>> You (a <commit,path> tuple) are suspected for introducing
>> certain lines, and you would want to pass blame to your parent.
>> How would you do that?
>>
>> First, you find if your parent has the same path; if not, you
>> find if between your parent and you there was a rename and find
>> the original path in the parent. If you are a merge, you do so
>> for all your parents. The path in the parent and your path may
>> have many common lines, and if the lines you are the suspect are
>> the same as the ones in the parent, you can pass the blame,
>> because these lines were there before you touched them.
>
> Do you handle the case where a merge had a conflict and
> the user changed the code (resolved) and then committed?
> In this case some lines will have to be blamed on the
> merge commit itself.
Working on a small example by hand is a good way to convince
yourself. The whole point of "try to pass the blame, and take
the blame yourself only when you can't pass to anybody" is
precisely to handle the merges sanely. The answer to your later
question also would become crystal clear with that exercise.
Suppose that we are looking at a merge that would give this in
its "git show" output:
diff --cc hello.c
index 3c27792,db3fdef..cec80d2
--- a/hello.c
+++ b/hello.c
@@@ -1,4 -1,6 +1,6 @@@
int main(int ac, char **av)
{
- printf("hello, world.\n");
- const char *msg = "hello, world";
++ const char *msg = "hello, world.";
+
+ printf("%s\n", msg);
}
First, we inspect the diff from the first parent:
diff --git a/hello.c b/hello.c
index 3c27792..cec80d2 100644
--- a/hello.c
+++ b/hello.c
@@ -1,4 +1,6 @@
1 int main(int ac, char **av)
2 {
- printf("hello, world.\n");
3 + const char *msg = "hello, world.";
4 +
5 + printf("%s\n", msg);
6 }
That would find that lines 1, 2 and 6 came from the first parent
(line numbers are of the postimage; e.g. line 6 is the closing
brace).
We are still left with lines 3, 4, and 5. So we will see the
difference from the second parent:
diff --git a/hello.c b/hello.c
index db3fdef..cec80d2 100644
--- a/hello.c
+++ b/hello.c
@@ -1,6 +1,6 @@
1 int main(int ac, char **av)
2 {
- const char *msg = "hello, world";
3 + const char *msg = "hello, world.";
4
5 printf("%s\n", msg);
6 }
It shows that lines 1, 2, 4, 5 and 6 are the same as the second
parent (again, line numbers are of the postimage). This means
that we _could_ attribute line 1, 2 and 6 to the second parent
if we wanted to, but we have already passed blame for 1, 2 and 6
to the first parent [*1*] and only lines 4 and 5 are assigned to
the second parent.
At this point, we have no more parents to pass blame on and are
still left with line 3. So we end up taking the blame for that
line ourselves. The final blame output reflects that.
If you are interested, prepare an example repository using the
attached script, and try annotating E, like this:
git pickaxe --not right~2 left -- E
This demonstrates the example in this message (first parent is
Right and the second is Left).
Annotating C with blame and pickaxe (use "-n -f" for clarity)
shows the limitation of the original 'blame' that can use only
one path per commit. This is a corner case where two files
originally different in the common ancestor were later merged
into one. pickaxe handles this case without -M.
git blame -n -f C
git pickaxe -n -f C
Annotating D with pickaxe with and without -M illustrates how
line movement is handled.
git pickaxe -M D
Have fun.
[*1*] The really core part of git does not have any preference
among parents, but typically a merge commit is made with the
current branch head as its first parent and the other branch as
its second parent, so favoring the earlier parent over the later
ones makes a lot of sense in practice. This is in line with
other parts of git, including the merge simplification done by
git-log.
-- 8< --
#!/bin/sh
test -d .git || {
echo Run me in an empty directory.
exit
}
git init-db
for i in 1 2 3 4 5 6 7 8 9 ; do echo line from initial $i ; done >A
for i in A B C D E F G H I ; do echo line from initial $i ; done >B
cp A D
cat >E <<EOF
int main(int ac, char **av)
{
printf("hello, world\n");
}
EOF
git add A B D E
git commit --author='Initial <initial@author>' -m initial
git branch right
git branch left
# Left
git checkout left
for i in 1 2 3; do echo added by left; done >C
cat A >>C
rm -f A B
cat >E <<EOF
int main(int ac, char **av)
{
const char *msg = "hello, world";
printf("%s\n", msg);
}
EOF
git update-index --add --remove A B C E
git commit --author='Left <left@branch>' -m Left
# Right
git checkout right
cat B >C
for i in 1 2 3; do echo added by right; done >>C
rm -f A B
cat >E <<EOF
int main(int ac, char **av)
{
printf("hello, world.\n");
}
EOF
git update-index --add --remove A B C E
git commit --author='Right <right@branch>' -m Right
echo "Merge -- this should fail which is expected and scripts fixes it up"
echo "Do not get alarmed with the error message."
git pull . left
echo "Fixing up..."
{
git cat-file blob :3:C
echo line by evil merge
git cat-file blob :2:C
} >C
cat >E <<EOF
int main(int ac, char **av)
{
const char *msg = "hello, world.";
printf("%s\n", msg);
}
EOF
git update-index C E
git commit --author='Merge <merge@branch>' -m 'Changes are merged.'
rm -f C~*
{
for i in 5 6 7; do echo line from initial $i ; done
echo line modified while swapping 8
for i in 9 1 2 3 4 ; do echo line from initial $i ; done
} >D
git update-index D
git commit --author='Swap <swap@branch>' -m 'Lines are swapped.'
echo "Now try annotating C, D and E with various options."
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: On blame/pickaxe
2006-10-17 0:44 ` Petr Baudis
@ 2006-10-21 3:20 ` Junio C Hamano
0 siblings, 0 replies; 12+ messages in thread
From: Junio C Hamano @ 2006-10-21 3:20 UTC (permalink / raw)
To: Petr Baudis; +Cc: git
Petr Baudis <pasky@suse.cz> writes:
>> >> If the user is not prepared to see code movement, pickaxe can be
>> >> run without -M nor -C to get the classic blame output.
>
> Ok, so in this case -M and -C does not mean just looking for
> copies/movements in other files but inside the same file as well.
>
> Perhaps we might want to differentiate those two cases since searching
> in all files might be significantly slower.
I do not personally worry about people confusing -M/-C to
pickaxe with -M/-C given to diff (to pickaxe, use of diff is
purely an internal implementation issue), and reused -M and -C
to mean quite different things. -M is to detect line movement
inside a file (it is not strictly limited to "line movement",
though. It _is_ about "copies and moves within the same file").
On the other hand, -C (and its stronger form -C -C) is to detect
copies and moves across file boundaries (but wholesale "file
rename" comes as part of the basic algorithm so you do not have
to ask for it with -M nor -C). So in that sense pronouncing M
"move" and C "copy" is not accurate. Their differences is
already what you said "we might want".
However they match the cost expectation people are used to in
diff options pretty well. -M is not so expensive and -C is
somewhat. -C -C is like find-copies-harder and is quite
expensive.
Also currently the code does not do "move" detection at all.
Contrary to intuition, move detection is more expensive than
copy detection in this case.
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2006-10-21 3:20 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-10-13 1:43 On blame/pickaxe Junio C Hamano
2006-10-13 7:54 ` Junio C Hamano
2006-10-14 10:26 ` Junio C Hamano
2006-10-14 23:43 ` Junio C Hamano
2006-10-16 2:21 ` Petr Baudis
2006-10-16 6:43 ` Junio C Hamano
2006-10-16 14:02 ` Josef Weidendorfer
2006-10-16 14:15 ` Andy Whitcroft
2006-10-17 0:44 ` Petr Baudis
2006-10-21 3:20 ` Junio C Hamano
2006-10-16 23:45 ` Luben Tuikov
2006-10-17 9:03 ` Junio C Hamano
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).