[RFC v2] blame: new option --prefer-first to better handle merged cherry-picks

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC v2] blame: new option --prefer-first to better handle merged cherry-picks
@ 2014-01-13  6:30 Bernhard R. Link
  2014-01-13 22:26 ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Bernhard R. Link @ 2014-01-13  6:30 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano

Allows to disable the git blame optimization of assuming that if there is a
parent of a merge commit that has the exactly same file content, then
only this parent is to be looked at.

This optimization, while being faster in the usual case, means that in
the case of cherry-picks the blamed commit depends on which other commits
touched a file.

If for example one commit A modified both files b and c. And there are
commits B and C, B only modifies file b and C only modifies file c
(so that no conflicts happen), and assume A is cherry-picked as A'
and the two branches then merged:

--o-----B---A
   \         \
    ---C---A'--M---

Then without this new option git blame blames the A|A' changes of
file b to A while blaming the changes of c to A'.
With the new option --prefer-first it blames both changes to the
same commit and to the one more on the "left" side of the graph.

Signed-off-by: Bernhard R. Link <brlink@debian.org>
---
 Documentation/blame-options.txt | 6 ++++++
 builtin/blame.c                 | 7 +++++--
 2 files changed, 11 insertions(+), 2 deletions(-)

 Differences to first round: rename option and describe the effect
 instead of the implementation in documentation.

diff --git a/Documentation/blame-options.txt b/Documentation/blame-options.txt
index 0cebc4f..b2e7fb8 100644
--- a/Documentation/blame-options.txt
+++ b/Documentation/blame-options.txt
@@ -48,6 +48,12 @@ include::line-range-format.txt[]
 	Show the result incrementally in a format designed for
 	machine consumption.
 
+--prefer-first::
+	If a line was introduced by two commits (for example via
+	a merged cherry-pick), prefer the commit that was
+	first merged in the history of always following the
+	first parent.
+
 --encoding=<encoding>::
 	Specifies the encoding used to output author names
 	and commit summaries. Setting it to `none` makes blame
diff --git a/builtin/blame.c b/builtin/blame.c
index 4916eb2..8ea34cf 100644
--- a/builtin/blame.c
+++ b/builtin/blame.c
@@ -45,6 +45,7 @@ static int incremental;
 static int xdl_opts;
 static int abbrev = -1;
 static int no_whole_file_rename;
+static int prefer_first;
 
 static enum date_mode blame_date_mode = DATE_ISO8601;
 static size_t blame_date_width;
@@ -1248,7 +1249,8 @@ static void pass_blame(struct scoreboard *sb, struct origin *origin, int opt)
 			porigin = find(sb, p, origin);
 			if (!porigin)
 				continue;
-			if (!hashcmp(porigin->blob_sha1, origin->blob_sha1)) {
+			if (!prefer_first &&
+			    !hashcmp(porigin->blob_sha1, origin->blob_sha1)) {
 				pass_whole_blame(sb, origin, porigin);
 				origin_decref(porigin);
 				goto finish;
@@ -2247,7 +2249,8 @@ int cmd_blame(int argc, const char **argv, const char *prefix)
 	static const char *contents_from = NULL;
 	static const struct option options[] = {
 		OPT_BOOL(0, "incremental", &incremental, N_("Show blame entries as we find them, incrementally")),
-		OPT_BOOL('b', NULL, &blank_boundary, N_("Show blank SHA-1 for boundary commits (Default: off)")),
+		OPT_BOOL(0, "prefer-first", &prefer_first, N_("Prefer blaming commits merged earlier")),
+		OPT_BOOL('b', NULL, &blank_boundary, N_("Show blank SHA-1 for boundary commits (Default: ff)")),
 		OPT_BOOL(0, "root", &show_root, N_("Do not treat root commits as boundaries (Default: off)")),
 		OPT_BOOL(0, "show-stats", &show_stats, N_("Show work cost statistics")),
 		OPT_BIT(0, "score-debug", &output_option, N_("Show output score for blame entries"), OUTPUT_SHOW_SCORE),
-- 
1.8.5.1

	Bernhard R. Link
-- 
F8AC 04D5 0B9B 064B 3383  C3DA AFFC 96D1 151D FFDC

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC v2] blame: new option --prefer-first to better handle merged cherry-picks
  2014-01-13  6:30 [RFC v2] blame: new option --prefer-first to better handle merged cherry-picks Bernhard R. Link
@ 2014-01-13 22:26 ` Junio C Hamano
  2014-01-13 22:52   ` Bernhard R. Link
  0 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2014-01-13 22:26 UTC (permalink / raw)
  To: Bernhard R. Link; +Cc: git

"Bernhard R. Link" <brlink@debian.org> writes:

> Allows to disable the git blame optimization of assuming that if there is a
> parent of a merge commit that has the exactly same file content, then
> only this parent is to be looked at.
>
> This optimization, while being faster in the usual case, means that in
> the case of cherry-picks the blamed commit depends on which other commits
> touched a file.
>
> If for example one commit A modified both files b and c. And there are
> commits B and C, B only modifies file b and C only modifies file c
> (so that no conflicts happen), and assume A is cherry-picked as A'
> and the two branches then merged:
>
> --o-----B---A
>    \         \
>     ---C---A'--M---
>
> Then without this new option git blame blames the A|A' changes of
> file b to A while blaming the changes of c to A'.
> With the new option --prefer-first it blames both changes to the
> same commit and to the one more on the "left" side of the graph.
>
> Signed-off-by: Bernhard R. Link <brlink@debian.org>
> ---
>  Documentation/blame-options.txt | 6 ++++++
>  builtin/blame.c                 | 7 +++++--
>  2 files changed, 11 insertions(+), 2 deletions(-)
>
>  Differences to first round: rename option and describe the effect
>  instead of the implementation in documentation.

I read the updated documentation three times but it still does not
answer any of my questions I had in $gmane/239888, the most
important part of which was:

    Yeah, the cherry-picked one will introduce the same change as
    the one that was cherry-picked, so if you look at the end result
    and ask "where did _this_ line come from?", there are two
    equally plausible candidates, as "blame" output can give only
    one answer to each line.  I still do not see why the one that is
    picked with the new option is better.  At best, it looks to me
    that it is saying "running with this option may (or may not)
    give a different answer, so run the command with and without it
    and see which one you like", which does not sound too useful to
    the end users.

To put it another way, why/when would an end user choose to use this
option?  If the result of using this option is always better than
without, why/when would an end user choose not to use this option?

> diff --git a/Documentation/blame-options.txt b/Documentation/blame-options.txt
> index 0cebc4f..b2e7fb8 100644
> --- a/Documentation/blame-options.txt
> +++ b/Documentation/blame-options.txt
> @@ -48,6 +48,12 @@ include::line-range-format.txt[]
>  	Show the result incrementally in a format designed for
>  	machine consumption.
>  
> +--prefer-first::
> +	If a line was introduced by two commits (for example via
> +	a merged cherry-pick), prefer the commit that was
> +	first merged in the history of always following the
> +	first parent.
> +
>  --encoding=<encoding>::
>  	Specifies the encoding used to output author names
>  	and commit summaries. Setting it to `none` makes blame

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC v2] blame: new option --prefer-first to better handle merged cherry-picks
  2014-01-13 22:26 ` Junio C Hamano
@ 2014-01-13 22:52   ` Bernhard R. Link
  2014-01-13 23:01     ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Bernhard R. Link @ 2014-01-13 22:52 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

* Junio C Hamano <gitster@pobox.com> [140113 23:31]:
> I read the updated documentation three times but it still does not
> answer any of my questions I had in $gmane/239888, the most
> important part of which was:
>
>     Yeah, the cherry-picked one will introduce the same change as
>     the one that was cherry-picked, so if you look at the end result
>     and ask "where did _this_ line come from?", there are two
>     equally plausible candidates, as "blame" output can give only
>     one answer to each line.  I still do not see why the one that is
>     picked with the new option is better.

Because:
  - it will blame the modifications of merged cherry-picked commit
    to only one commit. Without the option parts of the modification
    will be reported as coming from the one, parts will be reported
    to be from the other. With the option only one of those two commits
    is reported as the origin at the same time and not both.
  - it is more predictable which commit is blamed, so if one is
    interested in where some commit was introduced first into a
    "mainline", one gets this information, and not somtimes a different
    one due to unrelated reasons.

> To put it another way, why/when would an end user choose to use this
> option?  If the result of using this option is always better than
> without, why/when would an end user choose not to use this option?

While the result is more consistent and more predictable in the case
of merged cherry picks, it is also slower in every case. Usually speed
will be more important than this exactness, especially as the result
will not differ for the common case (if there are no cherry-picked
commits merged or when those commits do not touch any files that are
otherwise only modified in the merged branch).

	Bernhard R. Link
-- 
F8AC 04D5 0B9B 064B 3383  C3DA AFFC 96D1 151D FFDC

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC v2] blame: new option --prefer-first to better handle merged cherry-picks
  2014-01-13 22:52   ` Bernhard R. Link
@ 2014-01-13 23:01     ` Junio C Hamano
  2014-01-14  1:00       ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2014-01-13 23:01 UTC (permalink / raw)
  To: Bernhard R. Link; +Cc: git

"Bernhard R. Link" <brl+git@mail.brlink.eu> writes:

> * Junio C Hamano <gitster@pobox.com> [140113 23:31]:
>> I read the updated documentation three times but it still does not
>> answer any of my questions I had in $gmane/239888, the most
>> important part of which was:
>>
>>     Yeah, the cherry-picked one will introduce the same change as
>>     the one that was cherry-picked, so if you look at the end result
>>     and ask "where did _this_ line come from?", there are two
>>     equally plausible candidates, as "blame" output can give only
>>     one answer to each line.  I still do not see why the one that is
>>     picked with the new option is better.
>
> Because:
>   - it will blame the modifications of merged cherry-picked commit
>     to only one commit. Without the option parts of the modification
>     will be reported as coming from the one, parts will be reported
>     to be from the other. With the option only one of those two commits
>     is reported as the origin at the same time and not both.
>   - it is more predictable which commit is blamed, so if one is
>     interested in where some commit was introduced first into a
>     "mainline", one gets this information, and not somtimes a different
>     one due to unrelated reasons.
>
>> To put it another way, why/when would an end user choose to use this
>> option?  If the result of using this option is always better than
>> without, why/when would an end user choose not to use this option?
>
> While the result is more consistent and more predictable in the case
> of merged cherry picks, it is also slower in every case.

Consistent and predictable, perhaps, but I am not sure "exact" would
be a good word.

Wouldn't the result depend on which way the cherry pick went, and
then the later merge went?  In the particular topology you depicted
in the log message, the end result may happen to point at the same
commit for these two paths, but I am not sure how the change
guarantees that "we always point at the same original commit not the
cherry-picked one", which was implied by the log message, if your
cherry-pick and merge went in different direction in similar
topologies.

And that is why I said:

    At best, it looks to me that it is saying "running with this
    option may (or may not) give a different answer, so run the
    command with and without it and see which one you like"

With the stress on "different" answer; it the change were "with the
option the result is always better, albeit you will have to wait
longer", I would not have this much trouble accepting the change,
though.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC v2] blame: new option --prefer-first to better handle merged cherry-picks
  2014-01-13 23:01     ` Junio C Hamano
@ 2014-01-14  1:00       ` Junio C Hamano
  2014-01-14  8:37         ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2014-01-14  1:00 UTC (permalink / raw)
  To: Bernhard R. Link; +Cc: git

Junio C Hamano <gitster@pobox.com> writes:

>> While the result is more consistent and more predictable in the case
>> of merged cherry picks, it is also slower in every case.
>
> Consistent and predictable, perhaps, but I am not sure "exact" would
> be a good word.

Another thing I am not enthusiasitc about this change is that I am
afraid that this may make "git blame -- path" and "git log -- path"
work inconsistenly.  The both cull side branches whenever one of the
parents gave the resulting blob, even that parent is not the first
one.  But "git blame --prefer-first -- path", afaict, behaves quite
differently from "git log --first-parent -- path", even though they
share similar option names, adding more to the confusion.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC v2] blame: new option --prefer-first to better handle merged cherry-picks
  2014-01-14  1:00       ` Junio C Hamano
@ 2014-01-14  8:37         ` Junio C Hamano
  2014-01-14 19:12           ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2014-01-14  8:37 UTC (permalink / raw)
  To: Bernhard R. Link; +Cc: git

Junio C Hamano <gitster@pobox.com> writes:

> Junio C Hamano <gitster@pobox.com> writes:
>
>>> While the result is more consistent and more predictable in the case
>>> of merged cherry picks, it is also slower in every case.
>>
>> Consistent and predictable, perhaps, but I am not sure "exact" would
>> be a good word.
>
> Another thing I am not enthusiasitc about this change is that I am
> afraid that this may make "git blame -- path" and "git log -- path"
> work inconsistenly.  The both cull side branches whenever one of the
> parents gave the resulting blob, even that parent is not the first
> one.  But "git blame --prefer-first -- path", afaict, behaves quite
> differently from "git log --first-parent -- path", even though they
> share similar option names, adding more to the confusion.

I think I am starting to understand why this patch felt wrong to me.
This wasn't about "--first-parent" at all, and you are correct that
you didn't call the option as such), but I somehow thought that they
were related; perhaps the fact that both disable the "if the result
exactly matches one parent, all the other parents can be culled to
simplify the history" logic blinded me.

In reality, the new flag is a lot closer in spirit to the total
opposite of "--first-parent", i.e. "--full-history".  That option
also disables that "if same to one parent, other parents do not
matter" logic, but its effect is quite different.  It makes the
other histories that did not have to have contributed the end result
shown in the output.

Now, when we step back and think about how the normal "git blame"
logic apportions the blame to multiple parents when there is no
exact match, it does so in a pretty arbitrary way.  It lets earlier
parents to claim the responsibility and later parents only get
leftover contents that weren't claimed by the earlier ones.  We can
call that "favouring earler ones", i.e. "--prefer-first".

It was implemented this way, not because this order makes any sense,
but primarily because no order is particularly better than any
other, and the designer (me) happened to have picked the easiest one
at random.

The "pick the one that exactly matches if exists" can be thought of
an easy hack to hide the problems that come from this arbitrary
choice.  Without it, if the result matches the second parent (i.e. a
typical merge of a work done on the topic branch while the mainline
has been quiescent in the same area), the "give earlier parents a
chance to claim responsiblity before later ones" rule would have
split the blame for parts that weren't changed in the side branch
topic to the mainline and blame would have been passed to the side
branch only for the portion that were changed by the side branch.
Instead, "pass the whole blame to the one that exactly matches" hack
keeps larger blocks of text unsplit, clumping related contents
together as long as possible while we traverse the history.

It is an "easy hack", because we only need to compare the object
name, but a logical extension to it would have been to compute the
similarity scores between the result and each of the parents, sort
the parents by that similarity score order, and give more similar
ones a chance to claim responsibility before less similar ones.
We could call it "favouring similar ones", i.e. "--prefer-similar"
or something.

That would have made the result more stable.  Imagine that in one
history, a merge's result matchs exactly the second parent, and in
another history, a merge's result almost matches exactly the second
parent but the difference is the result adds one blank line at the
end of the file relative to what the second parent has.

With the current code, blaming the file will get quite a different
result.

In the former history, the sub-history leading to the second parent
of the merge will get all the blame, but in the latter history, the
sub-history leading to the first parent of the merge will have a
chance to claim the responsibility for the shared part before the
second parent has a say in the output.

If we sorted the parents in the similarity order and gave the first
refusal right to more similar parents before less similar ones, then
the resulting output from "git blame" would be very similar in these
two histories, which would be a very desirable property.  If the
only difference between the results of the merge in the former and
the latter histories is one blank line at the end of the file in
question, blames for the remaining part of the file should be
assigned the same between the two histories, but the "pass the
entire blame to the second parent only when the second parent
exactly matches" hack gets in the way for that ideal, and "sort the
parents in similarity order" will fix that.

Of course, it would make the computation a lot more costly, but it
would make the behaviour more predictable and understandable.

But that is a different tangent.

I think the new feature introduced by your change can be explained
as "'git blame' uses the same history simplification as the commands
in the 'git log' family that culls other side branches when the
merge result exactly matches one parent, and in all other cases, it
lets earlier parents claim responsibility before the later ones.
This option disables the culling of the irrelevant side branches, in
a way similar to how '--full-history' option to the commands in the
'git log' family works, and lets earlier parents claim
responsibility to the merge result (even when the later parents
contributed a lot more to the result) before the later parents.".

And if it were sold that way, I think I could at least understand it
(I do not necessarily buy it as a useful feature, though---at least
not yet).

In any case, "--prefer-first" is not particularly a good name, as
that is the default mode of operation for "git blame".  If we were
ever going to implement the "sort parents by similarity", that would
be triggered with "--prefer-similar" and "--prefer-first" would
becomeq a way to choose the current algorithm (i.e. not sort the
parents by similarity but go from earlier to later parents).  We
would regret if we gave that option name to the feature proposed by
the patch under discussion.  How about calling it "--full-history",
which is a way to tell Git not to cull side branches when the result
matches one of the parents?  It is even plausible that we may later
come up with "--prefer-<something>" (sort the parents not in the
original parent order nor in the similarity order but with some
other heuristics), and I suspect "--full-history" would be an
orthogonal axis to the order in which the parents are given a chance
to claim responsiblity.

Thanks; I'll queue the patch on 'pu' and wait for others to comment.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC v2] blame: new option --prefer-first to better handle merged cherry-picks
  2014-01-14  8:37         ` Junio C Hamano
@ 2014-01-14 19:12           ` Junio C Hamano
  0 siblings, 0 replies; 7+ messages in thread
From: Junio C Hamano @ 2014-01-14 19:12 UTC (permalink / raw)
  To: Bernhard R. Link; +Cc: git

Junio C Hamano <gitster@pobox.com> writes:

> The "pick the one that exactly matches if exists" can be thought of
> an easy hack to hide the problems that come from this arbitrary
> choice.  ...
> Instead, "pass the whole blame to the one that exactly matches" hack
> keeps larger blocks of text unsplit, clumping related contents
> together as long as possible while we traverse the history.
>
> It is an "easy hack", because we only need to compare the object
> name, but a logical extension to it would have been to compute the
> similarity scores between the result and each of the parents, sort
> the parents by that similarity score order, and give more similar
> ones a chance to claim responsibility before less similar ones.
> We could call it "favouring similar ones", i.e. "--prefer-similar"
> or something.

Extending along the tangent further.

Another thing that I found the argument in the proposed log message
of the patch weak was that the claim that changed code will assign
the blame to the "same" commit for both path b and c.  There are two
reasons why.  One is that we do not look at b while chasing the
ancestry of c, so if a different traverse order assigns the blame to
the same commit for them, it is a mere happenstance.  But a more
important reason is that the changed code will still assign the
blame for "different" commits if the final merge were made in the
opposite direction.  In your original topology, we skip over the
first parent and give the whole blame to the second parent without
the change, and with the change, we stop doing so and instead give
some blame to the first parent and then allow the second parent a
chance to claim the blame for the remainder.  But in a history where
the final merge went in the opposite direction, even with the
change, we compare with the "first" parent (which was the "second"
one in your original topology) with the result, find out that the
contents exactly match, and that parent grabs the whole blame.  So
in that sense, the updated code that "consistently" gives earlier
parents chance to claim the blame before later ones does not behave
consitently on the same history with different merge parent order.

That makes me think that the reason why the result you got with the
change is better (assuming it is better) is _not_ because the
updated code lets earlier parents give chance to claim the blame; it
could be an indication that the "keep larger blocks of text unsplit,
clumping related contents together as long as possible" heuristics
is what prevents us from having a better result.

If that is really the case, that would mean that letting the blame
split early would give us a better result.  I alluded to "give more
similar parents first chance to claim responsibility before less
similar ones" in the previous message, but perhaps this is
indicating that we might get a better result if we did the
opposite---instead of assigning blames to earlier parents and then
to later ones, compare the result with each parent, order the
parents by how few lines of blame they could claim if each of them
were allowed to go first, and then actually compute and assign the
blame in that order, "favouring dissimilar ones".  That may produce
the result you are after in a more consistent way, regardless of the
merge order.

I think I've done thinking about this issue, at least for now.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-01-14 19:12 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-13  6:30 [RFC v2] blame: new option --prefer-first to better handle merged cherry-picks Bernhard R. Link
2014-01-13 22:26 ` Junio C Hamano
2014-01-13 22:52   ` Bernhard R. Link
2014-01-13 23:01     ` Junio C Hamano
2014-01-14  1:00       ` Junio C Hamano
2014-01-14  8:37         ` Junio C Hamano
2014-01-14 19:12           ` Junio C Hamano

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.