Our merge bases sometimes suck

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Our merge bases sometimes suck
@ 2014-06-12 22:12 Michael Haggerty
  2014-06-13  9:38 ` Michael J Gruber
  2014-06-17 15:08 ` Junio C Hamano
  0 siblings, 2 replies; 12+ messages in thread
From: Michael Haggerty @ 2014-06-12 22:12 UTC (permalink / raw)
  To: git discussion list

[-- Attachment #1: Type: text/plain, Size: 9556 bytes --]

I've been thinking a lot about merge bases lately and think I have
discovered something interesting.

tl;dr:

When two branches have multiple merge bases,

    git merge-base $master $branch

picks one merge base more or less arbitrarily.  Here I describe a
criterion for picking a "best" merge base.  When the best merge base
is used, the diff output by

    git diff $master...$branch

becomes shorter and simpler--sometimes dramatically so.  I have
quantified the improvement by analyzing historical merges from the Git
project (see attached image).  Best merge bases might also help reduce
conflicts when conducting merges.

Background
==========

A merge base of two or more branches is a common ancestor of those
branches that has no descendant that is also a common ancestor (see
`git-merge-base(1)`).  Merge bases are important when merging
branches, and are also used internally to some other commands like
"git diff $master...$branch".

Sometimes branches have multiple merge bases, such as in the case of a
criss-cross merge:

    o--A--o--o--X--o--Y--o--Z    <- master
        \        \   /
         \        \ /
          \        ·
           \      / \
            \    /   \
             B--C--D--E--F       <- branch

Here, commits X and C are both merge bases for branches master and
branch.  You can list all merge bases by running

    git merge-base --all $master $branch

But if you run

    git merge-base $master $branch

only one merge base is returned.  *Which* one it returns is not
documented and appears to be pretty arbitrary (probably a side effect
of the traversal order?)

The "best" merge base
=====================

But not all merge bases are created equal.  It is possible to define a
"best" merge base that has some nice properties.

Let's focus on the command

    git diff $master...$branch

which is equivalent to

    git diff $(git merge-base $master $branch)..$branch

We want such a diff to have two properties:

1. It must include *all* changes that have been made on the branch but
   are not yet contained in master.

2. It should contain *as few* changes as possible that are already in
   master.

The first property is satisfied automatically if the left end of the
"diff" range is any merge base.  Because a merge base is an ancestor
of master, it cannot possibly include any changes that were made on
the branch but have not yet been merged to master [1].

The second property requires as a *necessary* condition that the left
end of the diff is a merge base.  But the property also helps us pick
the best among multiple merge bases.  We just need to make the idea of
"contains as few changes as possible" more precise.

I propose that the best merge base is the merge base "candidate" that
minimizes the number of non-merge commits that are in

    git rev-list --no-merges $candidate..$branch

but are already in master:

    git rev-list --no-merges $master

Since a non-merge commit should embody a single logical change,
counting non-merge commits is in some sense counting changes [2].

We can put this criterion in simpler form.  Because every candidate is
a merge-base,

    git rev-list --no-merges $candidate..$branch

necessarily includes *all* of the non-merge commits that are on branch
but not on master.  This is a fixed number of commits, the same for
every candidate.  It *additionally* includes the commits that are on
master but not yet in branch.  This second number varies from one
candidate to another.  So if we minimize the number of commits in this
output, is is the same as minimizing the number of unwanted commits.
Therefore, to get the best merge base, all we have to do is pick the
merge base that minimizes

    git rev-list --count --no-merges $candidate..$branch

There can be ties, but in practice they are rare enough that it is
probably not worth worrying about them.

Symmetry; generalization to more than two branches
==================================================

Interestingly, minimizing the last criterion is the same as maximizing

    git rev-list --count --no-merges $candidate

because there is a fixed number of commits in

    git rev-list --no-merges $branch

, and each of those commits is in exactly one of the two counts above.
This formulation shows that the best merge base for computing

    git diff $master...$branch

is also the best merge base for computing

    git diff $branch...$master

; i.e., the best merge base is symmetric in its arguments.  It also
shows that the concept of "best merge base" can trivially be
generalized to more than two branches.

git-best-merge-base script
===========================

The best merge base can be computed, for example, using the following
script:

    #! /bin/sh
    # Usage: git-best-merge-base MASTER BRANCH

    set -e

    master="$1"
    branch="$2"

    count() {
        git rev-list --no-merges --count "$1..$branch"
    }

    # Note that if $master and $branch have no common ancestors, `git
    # merge-base` fails with retcode=1, causing the script to exit
    # with the same error code.
    merge_bases=$(git merge-base --all "$master" "$branch")

    case "${#merge_bases}" in
    40)
        # One merge base -> use it.
        echo $merge_bases
        ;;
    *)
        # Multiple merge bases -> pick the one for which $base..$branch
        # has the fewest commits that are already in $master.  To ensure
        # that the result is repeatable, if there is a tie we choose the
        # merge base that comes first when sorted by SHA-1:
        for merge_base in $merge_bases
        do
            echo $(count $merge_base) $merge_base
        done | sort -k1,1n -k2 | sed -ne '1s/^.* //p'
        ;;
    esac

Do best merge bases really help?
================================

I analyzed all of the 2-parent merges made in the history of the Git
project.  (I limited the analysis to 2-parent merges just for
simplicity.)  I computed the size of the asymmetric diffs

    git diff $commit^1...$commit^2

using the current code and using git-best-merge-base, with the
following script:

    #! /bin/sh

    old_diff_len() {
        git diff $1...$2 | wc -l
    }

    new_diff_len() {
        git diff $(git-best-merge-base $1 $2)..$2 | wc -l
    }

    git rev-list --min-parents=2 --max-parents=2 --parents "$@" |
    while read commit p1 p2
    do
        echo "$commit $(git merge-base --all $p1 $p2 | wc -l)
$(old_diff_len $p1 $p2) $(new_diff_len $p1 $p2)"
    done

Results:

* Total number of merges: 8263

* Total number of merges with exactly two parents: 8229

  Of which:

  * Number with zero merge bases: 6 (0.07%)

  * Number with one merge base: 7823 (95.1%)

  * Number with multiple merge bases: 400 (4.8%)

    Of which:

    * Number whose diffs shrink: 71 (17.8%)

    * Number whose diffs remain the same length: 323 (80.1%)

    * Number whose diffs grow: 6 (1.5%)

I have attached a graph plotting the diff sizes against each other on
a log-log scale.  The points *under* the red line are diffs that have
shrunk; the points *over* the red line are the diffs that have grown.
As you can see, far more diffs shrank than grew, and by larger
factors.  Some of the diffs have shrunk dramatically.  In the most
drastic case, the diff shrank from 466602 lines to 81 lines.

A relatively small fraction of all merges are affected, but of merges
that have multiple merge bases (which are the most difficult merges to
handle), more than one in six would be simplified.

I should mention that I have done a similar analysis of a large
commercial software project's history, with broadly similar results.

I wouldn't be surprised if selecting merge bases more intelligently in
"git merge" also helps make more merges go through without conflicts.

Suggested next steps
====================

I don't have time right now to work more on best merge bases, and I am
not very familiar with the parts of the code that would be involved.
So I'm hoping that somebody else finds this a worthwhile improvement
and tries implementing best merge bases in git core.  I have the
following suggestions:

* Add an option

      git merge-base --best <commit> <commit>...

  that computes the best merge base as described above.  Its output
  can be tested against the git-best-merge-base script that I
  provided.

* Use the best merge base when computing

      git diff <commit>...<commit>

  This should give shorter differences in many cases.

* Benchmark "merge-base" with the "--best" option and, if it is not
  too expensive, perhaps make it the default behavior of "merge-base".

* See whether best merge bases can be used to improve other parts of
  the code, especially the code for the recursive merge strategy.

Michael

[1] I assume that commits make independent changes.  If some commits
    in fact revert changes made by other commits, or some commits have
    been cherry-picked across branches, then the correlation between
    number of commits and size of diff will be a bit weaker but should
    still be a useful approximation.

[2] There are other ways that "fewest changes" could be defined.  We
    could include merge commits in the count of extra commits.  (This
    variant could be implemented by omitting `--no-merges` in the
    `rev-list` commands above.)  Or, when diffing only text files, we
    could try to minimize the overall textual length of the diff.  The
    definition used in the main text is chosen for being reasonable,
    general, symmetric, and not too expensive to compute.

-- 
Michael Haggerty
mhagger@alum.mit.edu

[-- Attachment #2: diff-sizes.png --]
[-- Type: image/png, Size: 8157 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Our merge bases sometimes suck
  2014-06-12 22:12 Our merge bases sometimes suck Michael Haggerty
@ 2014-06-13  9:38 ` Michael J Gruber
  2014-06-13 10:13   ` Michael Haggerty
  2014-06-13 15:52   ` Jakub Narębski
  2014-06-17 15:08 ` Junio C Hamano
  1 sibling, 2 replies; 12+ messages in thread
From: Michael J Gruber @ 2014-06-13  9:38 UTC (permalink / raw)
  To: Michael Haggerty, git discussion list

Michael Haggerty venit, vidit, dixit 13.06.2014 00:12:
> I've been thinking a lot about merge bases lately and think I have
> discovered something interesting.
> 
> tl;dr:
> 
> When two branches have multiple merge bases,
> 
>     git merge-base $master $branch
> 
> picks one merge base more or less arbitrarily.  Here I describe a
> criterion for picking a "best" merge base.  When the best merge base
> is used, the diff output by
> 
>     git diff $master...$branch
> 
> becomes shorter and simpler--sometimes dramatically so.  I have
> quantified the improvement by analyzing historical merges from the Git
> project (see attached image).  Best merge bases might also help reduce
> conflicts when conducting merges.
> 

Everytime I looked at our merge base code, my head started spinning. So
it's admirable you even started this endeavor :)

We use merge bases for several things:

- merging
- resolving "A...B" rev notation (rev-list and friends), aka symmetric
difference
- left/right selection/annotation of commits (rev-list/log)
- resolving "diff A...B", which may be a handy notation, but is horribly
misleading because it is a very unsymmetric form of diff

Out of these four, you seemingly picked the one as pivoting example
which is most different from any symmetric notion of "...". But in fact,
"merging" is quite unsymmetric, too, because merging branch into master
is quite different from the other way round.

This is certainly obvious to you, bit I thought I point it out for the
convenience of other readers: You are after optimizing the merge base
choice for an unsymmetric use case, which covers "diff A...B" as well as
"merge B into A".

In these two cases, choosing different merge bases can lead to different
results. The other two basically use all merge bases anyways.

(I seem to recall that our cherry code could profit from improvements, too.)

> Background
> ==========
> 
> A merge base of two or more branches is a common ancestor of those
> branches that has no descendant that is also a common ancestor (see
> `git-merge-base(1)`).  Merge bases are important when merging
> branches, and are also used internally to some other commands like
> "git diff $master...$branch".
> 
> Sometimes branches have multiple merge bases, such as in the case of a
> criss-cross merge:
> 
>     o--A--o--o--X--o--Y--o--Z    <- master
>         \        \   /
>          \        \ /
>           \        ·
>            \      / \
>             \    /   \
>              B--C--D--E--F       <- branch
> 
> Here, commits X and C are both merge bases for branches master and
> branch.  You can list all merge bases by running
> 
>     git merge-base --all $master $branch
> 
> But if you run
> 
>     git merge-base $master $branch
> 
> only one merge base is returned.  *Which* one it returns is not
> documented and appears to be pretty arbitrary (probably a side effect
> of the traversal order?)
> 
> 
> The "best" merge base
> =====================
> 
> But not all merge bases are created equal.  It is possible to define a
> "best" merge base that has some nice properties.
> 
> Let's focus on the command
> 
>     git diff $master...$branch
> 
> which is equivalent to
> 
>     git diff $(git merge-base $master $branch)..$branch
> 
> We want such a diff to have two properties:
> 
> 1. It must include *all* changes that have been made on the branch but
>    are not yet contained in master.
> 
> 2. It should contain *as few* changes as possible that are already in
>    master.
> 
> The first property is satisfied automatically if the left end of the
> "diff" range is any merge base.  Because a merge base is an ancestor
> of master, it cannot possibly include any changes that were made on
> the branch but have not yet been merged to master [1].
> 
> The second property requires as a *necessary* condition that the left
> end of the diff is a merge base.  But the property also helps us pick
> the best among multiple merge bases.  We just need to make the idea of
> "contains as few changes as possible" more precise.
> 
> I propose that the best merge base is the merge base "candidate" that
> minimizes the number of non-merge commits that are in
> 
>     git rev-list --no-merges $candidate..$branch
> 
> but are already in master:
> 
>     git rev-list --no-merges $master
> 
> Since a non-merge commit should embody a single logical change,
> counting non-merge commits is in some sense counting changes [2].
> 
> We can put this criterion in simpler form.  Because every candidate is
> a merge-base,
> 
>     git rev-list --no-merges $candidate..$branch
> 
> necessarily includes *all* of the non-merge commits that are on branch
> but not on master. 

This is actually true for every merge base candidate (i.e. every commit
which is both on master and branch).

[These things are really much easier in set language, A..B being B
\setminus A, being B intersected with the complement of A etc.]

> This is a fixed number of commits, the same for
> every candidate.

"This" refers to "branch minus master" (commits on branch but not on
master), which is defined indepently of candidate.

>  It *additionally* includes the commits that are on
> master but not yet in branch.  This second number varies from one
> candidate to another.

"master minus branch" is independent of candidate also.

The second number is rather those commits on candidate..branch which are
not "branch minus master", i.e. "branch minus candidate" minus "branch
minus master", which is "branch intersected with master minus
candidate", i.e. those commits on candidate..branch which are also on
master. [Again this is true for every merge base candidate, not just
every merge base.]

>  So if we minimize the number of commits in this
> output, is is the same as minimizing the number of unwanted commits.

Now I understand why this statement is true :)

> Therefore, to get the best merge base, all we have to do is pick the
> merge base that minimizes
> 
>     git rev-list --count --no-merges $candidate..$branch
> 
> There can be ties, but in practice they are rare enough that it is
> probably not worth worrying about them.
> 
> 
> Symmetry; generalization to more than two branches
> ==================================================
>
> Interestingly, minimizing the last criterion is the same as maximizing
> 
>     git rev-list --count --no-merges $candidate
> 
> because there is a fixed number of commits in
> 
>     git rev-list --no-merges $branch
> 
> , and each of those commits is in exactly one of the two counts above.

That's a cute observation, which in the example above is wrong on first
glance, but true on second: Visually/subjectively, one easily misses
commits B and C when counting "X..master", which is not just the commits
"extending master beyond X", etc.

> This formulation shows that the best merge base for computing
> 
>     git diff $master...$branch
> 
> is also the best merge base for computing
> 
>     git diff $branch...$master
> 
> ; i.e., the best merge base is symmetric in its arguments.  It also
> shows that the concept of "best merge base" can trivially be
> generalized to more than two branches.

That symmetry is really surprising, but the argument is convincingly
correct. Alas, that count back to the roots can be really expensive.

Given that the arguments in the previous argument show that everything
applies to merge base candidates as well, not just merge bases, I'm
wondering whether we can weave in the weighing (which one is best) with
our existing merge base algorithm or maybe an alternative algorithm
using generation numbers or such.

> git-best-merge-base script
> ===========================
> 
> The best merge base can be computed, for example, using the following
> script:
> 
>     #! /bin/sh
>     # Usage: git-best-merge-base MASTER BRANCH
> 
>     set -e
> 
>     master="$1"
>     branch="$2"
> 
>     count() {
>         git rev-list --no-merges --count "$1..$branch"
>     }
> 
>     # Note that if $master and $branch have no common ancestors, `git
>     # merge-base` fails with retcode=1, causing the script to exit
>     # with the same error code.
>     merge_bases=$(git merge-base --all "$master" "$branch")
> 
>     case "${#merge_bases}" in
>     40)
>         # One merge base -> use it.
>         echo $merge_bases
>         ;;
>     *)
>         # Multiple merge bases -> pick the one for which $base..$branch
>         # has the fewest commits that are already in $master.  To ensure
>         # that the result is repeatable, if there is a tie we choose the
>         # merge base that comes first when sorted by SHA-1:
>         for merge_base in $merge_bases
>         do
>             echo $(count $merge_base) $merge_base
>         done | sort -k1,1n -k2 | sed -ne '1s/^.* //p'
>         ;;
>     esac
> 
> 
> Do best merge bases really help?
> ================================
> 
> I analyzed all of the 2-parent merges made in the history of the Git
> project.  (I limited the analysis to 2-parent merges just for
> simplicity.)  I computed the size of the asymmetric diffs
> 
>     git diff $commit^1...$commit^2
> 
> using the current code and using git-best-merge-base, with the
> following script:
> 
>     #! /bin/sh
> 
>     old_diff_len() {
>         git diff $1...$2 | wc -l
>     }
> 
>     new_diff_len() {
>         git diff $(git-best-merge-base $1 $2)..$2 | wc -l
>     }
> 
>     git rev-list --min-parents=2 --max-parents=2 --parents "$@" |
>     while read commit p1 p2
>     do
>         echo "$commit $(git merge-base --all $p1 $p2 | wc -l)
> $(old_diff_len $p1 $p2) $(new_diff_len $p1 $p2)"
>     done
> 
> 
> Results:
> 
> * Total number of merges: 8263
> 
> * Total number of merges with exactly two parents: 8229
> 
>   Of which:
> 
>   * Number with zero merge bases: 6 (0.07%)
> 
>   * Number with one merge base: 7823 (95.1%)
> 
>   * Number with multiple merge bases: 400 (4.8%)

I guess that shows how well thought out our maintainer's branch and
merge workflow is: practically no criss-crossing, everything merged
downwards, for almost constant definition of "down" ;)

>     Of which:
> 
>     * Number whose diffs shrink: 71 (17.8%)
> 
>     * Number whose diffs remain the same length: 323 (80.1%)
> 
>     * Number whose diffs grow: 6 (1.5%)
> 
> I have attached a graph plotting the diff sizes against each other on
> a log-log scale.  The points *under* the red line are diffs that have
> shrunk; the points *over* the red line are the diffs that have grown.
> As you can see, far more diffs shrank than grew, and by larger
> factors.  Some of the diffs have shrunk dramatically.  In the most
> drastic case, the diff shrank from 466602 lines to 81 lines.
> 
> A relatively small fraction of all merges are affected, but of merges
> that have multiple merge bases (which are the most difficult merges to
> handle), more than one in six would be simplified.
> 
> I should mention that I have done a similar analysis of a large
> commercial software project's history, with broadly similar results.
> 
> I wouldn't be surprised if selecting merge bases more intelligently in
> "git merge" also helps make more merges go through without conflicts.
> 
> 
> Suggested next steps
> ====================
> 
> I don't have time right now to work more on best merge bases, and I am
> not very familiar with the parts of the code that would be involved.
> So I'm hoping that somebody else finds this a worthwhile improvement
> and tries implementing best merge bases in git core.  I have the
> following suggestions:
> 
> * Add an option
> 
>       git merge-base --best <commit> <commit>...
> 
>   that computes the best merge base as described above.  Its output
>   can be tested against the git-best-merge-base script that I
>   provided.
> 
> * Use the best merge base when computing
> 
>       git diff <commit>...<commit>
> 
>   This should give shorter differences in many cases.
> 
> * Benchmark "merge-base" with the "--best" option and, if it is not
>   too expensive, perhaps make it the default behavior of "merge-base".
> 
> * See whether best merge bases can be used to improve other parts of
>   the code, especially the code for the recursive merge strategy.
> 
> Michael
> 
> [1] I assume that commits make independent changes.  If some commits
>     in fact revert changes made by other commits, or some commits have
>     been cherry-picked across branches, then the correlation between
>     number of commits and size of diff will be a bit weaker but should
>     still be a useful approximation.
> 
> [2] There are other ways that "fewest changes" could be defined.  We
>     could include merge commits in the count of extra commits.  (This
>     variant could be implemented by omitting `--no-merges` in the
>     `rev-list` commands above.)  Or, when diffing only text files, we
>     could try to minimize the overall textual length of the diff.  The
>     definition used in the main text is chosen for being reasonable,
>     general, symmetric, and not too expensive to compute.
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Our merge bases sometimes suck
  2014-06-13  9:38 ` Michael J Gruber
@ 2014-06-13 10:13   ` Michael Haggerty
  2014-06-13 15:52   ` Jakub Narębski
  1 sibling, 0 replies; 12+ messages in thread
From: Michael Haggerty @ 2014-06-13 10:13 UTC (permalink / raw)
  To: Michael J Gruber, git discussion list

On 06/13/2014 11:38 AM, Michael J Gruber wrote:
> Michael Haggerty venit, vidit, dixit 13.06.2014 00:12:
>> I've been thinking a lot about merge bases lately and think I have
>> discovered something interesting.
>>
>> tl;dr:
>>
>> When two branches have multiple merge bases,
>>
>>     git merge-base $master $branch
>>
>> picks one merge base more or less arbitrarily.  Here I describe a
>> criterion for picking a "best" merge base.  When the best merge base
>> is used, the diff output by
>>
>>     git diff $master...$branch
>>
>> becomes shorter and simpler--sometimes dramatically so.  I have
>> quantified the improvement by analyzing historical merges from the Git
>> project (see attached image).  Best merge bases might also help reduce
>> conflicts when conducting merges.
>>
> 
> Everytime I looked at our merge base code, my head started spinning. So
> it's admirable you even started this endeavor :)

I haven't looked at the code :-)  My email came only out of reading docs
and experimenting.

> We use merge bases for several things:
> 
> - merging
> - resolving "A...B" rev notation (rev-list and friends), aka symmetric
> difference
> - left/right selection/annotation of commits (rev-list/log)
> - resolving "diff A...B", which may be a handy notation, but is horribly
> misleading because it is a very unsymmetric form of diff
> 
> Out of these four, you seemingly picked the one as pivoting example
> which is most different from any symmetric notion of "...". But in fact,
> "merging" is quite unsymmetric, too, because merging branch into master
> is quite different from the other way round.
> 
> This is certainly obvious to you, bit I thought I point it out for the
> convenience of other readers: You are after optimizing the merge base
> choice for an unsymmetric use case, which covers "diff A...B" as well as
> "merge B into A".
> 
> In these two cases, choosing different merge bases can lead to different
> results. The other two basically use all merge bases anyways.

Correct, the scope of my proposal is for cases when there are multiple
merge bases but one has to be selected.

> [...]
>> I propose that the best merge base is the merge base "candidate" that
>> minimizes the number of non-merge commits that are in
>>
>>     git rev-list --no-merges $candidate..$branch
>>
>> but are already in master:
>>
>>     git rev-list --no-merges $master
>>
>> Since a non-merge commit should embody a single logical change,
>> counting non-merge commits is in some sense counting changes [2].
>>
>> We can put this criterion in simpler form.  Because every candidate is
>> a merge-base,
>>
>>     git rev-list --no-merges $candidate..$branch
>>
>> necessarily includes *all* of the non-merge commits that are on branch
>> but not on master. 
> 
> This is actually true for every merge base candidate (i.e. every commit
> which is both on master and branch).
> 
> [These things are really much easier in set language, A..B being B
> \setminus A, being B intersected with the complement of A etc.]
> 
>> This is a fixed number of commits, the same for
>> every candidate.
> 
> "This" refers to "branch minus master" (commits on branch but not on
> master), which is defined indepently of candidate.
> 
>>  It *additionally* includes the commits that are on
>> master but not yet in branch.  This second number varies from one
>> candidate to another.
> 
> "master minus branch" is independent of candidate also.
> 
> The second number is rather those commits on candidate..branch which are
> not "branch minus master", i.e. "branch minus candidate" minus "branch
> minus master", which is "branch intersected with master minus
> candidate", i.e. those commits on candidate..branch which are also on
> master. [Again this is true for every merge base candidate, not just
> every merge base.]
> 
>>  So if we minimize the number of commits in this
>> output, is is the same as minimizing the number of unwanted commits.
> 
> Now I understand why this statement is true :)

Thanks for the translation to set notation.  It is indeed easier to work
with.

>> Therefore, to get the best merge base, all we have to do is pick the
>> merge base that minimizes
>>
>>     git rev-list --count --no-merges $candidate..$branch
>>
>> There can be ties, but in practice they are rare enough that it is
>> probably not worth worrying about them.
>>
>>
>> Symmetry; generalization to more than two branches
>> ==================================================
>>
>> Interestingly, minimizing the last criterion is the same as maximizing
>>
>>     git rev-list --count --no-merges $candidate
>>
>> because there is a fixed number of commits in
>>
>>     git rev-list --no-merges $branch
>>
>> , and each of those commits is in exactly one of the two counts above.
> 
> That's a cute observation, which in the example above is wrong on first
> glance, but true on second: Visually/subjectively, one easily misses
> commits B and C when counting "X..master", which is not just the commits
> "extending master beyond X", etc.
> 
>> This formulation shows that the best merge base for computing
>>
>>     git diff $master...$branch
>>
>> is also the best merge base for computing
>>
>>     git diff $branch...$master
>>
>> ; i.e., the best merge base is symmetric in its arguments.  It also
>> shows that the concept of "best merge base" can trivially be
>> generalized to more than two branches.
> 
> That symmetry is really surprising, but the argument is convincingly
> correct. Alas, that count back to the roots can be really expensive.

I mentioned the formulation

    git rev-list --count --no-merges $candidate

not because it is necessarily the most efficient way to find the best
merge base, but because it is theoretically useful.  For the
implementation, any of the formulations can be used, because they are
all equivalent.

> Given that the arguments in the previous argument show that everything
> applies to merge base candidates as well, not just merge bases, I'm
> wondering whether we can weave in the weighing (which one is best) with
> our existing merge base algorithm or maybe an alternative algorithm
> using generation numbers or such.
> [...]

That's an interesting idea.  I haven't looked at that code at all, so if
you want to get involved it would be awesome!

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Our merge bases sometimes suck
  2014-06-13  9:38 ` Michael J Gruber
  2014-06-13 10:13   ` Michael Haggerty
@ 2014-06-13 15:52   ` Jakub Narębski
  2014-06-13 22:14     ` Michael Haggerty
  2014-06-13 22:35     ` Junio C Hamano
  1 sibling, 2 replies; 12+ messages in thread
From: Jakub Narębski @ 2014-06-13 15:52 UTC (permalink / raw)
  To: Michael J Gruber, Michael Haggerty, git discussion list

W dniu 2014-06-13 11:38, Michael J Gruber pisze:
> Michael Haggerty venit, vidit, dixit 13.06.2014 00:12:
>> I've been thinking a lot about merge bases lately and think I have
>> discovered something interesting.
>>
>> tl;dr:
>>
>> When two branches have multiple merge bases,
>>
>>      git merge-base $master $branch
>>
>> picks one merge base more or less arbitrarily.  Here I describe a
>> criterion for picking a "best" merge base.  When the best merge base
>> is used, the diff output by
>>
>>      git diff $master...$branch
>>
>> becomes shorter and simpler--sometimes dramatically so.  I have
>> quantified the improvement by analyzing historical merges from the Git
>> project (see attached image).  Best merge bases might also help reduce
>> conflicts when conducting merges.
>>
>
> Everytime I looked at our merge base code, my head started spinning. So
> it's admirable you even started this endeavor :)
>
> We use merge bases for several things:
>
> - merging
> - resolving "A...B" rev notation (rev-list and friends), aka symmetric
> difference
> - left/right selection/annotation of commits (rev-list/log)
> - resolving "diff A...B", which may be a handy notation, but is horribly
> misleading because it is a very unsymmetric form of diff

I don't know if it has been fixed, but there is a difference
between "git diff A...B" when A and B have one merge base, and
"git diff A...B" when there are more than one merge base.

When there is one merge base, "git diff A...B" returns simple
unified diff equivalent to "git diff $(git merge-base A B) B".
It is unsymmetric.

But where there are more than one merge base, by design or by
accident for "git diff A...B" git 1.9.2 / 1.7.4 returns

    git diff --cc $(git merge-base --all A B) A B

which is *symmetric*, and is combined not unified diff.

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Our merge bases sometimes suck
  2014-06-13 15:52   ` Jakub Narębski
@ 2014-06-13 22:14     ` Michael Haggerty
  2014-06-13 22:35     ` Junio C Hamano
  1 sibling, 0 replies; 12+ messages in thread
From: Michael Haggerty @ 2014-06-13 22:14 UTC (permalink / raw)
  To: Jakub Narębski, Michael J Gruber, git discussion list

On 06/13/2014 05:52 PM, Jakub Narębski wrote:
> I don't know if it has been fixed, but there is a difference
> between "git diff A...B" when A and B have one merge base, and
> "git diff A...B" when there are more than one merge base.
> 
> When there is one merge base, "git diff A...B" returns simple
> unified diff equivalent to "git diff $(git merge-base A B) B".
> It is unsymmetric.
> 
> But where there are more than one merge base, by design or by
> accident for "git diff A...B" git 1.9.2 / 1.7.4 returns
> 
>    git diff --cc $(git merge-base --all A B) A B
> 
> which is *symmetric*, and is combined not unified diff.

Thanks for the information.  This doesn't seem to be the case in git
2.0.0.  For example, commit 8c0db2f519 in the git project is a merge
with two merge bases:

$ git --version
git version 2.0.0
$ c=8c0db2f5193153ea8a51bb45b0512c5a3889023b
$ git merge-base --all $c^1 $c^2
a15f43312f6962959e8daa33df0cf5357852a4b8
9a0e6731c632c841cd2de9dec0b9091b2f10c6fd
$ git diff $c^1...$c^2 | head -20
diff --git a/Documentation/git-rev-parse.txt
b/Documentation/git-rev-parse.txt
index d638bfc..29b5789 100644
--- a/Documentation/git-rev-parse.txt
+++ b/Documentation/git-rev-parse.txt
@@ -77,6 +77,14 @@ OPTIONS
 	path of the top-level directory relative to the current
 	directory (typically a sequence of "../", or an empty string).

+--git-dir::
+	Show `$GIT_DIR` if defined else show the path to the .git directory.
+
+--short, --short=number::
+	Instead of outputting the full SHA1 values of object names try to
+	abbriviate them to a shorter unique name. When no length is specified
+	7 is used. The minimum length is 4.
+
 --since=datestring, --after=datestring::
 	Parses the date string, and outputs corresponding
 	--max-age= parameter for git-rev-list command.
diff --git a/git-archimport.perl b/git-archimport.perl

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: Our merge bases sometimes suck
  2014-06-13 15:52   ` Jakub Narębski
  2014-06-13 22:14     ` Michael Haggerty
@ 2014-06-13 22:35     ` Junio C Hamano
  1 sibling, 0 replies; 12+ messages in thread
From: Junio C Hamano @ 2014-06-13 22:35 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Michael J Gruber, Michael Haggerty, git discussion list

Jakub Narębski <jnareb@gmail.com> writes:

> I don't know if it has been fixed, but there is a difference
> between "git diff A...B" when A and B have one merge base, and
> "git diff A...B" when there are more than one merge base.
>
> When there is one merge base, "git diff A...B" returns simple
> unified diff equivalent to "git diff $(git merge-base A B) B".
> It is unsymmetric.
>
> But where there are more than one merge base, by design or by
> accident for "git diff A...B" git 1.9.2 / 1.7.4 returns
>
>    git diff --cc $(git merge-base --all A B) A B
>
> which is *symmetric*, and is combined not unified diff.

Hmph, the relevant code has not changed very much since c008c0ff
(diff A...B: give one possible diff when there are more than one
merge-base, 2010-07-12) which has been with us since v1.7.2.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Our merge bases sometimes suck
  2014-06-12 22:12 Our merge bases sometimes suck Michael Haggerty
  2014-06-13  9:38 ` Michael J Gruber
@ 2014-06-17 15:08 ` Junio C Hamano
  2014-06-17 15:44   ` Michael Haggerty
  1 sibling, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2014-06-17 15:08 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git discussion list

Michael Haggerty <mhagger@alum.mit.edu> writes:

> The "best" merge base
> =====================
>
> But not all merge bases are created equal.  It is possible to define a
> "best" merge base that has some nice properties.
>
> Let's focus on the command
>
>     git diff $master...$branch
>
> which is equivalent to
>
>     git diff $(git merge-base $master $branch)..$branch
> ...
> I propose that the best merge base is the merge base "candidate" that
> minimizes the number of non-merge commits that are in
>
>     git rev-list --no-merges $candidate..$branch
>
> but are already in master:
>
>     git rev-list --no-merges $master

I welcome this line of thought very much.

There is one niggle I find somewhat curious but am either too lazy
or too stupid to think it through myself ;-)

The "merge-base" is a symmetric operation, because the three-way
merge, which is the primary customer of its result, fundamentally
is.  From your description, it sounds like the "best" merge base
however may not be symmetric at all.  The merge-base between A and B
that makes "git diff A...B" the easiest to read by minimizing the
distance between it and B may be different from the merge-base
between A and B that makes the other diff "git diff B...A" the
easiest to read.

Or it may not be assymmetric---that is why I said I didn't think it
through.  I am not saying that it is bad if the "best" merge-base is
an asymmetric concept; I am curious if it is asymmetric, and if so
if that is fundamental.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Our merge bases sometimes suck
  2014-06-17 15:08 ` Junio C Hamano
@ 2014-06-17 15:44   ` Michael Haggerty
  2014-06-20  6:53     ` Junio C Hamano
  0 siblings, 1 reply; 12+ messages in thread
From: Michael Haggerty @ 2014-06-17 15:44 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git discussion list

On 06/17/2014 05:08 PM, Junio C Hamano wrote:
> Michael Haggerty <mhagger@alum.mit.edu> writes:
> 
>> The "best" merge base
>> =====================
>>
>> But not all merge bases are created equal.  It is possible to define a
>> "best" merge base that has some nice properties.
>>
>> Let's focus on the command
>>
>>     git diff $master...$branch
>>
>> which is equivalent to
>>
>>     git diff $(git merge-base $master $branch)..$branch
>> ...
>> I propose that the best merge base is the merge base "candidate" that
>> minimizes the number of non-merge commits that are in
>>
>>     git rev-list --no-merges $candidate..$branch
>>
>> but are already in master:
>>
>>     git rev-list --no-merges $master
> 
> I welcome this line of thought very much.
> 
> There is one niggle I find somewhat curious but am either too lazy
> or too stupid to think it through myself ;-)
> 
> The "merge-base" is a symmetric operation, because the three-way
> merge, which is the primary customer of its result, fundamentally
> is.  From your description, it sounds like the "best" merge base
> however may not be symmetric at all.  The merge-base between A and B
> that makes "git diff A...B" the easiest to read by minimizing the
> distance between it and B may be different from the merge-base
> between A and B that makes the other diff "git diff B...A" the
> easiest to read.
> 
> Or it may not be assymmetric---that is why I said I didn't think it
> through.  I am not saying that it is bad if the "best" merge-base is
> an asymmetric concept; I am curious if it is asymmetric, and if so
> if that is fundamental.

It just looks asymmetric, but actually it is symmetric, which was kindof
surprising when I realized it.  The argument is in the next section
"Symmetry; generalization to more than two branches".  Michael Gruber
showed the same thing upthread using set notation, which is easier to
follow.  Here is his argument in symbolic notation.  We want to minimize

    N = |(branch - candidate) ∧ master|

where "branch" represents the set of all commits in "branch" etc, "|x|"
represents the number of elements in set "x", and "∧" is set
intersection, and candidate is a merge base of branch and master.

    N = |(branch ∧ ∼candidate) ∧ master|
      = |(branch ∧ master) ∧ ∼candidate|

Since candidate is a common ancestor of branch and master,

    candidate ⊆ branch ∧ master

so we have

    N = |branch ∧ master| - |candidate|

Since "|branch ∧ master|" is the same for all candidates, minimizing N
is the same as maximizing |candidate|, which is the same as

    git rev-list --count --no-merges $candidate

.  This is clearly symmetric in master vs. base.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Our merge bases sometimes suck
  2014-06-17 15:44   ` Michael Haggerty
@ 2014-06-20  6:53     ` Junio C Hamano
  2014-06-20  8:53       ` Michael Haggerty
  2014-06-20 21:17       ` Nico Williams
  0 siblings, 2 replies; 12+ messages in thread
From: Junio C Hamano @ 2014-06-20  6:53 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git discussion list

Michael Haggerty <mhagger@alum.mit.edu> writes:

> It just looks asymmetric, but actually it is symmetric, which was kindof
> surprising when I realized it....
>
> Since "|branch ∧ master|" is the same for all candidates, minimizing N
> is the same as maximizing |candidate|, which is the same as
>
>     git rev-list --count --no-merges $candidate
>
> This is clearly symmetric in master vs. base.

Hmph, but that obviously will become very expensive to compute as
project grows.

When we (potentially) have multiple merge-bases, after finding all
the candidates by traversing from the two commits to be merged, we
already make another set of traversals, starting from the candidates
and painting the ancestors down to their common ancestors.  This is
done to discover if each candidate is reachable from any other
candidate (in which case the reachable one is not a merge-base).

The resulting graph of this traversal is currently used only to cull
non-merge-bases out of the candidates, but I wonder if you can
*count* the nodes in it in each color and use that number (which is
essentially the number of commits that can be reached only from one
candidate and not from other candidates) to derive a score for each
candidate, and use it to assess the goodness of merge-bases, just
like the number you are counting in the above full traversal.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Our merge bases sometimes suck
  2014-06-20  6:53     ` Junio C Hamano
@ 2014-06-20  8:53       ` Michael Haggerty
  2014-06-20 21:17       ` Nico Williams
  1 sibling, 0 replies; 12+ messages in thread
From: Michael Haggerty @ 2014-06-20  8:53 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git discussion list

On 06/20/2014 08:53 AM, Junio C Hamano wrote:
> Michael Haggerty <mhagger@alum.mit.edu> writes:
> 
>> It just looks asymmetric, but actually it is symmetric, which was kindof
>> surprising when I realized it....
>>
>> Since "|branch ∧ master|" is the same for all candidates, minimizing N
>> is the same as maximizing |candidate|, which is the same as
>>
>>     git rev-list --count --no-merges $candidate
>>
>> This is clearly symmetric in master vs. base.
> 
> Hmph, but that obviously will become very expensive to compute as
> project grows.

This formulation is theoretically interesting because it shows the
symmetry of the criterion, but that doesn't mean it is the most
practical to use.  Given that multiple formulations are equivalent, any
of them can be used.

> When we (potentially) have multiple merge-bases, after finding all
> the candidates by traversing from the two commits to be merged, we
> already make another set of traversals, starting from the candidates
> and painting the ancestors down to their common ancestors.  This is
> done to discover if each candidate is reachable from any other
> candidate (in which case the reachable one is not a merge-base).
> 
> The resulting graph of this traversal is currently used only to cull
> non-merge-bases out of the candidates, but I wonder if you can
> *count* the nodes in it in each color and use that number (which is
> essentially the number of commits that can be reached only from one
> candidate and not from other candidates) to derive a score for each
> candidate, and use it to assess the goodness of merge-bases, just
> like the number you are counting in the above full traversal.

It sounds promising.  Let's see if I correctly understand the algorithm
that you described:

"common_ancestors" = intersection of all candidates' histories
"painting_from($candidate)" = $candidate - common_ancestors
discard $candidate if $candidate is in painting_from($other_candidate)
(i.e., it is not a merge base)

If that is correct, then the candidate with the most commits in
painting_from($candidate) should be the best merge base, because
common_ancestors is a subset of the candidate's history, so

    |painting_from($candidate)| = |$candidate| - |common_ancestors|

Since |common_ancestors| is the same for all candidates, minimizing
|painting_from($candidate)| is the same as minimizing |$candidate|.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Our merge bases sometimes suck
  2014-06-20  6:53     ` Junio C Hamano
  2014-06-20  8:53       ` Michael Haggerty
@ 2014-06-20 21:17       ` Nico Williams
  2014-06-23 11:43         ` Jakub Narębski
  1 sibling, 1 reply; 12+ messages in thread
From: Nico Williams @ 2014-06-20 21:17 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Michael Haggerty, git discussion list

On Fri, Jun 20, 2014 at 1:53 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Michael Haggerty <mhagger@alum.mit.edu> writes:
>> [...]
>
> Hmph, but that obviously will become very expensive to compute as
> project grows.

That's the main reason to like Fossil's approach (namely, the use of
SQL, specifically SQLite3): you can write declarative queries and let
the query planner optimize.

(That's also about the only thing I like about Fossil: the use of
SQL/SQLite3, and all that that implies.  Fossil is otherwise an
anti-git, and that makes it useless for me.)

Pardon the intrusion,

Nico
--

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Our merge bases sometimes suck
  2014-06-20 21:17       ` Nico Williams
@ 2014-06-23 11:43         ` Jakub Narębski
  0 siblings, 0 replies; 12+ messages in thread
From: Jakub Narębski @ 2014-06-23 11:43 UTC (permalink / raw)
  To: Nico Williams, Junio C Hamano; +Cc: Michael Haggerty, git discussion list

W dniu 2014-06-20 23:17, Nico Williams pisze:
> On Fri, Jun 20, 2014 at 1:53 AM, Junio C Hamano <gitster@pobox.com> wrote:
>> Michael Haggerty <mhagger@alum.mit.edu> writes:
>>> [...]
>>
>> Hmph, but that obviously will become very expensive to compute as
>> project grows.
>
> That's the main reason to like Fossil's approach (namely, the use of
> SQL, specifically SQLite3): you can write declarative queries and let
> the query planner optimize.

I do wonder if using *generic* RDBMS / key-value store would really
make Git faster than current filesystem-like approach.

BTW. perhaps caching generation numbers would help here; the idea
of adding performance-helper generation number header to commit
object was rejected.

> (That's also about the only thing I like about Fossil: the use of
> SQL/SQLite3, and all that that implies.  Fossil is otherwise an
> anti-git, and that makes it useless for me.)

There are Bazaar and Veracity that are supposed to have pluggable
backends...

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2014-06-23 11:43 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-06-12 22:12 Our merge bases sometimes suck Michael Haggerty
2014-06-13  9:38 ` Michael J Gruber
2014-06-13 10:13   ` Michael Haggerty
2014-06-13 15:52   ` Jakub Narębski
2014-06-13 22:14     ` Michael Haggerty
2014-06-13 22:35     ` Junio C Hamano
2014-06-17 15:08 ` Junio C Hamano
2014-06-17 15:44   ` Michael Haggerty
2014-06-20  6:53     ` Junio C Hamano
2014-06-20  8:53       ` Michael Haggerty
2014-06-20 21:17       ` Nico Williams
2014-06-23 11:43         ` Jakub Narębski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).