The criss-cross merge case

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* The criss-cross merge case
@ 2005-04-27 20:25 Bram Cohen
  2005-04-27 23:32 ` Daniel Barkalow
  0 siblings, 1 reply; 12+ messages in thread
From: Bram Cohen @ 2005-04-27 20:25 UTC (permalink / raw)
  To: git

Here's an example of where simple three-way merge can't do the right
thing. Each letter represents a snapshot of the history, and time goes
downwards. The numbers after some letters refer to which line number was
modified at that time.

A
|\
| \
|  \
|   \
|    \
|     \
|      \
B8      C3
|\     /|
| \   / |
|  \ /  |
|   X   |
|  / \  |
| /   \ |
|/     \|
D8      E3
 \      |
  \     |
   \    |
    \   |
     \  |
      \ |
       \|
        ?

In this case the ? should have a clean merge with the D vesion of line 8
(because it was made with the B version already in the history) and the E
version of line 3 (because it was made with the C version already in the
history).

The problem is that there's no single ancestor for the three-way merge
which does the right thing. If one picks B, then there will be an
unnecessary merge conflict at line 3, because D will have the C version
and E will have the E version but B will have neither. Likewise if one
picks C, there will be an unnecessary conflict at line 8 because D will
have the D version and E will have the B version but C will have neither.
Picking A will cause unnecessary conflicts on *both* lines.

The problem can actually be much worse than a simple unnecesary conflict,
because if the later updates were strict undos of the earlier updates,
then picking either B or C will merge something *wrong*. Using A as the
ancestor will keep that from happening, but it also maximizes unnecessary
conflicts.

Note that the above criss-cross case only involves two branches, using the
methodology of each one modifying their own section and pulling in old
versions of the other one from time to time. Cogito's interface encourages
exactly this work flow, which is not a bad thing from a work flow
perspective, but does make it hit this case regularly.

The way Git handles this currently is very bad, because it forces the
common ancestor to be from the same snapshot across all files, so this
problem will happen if the modifications are made even in different files,
not just different lines within the same file. That could be improved
greatly by finding an LCA for each file individually, which is what
Monotone does. Darcs, Codeville, and all the Arch descendants have better
merge algorithms which don't have to pick a single common ancestor.

-Bram

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: The criss-cross merge case
  2005-04-27 20:25 The criss-cross merge case Bram Cohen
@ 2005-04-27 23:32 ` Daniel Barkalow
  2005-04-28  0:43   ` Tupshin Harper
  2005-04-28  3:41   ` suffix array/tree deltas (Was: The criss-cross merge case) Zed A. Shaw
  0 siblings, 2 replies; 12+ messages in thread
From: Daniel Barkalow @ 2005-04-27 23:32 UTC (permalink / raw)
  To: Bram Cohen; +Cc: git

On Wed, 27 Apr 2005, Bram Cohen wrote:

> The way Git handles this currently is very bad, because it forces the
> common ancestor to be from the same snapshot across all files, so this
> problem will happen if the modifications are made even in different files,
> not just different lines within the same file. That could be improved
> greatly by finding an LCA for each file individually, which is what
> Monotone does.

The git core is perfectly sufficient for getting all LCAs or
per-file best LCAs; merge-base doesn't bother, currently, because the
deficiencies of "merge" (a.k.a. diff3) are worse than the issues with
chosing a suboptimal LCA.

My plan is to implement multi-file diff and merge with a suffix tree-based
algorithm, and then revisit the history stuff once we have a merger that
can do sensible things with this information.

Note that the present very bad merger is actually seems to be sufficient
for the Linux kernel, where patches from different sides of a merge are
generally either unrelated or are identical, and, otherwise, they tend to
be true conflicts where people fixed the same bug independantly in
different ways.

> Darcs, Codeville, and all the Arch descendants have better merge
> algorithms which don't have to pick a single common ancestor.

I've been looking at Darcs (which seems to have a good method, although I
think the underlying diff isn't great), and Codeville still doesn't have
any documentation. Arch's method is strictly weaker than 3-way merge, and
generates more rejects (not even conflicts) in my experience than even
CVS. 

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: The criss-cross merge case
  2005-04-27 23:32 ` Daniel Barkalow
@ 2005-04-28  0:43   ` Tupshin Harper
  2005-04-28  1:16     ` Daniel Barkalow
  2005-04-28  3:41   ` suffix array/tree deltas (Was: The criss-cross merge case) Zed A. Shaw
  1 sibling, 1 reply; 12+ messages in thread
From: Tupshin Harper @ 2005-04-28  0:43 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: Bram Cohen, git

Daniel Barkalow wrote:

>I've been looking at Darcs (which seems to have a good method, although I
>think the underlying diff isn't great), and Codeville still doesn't have
>any documentation. Arch's method is strictly weaker than 3-way merge, and
>generates more rejects (not even conflicts) in my experience than even
>CVS. 
>
>	-Daniel
>  
>
Can you clarify what you mean by darcs' underlying diff not being that
great? It seems to function pretty much identically to gnu diff. In what
way would you want the underlying diff to be improved?

-Tupshin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: The criss-cross merge case
  2005-04-28  0:43   ` Tupshin Harper
@ 2005-04-28  1:16     ` Daniel Barkalow
  2005-04-28  2:15       ` Benedikt Schmidt
  0 siblings, 1 reply; 12+ messages in thread
From: Daniel Barkalow @ 2005-04-28  1:16 UTC (permalink / raw)
  To: Tupshin Harper; +Cc: Bram Cohen, git

On Wed, 27 Apr 2005, Tupshin Harper wrote:

> Can you clarify what you mean by darcs' underlying diff not being that
> great? It seems to function pretty much identically to gnu diff. In what
> way would you want the underlying diff to be improved?

GNU diff uses an algorithm which is tuned to handle finding the shortest
diff among a large set of similar-length alternatives while comparing
files which have a lot of repeated lines. The author of the paper it cites
is really thinking about diffing DNA sequences or similar things. It also
can't detect content moves, which are a common thing to have, and which
will be important in the long run, when we're trying to track
modifications to content which also moved from place to place.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: The criss-cross merge case
  2005-04-28  1:16     ` Daniel Barkalow
@ 2005-04-28  2:15       ` Benedikt Schmidt
  2005-04-28  2:19         ` Daniel Barkalow
  0 siblings, 1 reply; 12+ messages in thread
From: Benedikt Schmidt @ 2005-04-28  2:15 UTC (permalink / raw)
  To: git

Daniel Barkalow <barkalow@iabervon.org> writes:

> On Wed, 27 Apr 2005, Tupshin Harper wrote:
>
>> Can you clarify what you mean by darcs' underlying diff not being that
>> great? It seems to function pretty much identically to gnu diff. In what
>> way would you want the underlying diff to be improved?
>
> GNU diff uses an algorithm which is tuned to handle finding the shortest
> diff among a large set of similar-length alternatives while comparing
> files which have a lot of repeated lines. The author of the paper it cites
> is really thinking about diffing DNA sequences or similar things.

AFAIK the paper mentioned in the GNU diff sources [1] is an improvement
to an earlier paper by the same author titled
"A File Comparison Program" - Miller, Myers - 1985.

Can you be more specific why the algorithm is a bad choice (performance,
quality of diff output)?

> It also can't detect content moves, which are a common thing to have, and
> which will be important in the long run, when we're trying to track
> modifications to content which also moved from place to place.

Ok, darcs doesn't handle block moves, so there is no need for an algorithm that
supports them (yet). Is there any free SCM that has support for block moves at
the moment? It seems like clearcase detects them, but I don't know where it
takes advantage of it.

Benedikt

[1] http://citeseer.ist.psu.edu/myers86ond.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: The criss-cross merge case
  2005-04-28  2:15       ` Benedikt Schmidt
@ 2005-04-28  2:19         ` Daniel Barkalow
  2005-04-28 11:16           ` David Roundy
  0 siblings, 1 reply; 12+ messages in thread
From: Daniel Barkalow @ 2005-04-28  2:19 UTC (permalink / raw)
  To: Benedikt Schmidt; +Cc: git

On Thu, 28 Apr 2005, Benedikt Schmidt wrote:

> AFAIK the paper mentioned in the GNU diff sources [1] is an improvement
> to an earlier paper by the same author titled
> "A File Comparison Program" - Miller, Myers - 1985.

GNU diff is based on a better algorithm than traditional diff, reportly,
but there are better algorithms still, developed since, at least according
to a brief literature search on Google Scholar. (bdiff and vdelta, for
example, which can identify block moves as well.)

> Can you be more specific why the algorithm is a bad choice (performance,
> quality of diff output)?

I suspect that the speed is suboptimal (for the cases under which it is
actually used). The quality of the output is about ideal, lacking a
representation for block moves, but I'm hoping to have a diff/merge set
that handles block moves effectively, even if it can't report them in diff
format. I'm also hoping for an annotate function that could use block
moves.

> Ok, darcs doesn't handle block moves, so there is no need for an algorithm that
> supports them (yet). Is there any free SCM that has support for block moves at
> the moment? It seems like clearcase detects them, but I don't know where it
> takes advantage of it.

I would think that darcs would be able to do neat things in its merger if
it knew about block moves. Obviously, it only makes sense to add support
for identifying them and using them at the same time.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: The criss-cross merge case
  2005-04-28  2:19         ` Daniel Barkalow
@ 2005-04-28 11:16           ` David Roundy
  0 siblings, 0 replies; 12+ messages in thread
From: David Roundy @ 2005-04-28 11:16 UTC (permalink / raw)
  To: git

On Wed, Apr 27, 2005 at 10:19:17PM -0400, Daniel Barkalow wrote:
> On Thu, 28 Apr 2005, Benedikt Schmidt wrote:
> > Ok, darcs doesn't handle block moves, so there is no need for an
> > algorithm that supports them (yet). Is there any free SCM that has
> > support for block moves at the moment? It seems like clearcase detects
> > them, but I don't know whqere it takes advantage of it.
> 
> I would think that darcs would be able to do neat things in its merger if
> it knew about block moves. Obviously, it only makes sense to add support
> for identifying them and using them at the same time.

Indeed, handling block moves would definitely be *very* nice.  An ancient
version of darcs actually did this (it's not in the current darcs history,
since it was so ancient and buggy), although it had a terrible diff
algorithm.  But I really didn't understand the theory back then, and when I
rewrote everything, I never added the block moves back in.  They complicate
conflict situations a bit, and once I found that darcs was actually
useable, I started focusing on other issues.  (Most recently efficiency
issues.)
-- 
David Roundy
http://www.darcs.net

^ permalink raw reply	[flat|nested] 12+ messages in thread

* suffix array/tree deltas (Was: The criss-cross merge case)
  2005-04-27 23:32 ` Daniel Barkalow
  2005-04-28  0:43   ` Tupshin Harper
@ 2005-04-28  3:41   ` Zed A. Shaw
  2005-04-28  4:30     ` Daniel Barkalow
  1 sibling, 1 reply; 12+ messages in thread
From: Zed A. Shaw @ 2005-04-28  3:41 UTC (permalink / raw)
  To: git

On Wed, 2005-04-27 at 19:32 -0400, Daniel Barkalow wrote:
> On Wed, 27 Apr 2005, Bram Cohen wrote:
> 

> My plan is to implement multi-file diff and merge with a suffix tree-based
> algorithm, and then revisit the history stuff once we have a merger that
> can do sensible things with this information.

Hey, that's neat.  I've already implemented two versions of this very
thing with FastCST.  The original used suffix trees, but I found that
there were plenty of pathological cases which chewed memory and
processor.  Most of these cases were large (>1MB) PDF files.  Don't ask
me why PDF drove suffix tree algorithms insane, but they just did.

I recently switched to a suffix array based algorithm which actually
ends up being faster than the suffix tree alternative.  I'm not using
the most recent fastest algorithm and it still compares favorably with
xdelta.

There's tons of weird things about doing a delta based on suffix
arrays/trees, so feel free to pick my brain or the FastCST code if you
attempt it.  The difficult parts turn out to be making the suffix array
and searching for the matching/non-matching regions.  Once you do that
the actual delta algorithm is a simple while loop that keeps doing the
match/non-match detection.

Zed

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: suffix array/tree deltas (Was: The criss-cross merge case)
  2005-04-28  3:41   ` suffix array/tree deltas (Was: The criss-cross merge case) Zed A. Shaw
@ 2005-04-28  4:30     ` Daniel Barkalow
  0 siblings, 0 replies; 12+ messages in thread
From: Daniel Barkalow @ 2005-04-28  4:30 UTC (permalink / raw)
  To: Zed A. Shaw; +Cc: git

On Wed, 27 Apr 2005, Zed A. Shaw wrote:

> On Wed, 2005-04-27 at 19:32 -0400, Daniel Barkalow wrote:
> 
> > My plan is to implement multi-file diff and merge with a suffix tree-based
> > algorithm, and then revisit the history stuff once we have a merger that
> > can do sensible things with this information.
> 
> Hey, that's neat.  I've already implemented two versions of this very
> thing with FastCST.  The original used suffix trees, but I found that
> there were plenty of pathological cases which chewed memory and
> processor.  Most of these cases were large (>1MB) PDF files.  Don't ask
> me why PDF drove suffix tree algorithms insane, but they just did.

I'm not too surprised; but can you hope to merge or compare PDFs
anyway? I'd think that you'd just screw up alignment or something. (Note
that we aren't using deltas for history storage, so we're not interested
in the "compressing multiple versions" aspect of diffs.) I think I want to
punt anything too binary-like and try to find an unambiguous history-based
difference (i.e., there was some commit in the past that replaced one of
the versions with the other; therefore, we want the replacing one).

I'm thinking of line-based compressed suffix trees, with the obvious delta
algorithm: make the trees, find the longest prefix of the file, find the
longest prefix of the rest of the file, add an insertion for a line
that doesn't match, repeat. I probably need a few extra things to
stabilize the process (prefer that the next chunk come from the same file,
prefer that it come from next in the file, ignore copied lines without
enough content).

I haven't actually started yet; I'm waiting for a weekend when I'm feeling
inspired and not too fried.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: The criss-cross merge case
@ 2005-04-28 14:25 Adam J. Richter
  2005-04-29 12:19 ` Wayne Scott
  0 siblings, 1 reply; 12+ messages in thread
From: Adam J. Richter @ 2005-04-28 14:25 UTC (permalink / raw)
  To: ry102; +Cc: barkalow, bram, droundry, git, tupshin

On 2005-04-28, Benedikt Schmidt wrote:
>AFAIK the paper mentioned in the GNU diff sources [1] is an improvement
>to an earlier paper by the same author titled
>"A File Comparison Program" - Miller, Myers - 1985.
[...]
>[1] http://citeseer.ist.psu.edu/myers86ond.html

	Monotone apparently uses a futher acceleration of that algorithm
from the 1989 paper, also co-authored by the Myers, "An O(NP) Sequence
Comparison Algorithm" by Sun Wu, Udi Manber, and Gene Myers.
http://www.eecs.berkeley.edu/~gene/Papers/np_diff.pdf .  The Monotone
implementation was apparently a port of an implementation originally
written in Scheme by Aubrey Jaffer.

	I don't fully understand the 1989 paper, but I get the
general impression that is a small change to the previous algorithm
(the one in GNU diff) that might be a 30 line patch if someone
got around to submitting it, and seems to make the code run more
than twice as fast in practice.  One of these days, I will probably get
around to coding up a patch to GNU diff if nobody beats me to it.

	Making diff run faster may have at least one potentially useful
benefit for merging.  A faster diff makes it more practical run diff
on smaller units of comparison.  I posted a note here before about
converting the input files to diff3 to have just one character per
line, and then undoing that transformation of the result to produce
a character based merge that seemed to work pretty well in the
couple of tests that I tried.

                    __     ______________ 
Adam J. Richter        \ /
adam@yggdrasil.com      | g g d r a s i l

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: The criss-cross merge case
  2005-04-28 14:25 The criss-cross merge case Adam J. Richter
@ 2005-04-29 12:19 ` Wayne Scott
  0 siblings, 0 replies; 12+ messages in thread
From: Wayne Scott @ 2005-04-29 12:19 UTC (permalink / raw)
  To: Adam J. Richter; +Cc: ry102, barkalow, bram, droundry, git, tupshin

On 4/28/05, Adam J. Richter <adam@yggdrasil.com> wrote:
> On 2005-04-28, Benedikt Schmidt wrote:
> >AFAIK the paper mentioned in the GNU diff sources [1] is an improvement
> >to an earlier paper by the same author titled
> >"A File Comparison Program" - Miller, Myers - 1985.
> [...]
> >[1] http://citeseer.ist.psu.edu/myers86ond.html
> 
>         Monotone apparently uses a futher acceleration of that algorithm
> from the 1989 paper, also co-authored by the Myers, "An O(NP) Sequence
> Comparison Algorithm" by Sun Wu, Udi Manber, and Gene Myers.
> http://www.eecs.berkeley.edu/~gene/Papers/np_diff.pdf .  The Monotone
> implementation was apparently a port of an implementation originally
> written in Scheme by Aubrey Jaffer.
> 
>         I don't fully understand the 1989 paper, but I get the
> general impression that is a small change to the previous algorithm
> (the one in GNU diff) that might be a 30 line patch if someone
> got around to submitting it, and seems to make the code run more
> than twice as fast in practice.  One of these days, I will probably get
> around to coding up a patch to GNU diff if nobody beats me to it.
> 
>         Making diff run faster may have at least one potentially useful
> benefit for merging.  A faster diff makes it more practical run diff
> on smaller units of comparison.  I posted a note here before about
> converting the input files to diff3 to have just one character per
> line, and then undoing that transformation of the result to produce
> a character based merge that seemed to work pretty well in the
> couple of tests that I tried.

I just read that paper and unless I am mistaken, it already describes
the basis for how GNU diff works.  I don't think anything in that
paper would make it faster.

I also don't find anything to suggest the Monotone guys have rewritten
diff.  Just some notes from graydon that notes python's difflib uses a
non-optimal diff that is faster in some cases.

-Wayne

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: The criss-cross merge case
@ 2005-04-30 12:32 Adam J. Richter
  0 siblings, 0 replies; 12+ messages in thread
From: Adam J. Richter @ 2005-04-30 12:32 UTC (permalink / raw)
  To: wsc9tt; +Cc: barkalow, bram, droundry, git, ry102, tupshin

On Fri, 29 Apr 2005 07:19:18 -0500, Wayne Scott wrote:
>On 4/28/05, Adam J. Richter <adam@yggdrasil.com> wrote:
>> On 2005-04-28, Benedikt Schmidt wrote:
>> >AFAIK the paper mentioned in the GNU diff sources [1] is an improvement
>> >to an earlier paper by the same author titled
>> >"A File Comparison Program" - Miller, Myers - 1985.
>> [...]
>> >[1] http://citeseer.ist.psu.edu/myers86ond.html
>> 
>>         Monotone apparently uses a futher acceleration of that algorithm
>> from the 1989 paper, also co-authored by the Myers, "An O(NP) Sequence
>> Comparison Algorithm" by Sun Wu, Udi Manber, and Gene Myers.
>> http://www.eecs.berkeley.edu/~gene/Papers/np_diff.pdf .  The Monotone
>> implementation was apparently a port of an implementation originally
>> written in Scheme by Aubrey Jaffer.
>> 
>>         I don't fully understand the 1989 paper, but I get the
>> general impression that is a small change to the previous algorithm
>> (the one in GNU diff) that might be a 30 line patch if someone
>> got around to submitting it, and seems to make the code run more
>> than twice as fast in practice.  One of these days, I will probably get
>> around to coding up a patch to GNU diff if nobody beats me to it.
>> 
>>         Making diff run faster may have at least one potentially useful
>> benefit for merging.  A faster diff makes it more practical run diff
>> on smaller units of comparison.  I posted a note here before about
>> converting the input files to diff3 to have just one character per
>> line, and then undoing that transformation of the result to produce
>> a character based merge that seemed to work pretty well in the
>> couple of tests that I tried.

>I just read that paper and unless I am mistaken, it already describes
>the basis for how GNU diff works.  I don't think anything in that
>paper would make it faster.
>
>I also don't find anything to suggest the Monotone guys have rewritten
>diff.  Just some notes from graydon that notes python's difflib uses a
>non-optimal diff that is faster in some cases.

	In terminology that can only be understood by reading
the 1985 paper, the 1989 paper describes a possible reduction
in the number of diagonals in the edit graph that iterations of the
1989 algorithm have to consider.  I say "possible reduction" because
the reduction can be zero in the worse case, although I get the
impression that it should be a reduction of 50% or better
typically, and it makes the case where the changes is just
a bunch of inserts run in linear time.

	I believe that the longest common subsequence finder
at the core of GNU diff does not currently perform this optimization,
but the one in monotone-0.18/lcs.{cc,hh} does.

                    __     ______________
Adam J. Richter        \ /
adam@yggdrasil.com      | g g d r a s i l

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2005-04-30 13:36 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-04-27 20:25 The criss-cross merge case Bram Cohen
2005-04-27 23:32 ` Daniel Barkalow
2005-04-28  0:43   ` Tupshin Harper
2005-04-28  1:16     ` Daniel Barkalow
2005-04-28  2:15       ` Benedikt Schmidt
2005-04-28  2:19         ` Daniel Barkalow
2005-04-28 11:16           ` David Roundy
2005-04-28  3:41   ` suffix array/tree deltas (Was: The criss-cross merge case) Zed A. Shaw
2005-04-28  4:30     ` Daniel Barkalow
  -- strict thread matches above, loose matches on Subject: below --
2005-04-28 14:25 The criss-cross merge case Adam J. Richter
2005-04-29 12:19 ` Wayne Scott
2005-04-30 12:32 Adam J. Richter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).