The criss-cross merge case

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* The criss-cross merge case
@ 2005-04-27 20:25 Bram Cohen
  2005-04-27 23:32 ` Daniel Barkalow
  0 siblings, 1 reply; 9+ messages in thread
From: Bram Cohen @ 2005-04-27 20:25 UTC (permalink / raw)
  To: git

Here's an example of where simple three-way merge can't do the right
thing. Each letter represents a snapshot of the history, and time goes
downwards. The numbers after some letters refer to which line number was
modified at that time.

A
|\
| \
|  \
|   \
|    \
|     \
|      \
B8      C3
|\     /|
| \   / |
|  \ /  |
|   X   |
|  / \  |
| /   \ |
|/     \|
D8      E3
 \      |
  \     |
   \    |
    \   |
     \  |
      \ |
       \|
        ?

In this case the ? should have a clean merge with the D vesion of line 8
(because it was made with the B version already in the history) and the E
version of line 3 (because it was made with the C version already in the
history).

The problem is that there's no single ancestor for the three-way merge
which does the right thing. If one picks B, then there will be an
unnecessary merge conflict at line 3, because D will have the C version
and E will have the E version but B will have neither. Likewise if one
picks C, there will be an unnecessary conflict at line 8 because D will
have the D version and E will have the B version but C will have neither.
Picking A will cause unnecessary conflicts on *both* lines.

The problem can actually be much worse than a simple unnecesary conflict,
because if the later updates were strict undos of the earlier updates,
then picking either B or C will merge something *wrong*. Using A as the
ancestor will keep that from happening, but it also maximizes unnecessary
conflicts.

Note that the above criss-cross case only involves two branches, using the
methodology of each one modifying their own section and pulling in old
versions of the other one from time to time. Cogito's interface encourages
exactly this work flow, which is not a bad thing from a work flow
perspective, but does make it hit this case regularly.

The way Git handles this currently is very bad, because it forces the
common ancestor to be from the same snapshot across all files, so this
problem will happen if the modifications are made even in different files,
not just different lines within the same file. That could be improved
greatly by finding an LCA for each file individually, which is what
Monotone does. Darcs, Codeville, and all the Arch descendants have better
merge algorithms which don't have to pick a single common ancestor.

-Bram

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: The criss-cross merge case
  2005-04-27 20:25 The criss-cross merge case Bram Cohen
@ 2005-04-27 23:32 ` Daniel Barkalow
  2005-04-28  0:43   ` Tupshin Harper
  2005-04-28  3:41   ` suffix array/tree deltas (Was: The criss-cross merge case) Zed A. Shaw
  0 siblings, 2 replies; 9+ messages in thread
From: Daniel Barkalow @ 2005-04-27 23:32 UTC (permalink / raw)
  To: Bram Cohen; +Cc: git

On Wed, 27 Apr 2005, Bram Cohen wrote:

> The way Git handles this currently is very bad, because it forces the
> common ancestor to be from the same snapshot across all files, so this
> problem will happen if the modifications are made even in different files,
> not just different lines within the same file. That could be improved
> greatly by finding an LCA for each file individually, which is what
> Monotone does.

The git core is perfectly sufficient for getting all LCAs or
per-file best LCAs; merge-base doesn't bother, currently, because the
deficiencies of "merge" (a.k.a. diff3) are worse than the issues with
chosing a suboptimal LCA.

My plan is to implement multi-file diff and merge with a suffix tree-based
algorithm, and then revisit the history stuff once we have a merger that
can do sensible things with this information.

Note that the present very bad merger is actually seems to be sufficient
for the Linux kernel, where patches from different sides of a merge are
generally either unrelated or are identical, and, otherwise, they tend to
be true conflicts where people fixed the same bug independantly in
different ways.

> Darcs, Codeville, and all the Arch descendants have better merge
> algorithms which don't have to pick a single common ancestor.

I've been looking at Darcs (which seems to have a good method, although I
think the underlying diff isn't great), and Codeville still doesn't have
any documentation. Arch's method is strictly weaker than 3-way merge, and
generates more rejects (not even conflicts) in my experience than even
CVS. 

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: The criss-cross merge case
  2005-04-27 23:32 ` Daniel Barkalow
@ 2005-04-28  0:43   ` Tupshin Harper
  2005-04-28  1:16     ` Daniel Barkalow
  2005-04-28  3:41   ` suffix array/tree deltas (Was: The criss-cross merge case) Zed A. Shaw
  1 sibling, 1 reply; 9+ messages in thread
From: Tupshin Harper @ 2005-04-28  0:43 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: Bram Cohen, git

Daniel Barkalow wrote:

>I've been looking at Darcs (which seems to have a good method, although I
>think the underlying diff isn't great), and Codeville still doesn't have
>any documentation. Arch's method is strictly weaker than 3-way merge, and
>generates more rejects (not even conflicts) in my experience than even
>CVS. 
>
>	-Daniel
>  
>
Can you clarify what you mean by darcs' underlying diff not being that
great? It seems to function pretty much identically to gnu diff. In what
way would you want the underlying diff to be improved?

-Tupshin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: The criss-cross merge case
  2005-04-28  0:43   ` Tupshin Harper
@ 2005-04-28  1:16     ` Daniel Barkalow
  2005-04-28  2:15       ` Benedikt Schmidt
  0 siblings, 1 reply; 9+ messages in thread
From: Daniel Barkalow @ 2005-04-28  1:16 UTC (permalink / raw)
  To: Tupshin Harper; +Cc: Bram Cohen, git

On Wed, 27 Apr 2005, Tupshin Harper wrote:

> Can you clarify what you mean by darcs' underlying diff not being that
> great? It seems to function pretty much identically to gnu diff. In what
> way would you want the underlying diff to be improved?

GNU diff uses an algorithm which is tuned to handle finding the shortest
diff among a large set of similar-length alternatives while comparing
files which have a lot of repeated lines. The author of the paper it cites
is really thinking about diffing DNA sequences or similar things. It also
can't detect content moves, which are a common thing to have, and which
will be important in the long run, when we're trying to track
modifications to content which also moved from place to place.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: The criss-cross merge case
  2005-04-28  1:16     ` Daniel Barkalow
@ 2005-04-28  2:15       ` Benedikt Schmidt
  2005-04-28  2:19         ` Daniel Barkalow
  0 siblings, 1 reply; 9+ messages in thread
From: Benedikt Schmidt @ 2005-04-28  2:15 UTC (permalink / raw)
  To: git

Daniel Barkalow <barkalow@iabervon.org> writes:

> On Wed, 27 Apr 2005, Tupshin Harper wrote:
>
>> Can you clarify what you mean by darcs' underlying diff not being that
>> great? It seems to function pretty much identically to gnu diff. In what
>> way would you want the underlying diff to be improved?
>
> GNU diff uses an algorithm which is tuned to handle finding the shortest
> diff among a large set of similar-length alternatives while comparing
> files which have a lot of repeated lines. The author of the paper it cites
> is really thinking about diffing DNA sequences or similar things.

AFAIK the paper mentioned in the GNU diff sources [1] is an improvement
to an earlier paper by the same author titled
"A File Comparison Program" - Miller, Myers - 1985.

Can you be more specific why the algorithm is a bad choice (performance,
quality of diff output)?

> It also can't detect content moves, which are a common thing to have, and
> which will be important in the long run, when we're trying to track
> modifications to content which also moved from place to place.

Ok, darcs doesn't handle block moves, so there is no need for an algorithm that
supports them (yet). Is there any free SCM that has support for block moves at
the moment? It seems like clearcase detects them, but I don't know where it
takes advantage of it.

Benedikt

[1] http://citeseer.ist.psu.edu/myers86ond.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: The criss-cross merge case
  2005-04-28  2:15       ` Benedikt Schmidt
@ 2005-04-28  2:19         ` Daniel Barkalow
  2005-04-28 11:16           ` David Roundy
  0 siblings, 1 reply; 9+ messages in thread
From: Daniel Barkalow @ 2005-04-28  2:19 UTC (permalink / raw)
  To: Benedikt Schmidt; +Cc: git

On Thu, 28 Apr 2005, Benedikt Schmidt wrote:

> AFAIK the paper mentioned in the GNU diff sources [1] is an improvement
> to an earlier paper by the same author titled
> "A File Comparison Program" - Miller, Myers - 1985.

GNU diff is based on a better algorithm than traditional diff, reportly,
but there are better algorithms still, developed since, at least according
to a brief literature search on Google Scholar. (bdiff and vdelta, for
example, which can identify block moves as well.)

> Can you be more specific why the algorithm is a bad choice (performance,
> quality of diff output)?

I suspect that the speed is suboptimal (for the cases under which it is
actually used). The quality of the output is about ideal, lacking a
representation for block moves, but I'm hoping to have a diff/merge set
that handles block moves effectively, even if it can't report them in diff
format. I'm also hoping for an annotate function that could use block
moves.

> Ok, darcs doesn't handle block moves, so there is no need for an algorithm that
> supports them (yet). Is there any free SCM that has support for block moves at
> the moment? It seems like clearcase detects them, but I don't know where it
> takes advantage of it.

I would think that darcs would be able to do neat things in its merger if
it knew about block moves. Obviously, it only makes sense to add support
for identifying them and using them at the same time.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 9+ messages in thread

* suffix array/tree deltas (Was: The criss-cross merge case)
  2005-04-27 23:32 ` Daniel Barkalow
  2005-04-28  0:43   ` Tupshin Harper
@ 2005-04-28  3:41   ` Zed A. Shaw
  2005-04-28  4:30     ` Daniel Barkalow
  1 sibling, 1 reply; 9+ messages in thread
From: Zed A. Shaw @ 2005-04-28  3:41 UTC (permalink / raw)
  To: git

On Wed, 2005-04-27 at 19:32 -0400, Daniel Barkalow wrote:
> On Wed, 27 Apr 2005, Bram Cohen wrote:
> 

> My plan is to implement multi-file diff and merge with a suffix tree-based
> algorithm, and then revisit the history stuff once we have a merger that
> can do sensible things with this information.

Hey, that's neat.  I've already implemented two versions of this very
thing with FastCST.  The original used suffix trees, but I found that
there were plenty of pathological cases which chewed memory and
processor.  Most of these cases were large (>1MB) PDF files.  Don't ask
me why PDF drove suffix tree algorithms insane, but they just did.

I recently switched to a suffix array based algorithm which actually
ends up being faster than the suffix tree alternative.  I'm not using
the most recent fastest algorithm and it still compares favorably with
xdelta.

There's tons of weird things about doing a delta based on suffix
arrays/trees, so feel free to pick my brain or the FastCST code if you
attempt it.  The difficult parts turn out to be making the suffix array
and searching for the matching/non-matching regions.  Once you do that
the actual delta algorithm is a simple while loop that keeps doing the
match/non-match detection.

Zed

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: suffix array/tree deltas (Was: The criss-cross merge case)
  2005-04-28  3:41   ` suffix array/tree deltas (Was: The criss-cross merge case) Zed A. Shaw
@ 2005-04-28  4:30     ` Daniel Barkalow
  0 siblings, 0 replies; 9+ messages in thread
From: Daniel Barkalow @ 2005-04-28  4:30 UTC (permalink / raw)
  To: Zed A. Shaw; +Cc: git

On Wed, 27 Apr 2005, Zed A. Shaw wrote:

> On Wed, 2005-04-27 at 19:32 -0400, Daniel Barkalow wrote:
> 
> > My plan is to implement multi-file diff and merge with a suffix tree-based
> > algorithm, and then revisit the history stuff once we have a merger that
> > can do sensible things with this information.
> 
> Hey, that's neat.  I've already implemented two versions of this very
> thing with FastCST.  The original used suffix trees, but I found that
> there were plenty of pathological cases which chewed memory and
> processor.  Most of these cases were large (>1MB) PDF files.  Don't ask
> me why PDF drove suffix tree algorithms insane, but they just did.

I'm not too surprised; but can you hope to merge or compare PDFs
anyway? I'd think that you'd just screw up alignment or something. (Note
that we aren't using deltas for history storage, so we're not interested
in the "compressing multiple versions" aspect of diffs.) I think I want to
punt anything too binary-like and try to find an unambiguous history-based
difference (i.e., there was some commit in the past that replaced one of
the versions with the other; therefore, we want the replacing one).

I'm thinking of line-based compressed suffix trees, with the obvious delta
algorithm: make the trees, find the longest prefix of the file, find the
longest prefix of the rest of the file, add an insertion for a line
that doesn't match, repeat. I probably need a few extra things to
stabilize the process (prefer that the next chunk come from the same file,
prefer that it come from next in the file, ignore copied lines without
enough content).

I haven't actually started yet; I'm waiting for a weekend when I'm feeling
inspired and not too fried.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: The criss-cross merge case
  2005-04-28  2:19         ` Daniel Barkalow
@ 2005-04-28 11:16           ` David Roundy
  0 siblings, 0 replies; 9+ messages in thread
From: David Roundy @ 2005-04-28 11:16 UTC (permalink / raw)
  To: git

On Wed, Apr 27, 2005 at 10:19:17PM -0400, Daniel Barkalow wrote:
> On Thu, 28 Apr 2005, Benedikt Schmidt wrote:
> > Ok, darcs doesn't handle block moves, so there is no need for an
> > algorithm that supports them (yet). Is there any free SCM that has
> > support for block moves at the moment? It seems like clearcase detects
> > them, but I don't know whqere it takes advantage of it.
> 
> I would think that darcs would be able to do neat things in its merger if
> it knew about block moves. Obviously, it only makes sense to add support
> for identifying them and using them at the same time.

Indeed, handling block moves would definitely be *very* nice.  An ancient
version of darcs actually did this (it's not in the current darcs history,
since it was so ancient and buggy), although it had a terrible diff
algorithm.  But I really didn't understand the theory back then, and when I
rewrote everything, I never added the block moves back in.  They complicate
conflict situations a bit, and once I found that darcs was actually
useable, I started focusing on other issues.  (Most recently efficiency
issues.)
-- 
David Roundy
http://www.darcs.net

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2005-04-28 11:16 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-04-27 20:25 The criss-cross merge case Bram Cohen
2005-04-27 23:32 ` Daniel Barkalow
2005-04-28  0:43   ` Tupshin Harper
2005-04-28  1:16     ` Daniel Barkalow
2005-04-28  2:15       ` Benedikt Schmidt
2005-04-28  2:19         ` Daniel Barkalow
2005-04-28 11:16           ` David Roundy
2005-04-28  3:41   ` suffix array/tree deltas (Was: The criss-cross merge case) Zed A. Shaw
2005-04-28  4:30     ` Daniel Barkalow

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).