libxdiff and patience diff

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* libxdiff and patience diff
@ 2008-11-04  0:40 Pierre Habouzit
  2008-11-04  3:17 ` Davide Libenzi
  2008-11-04  5:39 ` Johannes Schindelin
  0 siblings, 2 replies; 67+ messages in thread
From: Pierre Habouzit @ 2008-11-04  0:40 UTC (permalink / raw)
  To: davidel; +Cc: Git ML

[-- Attachment #1: Type: text/plain, Size: 2679 bytes --]

Hi Davide,

I've been working tonight, trying to make libxdiff support the patience
diff algorithm, but I've totally failed, because I _thought_ I
understood what xdl_split was doing, but it appears I don't.

[ For the readers playing at home, the patience diff algorithm is
  explained after my sig. ]

What I did is to:
(1) add a flag to the xenv in xdl_split that says that I want a
    patience "split".
(2) Then in xdl_split, if that bit is set, I compute the longest common
    subsequence from the patience diff.
(3) for each split it computes I call xdl_recs_cmp on that interval.

What I thought it would achieve is that I force some boundaries at which
libxdiff _must_ resync. Though, it seems that for some reason it doesn't
work, probably because the "ha" stuff and the boundaries absolutely
don't work the way I thought it did.

So where is the place I should do that ? I suspect it should be
partly in xprepare.c but I'm a bit stuck right now.

Any pointer on how the stuff in xprepare.c and xdiffi.c work would help
greatly, it's really not self-evident to me :)

-- 
·O·  Pierre Habouzit
··O                                                madcoder@debian.org
OOO                                                http://www.madism.org

Patience diff
=============

Basically, it's an algorithm that considers line from the left file A
and the right file B.

It searches for lines that are unique *IN* A and unique *IN* B as well.
Then using a clever O(n log log n) where n is that number of lines, you
extract the longer common sequence of those lines.

This defines two splits of file A and B. For each corresponding chunks,
you then reduce the lines matching at the beginning and the end, and
reiterate the algorithm on the interior of that space, until there are
_no_ unique lines at all, and then you apply a usual diff algorithm to
generate the diff in that final section.

http://alfedenzo.livejournal.com/170301.html has a nice visual
explanation of the fact, even if it forgets about the "zones" trimming
that helps for the efficiency. It's also often wrong to generate the
stacks in the order shown on the blog, because you recurse from the max
line to the min line, which is not the natural order to work on a diff.
But that's just a matter of "inverting" all the comparisons which is
pretty obvious to do.

The difference of output can be seen on http://glandium.org/blog/?p=120
where the patience diff picks the line
   "include $(topsrcdir)/config/rules.mk"
as being unique on the left and the right file, hence will use it as
sync point rather than using it in a diff.

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: libxdiff and patience diff
  2008-11-04  0:40 libxdiff and patience diff Pierre Habouzit
@ 2008-11-04  3:17 ` Davide Libenzi
  2008-11-04  8:33   ` Pierre Habouzit
  2008-11-04  5:39 ` Johannes Schindelin
  1 sibling, 1 reply; 67+ messages in thread
From: Davide Libenzi @ 2008-11-04  3:17 UTC (permalink / raw)
  To: Pierre Habouzit; +Cc: Git ML

On Tue, 4 Nov 2008, Pierre Habouzit wrote:

> Hi Davide,
> 
> I've been working tonight, trying to make libxdiff support the patience
> diff algorithm, but I've totally failed, because I _thought_ I
> understood what xdl_split was doing, but it appears I don't.
> 
> 
> [ For the readers playing at home, the patience diff algorithm is
>   explained after my sig. ]
> 
> 
> What I did is to:
> (1) add a flag to the xenv in xdl_split that says that I want a
>     patience "split".
> (2) Then in xdl_split, if that bit is set, I compute the longest common
>     subsequence from the patience diff.
> (3) for each split it computes I call xdl_recs_cmp on that interval.
> 
> 
> What I thought it would achieve is that I force some boundaries at which
> libxdiff _must_ resync. Though, it seems that for some reason it doesn't
> work, probably because the "ha" stuff and the boundaries absolutely
> don't work the way I thought it did.
> 
> So where is the place I should do that ? I suspect it should be
> partly in xprepare.c but I'm a bit stuck right now.
> 
> 
> Any pointer on how the stuff in xprepare.c and xdiffi.c work would help
> greatly, it's really not self-evident to me :)

What makes you think it'd self-evident to me? :)
Seriously, I forgot stuff I wrote the last month, this is way beyond my 
memory limits.
You definitely need to look at xprepare.c, especially at xdl_trim_ends() 
and xdl_cleanup_records(). Lines are re-arranged in indexes, and this 
might screw up your logic if you're not prepared for it.
What xdl_split() does, is find the start of an LCS and return the 
coordinate. Then xdl_recs_cmp() does the box reducing (first two "for" 
loops) and different-lines marking (first and second "if").



- Davide

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: libxdiff and patience diff
  2008-11-04  0:40 libxdiff and patience diff Pierre Habouzit
  2008-11-04  3:17 ` Davide Libenzi
@ 2008-11-04  5:39 ` Johannes Schindelin
  2008-11-04  8:30   ` Pierre Habouzit
  1 sibling, 1 reply; 67+ messages in thread
From: Johannes Schindelin @ 2008-11-04  5:39 UTC (permalink / raw)
  To: Pierre Habouzit; +Cc: davidel, Git ML

Hi,

On Tue, 4 Nov 2008, Pierre Habouzit wrote:

> I've been working tonight, trying to make libxdiff support the patience 
> diff algorithm, but I've totally failed, because I _thought_ I 
> understood what xdl_split was doing, but it appears I don't.

I thought about it, too, at the GitTogether, although I want to finish my 
jGit diff first.

The main idea I had about patience diff is that you can reuse a lot of 
functions in libxdiff.

But the requirement of taking just unique lines/hashes into account, and 
_after_ splitting up, _again_ determine uniqueness _just_ in the 
between-hunk part made me think that it may be wiser to have a separate 
function for the patience diff stuff.

Basically, you would start to have libxdiff split the lines and hash them 
as normal, but then determine the unique hashes (I'd start with the 
smaller text, just to have a better chance to end up with unique hashes).

Once they are determined, you can search for those exact lines (hash 
first) in the post-document.

Once that is done, you'd have to find the longest common subsequence, 
which you could do using the existing infrastructure, but that would 
require more work (as we already know the lines are unique).

After that, you would have to recurse to the same algorithm _between_ 
known chunks.  Eventually, that would have to resort to classical libxdiff 
(if there are no, or not enough, unique lines).

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: libxdiff and patience diff
  2008-11-04  5:39 ` Johannes Schindelin
@ 2008-11-04  8:30   ` Pierre Habouzit
  2008-11-04 14:34     ` Johannes Schindelin
  2009-01-06 11:17     ` Pierre Habouzit
  0 siblings, 2 replies; 67+ messages in thread
From: Pierre Habouzit @ 2008-11-04  8:30 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: davidel, Git ML

[-- Attachment #1: Type: text/plain, Size: 3639 bytes --]

On Tue, Nov 04, 2008 at 05:39:48AM +0000, Johannes Schindelin wrote:
> Hi,
> 
> On Tue, 4 Nov 2008, Pierre Habouzit wrote:
> 
> > I've been working tonight, trying to make libxdiff support the patience 
> > diff algorithm, but I've totally failed, because I _thought_ I 
> > understood what xdl_split was doing, but it appears I don't.
> 
> I thought about it, too, at the GitTogether, although I want to finish my 
> jGit diff first.
> 
> The main idea I had about patience diff is that you can reuse a lot of 
> functions in libxdiff.
> 
> But the requirement of taking just unique lines/hashes into account, and 
> _after_ splitting up, _again_ determine uniqueness _just_ in the 
> between-hunk part made me think that it may be wiser to have a separate 
> function for the patience diff stuff.
> 
> Basically, you would start to have libxdiff split the lines and hash them 
> as normal, but then determine the unique hashes (I'd start with the 
> smaller text, just to have a better chance to end up with unique hashes).
> 
> Once they are determined, you can search for those exact lines (hash 
> first) in the post-document.

Actually my current implementation just puts all the hashes into an
array, sorts them by hash (which is O(n log(n)) with the position from
left or right file it's in, ordered by increasing right position. (and
the struct is filled with the left pos to -1 if the right pos is set,
and vice versa).

The I scan the array to find patterns of two consecutive hashes exactly,
and collapse it into the proper {left pos, right pos} tuple if it was
indeed a unique line in both files.

This results into an array I sort again by right pos then, and we can
work on that for the stack sorting, and I do it, and then I have my LCS.

This is the complete brute-force algorithm which requires a temporary
array of the size of the number of lines on the left + the right, and a
temporary array for the stacks which _may_ end up being as large as the
smallest number of lines between the left or right file in the worst
case I'd say (roughly).

Then I just remember a list of split points, and I recurse in all the
sub splits again. It has a fixed point which may or may not need
libxdiff recursion in it.

This code is actually written, naive and unoptimized but it doesn't work
probably because I didn't plug that in the proper place :)

> Once that is done, you'd have to find the longest common subsequence, 
> which you could do using the existing infrastructure, but that would 
> require more work (as we already know the lines are unique).

Patience diff gives you the algorithm to do that, it's pretty simple,
and is more efficient than the current infrastructure (in time, I don't
know for space though).

> After that, you would have to recurse to the same algorithm _between_ 
> known chunks.  Eventually, that would have to resort to classical libxdiff 
> (if there are no, or not enough, unique lines).

Yeah, that's the point, the problem is, I believe more and more that I
should prepare the LCS from patience diff in xprepare.c, but I grok
absolutely nothing at what the chastore_t and similar stuff is. I
understand it's about hashing, but the exact stuff it does eludes me. In
fact when I look at the records I have in xdiffi.c I had the impression
they were already somehow collapsed, which makes it a too late point to
apply the patience diff ...

-- 
·O·  Pierre Habouzit
··O                                                madcoder@debian.org
OOO                                                http://www.madism.org

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: libxdiff and patience diff
  2008-11-04  3:17 ` Davide Libenzi
@ 2008-11-04  8:33   ` Pierre Habouzit
  0 siblings, 0 replies; 67+ messages in thread
From: Pierre Habouzit @ 2008-11-04  8:33 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Git ML

[-- Attachment #1: Type: text/plain, Size: 2593 bytes --]

On Tue, Nov 04, 2008 at 03:17:37AM +0000, Davide Libenzi wrote:
> On Tue, 4 Nov 2008, Pierre Habouzit wrote:
> 
> > Hi Davide,
> > 
> > I've been working tonight, trying to make libxdiff support the patience
> > diff algorithm, but I've totally failed, because I _thought_ I
> > understood what xdl_split was doing, but it appears I don't.
> > 
> > 
> > [ For the readers playing at home, the patience diff algorithm is
> >   explained after my sig. ]
> > 
> > 
> > What I did is to:
> > (1) add a flag to the xenv in xdl_split that says that I want a
> >     patience "split".
> > (2) Then in xdl_split, if that bit is set, I compute the longest common
> >     subsequence from the patience diff.
> > (3) for each split it computes I call xdl_recs_cmp on that interval.
> > 
> > 
> > What I thought it would achieve is that I force some boundaries at which
> > libxdiff _must_ resync. Though, it seems that for some reason it doesn't
> > work, probably because the "ha" stuff and the boundaries absolutely
> > don't work the way I thought it did.
> > 
> > So where is the place I should do that ? I suspect it should be
> > partly in xprepare.c but I'm a bit stuck right now.
> > 
> > 
> > Any pointer on how the stuff in xprepare.c and xdiffi.c work would help
> > greatly, it's really not self-evident to me :)
> 
> What makes you think it'd self-evident to me? :)
> Seriously, I forgot stuff I wrote the last month, this is way beyond my 
> memory limits.
> You definitely need to look at xprepare.c, especially at xdl_trim_ends() 
> and xdl_cleanup_records(). Lines are re-arranged in indexes, and this 
> might screw up your logic if you're not prepared for it.
> What xdl_split() does, is find the start of an LCS and return the 
> coordinate. Then xdl_recs_cmp() does the box reducing (first two "for" 
> loops) and different-lines marking (first and second "if").

Well it's what I thought it did indeed, that's why before recursing into
that bit, I tried to extract a full LCS from the patience diff algorithm
and recurse into that for each interval it gives, which _should_ work,
but doesn't at all :/

Okay maybe I should re-read my algorithm slowly and check that I've not
made something silly (like chaining the list in the reverse order or god
knows what), but my debug functions let me believe that I did that fine.

I'll look into it tonight.

-- 
·O·  Pierre Habouzit
··O                                                madcoder@debian.org
OOO                                                http://www.madism.org

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: libxdiff and patience diff
  2008-11-04  8:30   ` Pierre Habouzit
@ 2008-11-04 14:34     ` Johannes Schindelin
  2008-11-04 15:23       ` Pierre Habouzit
  2009-01-06 11:17     ` Pierre Habouzit
  1 sibling, 1 reply; 67+ messages in thread
From: Johannes Schindelin @ 2008-11-04 14:34 UTC (permalink / raw)
  To: Pierre Habouzit; +Cc: davidel, Git ML

Hi,

On Tue, 4 Nov 2008, Pierre Habouzit wrote:

> On Tue, Nov 04, 2008 at 05:39:48AM +0000, Johannes Schindelin wrote:
> 
> > On Tue, 4 Nov 2008, Pierre Habouzit wrote:
> > 
> > > I've been working tonight, trying to make libxdiff support the 
> > > patience diff algorithm, but I've totally failed, because I 
> > > _thought_ I understood what xdl_split was doing, but it appears I 
> > > don't.
> > 
> > I thought about it, too, at the GitTogether, although I want to finish 
> > my jGit diff first.
> > 
> > The main idea I had about patience diff is that you can reuse a lot of 
> > functions in libxdiff.
> > 
> > But the requirement of taking just unique lines/hashes into account, 
> > and _after_ splitting up, _again_ determine uniqueness _just_ in the 
> > between-hunk part made me think that it may be wiser to have a 
> > separate function for the patience diff stuff.
> > 
> > Basically, you would start to have libxdiff split the lines and hash 
> > them as normal, but then determine the unique hashes (I'd start with 
> > the smaller text, just to have a better chance to end up with unique 
> > hashes).
> > 
> > Once they are determined, you can search for those exact lines (hash 
> > first) in the post-document.
> 
> Actually my current implementation just puts all the hashes into an 
> array, sorts them by hash (which is O(n log(n)) with the position from 
> left or right file it's in, ordered by increasing right position. (and 
> the struct is filled with the left pos to -1 if the right pos is set, 
> and vice versa).

Yeah, that would be much more efficient using a hash-multiset.  Still 
linear size (albeit twice as much).  And you can already discard double 
entries early (although you have to keep one pointer in the hash-multiset 
to prevent other identical lines from being misdetected as "unique").

> The I scan the array to find patterns of two consecutive hashes exactly, 
> and collapse it into the proper {left pos, right pos} tuple if it was 
> indeed a unique line in both files.
> 
> This results into an array I sort again by right pos then, and we can 
> work on that for the stack sorting, and I do it, and then I have my LCS.
> 
> 
> This is the complete brute-force algorithm which requires a temporary 
> array of the size of the number of lines on the left + the right, and a 
> temporary array for the stacks which _may_ end up being as large as the 
> smallest number of lines between the left or right file in the worst 
> case I'd say (roughly).
> 
> Then I just remember a list of split points, and I recurse in all the
> sub splits again. It has a fixed point which may or may not need
> libxdiff recursion in it.

I am not sure that you really end up with patience diff that way.  I 
_think_ that you need to determine the longest sequence of unique lines 
which has the property of being ordered in both texts first, and only 
_then_ recurse into the not-yet-handled lines.

> > Once that is done, you'd have to find the longest common subsequence, 
> > which you could do using the existing infrastructure, but that would 
> > require more work (as we already know the lines are unique).
> 
> Patience diff gives you the algorithm to do that, it's pretty simple,
> and is more efficient than the current infrastructure (in time, I don't
> know for space though).

Actually, IIRC it is pretty easy to see that the time complexity is linear 
(and therefore, the space complexity, too).

> > After that, you would have to recurse to the same algorithm _between_ 
> > known chunks.  Eventually, that would have to resort to classical 
> > libxdiff (if there are no, or not enough, unique lines).
> 
> Yeah, that's the point, the problem is, I believe more and more that I 
> should prepare the LCS from patience diff in xprepare.c, but I grok 
> absolutely nothing at what the chastore_t and similar stuff is. I 
> understand it's about hashing, but the exact stuff it does eludes me.

Yes, I do not like the short and unintuitive names either.

AFAIU chastore_t is just a generic extensible array of elements that have 
size "isize", and initially there are "icount" of them.

> In fact when I look at the records I have in xdiffi.c I had the 
> impression they were already somehow collapsed, which makes it a too 
> late point to apply the patience diff ...

AFAICS xdiffi.c contains the classical diff algorithm (incidentally, I the 
inventor of that algorithm is about 50 meters away from me at this very 
moment).  It should not have anything of interest to you, except for the 
fall-back case.

So I think that you should add a new file xpatience.c.

In that, I'd implement that hash multi-set, and use a prepared xdfenv_t to 
fill it (smaller file first, then you can traverse the other file, 
checking for uniqueness in that file and for a match in the other file at 
the same time).

You _could_ build the longest list of ordered pairs at the same time, too, 
but that may make the code a bit too complex.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: libxdiff and patience diff
  2008-11-04 14:34     ` Johannes Schindelin
@ 2008-11-04 15:23       ` Pierre Habouzit
  2008-11-04 15:57         ` Johannes Schindelin
  2009-01-01 16:38         ` [PATCH 0/3] Teach Git about the patience diff algorithm Johannes Schindelin
  0 siblings, 2 replies; 67+ messages in thread
From: Pierre Habouzit @ 2008-11-04 15:23 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: davidel, Git ML


[-- Attachment #1.1: Type: text/plain, Size: 5442 bytes --]

On Tue, Nov 04, 2008 at 02:34:37PM +0000, Johannes Schindelin wrote:
> Hi,
> 
> On Tue, 4 Nov 2008, Pierre Habouzit wrote:
> 
> > On Tue, Nov 04, 2008 at 05:39:48AM +0000, Johannes Schindelin wrote:
> > 
> > > On Tue, 4 Nov 2008, Pierre Habouzit wrote:
> > > 
> > > > I've been working tonight, trying to make libxdiff support the 
> > > > patience diff algorithm, but I've totally failed, because I 
> > > > _thought_ I understood what xdl_split was doing, but it appears I 
> > > > don't.
> > > 
> > > I thought about it, too, at the GitTogether, although I want to finish 
> > > my jGit diff first.
> > > 
> > > The main idea I had about patience diff is that you can reuse a lot of 
> > > functions in libxdiff.
> > > 
> > > But the requirement of taking just unique lines/hashes into account, 
> > > and _after_ splitting up, _again_ determine uniqueness _just_ in the 
> > > between-hunk part made me think that it may be wiser to have a 
> > > separate function for the patience diff stuff.
> > > 
> > > Basically, you would start to have libxdiff split the lines and hash 
> > > them as normal, but then determine the unique hashes (I'd start with 
> > > the smaller text, just to have a better chance to end up with unique 
> > > hashes).
> > > 
> > > Once they are determined, you can search for those exact lines (hash 
> > > first) in the post-document.
> > 
> > Actually my current implementation just puts all the hashes into an 
> > array, sorts them by hash (which is O(n log(n)) with the position from 
> > left or right file it's in, ordered by increasing right position. (and 
> > the struct is filled with the left pos to -1 if the right pos is set, 
> > and vice versa).
> 
> Yeah, that would be much more efficient using a hash-multiset.  Still 
> linear size (albeit twice as much).  And you can already discard double 
> entries early (although you have to keep one pointer in the hash-multiset 
> to prevent other identical lines from being misdetected as "unique").

Probably, I just wanted something to work first, and then optimize it
using the proper structure.

> I am not sure that you really end up with patience diff that way.  I 
> _think_ that you need to determine the longest sequence of unique lines 
> which has the property of being ordered in both texts first, and only 
> _then_ recurse into the not-yet-handled lines.

I must have been using really bad english because it's what I do ;)

> > > Once that is done, you'd have to find the longest common subsequence, 
> > > which you could do using the existing infrastructure, but that would 
> > > require more work (as we already know the lines are unique).
> > 
> > Patience diff gives you the algorithm to do that, it's pretty simple,
> > and is more efficient than the current infrastructure (in time, I don't
> > know for space though).
> 
> Actually, IIRC it is pretty easy to see that the time complexity is linear 
> (and therefore, the space complexity, too).

Well the space complexity of the patience diff would be linear too for
me, though I'm afraid that the current state of my implementation has a
big factor front ;)


> > In fact when I look at the records I have in xdiffi.c I had the 
> > impression they were already somehow collapsed, which makes it a too 
> > late point to apply the patience diff ...
> 
> AFAICS xdiffi.c contains the classical diff algorithm (incidentally, I the 
> inventor of that algorithm is about 50 meters away from me at this very 
> moment).  It should not have anything of interest to you, except for the 
> fall-back case.
> 
> So I think that you should add a new file xpatience.c.
> 
> In that, I'd implement that hash multi-set, and use a prepared xdfenv_t to 
> fill it (smaller file first, then you can traverse the other file, 
> checking for uniqueness in that file and for a match in the other file at 
> the same time).
> 
> You _could_ build the longest list of ordered pairs at the same time, too, 
> but that may make the code a bit too complex.

Well, technically, if I'm correct, xdiffi.c already has a pruned list of
hashed (I suppose ->ha is a hash value somehow ?) lines (pruned from
lines without a match left or right), and I was massaging that. Though
it absolutely fails so I fear I didn't understand what the ha stuff is
for real.

Attached is my current work (YES it is horrible, it's WIP), probably
most of it can be put in an xpatience.c file, but still, I absolutely
don't understand what gets wrong. On the simple example from
glandium.org it _looks_ like it catches the "unique" include line fine,
but somehow not.

The nasty thing about the patience diff is that it still needs the usual
diff algorithm once it has split the file into chunks separated by
"unique lines". So you cannot make really independant stuff. What I
could do is put most of the xpatience diff into xpatience.c but it would
still have to use some functions from xdiffi.c that are currently
private, so it messes somehow the files more than it's worth IMHO.


Anyways, I'm waiting for a client to finally setup his network for the
test I'm supposed to do right now, I'll try to have a closer look
tonight...

-- 
·O·  Pierre Habouzit
··O                                                madcoder@debian.org
OOO                                                http://www.madism.org

[-- Attachment #1.2: wip.patch --]
[-- Type: text/plain, Size: 8642 bytes --]

diff --git a/diff.c b/diff.c
index f644947..0901cdc 100644
--- a/diff.c
+++ b/diff.c
@@ -2447,6 +2447,8 @@ int diff_opt_parse(struct diff_options *options, const char **av, int ac)
 		options->xdl_opts |= XDF_IGNORE_WHITESPACE_CHANGE;
 	else if (!strcmp(arg, "--ignore-space-at-eol"))
 		options->xdl_opts |= XDF_IGNORE_WHITESPACE_AT_EOL;
+	else if (!strcmp(arg, "--patience"))
+		options->xdl_opts |= XDF_USE_PATIENCE;
 
 	/* flags options */
 	else if (!strcmp(arg, "--binary")) {
diff --git a/xdiff/xdiff.h b/xdiff/xdiff.h
index 84fff58..bba915c 100644
--- a/xdiff/xdiff.h
+++ b/xdiff/xdiff.h
@@ -32,6 +32,7 @@ extern "C" {
 #define XDF_IGNORE_WHITESPACE (1 << 2)
 #define XDF_IGNORE_WHITESPACE_CHANGE (1 << 3)
 #define XDF_IGNORE_WHITESPACE_AT_EOL (1 << 4)
+#define XDF_USE_PATIENCE (1 << 5)
 #define XDF_WHITESPACE_FLAGS (XDF_IGNORE_WHITESPACE | XDF_IGNORE_WHITESPACE_CHANGE | XDF_IGNORE_WHITESPACE_AT_EOL)
 
 #define XDL_PATCH_NORMAL '-'
diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 9d0324a..1a5b13a 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -37,8 +37,235 @@ typedef struct s_xdpsplit {
 	int min_lo, min_hi;
 } xdpsplit_t;
 
+typedef struct s_xduniqmatch {
+	struct s_xduniqmatch *next;
+	long i1, i2, len;
+} xduniqmatch_t;
+
+typedef struct s_xdp_ha {
+	struct s_xdp_ha *up, *down;
+	struct s_xdp_ha *left;
+	long ha;
+	long off1;
+	long off2;
+} xdp_ha_t;
+
+
+static xduniqmatch_t *xdl_uniq_match_alloc(long i1, long i2, long len)
+{
+	xduniqmatch_t *match = xdl_malloc(sizeof(xduniqmatch_t));
+	if (match) {
+		match->next = NULL;
+		match->i1 = i1;
+		match->i2 = i2;
+		match->len = len;
+	}
+	return match;
+}
+
+static xduniqmatch_t *xdl_uniq_popfree(xduniqmatch_t *lst)
+{
+	xduniqmatch_t *next = lst->next;
+	xdl_free(lst);
+	return next;
+}
+
+static int xdp_ha_cmp_by_ha(const void *a1, const void *a2)
+{
+	const xdp_ha_t *ha1 = a1;
+	const xdp_ha_t *ha2 = a2;
+	if (ha1->ha == ha2->ha) {
+		return ha1->off2 - ha2->off2;
+	}
+	return ha1->ha - ha2->ha;
+}
+
+static int xdp_ha_cmp_by_off2(const void *a1, const void *a2)
+{
+	const xdp_ha_t *ha1 = a1;
+	const xdp_ha_t *ha2 = a2;
+	return ha2->off2 - ha1->off2;
+}
+
+static int patience_bisect_stack(xdp_ha_t **stacks, long len, long off1)
+{
+	long l = 0, r = len;
+
+	while (l < r) {
+		long i = (r + l) / 2;
+
+		if (off1 < stacks[i]->off1) {
+			l = i + 1;
+		} else {
+			r = i;
+		}
+	}
+	return l;
+}
+
+static int xdl_patience_split_aux(xduniqmatch_t *lcs, xdp_ha_t *ha)
+{
+	xduniqmatch_t *tmp;
+
+	while (ha) {
+		tmp = xdl_uniq_match_alloc(ha->off1, ha->off2, 1);
+		if (!tmp)
+			return -1;
+		tmp->next = lcs->next;
+		lcs->next = tmp;
+		lcs = tmp;
+
+		while ((ha = ha->down ? ha->down : ha->left)) {
+			if (lcs->i1 + lcs->len + 1 == ha->off1 && lcs->i2 + lcs->len + 1 == ha->off2) {
+				lcs->len++;
+				continue;
+			}
+			break;
+		}
+	}
+	return 0;
+}
+
+static int xdl_patience_split(unsigned long const *ha1, unsigned long const *ha2,
+			      xduniqmatch_t *begin, xduniqmatch_t *end)
+{
+	xdp_ha_t *recs;
+	long off1 = begin->i1 + begin->len, lim1 = end->i1;
+	long off2 = begin->i2 + begin->len, lim2 = end->i2;
+	long len, i, j, uniq;
+
+	len = lim1 - off1 + lim2 - off2;
+	recs = (xdp_ha_t *)xdl_malloc(sizeof(xdp_ha_t) * len);
+	if (recs == NULL)
+		return -1;
+
+	for (i = 0, j = off1; j < lim1 - off1; j++, i++) {
+		recs[i].ha = ha1[j];
+		recs[i].off1 = j;
+		recs[i].off2 = -1;
+		recs[i].up = recs[i].down = recs[i].left = NULL;
+	}
+	for (j = off2; j < lim2; j++, i++) {
+		recs[i].ha = ha2[j];
+		recs[i].off1 = -1;
+		recs[i].off2 = j;
+		recs[i].up = recs[i].down = recs[i].left = NULL;
+	}
+
+	qsort(recs, len, sizeof(xdp_ha_t), xdp_ha_cmp_by_ha);
+
+	uniq = 0;
+	for (i = 0; i < len - 1; ) {
+		long ha = recs[i].ha;
+
+		if (ha != recs[i + 1].ha) {
+			i++;
+			continue;
+		}
+
+		if (i < len - 2 && ha == recs[i + 2].ha) {
+			i += 3;
+			while (i < len - 1 && recs[i].ha == ha && i < len - 1) {
+				i++;
+			}
+			continue;
+		}
+
+		if (recs[i].off2 < 0 && recs[i + 1].off1 < 0) {
+			long a, b;
+			recs[uniq].ha   = ha;
+			a = recs[uniq].off1 = recs[i].off1;
+			b = recs[uniq].off2 = recs[i + 1].off2;
+			uniq++;
+		}
+		i += 2;
+	}
+
+	if (uniq) {
+		xdp_ha_t **stacks;
+		long alloc, len;
+
+		qsort(recs, uniq, sizeof(xdp_ha_t), xdp_ha_cmp_by_off2);
+
+		alloc = xdl_bogosqrt(uniq);
+		stacks = xdl_malloc(sizeof(xdp_ha_t *) * alloc);
 
+		if (stacks == NULL)
+			goto error;
 
+		len = 1;
+		stacks[0] = recs;
+
+		for (i = 1; i < uniq; i++) {
+			long off1 = recs[i].off1;
+			long k;
+
+			if (off1 < stacks[len - 1]->off1) {
+				if (len >= alloc) {
+					alloc *= 2;
+					stacks = xdl_realloc(stacks, sizeof(xdp_ha_t *) * alloc);
+					if (!stacks)
+						goto error;
+				}
+				stacks[k = len++] = NULL;
+			} else {
+				k = patience_bisect_stack(stacks, len - 1, off1);
+			}
+
+			if (k > 0) {
+				recs[i].left = stacks[k - 1];
+			}
+			if (stacks[k]) {
+				stacks[k]->down = &recs[i];
+				recs[i].up = stacks[k];
+			}
+			stacks[k] = &recs[i];
+		}
+
+		if (xdl_patience_split_aux(begin, stacks[len - 1]) < 0) {
+			xdl_free(stacks);
+			goto error;
+		}
+
+		xdl_free(stacks);
+	}
+
+	xdl_free(recs);
+	return 0;
+
+error:
+	xdl_free(recs);
+	return -1;
+}
+
+static int xdl_patience_lcs(xdfenv_t *xe, xduniqmatch_t *begin, xduniqmatch_t *end)
+{
+	unsigned long const *ha1 = xe->xdf1.ha, *ha2 = xe->xdf2.ha;
+	long off1 = begin->i1 + begin->len, lim1 = end->i1;
+	long off2 = begin->i2 + begin->len, lim2 = end->i2;
+	xduniqmatch_t *next;
+
+	for (; off1 < lim1 && off2 < lim2 && ha1[off1] == ha2[off2]; off1++, off2++);
+	for (; off1 < lim1 && off2 < lim2 && ha1[lim1 - 1] == ha2[lim2 - 1]; lim1--, lim2--);
+
+	begin->len += off1 - begin->i1;
+	end->len   += end->i1 - lim1;
+	end->i1     = lim1;
+	end->i2     = lim2;
+
+	if (off1 == lim1 || off2 == lim2)
+		return 0;
+
+	if (xdl_patience_split(ha1, ha2, begin, end))
+		return -1;
+
+	for (next = begin->next; next != end; begin = next, next = begin->next) {
+		if (xdl_patience_lcs(xe, begin, next) < 0)
+			return -1;
+	}
+
+	return 0;
+}
 
 static long xdl_split(unsigned long const *ha1, long off1, long lim1,
 		      unsigned long const *ha2, long off2, long lim2,
@@ -321,13 +548,13 @@ int xdl_recs_cmp(diffdata_t *dd1, long off1, long lim1,
 	return 0;
 }
 
-
 int xdl_do_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
 		xdfenv_t *xe) {
 	long ndiags;
 	long *kvd, *kvdf, *kvdb;
 	xdalgoenv_t xenv;
 	diffdata_t dd1, dd2;
+	int need_min = (xpp->flags & XDF_NEED_MINIMAL) != 0;
 
 	if (xdl_prepare_env(mf1, mf2, xpp, xe) < 0) {
 
@@ -364,12 +591,54 @@ int xdl_do_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
 	dd2.rchg = xe->xdf2.rchg;
 	dd2.rindex = xe->xdf2.rindex;
 
-	if (xdl_recs_cmp(&dd1, 0, dd1.nrec, &dd2, 0, dd2.nrec,
-			 kvdf, kvdb, (xpp->flags & XDF_NEED_MINIMAL) != 0, &xenv) < 0) {
+	if (xpp->flags & XDF_USE_PATIENCE) {
+		xduniqmatch_t *lcs;
+		long i1, i2;
+
+		lcs = xdl_uniq_match_alloc(0, 0, 0);
+		if (!lcs)
+			goto error;
+		lcs->next = xdl_uniq_match_alloc(xe->xdf1.nreff, xe->xdf2.nreff, 0);
+		if (!lcs->next || xdl_patience_lcs(xe, lcs, lcs->next) < 0) {
+			while ((lcs = xdl_uniq_popfree(lcs)));
+			goto error;
+		}
 
-		xdl_free(kvd);
-		xdl_free_env(xe);
-		return -1;
+		i1 = i2 = lcs->len;
+		if (lcs->len) {
+			fprintf(stderr, "skip  %ld:%ld -> %ld:%ld\n",
+				lcs->i1, i1, lcs->i2, i2);
+		}
+
+		while ((lcs = xdl_uniq_popfree(lcs))) {
+			fprintf(stderr, "usual %ld:%ld -> %ld:%ld\n",
+				i1, lcs->i1, i2, lcs->i2);
+			fprintf(stderr, "l/r: %ld / %ld\n",
+				xe->xdf1.rindex[lcs->i1],
+				xe->xdf2.rindex[lcs->i2]);
+			if (xdl_recs_cmp(&dd1, i1, lcs->i1, &dd2, i2, lcs->i2,
+					 kvdf, kvdb, need_min, &xenv) < 0) {
+				while ((lcs = xdl_uniq_popfree(lcs)));
+				goto error;
+			}
+			i1 = lcs->i1 + lcs->len;
+			i2 = lcs->i2 + lcs->len;
+			if (lcs->len) {
+				fprintf(stderr, "skip  %ld:%ld -> %ld:%ld (len %ld)\n",
+					lcs->i1, i1, lcs->i2, i2, lcs->len);
+				fprintf(stderr, "l/r: %ld / %ld\n",
+					xe->xdf1.rindex[lcs->i1 + lcs->len],
+					xe->xdf2.rindex[lcs->i2 + lcs->len]);
+			}
+		}
+	} else {
+		if (xdl_recs_cmp(&dd1, 0, dd1.nrec, &dd2, 0, dd2.nrec,
+				 kvdf, kvdb, need_min, &xenv) < 0) {
+error:
+			xdl_free(kvd);
+			xdl_free_env(xe);
+			return -1;
+		}
 	}
 
 	xdl_free(kvd);

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: libxdiff and patience diff
  2008-11-04 15:23       ` Pierre Habouzit
@ 2008-11-04 15:57         ` Johannes Schindelin
  2008-11-04 16:15           ` Pierre Habouzit
  2009-01-01 16:38         ` [PATCH 0/3] Teach Git about the patience diff algorithm Johannes Schindelin
  1 sibling, 1 reply; 67+ messages in thread
From: Johannes Schindelin @ 2008-11-04 15:57 UTC (permalink / raw)
  To: Pierre Habouzit; +Cc: davidel, Git ML

Hi,

On Tue, 4 Nov 2008, Pierre Habouzit wrote:

> The nasty thing about the patience diff is that it still needs the usual
> diff algorithm once it has split the file into chunks separated by
> "unique lines".

Actually, it should try to apply patience diff again in those chunks, 
separately.

> So you cannot make really independant stuff. What I could do is put most 
> of the xpatience diff into xpatience.c but it would still have to use 
> some functions from xdiffi.c that are currently private, so it messes 
> somehow the files more than it's worth IMHO.

I think it is better that you use the stuff from xdiffi.c through a well 
defined interface, i.e. _not_ mess up the code by mingling it together 
with the code in xdiffi.c.  The code is hard enough to read already.

Oh, BTW, "ha" is a hash of the lines which is used to make the line 
matching more performant.  You will see a lot of "ha" comparisons before 
actually calling xdl_recmatch() for that reason.  Incidentally, this is 
also the hash that I'd use for the hash multi-set I was referring to.

Oh, and I am not sure that it is worth your time trying to get it to run 
with the linear list, since you cannot reuse that code afterwards, and 
have to spend the same amount of time to redo it with the hash set.

I am awfully short on time, so it will take some days until I can review 
what you have already, unfortunately.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: libxdiff and patience diff
  2008-11-04 15:57         ` Johannes Schindelin
@ 2008-11-04 16:15           ` Pierre Habouzit
  0 siblings, 0 replies; 67+ messages in thread
From: Pierre Habouzit @ 2008-11-04 16:15 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: davidel, Git ML

[-- Attachment #1: Type: text/plain, Size: 2571 bytes --]

On Tue, Nov 04, 2008 at 03:57:44PM +0000, Johannes Schindelin wrote:
> Hi,
> 
> On Tue, 4 Nov 2008, Pierre Habouzit wrote:
> 
> > The nasty thing about the patience diff is that it still needs the usual
> > diff algorithm once it has split the file into chunks separated by
> > "unique lines".
> 
> Actually, it should try to apply patience diff again in those chunks, 
> separately.

yes it's what I do, but this has a fixed point as soon as you don't find
unique lines between the new found ones, or that that space is "empty".
E.g. you could have the two following hunks:

File A       File B
1            2
2            1
1            2
2            1
1            2
2
1

The simple leading/trailing reduction will do nothing, and you don't
have any shared unique lines, on that you must apply the usual diff
algorithm.

> > So you cannot make really independant stuff. What I could do is put most 
> > of the xpatience diff into xpatience.c but it would still have to use 
> > some functions from xdiffi.c that are currently private, so it messes 
> > somehow the files more than it's worth IMHO.
> 
> I think it is better that you use the stuff from xdiffi.c through a well 
> defined interface, i.e. _not_ mess up the code by mingling it together 
> with the code in xdiffi.c.  The code is hard enough to read already.

Hmmm. I'll see to that later, once I have something that works.

> Oh, BTW, "ha" is a hash of the lines which is used to make the line 
> matching more performant.  You will see a lot of "ha" comparisons before 
> actually calling xdl_recmatch() for that reason.  Incidentally, this is 
> also the hash that I'd use for the hash multi-set I was referring to.

Yeah, that's what I assumed it would be.

> Oh, and I am not sure that it is worth your time trying to get it to run 
> with the linear list, since you cannot reuse that code afterwards, and 
> have to spend the same amount of time to redo it with the hash set.

Having the linear list (actually an array) work would show me I hook at
the proper place. Replacing a data structure doesn't makes me afraid
because I've split the functions properly.

> I am awfully short on time, so it will take some days until I can review 
> what you have already, unfortunately.

NP, it was just in case, because I'm horribly stuck with that code right
now ;)

-- 
·O·  Pierre Habouzit
··O                                                madcoder@debian.org
OOO                                                http://www.madism.org

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 0/3] Teach Git about the patience diff algorithm
  2008-11-04 15:23       ` Pierre Habouzit
  2008-11-04 15:57         ` Johannes Schindelin
@ 2009-01-01 16:38         ` Johannes Schindelin
  2009-01-01 16:38           ` [PATCH 1/3] Implement " Johannes Schindelin
                             ` (3 more replies)
  1 sibling, 4 replies; 67+ messages in thread
From: Johannes Schindelin @ 2009-01-01 16:38 UTC (permalink / raw)
  To: Pierre Habouzit; +Cc: davidel, Francis Galiegue, Git ML

Nothing fancy, really, just a straight-forward implementation of the
heavily under-documented and under-analyzed paience diff algorithm.

One thing is a bit ugly: the libxdiff interface does not allow to
calculate diffs of ranges of lines.  So instead of reusing an initialized
environment (with line hashes and all), the simple structure of mmfile_t
is exploited to fake an mmfile_t of a file _part_, reusing the file's
buffer.

And this mmfile_t pair gets a new environment, recalculating hashes and 
all.

Davide, I think it would be easier to refactor xdl_do_diff() to take line
ranges, and use that interface both in xpatience.c and in xmerge.c.  
(Although I do not know if you took xmerge.c at all; you seemed a bit 
reluctant about it back when I sent it to you.)

For those interested in studying the code, I suggest starting with the 
short comment at the beginning of xpatience.c and then working yourself up 
from the end (i.e. xdl_do_patience_diff()).

It might be a good idea to think about using this code in our merging code
once it is well reviewed and tested, as it might help a substantial number
of otherwise non-trivial merge conflicts.

Oh, and the bash completions are so trivial I did not even bother to test
them.

Happy new year.

Johannes Schindelin (3):
  Implement the patience diff algorithm
  Introduce the diff option '--patience'
  bash completions: Add the --patience option

 Documentation/diff-options.txt         |    3 +
 Makefile                               |    2 +-
 contrib/completion/git-completion.bash |    2 +
 diff.c                                 |    2 +
 t/t4033-diff-patience.sh               |  168 ++++++++++++++
 xdiff/xdiff.h                          |    1 +
 xdiff/xdiffi.c                         |    3 +
 xdiff/xdiffi.h                         |    2 +
 xdiff/xpatience.c                      |  374 ++++++++++++++++++++++++++++++++
 9 files changed, 556 insertions(+), 1 deletions(-)
 create mode 100755 t/t4033-diff-patience.sh
 create mode 100644 xdiff/xpatience.c

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 1/3] Implement the patience diff algorithm
  2009-01-01 16:38         ` [PATCH 0/3] Teach Git about the patience diff algorithm Johannes Schindelin
@ 2009-01-01 16:38           ` Johannes Schindelin
  2009-01-01 16:39           ` [PATCH 2/3] Introduce the diff option '--patience' Johannes Schindelin
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 67+ messages in thread
From: Johannes Schindelin @ 2009-01-01 16:38 UTC (permalink / raw)
  To: Pierre Habouzit; +Cc: davidel, Francis Galiegue, Git ML


The patience diff algorithm produces slightly more intuitive output
than the classic Myers algorithm, as it does not try to minimize the
number of +/- lines first, but tries to preserve the lines that are
unique.

To this end, it first determines lines that are unique in both files,
then the maximal sequence which preserves the order (relative to both
files) is extracted.

Starting from this initial set of common lines, the rest of the lines
is handled recursively, with Myers' algorithm as a fallback when
the patience algorithm fails (due to no common unique lines).

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 xdiff/xdiff.h     |    1 +
 xdiff/xdiffi.c    |    3 +
 xdiff/xdiffi.h    |    2 +
 xdiff/xpatience.c |  374 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 380 insertions(+), 0 deletions(-)
 create mode 100644 xdiff/xpatience.c

diff --git a/xdiff/xdiff.h b/xdiff/xdiff.h
index 361f802..4da052a 100644
--- a/xdiff/xdiff.h
+++ b/xdiff/xdiff.h
@@ -32,6 +32,7 @@ extern "C" {
 #define XDF_IGNORE_WHITESPACE (1 << 2)
 #define XDF_IGNORE_WHITESPACE_CHANGE (1 << 3)
 #define XDF_IGNORE_WHITESPACE_AT_EOL (1 << 4)
+#define XDF_PATIENCE_DIFF (1 << 5)
 #define XDF_WHITESPACE_FLAGS (XDF_IGNORE_WHITESPACE | XDF_IGNORE_WHITESPACE_CHANGE | XDF_IGNORE_WHITESPACE_AT_EOL)
 
 #define XDL_PATCH_NORMAL '-'
diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 9d0324a..3e97462 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -329,6 +329,9 @@ int xdl_do_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
 	xdalgoenv_t xenv;
 	diffdata_t dd1, dd2;
 
+	if (xpp->flags & XDF_PATIENCE_DIFF)
+		return xdl_do_patience_diff(mf1, mf2, xpp, xe);
+
 	if (xdl_prepare_env(mf1, mf2, xpp, xe) < 0) {
 
 		return -1;
diff --git a/xdiff/xdiffi.h b/xdiff/xdiffi.h
index 3e099dc..ad033a8 100644
--- a/xdiff/xdiffi.h
+++ b/xdiff/xdiffi.h
@@ -55,5 +55,7 @@ int xdl_build_script(xdfenv_t *xe, xdchange_t **xscr);
 void xdl_free_script(xdchange_t *xscr);
 int xdl_emit_diff(xdfenv_t *xe, xdchange_t *xscr, xdemitcb_t *ecb,
 		  xdemitconf_t const *xecfg);
+int xdl_do_patience_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
+		xdfenv_t *env);
 
 #endif /* #if !defined(XDIFFI_H) */
diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
new file mode 100644
index 0000000..6687940
--- /dev/null
+++ b/xdiff/xpatience.c
@@ -0,0 +1,374 @@
+/*
+ *  LibXDiff by Davide Libenzi ( File Differential Library )
+ *  Copyright (C) 2003-2009 Davide Libenzi, Johannes E. Schindelin
+ *
+ *  This library is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU Lesser General Public
+ *  License as published by the Free Software Foundation; either
+ *  version 2.1 of the License, or (at your option) any later version.
+ *
+ *  This library is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ *  Lesser General Public License for more details.
+ *
+ *  You should have received a copy of the GNU Lesser General Public
+ *  License along with this library; if not, write to the Free Software
+ *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ *  Davide Libenzi <davidel@xmailserver.org>
+ *
+ */
+#include "xinclude.h"
+#include "xtypes.h"
+#include "xdiff.h"
+
+/*
+ * The basic idea of patience diff is to find lines that are unique in
+ * both files.  These are intuitively the ones that we want to see as
+ * common lines.
+ *
+ * The maximal ordered sequence of such line pairs (where ordered means
+ * that the order in the sequence agrees with the order of the lines in
+ * both files) naturally defines an initial set of common lines.
+ *
+ * Now, the algorithm tries to extend the set of common lines by growing
+ * the line ranges where the files have identical lines.
+ *
+ * Between those common lines, the patience diff algorithm is applied
+ * recursively, until no unique line pairs can be found; these line ranges
+ * are handled by the well-known Myers algorithm.
+ */
+
+#define NON_UNIQUE ULONG_MAX
+
+/*
+ * This is a hash mapping from line hash to line numbers in the first and
+ * second file.
+ */
+struct hashmap {
+	int nr, alloc;
+	struct entry {
+		unsigned long hash;
+		/*
+		 * 0 = unused entry, 1 = first line, 2 = second, etc.
+		 * line2 is NON_UNIQUE if the line is not unique
+		 * in either the first or the second file.
+		 */
+		unsigned long line1, line2;
+		/*
+		 * "next" & "previous" are used for the longest common
+		 * sequence;
+		 * initially, "next" reflects only the order in file1.
+		 */
+		struct entry *next, *previous;
+	} *entries, *first, *last;
+	/* were common records found? */
+	unsigned long has_matches;
+	mmfile_t *file1, *file2;
+	xdfenv_t *env;
+	xpparam_t const *xpp;
+};
+
+/* The argument "pass" is 1 for the first file, 2 for the second. */
+static void insert_record(int line, struct hashmap *map, int pass)
+{
+	xrecord_t **records = pass == 1 ?
+		map->env->xdf1.recs : map->env->xdf2.recs;
+	xrecord_t *record = records[line - 1], *other;
+	/*
+	 * After xdl_prepare_env() (or more precisely, due to
+	 * xdl_classify_record()), the "ha" member of the records (AKA lines)
+	 * is _not_ the hash anymore, but a linearized version of it.  In
+	 * other words, the "ha" member is guaranteed to start with 0 and
+	 * the second record's ha can only be 0 or 1, etc.
+	 *
+	 * So we multiply ha by 2 in the hope that the hashing was
+	 * "unique enough".
+	 */
+	int index = (int)((record->ha << 1) % map->alloc);
+
+	while (map->entries[index].line1) {
+		other = map->env->xdf1.recs[map->entries[index].line1 - 1];
+		if (map->entries[index].hash != record->ha ||
+				!xdl_recmatch(record->ptr, record->size,
+					other->ptr, other->size,
+					map->xpp->flags)) {
+			if (++index >= map->alloc)
+				index = 0;
+			continue;
+		}
+		if (pass == 2)
+			map->has_matches = 1;
+		if (pass == 1 || map->entries[index].line2)
+			map->entries[index].line2 = NON_UNIQUE;
+		else
+			map->entries[index].line2 = line;
+		return;
+	}
+	if (pass == 2)
+		return;
+	map->entries[index].line1 = line;
+	map->entries[index].hash = record->ha;
+	if (!map->first)
+		map->first = map->entries + index;
+	if (map->last) {
+		map->last->next = map->entries + index;
+		map->entries[index].previous = map->last;
+	}
+	map->last = map->entries + index;
+	map->nr++;
+}
+
+/*
+ * This function has to be called for each recursion into the inter-hunk
+ * parts, as previously non-unique lines can become unique when being
+ * restricted to a smaller part of the files.
+ *
+ * It is assumed that env has been prepared using xdl_prepare().
+ */
+static int fill_hashmap(mmfile_t *file1, mmfile_t *file2,
+		xpparam_t const *xpp, xdfenv_t *env,
+		struct hashmap *result,
+		int line1, int count1, int line2, int count2)
+{
+	result->file1 = file1;
+	result->file2 = file2;
+	result->xpp = xpp;
+	result->env = env;
+
+	/* We know exactly how large we want the hash map */
+	result->alloc = count1 * 2;
+	result->entries = (struct entry *)
+		xdl_malloc(result->alloc * sizeof(struct entry));
+	if (!result->entries)
+		return -1;
+	memset(result->entries, 0, result->alloc * sizeof(struct entry));
+
+	/* First, fill with entries from the first file */
+	while (count1--)
+		insert_record(line1++, result, 1);
+
+	/* Then search for matches in the second file */
+	while (count2--)
+		insert_record(line2++, result, 2);
+
+	return 0;
+}
+
+/*
+ * Find the longest sequence with a smaller last element (meaning a smaller
+ * line2, as we construct the sequence with entries ordered by line1).
+ */
+static int binary_search(struct entry **sequence, int longest,
+		struct entry *entry)
+{
+	int left = -1, right = longest;
+
+	while (left + 1 < right) {
+		int middle = (left + right) / 2;
+		/* by construction, no two entries can be equal */
+		if (sequence[middle]->line2 > entry->line2)
+			right = middle;
+		else
+			left = middle;
+	}
+	/* return the index in "sequence", _not_ the sequence length */
+	return left;
+}
+
+/*
+ * The idea is to start with the list of common unique lines sorted by
+ * the order in file1.  For each of these pairs, the longest (partial)
+ * sequence whose last element's line2 is smaller is determined.
+ *
+ * For efficiency, the sequences are kept in a list containing exactly one
+ * item per sequence length: the sequence with the smallest last
+ * element (in terms of line2).
+ */
+static struct entry *find_longest_common_sequence(struct hashmap *map)
+{
+	struct entry **sequence = xdl_malloc(map->nr * sizeof(struct entry *));
+	int longest = 0, i;
+	struct entry *entry;
+
+	for (entry = map->first; entry; entry = entry->next) {
+		if (!entry->line2 || entry->line2 == NON_UNIQUE)
+			continue;
+		i = binary_search(sequence, longest, entry);
+		entry->previous = i < 0 ? NULL : sequence[i];
+		sequence[++i] = entry;
+		if (i == longest)
+			longest++;
+	}
+
+	/* No common unique lines were found */
+	if (!longest)
+		return NULL;
+
+	/* Iterate starting at the last element, adjusting the "next" members */
+	entry = sequence[longest - 1];
+	entry->next = NULL;
+	while (entry->previous) {
+		entry->previous->next = entry;
+		entry = entry->previous;
+	}
+	return entry;
+}
+
+static int match(struct hashmap *map, int line1, int line2)
+{
+	xrecord_t *record1 = map->env->xdf1.recs[line1 - 1];
+	xrecord_t *record2 = map->env->xdf2.recs[line2 - 1];
+	return xdl_recmatch(record1->ptr, record1->size,
+		record2->ptr, record2->size, map->xpp->flags);
+}
+
+static int patience_diff(mmfile_t *file1, mmfile_t *file2,
+		xpparam_t const *xpp, xdfenv_t *env,
+		int line1, int count1, int line2, int count2);
+
+static int walk_common_sequence(struct hashmap *map, struct entry *first,
+		int line1, int count1, int line2, int count2)
+{
+	int end1 = line1 + count1, end2 = line2 + count2;
+	int next1, next2;
+
+	for (;;) {
+		/* Try to grow the line ranges of common lines */
+		if (first) {
+			next1 = first->line1;
+			next2 = first->line2;
+			while (next1 > line1 && next2 > line2 &&
+					match(map, next1 - 1, next2 - 1)) {
+				next1--;
+				next2--;
+			}
+		} else {
+			next1 = end1;
+			next2 = end2;
+		}
+		while (line1 < next1 && line2 < next2 &&
+				match(map, line1, line2)) {
+			line1++;
+			line2++;
+		}
+
+		/* Recurse */
+		if (next1 > line1 || next2 > line2) {
+			struct hashmap submap;
+
+			memset(&submap, 0, sizeof(submap));
+			if (patience_diff(map->file1, map->file2,
+					map->xpp, map->env,
+					line1, next1 - line1,
+					line2, next2 - line2))
+				return -1;
+		}
+
+		if (!first)
+			return 0;
+
+		while (first->next &&
+				first->next->line1 == first->line1 + 1 &&
+				first->next->line2 == first->line2 + 1)
+			first = first->next;
+
+		line1 = first->line1 + 1;
+		line2 = first->line2 + 1;
+
+		first = first->next;
+	}
+}
+
+static int fall_back_to_classic_diff(struct hashmap *map,
+		int line1, int count1, int line2, int count2)
+{
+	/*
+	 * This probably does not work outside Git, since
+	 * we have a very simple mmfile structure.
+	 *
+	 * Note: ideally, we would reuse the prepared environment, but
+	 * the libxdiff interface does not (yet) allow for diffing only
+	 * ranges of lines instead of the whole files.
+	 */
+	mmfile_t subfile1, subfile2;
+	xpparam_t xpp;
+	xdfenv_t env;
+
+	subfile1.ptr = (char *)map->env->xdf1.recs[line1 - 1]->ptr;
+	subfile1.size = map->env->xdf1.recs[line1 + count1 - 2]->ptr +
+		map->env->xdf1.recs[line1 + count1 - 2]->size - subfile1.ptr;
+	subfile2.ptr = (char *)map->env->xdf2.recs[line2 - 1]->ptr;
+	subfile2.size = map->env->xdf2.recs[line2 + count2 - 2]->ptr +
+		map->env->xdf2.recs[line2 + count2 - 2]->size - subfile2.ptr;
+	xpp.flags = map->xpp->flags & ~XDF_PATIENCE_DIFF;
+	if (xdl_do_diff(&subfile1, &subfile2, &xpp, &env) < 0)
+		return -1;
+
+	memcpy(map->env->xdf1.rchg + line1 - 1, env.xdf1.rchg, count1);
+	memcpy(map->env->xdf2.rchg + line2 - 1, env.xdf2.rchg, count2);
+
+	return 0;
+}
+
+/*
+ * Recursively find the longest common sequence of unique lines,
+ * and if none was found, ask xdl_do_diff() to do the job.
+ *
+ * This function assumes that env was prepared with xdl_prepare_env().
+ */
+static int patience_diff(mmfile_t *file1, mmfile_t *file2,
+		xpparam_t const *xpp, xdfenv_t *env,
+		int line1, int count1, int line2, int count2)
+{
+	struct hashmap map;
+	struct entry *first;
+	int result = 0;
+
+	/* trivial case: one side is empty */
+	if (!count1) {
+		while(count2--)
+			env->xdf2.rchg[line2++ - 1] = 1;
+		return 0;
+	} else if (!count2) {
+		while(count1--)
+			env->xdf1.rchg[line1++ - 1] = 1;
+		return 0;
+	}
+
+	memset(&map, 0, sizeof(map));
+	if (fill_hashmap(file1, file2, xpp, env, &map,
+			line1, count1, line2, count2))
+		return -1;
+
+	/* are there any matching lines at all? */
+	if (!map.has_matches) {
+		while(count1--)
+			env->xdf1.rchg[line1++ - 1] = 1;
+		while(count2--)
+			env->xdf2.rchg[line2++ - 1] = 1;
+		return 0;
+	}
+
+	first = find_longest_common_sequence(&map);
+	if (first)
+		result = walk_common_sequence(&map, first,
+			line1, count1, line2, count2);
+	else
+		result = fall_back_to_classic_diff(&map,
+			line1, count1, line2, count2);
+
+	return result;
+}
+
+int xdl_do_patience_diff(mmfile_t *file1, mmfile_t *file2,
+		xpparam_t const *xpp, xdfenv_t *env)
+{
+	if (xdl_prepare_env(file1, file2, xpp, env) < 0)
+		return -1;
+
+	/* environment is cleaned up in xdl_diff() */
+	return patience_diff(file1, file2, xpp, env,
+			1, env->xdf1.nrec, 1, env->xdf2.nrec);
+}
-- 
1.6.1.rc3.412.ga72b

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 2/3] Introduce the diff option '--patience'
  2009-01-01 16:38         ` [PATCH 0/3] Teach Git about the patience diff algorithm Johannes Schindelin
  2009-01-01 16:38           ` [PATCH 1/3] Implement " Johannes Schindelin
@ 2009-01-01 16:39           ` Johannes Schindelin
  2009-01-01 16:39           ` [PATCH 3/3] bash completions: Add the --patience option Johannes Schindelin
  2009-01-01 19:45           ` [PATCH 0/3] Teach Git about the patience diff algorithm Linus Torvalds
  3 siblings, 0 replies; 67+ messages in thread
From: Johannes Schindelin @ 2009-01-01 16:39 UTC (permalink / raw)
  To: Pierre Habouzit; +Cc: davidel, Francis Galiegue, Git ML


This commit teaches Git to produce diff output using the patience diff
algorithm with the diff option '--patience'.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---

	git log --check complains about this patch, for obvious reasons.

 Documentation/diff-options.txt |    3 +
 Makefile                       |    2 +-
 diff.c                         |    2 +
 t/t4033-diff-patience.sh       |  168 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 174 insertions(+), 1 deletions(-)
 create mode 100755 t/t4033-diff-patience.sh

diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
index 671f533..15ef35a 100644
--- a/Documentation/diff-options.txt
+++ b/Documentation/diff-options.txt
@@ -36,6 +36,9 @@ endif::git-format-patch[]
 --patch-with-raw::
 	Synonym for "-p --raw".
 
+--patience:
+	Generate a diff using the "patience diff" algorithm.
+
 --stat[=width[,name-width]]::
 	Generate a diffstat.  You can override the default
 	output width for 80-column terminal by "--stat=width".
diff --git a/Makefile b/Makefile
index 154cf34..2217873 100644
--- a/Makefile
+++ b/Makefile
@@ -1287,7 +1287,7 @@ $(LIB_FILE): $(LIB_OBJS)
 	$(QUIET_AR)$(RM) $@ && $(AR) rcs $@ $(LIB_OBJS)
 
 XDIFF_OBJS=xdiff/xdiffi.o xdiff/xprepare.o xdiff/xutils.o xdiff/xemit.o \
-	xdiff/xmerge.o
+	xdiff/xmerge.o xdiff/xpatience.o
 $(XDIFF_OBJS): xdiff/xinclude.h xdiff/xmacros.h xdiff/xdiff.h xdiff/xtypes.h \
 	xdiff/xutils.h xdiff/xprepare.h xdiff/xdiffi.h xdiff/xemit.h
 
diff --git a/diff.c b/diff.c
index 56b80f9..67718b7 100644
--- a/diff.c
+++ b/diff.c
@@ -2472,6 +2472,8 @@ int diff_opt_parse(struct diff_options *options, const char **av, int ac)
 		options->xdl_opts |= XDF_IGNORE_WHITESPACE_CHANGE;
 	else if (!strcmp(arg, "--ignore-space-at-eol"))
 		options->xdl_opts |= XDF_IGNORE_WHITESPACE_AT_EOL;
+	else if (!strcmp(arg, "--patience"))
+		options->xdl_opts |= XDF_PATIENCE_DIFF;
 
 	/* flags options */
 	else if (!strcmp(arg, "--binary")) {
diff --git a/t/t4033-diff-patience.sh b/t/t4033-diff-patience.sh
new file mode 100755
index 0000000..63c1b00
--- /dev/null
+++ b/t/t4033-diff-patience.sh
@@ -0,0 +1,168 @@
+#!/bin/sh
+
+test_description='patience diff algorithm'
+
+. ./test-lib.sh
+
+cat > file1 << EOF
+#include <stdio.h>
+
+// Frobs foo heartily
+int frobnitz(int foo)
+{
+    int i;
+    for(i = 0; i < 10; i++)
+    {
+        printf("Your answer is: ");
+        printf("%d\n", foo);
+    }
+}
+
+int fact(int n)
+{
+    if(n > 1)
+    {
+        return fact(n-1) * n;
+    }
+    return 1;
+}
+
+int main(int argc, char **argv)
+{
+    frobnitz(fact(10));
+}
+EOF
+
+cat > file2 << EOF
+#include <stdio.h>
+
+int fib(int n)
+{
+    if(n > 2)
+    {
+        return fib(n-1) + fib(n-2);
+    }
+    return 1;
+}
+
+// Frobs foo heartily
+int frobnitz(int foo)
+{
+    int i;
+    for(i = 0; i < 10; i++)
+    {
+        printf("%d\n", foo);
+    }
+}
+
+int main(int argc, char **argv)
+{
+    frobnitz(fib(10));
+}
+EOF
+
+cat > expect << EOF
+diff --git a/file1 b/file2
+index 6faa5a3..e3af329 100644
+--- a/file1
++++ b/file2
+@@ -1,26 +1,25 @@
+ #include <stdio.h>
+ 
++int fib(int n)
++{
++    if(n > 2)
++    {
++        return fib(n-1) + fib(n-2);
++    }
++    return 1;
++}
++
+ // Frobs foo heartily
+ int frobnitz(int foo)
+ {
+     int i;
+     for(i = 0; i < 10; i++)
+     {
+-        printf("Your answer is: ");
+         printf("%d\n", foo);
+     }
+ }
+ 
+-int fact(int n)
+-{
+-    if(n > 1)
+-    {
+-        return fact(n-1) * n;
+-    }
+-    return 1;
+-}
+-
+ int main(int argc, char **argv)
+ {
+-    frobnitz(fact(10));
++    frobnitz(fib(10));
+ }
+EOF
+
+test_expect_success 'patience diff' '
+
+	test_must_fail git diff --no-index --patience file1 file2 > output &&
+	test_cmp expect output
+
+'
+
+test_expect_success 'patience diff output is valid' '
+
+	mv file2 expect &&
+	git apply < output &&
+	test_cmp expect file2
+
+'
+
+cat > uniq1 << EOF
+1
+2
+3
+4
+5
+6
+EOF
+
+cat > uniq2 << EOF
+a
+b
+c
+d
+e
+f
+EOF
+
+cat > expect << EOF
+diff --git a/uniq1 b/uniq2
+index b414108..0fdf397 100644
+--- a/uniq1
++++ b/uniq2
+@@ -1,6 +1,6 @@
+-1
+-2
+-3
+-4
+-5
+-6
++a
++b
++c
++d
++e
++f
+EOF
+
+test_expect_success 'completely different files' '
+
+	test_must_fail git diff --no-index --patience uniq1 uniq2 > output &&
+	test_cmp expect output
+
+'
+
+test_done
-- 
1.6.1.rc3.412.ga72b

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 3/3] bash completions: Add the --patience option
  2009-01-01 16:38         ` [PATCH 0/3] Teach Git about the patience diff algorithm Johannes Schindelin
  2009-01-01 16:38           ` [PATCH 1/3] Implement " Johannes Schindelin
  2009-01-01 16:39           ` [PATCH 2/3] Introduce the diff option '--patience' Johannes Schindelin
@ 2009-01-01 16:39           ` Johannes Schindelin
  2009-01-01 19:45           ` [PATCH 0/3] Teach Git about the patience diff algorithm Linus Torvalds
  3 siblings, 0 replies; 67+ messages in thread
From: Johannes Schindelin @ 2009-01-01 16:39 UTC (permalink / raw)
  To: Pierre Habouzit; +Cc: davidel, Francis Galiegue, Git ML


Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 contrib/completion/git-completion.bash |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/contrib/completion/git-completion.bash b/contrib/completion/git-completion.bash
index a046441..b3e1e22 100755
--- a/contrib/completion/git-completion.bash
+++ b/contrib/completion/git-completion.bash
@@ -777,6 +777,7 @@ _git_diff ()
 			--no-prefix --src-prefix= --dst-prefix=
 			--base --ours --theirs
 			--inter-hunk-context=
+			--patience
 			"
 		return
 		;;
@@ -969,6 +970,7 @@ _git_log ()
 			--parents --children --full-history
 			--merge
 			--inter-hunk-context=
+			--patience
 			"
 		return
 		;;
-- 
1.6.1.rc3.412.ga72b

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-01 16:38         ` [PATCH 0/3] Teach Git about the patience diff algorithm Johannes Schindelin
                             ` (2 preceding siblings ...)
  2009-01-01 16:39           ` [PATCH 3/3] bash completions: Add the --patience option Johannes Schindelin
@ 2009-01-01 19:45           ` Linus Torvalds
  2009-01-01 20:00             ` Linus Torvalds
  2009-01-01 20:46             ` [PATCH 0/3] Teach Git about " Adeodato Simó
  3 siblings, 2 replies; 67+ messages in thread
From: Linus Torvalds @ 2009-01-01 19:45 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Pierre Habouzit, davidel, Francis Galiegue, Git ML



On Thu, 1 Jan 2009, Johannes Schindelin wrote:
> 
> Nothing fancy, really, just a straight-forward implementation of the
> heavily under-documented and under-analyzed paience diff algorithm.

Exactly because the patience diff is so under-documented, could you 
perhaps give a few examples of how it differs in the result, and why it's 
so wonderful? Yes, yes, I can google, and no, no, nothing useful shows up 
except for *totally* content-free fanboisms. 

So could we have some actual real data on it?

			Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-01 19:45           ` [PATCH 0/3] Teach Git about the patience diff algorithm Linus Torvalds
@ 2009-01-01 20:00             ` Linus Torvalds
  2009-01-02 18:17               ` Johannes Schindelin
  2009-01-02 21:59               ` [PATCH 1/3 v2] Implement " Johannes Schindelin
  2009-01-01 20:46             ` [PATCH 0/3] Teach Git about " Adeodato Simó
  1 sibling, 2 replies; 67+ messages in thread
From: Linus Torvalds @ 2009-01-01 20:00 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Pierre Habouzit, davidel, Francis Galiegue, Git ML



On Thu, 1 Jan 2009, Linus Torvalds wrote:
> 
> So could we have some actual real data on it?

.. and some testing. I tried to get some limited data for the kernel 
myself, by doing

	git log --patience -p v2.6.28.. > ~/patience

but I just got a core-dump instead.

Pinpointing it to a specific commit shows a smaller failure case:

	git show -p --patience 05d564fe00c05bf8ff93948057ca1acb5bc68e10

which might help you debug this.

			Linus

---
#0  0x00000000004cce73 in xdl_get_rec (xdf=0x7fffcb1e08d0, ri=-9, rec=0x7fffcb1e0778) at xdiff/xemit.c:36
#1  0x00000000004cced1 in xdl_emit_record (xdf=0x7fffcb1e08d0, ri=-9, pre=0x4eef79 "-", ecb=0x7fffcb1e0bc0)
    at xdiff/xemit.c:46
#2  0x00000000004cd4e6 in xdl_emit_diff (xe=0x7fffcb1e08d0, xscr=0x1111daf0, ecb=0x7fffcb1e0bc0, 
    xecfg=0x7fffcb1e0b80) at xdiff/xemit.c:179
#3  0x00000000004caa2c in xdl_diff (mf1=0x7fffcb1e0a40, mf2=0x7fffcb1e0a30, xpp=0x7fffcb1e0bd0, 
    xecfg=0x7fffcb1e0b80, ecb=0x7fffcb1e0bc0) at xdiff/xdiffi.c:559
#4  0x00000000004c088d in xdi_diff (mf1=0x7fffcb1e0c00, mf2=0x7fffcb1e0bf0, xpp=0x7fffcb1e0bd0, 
    xecfg=0x7fffcb1e0b80, xecb=0x7fffcb1e0bc0) at xdiff-interface.c:137
#5  0x00000000004c0914 in xdi_diff_outf (mf1=0x7fffcb1e0c00, mf2=0x7fffcb1e0bf0, fn=0x475448 <fn_out_consume>, 
    consume_callback_data=0x7fffcb1e0b40, xpp=0x7fffcb1e0bd0, xecfg=0x7fffcb1e0b80, xecb=0x7fffcb1e0bc0)
    at xdiff-interface.c:154
#6  0x00000000004780dc in builtin_diff (name_a=0x25cf6f0 "fs/nfs/nfs4xdr.c", name_b=0x25cf6f0 "fs/nfs/nfs4xdr.c", 
    one=0x25cf690, two=0x26ae110, xfrm_msg=0xf659900 "index 7dde309..29656c5 100644", o=0x7fffcb1e1088, 
    complete_rewrite=0) at diff.c:1486
#7  0x00000000004796e4 in run_diff_cmd (pgm=0x0, name=0x25cf6f0 "fs/nfs/nfs4xdr.c", other=0x0, 
    attr_path=0x25cf6f0 "fs/nfs/nfs4xdr.c", one=0x25cf690, two=0x26ae110, 
    xfrm_msg=0xf659900 "index 7dde309..29656c5 100644", o=0x7fffcb1e1088, complete_rewrite=0) at diff.c:2024
#8  0x0000000000479e2e in run_diff (p=0xaffece0, o=0x7fffcb1e1088) at diff.c:2158
#9  0x000000000047b959 in diff_flush_patch (p=0xaffece0, o=0x7fffcb1e1088) at diff.c:2743
#10 0x000000000047c942 in diff_flush (options=0x7fffcb1e1088) at diff.c:3184
#11 0x0000000000488b75 in log_tree_diff_flush (opt=0x7fffcb1e0f40) at log-tree.c:451
#12 0x0000000000488d17 in log_tree_diff (opt=0x7fffcb1e0f40, commit=0x2673198, log=0x7fffcb1e0ec0) at log-tree.c:503
#13 0x0000000000488da4 in log_tree_commit (opt=0x7fffcb1e0f40, commit=0x2673198) at log-tree.c:526
#14 0x000000000043218d in cmd_log_walk (rev=0x7fffcb1e0f40) at builtin-log.c:201
#15 0x0000000000432bae in cmd_log (argc=4, argv=0x7fffcb1e14b0, prefix=0x0) at builtin-log.c:423
#16 0x000000000040486b in run_command (p=0x70c7b0, argc=4, argv=0x7fffcb1e14b0) at git.c:243
#17 0x0000000000404a1c in handle_internal_command (argc=4, argv=0x7fffcb1e14b0) at git.c:387
#18 0x0000000000404c6e in main (argc=4, argv=0x7fffcb1e14b0) at git.c:484

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-01 19:45           ` [PATCH 0/3] Teach Git about the patience diff algorithm Linus Torvalds
  2009-01-01 20:00             ` Linus Torvalds
@ 2009-01-01 20:46             ` Adeodato Simó
  2009-01-02  1:56               ` Linus Torvalds
  1 sibling, 1 reply; 67+ messages in thread
From: Adeodato Simó @ 2009-01-01 20:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Schindelin, Pierre Habouzit, davidel, Francis Galiegue,
	Git ML

[-- Attachment #1: Type: text/plain, Size: 4390 bytes --]

* Linus Torvalds [Thu, 01 Jan 2009 11:45:21 -0800]:

> On Thu, 1 Jan 2009, Johannes Schindelin wrote:

> > Nothing fancy, really, just a straight-forward implementation of the
> > heavily under-documented and under-analyzed paience diff algorithm.

> Exactly because the patience diff is so under-documented, could you 
> perhaps give a few examples of how it differs in the result, and why it's 
> so wonderful? Yes, yes, I can google, and no, no, nothing useful shows up 
> except for *totally* content-free fanboisms. 

> So could we have some actual real data on it?

For me, the cases where I find patience output to be of substantial
higher readability are those involving a rewrite of several consecutive
paragraphs (i.e., lines of code separated by blank lines). Compare:

-8<- git -8<-
@@ -51,29 +51,30 @@ def mbox_update(bug):
             f.close()
     else:
         # make a list of Message-Id we have
-        fp1 = file(path, 'ab+')
-        ids1 = [ x.get('Message-Id') for x in mailbox.UnixMailbox(fp1) ]
+        msgids = { x.get('Message-Id') for x in mailbox.mbox(path) }
 
-        # get remote mbox again
-        fp2 = tempfile.TemporaryFile()
-        retrieve_mbox(bug, fp2)
+        with tempfile.NamedTemporaryFile() as tmpfd:
+            # retrieve the remote mbox again
+            retrieve_mbox(bug, tmpfd)
 
-        # parse its messages
-        fp2.seek(0)
-        parser = email.Parser.Parser()
-        msgs2 = dict((x['Message-Id'], x)
-                    for x in mailbox.UnixMailbox(fp2, parser.parse))
+            # parse its messages
+            parser = email.parser.Parser()
+            new_msgids = { x['Message-Id']: x
+                            for x in mailbox.mbox(tmpfd.name, parser.parse) }
 
-        # now append the new ones
-        for msgid in set(msgs2.keys()) - set(ids1):
-            fp1.write('\n' + msgs2[msgid].as_string(unixfrom=True))
+        with open(path, 'a+') as fd:
+            # now append the new messages
+            for msgid in new_msgids.keys() - msgids:
+                fd.write('\n' + new_msgids[msgid].as_string(unixfrom=True))
 
     return path
->8- git ->8-

with:

-8<- bzr patience -8<-
@@ -51,29 +51,30 @@
             f.close()
     else:
         # make a list of Message-Id we have
-        fp1 = file(path, 'ab+')
-        ids1 = [ x.get('Message-Id') for x in mailbox.UnixMailbox(fp1) ]
-
-        # get remote mbox again
-        fp2 = tempfile.TemporaryFile()
-        retrieve_mbox(bug, fp2)
-
-        # parse its messages
-        fp2.seek(0)
-        parser = email.Parser.Parser()
-        msgs2 = dict((x['Message-Id'], x)
-                    for x in mailbox.UnixMailbox(fp2, parser.parse))
-
-        # now append the new ones
-        for msgid in set(msgs2.keys()) - set(ids1):
-            fp1.write('\n' + msgs2[msgid].as_string(unixfrom=True))
+        msgids = { x.get('Message-Id') for x in mailbox.mbox(path) }
+
+        with tempfile.NamedTemporaryFile() as tmpfd:
+            # retrieve the remote mbox again
+            retrieve_mbox(bug, tmpfd)
+
+            # parse its messages
+            parser = email.parser.Parser()
+            new_msgids = { x['Message-Id']: x
+                            for x in mailbox.mbox(tmpfd.name, parser.parse) }
+
+        with open(path, 'a+') as fd:
+            # now append the new messages
+            for msgid in new_msgids.keys() - msgids:
+                fd.write('\n' + new_msgids[msgid].as_string(unixfrom=True))
 
     return path
->8- bzr patience ->8-

I don't know about you, but I find the latter much easier to read,
because the whole context of each version is always available.

As you see, in (at least) this case is just a matter of considering the
blank lines worthy of presented as common, or not.

I'll note that in this particular case, `git diff` yielded the very same
results with or without --patience. I don't know why that is, Johannes?
I'll also note that /usr/bin/diff produces (in this case) something
closer to patience than to git.

I'm attaching both versions of the file in case they are useful to
anybody.

-- 
Adeodato Simó                                     dato at net.com.org.es
Debian Developer                                  adeodato at debian.org
 
I promise you. Once I enter into an exclusive relationship, I sleep with
very few people.
                -- Denny Crane

[-- Attachment #2: bdo0 --]
[-- Type: text/plain, Size: 2358 bytes --]

#! /usr/bin/python
## vim: fileencoding=utf-8

"""Open Debian BTS mboxes with Mutt, à la /usr/bin/bts show --mbox.

A cache of mboxes is kept, and changed mboxes will be merged with existing
files instead of replacing them, so that e.g. read-status is preserved for each
message.
"""

import os
import re
import sys
import urllib
import mailbox
import tempfile
import email.Parser

MBOX_DIR = os.path.expanduser('~/.mail/y.bug-cache')

##

def main():
    if len(sys.argv) != 2:
        print >>sys.stderr, 'Usage: %s <bugnumber>' % (sys.argv[0],)
        sys.exit(1)

    bug = re.sub(r'[^0-9]', '', sys.argv[1])
    if not re.match(r'\d{4,}$', bug):
        print >>sys.stderr, \
            'E: %s does not seem a valid number' % (sys.argv[1],)
        sys.exit(1)

    path = mbox_update(bug)
    invoke_mailer(path)

##

def mbox_update(bug):
    """Return a path with an up-to-date copy of the mbox for bug."""
    path = os.path.join(MBOX_DIR, bug + '.mbox')

    if not os.path.exists(path):
        f = file(path, 'wb')
        try:
            retrieve_mbox(bug, f)
        except:
            os.unlink(path)
            raise
        else:
            f.close()
    else:
        # make a list of Message-Id we have
        fp1 = file(path, 'ab+')
        ids1 = [ x.get('Message-Id') for x in mailbox.UnixMailbox(fp1) ]

        # get remote mbox again
        fp2 = tempfile.TemporaryFile()
        retrieve_mbox(bug, fp2)

        # parse its messages
        fp2.seek(0)
        parser = email.Parser.Parser()
        msgs2 = dict((x['Message-Id'], x)
                    for x in mailbox.UnixMailbox(fp2, parser.parse))

        # now append the new ones
        for msgid in set(msgs2.keys()) - set(ids1):
            fp1.write('\n' + msgs2[msgid].as_string(unixfrom=True))

    return path

def retrieve_mbox(bug, fileobj):
    """Retrieve mbox for bug from bugs.debian.org, writing it to fileobj."""
    for line in urllib.urlopen(
            'http://bugs.debian.org/cgi-bin/bugreport.cgi?mboxstatus=yes;mboxmaint=yes;mbox=yes;bug=%s' % (bug,)):
        fileobj.write(line)

def invoke_mailer(path):
    """Exec mutt, opening path."""
    os.execlp('mutt', 'mutt', '-f', path)

##

if __name__ == '__main__':
    try:
        sys.exit(main())
    except KeyboardInterrupt:
        print >>sys.stderr, '\nCancelled.'
        sys.exit(1)

[-- Attachment #3: bdo1 --]
[-- Type: text/plain, Size: 2501 bytes --]

#! /usr/bin/python3

"""Open Debian BTS mboxes with Mutt, à la /usr/bin/bts show --mbox.

A cache of mboxes is kept, and changed mboxes will be merged with existing
files instead of replacing them, so that e.g. read-status is preserved for each
message.
"""

import os
import re
import sys
import mailbox
import tempfile
import email.parser
import urllib.request

MBOX_DIR = os.path.expanduser('~/.mail/y.bug-cache')

##

def main():
    if len(sys.argv) != 2:
        print('Usage: {0} <bugnumber>'.format(sys.argv[0]), file=sys.stderr)
        return 1
    else:
        bug = re.sub(r'[^0-9]', '', sys.argv[1])

    if not re.search(r'^\d{4,}$', bug):
        print('E: {0} does not seem a valid number'.format(sys.argv[1]),
              file=sys.stderr)
        return 1

    path = mbox_update(bug)
    invoke_mailer(path)

##

def mbox_update(bug):
    """Return a path with an up-to-date copy of the mbox for bug."""
    path = os.path.join(MBOX_DIR, bug + '.mbox')

    if not os.path.exists(path):
        f = open(path, 'wb')
        try:
            retrieve_mbox(bug, f)
        except:
            os.unlink(path)
            raise
        else:
            f.close()
    else:
        # make a list of Message-Id we have
        msgids = { x.get('Message-Id') for x in mailbox.mbox(path) }

        with tempfile.NamedTemporaryFile() as tmpfd:
            # retrieve the remote mbox again
            retrieve_mbox(bug, tmpfd)

            # parse its messages
            parser = email.parser.Parser()
            new_msgids = { x['Message-Id']: x
                            for x in mailbox.mbox(tmpfd.name, parser.parse) }

        with open(path, 'a+') as fd:
            # now append the new messages
            for msgid in new_msgids.keys() - msgids:
                fd.write('\n' + new_msgids[msgid].as_string(unixfrom=True))

    return path

def retrieve_mbox(bug, fileobj):
    """Retrieve mbox for bug from bugs.debian.org, writing it to fileobj."""
    url = urllib.request.urlopen(
            'http://bugs.debian.org/cgi-bin/bugreport.cgi?'
            'mboxstatus=yes;mboxmaint=yes;mbox=yes;bug={0}'.format(bug))
    for line in url.fp: # http://bugs.python.org/issue4608
        fileobj.write(line)

def invoke_mailer(path):
    """Exec mutt, opening path."""
    os.execlp('mutt', 'mutt', '-f', path)

##

if __name__ == '__main__':
    try:
        sys.exit(main())
    except KeyboardInterrupt:
        print('\nCancelled.', file=sys.stderr)
        sys.exit(1)

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-01 20:46             ` [PATCH 0/3] Teach Git about " Adeodato Simó
@ 2009-01-02  1:56               ` Linus Torvalds
  2009-01-02 10:55                 ` Clemens Buchacher
  2009-01-02 18:50                 ` Adeodato Simó
  0 siblings, 2 replies; 67+ messages in thread
From: Linus Torvalds @ 2009-01-02  1:56 UTC (permalink / raw)
  To: Adeodato Simó
  Cc: Johannes Schindelin, Pierre Habouzit, davidel, Francis Galiegue,
	Git ML

On Thu, 1 Jan 2009, Adeodato Simó wrote:
> 
> For me, the cases where I find patience output to be of substantial
> higher readability are those involving a rewrite of several consecutive
> paragraphs (i.e., lines of code separated by blank lines). Compare:

I don't think that's a "patience diff" issue.

That's simply an issue of merging consecutive diff fragments together if 
they are close-by, and that's independent of the actual diff algorithm 
itself.

> I'll note that in this particular case, `git diff` yielded the very same
> results with or without --patience. I don't know why that is, Johannes?
> I'll also note that /usr/bin/diff produces (in this case) something
> closer to patience than to git.

See above - I really don't think this has anything to do with "patience vs 
non-patience". It's more akin to the things we do for our merge conflict 
markers: if we have two merge conflicts next to each other, with just a 
couple of lines in between, we coalesce the merge conflicts into one 
larger one instead.

We don't do that for regular diffs - they're always kept minimal (ok, not 
really minimal, but as close to minimal as the algorithm finds them).

See commit f407f14deaa14ebddd0d27238523ced8eca74393 for the git merge 
conflict merging. We _could_ do similar things for regular diffs. It's 
sometimes useful, sometimes not.

		Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-02  1:56               ` Linus Torvalds
@ 2009-01-02 10:55                 ` Clemens Buchacher
  2009-01-02 10:58                   ` Clemens Buchacher
  2009-01-02 11:03                   ` Junio C Hamano
  2009-01-02 18:50                 ` Adeodato Simó
  1 sibling, 2 replies; 67+ messages in thread
From: Clemens Buchacher @ 2009-01-02 10:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Adeodato Simó, Johannes Schindelin, Pierre Habouzit, davidel,
	Francis Galiegue, Git ML

On Thu, Jan 01, 2009 at 05:56:13PM -0800, Linus Torvalds wrote:
> See above - I really don't think this has anything to do with "patience vs 
> non-patience". It's more akin to the things we do for our merge conflict 
> markers: if we have two merge conflicts next to each other, with just a 
> couple of lines in between, we coalesce the merge conflicts into one 
> larger one instead.

But patience diff does more than that. Take a look at "git diff" and "git
diff --patience" output below (taken from [1]). You can see that patience
diff produces two separate hunks, one removing a function and one adding a
function. Merging consecutive diff fragments would have produced one big
hunk instead.

*** git diff ****************************** git diff --patience **********
 #include <stdio.h>                   | #include <stdio.h>
                                      |
+int fib(int n)                       |-// Frobs foo heartily
+{                                    |-int frobnitz(int foo)
+    if(n > 2)                        |+int fib(int n)
+    {                                | {
+        return fib(n-1) + fib(n-2);  |-    int i;
+    }                                |-    for(i = 0; i < 10; i++)
+    return 1;                        |+    if(n > 2)
+}                                    |     {
+                                     |-        printf("Your answer is: ");
 // Frobs foo heartily                |-        printf("%d\n", foo);
 int frobnitz(int foo)                |+        return fib(n-1) + fib(n-2);
 {                                    |     }
     int i;                           |+    return 1;
     for(i = 0; i < 10; i++)          | }
     {                                |
-        printf("Your answer is: ");  |-int fact(int n)
         printf("%d\n", foo);         |+// Frobs foo heartily
     }                                |+int frobnitz(int foo)
 }                                    | {
                                      |-    if(n > 1)
-int fact(int n)                      |+    int i;
-{                                    |+    for(i = 0; i < 10; i++)
-    if(n > 1)                        |     {
-    {                                |-        return fact(n-1) * n;
-        return fact(n-1) * n;        |+        printf("%d\n", foo);
-    }                                |     }
-    return 1;                        |-    return 1;
-}                                    | }
-                                     |
 int main(int argc, char **argv)      | int main(int argc, char **argv)
 {                                    | {
-    frobnitz(fact(10));              |-    frobnitz(fact(10));
+    frobnitz(fib(10));               |+    frobnitz(fib(10));
 }                                    | }

[1] http://alfedenzo.livejournal.com/170301.html

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-02 10:55                 ` Clemens Buchacher
@ 2009-01-02 10:58                   ` Clemens Buchacher
  2009-01-02 16:42                     ` Linus Torvalds
  2009-01-02 11:03                   ` Junio C Hamano
  1 sibling, 1 reply; 67+ messages in thread
From: Clemens Buchacher @ 2009-01-02 10:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Adeodato Simó, Johannes Schindelin, Pierre Habouzit, davidel,
	Francis Galiegue, Git ML

Only two choices, and I still get it wrong. The diffs should be labeled the
other way around, of course.

*** git diff --patience ******************** git diff ********************
 #include <stdio.h>                   | #include <stdio.h>
                                      |
+int fib(int n)                       |-// Frobs foo heartily
+{                                    |-int frobnitz(int foo)
+    if(n > 2)                        |+int fib(int n)
+    {                                | {
+        return fib(n-1) + fib(n-2);  |-    int i;
+    }                                |-    for(i = 0; i < 10; i++)
+    return 1;                        |+    if(n > 2)
+}                                    |     {
+                                     |-        printf("Your answer is: ");
 // Frobs foo heartily                |-        printf("%d\n", foo);
 int frobnitz(int foo)                |+        return fib(n-1) + fib(n-2);
 {                                    |     }
     int i;                           |+    return 1;
     for(i = 0; i < 10; i++)          | }
     {                                |
-        printf("Your answer is: ");  |-int fact(int n)
         printf("%d\n", foo);         |+// Frobs foo heartily
     }                                |+int frobnitz(int foo)
 }                                    | {
                                      |-    if(n > 1)
-int fact(int n)                      |+    int i;
-{                                    |+    for(i = 0; i < 10; i++)
-    if(n > 1)                        |     {
-    {                                |-        return fact(n-1) * n;
-        return fact(n-1) * n;        |+        printf("%d\n", foo);
-    }                                |     }
-    return 1;                        |-    return 1;
-}                                    | }
-                                     |
 int main(int argc, char **argv)      | int main(int argc, char **argv)
 {                                    | {
-    frobnitz(fact(10));              |-    frobnitz(fact(10));
+    frobnitz(fib(10));               |+    frobnitz(fib(10));
 }                                    | }

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-02 10:55                 ` Clemens Buchacher
  2009-01-02 10:58                   ` Clemens Buchacher
@ 2009-01-02 11:03                   ` Junio C Hamano
  1 sibling, 0 replies; 67+ messages in thread
From: Junio C Hamano @ 2009-01-02 11:03 UTC (permalink / raw)
  To: Clemens Buchacher
  Cc: Linus Torvalds, Adeodato Simó, Johannes Schindelin,
	Pierre Habouzit, davidel, Francis Galiegue, Git ML

Clemens Buchacher <drizzd@aon.at> writes:

> But patience diff does more than that. Take a look at "git diff" and "git
> diff --patience" output below (taken from [1]).
>
> *** git diff ****************************** git diff --patience **********
>  #include <stdio.h>                   | #include <stdio.h>

I suspect you have the above labels backwards...

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-02 10:58                   ` Clemens Buchacher
@ 2009-01-02 16:42                     ` Linus Torvalds
  2009-01-02 18:46                       ` Johannes Schindelin
                                         ` (2 more replies)
  0 siblings, 3 replies; 67+ messages in thread
From: Linus Torvalds @ 2009-01-02 16:42 UTC (permalink / raw)
  To: Clemens Buchacher
  Cc: Adeodato Simó, Johannes Schindelin, Pierre Habouzit, davidel,
	Francis Galiegue, Git ML

On Fri, 2 Jan 2009, Clemens Buchacher wrote:
>
> Only two choices, and I still get it wrong. The diffs should be labeled the
> other way around, of course.

Yes, this one is a real patience diff change, but it's also the same one 
that I've seen in the google fanboi findings. What google did _not_ show 
was any real-life examples, or anybody doing any critical analysis.

So I was hoping for something else than a single "in this case patience 
diff works really well". I was hoping to see what it does in real life. 
But when I tried it on the kernel archive, I get a core dump.

For example, in real life, files are bigger, and unique lines are not 
necessarily always common (generated files, whatever). Depending on unique 
line ordering may work fine in 95% of all cases, but do you know that it 
works fine in general? Does it work when 50% of lines are unique? I 
believe it does. Does ti work when just 1% of lines are unique? I just 
don't know.

And I haven't seen _any_ real critical analysis of it. Anywhere.

			Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-01 20:00             ` Linus Torvalds
@ 2009-01-02 18:17               ` Johannes Schindelin
  2009-01-02 18:49                 ` Linus Torvalds
  2009-01-02 18:51                 ` Jeff King
  2009-01-02 21:59               ` [PATCH 1/3 v2] Implement " Johannes Schindelin
  1 sibling, 2 replies; 67+ messages in thread
From: Johannes Schindelin @ 2009-01-02 18:17 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Pierre Habouzit, davidel, Francis Galiegue, Git ML

Hi,

On Thu, 1 Jan 2009, Linus Torvalds wrote:

> On Thu, 1 Jan 2009, Linus Torvalds wrote:
> > 
> > So could we have some actual real data on it?
> 
> .. and some testing. I tried to get some limited data for the kernel 
> myself, by doing
> 
> 	git log --patience -p v2.6.28.. > ~/patience
> 
> but I just got a core-dump instead.
> 
> Pinpointing it to a specific commit shows a smaller failure case:
> 
> 	git show -p --patience 05d564fe00c05bf8ff93948057ca1acb5bc68e10
> 
> which might help you debug this.

Thanks.  I am on it.  valgrind finds an earlier place in 
xdl_change_compact() which I think is rather more sensible, but at the 
same time a bit worrisome, too, as I did not expect any errors _that_ 
late in the game (I did not touch that code).

BTW the "-p" is not necessary with "show", indeed, you cannot even switch 
it off.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-02 16:42                     ` Linus Torvalds
@ 2009-01-02 18:46                       ` Johannes Schindelin
  2009-01-02 19:03                         ` Linus Torvalds
  2009-01-02 21:59                       ` [PATCH 0/3] Teach Git about the patience diff algorithm Johannes Schindelin
  2009-01-08 19:55                       ` Adeodato Simó
  2 siblings, 1 reply; 67+ messages in thread
From: Johannes Schindelin @ 2009-01-02 18:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Clemens Buchacher, Adeodato Simó, Pierre Habouzit, davidel,
	Francis Galiegue, Git ML

Hi,

On Fri, 2 Jan 2009, Linus Torvalds wrote:

> On Fri, 2 Jan 2009, Clemens Buchacher wrote:
> >
> > Only two choices, and I still get it wrong. The diffs should be 
> > labeled the other way around, of course.
> 
> Yes, this one is a real patience diff change, but it's also the same one 
> that I've seen in the google fanboi findings. What google did _not_ show 
> was any real-life examples, or anybody doing any critical analysis.

FWIW it's the test case in the commit introducing the --patience option.

> So I was hoping for something else than a single "in this case patience 
> diff works really well". I was hoping to see what it does in real life.

I will dig out a real-world example where I _know_ patience diff would 
have helped.  (I remember that I rewrote a pretty large diff which was 
sent on this list, only to understand what it actually does, and I am 
pretty certain this is a good real-world showcase.)

But yes, I agree, the thing does not matter _all_ that much in reality.

The case where I expect a real improvement is when you modify a function 
and insert a function just before it, and Myers' algorithm matches mainly 
empty lines and lines ending in curly brackets.

In other words: something I tried to tackle with ee95ec5d(xdl_merge(): 
introduce XDL_MERGE_ZEALOUS_ALNUM) for merges.

The typical look of such a diff is something like this:

-<... some function header ...>
+<... a completely different function header ...>
 {
-	<... variables ...>
+	<... other variables ...>

 	for (i = 0; i < 10; i++) {
-		<... some code ...>
+		<... some code doing something completely different ...>
	}

 	return 0;
 }

+<... the function header which was removed earlier ...>
+{
 	<... refactored _and also reindented_ code ...>

> And I haven't seen _any_ real critical analysis of it. Anywhere.

Neither have I.  Let alone something close to documentation.

For example, when the "patience diff algorithm" is explained, it looks 
more like a longest common sequence algorithm when the input is already 
sorted in the first item.

Further, there is no rigorous analysis of the runtime (I figured that the 
original runtime is O(nm) where "n" is the number of lines and "m" is the 
length of the maximal ordered sequence of common unique lines, and my 
implementation can only improve that to O(n log(m))).

This could be improved, I think, for the most common case where you have 
pretty long common _continuous_ sequences of unique lines, i.e. large 
ranges of lines that are identical.

The runtime is especially horrible in the light of the runtime of Myers' 
algorithm, which uses O(n d), where d is the edit distance, i.e. the 
number of lines added + number of lines removed.  (Note: in the real 
world, there are substantial speed ups for consecutive edits, i.e. line 
ranges where there are no common lines at all.)

Also, I am less than thrilled by the job the fanbois did coming up with 
"convincing" evidence: exactly as you pointed out, there are _no_ 
real-world examples where it helps.

And the worst part: one can only _guess_ what motivated patience diff.  I 
imagine it came from the observation that function headers are unique, and 
that you usually want to preserve as much context around them.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-02 18:17               ` Johannes Schindelin
@ 2009-01-02 18:49                 ` Linus Torvalds
  2009-01-02 19:07                   ` Johannes Schindelin
  2009-01-02 18:51                 ` Jeff King
  1 sibling, 1 reply; 67+ messages in thread
From: Linus Torvalds @ 2009-01-02 18:49 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Pierre Habouzit, davidel, Francis Galiegue, Git ML



On Fri, 2 Jan 2009, Johannes Schindelin wrote:
> 
> BTW the "-p" is not necessary with "show", indeed, you cannot even switch 
> it off.

I was just switching back-and-forth between "git log" and "git show" so 
the -p came from just that, and is not necessary.

And you _can_ suppress the patch generation - use "-s".

			Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-02  1:56               ` Linus Torvalds
  2009-01-02 10:55                 ` Clemens Buchacher
@ 2009-01-02 18:50                 ` Adeodato Simó
  1 sibling, 0 replies; 67+ messages in thread
From: Adeodato Simó @ 2009-01-02 18:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Schindelin, Pierre Habouzit, davidel, Francis Galiegue,
	Git ML

* Linus Torvalds [Thu, 01 Jan 2009 17:56:13 -0800]:

> On Thu, 1 Jan 2009, Adeodato Simó wrote:

> > For me, the cases where I find patience output to be of substantial
> > higher readability are those involving a rewrite of several consecutive
> > paragraphs (i.e., lines of code separated by blank lines). Compare:

> I don't think that's a "patience diff" issue.

Ah, I see.

> That's simply an issue of merging consecutive diff fragments together if 
> they are close-by, and that's independent of the actual diff algorithm 
> itself.

> > I'll note that in this particular case, `git diff` yielded the very same
> > results with or without --patience. I don't know why that is, Johannes?
> > I'll also note that /usr/bin/diff produces (in this case) something
> > closer to patience than to git.

> See above - I really don't think this has anything to do with "patience vs 
> non-patience". It's more akin to the things we do for our merge conflict 
> markers: if we have two merge conflicts next to each other, with just a 
> couple of lines in between, we coalesce the merge conflicts into one 
> larger one instead.

> We don't do that for regular diffs - they're always kept minimal (ok, not 
> really minimal, but as close to minimal as the algorithm finds them).

> See commit f407f14deaa14ebddd0d27238523ced8eca74393 for the git merge 
> conflict merging. We _could_ do similar things for regular diffs. It's 
> sometimes useful, sometimes not.

Independently of patience diff, then, I'd very much support changes to
improve the diff I pasted.

-- 
Adeodato Simó                                     dato at net.com.org.es
Debian Developer                                  adeodato at debian.org
 
Debugging is twice as hard as writing the code in the first place. Therefore,
if you write the code as cleverly as possible, you are, by definition, not
smart enough to debug it.
                -- Brian W. Kernighan

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-02 18:17               ` Johannes Schindelin
  2009-01-02 18:49                 ` Linus Torvalds
@ 2009-01-02 18:51                 ` Jeff King
  1 sibling, 0 replies; 67+ messages in thread
From: Jeff King @ 2009-01-02 18:51 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git

On Fri, Jan 02, 2009 at 07:17:34PM +0100, Johannes Schindelin wrote:

> BTW the "-p" is not necessary with "show", indeed, you cannot even switch 
> it off.

Half true:

  $ git grep -A1 '"-s"' diff.c
  diff.c: else if (!strcmp(arg, "-s"))
  diff.c-         options->output_format |= DIFF_FORMAT_NO_OUTPUT;

-Peff

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-02 18:46                       ` Johannes Schindelin
@ 2009-01-02 19:03                         ` Linus Torvalds
  2009-01-02 19:22                           ` Johannes Schindelin
  2009-01-02 19:39                           ` Jeff King
  0 siblings, 2 replies; 67+ messages in thread
From: Linus Torvalds @ 2009-01-02 19:03 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Clemens Buchacher, Adeodato Simó, Pierre Habouzit, davidel,
	Francis Galiegue, Git ML

On Fri, 2 Jan 2009, Johannes Schindelin wrote:
> 
> FWIW it's the test case in the commit introducing the --patience option.

Well, it's also the test-case in the very first hit on google for 
"patience diff" (with the quotes).

In fact, it's the _only_ one I ever found ;)

> And the worst part: one can only _guess_ what motivated patience diff.  I 
> imagine it came from the observation that function headers are unique, and 
> that you usually want to preserve as much context around them.

Well, I do like the notion of giving more weight to unique lines - I think 
it makes sense. That said, I suspect it would make almost as much sense to 
give more weight simply to _longer_ lines, and I suspect the standard 
Myers' algorithm could possibly be simply extended to take line size into 
account when calculating the weights.

Because the problem with diffs for C doesn't really tend to be as much 
about non-unique lines as about just _trivial_ lines that are mostly empty 
or contain just braces etc. Those are quite arguably almost totally 
worthless for equality testing.

And btw, don't get me wrong - I don't think there is anything wrong with 
the patience diff. I think it's a worthy thing to try, and I'm not at all 
arguing against it. However, I do think that the people arguing for it 
often do so based on less-than-very-logical arguments, and it's entirely 
possible that other approaches are better (eg the "weight by size" thing 
rather than "weight by uniqueness").

The thing about unique lines is that there are no guarantees at all that 
they even exist, so a uniqueness-based thing will always have to fall back 
on anything else. That, to me, implies that the whole notion is somewhat 
mis-designed: it's clearly not a generic concept.

(In contrast, taking the length of the matching lines into account would 
not have that kind of bad special case)

			Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-02 18:49                 ` Linus Torvalds
@ 2009-01-02 19:07                   ` Johannes Schindelin
  0 siblings, 0 replies; 67+ messages in thread
From: Johannes Schindelin @ 2009-01-02 19:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Pierre Habouzit, davidel, Francis Galiegue, Git ML

Hi,

On Fri, 2 Jan 2009, Linus Torvalds wrote:

> On Fri, 2 Jan 2009, Johannes Schindelin wrote:
> > 
> > BTW the "-p" is not necessary with "show", indeed, you cannot even 
> > switch it off.
> 
> I was just switching back-and-forth between "git log" and "git show" so 
> the -p came from just that, and is not necessary.
> 
> And you _can_ suppress the patch generation - use "-s".

Ah, another thing learnt.

Thanks,
Dscho

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-02 19:03                         ` Linus Torvalds
@ 2009-01-02 19:22                           ` Johannes Schindelin
  2009-01-02 19:39                           ` Jeff King
  1 sibling, 0 replies; 67+ messages in thread
From: Johannes Schindelin @ 2009-01-02 19:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Clemens Buchacher, Adeodato Simó, Pierre Habouzit, davidel,
	Francis Galiegue, Git ML

Hi,

On Fri, 2 Jan 2009, Linus Torvalds wrote:

> On Fri, 2 Jan 2009, Johannes Schindelin wrote:
> 
> > And the worst part: one can only _guess_ what motivated patience diff.  
> > I imagine it came from the observation that function headers are 
> > unique, and that you usually want to preserve as much context around 
> > them.
> 
> Well, I do like the notion of giving more weight to unique lines - I think 
> it makes sense. That said, I suspect it would make almost as much sense to 
> give more weight simply to _longer_ lines, and I suspect the standard 
> Myers' algorithm could possibly be simply extended to take line size into 
> account when calculating the weights.

I think that it makes more sense with the common unique lines, because 
they are _unambiguously_ common.  That is, if possible, we would like to 
keep them as common lines.

BTW this also opens the door for more automatic code movement detection.

As for the longer lines: what exactly would you want to put "weight" on?  
The edit distance is the number of plusses and minusses, and this is the 
thing that actually is pretty critical for the performance: the larger the 
distance, the longer it takes.  So if you want to put a different "weight" 
on a line, i.e. something else than a "1", you are risking a substantially 
worse performance.

And I am still not convinced that longer lines should get more weight. A 
line starting with "exit:" can be much more important than 8 tabs followed 
by a curly bracket.

> Because the problem with diffs for C doesn't really tend to be as much 
> about non-unique lines as about just _trivial_ lines that are mostly empty 
> or contain just braces etc. Those are quite arguably almost totally 
> worthless for equality testing.

Oh, but then we get very C specific.  Let's not go there.

> And btw, don't get me wrong - I don't think there is anything wrong with 
> the patience diff. I think it's a worthy thing to try, and I'm not at 
> all arguing against it.

I never took you to be opposed.  I myself mainly implemented it because I 
wanted to play with it, and have something more useful than what I found 
in the whole wide web.

> However, I do think that the people arguing for it often do so based on 
> less-than-very-logical arguments, and it's entirely possible that other 
> approaches are better (eg the "weight by size" thing rather than "weight 
> by uniqueness").

Actually, I think it is very possible that merging hunks if there are not 
enough alnums between them could make a whole lot of sense.

But IIRC it was shot down and replaced by the not-so-specific-to-C 
algorithm to merge hunks when the output is shorter than having separate 
hunks.

> The thing about unique lines is that there are no guarantees at all that 
> they even exist, so a uniqueness-based thing will always have to fall 
> back on anything else. That, to me, implies that the whole notion is 
> somewhat mis-designed: it's clearly not a generic concept.
> 
> (In contrast, taking the length of the matching lines into account would 
> not have that kind of bad special case)

However, we know that humans like to start from the unique features they 
see, and continue from there.

And while it is quite possible that it is the wrong approach (see all that 
security theater at the airport, and some nice rational analyses about the 
cost/benefit equations there), imitating humans is still the thing that 
will convince most humans.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-02 19:03                         ` Linus Torvalds
  2009-01-02 19:22                           ` Johannes Schindelin
@ 2009-01-02 19:39                           ` Jeff King
  2009-01-02 19:50                             ` Jeff King
  2009-01-03 16:24                             ` Bazaar's patience diff as GIT_EXTERNAL_DIFF Adeodato Simó
  1 sibling, 2 replies; 67+ messages in thread
From: Jeff King @ 2009-01-02 19:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Schindelin, Clemens Buchacher, Adeodato Simó,
	Pierre Habouzit, davidel, Francis Galiegue, Git ML

On Fri, Jan 02, 2009 at 11:03:07AM -0800, Linus Torvalds wrote:

> Well, it's also the test-case in the very first hit on google for 
> "patience diff" (with the quotes).
> 
> In fact, it's the _only_ one I ever found ;)

If you just want to see the results on some real-world cases (and don't
care about measuring performance), try installing bzr and using their
patiencediff test program as a GIT_EXTERNAL_DIFF.

On Debian, it's:

  $ sudo apt-get install bzr
  $ cat >$HOME/patience <<'EOF'
    #!/bin/sh
    exec python /usr/share/pyshared/bzrlib/patiencediff.py "$2" "$5"
    EOF
  $ chmod 755 patience
  $ GIT_EXTERNAL_DIFF=$HOME/patience git diff

Other distributions presumably install patiencediff.py somewhere
similar (or you can maybe even pull it from the source, but presumably
you have to install some other bzr modules, too).

-Peff

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-02 19:39                           ` Jeff King
@ 2009-01-02 19:50                             ` Jeff King
  2009-01-02 20:52                               ` Jeff King
  2009-01-03 16:24                             ` Bazaar's patience diff as GIT_EXTERNAL_DIFF Adeodato Simó
  1 sibling, 1 reply; 67+ messages in thread
From: Jeff King @ 2009-01-02 19:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Schindelin, Clemens Buchacher, Adeodato Simó,
	Pierre Habouzit, davidel, Francis Galiegue, Git ML

On Fri, Jan 02, 2009 at 02:39:04PM -0500, Jeff King wrote:

> If you just want to see the results on some real-world cases (and don't
> care about measuring performance), try installing bzr and using their
> patiencediff test program as a GIT_EXTERNAL_DIFF.
> 
> On Debian, it's:
> 
>   $ sudo apt-get install bzr
>   $ cat >$HOME/patience <<'EOF'
>     #!/bin/sh
>     exec python /usr/share/pyshared/bzrlib/patiencediff.py "$2" "$5"
>     EOF
>   $ chmod 755 patience
>   $ GIT_EXTERNAL_DIFF=$HOME/patience git diff

For added fun, try this:

-- >8 --
#!/bin/sh

canonical() {
  grep '^[ +-]' | egrep -v '^(---|\+\+\+)'
}

git rev-list --no-merges HEAD | while read rev; do
  git diff-tree -p $rev^ $rev | canonical >git.out
  GIT_EXTERNAL_DIFF=$HOME/patience git diff $rev^ $rev | canonical >bzr.out
  echo "`diff -U0 git.out bzr.out | wc -l` $rev"
done
-- 8< --

I'm running it on git.git now. It looks like both algorithms return the
same patch for most cases. Some of the differences are interesting,
though.

For example, f83b9ba209's commit message indicates that it moves the
"--format-patch" paragraph. Which is what "git diff" shows. Patience
diff shows it as moving other text _around_ that paragraph.

-Peff

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-02 19:50                             ` Jeff King
@ 2009-01-02 20:52                               ` Jeff King
  2009-01-02 23:05                                 ` Linus Torvalds
  0 siblings, 1 reply; 67+ messages in thread
From: Jeff King @ 2009-01-02 20:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Schindelin, Clemens Buchacher, Adeodato Simó,
	Pierre Habouzit, davidel, Francis Galiegue, Git ML

On Fri, Jan 02, 2009 at 02:50:53PM -0500, Jeff King wrote:

> For example, f83b9ba209's commit message indicates that it moves the
> "--format-patch" paragraph. Which is what "git diff" shows. Patience
> diff shows it as moving other text _around_ that paragraph.

Here's another interesting one: d592b315. The commit removes dashes
from git commands in test scripts. Git says:

        echo "tag-one-line" >expect &&
-       git-tag -l | grep "^tag-one-line" >actual &&
+       git tag -l | grep "^tag-one-line" >actual &&
        test_cmp expect actual &&
-       git-tag -n0 -l | grep "^tag-one-line" >actual &&
+       git tag -n0 -l | grep "^tag-one-line" >actual &&
        test_cmp expect actual &&
-       git-tag -n0 -l tag-one-line >actual &&
+       git tag -n0 -l tag-one-line >actual &&
        test_cmp expect actual &&

whereas patience says:

        echo "tag-one-line" >expect &&
-       git-tag -l | grep "^tag-one-line" >actual &&
-       test_cmp expect actual &&
-       git-tag -n0 -l | grep "^tag-one-line" >actual &&
-       test_cmp expect actual &&
-       git-tag -n0 -l tag-one-line >actual &&
+       git tag -l | grep "^tag-one-line" >actual &&
+       test_cmp expect actual &&
+       git tag -n0 -l | grep "^tag-one-line" >actual &&
+       test_cmp expect actual &&
+       git tag -n0 -l tag-one-line >actual &&
        test_cmp expect actual &&

which is exactly what patience is advertised to do: it's treating the
non-unique lines as uninteresting markers. But in this case they _are_
interesting, and I think the git output is more readable. And this is a
case where your "weight lines by length instead of uniqueness"
suggestion would perform better, I think.

-Peff

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-02 16:42                     ` Linus Torvalds
  2009-01-02 18:46                       ` Johannes Schindelin
@ 2009-01-02 21:59                       ` Johannes Schindelin
  2009-01-08 19:55                       ` Adeodato Simó
  2 siblings, 0 replies; 67+ messages in thread
From: Johannes Schindelin @ 2009-01-02 21:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Clemens Buchacher, Adeodato Simó, Pierre Habouzit, davidel,
	Francis Galiegue, Git ML

Hi,

On Fri, 2 Jan 2009, Linus Torvalds wrote:

> So I was hoping for something else than a single "in this case patience 
> diff works really well". I was hoping to see what it does in real life. 

Funnily, I think the test case you sent me is a pretty good example.  Look 
at this hunk (without patience diff):

@@ -4205,25 +4205,25 @@ out:
  */
 static int nfs4_xdr_dec_setattr(struct rpc_rqst *rqstp, __be32 *p, struct nfs_se
 {
-        struct xdr_stream xdr;
-        struct compound_hdr hdr;
-        int status;
-
-        xdr_init_decode(&xdr, &rqstp->rq_rcv_buf, p);
-        status = decode_compound_hdr(&xdr, &hdr);
-        if (status)
-                goto out;
-        status = decode_putfh(&xdr);
-        if (status)
-                goto out;
-        status = decode_setattr(&xdr, res);
-        if (status)
-                goto out;
+       struct xdr_stream xdr;
+       struct compound_hdr hdr;
+       int status;
+
+       xdr_init_decode(&xdr, &rqstp->rq_rcv_buf, p);

... and then it goes on with the whole reindented function.  Compare this 
to the same hunk _with_ patience diff:

@@ -4205,25 +4205,25 @@ out:
  */
 static int nfs4_xdr_dec_setattr(struct rpc_rqst *rqstp, __be32 *p, struct nfs_se
 {
-        struct xdr_stream xdr;
-        struct compound_hdr hdr;
-        int status;
+       struct xdr_stream xdr;
+       struct compound_hdr hdr;
+       int status;

-        xdr_init_decode(&xdr, &rqstp->rq_rcv_buf, p);
-        status = decode_compound_hdr(&xdr, &hdr);
-        if (status)
-                goto out;
-        status = decode_putfh(&xdr);
-        if (status)
-                goto out;
-        status = decode_setattr(&xdr, res);
-        if (status)
-                goto out;
+       xdr_init_decode(&xdr, &rqstp->rq_rcv_buf, p);
+       status = decode_compound_hdr(&xdr, &hdr);

... and again the rest is reindented code.

The difference?  The common empty line.  I actually find it more readable 
to have the separation between the declarations and the code also in the 
diff.

This is just a very feeble example, but you get the idea from there.

Oh, you might object that the empty line is not unique.  But actually it 
is, because the patience diff recurses into ever smaller line ranges until 
it finally comes to such a small range that the empty line _is_ unique.

And in my analysis of the complexity, I stupidly left out that recursion 
part.  So: patience diff is _substantially_ more expensive than Myers'.

> But when I tried it on the kernel archive, I get a core dump.

I also got this trying on the git.git repository, with commit 
be3cfa85([PATCH] Diff-tree-helper take two.)  Funnily, I almost got there 
trying the same before sending the first revision, but I got impatient and 
stopped early.  Tsk, tsk.

Now I tried with my complete clone of git.git together with all of my 
topic branches, and it runs through without segmentation fault.  Patch 
follows.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 1/3 v2] Implement the patience diff algorithm
  2009-01-01 20:00             ` Linus Torvalds
  2009-01-02 18:17               ` Johannes Schindelin
@ 2009-01-02 21:59               ` Johannes Schindelin
  2009-01-02 21:59                 ` Johannes Schindelin
  1 sibling, 1 reply; 67+ messages in thread
From: Johannes Schindelin @ 2009-01-02 21:59 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Pierre Habouzit, davidel, Francis Galiegue, Git ML


The patience diff algorithm produces slightly more intuitive output
than the classic Myers algorithm, as it does not try to minimize the
number of +/- lines first, but tries to preserve the lines that are
unique.

To this end, it first determines lines that are unique in both files,
then the maximal sequence which preserves the order (relative to both
files) is extracted.

Starting from this initial set of common lines, the rest of the lines
is handled recursively, with Myers' algorithm as a fallback when
the patience algorithm fails (due to no common unique lines).

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---

	I did not realize that xdl_prepare_env() initializes the arrays in
	rchg (which tell which lines are not common).  Unfortunately, there
	are ambiguities, e.g. with empty lines, and my implementation wanted
	to take other common lines, disagreeing with the previous
	initialization.

	Interdiff follows.

 xdiff/xdiff.h     |    1 +
 xdiff/xdiffi.c    |    3 +
 xdiff/xdiffi.h    |    2 +
 xdiff/xpatience.c |  385 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 391 insertions(+), 0 deletions(-)
 create mode 100644 xdiff/xpatience.c

diff --git a/xdiff/xdiff.h b/xdiff/xdiff.h
index 361f802..4da052a 100644
--- a/xdiff/xdiff.h
+++ b/xdiff/xdiff.h
@@ -32,6 +32,7 @@ extern "C" {
 #define XDF_IGNORE_WHITESPACE (1 << 2)
 #define XDF_IGNORE_WHITESPACE_CHANGE (1 << 3)
 #define XDF_IGNORE_WHITESPACE_AT_EOL (1 << 4)
+#define XDF_PATIENCE_DIFF (1 << 5)
 #define XDF_WHITESPACE_FLAGS (XDF_IGNORE_WHITESPACE | XDF_IGNORE_WHITESPACE_CHANGE | XDF_IGNORE_WHITESPACE_AT_EOL)
 
 #define XDL_PATCH_NORMAL '-'
diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 9d0324a..3e97462 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -329,6 +329,9 @@ int xdl_do_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
 	xdalgoenv_t xenv;
 	diffdata_t dd1, dd2;
 
+	if (xpp->flags & XDF_PATIENCE_DIFF)
+		return xdl_do_patience_diff(mf1, mf2, xpp, xe);
+
 	if (xdl_prepare_env(mf1, mf2, xpp, xe) < 0) {
 
 		return -1;
diff --git a/xdiff/xdiffi.h b/xdiff/xdiffi.h
index 3e099dc..ad033a8 100644
--- a/xdiff/xdiffi.h
+++ b/xdiff/xdiffi.h
@@ -55,5 +55,7 @@ int xdl_build_script(xdfenv_t *xe, xdchange_t **xscr);
 void xdl_free_script(xdchange_t *xscr);
 int xdl_emit_diff(xdfenv_t *xe, xdchange_t *xscr, xdemitcb_t *ecb,
 		  xdemitconf_t const *xecfg);
+int xdl_do_patience_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
+		xdfenv_t *env);
 
 #endif /* #if !defined(XDIFFI_H) */
diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
new file mode 100644
index 0000000..d01cbdd
--- /dev/null
+++ b/xdiff/xpatience.c
@@ -0,0 +1,385 @@
+/*
+ *  LibXDiff by Davide Libenzi ( File Differential Library )
+ *  Copyright (C) 2003-2009 Davide Libenzi, Johannes E. Schindelin
+ *
+ *  This library is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU Lesser General Public
+ *  License as published by the Free Software Foundation; either
+ *  version 2.1 of the License, or (at your option) any later version.
+ *
+ *  This library is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ *  Lesser General Public License for more details.
+ *
+ *  You should have received a copy of the GNU Lesser General Public
+ *  License along with this library; if not, write to the Free Software
+ *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ *  Davide Libenzi <davidel@xmailserver.org>
+ *
+ */
+#include "xinclude.h"
+#include "xtypes.h"
+#include "xdiff.h"
+
+/*
+ * The basic idea of patience diff is to find lines that are unique in
+ * both files.  These are intuitively the ones that we want to see as
+ * common lines.
+ *
+ * The maximal ordered sequence of such line pairs (where ordered means
+ * that the order in the sequence agrees with the order of the lines in
+ * both files) naturally defines an initial set of common lines.
+ *
+ * Now, the algorithm tries to extend the set of common lines by growing
+ * the line ranges where the files have identical lines.
+ *
+ * Between those common lines, the patience diff algorithm is applied
+ * recursively, until no unique line pairs can be found; these line ranges
+ * are handled by the well-known Myers algorithm.
+ */
+
+#define NON_UNIQUE ULONG_MAX
+
+/*
+ * This is a hash mapping from line hash to line numbers in the first and
+ * second file.
+ */
+struct hashmap {
+	int nr, alloc;
+	struct entry {
+		unsigned long hash;
+		/*
+		 * 0 = unused entry, 1 = first line, 2 = second, etc.
+		 * line2 is NON_UNIQUE if the line is not unique
+		 * in either the first or the second file.
+		 */
+		unsigned long line1, line2;
+		/*
+		 * "next" & "previous" are used for the longest common
+		 * sequence;
+		 * initially, "next" reflects only the order in file1.
+		 */
+		struct entry *next, *previous;
+	} *entries, *first, *last;
+	/* were common records found? */
+	unsigned long has_matches;
+	mmfile_t *file1, *file2;
+	xdfenv_t *env;
+	xpparam_t const *xpp;
+};
+
+/* The argument "pass" is 1 for the first file, 2 for the second. */
+static void insert_record(int line, struct hashmap *map, int pass)
+{
+	xrecord_t **records = pass == 1 ?
+		map->env->xdf1.recs : map->env->xdf2.recs;
+	xrecord_t *record = records[line - 1], *other;
+	/*
+	 * After xdl_prepare_env() (or more precisely, due to
+	 * xdl_classify_record()), the "ha" member of the records (AKA lines)
+	 * is _not_ the hash anymore, but a linearized version of it.  In
+	 * other words, the "ha" member is guaranteed to start with 0 and
+	 * the second record's ha can only be 0 or 1, etc.
+	 *
+	 * So we multiply ha by 2 in the hope that the hashing was
+	 * "unique enough".
+	 */
+	int index = (int)((record->ha << 1) % map->alloc);
+
+	while (map->entries[index].line1) {
+		other = map->env->xdf1.recs[map->entries[index].line1 - 1];
+		if (map->entries[index].hash != record->ha ||
+				!xdl_recmatch(record->ptr, record->size,
+					other->ptr, other->size,
+					map->xpp->flags)) {
+			if (++index >= map->alloc)
+				index = 0;
+			continue;
+		}
+		if (pass == 2)
+			map->has_matches = 1;
+		if (pass == 1 || map->entries[index].line2)
+			map->entries[index].line2 = NON_UNIQUE;
+		else
+			map->entries[index].line2 = line;
+		return;
+	}
+	if (pass == 2)
+		return;
+	map->entries[index].line1 = line;
+	map->entries[index].hash = record->ha;
+	if (!map->first)
+		map->first = map->entries + index;
+	if (map->last) {
+		map->last->next = map->entries + index;
+		map->entries[index].previous = map->last;
+	}
+	map->last = map->entries + index;
+	map->nr++;
+}
+
+/*
+ * This function has to be called for each recursion into the inter-hunk
+ * parts, as previously non-unique lines can become unique when being
+ * restricted to a smaller part of the files.
+ *
+ * It is assumed that env has been prepared using xdl_prepare().
+ */
+static int fill_hashmap(mmfile_t *file1, mmfile_t *file2,
+		xpparam_t const *xpp, xdfenv_t *env,
+		struct hashmap *result,
+		int line1, int count1, int line2, int count2)
+{
+	result->file1 = file1;
+	result->file2 = file2;
+	result->xpp = xpp;
+	result->env = env;
+
+	/* We know exactly how large we want the hash map */
+	result->alloc = count1 * 2;
+	result->entries = (struct entry *)
+		xdl_malloc(result->alloc * sizeof(struct entry));
+	if (!result->entries)
+		return -1;
+	memset(result->entries, 0, result->alloc * sizeof(struct entry));
+
+	/* First, fill with entries from the first file */
+	while (count1--)
+		insert_record(line1++, result, 1);
+
+	/* Then search for matches in the second file */
+	while (count2--)
+		insert_record(line2++, result, 2);
+
+	return 0;
+}
+
+/*
+ * Find the longest sequence with a smaller last element (meaning a smaller
+ * line2, as we construct the sequence with entries ordered by line1).
+ */
+static int binary_search(struct entry **sequence, int longest,
+		struct entry *entry)
+{
+	int left = -1, right = longest;
+
+	while (left + 1 < right) {
+		int middle = (left + right) / 2;
+		/* by construction, no two entries can be equal */
+		if (sequence[middle]->line2 > entry->line2)
+			right = middle;
+		else
+			left = middle;
+	}
+	/* return the index in "sequence", _not_ the sequence length */
+	return left;
+}
+
+/*
+ * The idea is to start with the list of common unique lines sorted by
+ * the order in file1.  For each of these pairs, the longest (partial)
+ * sequence whose last element's line2 is smaller is determined.
+ *
+ * For efficiency, the sequences are kept in a list containing exactly one
+ * item per sequence length: the sequence with the smallest last
+ * element (in terms of line2).
+ */
+static struct entry *find_longest_common_sequence(struct hashmap *map)
+{
+	struct entry **sequence = xdl_malloc(map->nr * sizeof(struct entry *));
+	int longest = 0, i;
+	struct entry *entry;
+
+	for (entry = map->first; entry; entry = entry->next) {
+		if (!entry->line2 || entry->line2 == NON_UNIQUE)
+			continue;
+		i = binary_search(sequence, longest, entry);
+		entry->previous = i < 0 ? NULL : sequence[i];
+		sequence[++i] = entry;
+		if (i == longest)
+			longest++;
+	}
+
+	/* No common unique lines were found */
+	if (!longest)
+		return NULL;
+
+	/* Iterate starting at the last element, adjusting the "next" members */
+	entry = sequence[longest - 1];
+	entry->next = NULL;
+	while (entry->previous) {
+		entry->previous->next = entry;
+		entry = entry->previous;
+	}
+	return entry;
+}
+
+static int match(struct hashmap *map, int line1, int line2)
+{
+	xrecord_t *record1 = map->env->xdf1.recs[line1 - 1];
+	xrecord_t *record2 = map->env->xdf2.recs[line2 - 1];
+	return xdl_recmatch(record1->ptr, record1->size,
+		record2->ptr, record2->size, map->xpp->flags);
+}
+
+static int patience_diff(mmfile_t *file1, mmfile_t *file2,
+		xpparam_t const *xpp, xdfenv_t *env,
+		int line1, int count1, int line2, int count2);
+
+static int walk_common_sequence(struct hashmap *map, struct entry *first,
+		int line1, int count1, int line2, int count2)
+{
+	int end1 = line1 + count1, end2 = line2 + count2;
+	int next1, next2;
+
+	for (;;) {
+		/* Try to grow the line ranges of common lines */
+		if (first) {
+			next1 = first->line1;
+			next2 = first->line2;
+			while (next1 > line1 && next2 > line2 &&
+					match(map, next1 - 1, next2 - 1)) {
+				next1--;
+				next2--;
+			}
+		} else {
+			next1 = end1;
+			next2 = end2;
+		}
+		while (line1 < next1 && line2 < next2 &&
+				match(map, line1, line2)) {
+			line1++;
+			line2++;
+		}
+
+		/* Recurse */
+		if (next1 > line1 || next2 > line2) {
+			struct hashmap submap;
+
+			memset(&submap, 0, sizeof(submap));
+			if (patience_diff(map->file1, map->file2,
+					map->xpp, map->env,
+					line1, next1 - line1,
+					line2, next2 - line2))
+				return -1;
+		}
+
+		if (!first)
+			return 0;
+
+		while (first->next &&
+				first->next->line1 == first->line1 + 1 &&
+				first->next->line2 == first->line2 + 1)
+			first = first->next;
+
+		line1 = first->line1 + 1;
+		line2 = first->line2 + 1;
+
+		first = first->next;
+	}
+}
+
+static int fall_back_to_classic_diff(struct hashmap *map,
+		int line1, int count1, int line2, int count2)
+{
+	/*
+	 * This probably does not work outside Git, since
+	 * we have a very simple mmfile structure.
+	 *
+	 * Note: ideally, we would reuse the prepared environment, but
+	 * the libxdiff interface does not (yet) allow for diffing only
+	 * ranges of lines instead of the whole files.
+	 */
+	mmfile_t subfile1, subfile2;
+	xpparam_t xpp;
+	xdfenv_t env;
+
+	subfile1.ptr = (char *)map->env->xdf1.recs[line1 - 1]->ptr;
+	subfile1.size = map->env->xdf1.recs[line1 + count1 - 2]->ptr +
+		map->env->xdf1.recs[line1 + count1 - 2]->size - subfile1.ptr;
+	subfile2.ptr = (char *)map->env->xdf2.recs[line2 - 1]->ptr;
+	subfile2.size = map->env->xdf2.recs[line2 + count2 - 2]->ptr +
+		map->env->xdf2.recs[line2 + count2 - 2]->size - subfile2.ptr;
+	xpp.flags = map->xpp->flags & ~XDF_PATIENCE_DIFF;
+	if (xdl_do_diff(&subfile1, &subfile2, &xpp, &env) < 0)
+		return -1;
+
+	memcpy(map->env->xdf1.rchg + line1 - 1, env.xdf1.rchg, count1);
+	memcpy(map->env->xdf2.rchg + line2 - 1, env.xdf2.rchg, count2);
+
+	xdl_free_env(&env);
+
+	return 0;
+}
+
+/*
+ * Recursively find the longest common sequence of unique lines,
+ * and if none was found, ask xdl_do_diff() to do the job.
+ *
+ * This function assumes that env was prepared with xdl_prepare_env().
+ */
+static int patience_diff(mmfile_t *file1, mmfile_t *file2,
+		xpparam_t const *xpp, xdfenv_t *env,
+		int line1, int count1, int line2, int count2)
+{
+	struct hashmap map;
+	struct entry *first;
+	int result = 0;
+
+	/* trivial case: one side is empty */
+	if (!count1) {
+		while(count2--)
+			env->xdf2.rchg[line2++ - 1] = 1;
+		return 0;
+	} else if (!count2) {
+		while(count1--)
+			env->xdf1.rchg[line1++ - 1] = 1;
+		return 0;
+	}
+
+	memset(&map, 0, sizeof(map));
+	if (fill_hashmap(file1, file2, xpp, env, &map,
+			line1, count1, line2, count2))
+		return -1;
+
+	/* are there any matching lines at all? */
+	if (!map.has_matches) {
+		while(count1--)
+			env->xdf1.rchg[line1++ - 1] = 1;
+		while(count2--)
+			env->xdf2.rchg[line2++ - 1] = 1;
+		return 0;
+	}
+
+	first = find_longest_common_sequence(&map);
+	if (first)
+		result = walk_common_sequence(&map, first,
+			line1, count1, line2, count2);
+	else
+		result = fall_back_to_classic_diff(&map,
+			line1, count1, line2, count2);
+
+	return result;
+}
+
+int xdl_do_patience_diff(mmfile_t *file1, mmfile_t *file2,
+		xpparam_t const *xpp, xdfenv_t *env)
+{
+	if (xdl_prepare_env(file1, file2, xpp, env) < 0)
+		return -1;
+
+	/*
+	 * It is a pity that xdl_cleanup_records() not only marks those
+	 * lines as changes that are only present in one file, but also
+	 * lines that have multiple matches and happen to be in a "run
+	 * of discardable lines" that patience diff happens to split
+	 * differently.
+	 */
+	memset(env->xdf1.rchg, 0, env->xdf1.nrec);
+	memset(env->xdf2.rchg, 0, env->xdf2.nrec);
+	/* environment is cleaned up in xdl_diff() */
+	return patience_diff(file1, file2, xpp, env,
+			1, env->xdf1.nrec, 1, env->xdf2.nrec);
+}
-- 
1.6.1.rc3.224.g95ac9

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH 1/3 v2] Implement the patience diff algorithm
  2009-01-02 21:59               ` [PATCH 1/3 v2] Implement " Johannes Schindelin
@ 2009-01-02 21:59                 ` Johannes Schindelin
  0 siblings, 0 replies; 67+ messages in thread
From: Johannes Schindelin @ 2009-01-02 21:59 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Pierre Habouzit, davidel, Francis Galiegue, Git ML

Hi,

The interdiff between v1 and v2 of PATCH 1/3.  As you can see, I also 
added a cleanup of an intermediate xdlenv.

Ciao,
Dscho

 xdiff/xpatience.c |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
index 6687940..d01cbdd 100644
--- a/xdiff/xpatience.c
+++ b/xdiff/xpatience.c
@@ -309,6 +309,8 @@ static int fall_back_to_classic_diff(struct hashmap *map,
 	memcpy(map->env->xdf1.rchg + line1 - 1, env.xdf1.rchg, count1);
 	memcpy(map->env->xdf2.rchg + line2 - 1, env.xdf2.rchg, count2);
 
+	xdl_free_env(&env);
+
 	return 0;
 }
 
@@ -368,6 +370,15 @@ int xdl_do_patience_diff(mmfile_t *file1, mmfile_t *file2,
 	if (xdl_prepare_env(file1, file2, xpp, env) < 0)
 		return -1;
 
+	/*
+	 * It is a pity that xdl_cleanup_records() not only marks those
+	 * lines as changes that are only present in one file, but also
+	 * lines that have multiple matches and happen to be in a "run
+	 * of discardable lines" that patience diff happens to split
+	 * differently.
+	 */
+	memset(env->xdf1.rchg, 0, env->xdf1.nrec);
+	memset(env->xdf2.rchg, 0, env->xdf2.nrec);
 	/* environment is cleaned up in xdl_diff() */
 	return patience_diff(file1, file2, xpp, env,
 			1, env->xdf1.nrec, 1, env->xdf2.nrec);
-- 
1.6.1.rc3.224.g95ac9

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-02 20:52                               ` Jeff King
@ 2009-01-02 23:05                                 ` Linus Torvalds
  0 siblings, 0 replies; 67+ messages in thread
From: Linus Torvalds @ 2009-01-02 23:05 UTC (permalink / raw)
  To: Jeff King
  Cc: Johannes Schindelin, Clemens Buchacher, Adeodato Simó,
	Pierre Habouzit, davidel, Francis Galiegue, Git ML

On Fri, 2 Jan 2009, Jeff King wrote:
> 
> Here's another interesting one: d592b315. The commit removes dashes
> from git commands in test scripts. Git says:
> 
>         echo "tag-one-line" >expect &&
> -       git-tag -l | grep "^tag-one-line" >actual &&
> +       git tag -l | grep "^tag-one-line" >actual &&
>         test_cmp expect actual &&
> -       git-tag -n0 -l | grep "^tag-one-line" >actual &&
> +       git tag -n0 -l | grep "^tag-one-line" >actual &&
>         test_cmp expect actual &&
> -       git-tag -n0 -l tag-one-line >actual &&
> +       git tag -n0 -l tag-one-line >actual &&
>         test_cmp expect actual &&
> 
> whereas patience says:
> 
>         echo "tag-one-line" >expect &&
> -       git-tag -l | grep "^tag-one-line" >actual &&
> -       test_cmp expect actual &&
> -       git-tag -n0 -l | grep "^tag-one-line" >actual &&
> -       test_cmp expect actual &&
> -       git-tag -n0 -l tag-one-line >actual &&
> +       git tag -l | grep "^tag-one-line" >actual &&
> +       test_cmp expect actual &&
> +       git tag -n0 -l | grep "^tag-one-line" >actual &&
> +       test_cmp expect actual &&
> +       git tag -n0 -l tag-one-line >actual &&
>         test_cmp expect actual &&

Yeah, the bazaar version clearly is inferior here. But again, I don't 
think that's actually a patience diff issue, I think it's because the 
bazaar diff has merged consecutive diff lines too aggressively.

I suspect both patience and the straight Mayers diff (that git uses) 
actually finds exactly the same differences, and then bazaar has a "merge 
closeby -/ pairs together if there is just a single unmodified line in 
between them". 

And _that_ is where it should care whether the unmodified line is complex 
or not. If it's complex, you shouldn't merge it into the -/+ region.

(but yes, it's possible that the bazaar diff uses "uniqueness" as a 
complexity analysis marker, and while that is somewhat valid, it is _not_ 
valid enough to be useful. Unique lines tend to be complex, but complex 
lines are _not_ always unique)

		Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Bazaar's patience diff as GIT_EXTERNAL_DIFF
  2009-01-02 19:39                           ` Jeff King
  2009-01-02 19:50                             ` Jeff King
@ 2009-01-03 16:24                             ` Adeodato Simó
  1 sibling, 0 replies; 67+ messages in thread
From: Adeodato Simó @ 2009-01-03 16:24 UTC (permalink / raw)
  To: Jeff King; +Cc: Git ML

[-- Attachment #1: Type: text/plain, Size: 974 bytes --]

* Jeff King [Fri, 02 Jan 2009 14:39:04 -0500]:

> If you just want to see the results on some real-world cases (and don't
> care about measuring performance), try installing bzr and using their
> patiencediff test program as a GIT_EXTERNAL_DIFF.

> On Debian, it's:

>   $ sudo apt-get install bzr
>   $ cat >$HOME/patience <<'EOF'
>     #!/bin/sh
>     exec python /usr/share/pyshared/bzrlib/patiencediff.py "$2" "$5"
>     EOF
>   $ chmod 755 patience
>   $ GIT_EXTERNAL_DIFF=$HOME/patience git diff

In case somebody's interested, I have this script lying around that does
that, and knows how to colorize the output using bzrtools, and honoring
color.{diff,ui}. No support for color.diff.<slot>, though.

-- 
Adeodato Simó                                     dato at net.com.org.es
Debian Developer                                  adeodato at debian.org
 
- Oh my God, you're pimping me out for a new roof?
- And windows!
                -- Andrew and Bree Van De Kamp

[-- Attachment #2: git_bzr_patience_diff --]
[-- Type: text/plain, Size: 1457 bytes --]

#! /usr/bin/python

import os
import sys
import stat
import subprocess

from bzrlib.patiencediff import unified_diff, PatienceSequenceMatcher
try:
    from bzrlib.plugins.bzrtools.colordiff import DiffWriter
except ImportError:
    _have_colordiff = False
else:
    _have_colordiff = True

##

def main():
    path = sys.argv[1]
    file1 = open(sys.argv[2], 'rb')
    file2 = open(sys.argv[5], 'rb')

    if use_color():
        writer = DiffWriter(sys.stdout, check_style=True)
    else:
        writer = sys.stdout

    for line in unified_diff(
            file1.readlines(), file2.readlines(),
            path, path, sequencematcher=PatienceSequenceMatcher):
        writer.write(line)

##

def use_color():
    if not _have_colordiff:
        return False

    for c in ['color.diff', 'color.ui']:
        p = subprocess.Popen(
                ['git', 'config', '--get', c], stdout=subprocess.PIPE)
        if p.wait() == 0:
            when = p.stdout.readline().strip()
            break
    else:
        return False

    if when == 'always':
        return True
    elif when in ['false', 'never']:
        return False
    elif when in ['true', 'auto']:
        stdout = sys.stdout.fileno()
        return (os.isatty(stdout) or
                stat.S_ISFIFO(os.fstat(sys.stdout.fileno()).st_mode))
    else:
        return False

##

if __name__ == '__main__':
    sys.exit(main())

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2008-11-04  8:30   ` Pierre Habouzit
  2008-11-04 14:34     ` Johannes Schindelin
@ 2009-01-06 11:17     ` Pierre Habouzit
  2009-01-06 11:39       ` Pierre Habouzit
  2009-01-06 19:40       ` Johannes Schindelin
  1 sibling, 2 replies; 67+ messages in thread
From: Pierre Habouzit @ 2009-01-06 11:17 UTC (permalink / raw)
  To: Johannes Schindelin, Linus Torvalds; +Cc: davidel, Francis Galiegue, Git ML

[-- Attachment #1: Type: text/plain, Size: 3580 bytes --]

On jeu, jan 01, 2009 at 04:38:09 +0000, Johannes Schindelin wrote:
> 
> Nothing fancy, really, just a straight-forward implementation of the
> heavily under-documented and under-analyzed paience diff algorithm.

Btw, what is the status of this series ? I see it neither in pu nor in
next. And I would gladly see it included in git.

On jeu, jan 01, 2009 at 07:45:21 +0000, Linus Torvalds wrote:
> 
> 
> On Thu, 1 Jan 2009, Johannes Schindelin wrote:
> > 
> > Nothing fancy, really, just a straight-forward implementation of the
> > heavily under-documented and under-analyzed paience diff algorithm.
> 
> Exactly because the patience diff is so under-documented, could you 
> perhaps give a few examples of how it differs in the result, and why it's 
> so wonderful? Yes, yes, I can google, and no, no, nothing useful shows up 
> except for *totally* content-free fanboisms. 
> 
> So could we have some actual real data on it?

For example, look at the following (shamelessly stolen from the real
life example http://glandium.org/blog/?p=120).

With orig.mk reading:
  ] include $(DEPTH)/config/autoconf.mk
  ] 
  ] include $(topsrcdir)/config/rules.mk
  ] 
  ] libs::
  ]        $(INSTALL) $(srcdir)/nsKillAll.js $(DIST)/bin/components
  ] 
  ] clean::
  ]        rm -f $(DIST)/bin/components/nsKillAll.js

With patched.mk reading:
  ] include $(DEPTH)/config/autoconf.mk
  ] 
  ] EXTRA_COMPONENTS = nsKillAll.js
  ] 
  ] include $(topsrcdir)/config/rules.mk
  ] 
  ] clean::
  ]        rm -f $(DIST)/bin/components/nsKillAll.js

You get:

Normal git diff                                    | patience diff
---------------------------------------------------+------------------------------------------
 include $(DEPTH)/config/autoconf.mk               |  include $(DEPTH)/config/autoconf.mk
                                                   |
-include $(topsrcdir)/config/rules.mk              | +EXTRA_COMPONENTS = nsKillAll.js
+EXTRA_COMPONENTS = nsKillAll.js                   | +
                                                   |  include $(topsrcdir)/config/rules.mk
-libs::                                            |
-       $(INSTALL) $(srcdir)/...                   | -libs::
+include $(topsrcdir)/config/rules.mk              | -       $(INSTALL) $(srcdir)/...
                                                   | -
 clean::                                           |  clean::
        rm -f $(DIST)/bin/components/nsKillAll.js  |         rm -f $(DIST)/bin/components/nsKillAll.js



I've checked in many projects I have under git, the differences between
git log -p and git log -p --patience. The patience algorithm is really
really more readable with it involves code moves with changes in the
moved sections. If the section you move across is smaller than the moved
ones, the patience algorithm will show the moved code as removed where
it was and added where it now is, changes included. The current diff
will rather move the smaller invariend section you move across and
present mangled diffs involving the function prototypes making it less
than readable.

Of course, moving code _and_ modifying it at the same time is rather bad
style. BUt (1) it happens and the fact it's bad style is a bad argument
to not show a decent diff of it (2) if you construct diffs of a range of
commits it happens a lot.
-- 
·O·  Pierre Habouzit
··O                                                madcoder@debian.org
OOO                                                http://www.madism.org

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-06 11:17     ` Pierre Habouzit
@ 2009-01-06 11:39       ` Pierre Habouzit
  2009-01-06 19:40       ` Johannes Schindelin
  1 sibling, 0 replies; 67+ messages in thread
From: Pierre Habouzit @ 2009-01-06 11:39 UTC (permalink / raw)
  To: Johannes Schindelin, Linus Torvalds; +Cc: davidel, Francis Galiegue, Git ML


[-- Attachment #1.1: Type: text/plain, Size: 2784 bytes --]

On Tue, Jan 06, 2009 at 11:17:12AM +0000, Pierre Habouzit wrote:
> On jeu, jan 01, 2009 at 04:38:09 +0000, Johannes Schindelin wrote:
> > 
> > Nothing fancy, really, just a straight-forward implementation of the
> > heavily under-documented and under-analyzed paience diff algorithm.
> 
> Btw, what is the status of this series ? I see it neither in pu nor in
> next. And I would gladly see it included in git.

Johannes: I've not had time to investigate, but when finding what I
present in the end of this mail, I ran:

    git log -p > log-normal
    git log -p --patience > log-patience

in git.git.

I saw that patience diff is slower than normal diff, which is expected,
but I had to kill the latter command because it leaks like hell. I've
not investigated yet (And yes I'm running your latest posted series).

> On jeu, jan 01, 2009 at 07:45:21 +0000, Linus Torvalds wrote:
> > On Thu, 1 Jan 2009, Johannes Schindelin wrote:
> > > 
> > > Nothing fancy, really, just a straight-forward implementation of the
> > > heavily under-documented and under-analyzed paience diff algorithm.
> > 
> > Exactly because the patience diff is so under-documented, could you 
> > perhaps give a few examples of how it differs in the result, and why it's 
> > so wonderful? Yes, yes, I can google, and no, no, nothing useful shows up 
> > except for *totally* content-free fanboisms. 
> > 
> > So could we have some actual real data on it?

> I've checked in many projects I have under git, the differences between
> git log -p and git log -p --patience. The patience algorithm is really
> really more readable with it involves code moves with changes in the
> moved sections. If the section you move across is smaller than the moved
> ones, the patience algorithm will show the moved code as removed where
> it was and added where it now is, changes included. The current diff
> will rather move the smaller invariend section you move across and
> present mangled diffs involving the function prototypes making it less
> than readable.

Actually git.git has a canonical example of this in 214a34d22.

For those not having the --patience diff applied locally, attached are
the two patches git show / git show --patience give. It's of course a
matter of taste, but I like the patience version a lot more.

I'm also curious to see what a merge conflict with such a move would
look like (e.g. inverting some of the added arguments to the factorized
function of 214a34d22 or something similar). I'm somehow convinced that
it would generate a nicer conflict somehow.

-- 
·O·  Pierre Habouzit
··O                                                madcoder@debian.org
OOO                                                http://www.madism.org

[-- Attachment #1.2: 214a34d22.normal --]
[-- Type: text/plain, Size: 2323 bytes --]

commit 214a34d22ec59ec7e1166772f06ecf8799f96c96
Author: Florian Weimer <fw@deneb.enyo.de>
Date:   Sun Aug 31 17:45:04 2008 +0200

    git-svn: Introduce SVN::Git::Editor::_chg_file_get_blob
    
    Signed-off-by: Florian Weimer <fw@deneb.enyo.de>
    Acked-by: Eric Wong <normalperson@yhbt.net>

diff --git a/git-svn.perl b/git-svn.perl
index 0479f41..2c3e13f 100755
--- a/git-svn.perl
+++ b/git-svn.perl
@@ -3663,28 +3663,35 @@ sub change_file_prop {
 	$self->SUPER::change_file_prop($fbat, $pname, $pval, $self->{pool});
 }
 
-sub chg_file {
-	my ($self, $fbat, $m) = @_;
-	if ($m->{mode_b} =~ /755$/ && $m->{mode_a} !~ /755$/) {
-		$self->change_file_prop($fbat,'svn:executable','*');
-	} elsif ($m->{mode_b} !~ /755$/ && $m->{mode_a} =~ /755$/) {
-		$self->change_file_prop($fbat,'svn:executable',undef);
-	}
-	my $fh = Git::temp_acquire('git_blob');
-	if ($m->{mode_b} =~ /^120/) {
+sub _chg_file_get_blob ($$$$) {
+	my ($self, $fbat, $m, $which) = @_;
+	my $fh = Git::temp_acquire("git_blob_$which");
+	if ($m->{"mode_$which"} =~ /^120/) {
 		print $fh 'link ' or croak $!;
 		$self->change_file_prop($fbat,'svn:special','*');
-	} elsif ($m->{mode_a} =~ /^120/ && $m->{mode_b} !~ /^120/) {
+	} elsif ($m->{mode_a} =~ /^120/ && $m->{"mode_$which"} !~ /^120/) {
 		$self->change_file_prop($fbat,'svn:special',undef);
 	}
-	my $size = $::_repository->cat_blob($m->{sha1_b}, $fh);
-	croak "Failed to read object $m->{sha1_b}" if ($size < 0);
+	my $blob = $m->{"sha1_$which"};
+	return ($fh,) if ($blob =~ /^0{40}$/);
+	my $size = $::_repository->cat_blob($blob, $fh);
+	croak "Failed to read object $blob" if ($size < 0);
 	$fh->flush == 0 or croak $!;
 	seek $fh, 0, 0 or croak $!;
 
 	my $exp = ::md5sum($fh);
 	seek $fh, 0, 0 or croak $!;
+	return ($fh, $exp);
+}
 
+sub chg_file {
+	my ($self, $fbat, $m) = @_;
+	if ($m->{mode_b} =~ /755$/ && $m->{mode_a} !~ /755$/) {
+		$self->change_file_prop($fbat,'svn:executable','*');
+	} elsif ($m->{mode_b} !~ /755$/ && $m->{mode_a} =~ /755$/) {
+		$self->change_file_prop($fbat,'svn:executable',undef);
+	}
+	my ($fh, $exp) = _chg_file_get_blob $self, $fbat, $m, 'b';
 	my $pool = SVN::Pool->new;
 	my $atd = $self->apply_textdelta($fbat, undef, $pool);
 	my $got = SVN::TxDelta::send_stream($fh, @$atd, $pool);

[-- Attachment #1.3: 214a34d22.patience --]
[-- Type: text/plain, Size: 2290 bytes --]

commit 214a34d22ec59ec7e1166772f06ecf8799f96c96
Author: Florian Weimer <fw@deneb.enyo.de>
Date:   Sun Aug 31 17:45:04 2008 +0200

    git-svn: Introduce SVN::Git::Editor::_chg_file_get_blob
    
    Signed-off-by: Florian Weimer <fw@deneb.enyo.de>
    Acked-by: Eric Wong <normalperson@yhbt.net>

diff --git a/git-svn.perl b/git-svn.perl
index 0479f41..2c3e13f 100755
--- a/git-svn.perl
+++ b/git-svn.perl
@@ -3663,6 +3663,27 @@ sub change_file_prop {
 	$self->SUPER::change_file_prop($fbat, $pname, $pval, $self->{pool});
 }
 
+sub _chg_file_get_blob ($$$$) {
+	my ($self, $fbat, $m, $which) = @_;
+	my $fh = Git::temp_acquire("git_blob_$which");
+	if ($m->{"mode_$which"} =~ /^120/) {
+		print $fh 'link ' or croak $!;
+		$self->change_file_prop($fbat,'svn:special','*');
+	} elsif ($m->{mode_a} =~ /^120/ && $m->{"mode_$which"} !~ /^120/) {
+		$self->change_file_prop($fbat,'svn:special',undef);
+	}
+	my $blob = $m->{"sha1_$which"};
+	return ($fh,) if ($blob =~ /^0{40}$/);
+	my $size = $::_repository->cat_blob($blob, $fh);
+	croak "Failed to read object $blob" if ($size < 0);
+	$fh->flush == 0 or croak $!;
+	seek $fh, 0, 0 or croak $!;
+
+	my $exp = ::md5sum($fh);
+	seek $fh, 0, 0 or croak $!;
+	return ($fh, $exp);
+}
+
 sub chg_file {
 	my ($self, $fbat, $m) = @_;
 	if ($m->{mode_b} =~ /755$/ && $m->{mode_a} !~ /755$/) {
@@ -3670,21 +3691,7 @@ sub chg_file {
 	} elsif ($m->{mode_b} !~ /755$/ && $m->{mode_a} =~ /755$/) {
 		$self->change_file_prop($fbat,'svn:executable',undef);
 	}
-	my $fh = Git::temp_acquire('git_blob');
-	if ($m->{mode_b} =~ /^120/) {
-		print $fh 'link ' or croak $!;
-		$self->change_file_prop($fbat,'svn:special','*');
-	} elsif ($m->{mode_a} =~ /^120/ && $m->{mode_b} !~ /^120/) {
-		$self->change_file_prop($fbat,'svn:special',undef);
-	}
-	my $size = $::_repository->cat_blob($m->{sha1_b}, $fh);
-	croak "Failed to read object $m->{sha1_b}" if ($size < 0);
-	$fh->flush == 0 or croak $!;
-	seek $fh, 0, 0 or croak $!;
-
-	my $exp = ::md5sum($fh);
-	seek $fh, 0, 0 or croak $!;
-
+	my ($fh, $exp) = _chg_file_get_blob $self, $fbat, $m, 'b';
 	my $pool = SVN::Pool->new;
 	my $atd = $self->apply_textdelta($fbat, undef, $pool);
 	my $got = SVN::TxDelta::send_stream($fh, @$atd, $pool);

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-06 11:17     ` Pierre Habouzit
  2009-01-06 11:39       ` Pierre Habouzit
@ 2009-01-06 19:40       ` Johannes Schindelin
  2009-01-07 14:39         ` Pierre Habouzit
  2009-01-07 20:15         ` [PATCH 0/3] Teach Git about " Sam Vilain
  1 sibling, 2 replies; 67+ messages in thread
From: Johannes Schindelin @ 2009-01-06 19:40 UTC (permalink / raw)
  To: Pierre Habouzit; +Cc: Linus Torvalds, davidel, Francis Galiegue, Git ML

Hi,

On Tue, 6 Jan 2009, Pierre Habouzit wrote:

> On jeu, jan 01, 2009 at 04:38:09 +0000, Johannes Schindelin wrote:
> > 
> > Nothing fancy, really, just a straight-forward implementation of the
> > heavily under-documented and under-analyzed paience diff algorithm.
> 
> Btw, what is the status of this series ? I see it neither in pu nor in 
> next. And I would gladly see it included in git.

AFAICT people wanted to be reasonably sure that it is worth the effort.

Although I would like to see it in once it is fleshed out -- even if it 
does not meet our usefulness standard -- because people said Git is 
inferior for not providing a patience diff.  If we have --patience, we can 
say "but we have it, it's just not useful, check for yourself".

> On jeu, jan 01, 2009 at 07:45:21 +0000, Linus Torvalds wrote:
> > 
> > So could we have some actual real data on it?
> 
> For example, look at the following (shamelessly stolen from the real
> life example http://glandium.org/blog/?p=120).

Due to the lines being much longer than 80 characters, this example was 
not useful to me.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-06 19:40       ` Johannes Schindelin
@ 2009-01-07 14:39         ` Pierre Habouzit
  2009-01-07 17:01           ` Johannes Schindelin
  2009-01-07 20:15         ` [PATCH 0/3] Teach Git about " Sam Vilain
  1 sibling, 1 reply; 67+ messages in thread
From: Pierre Habouzit @ 2009-01-07 14:39 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Linus Torvalds, davidel, Francis Galiegue, Git ML


[-- Attachment #1.1: Type: text/plain, Size: 1832 bytes --]

On Tue, Jan 06, 2009 at 07:40:02PM +0000, Johannes Schindelin wrote:
> Hi,
> 
> On Tue, 6 Jan 2009, Pierre Habouzit wrote:
> 
> > On jeu, jan 01, 2009 at 04:38:09 +0000, Johannes Schindelin wrote:
> > > 
> > > Nothing fancy, really, just a straight-forward implementation of the
> > > heavily under-documented and under-analyzed paience diff algorithm.
> > 
> > Btw, what is the status of this series ? I see it neither in pu nor in 
> > next. And I would gladly see it included in git.
> 
> AFAICT people wanted to be reasonably sure that it is worth the effort.

Fair enough.

> Although I would like to see it in once it is fleshed out -- even if it 
> does not meet our usefulness standard -- because people said Git is 
> inferior for not providing a patience diff.  If we have --patience, we can 
> say "but we have it, it's just not useful, check for yourself".

Well I believe it's useful, but maybe the standard algorithm could be
tweaked the way Linus proposes to make the "long" lines weight louder or
so.

> Due to the lines being much longer than 80 characters, this example was 
> not useful to me.

I hope the other one was ;)

WRT the leaks, you want to squash the attached patch on the proper
patches of your series (maybe the xdl_free on map.entries could be put
in a hasmap_destroy or similar btw, but valgrind reports no more leaks
in xdiff now). As of timings, with the current implementation, patience
diff is around 30% slower than the default implementation, which is a
bit worse than what I would expect, probably due to the fact that you
had to work around the libxdiff interface I guess.

-- 
·O·  Pierre Habouzit
··O                                                madcoder@debian.org
OOO                                                http://www.madism.org

[-- Attachment #1.2: memory-leaks.diff --]
[-- Type: text/plain, Size: 1120 bytes --]

diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
index d01cbdd..c2ffb03 100644
--- a/xdiff/xpatience.c
+++ b/xdiff/xpatience.c
@@ -203,8 +203,10 @@ static struct entry *find_longest_common_sequence(struct hashmap *map)
 	}
 
 	/* No common unique lines were found */
-	if (!longest)
+	if (!longest) {
+		xdl_free(sequence);
 		return NULL;
+	}
 
 	/* Iterate starting at the last element, adjusting the "next" members */
 	entry = sequence[longest - 1];
@@ -213,6 +215,7 @@ static struct entry *find_longest_common_sequence(struct hashmap *map)
 		entry->previous->next = entry;
 		entry = entry->previous;
 	}
+	xdl_free(sequence);
 	return entry;
 }
 
@@ -350,6 +353,7 @@ static int patience_diff(mmfile_t *file1, mmfile_t *file2,
 			env->xdf1.rchg[line1++ - 1] = 1;
 		while(count2--)
 			env->xdf2.rchg[line2++ - 1] = 1;
+		xdl_free(map.entries);
 		return 0;
 	}
 
@@ -361,6 +365,7 @@ static int patience_diff(mmfile_t *file1, mmfile_t *file2,
 		result = fall_back_to_classic_diff(&map,
 			line1, count1, line2, count2);
 
+	xdl_free(map.entries);
 	return result;
 }
 

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-07 14:39         ` Pierre Habouzit
@ 2009-01-07 17:01           ` Johannes Schindelin
  2009-01-07 17:04             ` [PATCH v3 1/3] Implement " Johannes Schindelin
  0 siblings, 1 reply; 67+ messages in thread
From: Johannes Schindelin @ 2009-01-07 17:01 UTC (permalink / raw)
  To: Pierre Habouzit; +Cc: Linus Torvalds, davidel, Francis Galiegue, Git ML

Hi,

On Wed, 7 Jan 2009, Pierre Habouzit wrote:

> On Tue, Jan 06, 2009 at 07:40:02PM +0000, Johannes Schindelin wrote:
> 
> > Although I would like to see it in once it is fleshed out -- even if 
> > it does not meet our usefulness standard -- because people said Git is 
> > inferior for not providing a patience diff.  If we have --patience, we 
> > can say "but we have it, it's just not useful, check for yourself".
> 
> Well I believe it's useful, but maybe the standard algorithm could be 
> tweaked the way Linus proposes to make the "long" lines weight louder or 
> so.

I think this "weighting idea" is a bit too much of handwaving to start 
anything close to a design; as I pointed out, anything that has something 
different than a 1 for a deleted/added line affects performance 
negatively.

> WRT the leaks, you want to squash the attached patch on the proper
> patches of your series (maybe the xdl_free on map.entries could be put
> in a hasmap_destroy or similar btw, but valgrind reports no more leaks
> in xdiff now).

Thanks!

I also squashed in a patch that avoids calling xdl_cleanup_records() and 
then memset()ing the rchg array to 0 (which worked around the segmentation 
fault).

Patch 1/3 v3 follows,
Dscho

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v3 1/3] Implement the patience diff algorithm
  2009-01-07 17:01           ` Johannes Schindelin
@ 2009-01-07 17:04             ` Johannes Schindelin
  2009-01-07 18:10               ` Davide Libenzi
  0 siblings, 1 reply; 67+ messages in thread
From: Johannes Schindelin @ 2009-01-07 17:04 UTC (permalink / raw)
  To: Pierre Habouzit; +Cc: Linus Torvalds, davidel, Francis Galiegue, Git ML


The patience diff algorithm produces slightly more intuitive output
than the classic Myers algorithm, as it does not try to minimize the
number of +/- lines first, but tries to preserve the lines that are
unique.

To this end, it first determines lines that are unique in both files,
then the maximal sequence which preserves the order (relative to both
files) is extracted.

Starting from this initial set of common lines, the rest of the lines
is handled recursively, with Myers' algorithm as a fallback when
the patience algorithm fails (due to no common unique lines).

This patch includes memory leak fixes by Pierre Habouzit.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---

	This comes close to 'next' quality, IMHO.

	Changes vs v2: Pierre's memory leak fixes, and it now avoids 
	calling xdl_cleanup_records() and then throwing away the 
	results for patience diff.

	Pierre, could you do the performance tests again?  I _think_ it 
	might be somewhat faster now because of the xdl_cleanup_records()
	avoidance.

 xdiff/xdiff.h     |    1 +
 xdiff/xdiffi.c    |    3 +
 xdiff/xdiffi.h    |    2 +
 xdiff/xpatience.c |  381 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 xdiff/xprepare.c  |    3 +-
 5 files changed, 389 insertions(+), 1 deletions(-)
 create mode 100644 xdiff/xpatience.c

diff --git a/xdiff/xdiff.h b/xdiff/xdiff.h
index 361f802..4da052a 100644
--- a/xdiff/xdiff.h
+++ b/xdiff/xdiff.h
@@ -32,6 +32,7 @@ extern "C" {
 #define XDF_IGNORE_WHITESPACE (1 << 2)
 #define XDF_IGNORE_WHITESPACE_CHANGE (1 << 3)
 #define XDF_IGNORE_WHITESPACE_AT_EOL (1 << 4)
+#define XDF_PATIENCE_DIFF (1 << 5)
 #define XDF_WHITESPACE_FLAGS (XDF_IGNORE_WHITESPACE | XDF_IGNORE_WHITESPACE_CHANGE | XDF_IGNORE_WHITESPACE_AT_EOL)
 
 #define XDL_PATCH_NORMAL '-'
diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 9d0324a..3e97462 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -329,6 +329,9 @@ int xdl_do_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
 	xdalgoenv_t xenv;
 	diffdata_t dd1, dd2;
 
+	if (xpp->flags & XDF_PATIENCE_DIFF)
+		return xdl_do_patience_diff(mf1, mf2, xpp, xe);
+
 	if (xdl_prepare_env(mf1, mf2, xpp, xe) < 0) {
 
 		return -1;
diff --git a/xdiff/xdiffi.h b/xdiff/xdiffi.h
index 3e099dc..ad033a8 100644
--- a/xdiff/xdiffi.h
+++ b/xdiff/xdiffi.h
@@ -55,5 +55,7 @@ int xdl_build_script(xdfenv_t *xe, xdchange_t **xscr);
 void xdl_free_script(xdchange_t *xscr);
 int xdl_emit_diff(xdfenv_t *xe, xdchange_t *xscr, xdemitcb_t *ecb,
 		  xdemitconf_t const *xecfg);
+int xdl_do_patience_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
+		xdfenv_t *env);
 
 #endif /* #if !defined(XDIFFI_H) */
diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
new file mode 100644
index 0000000..e42c16a
--- /dev/null
+++ b/xdiff/xpatience.c
@@ -0,0 +1,381 @@
+/*
+ *  LibXDiff by Davide Libenzi ( File Differential Library )
+ *  Copyright (C) 2003-2009 Davide Libenzi, Johannes E. Schindelin
+ *
+ *  This library is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU Lesser General Public
+ *  License as published by the Free Software Foundation; either
+ *  version 2.1 of the License, or (at your option) any later version.
+ *
+ *  This library is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ *  Lesser General Public License for more details.
+ *
+ *  You should have received a copy of the GNU Lesser General Public
+ *  License along with this library; if not, write to the Free Software
+ *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ *  Davide Libenzi <davidel@xmailserver.org>
+ *
+ */
+#include "xinclude.h"
+#include "xtypes.h"
+#include "xdiff.h"
+
+/*
+ * The basic idea of patience diff is to find lines that are unique in
+ * both files.  These are intuitively the ones that we want to see as
+ * common lines.
+ *
+ * The maximal ordered sequence of such line pairs (where ordered means
+ * that the order in the sequence agrees with the order of the lines in
+ * both files) naturally defines an initial set of common lines.
+ *
+ * Now, the algorithm tries to extend the set of common lines by growing
+ * the line ranges where the files have identical lines.
+ *
+ * Between those common lines, the patience diff algorithm is applied
+ * recursively, until no unique line pairs can be found; these line ranges
+ * are handled by the well-known Myers algorithm.
+ */
+
+#define NON_UNIQUE ULONG_MAX
+
+/*
+ * This is a hash mapping from line hash to line numbers in the first and
+ * second file.
+ */
+struct hashmap {
+	int nr, alloc;
+	struct entry {
+		unsigned long hash;
+		/*
+		 * 0 = unused entry, 1 = first line, 2 = second, etc.
+		 * line2 is NON_UNIQUE if the line is not unique
+		 * in either the first or the second file.
+		 */
+		unsigned long line1, line2;
+		/*
+		 * "next" & "previous" are used for the longest common
+		 * sequence;
+		 * initially, "next" reflects only the order in file1.
+		 */
+		struct entry *next, *previous;
+	} *entries, *first, *last;
+	/* were common records found? */
+	unsigned long has_matches;
+	mmfile_t *file1, *file2;
+	xdfenv_t *env;
+	xpparam_t const *xpp;
+};
+
+/* The argument "pass" is 1 for the first file, 2 for the second. */
+static void insert_record(int line, struct hashmap *map, int pass)
+{
+	xrecord_t **records = pass == 1 ?
+		map->env->xdf1.recs : map->env->xdf2.recs;
+	xrecord_t *record = records[line - 1], *other;
+	/*
+	 * After xdl_prepare_env() (or more precisely, due to
+	 * xdl_classify_record()), the "ha" member of the records (AKA lines)
+	 * is _not_ the hash anymore, but a linearized version of it.  In
+	 * other words, the "ha" member is guaranteed to start with 0 and
+	 * the second record's ha can only be 0 or 1, etc.
+	 *
+	 * So we multiply ha by 2 in the hope that the hashing was
+	 * "unique enough".
+	 */
+	int index = (int)((record->ha << 1) % map->alloc);
+
+	while (map->entries[index].line1) {
+		other = map->env->xdf1.recs[map->entries[index].line1 - 1];
+		if (map->entries[index].hash != record->ha ||
+				!xdl_recmatch(record->ptr, record->size,
+					other->ptr, other->size,
+					map->xpp->flags)) {
+			if (++index >= map->alloc)
+				index = 0;
+			continue;
+		}
+		if (pass == 2)
+			map->has_matches = 1;
+		if (pass == 1 || map->entries[index].line2)
+			map->entries[index].line2 = NON_UNIQUE;
+		else
+			map->entries[index].line2 = line;
+		return;
+	}
+	if (pass == 2)
+		return;
+	map->entries[index].line1 = line;
+	map->entries[index].hash = record->ha;
+	if (!map->first)
+		map->first = map->entries + index;
+	if (map->last) {
+		map->last->next = map->entries + index;
+		map->entries[index].previous = map->last;
+	}
+	map->last = map->entries + index;
+	map->nr++;
+}
+
+/*
+ * This function has to be called for each recursion into the inter-hunk
+ * parts, as previously non-unique lines can become unique when being
+ * restricted to a smaller part of the files.
+ *
+ * It is assumed that env has been prepared using xdl_prepare().
+ */
+static int fill_hashmap(mmfile_t *file1, mmfile_t *file2,
+		xpparam_t const *xpp, xdfenv_t *env,
+		struct hashmap *result,
+		int line1, int count1, int line2, int count2)
+{
+	result->file1 = file1;
+	result->file2 = file2;
+	result->xpp = xpp;
+	result->env = env;
+
+	/* We know exactly how large we want the hash map */
+	result->alloc = count1 * 2;
+	result->entries = (struct entry *)
+		xdl_malloc(result->alloc * sizeof(struct entry));
+	if (!result->entries)
+		return -1;
+	memset(result->entries, 0, result->alloc * sizeof(struct entry));
+
+	/* First, fill with entries from the first file */
+	while (count1--)
+		insert_record(line1++, result, 1);
+
+	/* Then search for matches in the second file */
+	while (count2--)
+		insert_record(line2++, result, 2);
+
+	return 0;
+}
+
+/*
+ * Find the longest sequence with a smaller last element (meaning a smaller
+ * line2, as we construct the sequence with entries ordered by line1).
+ */
+static int binary_search(struct entry **sequence, int longest,
+		struct entry *entry)
+{
+	int left = -1, right = longest;
+
+	while (left + 1 < right) {
+		int middle = (left + right) / 2;
+		/* by construction, no two entries can be equal */
+		if (sequence[middle]->line2 > entry->line2)
+			right = middle;
+		else
+			left = middle;
+	}
+	/* return the index in "sequence", _not_ the sequence length */
+	return left;
+}
+
+/*
+ * The idea is to start with the list of common unique lines sorted by
+ * the order in file1.  For each of these pairs, the longest (partial)
+ * sequence whose last element's line2 is smaller is determined.
+ *
+ * For efficiency, the sequences are kept in a list containing exactly one
+ * item per sequence length: the sequence with the smallest last
+ * element (in terms of line2).
+ */
+static struct entry *find_longest_common_sequence(struct hashmap *map)
+{
+	struct entry **sequence = xdl_malloc(map->nr * sizeof(struct entry *));
+	int longest = 0, i;
+	struct entry *entry;
+
+	for (entry = map->first; entry; entry = entry->next) {
+		if (!entry->line2 || entry->line2 == NON_UNIQUE)
+			continue;
+		i = binary_search(sequence, longest, entry);
+		entry->previous = i < 0 ? NULL : sequence[i];
+		sequence[++i] = entry;
+		if (i == longest)
+			longest++;
+	}
+
+	/* No common unique lines were found */
+	if (!longest) {
+		xdl_free(sequence);
+		return NULL;
+	}
+
+	/* Iterate starting at the last element, adjusting the "next" members */
+	entry = sequence[longest - 1];
+	entry->next = NULL;
+	while (entry->previous) {
+		entry->previous->next = entry;
+		entry = entry->previous;
+	}
+	xdl_free(sequence);
+	return entry;
+}
+
+static int match(struct hashmap *map, int line1, int line2)
+{
+	xrecord_t *record1 = map->env->xdf1.recs[line1 - 1];
+	xrecord_t *record2 = map->env->xdf2.recs[line2 - 1];
+	return xdl_recmatch(record1->ptr, record1->size,
+		record2->ptr, record2->size, map->xpp->flags);
+}
+
+static int patience_diff(mmfile_t *file1, mmfile_t *file2,
+		xpparam_t const *xpp, xdfenv_t *env,
+		int line1, int count1, int line2, int count2);
+
+static int walk_common_sequence(struct hashmap *map, struct entry *first,
+		int line1, int count1, int line2, int count2)
+{
+	int end1 = line1 + count1, end2 = line2 + count2;
+	int next1, next2;
+
+	for (;;) {
+		/* Try to grow the line ranges of common lines */
+		if (first) {
+			next1 = first->line1;
+			next2 = first->line2;
+			while (next1 > line1 && next2 > line2 &&
+					match(map, next1 - 1, next2 - 1)) {
+				next1--;
+				next2--;
+			}
+		} else {
+			next1 = end1;
+			next2 = end2;
+		}
+		while (line1 < next1 && line2 < next2 &&
+				match(map, line1, line2)) {
+			line1++;
+			line2++;
+		}
+
+		/* Recurse */
+		if (next1 > line1 || next2 > line2) {
+			struct hashmap submap;
+
+			memset(&submap, 0, sizeof(submap));
+			if (patience_diff(map->file1, map->file2,
+					map->xpp, map->env,
+					line1, next1 - line1,
+					line2, next2 - line2))
+				return -1;
+		}
+
+		if (!first)
+			return 0;
+
+		while (first->next &&
+				first->next->line1 == first->line1 + 1 &&
+				first->next->line2 == first->line2 + 1)
+			first = first->next;
+
+		line1 = first->line1 + 1;
+		line2 = first->line2 + 1;
+
+		first = first->next;
+	}
+}
+
+static int fall_back_to_classic_diff(struct hashmap *map,
+		int line1, int count1, int line2, int count2)
+{
+	/*
+	 * This probably does not work outside Git, since
+	 * we have a very simple mmfile structure.
+	 *
+	 * Note: ideally, we would reuse the prepared environment, but
+	 * the libxdiff interface does not (yet) allow for diffing only
+	 * ranges of lines instead of the whole files.
+	 */
+	mmfile_t subfile1, subfile2;
+	xpparam_t xpp;
+	xdfenv_t env;
+
+	subfile1.ptr = (char *)map->env->xdf1.recs[line1 - 1]->ptr;
+	subfile1.size = map->env->xdf1.recs[line1 + count1 - 2]->ptr +
+		map->env->xdf1.recs[line1 + count1 - 2]->size - subfile1.ptr;
+	subfile2.ptr = (char *)map->env->xdf2.recs[line2 - 1]->ptr;
+	subfile2.size = map->env->xdf2.recs[line2 + count2 - 2]->ptr +
+		map->env->xdf2.recs[line2 + count2 - 2]->size - subfile2.ptr;
+	xpp.flags = map->xpp->flags & ~XDF_PATIENCE_DIFF;
+	if (xdl_do_diff(&subfile1, &subfile2, &xpp, &env) < 0)
+		return -1;
+
+	memcpy(map->env->xdf1.rchg + line1 - 1, env.xdf1.rchg, count1);
+	memcpy(map->env->xdf2.rchg + line2 - 1, env.xdf2.rchg, count2);
+
+	xdl_free_env(&env);
+
+	return 0;
+}
+
+/*
+ * Recursively find the longest common sequence of unique lines,
+ * and if none was found, ask xdl_do_diff() to do the job.
+ *
+ * This function assumes that env was prepared with xdl_prepare_env().
+ */
+static int patience_diff(mmfile_t *file1, mmfile_t *file2,
+		xpparam_t const *xpp, xdfenv_t *env,
+		int line1, int count1, int line2, int count2)
+{
+	struct hashmap map;
+	struct entry *first;
+	int result = 0;
+
+	/* trivial case: one side is empty */
+	if (!count1) {
+		while(count2--)
+			env->xdf2.rchg[line2++ - 1] = 1;
+		return 0;
+	} else if (!count2) {
+		while(count1--)
+			env->xdf1.rchg[line1++ - 1] = 1;
+		return 0;
+	}
+
+	memset(&map, 0, sizeof(map));
+	if (fill_hashmap(file1, file2, xpp, env, &map,
+			line1, count1, line2, count2))
+		return -1;
+
+	/* are there any matching lines at all? */
+	if (!map.has_matches) {
+		while(count1--)
+			env->xdf1.rchg[line1++ - 1] = 1;
+		while(count2--)
+			env->xdf2.rchg[line2++ - 1] = 1;
+		xdl_free(map.entries);
+		return 0;
+	}
+
+	first = find_longest_common_sequence(&map);
+	if (first)
+		result = walk_common_sequence(&map, first,
+			line1, count1, line2, count2);
+	else
+		result = fall_back_to_classic_diff(&map,
+			line1, count1, line2, count2);
+
+	xdl_free(map.entries);
+	return result;
+}
+
+int xdl_do_patience_diff(mmfile_t *file1, mmfile_t *file2,
+		xpparam_t const *xpp, xdfenv_t *env)
+{
+	if (xdl_prepare_env(file1, file2, xpp, env) < 0)
+		return -1;
+
+	/* environment is cleaned up in xdl_diff() */
+	return patience_diff(file1, file2, xpp, env,
+			1, env->xdf1.nrec, 1, env->xdf2.nrec);
+}
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index a43aa72..1689085 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -290,7 +290,8 @@ int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
 
 	xdl_free_classifier(&cf);
 
-	if (xdl_optimize_ctxs(&xe->xdf1, &xe->xdf2) < 0) {
+	if (!(xpp->flags & XDF_PATIENCE_DIFF) &&
+			xdl_optimize_ctxs(&xe->xdf1, &xe->xdf2) < 0) {
 
 		xdl_free_ctx(&xe->xdf2);
 		xdl_free_ctx(&xe->xdf1);
-- 
1.6.0.2.GIT

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH v3 1/3] Implement the patience diff algorithm
  2009-01-07 17:04             ` [PATCH v3 1/3] Implement " Johannes Schindelin
@ 2009-01-07 18:10               ` Davide Libenzi
  2009-01-07 18:32                 ` Johannes Schindelin
  2009-01-07 18:59                 ` Linus Torvalds
  0 siblings, 2 replies; 67+ messages in thread
From: Davide Libenzi @ 2009-01-07 18:10 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Pierre Habouzit, Linus Torvalds, Francis Galiegue, Git ML

On Wed, 7 Jan 2009, Johannes Schindelin wrote:

> 
> The patience diff algorithm produces slightly more intuitive output
> than the classic Myers algorithm, as it does not try to minimize the
> number of +/- lines first, but tries to preserve the lines that are
> unique.

Johannes, sorry I had not time to follow this one. A couple of minor 
comments that arose just at glancing at the patch.



> +/*
> + *  LibXDiff by Davide Libenzi ( File Differential Library )
> + *  Copyright (C) 2003-2009 Davide Libenzi, Johannes E. Schindelin
> + *
> + *  This library is free software; you can redistribute it and/or
> + *  modify it under the terms of the GNU Lesser General Public
> + *  License as published by the Free Software Foundation; either
> + *  version 2.1 of the License, or (at your option) any later version.
> + *
> + *  This library is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + *  Lesser General Public License for more details.
> + *
> + *  You should have received a copy of the GNU Lesser General Public
> + *  License along with this library; if not, write to the Free Software
> + *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
> + *
> + *  Davide Libenzi <davidel@xmailserver.org>

You do not need to give me credit for something I don't even know how it 
works ;)



> +static int fall_back_to_classic_diff(struct hashmap *map,
> +		int line1, int count1, int line2, int count2)
> +{
> +	/*
> +	 * This probably does not work outside Git, since
> +	 * we have a very simple mmfile structure.
> +	 *
> +	 * Note: ideally, we would reuse the prepared environment, but
> +	 * the libxdiff interface does not (yet) allow for diffing only
> +	 * ranges of lines instead of the whole files.
> +	 */
> +	mmfile_t subfile1, subfile2;
> +	xpparam_t xpp;
> +	xdfenv_t env;
> +
> +	subfile1.ptr = (char *)map->env->xdf1.recs[line1 - 1]->ptr;
> +	subfile1.size = map->env->xdf1.recs[line1 + count1 - 2]->ptr +
> +		map->env->xdf1.recs[line1 + count1 - 2]->size - subfile1.ptr;
> +	subfile2.ptr = (char *)map->env->xdf2.recs[line2 - 1]->ptr;
> +	subfile2.size = map->env->xdf2.recs[line2 + count2 - 2]->ptr +
> +		map->env->xdf2.recs[line2 + count2 - 2]->size - subfile2.ptr;
> +	xpp.flags = map->xpp->flags & ~XDF_PATIENCE_DIFF;
> +	if (xdl_do_diff(&subfile1, &subfile2, &xpp, &env) < 0)
> +		return -1;

xdiff allows for diffing ranges, and the most efficent method is exactly 
how you did ;) Once you know the lines pointers, there's no need to pass 
it the whole file and have it scan it whole to find the lines range it has 
to diff. Just pass the limited view like you did.


- Davide

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v3 1/3] Implement the patience diff algorithm
  2009-01-07 18:10               ` Davide Libenzi
@ 2009-01-07 18:32                 ` Johannes Schindelin
  2009-01-07 20:09                   ` Davide Libenzi
  2009-01-07 18:59                 ` Linus Torvalds
  1 sibling, 1 reply; 67+ messages in thread
From: Johannes Schindelin @ 2009-01-07 18:32 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Pierre Habouzit, Linus Torvalds, Francis Galiegue, Git ML

Hi,

On Wed, 7 Jan 2009, Davide Libenzi wrote:

> On Wed, 7 Jan 2009, Johannes Schindelin wrote:
> 
> > The patience diff algorithm produces slightly more intuitive output 
> > than the classic Myers algorithm, as it does not try to minimize the 
> > number of +/- lines first, but tries to preserve the lines that are 
> > unique.
> 
> Johannes, sorry I had not time to follow this one.

What?  You mean you actually took some time off around Christmas???

:-)

> A couple of minor comments that arose just at glancing at the patch.

Thanks.

> > +/*
> > + *  LibXDiff by Davide Libenzi ( File Differential Library )
> > + *  Copyright (C) 2003-2009 Davide Libenzi, Johannes E. Schindelin
> > + *
> > + *  This library is free software; you can redistribute it and/or
> > + *  modify it under the terms of the GNU Lesser General Public
> > + *  License as published by the Free Software Foundation; either
> > + *  version 2.1 of the License, or (at your option) any later version.
> > + *
> > + *  This library is distributed in the hope that it will be useful,
> > + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > + *  Lesser General Public License for more details.
> > + *
> > + *  You should have received a copy of the GNU Lesser General Public
> > + *  License along with this library; if not, write to the Free Software
> > + *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
> > + *
> > + *  Davide Libenzi <davidel@xmailserver.org>
> 
> You do not need to give me credit for something I don't even know how it 
> works ;)

Well, I meant to pass the copyright to you, or at least share it.

> > +static int fall_back_to_classic_diff(struct hashmap *map,
> > +		int line1, int count1, int line2, int count2)
> > +{
> > +	/*
> > +	 * This probably does not work outside Git, since
> > +	 * we have a very simple mmfile structure.
> > +	 *
> > +	 * Note: ideally, we would reuse the prepared environment, but
> > +	 * the libxdiff interface does not (yet) allow for diffing only
> > +	 * ranges of lines instead of the whole files.
> > +	 */
> > +	mmfile_t subfile1, subfile2;
> > +	xpparam_t xpp;
> > +	xdfenv_t env;
> > +
> > +	subfile1.ptr = (char *)map->env->xdf1.recs[line1 - 1]->ptr;
> > +	subfile1.size = map->env->xdf1.recs[line1 + count1 - 2]->ptr +
> > +		map->env->xdf1.recs[line1 + count1 - 2]->size - subfile1.ptr;
> > +	subfile2.ptr = (char *)map->env->xdf2.recs[line2 - 1]->ptr;
> > +	subfile2.size = map->env->xdf2.recs[line2 + count2 - 2]->ptr +
> > +		map->env->xdf2.recs[line2 + count2 - 2]->size - subfile2.ptr;
> > +	xpp.flags = map->xpp->flags & ~XDF_PATIENCE_DIFF;
> > +	if (xdl_do_diff(&subfile1, &subfile2, &xpp, &env) < 0)
> > +		return -1;
> 
> xdiff allows for diffing ranges, and the most efficent method is exactly 
> how you did ;) Once you know the lines pointers, there's no need to pass 
> it the whole file and have it scan it whole to find the lines range it 
> has to diff. Just pass the limited view like you did.

Heh.

Could it be that you misread my patch, and assumed that I faked an 
xdfenv?

I did not, but instead faked two mmfiles, which is only as simple as I did 
it because in git.git, we only have contiguous mmfiles.  (I recall that 
libxdiff allows for ropes instead of arrays.)

The way I did it has one big shortcoming: I need to prepare an xdfenv for 
the subfiles even if I already prepared one for the complete files.  IOW 
the lines are rehashed all over again.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v3 1/3] Implement the patience diff algorithm
  2009-01-07 18:10               ` Davide Libenzi
  2009-01-07 18:32                 ` Johannes Schindelin
@ 2009-01-07 18:59                 ` Linus Torvalds
  2009-01-07 20:00                   ` Johannes Schindelin
  2009-01-07 20:11                   ` Davide Libenzi
  1 sibling, 2 replies; 67+ messages in thread
From: Linus Torvalds @ 2009-01-07 18:59 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Johannes Schindelin, Pierre Habouzit, Francis Galiegue, Git ML

On Wed, 7 Jan 2009, Davide Libenzi wrote:
> 
> xdiff allows for diffing ranges, and the most efficent method is exactly 
> how you did ;)

No, the performance problem is how it needs to re-hash everything. xdiff 
doesn't seem to have any way to use a "subset of the hash", so what the 
patience diff does is to use a subset of the mmfile, and then that will 
force all the rehashing to take place, which is kind of sad.

It would be nice (for patience diff) if it could partition the _hashes_ 
instead of partitioning the _data_. That way it wouldn't need to rehash. 
See?

		Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v3 1/3] Implement the patience diff algorithm
  2009-01-07 18:59                 ` Linus Torvalds
@ 2009-01-07 20:00                   ` Johannes Schindelin
  2009-01-07 20:11                   ` Davide Libenzi
  1 sibling, 0 replies; 67+ messages in thread
From: Johannes Schindelin @ 2009-01-07 20:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Davide Libenzi, Pierre Habouzit, Francis Galiegue, Git ML

Hi,

On Wed, 7 Jan 2009, Linus Torvalds wrote:

> On Wed, 7 Jan 2009, Davide Libenzi wrote:
> > 
> > xdiff allows for diffing ranges, and the most efficent method is 
> > exactly how you did ;)
> 
> No, the performance problem is how it needs to re-hash everything. xdiff 
> doesn't seem to have any way to use a "subset of the hash", so what the 
> patience diff does is to use a subset of the mmfile, and then that will 
> force all the rehashing to take place, which is kind of sad.
> 
> It would be nice (for patience diff) if it could partition the _hashes_ 
> instead of partitioning the _data_. That way it wouldn't need to rehash. 
> See?

Actually, for libxdiff (non-patience), this is not possible, as 
xdl_cleanup_records() rewrites the hashes so that they are in a contiguous 
interval (0, ..., N-1), where N is the number of distinct lines found.

I am also pretty certain that at least the non-patience part needs the 
maximum of that "hash" to be as small as possible.

Having said that, if Linus likes patience diff enough to want it faster, 
Dscho will improve the speed by avoiding the rehashing for the patience 
diff part (although the lines for which patience has to fall back to Myers 
diff will _still_ rehash, but that does not hurt _that_ much in practice).

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v3 1/3] Implement the patience diff algorithm
  2009-01-07 18:32                 ` Johannes Schindelin
@ 2009-01-07 20:09                   ` Davide Libenzi
  2009-01-07 20:19                     ` Johannes Schindelin
  0 siblings, 1 reply; 67+ messages in thread
From: Davide Libenzi @ 2009-01-07 20:09 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Pierre Habouzit, Linus Torvalds, Francis Galiegue, Git ML

On Wed, 7 Jan 2009, Johannes Schindelin wrote:

> Heh.
> 
> Could it be that you misread my patch, and assumed that I faked an 
> xdfenv?
> 
> I did not, but instead faked two mmfiles, which is only as simple as I did 
> it because in git.git, we only have contiguous mmfiles.  (I recall that 
> libxdiff allows for ropes instead of arrays.)
> 
> The way I did it has one big shortcoming: I need to prepare an xdfenv for 
> the subfiles even if I already prepared one for the complete files.  IOW 
> the lines are rehashed all over again.

I told you I just glanced at the code :)
In that way, if you guys decide to merge this new algo, you'll need to 
split the prepare from the optimize, and feed it with an already prepared 
env.
Before going that way, have you ever tried to tweak xdl_cleanup_records 
and xdl_clean_mmatch to reduce the level of optimization, and see the 
results you get? It is possible that you won't need two different algos 
inside git.

- Davide

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v3 1/3] Implement the patience diff algorithm
  2009-01-07 18:59                 ` Linus Torvalds
  2009-01-07 20:00                   ` Johannes Schindelin
@ 2009-01-07 20:11                   ` Davide Libenzi
  1 sibling, 0 replies; 67+ messages in thread
From: Davide Libenzi @ 2009-01-07 20:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Schindelin, Pierre Habouzit, Francis Galiegue, Git ML

On Wed, 7 Jan 2009, Linus Torvalds wrote:

> On Wed, 7 Jan 2009, Davide Libenzi wrote:
> > 
> > xdiff allows for diffing ranges, and the most efficent method is exactly 
> > how you did ;)
> 
> No, the performance problem is how it needs to re-hash everything. xdiff 
> doesn't seem to have any way to use a "subset of the hash", so what the 
> patience diff does is to use a subset of the mmfile, and then that will 
> force all the rehashing to take place, which is kind of sad.
> 
> It would be nice (for patience diff) if it could partition the _hashes_ 
> instead of partitioning the _data_. That way it wouldn't need to rehash. 
> See?

Yeah, saw that afterwards ;)



- Davide

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-06 19:40       ` Johannes Schindelin
  2009-01-07 14:39         ` Pierre Habouzit
@ 2009-01-07 20:15         ` Sam Vilain
  2009-01-07 20:25           ` Linus Torvalds
  2009-01-07 20:38           ` Johannes Schindelin
  1 sibling, 2 replies; 67+ messages in thread
From: Sam Vilain @ 2009-01-07 20:15 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Pierre Habouzit, Linus Torvalds, davidel, Francis Galiegue,
	Git ML

On Tue, 2009-01-06 at 20:40 +0100, Johannes Schindelin wrote:
> Although I would like to see it in once it is fleshed out -- even if it 
> does not meet our usefulness standard -- because people said Git is 
> inferior for not providing a patience diff.  If we have --patience, we can 
> say "but we have it, it's just not useful, check for yourself".

Whatever happens, the current deterministic diff algorithm needs to stay
for generating patch-id's... those really can't be allowed to change.

Sam.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v3 1/3] Implement the patience diff algorithm
  2009-01-07 20:09                   ` Davide Libenzi
@ 2009-01-07 20:19                     ` Johannes Schindelin
  0 siblings, 0 replies; 67+ messages in thread
From: Johannes Schindelin @ 2009-01-07 20:19 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Pierre Habouzit, Linus Torvalds, Francis Galiegue, Git ML

Hi,

On Wed, 7 Jan 2009, Davide Libenzi wrote:

> On Wed, 7 Jan 2009, Johannes Schindelin wrote:
> 
> > Could it be that you misread my patch, and assumed that I faked an 
> > xdfenv?
> > 
> > I did not, but instead faked two mmfiles, which is only as simple as I did 
> > it because in git.git, we only have contiguous mmfiles.  (I recall that 
> > libxdiff allows for ropes instead of arrays.)
> > 
> > The way I did it has one big shortcoming: I need to prepare an xdfenv for 
> > the subfiles even if I already prepared one for the complete files.  IOW 
> > the lines are rehashed all over again.
> 
> I told you I just glanced at the code :)
> In that way, if you guys decide to merge this new algo, you'll need to 
> split the prepare from the optimize, and feed it with an already prepared 
> env.

Right.

> Before going that way, have you ever tried to tweak xdl_cleanup_records 
> and xdl_clean_mmatch to reduce the level of optimization, and see the 
> results you get? It is possible that you won't need two different algos 
> inside git.

No, I hadn't thought that libxdiff already determines uniqueness before 
actually running xdl_do_diff().

I also have to admit that I am not as clever as other people, and had 
quite a hard time figuring out as much as I did (for example, that rchg[i] 
== 1 means that this line is to be added/deleted, and that i is in the 
range 0, ..., N - 1 rather than 1, ..., N).

So it is quite possible that something patience-like can be done earlier.

However, I do not see a way to implement the recursion necessary for the 
patience diff.  Remember:

	patience(line range):
		find unique lines
		if no unique lines found:
			resort to classical diff
			return
		extract the longest common sequence of unique common lines
		between those, recurse

When recursing, previously non-unique lines can turn unique, of course.  
And I do not see how that recursion could be done before 
xdl_clean_mmatch(), short of redoing the hashing and cleaning records up.

Of course, it might well be possible, but I am already out of my depth 
reading something like "rdis0" and "rpdis1", and being close to despair.

:')

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-07 20:15         ` [PATCH 0/3] Teach Git about " Sam Vilain
@ 2009-01-07 20:25           ` Linus Torvalds
  2009-01-08  2:31             ` Sam Vilain
  2009-01-07 20:38           ` Johannes Schindelin
  1 sibling, 1 reply; 67+ messages in thread
From: Linus Torvalds @ 2009-01-07 20:25 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Johannes Schindelin, Pierre Habouzit, davidel, Francis Galiegue,
	Git ML

On Thu, 8 Jan 2009, Sam Vilain wrote:
> 
> Whatever happens, the current deterministic diff algorithm needs to stay
> for generating patch-id's... those really can't be allowed to change.

Sure they can.

We never cache patch-id's over a long time. And we _have_ changed xdiff to 
modify the output of the patches before, quite regardless of any patience 
issues: see commit 9b28d55401a529ff08c709f42f66e765c93b0a20, which 
admittedly doesn't affect any _normal_ diffs, but can generate subtly 
different results for some cases.

It's true that we want the diff algorithm to be deterministic in the sense 
that over the run of a _single_ rebase operation, the diff between two 
files should give similar and deterministic results, but that's certainly 
true of patience diff too.

			Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-07 20:15         ` [PATCH 0/3] Teach Git about " Sam Vilain
  2009-01-07 20:25           ` Linus Torvalds
@ 2009-01-07 20:38           ` Johannes Schindelin
  2009-01-07 20:48             ` Junio C Hamano
  1 sibling, 1 reply; 67+ messages in thread
From: Johannes Schindelin @ 2009-01-07 20:38 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Pierre Habouzit, Linus Torvalds, davidel, Francis Galiegue,
	Git ML

Hi,

On Thu, 8 Jan 2009, Sam Vilain wrote:

> On Tue, 2009-01-06 at 20:40 +0100, Johannes Schindelin wrote:
> > Although I would like to see it in once it is fleshed out -- even if it 
> > does not meet our usefulness standard -- because people said Git is 
> > inferior for not providing a patience diff.  If we have --patience, we can 
> > say "but we have it, it's just not useful, check for yourself".
> 
> Whatever happens, the current deterministic diff algorithm needs to stay
> for generating patch-id's... those really can't be allowed to change.

Oh, there is no discussion about replacing the diff algorithm we have 
right now.

Even if we never output patch-ids anywhere, so we always recalculate them, 
and therefore would be free to replace the diff algorithm with something 
else.

The thing why I do not want patience diff to replace the existing code 
is:

- patience diff is slower,
- patience diff is often not worth it (produces the same, maybe even 
  worse output), and
- patience diff needs the existing code as a fallback anyway.

Where it could possibly help to change existing behavior is with merging.

So maybe somebody has some time to play with, and can apply this patch:

-- snip --
diff --git a/ll-merge.c b/ll-merge.c
index fa2ca52..f731026 100644
--- a/ll-merge.c
+++ b/ll-merge.c
@@ -79,6 +79,8 @@ static int ll_xdl_merge(const struct ll_merge_driver *drv_unused,
 	memset(&xpp, 0, sizeof(xpp));
 	if (git_xmerge_style >= 0)
 		style = git_xmerge_style;
+	if (getenv("GIT_PATIENCE_MERGE"))
+		xpp.flags |= XDF_PATIENCE_DIFF;
 	return xdl_merge(orig,
 			 src1, name1,
 			 src2, name2,
-- snap --

After compiling and installing, something like this should be fun to 
watch:

	$ git rev-list --all --parents | \
	  grep " .* " | \
	  while read commit parent1 parent2 otherparents
	  do
		test -z "$otherparents" || continue
		git checkout $parent1 &&
		git merge $parent2 &&
		git diff > without-patience.txt &&
		git reset --hard $parent1 &&
		GIT_PATIENCE_MERGE=1 git merge $parent2 &&
		git diff > with-patience.txt &&
		if ! cmp without-patience.txt with-patience.txt
		then
			echo '==============================='
			echo "differences found in merge $commit"
			echo "without patience: $(wc -l < without-patience.txt)"
			echo "with patience: $(wc -l < with-patience.txt)"
			echo '-------------------------------'
			echo 'without patience:'
			cat without-patience.txt
			echo '-------------------------------'
			echo 'with patience:'
			cat with-patience.txt
		fi ||
		exit
	  done | tee patience-merge.out

Ciao,
Dscho

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-07 20:38           ` Johannes Schindelin
@ 2009-01-07 20:48             ` Junio C Hamano
  2009-01-07 22:00               ` Johannes Schindelin
  0 siblings, 1 reply; 67+ messages in thread
From: Junio C Hamano @ 2009-01-07 20:48 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Sam Vilain, Pierre Habouzit, Linus Torvalds, davidel,
	Francis Galiegue, Git ML

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> After compiling and installing, something like this should be fun to 
> watch:
>
> 	$ git rev-list --all --parents | \
> 	  grep " .* " | \
> 	  while read commit parent1 parent2 otherparents
> 	  do
> 		test -z "$otherparents" || continue
> 		git checkout $parent1 &&
> 		git merge $parent2 &&
> 		git diff > without-patience.txt &&
> ...
> 		if ! cmp without-patience.txt with-patience.txt
> 		then
> 			echo '==============================='
> 			echo "differences found in merge $commit"
> ...
> 			cat with-patience.txt
> 		fi ||
> 		exit
> 	  done | tee patience-merge.out

An even more interesting test would be possible by dropping "&&" from the
two "git merge" invocations.

 - Your sample will exit at the first conflicting merge otherwise.

 - You may find cases where one resolves cleanly while the other leaves
   conflicts.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-07 20:48             ` Junio C Hamano
@ 2009-01-07 22:00               ` Johannes Schindelin
  2009-01-07 22:45                 ` Pierre Habouzit
  0 siblings, 1 reply; 67+ messages in thread
From: Johannes Schindelin @ 2009-01-07 22:00 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Sam Vilain, Pierre Habouzit, Linus Torvalds, davidel,
	Francis Galiegue, Git ML

Hi,

On Wed, 7 Jan 2009, Junio C Hamano wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> 
> > After compiling and installing, something like this should be fun to 
> > watch:
> >
> > 	$ git rev-list --all --parents | \
> > 	  grep " .* " | \
> > 	  while read commit parent1 parent2 otherparents
> > 	  do
> > 		test -z "$otherparents" || continue
> > 		git checkout $parent1 &&
> > 		git merge $parent2 &&
> > 		git diff > without-patience.txt &&
> > ...
> > 		if ! cmp without-patience.txt with-patience.txt
> > 		then
> > 			echo '==============================='
> > 			echo "differences found in merge $commit"
> > ...
> > 			cat with-patience.txt
> > 		fi ||
> > 		exit
> > 	  done | tee patience-merge.out
> 
> An even more interesting test would be possible by dropping "&&" from the
> two "git merge" invocations.
> 
>  - Your sample will exit at the first conflicting merge otherwise.
> 
>  - You may find cases where one resolves cleanly while the other leaves
>    conflicts.

Yeah, that's why I always put "like" into that phrase "something like 
this"... :-)

Actually, I had to read something and did not want my box to sit idle 
while I was doing that, so...

The most interesting thing to me was: of the 4072 merges I have in my 
local git.git clone, only 66 show a difference.

The next interesting thing: none -- I repeat, none! -- resulted in only 
one of both methods having conflicts.  In all cases, if patience merge had 
conflicts, so had the classical merge, and vice versa.  I would have 
expected patience merge to handle some conflicts more gracefully.

So let's go on to the next metric: what are the differences in the --cc 
diffs' line counts?

On average, patience merge produced 1.03225806451613 more lines of --cc 
diff, and the standard deviation between the line counts is 
42.9823669772587.

So from the line counts' point of view, the difference is lost in the 
noise.

So let's look at a concrete example.  I take 
41a3e3aa9bdaede9ab7afed206428c1b071060d2, as it is one of the three merges 
with minimal --cc diff line counts (they all have 33 lines) and where 
patience merge makes a difference.

This is the --cc diff without patience merge:

-- snip --
diff --cc git-am.sh
index a391254,5a7695e..0000000
--- a/git-am.sh
+++ b/git-am.sh
@@@ -327,11 -327,20 +327,28 @@@ d
  			echo "Patch is empty.  Was it split wrong?"
  			stop_here $this
  		}
++<<<<<<< HEAD:git-am.sh
 +		SUBJECT="$(sed -n '/^Subject/ s/Subject: //p' "$dotest/info")"
 +		case "$keep_subject" in -k)  SUBJECT="[PATCH] $SUBJECT" ;; esac
 +
 +		(printf '%s\n\n' "$SUBJECT"; cat "$dotest/msg") |
 +			git stripspace > "$dotest/msg-clean"
++=======
+ 		if test -f "$dotest/rebasing" &&
+ 			commit=$(sed -e 's/^From \([0-9a-f]*\) .*/\1/' \
+ 				-e q "$dotest/$msgnum") &&
+ 			test "$(git cat-file -t "$commit")" = commit
+ 		then
+ 			git cat-file commit "$commit" |
+ 			sed -e '1,/^$/d' >"$dotest/msg-clean"
+ 		else
+ 			SUBJECT="$(sed -n '/^Subject/ s/Subject: //p' "$dotest/info")"
+ 			case "$keep_subject" in -k)  SUBJECT="[PATCH] $SUBJECT" ;; esac
+ 
+ 			(echo "$SUBJECT" ; echo ; cat "$dotest/msg") |
+ 				git stripspace > "$dotest/msg-clean"
+ 		fi
++>>>>>>> 5e835cac8622373724235d299f1331ac4cf81ccf:git-am.sh
  		;;
  	esac
  
-- snap --

Compare this with the --cc diff _with_ patience merge:

-- snip --
diff --cc git-am.sh
index a391254,5a7695e..0000000
--- a/git-am.sh
+++ b/git-am.sh
@@@ -327,11 -327,20 +327,25 @@@ d
  			echo "Patch is empty.  Was it split wrong?"
  			stop_here $this
  		}
- 		SUBJECT="$(sed -n '/^Subject/ s/Subject: //p' "$dotest/info")"
- 		case "$keep_subject" in -k)  SUBJECT="[PATCH] $SUBJECT" ;; esac
- 
+ 		if test -f "$dotest/rebasing" &&
+ 			commit=$(sed -e 's/^From \([0-9a-f]*\) .*/\1/' \
+ 				-e q "$dotest/$msgnum") &&
+ 			test "$(git cat-file -t "$commit")" = commit
+ 		then
+ 			git cat-file commit "$commit" |
+ 			sed -e '1,/^$/d' >"$dotest/msg-clean"
+ 		else
+ 			SUBJECT="$(sed -n '/^Subject/ s/Subject: //p' "$dotest/info")"
+ 			case "$keep_subject" in -k)  SUBJECT="[PATCH] $SUBJECT" ;; esac
+ 
++<<<<<<< HEAD:git-am.sh
 +		(printf '%s\n\n' "$SUBJECT"; cat "$dotest/msg") |
 +			git stripspace > "$dotest/msg-clean"
++=======
+ 			(echo "$SUBJECT" ; echo ; cat "$dotest/msg") |
+ 				git stripspace > "$dotest/msg-clean"
+ 		fi
++>>>>>>> 5e835cac8622373724235d299f1331ac4cf81ccf:git-am.sh
  		;;
  	esac
  
-- snap --

So, the patience merge resulted in a much smaller _conflict_.

However, another such merge is 276328ffb87cefdc515bee5f09916aea6e0244ed.  
This is the --cc diff without patience merge:

-- snip --
diff --cc diff.c
index 4e4e439,f91f256..0000000
--- a/diff.c
+++ b/diff.c
@@@ -1498,19 -1464,13 +1498,28 @@@ static void builtin_diff(const char *na
  	char *a_one, *b_two;
  	const char *set = diff_get_color_opt(o, DIFF_METAINFO);
  	const char *reset = diff_get_color_opt(o, DIFF_RESET);
 +	const char *a_prefix, *b_prefix;
 +
 +	diff_set_mnemonic_prefix(o, "a/", "b/");
 +	if (DIFF_OPT_TST(o, REVERSE_DIFF)) {
 +		a_prefix = o->b_prefix;
 +		b_prefix = o->a_prefix;
 +	} else {
 +		a_prefix = o->a_prefix;
 +		b_prefix = o->b_prefix;
 +	}
  
++<<<<<<< HEAD:diff.c
 +	a_one = quote_two(a_prefix, name_a + (*name_a == '/'));
 +	b_two = quote_two(b_prefix, name_b + (*name_b == '/'));
++=======
+ 	/* Never use a non-valid filename anywhere if at all possible */
+ 	name_a = DIFF_FILE_VALID(one) ? name_a : name_b;
+ 	name_b = DIFF_FILE_VALID(two) ? name_b : name_a;
+ 
+ 	a_one = quote_two(o->a_prefix, name_a + (*name_a == '/'));
+ 	b_two = quote_two(o->b_prefix, name_b + (*name_b == '/'));
++>>>>>>> e261cf94848d31868c21fb11cade51c30dfcdbe7:diff.c
  	lbl[0] = DIFF_FILE_VALID(one) ? a_one : "/dev/null";
  	lbl[1] = DIFF_FILE_VALID(two) ? b_two : "/dev/null";
  	fprintf(o->file, "%sdiff --git %s %s%s\n", set, a_one, b_two, reset);
-- snap --

And this is _with_ patience merge:

-- snip --
diff --cc diff.c
index 4e4e439,f91f256..0000000
--- a/diff.c
+++ b/diff.c
@@@ -1498,19 -1464,13 +1498,28 @@@ static void builtin_diff(const char *na
  	char *a_one, *b_two;
  	const char *set = diff_get_color_opt(o, DIFF_METAINFO);
  	const char *reset = diff_get_color_opt(o, DIFF_RESET);
 +	const char *a_prefix, *b_prefix;
 +
++<<<<<<< HEAD:diff.c
 +	diff_set_mnemonic_prefix(o, "a/", "b/");
 +	if (DIFF_OPT_TST(o, REVERSE_DIFF)) {
 +		a_prefix = o->b_prefix;
 +		b_prefix = o->a_prefix;
 +	} else {
 +		a_prefix = o->a_prefix;
 +		b_prefix = o->b_prefix;
 +	}
  
 +	a_one = quote_two(a_prefix, name_a + (*name_a == '/'));
 +	b_two = quote_two(b_prefix, name_b + (*name_b == '/'));
++=======
+ 	/* Never use a non-valid filename anywhere if at all possible */
+ 	name_a = DIFF_FILE_VALID(one) ? name_a : name_b;
+ 	name_b = DIFF_FILE_VALID(two) ? name_b : name_a;
+ 
+ 	a_one = quote_two(o->a_prefix, name_a + (*name_a == '/'));
+ 	b_two = quote_two(o->b_prefix, name_b + (*name_b == '/'));
++>>>>>>> e261cf94848d31868c21fb11cade51c30dfcdbe7:diff.c
  	lbl[0] = DIFF_FILE_VALID(one) ? a_one : "/dev/null";
  	lbl[1] = DIFF_FILE_VALID(two) ? b_two : "/dev/null";
  	fprintf(o->file, "%sdiff --git %s %s%s\n", set, a_one, b_two, reset);
-- snap --

So again, we have no clear winner.

Therefore I counted the lines between conflict markers (actually, a perl 
script did).  Of these 66 merges, on average patience merge produced 
4.46774193548387 _fewer_ lines between conflict markers.

Take that with a grain of salt, though: the standard deviation of this 
difference is a hefty 121.163046639509 lines.

The worst case for patience diff was the merge 
4698ef555a1706fe322a68a02a21fb1087940ac3, where the --cc diff line counts 
are 1300 (without) vs 1301 (with patience merge), but the lines between 
conflict markers are 197 vs a ridiculous 826 lines!

But you should take that also with a grain of salt: this merge is a 
_subtree_ merge, and my test redid it as a _non-subtree_ merge.

So I restricted the analysis to the non-subtree merges, and now 
non-patience merge comes out 6.97297297297297 conflict lines fewer than 
patience merge, with a standard deviation of 58.941106657867 (with a total 
count of 37 merges).

Note that ~7 lines difference with a standard deviation of ~59 lines is 
pretty close to ~0 lines difference.

In the end, the additional expense of patience merge might just not be 
worth it.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-07 22:00               ` Johannes Schindelin
@ 2009-01-07 22:45                 ` Pierre Habouzit
  2009-01-07 23:03                   ` Johannes Schindelin
  0 siblings, 1 reply; 67+ messages in thread
From: Pierre Habouzit @ 2009-01-07 22:45 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Junio C Hamano, Sam Vilain, Linus Torvalds, davidel,
	Francis Galiegue, Git ML

[-- Attachment #1: Type: text/plain, Size: 2168 bytes --]

On Wed, Jan 07, 2009 at 10:00:07PM +0000, Johannes Schindelin wrote:
> Therefore I counted the lines between conflict markers (actually, a perl 
> script did).  Of these 66 merges, on average patience merge produced 
> 4.46774193548387 _fewer_ lines between conflict markers.
> 
> Take that with a grain of salt, though: the standard deviation of this 
> difference is a hefty 121.163046639509 lines.
> 
> The worst case for patience diff was the merge 
> 4698ef555a1706fe322a68a02a21fb1087940ac3, where the --cc diff line counts 
> are 1300 (without) vs 1301 (with patience merge), but the lines between 
> conflict markers are 197 vs a ridiculous 826 lines!
> 
> But you should take that also with a grain of salt: this merge is a 
> _subtree_ merge, and my test redid it as a _non-subtree_ merge.
> 
> So I restricted the analysis to the non-subtree merges, and now 
> non-patience merge comes out 6.97297297297297 conflict lines fewer than 
> patience merge, with a standard deviation of 58.941106657867 (with a total 
> count of 37 merges).
> 
> Note that ~7 lines difference with a standard deviation of ~59 lines is 
> pretty close to ~0 lines difference.
> 
> In the end, the additional expense of patience merge might just not be 
> worth it.

Depends, if it can help generating nicer merges, it's good to have.

We could have an option to git-merge that tries hard to generate the
smallest conflict possible. _that_ would really really be worth it. I
mean, I've had really really tricky conflicts to work with where
git-merge genrated ridiculously big conflicts, and where I hard to
resort using UI tools to perform the merge (meld IIRC to name it), and
given how slow and crappy those tools are, I would gladly restart a
merge with a --generate-smallest-conflicts-as-possible if it can save me
from those merge tools.

YMMV though.

PS: I never thought the patience diff is a silver bullet, it's just yet
    another tool in the toolbox.
-- 
·O·  Pierre Habouzit
··O                                                madcoder@debian.org
OOO                                                http://www.madism.org

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-07 22:45                 ` Pierre Habouzit
@ 2009-01-07 23:03                   ` Johannes Schindelin
  0 siblings, 0 replies; 67+ messages in thread
From: Johannes Schindelin @ 2009-01-07 23:03 UTC (permalink / raw)
  To: Pierre Habouzit
  Cc: Junio C Hamano, Sam Vilain, Linus Torvalds, davidel,
	Francis Galiegue, Git ML

Hi,

On Wed, 7 Jan 2009, Pierre Habouzit wrote:

> On Wed, Jan 07, 2009 at 10:00:07PM +0000, Johannes Schindelin wrote:
> > Therefore I counted the lines between conflict markers (actually, a perl 
> > script did).  Of these 66 merges, on average patience merge produced 
> > 4.46774193548387 _fewer_ lines between conflict markers.
> > 
> > Take that with a grain of salt, though: the standard deviation of this 
> > difference is a hefty 121.163046639509 lines.
> > 
> > The worst case for patience diff was the merge 
> > 4698ef555a1706fe322a68a02a21fb1087940ac3, where the --cc diff line counts 
> > are 1300 (without) vs 1301 (with patience merge), but the lines between 
> > conflict markers are 197 vs a ridiculous 826 lines!
> > 
> > But you should take that also with a grain of salt: this merge is a 
> > _subtree_ merge, and my test redid it as a _non-subtree_ merge.
> > 
> > So I restricted the analysis to the non-subtree merges, and now 
> > non-patience merge comes out 6.97297297297297 conflict lines fewer than 
> > patience merge, with a standard deviation of 58.941106657867 (with a total 
> > count of 37 merges).
> > 
> > Note that ~7 lines difference with a standard deviation of ~59 lines is 
> > pretty close to ~0 lines difference.
> > 
> > In the end, the additional expense of patience merge might just not be 
> > worth it.
> 
> Depends, if it can help generating nicer merges, it's good to have.
> 
> We could have an option to git-merge that tries hard to generate the
> smallest conflict possible.

I also showed you examples, in addition to numbers, exactly to point out 
that shorter conflicts do not necessarily mean nicer conflicts.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-07 20:25           ` Linus Torvalds
@ 2009-01-08  2:31             ` Sam Vilain
  0 siblings, 0 replies; 67+ messages in thread
From: Sam Vilain @ 2009-01-08  2:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Schindelin, Pierre Habouzit, davidel, Francis Galiegue,
	Git ML

Linus Torvalds wrote:
> On Thu, 8 Jan 2009, Sam Vilain wrote:
>   
>> Whatever happens, the current deterministic diff algorithm needs to stay
>> for generating patch-id's... those really can't be allowed to change.
>>     
>
> Sure they can.
>
> We never cache patch-id's over a long time. And we _have_ changed xdiff to 
> modify the output of the patches before, quite regardless of any patience 
> issues: see commit 9b28d55401a529ff08c709f42f66e765c93b0a20, which 
> admittedly doesn't affect any _normal_ diffs, but can generate subtly 
> different results for some cases.
>   

There's at least one person who thinks that they should be deterministic 
enough to be able to be placed in commit messages;

http://article.gmane.org/gmane.comp.version-control.git/95671

Now of course the git cherry-pick feature to add the old patch ID to the 
commit message isn't written yet; but unless there was a fall-back mode 
to produce a "stable" patch ID, these breadcrumbs would become (even 
more) worthless.

Sam

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
@ 2009-01-08 19:55                       ` Adeodato Simó
  2009-01-08 20:06                         ` Adeodato Simó
  2009-01-09  6:54                         ` Junio C Hamano
  0 siblings, 2 replies; 67+ messages in thread
From: Adeodato Simó @ 2009-01-08 19:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Clemens Buchacher, Johannes Schindelin, Pierre Habouzit, davidel,
	Francis Galiegue, Git ML

[-- Attachment #1: Type: text/plain, Size: 1062 bytes --]

* Linus Torvalds [Fri, 02 Jan 2009 08:42:04 -0800]:

> Yes, this one is a real patience diff change, but it's also the same one 
> that I've seen in the google fanboi findings. What google did _not_ show 
> was any real-life examples, or anybody doing any critical analysis.

This comes a bit late and maybe it's redundant, but somebody just sent
to a Debian mailing list a patch that was hard to read, and patience
improved it. (I realize it's quite similar in spirit to the "toy
patience example" that google returns, but this at list is a *real*
example where patience helped me read a patch.)

I'm also attaching bzr diff output, because it's still more readable
IMHO. (I realize that's independent of patience, as you explained, but
I'm making a point that it'd be nice to have this addressed by somebody
knowledgeable.)

Thanks,

-- 
Adeodato Simó                                     dato at net.com.org.es
Debian Developer                                  adeodato at debian.org
 
- Are you sure we're good?
- Always.
                -- Rory and Lorelai

[-- Attachment #2: git.diff --]
[-- Type: text/x-diff, Size: 3615 bytes --]

diff --git util_sock.c util_sock.c
index e20768e..f7b9145 100644
--- util_sock.c
+++ util_sock.c
@@ -1037,40 +1037,109 @@ NTSTATUS read_data(int fd, char *buffer, size_t N)
 }
 
 /****************************************************************************
- Write data to a fd.
+ Write all data from an iov array
 ****************************************************************************/
 
-ssize_t write_data(int fd, const char *buffer, size_t N)
+ssize_t write_data_iov(int fd, const struct iovec *orig_iov, int iovcnt)
 {
-	size_t total=0;
-	ssize_t ret;
-	char addr[INET6_ADDRSTRLEN];
+	int i;
+	size_t to_send;
+	ssize_t thistime;
+	size_t sent;
+	struct iovec *iov_copy, *iov;
 
-	while (total < N) {
-		ret = sys_write(fd,buffer + total,N - total);
+	to_send = 0;
+	for (i=0; i<iovcnt; i++) {
+		to_send += orig_iov[i].iov_len;
+	}
 
-		if (ret == -1) {
-			if (fd == get_client_fd()) {
-				/* Try and give an error message saying
-				 * what client failed. */
-				DEBUG(0,("write_data: write failure in "
-					"writing to client %s. Error %s\n",
-					get_peer_addr(fd,addr,sizeof(addr)),
-					strerror(errno) ));
-			} else {
-				DEBUG(0,("write_data: write failure. "
-					"Error = %s\n", strerror(errno) ));
+	thistime = sys_writev(fd, orig_iov, iovcnt);
+	if ((thistime <= 0) || (thistime == to_send)) {
+		return thistime;
+	}
+	sent = thistime;
+
+	/*
+	 * We could not send everything in one call. Make a copy of iov that
+	 * we can mess with. We keep a copy of the array start in iov_copy for
+	 * the TALLOC_FREE, because we're going to modify iov later on,
+	 * discarding elements.
+	 */
+
+	iov_copy = (struct iovec *)TALLOC_MEMDUP(
+		talloc_tos(), orig_iov, sizeof(struct iovec) * iovcnt);
+
+	if (iov_copy == NULL) {
+		errno = ENOMEM;
+		return -1;
+	}
+	iov = iov_copy;
+
+	while (sent < to_send) {
+		/*
+		 * We have to discard "thistime" bytes from the beginning
+		 * iov array, "thistime" contains the number of bytes sent
+		 * via writev last.
+		 */
+		while (thistime > 0) {
+			if (thistime < iov[0].iov_len) {
+				char *new_base =
+					(char *)iov[0].iov_base + thistime;
+				iov[0].iov_base = new_base;
+				iov[0].iov_len -= thistime;
+				break;
 			}
-			return -1;
+			thistime -= iov[0].iov_len;
+			iov += 1;
+			iovcnt -= 1;
 		}
 
-		if (ret == 0) {
-			return total;
+		thistime = sys_writev(fd, iov, iovcnt);
+		if (thistime <= 0) {
+			break;
 		}
+		sent += thistime;
+	}
+
+	TALLOC_FREE(iov_copy);
+	return sent;
+}
+
+/****************************************************************************
+ Write data to a fd.
+****************************************************************************/
+
+/****************************************************************************
+ Write data to a fd.
+****************************************************************************/
+
+ssize_t write_data(int fd, const char *buffer, size_t N)
+{
+	ssize_t ret;
+	struct iovec iov;
+
+	iov.iov_base = CONST_DISCARD(char *, buffer);
+	iov.iov_len = N;
+
+	ret = write_data_iov(fd, &iov, 1);
+	if (ret >= 0) {
+		return ret;
+	}
 
-		total += ret;
+	if (fd == get_client_fd()) {
+		char addr[INET6_ADDRSTRLEN];
+		/*
+		 * Try and give an error message saying what client failed.
+		 */
+		DEBUG(0, ("write_data: write failure in writing to client %s. "
+			  "Error %s\n", get_peer_addr(fd,addr,sizeof(addr)),
+			  strerror(errno)));
+	} else {
+		DEBUG(0,("write_data: write failure. Error = %s\n",
+			 strerror(errno) ));
 	}
-	return (ssize_t)total;
+
+	return -1;
 }
 
 /****************************************************************************

[-- Attachment #3: git_patience.diff --]
[-- Type: text/x-diff, Size: 3538 bytes --]

diff --git util_sock.c util_sock.c
index e20768e..f7b9145 100644
--- util_sock.c
+++ util_sock.c
@@ -1037,40 +1037,109 @@ NTSTATUS read_data(int fd, char *buffer, size_t N)
 }
 
 /****************************************************************************
+ Write all data from an iov array
+****************************************************************************/
+
+ssize_t write_data_iov(int fd, const struct iovec *orig_iov, int iovcnt)
+{
+	int i;
+	size_t to_send;
+	ssize_t thistime;
+	size_t sent;
+	struct iovec *iov_copy, *iov;
+
+	to_send = 0;
+	for (i=0; i<iovcnt; i++) {
+		to_send += orig_iov[i].iov_len;
+	}
+
+	thistime = sys_writev(fd, orig_iov, iovcnt);
+	if ((thistime <= 0) || (thistime == to_send)) {
+		return thistime;
+	}
+	sent = thistime;
+
+	/*
+	 * We could not send everything in one call. Make a copy of iov that
+	 * we can mess with. We keep a copy of the array start in iov_copy for
+	 * the TALLOC_FREE, because we're going to modify iov later on,
+	 * discarding elements.
+	 */
+
+	iov_copy = (struct iovec *)TALLOC_MEMDUP(
+		talloc_tos(), orig_iov, sizeof(struct iovec) * iovcnt);
+
+	if (iov_copy == NULL) {
+		errno = ENOMEM;
+		return -1;
+	}
+	iov = iov_copy;
+
+	while (sent < to_send) {
+		/*
+		 * We have to discard "thistime" bytes from the beginning
+		 * iov array, "thistime" contains the number of bytes sent
+		 * via writev last.
+		 */
+		while (thistime > 0) {
+			if (thistime < iov[0].iov_len) {
+				char *new_base =
+					(char *)iov[0].iov_base + thistime;
+				iov[0].iov_base = new_base;
+				iov[0].iov_len -= thistime;
+				break;
+			}
+			thistime -= iov[0].iov_len;
+			iov += 1;
+			iovcnt -= 1;
+		}
+
+		thistime = sys_writev(fd, iov, iovcnt);
+		if (thistime <= 0) {
+			break;
+		}
+		sent += thistime;
+	}
+
+	TALLOC_FREE(iov_copy);
+	return sent;
+}
+
+/****************************************************************************
+ Write data to a fd.
+****************************************************************************/
+
+/****************************************************************************
  Write data to a fd.
 ****************************************************************************/
 
 ssize_t write_data(int fd, const char *buffer, size_t N)
 {
-	size_t total=0;
 	ssize_t ret;
-	char addr[INET6_ADDRSTRLEN];
+	struct iovec iov;
 
-	while (total < N) {
-		ret = sys_write(fd,buffer + total,N - total);
+	iov.iov_base = CONST_DISCARD(char *, buffer);
+	iov.iov_len = N;
 
-		if (ret == -1) {
-			if (fd == get_client_fd()) {
-				/* Try and give an error message saying
-				 * what client failed. */
-				DEBUG(0,("write_data: write failure in "
-					"writing to client %s. Error %s\n",
-					get_peer_addr(fd,addr,sizeof(addr)),
-					strerror(errno) ));
-			} else {
-				DEBUG(0,("write_data: write failure. "
-					"Error = %s\n", strerror(errno) ));
-			}
-			return -1;
-		}
-
-		if (ret == 0) {
-			return total;
-		}
+	ret = write_data_iov(fd, &iov, 1);
+	if (ret >= 0) {
+		return ret;
+	}
 
-		total += ret;
+	if (fd == get_client_fd()) {
+		char addr[INET6_ADDRSTRLEN];
+		/*
+		 * Try and give an error message saying what client failed.
+		 */
+		DEBUG(0, ("write_data: write failure in writing to client %s. "
+			  "Error %s\n", get_peer_addr(fd,addr,sizeof(addr)),
+			  strerror(errno)));
+	} else {
+		DEBUG(0,("write_data: write failure. Error = %s\n",
+			 strerror(errno) ));
 	}
-	return (ssize_t)total;
+
+	return -1;
 }
 
 /****************************************************************************

[-- Attachment #4: bzr.diff --]
[-- Type: text/x-diff, Size: 3432 bytes --]

--- util_sock.c
+++ util_sock.c
@@ -1037,40 +1037,109 @@
 }
 
 /****************************************************************************
+ Write all data from an iov array
+****************************************************************************/
+
+ssize_t write_data_iov(int fd, const struct iovec *orig_iov, int iovcnt)
+{
+	int i;
+	size_t to_send;
+	ssize_t thistime;
+	size_t sent;
+	struct iovec *iov_copy, *iov;
+
+	to_send = 0;
+	for (i=0; i<iovcnt; i++) {
+		to_send += orig_iov[i].iov_len;
+	}
+
+	thistime = sys_writev(fd, orig_iov, iovcnt);
+	if ((thistime <= 0) || (thistime == to_send)) {
+		return thistime;
+	}
+	sent = thistime;
+
+	/*
+	 * We could not send everything in one call. Make a copy of iov that
+	 * we can mess with. We keep a copy of the array start in iov_copy for
+	 * the TALLOC_FREE, because we're going to modify iov later on,
+	 * discarding elements.
+	 */
+
+	iov_copy = (struct iovec *)TALLOC_MEMDUP(
+		talloc_tos(), orig_iov, sizeof(struct iovec) * iovcnt);
+
+	if (iov_copy == NULL) {
+		errno = ENOMEM;
+		return -1;
+	}
+	iov = iov_copy;
+
+	while (sent < to_send) {
+		/*
+		 * We have to discard "thistime" bytes from the beginning
+		 * iov array, "thistime" contains the number of bytes sent
+		 * via writev last.
+		 */
+		while (thistime > 0) {
+			if (thistime < iov[0].iov_len) {
+				char *new_base =
+					(char *)iov[0].iov_base + thistime;
+				iov[0].iov_base = new_base;
+				iov[0].iov_len -= thistime;
+				break;
+			}
+			thistime -= iov[0].iov_len;
+			iov += 1;
+			iovcnt -= 1;
+		}
+
+		thistime = sys_writev(fd, iov, iovcnt);
+		if (thistime <= 0) {
+			break;
+		}
+		sent += thistime;
+	}
+
+	TALLOC_FREE(iov_copy);
+	return sent;
+}
+
+/****************************************************************************
+ Write data to a fd.
+****************************************************************************/
+
+/****************************************************************************
  Write data to a fd.
 ****************************************************************************/
 
 ssize_t write_data(int fd, const char *buffer, size_t N)
 {
-	size_t total=0;
 	ssize_t ret;
-	char addr[INET6_ADDRSTRLEN];
-
-	while (total < N) {
-		ret = sys_write(fd,buffer + total,N - total);
-
-		if (ret == -1) {
-			if (fd == get_client_fd()) {
-				/* Try and give an error message saying
-				 * what client failed. */
-				DEBUG(0,("write_data: write failure in "
-					"writing to client %s. Error %s\n",
-					get_peer_addr(fd,addr,sizeof(addr)),
-					strerror(errno) ));
-			} else {
-				DEBUG(0,("write_data: write failure. "
-					"Error = %s\n", strerror(errno) ));
-			}
-			return -1;
-		}
-
-		if (ret == 0) {
-			return total;
-		}
-
-		total += ret;
-	}
-	return (ssize_t)total;
+	struct iovec iov;
+
+	iov.iov_base = CONST_DISCARD(char *, buffer);
+	iov.iov_len = N;
+
+	ret = write_data_iov(fd, &iov, 1);
+	if (ret >= 0) {
+		return ret;
+	}
+
+	if (fd == get_client_fd()) {
+		char addr[INET6_ADDRSTRLEN];
+		/*
+		 * Try and give an error message saying what client failed.
+		 */
+		DEBUG(0, ("write_data: write failure in writing to client %s. "
+			  "Error %s\n", get_peer_addr(fd,addr,sizeof(addr)),
+			  strerror(errno)));
+	} else {
+		DEBUG(0,("write_data: write failure. Error = %s\n",
+			 strerror(errno) ));
+	}
+
+	return -1;
 }
 
 /****************************************************************************

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-08 19:55                       ` Adeodato Simó
@ 2009-01-08 20:06                         ` Adeodato Simó
  2009-01-09  6:54                         ` Junio C Hamano
  1 sibling, 0 replies; 67+ messages in thread
From: Adeodato Simó @ 2009-01-08 20:06 UTC (permalink / raw)
  To: git

(My apologies for breaking the thread.)

-- 
Adeodato Simó                                     dato at net.com.org.es
Debian Developer                                  adeodato at debian.org
 
«¡Pero si es tan español que debe de tener el cerebro en forma de botijo,
con pitorro y todo!»
                -- Javier Cercas, “La velocidad de la luz”

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-08 19:55                       ` Adeodato Simó
  2009-01-08 20:06                         ` Adeodato Simó
@ 2009-01-09  6:54                         ` Junio C Hamano
  2009-01-09 13:07                           ` Johannes Schindelin
  1 sibling, 1 reply; 67+ messages in thread
From: Junio C Hamano @ 2009-01-09  6:54 UTC (permalink / raw)
  To: Adeodato Simó
  Cc: Linus Torvalds, Clemens Buchacher, Johannes Schindelin,
	Pierre Habouzit, davidel, Francis Galiegue, Git ML

Adeodato Simó <dato@net.com.org.es> writes:

> * Linus Torvalds [Fri, 02 Jan 2009 08:42:04 -0800]:
>
>> Yes, this one is a real patience diff change, but it's also the same one 
>> that I've seen in the google fanboi findings. What google did _not_ show 
>> was any real-life examples, or anybody doing any critical analysis.
>
> This comes a bit late and maybe it's redundant, but somebody just sent
> to a Debian mailing list a patch that was hard to read, and patience
> improved it. (I realize it's quite similar in spirit to the "toy
> patience example" that google returns, but this at list is a *real*
> example where patience helped me read a patch.)
>
> I'm also attaching bzr diff output, because it's still more readable
> IMHO. (I realize that's independent of patience, as you explained, but
> I'm making a point that it'd be nice to have this addressed by somebody
> knowledgeable.)

I found thd difference between the Bzr diff and Dscho diff somewhat
interesting.  It shows that sometimes it makes the results easier to read
to consider blank lines (and lines with only punctuation marks) that match
to be non-matching when they appear inside a block of other consecutive
non-matching lines, even though the result may become larger.

The part Bzr gives this result:

> +/****************************************************************************
>   Write data to a fd.
>  ****************************************************************************/
>  
>  ssize_t write_data(int fd, const char *buffer, size_t N)
>  {
> -	size_t total=0;
>  	ssize_t ret;
> -	char addr[INET6_ADDRSTRLEN];
> ... all "removed"
> -	while (total < N) {
> -		total += ret;
> -	}
> -	return (ssize_t)total;
> +	struct iovec iov;
> +
> +	iov.iov_base = CONST_DISCARD(char *, buffer);
> ... all "added"
> +
> +
> +	return -1;
>  }
>  
>  /****************************************************************************

is shown by the Dscho git-diff like this:

>   Write data to a fd.
>  ****************************************************************************/
>  
>  ssize_t write_data(int fd, const char *buffer, size_t N)
>  {
> -	size_t total=0;
>  	ssize_t ret;
> -	char addr[INET6_ADDRSTRLEN];
> +	struct iovec iov;
>  
> -	while (total < N) {
> -		ret = sys_write(fd,buffer + total,N - total);
> +	iov.iov_base = CONST_DISCARD(char *, buffer);
> +	iov.iov_len = N;
>  
> -		if (ret == -1) {
> -			if (fd == get_client_fd()) {
> ... all "removed"
> -
> -		if (ret == 0) {
> -			return total;
> -		}
> +	ret = write_data_iov(fd, &iov, 1);
> +	if (ret >= 0) {
> +		return ret;
> +	}
>  
> -		total += ret;
> +	if (fd == get_client_fd()) {
> +		char addr[INET6_ADDRSTRLEN];
> ... all "added"
> +		DEBUG(0,("write_data: write failure. Error = %s\n",
> +			 strerror(errno) ));
>  	}
> -	return (ssize_t)total;
> +
> +	return -1;
>  }

If we find the "common" context lines that have only blank and punctuation
letters in Dscho output, turn each of them into "-" and "+", and rearrange
them so that all "-" are together followed by "+", it will match Bzr
output.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-09  6:54                         ` Junio C Hamano
@ 2009-01-09 13:07                           ` Johannes Schindelin
  2009-01-09 15:59                             ` Adeodato Simó
                                               ` (2 more replies)
  0 siblings, 3 replies; 67+ messages in thread
From: Johannes Schindelin @ 2009-01-09 13:07 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Adeodato Simó, Linus Torvalds, Clemens Buchacher,
	Pierre Habouzit, davidel, Francis Galiegue, Git ML

Hi,

On Thu, 8 Jan 2009, Junio C Hamano wrote:

> If we find the "common" context lines that have only blank and 
> punctuation letters in Dscho output, turn each of them into "-" and "+", 
> and rearrange them so that all "-" are together followed by "+", it will 
> match Bzr output.

So we'd need something like this (I still think we should treat curly 
brackets the same as punctuation, and for good measure I just handled 
everything that is not alphanumerical the same):

-- snipsnap --
[TOY PATCH] Add diff option '--collapse-non-alnums'

With the option --collapse-non-alnums, there will be no interhunks
consisting solely of non-alphanumerical letters.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 diff.c         |    2 ++
 xdiff/xdiff.h  |    1 +
 xdiff/xdiffi.c |   48 +++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 50 insertions(+), 1 deletions(-)

diff --git a/diff.c b/diff.c
index c7ddb60..4b387fb 100644
--- a/diff.c
+++ b/diff.c
@@ -2503,6 +2503,8 @@ int diff_opt_parse(struct diff_options *options, const char **av, int ac)
 		options->xdl_opts |= XDF_IGNORE_WHITESPACE_AT_EOL;
 	else if (!strcmp(arg, "--patience"))
 		options->xdl_opts |= XDF_PATIENCE_DIFF;
+	else if (!strcmp(arg, "--collapse-non-alnums"))
+		options->xdl_opts |= XDF_COLLAPSE_NON_ALNUMS;
 
 	/* flags options */
 	else if (!strcmp(arg, "--binary")) {
diff --git a/xdiff/xdiff.h b/xdiff/xdiff.h
index 4da052a..a444f9a 100644
--- a/xdiff/xdiff.h
+++ b/xdiff/xdiff.h
@@ -33,6 +33,7 @@ extern "C" {
 #define XDF_IGNORE_WHITESPACE_CHANGE (1 << 3)
 #define XDF_IGNORE_WHITESPACE_AT_EOL (1 << 4)
 #define XDF_PATIENCE_DIFF (1 << 5)
+#define XDF_COLLAPSE_NON_ALNUMS (1 << 6)
 #define XDF_WHITESPACE_FLAGS (XDF_IGNORE_WHITESPACE | XDF_IGNORE_WHITESPACE_CHANGE | XDF_IGNORE_WHITESPACE_AT_EOL)
 
 #define XDL_PATCH_NORMAL '-'
diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 3e97462..b8e7ee8 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -396,6 +396,50 @@ static xdchange_t *xdl_add_change(xdchange_t *xscr, long i1, long i2, long chg1,
 	return xch;
 }
 
+static int xdl_record_contains_alnum(xrecord_t *record)
+{
+	long i;
+	for (i = 0; i < record->size; i++)
+		if (isalnum(record->ptr[i]))
+			return 1;
+	return 0;
+}
+
+static int xdl_collapse_non_alnum(xdfile_t *xdf, xdfile_t *xdfo)
+{
+	long ix, ixo, len = 0;
+
+	/*
+	 * Collapse all interhunk parts consisting solely of non-alnum
+	 * characters into the hunks.
+	 */
+	for (ix = 0, ixo = 0; ix < xdf->nrec && ixo < xdfo->nrec; ix++, ixo++) {
+		if (xdf->rchg[ix] == 1 || xdfo->rchg[ixo] == 1) {
+			/* collapse non-alnum interhunks */
+			while (len > 0) {
+				xdf->rchg[ix - len] = 1;
+				xdfo->rchg[ixo - len] = 1;
+				len--;
+			}
+
+			/* look for end of hunk */
+			while (ix < xdf->nrec && xdf->rchg[ix] == 1)
+				ix++;
+			while (ixo < xdfo->nrec && xdfo->rchg[ixo] == 1)
+				ixo++;
+			if (ix >= xdf->nrec)
+				return 0;
+			len = !xdl_record_contains_alnum(xdf->recs[ix]);
+		}
+		else if (len > 0) {
+			if (xdl_record_contains_alnum(xdf->recs[ix]))
+				len = 0;
+			else
+				len++;
+		}
+	}
+	return 0;
+}
 
 int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 	long ix, ixo, ixs, ixref, grpsiz, nrec = xdf->nrec;
@@ -548,7 +592,9 @@ int xdl_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
 
 		return -1;
 	}
-	if (xdl_change_compact(&xe.xdf1, &xe.xdf2, xpp->flags) < 0 ||
+	if (((xpp->flags & XDF_COLLAPSE_NON_ALNUMS) &&
+	     xdl_collapse_non_alnum(&xe.xdf1, &xe.xdf2)) ||
+	    xdl_change_compact(&xe.xdf1, &xe.xdf2, xpp->flags) < 0 ||
 	    xdl_change_compact(&xe.xdf2, &xe.xdf1, xpp->flags) < 0 ||
 	    xdl_build_script(&xe, &xscr) < 0) {
 
-- 
1.6.1.203.gc8be3

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-09 13:07                           ` Johannes Schindelin
@ 2009-01-09 15:59                             ` Adeodato Simó
  2009-01-09 18:09                             ` Linus Torvalds
  2009-01-09 20:53                             ` Junio C Hamano
  2 siblings, 0 replies; 67+ messages in thread
From: Adeodato Simó @ 2009-01-09 15:59 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Junio C Hamano, Linus Torvalds, Clemens Buchacher,
	Pierre Habouzit, davidel, Francis Galiegue, Git ML

* Johannes Schindelin [Fri, 09 Jan 2009 14:07:28 +0100]:

> > If we find the "common" context lines that have only blank and 
> > punctuation letters in Dscho output, turn each of them into "-" and "+", 
> > and rearrange them so that all "-" are together followed by "+", it will 
> > match Bzr output.

> So we'd need something like this (I still think we should treat curly 
> brackets the same as punctuation, and for good measure I just handled 
> everything that is not alphanumerical the same):

Nice. With this patch of yours, --patience --collapse-non-alnums
produces the same output as bzr for this last test case (the util_sock.c
one). However, also for this last case, without --patience, diff
--collapse-non-alnums finds *no* common lines at all. Mentioning in case
you'd be interested in knowing.

Cheers,

-- 
Adeodato Simó                                     dato at net.com.org.es
Debian Developer                                  adeodato at debian.org
 
- You look beaten.
- I just caught Tara laughing with another man.
- Are you sure they weren't just... kissing or something?
- No, they were laughing.
                -- Denny Crane and Alan Shore

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-09 13:07                           ` Johannes Schindelin
  2009-01-09 15:59                             ` Adeodato Simó
@ 2009-01-09 18:09                             ` Linus Torvalds
  2009-01-09 18:13                               ` Linus Torvalds
  2009-01-09 20:53                             ` Junio C Hamano
  2 siblings, 1 reply; 67+ messages in thread
From: Linus Torvalds @ 2009-01-09 18:09 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Junio C Hamano, Adeodato Simó, Clemens Buchacher,
	Pierre Habouzit, davidel, Francis Galiegue, Git ML



On Fri, 9 Jan 2009, Johannes Schindelin wrote:
> 
> -- snipsnap --
> [TOY PATCH] Add diff option '--collapse-non-alnums'

I really don't think it should be about "alnum".

Think about languages that use "begin" and "end" instead of "{" "}".

I think we'd be better off just looking at the size, but _this_ is a 
really good area where "uniqueness" matters.

Don't combine unique lines, but lines that have the same hash as a 
thousand other lines? Go right ahead.

		Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-09 18:09                             ` Linus Torvalds
@ 2009-01-09 18:13                               ` Linus Torvalds
  0 siblings, 0 replies; 67+ messages in thread
From: Linus Torvalds @ 2009-01-09 18:13 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Junio C Hamano, Adeodato Simó, Clemens Buchacher,
	Pierre Habouzit, davidel, Francis Galiegue, Git ML



On Fri, 9 Jan 2009, Linus Torvalds wrote:
>
> Don't combine unique lines, but lines that have the same hash as a 
> thousand other lines? Go right ahead.

Oh, and even then, don't ever combine fragments unless it's a single line, 
and unless you can really stop with just one or two combines. Because we 
don't want to combine 100 one-liner changes (on every other line) into one 
300-line change.

		Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-09 13:07                           ` Johannes Schindelin
  2009-01-09 15:59                             ` Adeodato Simó
  2009-01-09 18:09                             ` Linus Torvalds
@ 2009-01-09 20:53                             ` Junio C Hamano
  2009-01-10 11:36                               ` Johannes Schindelin
  2 siblings, 1 reply; 67+ messages in thread
From: Junio C Hamano @ 2009-01-09 20:53 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Adeodato Simó, Linus Torvalds, Clemens Buchacher,
	Pierre Habouzit, davidel, Francis Galiegue, Git ML

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> On Thu, 8 Jan 2009, Junio C Hamano wrote:
>
>> If we find the "common" context lines that have only blank and 
>> punctuation letters in Dscho output, turn each of them into "-" and "+", 
>> and rearrange them so that all "-" are together followed by "+", it will 
>> match Bzr output.
>
> So we'd need something like this (I still think we should treat curly 
> brackets the same as punctuation, and for good measure I just handled 
> everything that is not alphanumerical the same):

I meant by punctuation to include curlies (my wording may have been wrong
but from the example with " }" line it should have been obvious).

But I agree with both points Linus raised.  The criteria to pick what to
pretend unmatching should be "small insignificant lines" (small goes for
both size and also number of consecutive "insignificant" lines), and the
coallescing should be done to join a block of consecutive changed lines of
a significant size (so you do not join two 1 or 2-line "changed line"
blocks by pretending that a 1-line unchanged insignificant line in between
them is unmatching).

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
  2009-01-09 20:53                             ` Junio C Hamano
@ 2009-01-10 11:36                               ` Johannes Schindelin
  0 siblings, 0 replies; 67+ messages in thread
From: Johannes Schindelin @ 2009-01-10 11:36 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Adeodato Simó, Linus Torvalds, Clemens Buchacher,
	Pierre Habouzit, davidel, Francis Galiegue, Git ML

Hi,

On Fri, 9 Jan 2009, Junio C Hamano wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> 
> > On Thu, 8 Jan 2009, Junio C Hamano wrote:
> >
> >> If we find the "common" context lines that have only blank and 
> >> punctuation letters in Dscho output, turn each of them into "-" and "+", 
> >> and rearrange them so that all "-" are together followed by "+", it will 
> >> match Bzr output.
> >
> > So we'd need something like this (I still think we should treat curly 
> > brackets the same as punctuation, and for good measure I just handled 
> > everything that is not alphanumerical the same):
> 
> I meant by punctuation to include curlies (my wording may have been wrong
> but from the example with " }" line it should have been obvious).
> 
> But I agree with both points Linus raised.  The criteria to pick what to
> pretend unmatching should be "small insignificant lines" (small goes for
> both size and also number of consecutive "insignificant" lines), and the
> coallescing should be done to join a block of consecutive changed lines of
> a significant size (so you do not join two 1 or 2-line "changed line"
> blocks by pretending that a 1-line unchanged insignificant line in between
> them is unmatching).

Of course, the number should be configurable as with the inter-hunk 
context.

However, I'll not have much time to work on this feature, and it would 
definitely need some experimenting to find the best parameters, e.g. 
maximal number of inter-hunk lines, maybe even an inter-hunk/hunk ratio, 
alnums, or as Linus suggests the length of the line (of which I am still 
not convinced).

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2009-01-10 11:37 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-04  0:40 libxdiff and patience diff Pierre Habouzit
2008-11-04  3:17 ` Davide Libenzi
2008-11-04  8:33   ` Pierre Habouzit
2008-11-04  5:39 ` Johannes Schindelin
2008-11-04  8:30   ` Pierre Habouzit
2008-11-04 14:34     ` Johannes Schindelin
2008-11-04 15:23       ` Pierre Habouzit
2008-11-04 15:57         ` Johannes Schindelin
2008-11-04 16:15           ` Pierre Habouzit
2009-01-01 16:38         ` [PATCH 0/3] Teach Git about the patience diff algorithm Johannes Schindelin
2009-01-01 16:38           ` [PATCH 1/3] Implement " Johannes Schindelin
2009-01-01 16:39           ` [PATCH 2/3] Introduce the diff option '--patience' Johannes Schindelin
2009-01-01 16:39           ` [PATCH 3/3] bash completions: Add the --patience option Johannes Schindelin
2009-01-01 19:45           ` [PATCH 0/3] Teach Git about the patience diff algorithm Linus Torvalds
2009-01-01 20:00             ` Linus Torvalds
2009-01-02 18:17               ` Johannes Schindelin
2009-01-02 18:49                 ` Linus Torvalds
2009-01-02 19:07                   ` Johannes Schindelin
2009-01-02 18:51                 ` Jeff King
2009-01-02 21:59               ` [PATCH 1/3 v2] Implement " Johannes Schindelin
2009-01-02 21:59                 ` Johannes Schindelin
2009-01-01 20:46             ` [PATCH 0/3] Teach Git about " Adeodato Simó
2009-01-02  1:56               ` Linus Torvalds
2009-01-02 10:55                 ` Clemens Buchacher
2009-01-02 10:58                   ` Clemens Buchacher
2009-01-02 16:42                     ` Linus Torvalds
2009-01-02 18:46                       ` Johannes Schindelin
2009-01-02 19:03                         ` Linus Torvalds
2009-01-02 19:22                           ` Johannes Schindelin
2009-01-02 19:39                           ` Jeff King
2009-01-02 19:50                             ` Jeff King
2009-01-02 20:52                               ` Jeff King
2009-01-02 23:05                                 ` Linus Torvalds
2009-01-03 16:24                             ` Bazaar's patience diff as GIT_EXTERNAL_DIFF Adeodato Simó
2009-01-02 21:59                       ` [PATCH 0/3] Teach Git about the patience diff algorithm Johannes Schindelin
2009-01-08 19:55                       ` Adeodato Simó
2009-01-08 20:06                         ` Adeodato Simó
2009-01-09  6:54                         ` Junio C Hamano
2009-01-09 13:07                           ` Johannes Schindelin
2009-01-09 15:59                             ` Adeodato Simó
2009-01-09 18:09                             ` Linus Torvalds
2009-01-09 18:13                               ` Linus Torvalds
2009-01-09 20:53                             ` Junio C Hamano
2009-01-10 11:36                               ` Johannes Schindelin
2009-01-02 11:03                   ` Junio C Hamano
2009-01-02 18:50                 ` Adeodato Simó
2009-01-06 11:17     ` Pierre Habouzit
2009-01-06 11:39       ` Pierre Habouzit
2009-01-06 19:40       ` Johannes Schindelin
2009-01-07 14:39         ` Pierre Habouzit
2009-01-07 17:01           ` Johannes Schindelin
2009-01-07 17:04             ` [PATCH v3 1/3] Implement " Johannes Schindelin
2009-01-07 18:10               ` Davide Libenzi
2009-01-07 18:32                 ` Johannes Schindelin
2009-01-07 20:09                   ` Davide Libenzi
2009-01-07 20:19                     ` Johannes Schindelin
2009-01-07 18:59                 ` Linus Torvalds
2009-01-07 20:00                   ` Johannes Schindelin
2009-01-07 20:11                   ` Davide Libenzi
2009-01-07 20:15         ` [PATCH 0/3] Teach Git about " Sam Vilain
2009-01-07 20:25           ` Linus Torvalds
2009-01-08  2:31             ` Sam Vilain
2009-01-07 20:38           ` Johannes Schindelin
2009-01-07 20:48             ` Junio C Hamano
2009-01-07 22:00               ` Johannes Schindelin
2009-01-07 22:45                 ` Pierre Habouzit
2009-01-07 23:03                   ` Johannes Schindelin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).