* libxdiff and patience diff @ 2008-11-04 0:40 Pierre Habouzit 2008-11-04 3:17 ` Davide Libenzi 2008-11-04 5:39 ` Johannes Schindelin 0 siblings, 2 replies; 67+ messages in thread From: Pierre Habouzit @ 2008-11-04 0:40 UTC (permalink / raw) To: davidel; +Cc: Git ML [-- Attachment #1: Type: text/plain, Size: 2679 bytes --] Hi Davide, I've been working tonight, trying to make libxdiff support the patience diff algorithm, but I've totally failed, because I _thought_ I understood what xdl_split was doing, but it appears I don't. [ For the readers playing at home, the patience diff algorithm is explained after my sig. ] What I did is to: (1) add a flag to the xenv in xdl_split that says that I want a patience "split". (2) Then in xdl_split, if that bit is set, I compute the longest common subsequence from the patience diff. (3) for each split it computes I call xdl_recs_cmp on that interval. What I thought it would achieve is that I force some boundaries at which libxdiff _must_ resync. Though, it seems that for some reason it doesn't work, probably because the "ha" stuff and the boundaries absolutely don't work the way I thought it did. So where is the place I should do that ? I suspect it should be partly in xprepare.c but I'm a bit stuck right now. Any pointer on how the stuff in xprepare.c and xdiffi.c work would help greatly, it's really not self-evident to me :) -- ·O· Pierre Habouzit ··O madcoder@debian.org OOO http://www.madism.org Patience diff ============= Basically, it's an algorithm that considers line from the left file A and the right file B. It searches for lines that are unique *IN* A and unique *IN* B as well. Then using a clever O(n log log n) where n is that number of lines, you extract the longer common sequence of those lines. This defines two splits of file A and B. For each corresponding chunks, you then reduce the lines matching at the beginning and the end, and reiterate the algorithm on the interior of that space, until there are _no_ unique lines at all, and then you apply a usual diff algorithm to generate the diff in that final section. http://alfedenzo.livejournal.com/170301.html has a nice visual explanation of the fact, even if it forgets about the "zones" trimming that helps for the efficiency. It's also often wrong to generate the stacks in the order shown on the blog, because you recurse from the max line to the min line, which is not the natural order to work on a diff. But that's just a matter of "inverting" all the comparisons which is pretty obvious to do. The difference of output can be seen on http://glandium.org/blog/?p=120 where the patience diff picks the line "include $(topsrcdir)/config/rules.mk" as being unique on the left and the right file, hence will use it as sync point rather than using it in a diff. [-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: libxdiff and patience diff 2008-11-04 0:40 libxdiff and patience diff Pierre Habouzit @ 2008-11-04 3:17 ` Davide Libenzi 2008-11-04 8:33 ` Pierre Habouzit 2008-11-04 5:39 ` Johannes Schindelin 1 sibling, 1 reply; 67+ messages in thread From: Davide Libenzi @ 2008-11-04 3:17 UTC (permalink / raw) To: Pierre Habouzit; +Cc: Git ML On Tue, 4 Nov 2008, Pierre Habouzit wrote: > Hi Davide, > > I've been working tonight, trying to make libxdiff support the patience > diff algorithm, but I've totally failed, because I _thought_ I > understood what xdl_split was doing, but it appears I don't. > > > [ For the readers playing at home, the patience diff algorithm is > explained after my sig. ] > > > What I did is to: > (1) add a flag to the xenv in xdl_split that says that I want a > patience "split". > (2) Then in xdl_split, if that bit is set, I compute the longest common > subsequence from the patience diff. > (3) for each split it computes I call xdl_recs_cmp on that interval. > > > What I thought it would achieve is that I force some boundaries at which > libxdiff _must_ resync. Though, it seems that for some reason it doesn't > work, probably because the "ha" stuff and the boundaries absolutely > don't work the way I thought it did. > > So where is the place I should do that ? I suspect it should be > partly in xprepare.c but I'm a bit stuck right now. > > > Any pointer on how the stuff in xprepare.c and xdiffi.c work would help > greatly, it's really not self-evident to me :) What makes you think it'd self-evident to me? :) Seriously, I forgot stuff I wrote the last month, this is way beyond my memory limits. You definitely need to look at xprepare.c, especially at xdl_trim_ends() and xdl_cleanup_records(). Lines are re-arranged in indexes, and this might screw up your logic if you're not prepared for it. What xdl_split() does, is find the start of an LCS and return the coordinate. Then xdl_recs_cmp() does the box reducing (first two "for" loops) and different-lines marking (first and second "if"). - Davide ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: libxdiff and patience diff 2008-11-04 3:17 ` Davide Libenzi @ 2008-11-04 8:33 ` Pierre Habouzit 0 siblings, 0 replies; 67+ messages in thread From: Pierre Habouzit @ 2008-11-04 8:33 UTC (permalink / raw) To: Davide Libenzi; +Cc: Git ML [-- Attachment #1: Type: text/plain, Size: 2593 bytes --] On Tue, Nov 04, 2008 at 03:17:37AM +0000, Davide Libenzi wrote: > On Tue, 4 Nov 2008, Pierre Habouzit wrote: > > > Hi Davide, > > > > I've been working tonight, trying to make libxdiff support the patience > > diff algorithm, but I've totally failed, because I _thought_ I > > understood what xdl_split was doing, but it appears I don't. > > > > > > [ For the readers playing at home, the patience diff algorithm is > > explained after my sig. ] > > > > > > What I did is to: > > (1) add a flag to the xenv in xdl_split that says that I want a > > patience "split". > > (2) Then in xdl_split, if that bit is set, I compute the longest common > > subsequence from the patience diff. > > (3) for each split it computes I call xdl_recs_cmp on that interval. > > > > > > What I thought it would achieve is that I force some boundaries at which > > libxdiff _must_ resync. Though, it seems that for some reason it doesn't > > work, probably because the "ha" stuff and the boundaries absolutely > > don't work the way I thought it did. > > > > So where is the place I should do that ? I suspect it should be > > partly in xprepare.c but I'm a bit stuck right now. > > > > > > Any pointer on how the stuff in xprepare.c and xdiffi.c work would help > > greatly, it's really not self-evident to me :) > > What makes you think it'd self-evident to me? :) > Seriously, I forgot stuff I wrote the last month, this is way beyond my > memory limits. > You definitely need to look at xprepare.c, especially at xdl_trim_ends() > and xdl_cleanup_records(). Lines are re-arranged in indexes, and this > might screw up your logic if you're not prepared for it. > What xdl_split() does, is find the start of an LCS and return the > coordinate. Then xdl_recs_cmp() does the box reducing (first two "for" > loops) and different-lines marking (first and second "if"). Well it's what I thought it did indeed, that's why before recursing into that bit, I tried to extract a full LCS from the patience diff algorithm and recurse into that for each interval it gives, which _should_ work, but doesn't at all :/ Okay maybe I should re-read my algorithm slowly and check that I've not made something silly (like chaining the list in the reverse order or god knows what), but my debug functions let me believe that I did that fine. I'll look into it tonight. -- ·O· Pierre Habouzit ··O madcoder@debian.org OOO http://www.madism.org [-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: libxdiff and patience diff 2008-11-04 0:40 libxdiff and patience diff Pierre Habouzit 2008-11-04 3:17 ` Davide Libenzi @ 2008-11-04 5:39 ` Johannes Schindelin 2008-11-04 8:30 ` Pierre Habouzit 1 sibling, 1 reply; 67+ messages in thread From: Johannes Schindelin @ 2008-11-04 5:39 UTC (permalink / raw) To: Pierre Habouzit; +Cc: davidel, Git ML Hi, On Tue, 4 Nov 2008, Pierre Habouzit wrote: > I've been working tonight, trying to make libxdiff support the patience > diff algorithm, but I've totally failed, because I _thought_ I > understood what xdl_split was doing, but it appears I don't. I thought about it, too, at the GitTogether, although I want to finish my jGit diff first. The main idea I had about patience diff is that you can reuse a lot of functions in libxdiff. But the requirement of taking just unique lines/hashes into account, and _after_ splitting up, _again_ determine uniqueness _just_ in the between-hunk part made me think that it may be wiser to have a separate function for the patience diff stuff. Basically, you would start to have libxdiff split the lines and hash them as normal, but then determine the unique hashes (I'd start with the smaller text, just to have a better chance to end up with unique hashes). Once they are determined, you can search for those exact lines (hash first) in the post-document. Once that is done, you'd have to find the longest common subsequence, which you could do using the existing infrastructure, but that would require more work (as we already know the lines are unique). After that, you would have to recurse to the same algorithm _between_ known chunks. Eventually, that would have to resort to classical libxdiff (if there are no, or not enough, unique lines). Ciao, Dscho ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: libxdiff and patience diff 2008-11-04 5:39 ` Johannes Schindelin @ 2008-11-04 8:30 ` Pierre Habouzit 2008-11-04 14:34 ` Johannes Schindelin 2009-01-06 11:17 ` Pierre Habouzit 0 siblings, 2 replies; 67+ messages in thread From: Pierre Habouzit @ 2008-11-04 8:30 UTC (permalink / raw) To: Johannes Schindelin; +Cc: davidel, Git ML [-- Attachment #1: Type: text/plain, Size: 3639 bytes --] On Tue, Nov 04, 2008 at 05:39:48AM +0000, Johannes Schindelin wrote: > Hi, > > On Tue, 4 Nov 2008, Pierre Habouzit wrote: > > > I've been working tonight, trying to make libxdiff support the patience > > diff algorithm, but I've totally failed, because I _thought_ I > > understood what xdl_split was doing, but it appears I don't. > > I thought about it, too, at the GitTogether, although I want to finish my > jGit diff first. > > The main idea I had about patience diff is that you can reuse a lot of > functions in libxdiff. > > But the requirement of taking just unique lines/hashes into account, and > _after_ splitting up, _again_ determine uniqueness _just_ in the > between-hunk part made me think that it may be wiser to have a separate > function for the patience diff stuff. > > Basically, you would start to have libxdiff split the lines and hash them > as normal, but then determine the unique hashes (I'd start with the > smaller text, just to have a better chance to end up with unique hashes). > > Once they are determined, you can search for those exact lines (hash > first) in the post-document. Actually my current implementation just puts all the hashes into an array, sorts them by hash (which is O(n log(n)) with the position from left or right file it's in, ordered by increasing right position. (and the struct is filled with the left pos to -1 if the right pos is set, and vice versa). The I scan the array to find patterns of two consecutive hashes exactly, and collapse it into the proper {left pos, right pos} tuple if it was indeed a unique line in both files. This results into an array I sort again by right pos then, and we can work on that for the stack sorting, and I do it, and then I have my LCS. This is the complete brute-force algorithm which requires a temporary array of the size of the number of lines on the left + the right, and a temporary array for the stacks which _may_ end up being as large as the smallest number of lines between the left or right file in the worst case I'd say (roughly). Then I just remember a list of split points, and I recurse in all the sub splits again. It has a fixed point which may or may not need libxdiff recursion in it. This code is actually written, naive and unoptimized but it doesn't work probably because I didn't plug that in the proper place :) > Once that is done, you'd have to find the longest common subsequence, > which you could do using the existing infrastructure, but that would > require more work (as we already know the lines are unique). Patience diff gives you the algorithm to do that, it's pretty simple, and is more efficient than the current infrastructure (in time, I don't know for space though). > After that, you would have to recurse to the same algorithm _between_ > known chunks. Eventually, that would have to resort to classical libxdiff > (if there are no, or not enough, unique lines). Yeah, that's the point, the problem is, I believe more and more that I should prepare the LCS from patience diff in xprepare.c, but I grok absolutely nothing at what the chastore_t and similar stuff is. I understand it's about hashing, but the exact stuff it does eludes me. In fact when I look at the records I have in xdiffi.c I had the impression they were already somehow collapsed, which makes it a too late point to apply the patience diff ... -- ·O· Pierre Habouzit ··O madcoder@debian.org OOO http://www.madism.org [-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: libxdiff and patience diff 2008-11-04 8:30 ` Pierre Habouzit @ 2008-11-04 14:34 ` Johannes Schindelin 2008-11-04 15:23 ` Pierre Habouzit 2009-01-06 11:17 ` Pierre Habouzit 1 sibling, 1 reply; 67+ messages in thread From: Johannes Schindelin @ 2008-11-04 14:34 UTC (permalink / raw) To: Pierre Habouzit; +Cc: davidel, Git ML Hi, On Tue, 4 Nov 2008, Pierre Habouzit wrote: > On Tue, Nov 04, 2008 at 05:39:48AM +0000, Johannes Schindelin wrote: > > > On Tue, 4 Nov 2008, Pierre Habouzit wrote: > > > > > I've been working tonight, trying to make libxdiff support the > > > patience diff algorithm, but I've totally failed, because I > > > _thought_ I understood what xdl_split was doing, but it appears I > > > don't. > > > > I thought about it, too, at the GitTogether, although I want to finish > > my jGit diff first. > > > > The main idea I had about patience diff is that you can reuse a lot of > > functions in libxdiff. > > > > But the requirement of taking just unique lines/hashes into account, > > and _after_ splitting up, _again_ determine uniqueness _just_ in the > > between-hunk part made me think that it may be wiser to have a > > separate function for the patience diff stuff. > > > > Basically, you would start to have libxdiff split the lines and hash > > them as normal, but then determine the unique hashes (I'd start with > > the smaller text, just to have a better chance to end up with unique > > hashes). > > > > Once they are determined, you can search for those exact lines (hash > > first) in the post-document. > > Actually my current implementation just puts all the hashes into an > array, sorts them by hash (which is O(n log(n)) with the position from > left or right file it's in, ordered by increasing right position. (and > the struct is filled with the left pos to -1 if the right pos is set, > and vice versa). Yeah, that would be much more efficient using a hash-multiset. Still linear size (albeit twice as much). And you can already discard double entries early (although you have to keep one pointer in the hash-multiset to prevent other identical lines from being misdetected as "unique"). > The I scan the array to find patterns of two consecutive hashes exactly, > and collapse it into the proper {left pos, right pos} tuple if it was > indeed a unique line in both files. > > This results into an array I sort again by right pos then, and we can > work on that for the stack sorting, and I do it, and then I have my LCS. > > > This is the complete brute-force algorithm which requires a temporary > array of the size of the number of lines on the left + the right, and a > temporary array for the stacks which _may_ end up being as large as the > smallest number of lines between the left or right file in the worst > case I'd say (roughly). > > Then I just remember a list of split points, and I recurse in all the > sub splits again. It has a fixed point which may or may not need > libxdiff recursion in it. I am not sure that you really end up with patience diff that way. I _think_ that you need to determine the longest sequence of unique lines which has the property of being ordered in both texts first, and only _then_ recurse into the not-yet-handled lines. > > Once that is done, you'd have to find the longest common subsequence, > > which you could do using the existing infrastructure, but that would > > require more work (as we already know the lines are unique). > > Patience diff gives you the algorithm to do that, it's pretty simple, > and is more efficient than the current infrastructure (in time, I don't > know for space though). Actually, IIRC it is pretty easy to see that the time complexity is linear (and therefore, the space complexity, too). > > After that, you would have to recurse to the same algorithm _between_ > > known chunks. Eventually, that would have to resort to classical > > libxdiff (if there are no, or not enough, unique lines). > > Yeah, that's the point, the problem is, I believe more and more that I > should prepare the LCS from patience diff in xprepare.c, but I grok > absolutely nothing at what the chastore_t and similar stuff is. I > understand it's about hashing, but the exact stuff it does eludes me. Yes, I do not like the short and unintuitive names either. AFAIU chastore_t is just a generic extensible array of elements that have size "isize", and initially there are "icount" of them. > In fact when I look at the records I have in xdiffi.c I had the > impression they were already somehow collapsed, which makes it a too > late point to apply the patience diff ... AFAICS xdiffi.c contains the classical diff algorithm (incidentally, I the inventor of that algorithm is about 50 meters away from me at this very moment). It should not have anything of interest to you, except for the fall-back case. So I think that you should add a new file xpatience.c. In that, I'd implement that hash multi-set, and use a prepared xdfenv_t to fill it (smaller file first, then you can traverse the other file, checking for uniqueness in that file and for a match in the other file at the same time). You _could_ build the longest list of ordered pairs at the same time, too, but that may make the code a bit too complex. Ciao, Dscho ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: libxdiff and patience diff 2008-11-04 14:34 ` Johannes Schindelin @ 2008-11-04 15:23 ` Pierre Habouzit 2008-11-04 15:57 ` Johannes Schindelin 2009-01-01 16:38 ` [PATCH 0/3] Teach Git about the patience diff algorithm Johannes Schindelin 0 siblings, 2 replies; 67+ messages in thread From: Pierre Habouzit @ 2008-11-04 15:23 UTC (permalink / raw) To: Johannes Schindelin; +Cc: davidel, Git ML [-- Attachment #1.1: Type: text/plain, Size: 5442 bytes --] On Tue, Nov 04, 2008 at 02:34:37PM +0000, Johannes Schindelin wrote: > Hi, > > On Tue, 4 Nov 2008, Pierre Habouzit wrote: > > > On Tue, Nov 04, 2008 at 05:39:48AM +0000, Johannes Schindelin wrote: > > > > > On Tue, 4 Nov 2008, Pierre Habouzit wrote: > > > > > > > I've been working tonight, trying to make libxdiff support the > > > > patience diff algorithm, but I've totally failed, because I > > > > _thought_ I understood what xdl_split was doing, but it appears I > > > > don't. > > > > > > I thought about it, too, at the GitTogether, although I want to finish > > > my jGit diff first. > > > > > > The main idea I had about patience diff is that you can reuse a lot of > > > functions in libxdiff. > > > > > > But the requirement of taking just unique lines/hashes into account, > > > and _after_ splitting up, _again_ determine uniqueness _just_ in the > > > between-hunk part made me think that it may be wiser to have a > > > separate function for the patience diff stuff. > > > > > > Basically, you would start to have libxdiff split the lines and hash > > > them as normal, but then determine the unique hashes (I'd start with > > > the smaller text, just to have a better chance to end up with unique > > > hashes). > > > > > > Once they are determined, you can search for those exact lines (hash > > > first) in the post-document. > > > > Actually my current implementation just puts all the hashes into an > > array, sorts them by hash (which is O(n log(n)) with the position from > > left or right file it's in, ordered by increasing right position. (and > > the struct is filled with the left pos to -1 if the right pos is set, > > and vice versa). > > Yeah, that would be much more efficient using a hash-multiset. Still > linear size (albeit twice as much). And you can already discard double > entries early (although you have to keep one pointer in the hash-multiset > to prevent other identical lines from being misdetected as "unique"). Probably, I just wanted something to work first, and then optimize it using the proper structure. > I am not sure that you really end up with patience diff that way. I > _think_ that you need to determine the longest sequence of unique lines > which has the property of being ordered in both texts first, and only > _then_ recurse into the not-yet-handled lines. I must have been using really bad english because it's what I do ;) > > > Once that is done, you'd have to find the longest common subsequence, > > > which you could do using the existing infrastructure, but that would > > > require more work (as we already know the lines are unique). > > > > Patience diff gives you the algorithm to do that, it's pretty simple, > > and is more efficient than the current infrastructure (in time, I don't > > know for space though). > > Actually, IIRC it is pretty easy to see that the time complexity is linear > (and therefore, the space complexity, too). Well the space complexity of the patience diff would be linear too for me, though I'm afraid that the current state of my implementation has a big factor front ;) > > In fact when I look at the records I have in xdiffi.c I had the > > impression they were already somehow collapsed, which makes it a too > > late point to apply the patience diff ... > > AFAICS xdiffi.c contains the classical diff algorithm (incidentally, I the > inventor of that algorithm is about 50 meters away from me at this very > moment). It should not have anything of interest to you, except for the > fall-back case. > > So I think that you should add a new file xpatience.c. > > In that, I'd implement that hash multi-set, and use a prepared xdfenv_t to > fill it (smaller file first, then you can traverse the other file, > checking for uniqueness in that file and for a match in the other file at > the same time). > > You _could_ build the longest list of ordered pairs at the same time, too, > but that may make the code a bit too complex. Well, technically, if I'm correct, xdiffi.c already has a pruned list of hashed (I suppose ->ha is a hash value somehow ?) lines (pruned from lines without a match left or right), and I was massaging that. Though it absolutely fails so I fear I didn't understand what the ha stuff is for real. Attached is my current work (YES it is horrible, it's WIP), probably most of it can be put in an xpatience.c file, but still, I absolutely don't understand what gets wrong. On the simple example from glandium.org it _looks_ like it catches the "unique" include line fine, but somehow not. The nasty thing about the patience diff is that it still needs the usual diff algorithm once it has split the file into chunks separated by "unique lines". So you cannot make really independant stuff. What I could do is put most of the xpatience diff into xpatience.c but it would still have to use some functions from xdiffi.c that are currently private, so it messes somehow the files more than it's worth IMHO. Anyways, I'm waiting for a client to finally setup his network for the test I'm supposed to do right now, I'll try to have a closer look tonight... -- ·O· Pierre Habouzit ··O madcoder@debian.org OOO http://www.madism.org [-- Attachment #1.2: wip.patch --] [-- Type: text/plain, Size: 8642 bytes --] diff --git a/diff.c b/diff.c index f644947..0901cdc 100644 --- a/diff.c +++ b/diff.c @@ -2447,6 +2447,8 @@ int diff_opt_parse(struct diff_options *options, const char **av, int ac) options->xdl_opts |= XDF_IGNORE_WHITESPACE_CHANGE; else if (!strcmp(arg, "--ignore-space-at-eol")) options->xdl_opts |= XDF_IGNORE_WHITESPACE_AT_EOL; + else if (!strcmp(arg, "--patience")) + options->xdl_opts |= XDF_USE_PATIENCE; /* flags options */ else if (!strcmp(arg, "--binary")) { diff --git a/xdiff/xdiff.h b/xdiff/xdiff.h index 84fff58..bba915c 100644 --- a/xdiff/xdiff.h +++ b/xdiff/xdiff.h @@ -32,6 +32,7 @@ extern "C" { #define XDF_IGNORE_WHITESPACE (1 << 2) #define XDF_IGNORE_WHITESPACE_CHANGE (1 << 3) #define XDF_IGNORE_WHITESPACE_AT_EOL (1 << 4) +#define XDF_USE_PATIENCE (1 << 5) #define XDF_WHITESPACE_FLAGS (XDF_IGNORE_WHITESPACE | XDF_IGNORE_WHITESPACE_CHANGE | XDF_IGNORE_WHITESPACE_AT_EOL) #define XDL_PATCH_NORMAL '-' diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c index 9d0324a..1a5b13a 100644 --- a/xdiff/xdiffi.c +++ b/xdiff/xdiffi.c @@ -37,8 +37,235 @@ typedef struct s_xdpsplit { int min_lo, min_hi; } xdpsplit_t; +typedef struct s_xduniqmatch { + struct s_xduniqmatch *next; + long i1, i2, len; +} xduniqmatch_t; + +typedef struct s_xdp_ha { + struct s_xdp_ha *up, *down; + struct s_xdp_ha *left; + long ha; + long off1; + long off2; +} xdp_ha_t; + + +static xduniqmatch_t *xdl_uniq_match_alloc(long i1, long i2, long len) +{ + xduniqmatch_t *match = xdl_malloc(sizeof(xduniqmatch_t)); + if (match) { + match->next = NULL; + match->i1 = i1; + match->i2 = i2; + match->len = len; + } + return match; +} + +static xduniqmatch_t *xdl_uniq_popfree(xduniqmatch_t *lst) +{ + xduniqmatch_t *next = lst->next; + xdl_free(lst); + return next; +} + +static int xdp_ha_cmp_by_ha(const void *a1, const void *a2) +{ + const xdp_ha_t *ha1 = a1; + const xdp_ha_t *ha2 = a2; + if (ha1->ha == ha2->ha) { + return ha1->off2 - ha2->off2; + } + return ha1->ha - ha2->ha; +} + +static int xdp_ha_cmp_by_off2(const void *a1, const void *a2) +{ + const xdp_ha_t *ha1 = a1; + const xdp_ha_t *ha2 = a2; + return ha2->off2 - ha1->off2; +} + +static int patience_bisect_stack(xdp_ha_t **stacks, long len, long off1) +{ + long l = 0, r = len; + + while (l < r) { + long i = (r + l) / 2; + + if (off1 < stacks[i]->off1) { + l = i + 1; + } else { + r = i; + } + } + return l; +} + +static int xdl_patience_split_aux(xduniqmatch_t *lcs, xdp_ha_t *ha) +{ + xduniqmatch_t *tmp; + + while (ha) { + tmp = xdl_uniq_match_alloc(ha->off1, ha->off2, 1); + if (!tmp) + return -1; + tmp->next = lcs->next; + lcs->next = tmp; + lcs = tmp; + + while ((ha = ha->down ? ha->down : ha->left)) { + if (lcs->i1 + lcs->len + 1 == ha->off1 && lcs->i2 + lcs->len + 1 == ha->off2) { + lcs->len++; + continue; + } + break; + } + } + return 0; +} + +static int xdl_patience_split(unsigned long const *ha1, unsigned long const *ha2, + xduniqmatch_t *begin, xduniqmatch_t *end) +{ + xdp_ha_t *recs; + long off1 = begin->i1 + begin->len, lim1 = end->i1; + long off2 = begin->i2 + begin->len, lim2 = end->i2; + long len, i, j, uniq; + + len = lim1 - off1 + lim2 - off2; + recs = (xdp_ha_t *)xdl_malloc(sizeof(xdp_ha_t) * len); + if (recs == NULL) + return -1; + + for (i = 0, j = off1; j < lim1 - off1; j++, i++) { + recs[i].ha = ha1[j]; + recs[i].off1 = j; + recs[i].off2 = -1; + recs[i].up = recs[i].down = recs[i].left = NULL; + } + for (j = off2; j < lim2; j++, i++) { + recs[i].ha = ha2[j]; + recs[i].off1 = -1; + recs[i].off2 = j; + recs[i].up = recs[i].down = recs[i].left = NULL; + } + + qsort(recs, len, sizeof(xdp_ha_t), xdp_ha_cmp_by_ha); + + uniq = 0; + for (i = 0; i < len - 1; ) { + long ha = recs[i].ha; + + if (ha != recs[i + 1].ha) { + i++; + continue; + } + + if (i < len - 2 && ha == recs[i + 2].ha) { + i += 3; + while (i < len - 1 && recs[i].ha == ha && i < len - 1) { + i++; + } + continue; + } + + if (recs[i].off2 < 0 && recs[i + 1].off1 < 0) { + long a, b; + recs[uniq].ha = ha; + a = recs[uniq].off1 = recs[i].off1; + b = recs[uniq].off2 = recs[i + 1].off2; + uniq++; + } + i += 2; + } + + if (uniq) { + xdp_ha_t **stacks; + long alloc, len; + + qsort(recs, uniq, sizeof(xdp_ha_t), xdp_ha_cmp_by_off2); + + alloc = xdl_bogosqrt(uniq); + stacks = xdl_malloc(sizeof(xdp_ha_t *) * alloc); + if (stacks == NULL) + goto error; + len = 1; + stacks[0] = recs; + + for (i = 1; i < uniq; i++) { + long off1 = recs[i].off1; + long k; + + if (off1 < stacks[len - 1]->off1) { + if (len >= alloc) { + alloc *= 2; + stacks = xdl_realloc(stacks, sizeof(xdp_ha_t *) * alloc); + if (!stacks) + goto error; + } + stacks[k = len++] = NULL; + } else { + k = patience_bisect_stack(stacks, len - 1, off1); + } + + if (k > 0) { + recs[i].left = stacks[k - 1]; + } + if (stacks[k]) { + stacks[k]->down = &recs[i]; + recs[i].up = stacks[k]; + } + stacks[k] = &recs[i]; + } + + if (xdl_patience_split_aux(begin, stacks[len - 1]) < 0) { + xdl_free(stacks); + goto error; + } + + xdl_free(stacks); + } + + xdl_free(recs); + return 0; + +error: + xdl_free(recs); + return -1; +} + +static int xdl_patience_lcs(xdfenv_t *xe, xduniqmatch_t *begin, xduniqmatch_t *end) +{ + unsigned long const *ha1 = xe->xdf1.ha, *ha2 = xe->xdf2.ha; + long off1 = begin->i1 + begin->len, lim1 = end->i1; + long off2 = begin->i2 + begin->len, lim2 = end->i2; + xduniqmatch_t *next; + + for (; off1 < lim1 && off2 < lim2 && ha1[off1] == ha2[off2]; off1++, off2++); + for (; off1 < lim1 && off2 < lim2 && ha1[lim1 - 1] == ha2[lim2 - 1]; lim1--, lim2--); + + begin->len += off1 - begin->i1; + end->len += end->i1 - lim1; + end->i1 = lim1; + end->i2 = lim2; + + if (off1 == lim1 || off2 == lim2) + return 0; + + if (xdl_patience_split(ha1, ha2, begin, end)) + return -1; + + for (next = begin->next; next != end; begin = next, next = begin->next) { + if (xdl_patience_lcs(xe, begin, next) < 0) + return -1; + } + + return 0; +} static long xdl_split(unsigned long const *ha1, long off1, long lim1, unsigned long const *ha2, long off2, long lim2, @@ -321,13 +548,13 @@ int xdl_recs_cmp(diffdata_t *dd1, long off1, long lim1, return 0; } - int xdl_do_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, xdfenv_t *xe) { long ndiags; long *kvd, *kvdf, *kvdb; xdalgoenv_t xenv; diffdata_t dd1, dd2; + int need_min = (xpp->flags & XDF_NEED_MINIMAL) != 0; if (xdl_prepare_env(mf1, mf2, xpp, xe) < 0) { @@ -364,12 +591,54 @@ int xdl_do_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, dd2.rchg = xe->xdf2.rchg; dd2.rindex = xe->xdf2.rindex; - if (xdl_recs_cmp(&dd1, 0, dd1.nrec, &dd2, 0, dd2.nrec, - kvdf, kvdb, (xpp->flags & XDF_NEED_MINIMAL) != 0, &xenv) < 0) { + if (xpp->flags & XDF_USE_PATIENCE) { + xduniqmatch_t *lcs; + long i1, i2; + + lcs = xdl_uniq_match_alloc(0, 0, 0); + if (!lcs) + goto error; + lcs->next = xdl_uniq_match_alloc(xe->xdf1.nreff, xe->xdf2.nreff, 0); + if (!lcs->next || xdl_patience_lcs(xe, lcs, lcs->next) < 0) { + while ((lcs = xdl_uniq_popfree(lcs))); + goto error; + } - xdl_free(kvd); - xdl_free_env(xe); - return -1; + i1 = i2 = lcs->len; + if (lcs->len) { + fprintf(stderr, "skip %ld:%ld -> %ld:%ld\n", + lcs->i1, i1, lcs->i2, i2); + } + + while ((lcs = xdl_uniq_popfree(lcs))) { + fprintf(stderr, "usual %ld:%ld -> %ld:%ld\n", + i1, lcs->i1, i2, lcs->i2); + fprintf(stderr, "l/r: %ld / %ld\n", + xe->xdf1.rindex[lcs->i1], + xe->xdf2.rindex[lcs->i2]); + if (xdl_recs_cmp(&dd1, i1, lcs->i1, &dd2, i2, lcs->i2, + kvdf, kvdb, need_min, &xenv) < 0) { + while ((lcs = xdl_uniq_popfree(lcs))); + goto error; + } + i1 = lcs->i1 + lcs->len; + i2 = lcs->i2 + lcs->len; + if (lcs->len) { + fprintf(stderr, "skip %ld:%ld -> %ld:%ld (len %ld)\n", + lcs->i1, i1, lcs->i2, i2, lcs->len); + fprintf(stderr, "l/r: %ld / %ld\n", + xe->xdf1.rindex[lcs->i1 + lcs->len], + xe->xdf2.rindex[lcs->i2 + lcs->len]); + } + } + } else { + if (xdl_recs_cmp(&dd1, 0, dd1.nrec, &dd2, 0, dd2.nrec, + kvdf, kvdb, need_min, &xenv) < 0) { +error: + xdl_free(kvd); + xdl_free_env(xe); + return -1; + } } xdl_free(kvd); [-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: libxdiff and patience diff 2008-11-04 15:23 ` Pierre Habouzit @ 2008-11-04 15:57 ` Johannes Schindelin 2008-11-04 16:15 ` Pierre Habouzit 2009-01-01 16:38 ` [PATCH 0/3] Teach Git about the patience diff algorithm Johannes Schindelin 1 sibling, 1 reply; 67+ messages in thread From: Johannes Schindelin @ 2008-11-04 15:57 UTC (permalink / raw) To: Pierre Habouzit; +Cc: davidel, Git ML Hi, On Tue, 4 Nov 2008, Pierre Habouzit wrote: > The nasty thing about the patience diff is that it still needs the usual > diff algorithm once it has split the file into chunks separated by > "unique lines". Actually, it should try to apply patience diff again in those chunks, separately. > So you cannot make really independant stuff. What I could do is put most > of the xpatience diff into xpatience.c but it would still have to use > some functions from xdiffi.c that are currently private, so it messes > somehow the files more than it's worth IMHO. I think it is better that you use the stuff from xdiffi.c through a well defined interface, i.e. _not_ mess up the code by mingling it together with the code in xdiffi.c. The code is hard enough to read already. Oh, BTW, "ha" is a hash of the lines which is used to make the line matching more performant. You will see a lot of "ha" comparisons before actually calling xdl_recmatch() for that reason. Incidentally, this is also the hash that I'd use for the hash multi-set I was referring to. Oh, and I am not sure that it is worth your time trying to get it to run with the linear list, since you cannot reuse that code afterwards, and have to spend the same amount of time to redo it with the hash set. I am awfully short on time, so it will take some days until I can review what you have already, unfortunately. Ciao, Dscho ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: libxdiff and patience diff 2008-11-04 15:57 ` Johannes Schindelin @ 2008-11-04 16:15 ` Pierre Habouzit 0 siblings, 0 replies; 67+ messages in thread From: Pierre Habouzit @ 2008-11-04 16:15 UTC (permalink / raw) To: Johannes Schindelin; +Cc: davidel, Git ML [-- Attachment #1: Type: text/plain, Size: 2571 bytes --] On Tue, Nov 04, 2008 at 03:57:44PM +0000, Johannes Schindelin wrote: > Hi, > > On Tue, 4 Nov 2008, Pierre Habouzit wrote: > > > The nasty thing about the patience diff is that it still needs the usual > > diff algorithm once it has split the file into chunks separated by > > "unique lines". > > Actually, it should try to apply patience diff again in those chunks, > separately. yes it's what I do, but this has a fixed point as soon as you don't find unique lines between the new found ones, or that that space is "empty". E.g. you could have the two following hunks: File A File B 1 2 2 1 1 2 2 1 1 2 2 1 The simple leading/trailing reduction will do nothing, and you don't have any shared unique lines, on that you must apply the usual diff algorithm. > > So you cannot make really independant stuff. What I could do is put most > > of the xpatience diff into xpatience.c but it would still have to use > > some functions from xdiffi.c that are currently private, so it messes > > somehow the files more than it's worth IMHO. > > I think it is better that you use the stuff from xdiffi.c through a well > defined interface, i.e. _not_ mess up the code by mingling it together > with the code in xdiffi.c. The code is hard enough to read already. Hmmm. I'll see to that later, once I have something that works. > Oh, BTW, "ha" is a hash of the lines which is used to make the line > matching more performant. You will see a lot of "ha" comparisons before > actually calling xdl_recmatch() for that reason. Incidentally, this is > also the hash that I'd use for the hash multi-set I was referring to. Yeah, that's what I assumed it would be. > Oh, and I am not sure that it is worth your time trying to get it to run > with the linear list, since you cannot reuse that code afterwards, and > have to spend the same amount of time to redo it with the hash set. Having the linear list (actually an array) work would show me I hook at the proper place. Replacing a data structure doesn't makes me afraid because I've split the functions properly. > I am awfully short on time, so it will take some days until I can review > what you have already, unfortunately. NP, it was just in case, because I'm horribly stuck with that code right now ;) -- ·O· Pierre Habouzit ··O madcoder@debian.org OOO http://www.madism.org [-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH 0/3] Teach Git about the patience diff algorithm 2008-11-04 15:23 ` Pierre Habouzit 2008-11-04 15:57 ` Johannes Schindelin @ 2009-01-01 16:38 ` Johannes Schindelin 2009-01-01 16:38 ` [PATCH 1/3] Implement " Johannes Schindelin ` (3 more replies) 1 sibling, 4 replies; 67+ messages in thread From: Johannes Schindelin @ 2009-01-01 16:38 UTC (permalink / raw) To: Pierre Habouzit; +Cc: davidel, Francis Galiegue, Git ML Nothing fancy, really, just a straight-forward implementation of the heavily under-documented and under-analyzed paience diff algorithm. One thing is a bit ugly: the libxdiff interface does not allow to calculate diffs of ranges of lines. So instead of reusing an initialized environment (with line hashes and all), the simple structure of mmfile_t is exploited to fake an mmfile_t of a file _part_, reusing the file's buffer. And this mmfile_t pair gets a new environment, recalculating hashes and all. Davide, I think it would be easier to refactor xdl_do_diff() to take line ranges, and use that interface both in xpatience.c and in xmerge.c. (Although I do not know if you took xmerge.c at all; you seemed a bit reluctant about it back when I sent it to you.) For those interested in studying the code, I suggest starting with the short comment at the beginning of xpatience.c and then working yourself up from the end (i.e. xdl_do_patience_diff()). It might be a good idea to think about using this code in our merging code once it is well reviewed and tested, as it might help a substantial number of otherwise non-trivial merge conflicts. Oh, and the bash completions are so trivial I did not even bother to test them. Happy new year. Johannes Schindelin (3): Implement the patience diff algorithm Introduce the diff option '--patience' bash completions: Add the --patience option Documentation/diff-options.txt | 3 + Makefile | 2 +- contrib/completion/git-completion.bash | 2 + diff.c | 2 + t/t4033-diff-patience.sh | 168 ++++++++++++++ xdiff/xdiff.h | 1 + xdiff/xdiffi.c | 3 + xdiff/xdiffi.h | 2 + xdiff/xpatience.c | 374 ++++++++++++++++++++++++++++++++ 9 files changed, 556 insertions(+), 1 deletions(-) create mode 100755 t/t4033-diff-patience.sh create mode 100644 xdiff/xpatience.c ^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH 1/3] Implement the patience diff algorithm 2009-01-01 16:38 ` [PATCH 0/3] Teach Git about the patience diff algorithm Johannes Schindelin @ 2009-01-01 16:38 ` Johannes Schindelin 2009-01-01 16:39 ` [PATCH 2/3] Introduce the diff option '--patience' Johannes Schindelin ` (2 subsequent siblings) 3 siblings, 0 replies; 67+ messages in thread From: Johannes Schindelin @ 2009-01-01 16:38 UTC (permalink / raw) To: Pierre Habouzit; +Cc: davidel, Francis Galiegue, Git ML The patience diff algorithm produces slightly more intuitive output than the classic Myers algorithm, as it does not try to minimize the number of +/- lines first, but tries to preserve the lines that are unique. To this end, it first determines lines that are unique in both files, then the maximal sequence which preserves the order (relative to both files) is extracted. Starting from this initial set of common lines, the rest of the lines is handled recursively, with Myers' algorithm as a fallback when the patience algorithm fails (due to no common unique lines). Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> --- xdiff/xdiff.h | 1 + xdiff/xdiffi.c | 3 + xdiff/xdiffi.h | 2 + xdiff/xpatience.c | 374 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 380 insertions(+), 0 deletions(-) create mode 100644 xdiff/xpatience.c diff --git a/xdiff/xdiff.h b/xdiff/xdiff.h index 361f802..4da052a 100644 --- a/xdiff/xdiff.h +++ b/xdiff/xdiff.h @@ -32,6 +32,7 @@ extern "C" { #define XDF_IGNORE_WHITESPACE (1 << 2) #define XDF_IGNORE_WHITESPACE_CHANGE (1 << 3) #define XDF_IGNORE_WHITESPACE_AT_EOL (1 << 4) +#define XDF_PATIENCE_DIFF (1 << 5) #define XDF_WHITESPACE_FLAGS (XDF_IGNORE_WHITESPACE | XDF_IGNORE_WHITESPACE_CHANGE | XDF_IGNORE_WHITESPACE_AT_EOL) #define XDL_PATCH_NORMAL '-' diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c index 9d0324a..3e97462 100644 --- a/xdiff/xdiffi.c +++ b/xdiff/xdiffi.c @@ -329,6 +329,9 @@ int xdl_do_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, xdalgoenv_t xenv; diffdata_t dd1, dd2; + if (xpp->flags & XDF_PATIENCE_DIFF) + return xdl_do_patience_diff(mf1, mf2, xpp, xe); + if (xdl_prepare_env(mf1, mf2, xpp, xe) < 0) { return -1; diff --git a/xdiff/xdiffi.h b/xdiff/xdiffi.h index 3e099dc..ad033a8 100644 --- a/xdiff/xdiffi.h +++ b/xdiff/xdiffi.h @@ -55,5 +55,7 @@ int xdl_build_script(xdfenv_t *xe, xdchange_t **xscr); void xdl_free_script(xdchange_t *xscr); int xdl_emit_diff(xdfenv_t *xe, xdchange_t *xscr, xdemitcb_t *ecb, xdemitconf_t const *xecfg); +int xdl_do_patience_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, + xdfenv_t *env); #endif /* #if !defined(XDIFFI_H) */ diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c new file mode 100644 index 0000000..6687940 --- /dev/null +++ b/xdiff/xpatience.c @@ -0,0 +1,374 @@ +/* + * LibXDiff by Davide Libenzi ( File Differential Library ) + * Copyright (C) 2003-2009 Davide Libenzi, Johannes E. Schindelin + * + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this library; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * Davide Libenzi <davidel@xmailserver.org> + * + */ +#include "xinclude.h" +#include "xtypes.h" +#include "xdiff.h" + +/* + * The basic idea of patience diff is to find lines that are unique in + * both files. These are intuitively the ones that we want to see as + * common lines. + * + * The maximal ordered sequence of such line pairs (where ordered means + * that the order in the sequence agrees with the order of the lines in + * both files) naturally defines an initial set of common lines. + * + * Now, the algorithm tries to extend the set of common lines by growing + * the line ranges where the files have identical lines. + * + * Between those common lines, the patience diff algorithm is applied + * recursively, until no unique line pairs can be found; these line ranges + * are handled by the well-known Myers algorithm. + */ + +#define NON_UNIQUE ULONG_MAX + +/* + * This is a hash mapping from line hash to line numbers in the first and + * second file. + */ +struct hashmap { + int nr, alloc; + struct entry { + unsigned long hash; + /* + * 0 = unused entry, 1 = first line, 2 = second, etc. + * line2 is NON_UNIQUE if the line is not unique + * in either the first or the second file. + */ + unsigned long line1, line2; + /* + * "next" & "previous" are used for the longest common + * sequence; + * initially, "next" reflects only the order in file1. + */ + struct entry *next, *previous; + } *entries, *first, *last; + /* were common records found? */ + unsigned long has_matches; + mmfile_t *file1, *file2; + xdfenv_t *env; + xpparam_t const *xpp; +}; + +/* The argument "pass" is 1 for the first file, 2 for the second. */ +static void insert_record(int line, struct hashmap *map, int pass) +{ + xrecord_t **records = pass == 1 ? + map->env->xdf1.recs : map->env->xdf2.recs; + xrecord_t *record = records[line - 1], *other; + /* + * After xdl_prepare_env() (or more precisely, due to + * xdl_classify_record()), the "ha" member of the records (AKA lines) + * is _not_ the hash anymore, but a linearized version of it. In + * other words, the "ha" member is guaranteed to start with 0 and + * the second record's ha can only be 0 or 1, etc. + * + * So we multiply ha by 2 in the hope that the hashing was + * "unique enough". + */ + int index = (int)((record->ha << 1) % map->alloc); + + while (map->entries[index].line1) { + other = map->env->xdf1.recs[map->entries[index].line1 - 1]; + if (map->entries[index].hash != record->ha || + !xdl_recmatch(record->ptr, record->size, + other->ptr, other->size, + map->xpp->flags)) { + if (++index >= map->alloc) + index = 0; + continue; + } + if (pass == 2) + map->has_matches = 1; + if (pass == 1 || map->entries[index].line2) + map->entries[index].line2 = NON_UNIQUE; + else + map->entries[index].line2 = line; + return; + } + if (pass == 2) + return; + map->entries[index].line1 = line; + map->entries[index].hash = record->ha; + if (!map->first) + map->first = map->entries + index; + if (map->last) { + map->last->next = map->entries + index; + map->entries[index].previous = map->last; + } + map->last = map->entries + index; + map->nr++; +} + +/* + * This function has to be called for each recursion into the inter-hunk + * parts, as previously non-unique lines can become unique when being + * restricted to a smaller part of the files. + * + * It is assumed that env has been prepared using xdl_prepare(). + */ +static int fill_hashmap(mmfile_t *file1, mmfile_t *file2, + xpparam_t const *xpp, xdfenv_t *env, + struct hashmap *result, + int line1, int count1, int line2, int count2) +{ + result->file1 = file1; + result->file2 = file2; + result->xpp = xpp; + result->env = env; + + /* We know exactly how large we want the hash map */ + result->alloc = count1 * 2; + result->entries = (struct entry *) + xdl_malloc(result->alloc * sizeof(struct entry)); + if (!result->entries) + return -1; + memset(result->entries, 0, result->alloc * sizeof(struct entry)); + + /* First, fill with entries from the first file */ + while (count1--) + insert_record(line1++, result, 1); + + /* Then search for matches in the second file */ + while (count2--) + insert_record(line2++, result, 2); + + return 0; +} + +/* + * Find the longest sequence with a smaller last element (meaning a smaller + * line2, as we construct the sequence with entries ordered by line1). + */ +static int binary_search(struct entry **sequence, int longest, + struct entry *entry) +{ + int left = -1, right = longest; + + while (left + 1 < right) { + int middle = (left + right) / 2; + /* by construction, no two entries can be equal */ + if (sequence[middle]->line2 > entry->line2) + right = middle; + else + left = middle; + } + /* return the index in "sequence", _not_ the sequence length */ + return left; +} + +/* + * The idea is to start with the list of common unique lines sorted by + * the order in file1. For each of these pairs, the longest (partial) + * sequence whose last element's line2 is smaller is determined. + * + * For efficiency, the sequences are kept in a list containing exactly one + * item per sequence length: the sequence with the smallest last + * element (in terms of line2). + */ +static struct entry *find_longest_common_sequence(struct hashmap *map) +{ + struct entry **sequence = xdl_malloc(map->nr * sizeof(struct entry *)); + int longest = 0, i; + struct entry *entry; + + for (entry = map->first; entry; entry = entry->next) { + if (!entry->line2 || entry->line2 == NON_UNIQUE) + continue; + i = binary_search(sequence, longest, entry); + entry->previous = i < 0 ? NULL : sequence[i]; + sequence[++i] = entry; + if (i == longest) + longest++; + } + + /* No common unique lines were found */ + if (!longest) + return NULL; + + /* Iterate starting at the last element, adjusting the "next" members */ + entry = sequence[longest - 1]; + entry->next = NULL; + while (entry->previous) { + entry->previous->next = entry; + entry = entry->previous; + } + return entry; +} + +static int match(struct hashmap *map, int line1, int line2) +{ + xrecord_t *record1 = map->env->xdf1.recs[line1 - 1]; + xrecord_t *record2 = map->env->xdf2.recs[line2 - 1]; + return xdl_recmatch(record1->ptr, record1->size, + record2->ptr, record2->size, map->xpp->flags); +} + +static int patience_diff(mmfile_t *file1, mmfile_t *file2, + xpparam_t const *xpp, xdfenv_t *env, + int line1, int count1, int line2, int count2); + +static int walk_common_sequence(struct hashmap *map, struct entry *first, + int line1, int count1, int line2, int count2) +{ + int end1 = line1 + count1, end2 = line2 + count2; + int next1, next2; + + for (;;) { + /* Try to grow the line ranges of common lines */ + if (first) { + next1 = first->line1; + next2 = first->line2; + while (next1 > line1 && next2 > line2 && + match(map, next1 - 1, next2 - 1)) { + next1--; + next2--; + } + } else { + next1 = end1; + next2 = end2; + } + while (line1 < next1 && line2 < next2 && + match(map, line1, line2)) { + line1++; + line2++; + } + + /* Recurse */ + if (next1 > line1 || next2 > line2) { + struct hashmap submap; + + memset(&submap, 0, sizeof(submap)); + if (patience_diff(map->file1, map->file2, + map->xpp, map->env, + line1, next1 - line1, + line2, next2 - line2)) + return -1; + } + + if (!first) + return 0; + + while (first->next && + first->next->line1 == first->line1 + 1 && + first->next->line2 == first->line2 + 1) + first = first->next; + + line1 = first->line1 + 1; + line2 = first->line2 + 1; + + first = first->next; + } +} + +static int fall_back_to_classic_diff(struct hashmap *map, + int line1, int count1, int line2, int count2) +{ + /* + * This probably does not work outside Git, since + * we have a very simple mmfile structure. + * + * Note: ideally, we would reuse the prepared environment, but + * the libxdiff interface does not (yet) allow for diffing only + * ranges of lines instead of the whole files. + */ + mmfile_t subfile1, subfile2; + xpparam_t xpp; + xdfenv_t env; + + subfile1.ptr = (char *)map->env->xdf1.recs[line1 - 1]->ptr; + subfile1.size = map->env->xdf1.recs[line1 + count1 - 2]->ptr + + map->env->xdf1.recs[line1 + count1 - 2]->size - subfile1.ptr; + subfile2.ptr = (char *)map->env->xdf2.recs[line2 - 1]->ptr; + subfile2.size = map->env->xdf2.recs[line2 + count2 - 2]->ptr + + map->env->xdf2.recs[line2 + count2 - 2]->size - subfile2.ptr; + xpp.flags = map->xpp->flags & ~XDF_PATIENCE_DIFF; + if (xdl_do_diff(&subfile1, &subfile2, &xpp, &env) < 0) + return -1; + + memcpy(map->env->xdf1.rchg + line1 - 1, env.xdf1.rchg, count1); + memcpy(map->env->xdf2.rchg + line2 - 1, env.xdf2.rchg, count2); + + return 0; +} + +/* + * Recursively find the longest common sequence of unique lines, + * and if none was found, ask xdl_do_diff() to do the job. + * + * This function assumes that env was prepared with xdl_prepare_env(). + */ +static int patience_diff(mmfile_t *file1, mmfile_t *file2, + xpparam_t const *xpp, xdfenv_t *env, + int line1, int count1, int line2, int count2) +{ + struct hashmap map; + struct entry *first; + int result = 0; + + /* trivial case: one side is empty */ + if (!count1) { + while(count2--) + env->xdf2.rchg[line2++ - 1] = 1; + return 0; + } else if (!count2) { + while(count1--) + env->xdf1.rchg[line1++ - 1] = 1; + return 0; + } + + memset(&map, 0, sizeof(map)); + if (fill_hashmap(file1, file2, xpp, env, &map, + line1, count1, line2, count2)) + return -1; + + /* are there any matching lines at all? */ + if (!map.has_matches) { + while(count1--) + env->xdf1.rchg[line1++ - 1] = 1; + while(count2--) + env->xdf2.rchg[line2++ - 1] = 1; + return 0; + } + + first = find_longest_common_sequence(&map); + if (first) + result = walk_common_sequence(&map, first, + line1, count1, line2, count2); + else + result = fall_back_to_classic_diff(&map, + line1, count1, line2, count2); + + return result; +} + +int xdl_do_patience_diff(mmfile_t *file1, mmfile_t *file2, + xpparam_t const *xpp, xdfenv_t *env) +{ + if (xdl_prepare_env(file1, file2, xpp, env) < 0) + return -1; + + /* environment is cleaned up in xdl_diff() */ + return patience_diff(file1, file2, xpp, env, + 1, env->xdf1.nrec, 1, env->xdf2.nrec); +} -- 1.6.1.rc3.412.ga72b ^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH 2/3] Introduce the diff option '--patience' 2009-01-01 16:38 ` [PATCH 0/3] Teach Git about the patience diff algorithm Johannes Schindelin 2009-01-01 16:38 ` [PATCH 1/3] Implement " Johannes Schindelin @ 2009-01-01 16:39 ` Johannes Schindelin 2009-01-01 16:39 ` [PATCH 3/3] bash completions: Add the --patience option Johannes Schindelin 2009-01-01 19:45 ` [PATCH 0/3] Teach Git about the patience diff algorithm Linus Torvalds 3 siblings, 0 replies; 67+ messages in thread From: Johannes Schindelin @ 2009-01-01 16:39 UTC (permalink / raw) To: Pierre Habouzit; +Cc: davidel, Francis Galiegue, Git ML This commit teaches Git to produce diff output using the patience diff algorithm with the diff option '--patience'. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> --- git log --check complains about this patch, for obvious reasons. Documentation/diff-options.txt | 3 + Makefile | 2 +- diff.c | 2 + t/t4033-diff-patience.sh | 168 ++++++++++++++++++++++++++++++++++++++++ 4 files changed, 174 insertions(+), 1 deletions(-) create mode 100755 t/t4033-diff-patience.sh diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt index 671f533..15ef35a 100644 --- a/Documentation/diff-options.txt +++ b/Documentation/diff-options.txt @@ -36,6 +36,9 @@ endif::git-format-patch[] --patch-with-raw:: Synonym for "-p --raw". +--patience: + Generate a diff using the "patience diff" algorithm. + --stat[=width[,name-width]]:: Generate a diffstat. You can override the default output width for 80-column terminal by "--stat=width". diff --git a/Makefile b/Makefile index 154cf34..2217873 100644 --- a/Makefile +++ b/Makefile @@ -1287,7 +1287,7 @@ $(LIB_FILE): $(LIB_OBJS) $(QUIET_AR)$(RM) $@ && $(AR) rcs $@ $(LIB_OBJS) XDIFF_OBJS=xdiff/xdiffi.o xdiff/xprepare.o xdiff/xutils.o xdiff/xemit.o \ - xdiff/xmerge.o + xdiff/xmerge.o xdiff/xpatience.o $(XDIFF_OBJS): xdiff/xinclude.h xdiff/xmacros.h xdiff/xdiff.h xdiff/xtypes.h \ xdiff/xutils.h xdiff/xprepare.h xdiff/xdiffi.h xdiff/xemit.h diff --git a/diff.c b/diff.c index 56b80f9..67718b7 100644 --- a/diff.c +++ b/diff.c @@ -2472,6 +2472,8 @@ int diff_opt_parse(struct diff_options *options, const char **av, int ac) options->xdl_opts |= XDF_IGNORE_WHITESPACE_CHANGE; else if (!strcmp(arg, "--ignore-space-at-eol")) options->xdl_opts |= XDF_IGNORE_WHITESPACE_AT_EOL; + else if (!strcmp(arg, "--patience")) + options->xdl_opts |= XDF_PATIENCE_DIFF; /* flags options */ else if (!strcmp(arg, "--binary")) { diff --git a/t/t4033-diff-patience.sh b/t/t4033-diff-patience.sh new file mode 100755 index 0000000..63c1b00 --- /dev/null +++ b/t/t4033-diff-patience.sh @@ -0,0 +1,168 @@ +#!/bin/sh + +test_description='patience diff algorithm' + +. ./test-lib.sh + +cat > file1 << EOF +#include <stdio.h> + +// Frobs foo heartily +int frobnitz(int foo) +{ + int i; + for(i = 0; i < 10; i++) + { + printf("Your answer is: "); + printf("%d\n", foo); + } +} + +int fact(int n) +{ + if(n > 1) + { + return fact(n-1) * n; + } + return 1; +} + +int main(int argc, char **argv) +{ + frobnitz(fact(10)); +} +EOF + +cat > file2 << EOF +#include <stdio.h> + +int fib(int n) +{ + if(n > 2) + { + return fib(n-1) + fib(n-2); + } + return 1; +} + +// Frobs foo heartily +int frobnitz(int foo) +{ + int i; + for(i = 0; i < 10; i++) + { + printf("%d\n", foo); + } +} + +int main(int argc, char **argv) +{ + frobnitz(fib(10)); +} +EOF + +cat > expect << EOF +diff --git a/file1 b/file2 +index 6faa5a3..e3af329 100644 +--- a/file1 ++++ b/file2 +@@ -1,26 +1,25 @@ + #include <stdio.h> + ++int fib(int n) ++{ ++ if(n > 2) ++ { ++ return fib(n-1) + fib(n-2); ++ } ++ return 1; ++} ++ + // Frobs foo heartily + int frobnitz(int foo) + { + int i; + for(i = 0; i < 10; i++) + { +- printf("Your answer is: "); + printf("%d\n", foo); + } + } + +-int fact(int n) +-{ +- if(n > 1) +- { +- return fact(n-1) * n; +- } +- return 1; +-} +- + int main(int argc, char **argv) + { +- frobnitz(fact(10)); ++ frobnitz(fib(10)); + } +EOF + +test_expect_success 'patience diff' ' + + test_must_fail git diff --no-index --patience file1 file2 > output && + test_cmp expect output + +' + +test_expect_success 'patience diff output is valid' ' + + mv file2 expect && + git apply < output && + test_cmp expect file2 + +' + +cat > uniq1 << EOF +1 +2 +3 +4 +5 +6 +EOF + +cat > uniq2 << EOF +a +b +c +d +e +f +EOF + +cat > expect << EOF +diff --git a/uniq1 b/uniq2 +index b414108..0fdf397 100644 +--- a/uniq1 ++++ b/uniq2 +@@ -1,6 +1,6 @@ +-1 +-2 +-3 +-4 +-5 +-6 ++a ++b ++c ++d ++e ++f +EOF + +test_expect_success 'completely different files' ' + + test_must_fail git diff --no-index --patience uniq1 uniq2 > output && + test_cmp expect output + +' + +test_done -- 1.6.1.rc3.412.ga72b ^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH 3/3] bash completions: Add the --patience option 2009-01-01 16:38 ` [PATCH 0/3] Teach Git about the patience diff algorithm Johannes Schindelin 2009-01-01 16:38 ` [PATCH 1/3] Implement " Johannes Schindelin 2009-01-01 16:39 ` [PATCH 2/3] Introduce the diff option '--patience' Johannes Schindelin @ 2009-01-01 16:39 ` Johannes Schindelin 2009-01-01 19:45 ` [PATCH 0/3] Teach Git about the patience diff algorithm Linus Torvalds 3 siblings, 0 replies; 67+ messages in thread From: Johannes Schindelin @ 2009-01-01 16:39 UTC (permalink / raw) To: Pierre Habouzit; +Cc: davidel, Francis Galiegue, Git ML Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> --- contrib/completion/git-completion.bash | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/contrib/completion/git-completion.bash b/contrib/completion/git-completion.bash index a046441..b3e1e22 100755 --- a/contrib/completion/git-completion.bash +++ b/contrib/completion/git-completion.bash @@ -777,6 +777,7 @@ _git_diff () --no-prefix --src-prefix= --dst-prefix= --base --ours --theirs --inter-hunk-context= + --patience " return ;; @@ -969,6 +970,7 @@ _git_log () --parents --children --full-history --merge --inter-hunk-context= + --patience " return ;; -- 1.6.1.rc3.412.ga72b ^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-01 16:38 ` [PATCH 0/3] Teach Git about the patience diff algorithm Johannes Schindelin ` (2 preceding siblings ...) 2009-01-01 16:39 ` [PATCH 3/3] bash completions: Add the --patience option Johannes Schindelin @ 2009-01-01 19:45 ` Linus Torvalds 2009-01-01 20:00 ` Linus Torvalds 2009-01-01 20:46 ` [PATCH 0/3] Teach Git about " Adeodato Simó 3 siblings, 2 replies; 67+ messages in thread From: Linus Torvalds @ 2009-01-01 19:45 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Pierre Habouzit, davidel, Francis Galiegue, Git ML On Thu, 1 Jan 2009, Johannes Schindelin wrote: > > Nothing fancy, really, just a straight-forward implementation of the > heavily under-documented and under-analyzed paience diff algorithm. Exactly because the patience diff is so under-documented, could you perhaps give a few examples of how it differs in the result, and why it's so wonderful? Yes, yes, I can google, and no, no, nothing useful shows up except for *totally* content-free fanboisms. So could we have some actual real data on it? Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-01 19:45 ` [PATCH 0/3] Teach Git about the patience diff algorithm Linus Torvalds @ 2009-01-01 20:00 ` Linus Torvalds 2009-01-02 18:17 ` Johannes Schindelin 2009-01-02 21:59 ` [PATCH 1/3 v2] Implement " Johannes Schindelin 2009-01-01 20:46 ` [PATCH 0/3] Teach Git about " Adeodato Simó 1 sibling, 2 replies; 67+ messages in thread From: Linus Torvalds @ 2009-01-01 20:00 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Pierre Habouzit, davidel, Francis Galiegue, Git ML On Thu, 1 Jan 2009, Linus Torvalds wrote: > > So could we have some actual real data on it? .. and some testing. I tried to get some limited data for the kernel myself, by doing git log --patience -p v2.6.28.. > ~/patience but I just got a core-dump instead. Pinpointing it to a specific commit shows a smaller failure case: git show -p --patience 05d564fe00c05bf8ff93948057ca1acb5bc68e10 which might help you debug this. Linus --- #0 0x00000000004cce73 in xdl_get_rec (xdf=0x7fffcb1e08d0, ri=-9, rec=0x7fffcb1e0778) at xdiff/xemit.c:36 #1 0x00000000004cced1 in xdl_emit_record (xdf=0x7fffcb1e08d0, ri=-9, pre=0x4eef79 "-", ecb=0x7fffcb1e0bc0) at xdiff/xemit.c:46 #2 0x00000000004cd4e6 in xdl_emit_diff (xe=0x7fffcb1e08d0, xscr=0x1111daf0, ecb=0x7fffcb1e0bc0, xecfg=0x7fffcb1e0b80) at xdiff/xemit.c:179 #3 0x00000000004caa2c in xdl_diff (mf1=0x7fffcb1e0a40, mf2=0x7fffcb1e0a30, xpp=0x7fffcb1e0bd0, xecfg=0x7fffcb1e0b80, ecb=0x7fffcb1e0bc0) at xdiff/xdiffi.c:559 #4 0x00000000004c088d in xdi_diff (mf1=0x7fffcb1e0c00, mf2=0x7fffcb1e0bf0, xpp=0x7fffcb1e0bd0, xecfg=0x7fffcb1e0b80, xecb=0x7fffcb1e0bc0) at xdiff-interface.c:137 #5 0x00000000004c0914 in xdi_diff_outf (mf1=0x7fffcb1e0c00, mf2=0x7fffcb1e0bf0, fn=0x475448 <fn_out_consume>, consume_callback_data=0x7fffcb1e0b40, xpp=0x7fffcb1e0bd0, xecfg=0x7fffcb1e0b80, xecb=0x7fffcb1e0bc0) at xdiff-interface.c:154 #6 0x00000000004780dc in builtin_diff (name_a=0x25cf6f0 "fs/nfs/nfs4xdr.c", name_b=0x25cf6f0 "fs/nfs/nfs4xdr.c", one=0x25cf690, two=0x26ae110, xfrm_msg=0xf659900 "index 7dde309..29656c5 100644", o=0x7fffcb1e1088, complete_rewrite=0) at diff.c:1486 #7 0x00000000004796e4 in run_diff_cmd (pgm=0x0, name=0x25cf6f0 "fs/nfs/nfs4xdr.c", other=0x0, attr_path=0x25cf6f0 "fs/nfs/nfs4xdr.c", one=0x25cf690, two=0x26ae110, xfrm_msg=0xf659900 "index 7dde309..29656c5 100644", o=0x7fffcb1e1088, complete_rewrite=0) at diff.c:2024 #8 0x0000000000479e2e in run_diff (p=0xaffece0, o=0x7fffcb1e1088) at diff.c:2158 #9 0x000000000047b959 in diff_flush_patch (p=0xaffece0, o=0x7fffcb1e1088) at diff.c:2743 #10 0x000000000047c942 in diff_flush (options=0x7fffcb1e1088) at diff.c:3184 #11 0x0000000000488b75 in log_tree_diff_flush (opt=0x7fffcb1e0f40) at log-tree.c:451 #12 0x0000000000488d17 in log_tree_diff (opt=0x7fffcb1e0f40, commit=0x2673198, log=0x7fffcb1e0ec0) at log-tree.c:503 #13 0x0000000000488da4 in log_tree_commit (opt=0x7fffcb1e0f40, commit=0x2673198) at log-tree.c:526 #14 0x000000000043218d in cmd_log_walk (rev=0x7fffcb1e0f40) at builtin-log.c:201 #15 0x0000000000432bae in cmd_log (argc=4, argv=0x7fffcb1e14b0, prefix=0x0) at builtin-log.c:423 #16 0x000000000040486b in run_command (p=0x70c7b0, argc=4, argv=0x7fffcb1e14b0) at git.c:243 #17 0x0000000000404a1c in handle_internal_command (argc=4, argv=0x7fffcb1e14b0) at git.c:387 #18 0x0000000000404c6e in main (argc=4, argv=0x7fffcb1e14b0) at git.c:484 ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-01 20:00 ` Linus Torvalds @ 2009-01-02 18:17 ` Johannes Schindelin 2009-01-02 18:49 ` Linus Torvalds 2009-01-02 18:51 ` Jeff King 2009-01-02 21:59 ` [PATCH 1/3 v2] Implement " Johannes Schindelin 1 sibling, 2 replies; 67+ messages in thread From: Johannes Schindelin @ 2009-01-02 18:17 UTC (permalink / raw) To: Linus Torvalds; +Cc: Pierre Habouzit, davidel, Francis Galiegue, Git ML Hi, On Thu, 1 Jan 2009, Linus Torvalds wrote: > On Thu, 1 Jan 2009, Linus Torvalds wrote: > > > > So could we have some actual real data on it? > > .. and some testing. I tried to get some limited data for the kernel > myself, by doing > > git log --patience -p v2.6.28.. > ~/patience > > but I just got a core-dump instead. > > Pinpointing it to a specific commit shows a smaller failure case: > > git show -p --patience 05d564fe00c05bf8ff93948057ca1acb5bc68e10 > > which might help you debug this. Thanks. I am on it. valgrind finds an earlier place in xdl_change_compact() which I think is rather more sensible, but at the same time a bit worrisome, too, as I did not expect any errors _that_ late in the game (I did not touch that code). BTW the "-p" is not necessary with "show", indeed, you cannot even switch it off. Ciao, Dscho ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-02 18:17 ` Johannes Schindelin @ 2009-01-02 18:49 ` Linus Torvalds 2009-01-02 19:07 ` Johannes Schindelin 2009-01-02 18:51 ` Jeff King 1 sibling, 1 reply; 67+ messages in thread From: Linus Torvalds @ 2009-01-02 18:49 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Pierre Habouzit, davidel, Francis Galiegue, Git ML On Fri, 2 Jan 2009, Johannes Schindelin wrote: > > BTW the "-p" is not necessary with "show", indeed, you cannot even switch > it off. I was just switching back-and-forth between "git log" and "git show" so the -p came from just that, and is not necessary. And you _can_ suppress the patch generation - use "-s". Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-02 18:49 ` Linus Torvalds @ 2009-01-02 19:07 ` Johannes Schindelin 0 siblings, 0 replies; 67+ messages in thread From: Johannes Schindelin @ 2009-01-02 19:07 UTC (permalink / raw) To: Linus Torvalds; +Cc: Pierre Habouzit, davidel, Francis Galiegue, Git ML Hi, On Fri, 2 Jan 2009, Linus Torvalds wrote: > On Fri, 2 Jan 2009, Johannes Schindelin wrote: > > > > BTW the "-p" is not necessary with "show", indeed, you cannot even > > switch it off. > > I was just switching back-and-forth between "git log" and "git show" so > the -p came from just that, and is not necessary. > > And you _can_ suppress the patch generation - use "-s". Ah, another thing learnt. Thanks, Dscho ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-02 18:17 ` Johannes Schindelin 2009-01-02 18:49 ` Linus Torvalds @ 2009-01-02 18:51 ` Jeff King 1 sibling, 0 replies; 67+ messages in thread From: Jeff King @ 2009-01-02 18:51 UTC (permalink / raw) To: Johannes Schindelin; +Cc: git On Fri, Jan 02, 2009 at 07:17:34PM +0100, Johannes Schindelin wrote: > BTW the "-p" is not necessary with "show", indeed, you cannot even switch > it off. Half true: $ git grep -A1 '"-s"' diff.c diff.c: else if (!strcmp(arg, "-s")) diff.c- options->output_format |= DIFF_FORMAT_NO_OUTPUT; -Peff ^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH 1/3 v2] Implement the patience diff algorithm 2009-01-01 20:00 ` Linus Torvalds 2009-01-02 18:17 ` Johannes Schindelin @ 2009-01-02 21:59 ` Johannes Schindelin 2009-01-02 21:59 ` Johannes Schindelin 1 sibling, 1 reply; 67+ messages in thread From: Johannes Schindelin @ 2009-01-02 21:59 UTC (permalink / raw) To: Linus Torvalds; +Cc: Pierre Habouzit, davidel, Francis Galiegue, Git ML The patience diff algorithm produces slightly more intuitive output than the classic Myers algorithm, as it does not try to minimize the number of +/- lines first, but tries to preserve the lines that are unique. To this end, it first determines lines that are unique in both files, then the maximal sequence which preserves the order (relative to both files) is extracted. Starting from this initial set of common lines, the rest of the lines is handled recursively, with Myers' algorithm as a fallback when the patience algorithm fails (due to no common unique lines). Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> --- I did not realize that xdl_prepare_env() initializes the arrays in rchg (which tell which lines are not common). Unfortunately, there are ambiguities, e.g. with empty lines, and my implementation wanted to take other common lines, disagreeing with the previous initialization. Interdiff follows. xdiff/xdiff.h | 1 + xdiff/xdiffi.c | 3 + xdiff/xdiffi.h | 2 + xdiff/xpatience.c | 385 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 391 insertions(+), 0 deletions(-) create mode 100644 xdiff/xpatience.c diff --git a/xdiff/xdiff.h b/xdiff/xdiff.h index 361f802..4da052a 100644 --- a/xdiff/xdiff.h +++ b/xdiff/xdiff.h @@ -32,6 +32,7 @@ extern "C" { #define XDF_IGNORE_WHITESPACE (1 << 2) #define XDF_IGNORE_WHITESPACE_CHANGE (1 << 3) #define XDF_IGNORE_WHITESPACE_AT_EOL (1 << 4) +#define XDF_PATIENCE_DIFF (1 << 5) #define XDF_WHITESPACE_FLAGS (XDF_IGNORE_WHITESPACE | XDF_IGNORE_WHITESPACE_CHANGE | XDF_IGNORE_WHITESPACE_AT_EOL) #define XDL_PATCH_NORMAL '-' diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c index 9d0324a..3e97462 100644 --- a/xdiff/xdiffi.c +++ b/xdiff/xdiffi.c @@ -329,6 +329,9 @@ int xdl_do_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, xdalgoenv_t xenv; diffdata_t dd1, dd2; + if (xpp->flags & XDF_PATIENCE_DIFF) + return xdl_do_patience_diff(mf1, mf2, xpp, xe); + if (xdl_prepare_env(mf1, mf2, xpp, xe) < 0) { return -1; diff --git a/xdiff/xdiffi.h b/xdiff/xdiffi.h index 3e099dc..ad033a8 100644 --- a/xdiff/xdiffi.h +++ b/xdiff/xdiffi.h @@ -55,5 +55,7 @@ int xdl_build_script(xdfenv_t *xe, xdchange_t **xscr); void xdl_free_script(xdchange_t *xscr); int xdl_emit_diff(xdfenv_t *xe, xdchange_t *xscr, xdemitcb_t *ecb, xdemitconf_t const *xecfg); +int xdl_do_patience_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, + xdfenv_t *env); #endif /* #if !defined(XDIFFI_H) */ diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c new file mode 100644 index 0000000..d01cbdd --- /dev/null +++ b/xdiff/xpatience.c @@ -0,0 +1,385 @@ +/* + * LibXDiff by Davide Libenzi ( File Differential Library ) + * Copyright (C) 2003-2009 Davide Libenzi, Johannes E. Schindelin + * + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this library; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * Davide Libenzi <davidel@xmailserver.org> + * + */ +#include "xinclude.h" +#include "xtypes.h" +#include "xdiff.h" + +/* + * The basic idea of patience diff is to find lines that are unique in + * both files. These are intuitively the ones that we want to see as + * common lines. + * + * The maximal ordered sequence of such line pairs (where ordered means + * that the order in the sequence agrees with the order of the lines in + * both files) naturally defines an initial set of common lines. + * + * Now, the algorithm tries to extend the set of common lines by growing + * the line ranges where the files have identical lines. + * + * Between those common lines, the patience diff algorithm is applied + * recursively, until no unique line pairs can be found; these line ranges + * are handled by the well-known Myers algorithm. + */ + +#define NON_UNIQUE ULONG_MAX + +/* + * This is a hash mapping from line hash to line numbers in the first and + * second file. + */ +struct hashmap { + int nr, alloc; + struct entry { + unsigned long hash; + /* + * 0 = unused entry, 1 = first line, 2 = second, etc. + * line2 is NON_UNIQUE if the line is not unique + * in either the first or the second file. + */ + unsigned long line1, line2; + /* + * "next" & "previous" are used for the longest common + * sequence; + * initially, "next" reflects only the order in file1. + */ + struct entry *next, *previous; + } *entries, *first, *last; + /* were common records found? */ + unsigned long has_matches; + mmfile_t *file1, *file2; + xdfenv_t *env; + xpparam_t const *xpp; +}; + +/* The argument "pass" is 1 for the first file, 2 for the second. */ +static void insert_record(int line, struct hashmap *map, int pass) +{ + xrecord_t **records = pass == 1 ? + map->env->xdf1.recs : map->env->xdf2.recs; + xrecord_t *record = records[line - 1], *other; + /* + * After xdl_prepare_env() (or more precisely, due to + * xdl_classify_record()), the "ha" member of the records (AKA lines) + * is _not_ the hash anymore, but a linearized version of it. In + * other words, the "ha" member is guaranteed to start with 0 and + * the second record's ha can only be 0 or 1, etc. + * + * So we multiply ha by 2 in the hope that the hashing was + * "unique enough". + */ + int index = (int)((record->ha << 1) % map->alloc); + + while (map->entries[index].line1) { + other = map->env->xdf1.recs[map->entries[index].line1 - 1]; + if (map->entries[index].hash != record->ha || + !xdl_recmatch(record->ptr, record->size, + other->ptr, other->size, + map->xpp->flags)) { + if (++index >= map->alloc) + index = 0; + continue; + } + if (pass == 2) + map->has_matches = 1; + if (pass == 1 || map->entries[index].line2) + map->entries[index].line2 = NON_UNIQUE; + else + map->entries[index].line2 = line; + return; + } + if (pass == 2) + return; + map->entries[index].line1 = line; + map->entries[index].hash = record->ha; + if (!map->first) + map->first = map->entries + index; + if (map->last) { + map->last->next = map->entries + index; + map->entries[index].previous = map->last; + } + map->last = map->entries + index; + map->nr++; +} + +/* + * This function has to be called for each recursion into the inter-hunk + * parts, as previously non-unique lines can become unique when being + * restricted to a smaller part of the files. + * + * It is assumed that env has been prepared using xdl_prepare(). + */ +static int fill_hashmap(mmfile_t *file1, mmfile_t *file2, + xpparam_t const *xpp, xdfenv_t *env, + struct hashmap *result, + int line1, int count1, int line2, int count2) +{ + result->file1 = file1; + result->file2 = file2; + result->xpp = xpp; + result->env = env; + + /* We know exactly how large we want the hash map */ + result->alloc = count1 * 2; + result->entries = (struct entry *) + xdl_malloc(result->alloc * sizeof(struct entry)); + if (!result->entries) + return -1; + memset(result->entries, 0, result->alloc * sizeof(struct entry)); + + /* First, fill with entries from the first file */ + while (count1--) + insert_record(line1++, result, 1); + + /* Then search for matches in the second file */ + while (count2--) + insert_record(line2++, result, 2); + + return 0; +} + +/* + * Find the longest sequence with a smaller last element (meaning a smaller + * line2, as we construct the sequence with entries ordered by line1). + */ +static int binary_search(struct entry **sequence, int longest, + struct entry *entry) +{ + int left = -1, right = longest; + + while (left + 1 < right) { + int middle = (left + right) / 2; + /* by construction, no two entries can be equal */ + if (sequence[middle]->line2 > entry->line2) + right = middle; + else + left = middle; + } + /* return the index in "sequence", _not_ the sequence length */ + return left; +} + +/* + * The idea is to start with the list of common unique lines sorted by + * the order in file1. For each of these pairs, the longest (partial) + * sequence whose last element's line2 is smaller is determined. + * + * For efficiency, the sequences are kept in a list containing exactly one + * item per sequence length: the sequence with the smallest last + * element (in terms of line2). + */ +static struct entry *find_longest_common_sequence(struct hashmap *map) +{ + struct entry **sequence = xdl_malloc(map->nr * sizeof(struct entry *)); + int longest = 0, i; + struct entry *entry; + + for (entry = map->first; entry; entry = entry->next) { + if (!entry->line2 || entry->line2 == NON_UNIQUE) + continue; + i = binary_search(sequence, longest, entry); + entry->previous = i < 0 ? NULL : sequence[i]; + sequence[++i] = entry; + if (i == longest) + longest++; + } + + /* No common unique lines were found */ + if (!longest) + return NULL; + + /* Iterate starting at the last element, adjusting the "next" members */ + entry = sequence[longest - 1]; + entry->next = NULL; + while (entry->previous) { + entry->previous->next = entry; + entry = entry->previous; + } + return entry; +} + +static int match(struct hashmap *map, int line1, int line2) +{ + xrecord_t *record1 = map->env->xdf1.recs[line1 - 1]; + xrecord_t *record2 = map->env->xdf2.recs[line2 - 1]; + return xdl_recmatch(record1->ptr, record1->size, + record2->ptr, record2->size, map->xpp->flags); +} + +static int patience_diff(mmfile_t *file1, mmfile_t *file2, + xpparam_t const *xpp, xdfenv_t *env, + int line1, int count1, int line2, int count2); + +static int walk_common_sequence(struct hashmap *map, struct entry *first, + int line1, int count1, int line2, int count2) +{ + int end1 = line1 + count1, end2 = line2 + count2; + int next1, next2; + + for (;;) { + /* Try to grow the line ranges of common lines */ + if (first) { + next1 = first->line1; + next2 = first->line2; + while (next1 > line1 && next2 > line2 && + match(map, next1 - 1, next2 - 1)) { + next1--; + next2--; + } + } else { + next1 = end1; + next2 = end2; + } + while (line1 < next1 && line2 < next2 && + match(map, line1, line2)) { + line1++; + line2++; + } + + /* Recurse */ + if (next1 > line1 || next2 > line2) { + struct hashmap submap; + + memset(&submap, 0, sizeof(submap)); + if (patience_diff(map->file1, map->file2, + map->xpp, map->env, + line1, next1 - line1, + line2, next2 - line2)) + return -1; + } + + if (!first) + return 0; + + while (first->next && + first->next->line1 == first->line1 + 1 && + first->next->line2 == first->line2 + 1) + first = first->next; + + line1 = first->line1 + 1; + line2 = first->line2 + 1; + + first = first->next; + } +} + +static int fall_back_to_classic_diff(struct hashmap *map, + int line1, int count1, int line2, int count2) +{ + /* + * This probably does not work outside Git, since + * we have a very simple mmfile structure. + * + * Note: ideally, we would reuse the prepared environment, but + * the libxdiff interface does not (yet) allow for diffing only + * ranges of lines instead of the whole files. + */ + mmfile_t subfile1, subfile2; + xpparam_t xpp; + xdfenv_t env; + + subfile1.ptr = (char *)map->env->xdf1.recs[line1 - 1]->ptr; + subfile1.size = map->env->xdf1.recs[line1 + count1 - 2]->ptr + + map->env->xdf1.recs[line1 + count1 - 2]->size - subfile1.ptr; + subfile2.ptr = (char *)map->env->xdf2.recs[line2 - 1]->ptr; + subfile2.size = map->env->xdf2.recs[line2 + count2 - 2]->ptr + + map->env->xdf2.recs[line2 + count2 - 2]->size - subfile2.ptr; + xpp.flags = map->xpp->flags & ~XDF_PATIENCE_DIFF; + if (xdl_do_diff(&subfile1, &subfile2, &xpp, &env) < 0) + return -1; + + memcpy(map->env->xdf1.rchg + line1 - 1, env.xdf1.rchg, count1); + memcpy(map->env->xdf2.rchg + line2 - 1, env.xdf2.rchg, count2); + + xdl_free_env(&env); + + return 0; +} + +/* + * Recursively find the longest common sequence of unique lines, + * and if none was found, ask xdl_do_diff() to do the job. + * + * This function assumes that env was prepared with xdl_prepare_env(). + */ +static int patience_diff(mmfile_t *file1, mmfile_t *file2, + xpparam_t const *xpp, xdfenv_t *env, + int line1, int count1, int line2, int count2) +{ + struct hashmap map; + struct entry *first; + int result = 0; + + /* trivial case: one side is empty */ + if (!count1) { + while(count2--) + env->xdf2.rchg[line2++ - 1] = 1; + return 0; + } else if (!count2) { + while(count1--) + env->xdf1.rchg[line1++ - 1] = 1; + return 0; + } + + memset(&map, 0, sizeof(map)); + if (fill_hashmap(file1, file2, xpp, env, &map, + line1, count1, line2, count2)) + return -1; + + /* are there any matching lines at all? */ + if (!map.has_matches) { + while(count1--) + env->xdf1.rchg[line1++ - 1] = 1; + while(count2--) + env->xdf2.rchg[line2++ - 1] = 1; + return 0; + } + + first = find_longest_common_sequence(&map); + if (first) + result = walk_common_sequence(&map, first, + line1, count1, line2, count2); + else + result = fall_back_to_classic_diff(&map, + line1, count1, line2, count2); + + return result; +} + +int xdl_do_patience_diff(mmfile_t *file1, mmfile_t *file2, + xpparam_t const *xpp, xdfenv_t *env) +{ + if (xdl_prepare_env(file1, file2, xpp, env) < 0) + return -1; + + /* + * It is a pity that xdl_cleanup_records() not only marks those + * lines as changes that are only present in one file, but also + * lines that have multiple matches and happen to be in a "run + * of discardable lines" that patience diff happens to split + * differently. + */ + memset(env->xdf1.rchg, 0, env->xdf1.nrec); + memset(env->xdf2.rchg, 0, env->xdf2.nrec); + /* environment is cleaned up in xdl_diff() */ + return patience_diff(file1, file2, xpp, env, + 1, env->xdf1.nrec, 1, env->xdf2.nrec); +} -- 1.6.1.rc3.224.g95ac9 ^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [PATCH 1/3 v2] Implement the patience diff algorithm 2009-01-02 21:59 ` [PATCH 1/3 v2] Implement " Johannes Schindelin @ 2009-01-02 21:59 ` Johannes Schindelin 0 siblings, 0 replies; 67+ messages in thread From: Johannes Schindelin @ 2009-01-02 21:59 UTC (permalink / raw) To: Linus Torvalds; +Cc: Pierre Habouzit, davidel, Francis Galiegue, Git ML Hi, The interdiff between v1 and v2 of PATCH 1/3. As you can see, I also added a cleanup of an intermediate xdlenv. Ciao, Dscho xdiff/xpatience.c | 11 +++++++++++ 1 files changed, 11 insertions(+), 0 deletions(-) diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c index 6687940..d01cbdd 100644 --- a/xdiff/xpatience.c +++ b/xdiff/xpatience.c @@ -309,6 +309,8 @@ static int fall_back_to_classic_diff(struct hashmap *map, memcpy(map->env->xdf1.rchg + line1 - 1, env.xdf1.rchg, count1); memcpy(map->env->xdf2.rchg + line2 - 1, env.xdf2.rchg, count2); + xdl_free_env(&env); + return 0; } @@ -368,6 +370,15 @@ int xdl_do_patience_diff(mmfile_t *file1, mmfile_t *file2, if (xdl_prepare_env(file1, file2, xpp, env) < 0) return -1; + /* + * It is a pity that xdl_cleanup_records() not only marks those + * lines as changes that are only present in one file, but also + * lines that have multiple matches and happen to be in a "run + * of discardable lines" that patience diff happens to split + * differently. + */ + memset(env->xdf1.rchg, 0, env->xdf1.nrec); + memset(env->xdf2.rchg, 0, env->xdf2.nrec); /* environment is cleaned up in xdl_diff() */ return patience_diff(file1, file2, xpp, env, 1, env->xdf1.nrec, 1, env->xdf2.nrec); -- 1.6.1.rc3.224.g95ac9 ^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-01 19:45 ` [PATCH 0/3] Teach Git about the patience diff algorithm Linus Torvalds 2009-01-01 20:00 ` Linus Torvalds @ 2009-01-01 20:46 ` Adeodato Simó 2009-01-02 1:56 ` Linus Torvalds 1 sibling, 1 reply; 67+ messages in thread From: Adeodato Simó @ 2009-01-01 20:46 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Schindelin, Pierre Habouzit, davidel, Francis Galiegue, Git ML [-- Attachment #1: Type: text/plain, Size: 4390 bytes --] * Linus Torvalds [Thu, 01 Jan 2009 11:45:21 -0800]: > On Thu, 1 Jan 2009, Johannes Schindelin wrote: > > Nothing fancy, really, just a straight-forward implementation of the > > heavily under-documented and under-analyzed paience diff algorithm. > Exactly because the patience diff is so under-documented, could you > perhaps give a few examples of how it differs in the result, and why it's > so wonderful? Yes, yes, I can google, and no, no, nothing useful shows up > except for *totally* content-free fanboisms. > So could we have some actual real data on it? For me, the cases where I find patience output to be of substantial higher readability are those involving a rewrite of several consecutive paragraphs (i.e., lines of code separated by blank lines). Compare: -8<- git -8<- @@ -51,29 +51,30 @@ def mbox_update(bug): f.close() else: # make a list of Message-Id we have - fp1 = file(path, 'ab+') - ids1 = [ x.get('Message-Id') for x in mailbox.UnixMailbox(fp1) ] + msgids = { x.get('Message-Id') for x in mailbox.mbox(path) } - # get remote mbox again - fp2 = tempfile.TemporaryFile() - retrieve_mbox(bug, fp2) + with tempfile.NamedTemporaryFile() as tmpfd: + # retrieve the remote mbox again + retrieve_mbox(bug, tmpfd) - # parse its messages - fp2.seek(0) - parser = email.Parser.Parser() - msgs2 = dict((x['Message-Id'], x) - for x in mailbox.UnixMailbox(fp2, parser.parse)) + # parse its messages + parser = email.parser.Parser() + new_msgids = { x['Message-Id']: x + for x in mailbox.mbox(tmpfd.name, parser.parse) } - # now append the new ones - for msgid in set(msgs2.keys()) - set(ids1): - fp1.write('\n' + msgs2[msgid].as_string(unixfrom=True)) + with open(path, 'a+') as fd: + # now append the new messages + for msgid in new_msgids.keys() - msgids: + fd.write('\n' + new_msgids[msgid].as_string(unixfrom=True)) return path ->8- git ->8- with: -8<- bzr patience -8<- @@ -51,29 +51,30 @@ f.close() else: # make a list of Message-Id we have - fp1 = file(path, 'ab+') - ids1 = [ x.get('Message-Id') for x in mailbox.UnixMailbox(fp1) ] - - # get remote mbox again - fp2 = tempfile.TemporaryFile() - retrieve_mbox(bug, fp2) - - # parse its messages - fp2.seek(0) - parser = email.Parser.Parser() - msgs2 = dict((x['Message-Id'], x) - for x in mailbox.UnixMailbox(fp2, parser.parse)) - - # now append the new ones - for msgid in set(msgs2.keys()) - set(ids1): - fp1.write('\n' + msgs2[msgid].as_string(unixfrom=True)) + msgids = { x.get('Message-Id') for x in mailbox.mbox(path) } + + with tempfile.NamedTemporaryFile() as tmpfd: + # retrieve the remote mbox again + retrieve_mbox(bug, tmpfd) + + # parse its messages + parser = email.parser.Parser() + new_msgids = { x['Message-Id']: x + for x in mailbox.mbox(tmpfd.name, parser.parse) } + + with open(path, 'a+') as fd: + # now append the new messages + for msgid in new_msgids.keys() - msgids: + fd.write('\n' + new_msgids[msgid].as_string(unixfrom=True)) return path ->8- bzr patience ->8- I don't know about you, but I find the latter much easier to read, because the whole context of each version is always available. As you see, in (at least) this case is just a matter of considering the blank lines worthy of presented as common, or not. I'll note that in this particular case, `git diff` yielded the very same results with or without --patience. I don't know why that is, Johannes? I'll also note that /usr/bin/diff produces (in this case) something closer to patience than to git. I'm attaching both versions of the file in case they are useful to anybody. -- Adeodato Simó dato at net.com.org.es Debian Developer adeodato at debian.org I promise you. Once I enter into an exclusive relationship, I sleep with very few people. -- Denny Crane [-- Attachment #2: bdo0 --] [-- Type: text/plain, Size: 2358 bytes --] #! /usr/bin/python ## vim: fileencoding=utf-8 """Open Debian BTS mboxes with Mutt, à la /usr/bin/bts show --mbox. A cache of mboxes is kept, and changed mboxes will be merged with existing files instead of replacing them, so that e.g. read-status is preserved for each message. """ import os import re import sys import urllib import mailbox import tempfile import email.Parser MBOX_DIR = os.path.expanduser('~/.mail/y.bug-cache') ## def main(): if len(sys.argv) != 2: print >>sys.stderr, 'Usage: %s <bugnumber>' % (sys.argv[0],) sys.exit(1) bug = re.sub(r'[^0-9]', '', sys.argv[1]) if not re.match(r'\d{4,}$', bug): print >>sys.stderr, \ 'E: %s does not seem a valid number' % (sys.argv[1],) sys.exit(1) path = mbox_update(bug) invoke_mailer(path) ## def mbox_update(bug): """Return a path with an up-to-date copy of the mbox for bug.""" path = os.path.join(MBOX_DIR, bug + '.mbox') if not os.path.exists(path): f = file(path, 'wb') try: retrieve_mbox(bug, f) except: os.unlink(path) raise else: f.close() else: # make a list of Message-Id we have fp1 = file(path, 'ab+') ids1 = [ x.get('Message-Id') for x in mailbox.UnixMailbox(fp1) ] # get remote mbox again fp2 = tempfile.TemporaryFile() retrieve_mbox(bug, fp2) # parse its messages fp2.seek(0) parser = email.Parser.Parser() msgs2 = dict((x['Message-Id'], x) for x in mailbox.UnixMailbox(fp2, parser.parse)) # now append the new ones for msgid in set(msgs2.keys()) - set(ids1): fp1.write('\n' + msgs2[msgid].as_string(unixfrom=True)) return path def retrieve_mbox(bug, fileobj): """Retrieve mbox for bug from bugs.debian.org, writing it to fileobj.""" for line in urllib.urlopen( 'http://bugs.debian.org/cgi-bin/bugreport.cgi?mboxstatus=yes;mboxmaint=yes;mbox=yes;bug=%s' % (bug,)): fileobj.write(line) def invoke_mailer(path): """Exec mutt, opening path.""" os.execlp('mutt', 'mutt', '-f', path) ## if __name__ == '__main__': try: sys.exit(main()) except KeyboardInterrupt: print >>sys.stderr, '\nCancelled.' sys.exit(1) [-- Attachment #3: bdo1 --] [-- Type: text/plain, Size: 2501 bytes --] #! /usr/bin/python3 """Open Debian BTS mboxes with Mutt, à la /usr/bin/bts show --mbox. A cache of mboxes is kept, and changed mboxes will be merged with existing files instead of replacing them, so that e.g. read-status is preserved for each message. """ import os import re import sys import mailbox import tempfile import email.parser import urllib.request MBOX_DIR = os.path.expanduser('~/.mail/y.bug-cache') ## def main(): if len(sys.argv) != 2: print('Usage: {0} <bugnumber>'.format(sys.argv[0]), file=sys.stderr) return 1 else: bug = re.sub(r'[^0-9]', '', sys.argv[1]) if not re.search(r'^\d{4,}$', bug): print('E: {0} does not seem a valid number'.format(sys.argv[1]), file=sys.stderr) return 1 path = mbox_update(bug) invoke_mailer(path) ## def mbox_update(bug): """Return a path with an up-to-date copy of the mbox for bug.""" path = os.path.join(MBOX_DIR, bug + '.mbox') if not os.path.exists(path): f = open(path, 'wb') try: retrieve_mbox(bug, f) except: os.unlink(path) raise else: f.close() else: # make a list of Message-Id we have msgids = { x.get('Message-Id') for x in mailbox.mbox(path) } with tempfile.NamedTemporaryFile() as tmpfd: # retrieve the remote mbox again retrieve_mbox(bug, tmpfd) # parse its messages parser = email.parser.Parser() new_msgids = { x['Message-Id']: x for x in mailbox.mbox(tmpfd.name, parser.parse) } with open(path, 'a+') as fd: # now append the new messages for msgid in new_msgids.keys() - msgids: fd.write('\n' + new_msgids[msgid].as_string(unixfrom=True)) return path def retrieve_mbox(bug, fileobj): """Retrieve mbox for bug from bugs.debian.org, writing it to fileobj.""" url = urllib.request.urlopen( 'http://bugs.debian.org/cgi-bin/bugreport.cgi?' 'mboxstatus=yes;mboxmaint=yes;mbox=yes;bug={0}'.format(bug)) for line in url.fp: # http://bugs.python.org/issue4608 fileobj.write(line) def invoke_mailer(path): """Exec mutt, opening path.""" os.execlp('mutt', 'mutt', '-f', path) ## if __name__ == '__main__': try: sys.exit(main()) except KeyboardInterrupt: print('\nCancelled.', file=sys.stderr) sys.exit(1) ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-01 20:46 ` [PATCH 0/3] Teach Git about " Adeodato Simó @ 2009-01-02 1:56 ` Linus Torvalds 2009-01-02 10:55 ` Clemens Buchacher 2009-01-02 18:50 ` Adeodato Simó 0 siblings, 2 replies; 67+ messages in thread From: Linus Torvalds @ 2009-01-02 1:56 UTC (permalink / raw) To: Adeodato Simó Cc: Johannes Schindelin, Pierre Habouzit, davidel, Francis Galiegue, Git ML On Thu, 1 Jan 2009, Adeodato Simó wrote: > > For me, the cases where I find patience output to be of substantial > higher readability are those involving a rewrite of several consecutive > paragraphs (i.e., lines of code separated by blank lines). Compare: I don't think that's a "patience diff" issue. That's simply an issue of merging consecutive diff fragments together if they are close-by, and that's independent of the actual diff algorithm itself. > I'll note that in this particular case, `git diff` yielded the very same > results with or without --patience. I don't know why that is, Johannes? > I'll also note that /usr/bin/diff produces (in this case) something > closer to patience than to git. See above - I really don't think this has anything to do with "patience vs non-patience". It's more akin to the things we do for our merge conflict markers: if we have two merge conflicts next to each other, with just a couple of lines in between, we coalesce the merge conflicts into one larger one instead. We don't do that for regular diffs - they're always kept minimal (ok, not really minimal, but as close to minimal as the algorithm finds them). See commit f407f14deaa14ebddd0d27238523ced8eca74393 for the git merge conflict merging. We _could_ do similar things for regular diffs. It's sometimes useful, sometimes not. Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-02 1:56 ` Linus Torvalds @ 2009-01-02 10:55 ` Clemens Buchacher 2009-01-02 10:58 ` Clemens Buchacher 2009-01-02 11:03 ` Junio C Hamano 2009-01-02 18:50 ` Adeodato Simó 1 sibling, 2 replies; 67+ messages in thread From: Clemens Buchacher @ 2009-01-02 10:55 UTC (permalink / raw) To: Linus Torvalds Cc: Adeodato Simó, Johannes Schindelin, Pierre Habouzit, davidel, Francis Galiegue, Git ML On Thu, Jan 01, 2009 at 05:56:13PM -0800, Linus Torvalds wrote: > See above - I really don't think this has anything to do with "patience vs > non-patience". It's more akin to the things we do for our merge conflict > markers: if we have two merge conflicts next to each other, with just a > couple of lines in between, we coalesce the merge conflicts into one > larger one instead. But patience diff does more than that. Take a look at "git diff" and "git diff --patience" output below (taken from [1]). You can see that patience diff produces two separate hunks, one removing a function and one adding a function. Merging consecutive diff fragments would have produced one big hunk instead. *** git diff ****************************** git diff --patience ********** #include <stdio.h> | #include <stdio.h> | +int fib(int n) |-// Frobs foo heartily +{ |-int frobnitz(int foo) + if(n > 2) |+int fib(int n) + { | { + return fib(n-1) + fib(n-2); |- int i; + } |- for(i = 0; i < 10; i++) + return 1; |+ if(n > 2) +} | { + |- printf("Your answer is: "); // Frobs foo heartily |- printf("%d\n", foo); int frobnitz(int foo) |+ return fib(n-1) + fib(n-2); { | } int i; |+ return 1; for(i = 0; i < 10; i++) | } { | - printf("Your answer is: "); |-int fact(int n) printf("%d\n", foo); |+// Frobs foo heartily } |+int frobnitz(int foo) } | { |- if(n > 1) -int fact(int n) |+ int i; -{ |+ for(i = 0; i < 10; i++) - if(n > 1) | { - { |- return fact(n-1) * n; - return fact(n-1) * n; |+ printf("%d\n", foo); - } | } - return 1; |- return 1; -} | } - | int main(int argc, char **argv) | int main(int argc, char **argv) { | { - frobnitz(fact(10)); |- frobnitz(fact(10)); + frobnitz(fib(10)); |+ frobnitz(fib(10)); } | } [1] http://alfedenzo.livejournal.com/170301.html ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-02 10:55 ` Clemens Buchacher @ 2009-01-02 10:58 ` Clemens Buchacher 2009-01-02 16:42 ` Linus Torvalds 2009-01-02 11:03 ` Junio C Hamano 1 sibling, 1 reply; 67+ messages in thread From: Clemens Buchacher @ 2009-01-02 10:58 UTC (permalink / raw) To: Linus Torvalds Cc: Adeodato Simó, Johannes Schindelin, Pierre Habouzit, davidel, Francis Galiegue, Git ML Only two choices, and I still get it wrong. The diffs should be labeled the other way around, of course. *** git diff --patience ******************** git diff ******************** #include <stdio.h> | #include <stdio.h> | +int fib(int n) |-// Frobs foo heartily +{ |-int frobnitz(int foo) + if(n > 2) |+int fib(int n) + { | { + return fib(n-1) + fib(n-2); |- int i; + } |- for(i = 0; i < 10; i++) + return 1; |+ if(n > 2) +} | { + |- printf("Your answer is: "); // Frobs foo heartily |- printf("%d\n", foo); int frobnitz(int foo) |+ return fib(n-1) + fib(n-2); { | } int i; |+ return 1; for(i = 0; i < 10; i++) | } { | - printf("Your answer is: "); |-int fact(int n) printf("%d\n", foo); |+// Frobs foo heartily } |+int frobnitz(int foo) } | { |- if(n > 1) -int fact(int n) |+ int i; -{ |+ for(i = 0; i < 10; i++) - if(n > 1) | { - { |- return fact(n-1) * n; - return fact(n-1) * n; |+ printf("%d\n", foo); - } | } - return 1; |- return 1; -} | } - | int main(int argc, char **argv) | int main(int argc, char **argv) { | { - frobnitz(fact(10)); |- frobnitz(fact(10)); + frobnitz(fib(10)); |+ frobnitz(fib(10)); } | } ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-02 10:58 ` Clemens Buchacher @ 2009-01-02 16:42 ` Linus Torvalds 2009-01-02 18:46 ` Johannes Schindelin ` (2 more replies) 0 siblings, 3 replies; 67+ messages in thread From: Linus Torvalds @ 2009-01-02 16:42 UTC (permalink / raw) To: Clemens Buchacher Cc: Adeodato Simó, Johannes Schindelin, Pierre Habouzit, davidel, Francis Galiegue, Git ML On Fri, 2 Jan 2009, Clemens Buchacher wrote: > > Only two choices, and I still get it wrong. The diffs should be labeled the > other way around, of course. Yes, this one is a real patience diff change, but it's also the same one that I've seen in the google fanboi findings. What google did _not_ show was any real-life examples, or anybody doing any critical analysis. So I was hoping for something else than a single "in this case patience diff works really well". I was hoping to see what it does in real life. But when I tried it on the kernel archive, I get a core dump. For example, in real life, files are bigger, and unique lines are not necessarily always common (generated files, whatever). Depending on unique line ordering may work fine in 95% of all cases, but do you know that it works fine in general? Does it work when 50% of lines are unique? I believe it does. Does ti work when just 1% of lines are unique? I just don't know. And I haven't seen _any_ real critical analysis of it. Anywhere. Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-02 16:42 ` Linus Torvalds @ 2009-01-02 18:46 ` Johannes Schindelin 2009-01-02 19:03 ` Linus Torvalds 2009-01-02 21:59 ` [PATCH 0/3] Teach Git about the patience diff algorithm Johannes Schindelin 2009-01-08 19:55 ` Adeodato Simó 2 siblings, 1 reply; 67+ messages in thread From: Johannes Schindelin @ 2009-01-02 18:46 UTC (permalink / raw) To: Linus Torvalds Cc: Clemens Buchacher, Adeodato Simó, Pierre Habouzit, davidel, Francis Galiegue, Git ML Hi, On Fri, 2 Jan 2009, Linus Torvalds wrote: > On Fri, 2 Jan 2009, Clemens Buchacher wrote: > > > > Only two choices, and I still get it wrong. The diffs should be > > labeled the other way around, of course. > > Yes, this one is a real patience diff change, but it's also the same one > that I've seen in the google fanboi findings. What google did _not_ show > was any real-life examples, or anybody doing any critical analysis. FWIW it's the test case in the commit introducing the --patience option. > So I was hoping for something else than a single "in this case patience > diff works really well". I was hoping to see what it does in real life. I will dig out a real-world example where I _know_ patience diff would have helped. (I remember that I rewrote a pretty large diff which was sent on this list, only to understand what it actually does, and I am pretty certain this is a good real-world showcase.) But yes, I agree, the thing does not matter _all_ that much in reality. The case where I expect a real improvement is when you modify a function and insert a function just before it, and Myers' algorithm matches mainly empty lines and lines ending in curly brackets. In other words: something I tried to tackle with ee95ec5d(xdl_merge(): introduce XDL_MERGE_ZEALOUS_ALNUM) for merges. The typical look of such a diff is something like this: -<... some function header ...> +<... a completely different function header ...> { - <... variables ...> + <... other variables ...> for (i = 0; i < 10; i++) { - <... some code ...> + <... some code doing something completely different ...> } return 0; } +<... the function header which was removed earlier ...> +{ <... refactored _and also reindented_ code ...> > And I haven't seen _any_ real critical analysis of it. Anywhere. Neither have I. Let alone something close to documentation. For example, when the "patience diff algorithm" is explained, it looks more like a longest common sequence algorithm when the input is already sorted in the first item. Further, there is no rigorous analysis of the runtime (I figured that the original runtime is O(nm) where "n" is the number of lines and "m" is the length of the maximal ordered sequence of common unique lines, and my implementation can only improve that to O(n log(m))). This could be improved, I think, for the most common case where you have pretty long common _continuous_ sequences of unique lines, i.e. large ranges of lines that are identical. The runtime is especially horrible in the light of the runtime of Myers' algorithm, which uses O(n d), where d is the edit distance, i.e. the number of lines added + number of lines removed. (Note: in the real world, there are substantial speed ups for consecutive edits, i.e. line ranges where there are no common lines at all.) Also, I am less than thrilled by the job the fanbois did coming up with "convincing" evidence: exactly as you pointed out, there are _no_ real-world examples where it helps. And the worst part: one can only _guess_ what motivated patience diff. I imagine it came from the observation that function headers are unique, and that you usually want to preserve as much context around them. Ciao, Dscho ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-02 18:46 ` Johannes Schindelin @ 2009-01-02 19:03 ` Linus Torvalds 2009-01-02 19:22 ` Johannes Schindelin 2009-01-02 19:39 ` Jeff King 0 siblings, 2 replies; 67+ messages in thread From: Linus Torvalds @ 2009-01-02 19:03 UTC (permalink / raw) To: Johannes Schindelin Cc: Clemens Buchacher, Adeodato Simó, Pierre Habouzit, davidel, Francis Galiegue, Git ML On Fri, 2 Jan 2009, Johannes Schindelin wrote: > > FWIW it's the test case in the commit introducing the --patience option. Well, it's also the test-case in the very first hit on google for "patience diff" (with the quotes). In fact, it's the _only_ one I ever found ;) > And the worst part: one can only _guess_ what motivated patience diff. I > imagine it came from the observation that function headers are unique, and > that you usually want to preserve as much context around them. Well, I do like the notion of giving more weight to unique lines - I think it makes sense. That said, I suspect it would make almost as much sense to give more weight simply to _longer_ lines, and I suspect the standard Myers' algorithm could possibly be simply extended to take line size into account when calculating the weights. Because the problem with diffs for C doesn't really tend to be as much about non-unique lines as about just _trivial_ lines that are mostly empty or contain just braces etc. Those are quite arguably almost totally worthless for equality testing. And btw, don't get me wrong - I don't think there is anything wrong with the patience diff. I think it's a worthy thing to try, and I'm not at all arguing against it. However, I do think that the people arguing for it often do so based on less-than-very-logical arguments, and it's entirely possible that other approaches are better (eg the "weight by size" thing rather than "weight by uniqueness"). The thing about unique lines is that there are no guarantees at all that they even exist, so a uniqueness-based thing will always have to fall back on anything else. That, to me, implies that the whole notion is somewhat mis-designed: it's clearly not a generic concept. (In contrast, taking the length of the matching lines into account would not have that kind of bad special case) Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-02 19:03 ` Linus Torvalds @ 2009-01-02 19:22 ` Johannes Schindelin 2009-01-02 19:39 ` Jeff King 1 sibling, 0 replies; 67+ messages in thread From: Johannes Schindelin @ 2009-01-02 19:22 UTC (permalink / raw) To: Linus Torvalds Cc: Clemens Buchacher, Adeodato Simó, Pierre Habouzit, davidel, Francis Galiegue, Git ML Hi, On Fri, 2 Jan 2009, Linus Torvalds wrote: > On Fri, 2 Jan 2009, Johannes Schindelin wrote: > > > And the worst part: one can only _guess_ what motivated patience diff. > > I imagine it came from the observation that function headers are > > unique, and that you usually want to preserve as much context around > > them. > > Well, I do like the notion of giving more weight to unique lines - I think > it makes sense. That said, I suspect it would make almost as much sense to > give more weight simply to _longer_ lines, and I suspect the standard > Myers' algorithm could possibly be simply extended to take line size into > account when calculating the weights. I think that it makes more sense with the common unique lines, because they are _unambiguously_ common. That is, if possible, we would like to keep them as common lines. BTW this also opens the door for more automatic code movement detection. As for the longer lines: what exactly would you want to put "weight" on? The edit distance is the number of plusses and minusses, and this is the thing that actually is pretty critical for the performance: the larger the distance, the longer it takes. So if you want to put a different "weight" on a line, i.e. something else than a "1", you are risking a substantially worse performance. And I am still not convinced that longer lines should get more weight. A line starting with "exit:" can be much more important than 8 tabs followed by a curly bracket. > Because the problem with diffs for C doesn't really tend to be as much > about non-unique lines as about just _trivial_ lines that are mostly empty > or contain just braces etc. Those are quite arguably almost totally > worthless for equality testing. Oh, but then we get very C specific. Let's not go there. > And btw, don't get me wrong - I don't think there is anything wrong with > the patience diff. I think it's a worthy thing to try, and I'm not at > all arguing against it. I never took you to be opposed. I myself mainly implemented it because I wanted to play with it, and have something more useful than what I found in the whole wide web. > However, I do think that the people arguing for it often do so based on > less-than-very-logical arguments, and it's entirely possible that other > approaches are better (eg the "weight by size" thing rather than "weight > by uniqueness"). Actually, I think it is very possible that merging hunks if there are not enough alnums between them could make a whole lot of sense. But IIRC it was shot down and replaced by the not-so-specific-to-C algorithm to merge hunks when the output is shorter than having separate hunks. > The thing about unique lines is that there are no guarantees at all that > they even exist, so a uniqueness-based thing will always have to fall > back on anything else. That, to me, implies that the whole notion is > somewhat mis-designed: it's clearly not a generic concept. > > (In contrast, taking the length of the matching lines into account would > not have that kind of bad special case) However, we know that humans like to start from the unique features they see, and continue from there. And while it is quite possible that it is the wrong approach (see all that security theater at the airport, and some nice rational analyses about the cost/benefit equations there), imitating humans is still the thing that will convince most humans. Ciao, Dscho ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-02 19:03 ` Linus Torvalds 2009-01-02 19:22 ` Johannes Schindelin @ 2009-01-02 19:39 ` Jeff King 2009-01-02 19:50 ` Jeff King 2009-01-03 16:24 ` Bazaar's patience diff as GIT_EXTERNAL_DIFF Adeodato Simó 1 sibling, 2 replies; 67+ messages in thread From: Jeff King @ 2009-01-02 19:39 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Schindelin, Clemens Buchacher, Adeodato Simó, Pierre Habouzit, davidel, Francis Galiegue, Git ML On Fri, Jan 02, 2009 at 11:03:07AM -0800, Linus Torvalds wrote: > Well, it's also the test-case in the very first hit on google for > "patience diff" (with the quotes). > > In fact, it's the _only_ one I ever found ;) If you just want to see the results on some real-world cases (and don't care about measuring performance), try installing bzr and using their patiencediff test program as a GIT_EXTERNAL_DIFF. On Debian, it's: $ sudo apt-get install bzr $ cat >$HOME/patience <<'EOF' #!/bin/sh exec python /usr/share/pyshared/bzrlib/patiencediff.py "$2" "$5" EOF $ chmod 755 patience $ GIT_EXTERNAL_DIFF=$HOME/patience git diff Other distributions presumably install patiencediff.py somewhere similar (or you can maybe even pull it from the source, but presumably you have to install some other bzr modules, too). -Peff ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-02 19:39 ` Jeff King @ 2009-01-02 19:50 ` Jeff King 2009-01-02 20:52 ` Jeff King 2009-01-03 16:24 ` Bazaar's patience diff as GIT_EXTERNAL_DIFF Adeodato Simó 1 sibling, 1 reply; 67+ messages in thread From: Jeff King @ 2009-01-02 19:50 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Schindelin, Clemens Buchacher, Adeodato Simó, Pierre Habouzit, davidel, Francis Galiegue, Git ML On Fri, Jan 02, 2009 at 02:39:04PM -0500, Jeff King wrote: > If you just want to see the results on some real-world cases (and don't > care about measuring performance), try installing bzr and using their > patiencediff test program as a GIT_EXTERNAL_DIFF. > > On Debian, it's: > > $ sudo apt-get install bzr > $ cat >$HOME/patience <<'EOF' > #!/bin/sh > exec python /usr/share/pyshared/bzrlib/patiencediff.py "$2" "$5" > EOF > $ chmod 755 patience > $ GIT_EXTERNAL_DIFF=$HOME/patience git diff For added fun, try this: -- >8 -- #!/bin/sh canonical() { grep '^[ +-]' | egrep -v '^(---|\+\+\+)' } git rev-list --no-merges HEAD | while read rev; do git diff-tree -p $rev^ $rev | canonical >git.out GIT_EXTERNAL_DIFF=$HOME/patience git diff $rev^ $rev | canonical >bzr.out echo "`diff -U0 git.out bzr.out | wc -l` $rev" done -- 8< -- I'm running it on git.git now. It looks like both algorithms return the same patch for most cases. Some of the differences are interesting, though. For example, f83b9ba209's commit message indicates that it moves the "--format-patch" paragraph. Which is what "git diff" shows. Patience diff shows it as moving other text _around_ that paragraph. -Peff ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-02 19:50 ` Jeff King @ 2009-01-02 20:52 ` Jeff King 2009-01-02 23:05 ` Linus Torvalds 0 siblings, 1 reply; 67+ messages in thread From: Jeff King @ 2009-01-02 20:52 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Schindelin, Clemens Buchacher, Adeodato Simó, Pierre Habouzit, davidel, Francis Galiegue, Git ML On Fri, Jan 02, 2009 at 02:50:53PM -0500, Jeff King wrote: > For example, f83b9ba209's commit message indicates that it moves the > "--format-patch" paragraph. Which is what "git diff" shows. Patience > diff shows it as moving other text _around_ that paragraph. Here's another interesting one: d592b315. The commit removes dashes from git commands in test scripts. Git says: echo "tag-one-line" >expect && - git-tag -l | grep "^tag-one-line" >actual && + git tag -l | grep "^tag-one-line" >actual && test_cmp expect actual && - git-tag -n0 -l | grep "^tag-one-line" >actual && + git tag -n0 -l | grep "^tag-one-line" >actual && test_cmp expect actual && - git-tag -n0 -l tag-one-line >actual && + git tag -n0 -l tag-one-line >actual && test_cmp expect actual && whereas patience says: echo "tag-one-line" >expect && - git-tag -l | grep "^tag-one-line" >actual && - test_cmp expect actual && - git-tag -n0 -l | grep "^tag-one-line" >actual && - test_cmp expect actual && - git-tag -n0 -l tag-one-line >actual && + git tag -l | grep "^tag-one-line" >actual && + test_cmp expect actual && + git tag -n0 -l | grep "^tag-one-line" >actual && + test_cmp expect actual && + git tag -n0 -l tag-one-line >actual && test_cmp expect actual && which is exactly what patience is advertised to do: it's treating the non-unique lines as uninteresting markers. But in this case they _are_ interesting, and I think the git output is more readable. And this is a case where your "weight lines by length instead of uniqueness" suggestion would perform better, I think. -Peff ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-02 20:52 ` Jeff King @ 2009-01-02 23:05 ` Linus Torvalds 0 siblings, 0 replies; 67+ messages in thread From: Linus Torvalds @ 2009-01-02 23:05 UTC (permalink / raw) To: Jeff King Cc: Johannes Schindelin, Clemens Buchacher, Adeodato Simó, Pierre Habouzit, davidel, Francis Galiegue, Git ML On Fri, 2 Jan 2009, Jeff King wrote: > > Here's another interesting one: d592b315. The commit removes dashes > from git commands in test scripts. Git says: > > echo "tag-one-line" >expect && > - git-tag -l | grep "^tag-one-line" >actual && > + git tag -l | grep "^tag-one-line" >actual && > test_cmp expect actual && > - git-tag -n0 -l | grep "^tag-one-line" >actual && > + git tag -n0 -l | grep "^tag-one-line" >actual && > test_cmp expect actual && > - git-tag -n0 -l tag-one-line >actual && > + git tag -n0 -l tag-one-line >actual && > test_cmp expect actual && > > whereas patience says: > > echo "tag-one-line" >expect && > - git-tag -l | grep "^tag-one-line" >actual && > - test_cmp expect actual && > - git-tag -n0 -l | grep "^tag-one-line" >actual && > - test_cmp expect actual && > - git-tag -n0 -l tag-one-line >actual && > + git tag -l | grep "^tag-one-line" >actual && > + test_cmp expect actual && > + git tag -n0 -l | grep "^tag-one-line" >actual && > + test_cmp expect actual && > + git tag -n0 -l tag-one-line >actual && > test_cmp expect actual && Yeah, the bazaar version clearly is inferior here. But again, I don't think that's actually a patience diff issue, I think it's because the bazaar diff has merged consecutive diff lines too aggressively. I suspect both patience and the straight Mayers diff (that git uses) actually finds exactly the same differences, and then bazaar has a "merge closeby -/ pairs together if there is just a single unmodified line in between them". And _that_ is where it should care whether the unmodified line is complex or not. If it's complex, you shouldn't merge it into the -/+ region. (but yes, it's possible that the bazaar diff uses "uniqueness" as a complexity analysis marker, and while that is somewhat valid, it is _not_ valid enough to be useful. Unique lines tend to be complex, but complex lines are _not_ always unique) Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* Bazaar's patience diff as GIT_EXTERNAL_DIFF 2009-01-02 19:39 ` Jeff King 2009-01-02 19:50 ` Jeff King @ 2009-01-03 16:24 ` Adeodato Simó 1 sibling, 0 replies; 67+ messages in thread From: Adeodato Simó @ 2009-01-03 16:24 UTC (permalink / raw) To: Jeff King; +Cc: Git ML [-- Attachment #1: Type: text/plain, Size: 974 bytes --] * Jeff King [Fri, 02 Jan 2009 14:39:04 -0500]: > If you just want to see the results on some real-world cases (and don't > care about measuring performance), try installing bzr and using their > patiencediff test program as a GIT_EXTERNAL_DIFF. > On Debian, it's: > $ sudo apt-get install bzr > $ cat >$HOME/patience <<'EOF' > #!/bin/sh > exec python /usr/share/pyshared/bzrlib/patiencediff.py "$2" "$5" > EOF > $ chmod 755 patience > $ GIT_EXTERNAL_DIFF=$HOME/patience git diff In case somebody's interested, I have this script lying around that does that, and knows how to colorize the output using bzrtools, and honoring color.{diff,ui}. No support for color.diff.<slot>, though. -- Adeodato Simó dato at net.com.org.es Debian Developer adeodato at debian.org - Oh my God, you're pimping me out for a new roof? - And windows! -- Andrew and Bree Van De Kamp [-- Attachment #2: git_bzr_patience_diff --] [-- Type: text/plain, Size: 1457 bytes --] #! /usr/bin/python import os import sys import stat import subprocess from bzrlib.patiencediff import unified_diff, PatienceSequenceMatcher try: from bzrlib.plugins.bzrtools.colordiff import DiffWriter except ImportError: _have_colordiff = False else: _have_colordiff = True ## def main(): path = sys.argv[1] file1 = open(sys.argv[2], 'rb') file2 = open(sys.argv[5], 'rb') if use_color(): writer = DiffWriter(sys.stdout, check_style=True) else: writer = sys.stdout for line in unified_diff( file1.readlines(), file2.readlines(), path, path, sequencematcher=PatienceSequenceMatcher): writer.write(line) ## def use_color(): if not _have_colordiff: return False for c in ['color.diff', 'color.ui']: p = subprocess.Popen( ['git', 'config', '--get', c], stdout=subprocess.PIPE) if p.wait() == 0: when = p.stdout.readline().strip() break else: return False if when == 'always': return True elif when in ['false', 'never']: return False elif when in ['true', 'auto']: stdout = sys.stdout.fileno() return (os.isatty(stdout) or stat.S_ISFIFO(os.fstat(sys.stdout.fileno()).st_mode)) else: return False ## if __name__ == '__main__': sys.exit(main()) ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-02 16:42 ` Linus Torvalds 2009-01-02 18:46 ` Johannes Schindelin @ 2009-01-02 21:59 ` Johannes Schindelin 2009-01-08 19:55 ` Adeodato Simó 2 siblings, 0 replies; 67+ messages in thread From: Johannes Schindelin @ 2009-01-02 21:59 UTC (permalink / raw) To: Linus Torvalds Cc: Clemens Buchacher, Adeodato Simó, Pierre Habouzit, davidel, Francis Galiegue, Git ML Hi, On Fri, 2 Jan 2009, Linus Torvalds wrote: > So I was hoping for something else than a single "in this case patience > diff works really well". I was hoping to see what it does in real life. Funnily, I think the test case you sent me is a pretty good example. Look at this hunk (without patience diff): @@ -4205,25 +4205,25 @@ out: */ static int nfs4_xdr_dec_setattr(struct rpc_rqst *rqstp, __be32 *p, struct nfs_se { - struct xdr_stream xdr; - struct compound_hdr hdr; - int status; - - xdr_init_decode(&xdr, &rqstp->rq_rcv_buf, p); - status = decode_compound_hdr(&xdr, &hdr); - if (status) - goto out; - status = decode_putfh(&xdr); - if (status) - goto out; - status = decode_setattr(&xdr, res); - if (status) - goto out; + struct xdr_stream xdr; + struct compound_hdr hdr; + int status; + + xdr_init_decode(&xdr, &rqstp->rq_rcv_buf, p); ... and then it goes on with the whole reindented function. Compare this to the same hunk _with_ patience diff: @@ -4205,25 +4205,25 @@ out: */ static int nfs4_xdr_dec_setattr(struct rpc_rqst *rqstp, __be32 *p, struct nfs_se { - struct xdr_stream xdr; - struct compound_hdr hdr; - int status; + struct xdr_stream xdr; + struct compound_hdr hdr; + int status; - xdr_init_decode(&xdr, &rqstp->rq_rcv_buf, p); - status = decode_compound_hdr(&xdr, &hdr); - if (status) - goto out; - status = decode_putfh(&xdr); - if (status) - goto out; - status = decode_setattr(&xdr, res); - if (status) - goto out; + xdr_init_decode(&xdr, &rqstp->rq_rcv_buf, p); + status = decode_compound_hdr(&xdr, &hdr); ... and again the rest is reindented code. The difference? The common empty line. I actually find it more readable to have the separation between the declarations and the code also in the diff. This is just a very feeble example, but you get the idea from there. Oh, you might object that the empty line is not unique. But actually it is, because the patience diff recurses into ever smaller line ranges until it finally comes to such a small range that the empty line _is_ unique. And in my analysis of the complexity, I stupidly left out that recursion part. So: patience diff is _substantially_ more expensive than Myers'. > But when I tried it on the kernel archive, I get a core dump. I also got this trying on the git.git repository, with commit be3cfa85([PATCH] Diff-tree-helper take two.) Funnily, I almost got there trying the same before sending the first revision, but I got impatient and stopped early. Tsk, tsk. Now I tried with my complete clone of git.git together with all of my topic branches, and it runs through without segmentation fault. Patch follows. Ciao, Dscho ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm @ 2009-01-08 19:55 ` Adeodato Simó 2009-01-08 20:06 ` Adeodato Simó 2009-01-09 6:54 ` Junio C Hamano 0 siblings, 2 replies; 67+ messages in thread From: Adeodato Simó @ 2009-01-08 19:55 UTC (permalink / raw) To: Linus Torvalds Cc: Clemens Buchacher, Johannes Schindelin, Pierre Habouzit, davidel, Francis Galiegue, Git ML [-- Attachment #1: Type: text/plain, Size: 1062 bytes --] * Linus Torvalds [Fri, 02 Jan 2009 08:42:04 -0800]: > Yes, this one is a real patience diff change, but it's also the same one > that I've seen in the google fanboi findings. What google did _not_ show > was any real-life examples, or anybody doing any critical analysis. This comes a bit late and maybe it's redundant, but somebody just sent to a Debian mailing list a patch that was hard to read, and patience improved it. (I realize it's quite similar in spirit to the "toy patience example" that google returns, but this at list is a *real* example where patience helped me read a patch.) I'm also attaching bzr diff output, because it's still more readable IMHO. (I realize that's independent of patience, as you explained, but I'm making a point that it'd be nice to have this addressed by somebody knowledgeable.) Thanks, -- Adeodato Simó dato at net.com.org.es Debian Developer adeodato at debian.org - Are you sure we're good? - Always. -- Rory and Lorelai [-- Attachment #2: git.diff --] [-- Type: text/x-diff, Size: 3615 bytes --] diff --git util_sock.c util_sock.c index e20768e..f7b9145 100644 --- util_sock.c +++ util_sock.c @@ -1037,40 +1037,109 @@ NTSTATUS read_data(int fd, char *buffer, size_t N) } /**************************************************************************** - Write data to a fd. + Write all data from an iov array ****************************************************************************/ -ssize_t write_data(int fd, const char *buffer, size_t N) +ssize_t write_data_iov(int fd, const struct iovec *orig_iov, int iovcnt) { - size_t total=0; - ssize_t ret; - char addr[INET6_ADDRSTRLEN]; + int i; + size_t to_send; + ssize_t thistime; + size_t sent; + struct iovec *iov_copy, *iov; - while (total < N) { - ret = sys_write(fd,buffer + total,N - total); + to_send = 0; + for (i=0; i<iovcnt; i++) { + to_send += orig_iov[i].iov_len; + } - if (ret == -1) { - if (fd == get_client_fd()) { - /* Try and give an error message saying - * what client failed. */ - DEBUG(0,("write_data: write failure in " - "writing to client %s. Error %s\n", - get_peer_addr(fd,addr,sizeof(addr)), - strerror(errno) )); - } else { - DEBUG(0,("write_data: write failure. " - "Error = %s\n", strerror(errno) )); + thistime = sys_writev(fd, orig_iov, iovcnt); + if ((thistime <= 0) || (thistime == to_send)) { + return thistime; + } + sent = thistime; + + /* + * We could not send everything in one call. Make a copy of iov that + * we can mess with. We keep a copy of the array start in iov_copy for + * the TALLOC_FREE, because we're going to modify iov later on, + * discarding elements. + */ + + iov_copy = (struct iovec *)TALLOC_MEMDUP( + talloc_tos(), orig_iov, sizeof(struct iovec) * iovcnt); + + if (iov_copy == NULL) { + errno = ENOMEM; + return -1; + } + iov = iov_copy; + + while (sent < to_send) { + /* + * We have to discard "thistime" bytes from the beginning + * iov array, "thistime" contains the number of bytes sent + * via writev last. + */ + while (thistime > 0) { + if (thistime < iov[0].iov_len) { + char *new_base = + (char *)iov[0].iov_base + thistime; + iov[0].iov_base = new_base; + iov[0].iov_len -= thistime; + break; } - return -1; + thistime -= iov[0].iov_len; + iov += 1; + iovcnt -= 1; } - if (ret == 0) { - return total; + thistime = sys_writev(fd, iov, iovcnt); + if (thistime <= 0) { + break; } + sent += thistime; + } + + TALLOC_FREE(iov_copy); + return sent; +} + +/**************************************************************************** + Write data to a fd. +****************************************************************************/ + +/**************************************************************************** + Write data to a fd. +****************************************************************************/ + +ssize_t write_data(int fd, const char *buffer, size_t N) +{ + ssize_t ret; + struct iovec iov; + + iov.iov_base = CONST_DISCARD(char *, buffer); + iov.iov_len = N; + + ret = write_data_iov(fd, &iov, 1); + if (ret >= 0) { + return ret; + } - total += ret; + if (fd == get_client_fd()) { + char addr[INET6_ADDRSTRLEN]; + /* + * Try and give an error message saying what client failed. + */ + DEBUG(0, ("write_data: write failure in writing to client %s. " + "Error %s\n", get_peer_addr(fd,addr,sizeof(addr)), + strerror(errno))); + } else { + DEBUG(0,("write_data: write failure. Error = %s\n", + strerror(errno) )); } - return (ssize_t)total; + + return -1; } /**************************************************************************** [-- Attachment #3: git_patience.diff --] [-- Type: text/x-diff, Size: 3538 bytes --] diff --git util_sock.c util_sock.c index e20768e..f7b9145 100644 --- util_sock.c +++ util_sock.c @@ -1037,40 +1037,109 @@ NTSTATUS read_data(int fd, char *buffer, size_t N) } /**************************************************************************** + Write all data from an iov array +****************************************************************************/ + +ssize_t write_data_iov(int fd, const struct iovec *orig_iov, int iovcnt) +{ + int i; + size_t to_send; + ssize_t thistime; + size_t sent; + struct iovec *iov_copy, *iov; + + to_send = 0; + for (i=0; i<iovcnt; i++) { + to_send += orig_iov[i].iov_len; + } + + thistime = sys_writev(fd, orig_iov, iovcnt); + if ((thistime <= 0) || (thistime == to_send)) { + return thistime; + } + sent = thistime; + + /* + * We could not send everything in one call. Make a copy of iov that + * we can mess with. We keep a copy of the array start in iov_copy for + * the TALLOC_FREE, because we're going to modify iov later on, + * discarding elements. + */ + + iov_copy = (struct iovec *)TALLOC_MEMDUP( + talloc_tos(), orig_iov, sizeof(struct iovec) * iovcnt); + + if (iov_copy == NULL) { + errno = ENOMEM; + return -1; + } + iov = iov_copy; + + while (sent < to_send) { + /* + * We have to discard "thistime" bytes from the beginning + * iov array, "thistime" contains the number of bytes sent + * via writev last. + */ + while (thistime > 0) { + if (thistime < iov[0].iov_len) { + char *new_base = + (char *)iov[0].iov_base + thistime; + iov[0].iov_base = new_base; + iov[0].iov_len -= thistime; + break; + } + thistime -= iov[0].iov_len; + iov += 1; + iovcnt -= 1; + } + + thistime = sys_writev(fd, iov, iovcnt); + if (thistime <= 0) { + break; + } + sent += thistime; + } + + TALLOC_FREE(iov_copy); + return sent; +} + +/**************************************************************************** + Write data to a fd. +****************************************************************************/ + +/**************************************************************************** Write data to a fd. ****************************************************************************/ ssize_t write_data(int fd, const char *buffer, size_t N) { - size_t total=0; ssize_t ret; - char addr[INET6_ADDRSTRLEN]; + struct iovec iov; - while (total < N) { - ret = sys_write(fd,buffer + total,N - total); + iov.iov_base = CONST_DISCARD(char *, buffer); + iov.iov_len = N; - if (ret == -1) { - if (fd == get_client_fd()) { - /* Try and give an error message saying - * what client failed. */ - DEBUG(0,("write_data: write failure in " - "writing to client %s. Error %s\n", - get_peer_addr(fd,addr,sizeof(addr)), - strerror(errno) )); - } else { - DEBUG(0,("write_data: write failure. " - "Error = %s\n", strerror(errno) )); - } - return -1; - } - - if (ret == 0) { - return total; - } + ret = write_data_iov(fd, &iov, 1); + if (ret >= 0) { + return ret; + } - total += ret; + if (fd == get_client_fd()) { + char addr[INET6_ADDRSTRLEN]; + /* + * Try and give an error message saying what client failed. + */ + DEBUG(0, ("write_data: write failure in writing to client %s. " + "Error %s\n", get_peer_addr(fd,addr,sizeof(addr)), + strerror(errno))); + } else { + DEBUG(0,("write_data: write failure. Error = %s\n", + strerror(errno) )); } - return (ssize_t)total; + + return -1; } /**************************************************************************** [-- Attachment #4: bzr.diff --] [-- Type: text/x-diff, Size: 3432 bytes --] --- util_sock.c +++ util_sock.c @@ -1037,40 +1037,109 @@ } /**************************************************************************** + Write all data from an iov array +****************************************************************************/ + +ssize_t write_data_iov(int fd, const struct iovec *orig_iov, int iovcnt) +{ + int i; + size_t to_send; + ssize_t thistime; + size_t sent; + struct iovec *iov_copy, *iov; + + to_send = 0; + for (i=0; i<iovcnt; i++) { + to_send += orig_iov[i].iov_len; + } + + thistime = sys_writev(fd, orig_iov, iovcnt); + if ((thistime <= 0) || (thistime == to_send)) { + return thistime; + } + sent = thistime; + + /* + * We could not send everything in one call. Make a copy of iov that + * we can mess with. We keep a copy of the array start in iov_copy for + * the TALLOC_FREE, because we're going to modify iov later on, + * discarding elements. + */ + + iov_copy = (struct iovec *)TALLOC_MEMDUP( + talloc_tos(), orig_iov, sizeof(struct iovec) * iovcnt); + + if (iov_copy == NULL) { + errno = ENOMEM; + return -1; + } + iov = iov_copy; + + while (sent < to_send) { + /* + * We have to discard "thistime" bytes from the beginning + * iov array, "thistime" contains the number of bytes sent + * via writev last. + */ + while (thistime > 0) { + if (thistime < iov[0].iov_len) { + char *new_base = + (char *)iov[0].iov_base + thistime; + iov[0].iov_base = new_base; + iov[0].iov_len -= thistime; + break; + } + thistime -= iov[0].iov_len; + iov += 1; + iovcnt -= 1; + } + + thistime = sys_writev(fd, iov, iovcnt); + if (thistime <= 0) { + break; + } + sent += thistime; + } + + TALLOC_FREE(iov_copy); + return sent; +} + +/**************************************************************************** + Write data to a fd. +****************************************************************************/ + +/**************************************************************************** Write data to a fd. ****************************************************************************/ ssize_t write_data(int fd, const char *buffer, size_t N) { - size_t total=0; ssize_t ret; - char addr[INET6_ADDRSTRLEN]; - - while (total < N) { - ret = sys_write(fd,buffer + total,N - total); - - if (ret == -1) { - if (fd == get_client_fd()) { - /* Try and give an error message saying - * what client failed. */ - DEBUG(0,("write_data: write failure in " - "writing to client %s. Error %s\n", - get_peer_addr(fd,addr,sizeof(addr)), - strerror(errno) )); - } else { - DEBUG(0,("write_data: write failure. " - "Error = %s\n", strerror(errno) )); - } - return -1; - } - - if (ret == 0) { - return total; - } - - total += ret; - } - return (ssize_t)total; + struct iovec iov; + + iov.iov_base = CONST_DISCARD(char *, buffer); + iov.iov_len = N; + + ret = write_data_iov(fd, &iov, 1); + if (ret >= 0) { + return ret; + } + + if (fd == get_client_fd()) { + char addr[INET6_ADDRSTRLEN]; + /* + * Try and give an error message saying what client failed. + */ + DEBUG(0, ("write_data: write failure in writing to client %s. " + "Error %s\n", get_peer_addr(fd,addr,sizeof(addr)), + strerror(errno))); + } else { + DEBUG(0,("write_data: write failure. Error = %s\n", + strerror(errno) )); + } + + return -1; } /**************************************************************************** ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-08 19:55 ` Adeodato Simó @ 2009-01-08 20:06 ` Adeodato Simó 2009-01-09 6:54 ` Junio C Hamano 1 sibling, 0 replies; 67+ messages in thread From: Adeodato Simó @ 2009-01-08 20:06 UTC (permalink / raw) To: git (My apologies for breaking the thread.) -- Adeodato Simó dato at net.com.org.es Debian Developer adeodato at debian.org «¡Pero si es tan español que debe de tener el cerebro en forma de botijo, con pitorro y todo!» -- Javier Cercas, “La velocidad de la luz” ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-08 19:55 ` Adeodato Simó 2009-01-08 20:06 ` Adeodato Simó @ 2009-01-09 6:54 ` Junio C Hamano 2009-01-09 13:07 ` Johannes Schindelin 1 sibling, 1 reply; 67+ messages in thread From: Junio C Hamano @ 2009-01-09 6:54 UTC (permalink / raw) To: Adeodato Simó Cc: Linus Torvalds, Clemens Buchacher, Johannes Schindelin, Pierre Habouzit, davidel, Francis Galiegue, Git ML Adeodato Simó <dato@net.com.org.es> writes: > * Linus Torvalds [Fri, 02 Jan 2009 08:42:04 -0800]: > >> Yes, this one is a real patience diff change, but it's also the same one >> that I've seen in the google fanboi findings. What google did _not_ show >> was any real-life examples, or anybody doing any critical analysis. > > This comes a bit late and maybe it's redundant, but somebody just sent > to a Debian mailing list a patch that was hard to read, and patience > improved it. (I realize it's quite similar in spirit to the "toy > patience example" that google returns, but this at list is a *real* > example where patience helped me read a patch.) > > I'm also attaching bzr diff output, because it's still more readable > IMHO. (I realize that's independent of patience, as you explained, but > I'm making a point that it'd be nice to have this addressed by somebody > knowledgeable.) I found thd difference between the Bzr diff and Dscho diff somewhat interesting. It shows that sometimes it makes the results easier to read to consider blank lines (and lines with only punctuation marks) that match to be non-matching when they appear inside a block of other consecutive non-matching lines, even though the result may become larger. The part Bzr gives this result: > +/**************************************************************************** > Write data to a fd. > ****************************************************************************/ > > ssize_t write_data(int fd, const char *buffer, size_t N) > { > - size_t total=0; > ssize_t ret; > - char addr[INET6_ADDRSTRLEN]; > ... all "removed" > - while (total < N) { > - total += ret; > - } > - return (ssize_t)total; > + struct iovec iov; > + > + iov.iov_base = CONST_DISCARD(char *, buffer); > ... all "added" > + > + > + return -1; > } > > /**************************************************************************** is shown by the Dscho git-diff like this: > Write data to a fd. > ****************************************************************************/ > > ssize_t write_data(int fd, const char *buffer, size_t N) > { > - size_t total=0; > ssize_t ret; > - char addr[INET6_ADDRSTRLEN]; > + struct iovec iov; > > - while (total < N) { > - ret = sys_write(fd,buffer + total,N - total); > + iov.iov_base = CONST_DISCARD(char *, buffer); > + iov.iov_len = N; > > - if (ret == -1) { > - if (fd == get_client_fd()) { > ... all "removed" > - > - if (ret == 0) { > - return total; > - } > + ret = write_data_iov(fd, &iov, 1); > + if (ret >= 0) { > + return ret; > + } > > - total += ret; > + if (fd == get_client_fd()) { > + char addr[INET6_ADDRSTRLEN]; > ... all "added" > + DEBUG(0,("write_data: write failure. Error = %s\n", > + strerror(errno) )); > } > - return (ssize_t)total; > + > + return -1; > } If we find the "common" context lines that have only blank and punctuation letters in Dscho output, turn each of them into "-" and "+", and rearrange them so that all "-" are together followed by "+", it will match Bzr output. ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-09 6:54 ` Junio C Hamano @ 2009-01-09 13:07 ` Johannes Schindelin 2009-01-09 15:59 ` Adeodato Simó ` (2 more replies) 0 siblings, 3 replies; 67+ messages in thread From: Johannes Schindelin @ 2009-01-09 13:07 UTC (permalink / raw) To: Junio C Hamano Cc: Adeodato Simó, Linus Torvalds, Clemens Buchacher, Pierre Habouzit, davidel, Francis Galiegue, Git ML Hi, On Thu, 8 Jan 2009, Junio C Hamano wrote: > If we find the "common" context lines that have only blank and > punctuation letters in Dscho output, turn each of them into "-" and "+", > and rearrange them so that all "-" are together followed by "+", it will > match Bzr output. So we'd need something like this (I still think we should treat curly brackets the same as punctuation, and for good measure I just handled everything that is not alphanumerical the same): -- snipsnap -- [TOY PATCH] Add diff option '--collapse-non-alnums' With the option --collapse-non-alnums, there will be no interhunks consisting solely of non-alphanumerical letters. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> --- diff.c | 2 ++ xdiff/xdiff.h | 1 + xdiff/xdiffi.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++- 3 files changed, 50 insertions(+), 1 deletions(-) diff --git a/diff.c b/diff.c index c7ddb60..4b387fb 100644 --- a/diff.c +++ b/diff.c @@ -2503,6 +2503,8 @@ int diff_opt_parse(struct diff_options *options, const char **av, int ac) options->xdl_opts |= XDF_IGNORE_WHITESPACE_AT_EOL; else if (!strcmp(arg, "--patience")) options->xdl_opts |= XDF_PATIENCE_DIFF; + else if (!strcmp(arg, "--collapse-non-alnums")) + options->xdl_opts |= XDF_COLLAPSE_NON_ALNUMS; /* flags options */ else if (!strcmp(arg, "--binary")) { diff --git a/xdiff/xdiff.h b/xdiff/xdiff.h index 4da052a..a444f9a 100644 --- a/xdiff/xdiff.h +++ b/xdiff/xdiff.h @@ -33,6 +33,7 @@ extern "C" { #define XDF_IGNORE_WHITESPACE_CHANGE (1 << 3) #define XDF_IGNORE_WHITESPACE_AT_EOL (1 << 4) #define XDF_PATIENCE_DIFF (1 << 5) +#define XDF_COLLAPSE_NON_ALNUMS (1 << 6) #define XDF_WHITESPACE_FLAGS (XDF_IGNORE_WHITESPACE | XDF_IGNORE_WHITESPACE_CHANGE | XDF_IGNORE_WHITESPACE_AT_EOL) #define XDL_PATCH_NORMAL '-' diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c index 3e97462..b8e7ee8 100644 --- a/xdiff/xdiffi.c +++ b/xdiff/xdiffi.c @@ -396,6 +396,50 @@ static xdchange_t *xdl_add_change(xdchange_t *xscr, long i1, long i2, long chg1, return xch; } +static int xdl_record_contains_alnum(xrecord_t *record) +{ + long i; + for (i = 0; i < record->size; i++) + if (isalnum(record->ptr[i])) + return 1; + return 0; +} + +static int xdl_collapse_non_alnum(xdfile_t *xdf, xdfile_t *xdfo) +{ + long ix, ixo, len = 0; + + /* + * Collapse all interhunk parts consisting solely of non-alnum + * characters into the hunks. + */ + for (ix = 0, ixo = 0; ix < xdf->nrec && ixo < xdfo->nrec; ix++, ixo++) { + if (xdf->rchg[ix] == 1 || xdfo->rchg[ixo] == 1) { + /* collapse non-alnum interhunks */ + while (len > 0) { + xdf->rchg[ix - len] = 1; + xdfo->rchg[ixo - len] = 1; + len--; + } + + /* look for end of hunk */ + while (ix < xdf->nrec && xdf->rchg[ix] == 1) + ix++; + while (ixo < xdfo->nrec && xdfo->rchg[ixo] == 1) + ixo++; + if (ix >= xdf->nrec) + return 0; + len = !xdl_record_contains_alnum(xdf->recs[ix]); + } + else if (len > 0) { + if (xdl_record_contains_alnum(xdf->recs[ix])) + len = 0; + else + len++; + } + } + return 0; +} int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) { long ix, ixo, ixs, ixref, grpsiz, nrec = xdf->nrec; @@ -548,7 +592,9 @@ int xdl_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, return -1; } - if (xdl_change_compact(&xe.xdf1, &xe.xdf2, xpp->flags) < 0 || + if (((xpp->flags & XDF_COLLAPSE_NON_ALNUMS) && + xdl_collapse_non_alnum(&xe.xdf1, &xe.xdf2)) || + xdl_change_compact(&xe.xdf1, &xe.xdf2, xpp->flags) < 0 || xdl_change_compact(&xe.xdf2, &xe.xdf1, xpp->flags) < 0 || xdl_build_script(&xe, &xscr) < 0) { -- 1.6.1.203.gc8be3 ^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-09 13:07 ` Johannes Schindelin @ 2009-01-09 15:59 ` Adeodato Simó 2009-01-09 18:09 ` Linus Torvalds 2009-01-09 20:53 ` Junio C Hamano 2 siblings, 0 replies; 67+ messages in thread From: Adeodato Simó @ 2009-01-09 15:59 UTC (permalink / raw) To: Johannes Schindelin Cc: Junio C Hamano, Linus Torvalds, Clemens Buchacher, Pierre Habouzit, davidel, Francis Galiegue, Git ML * Johannes Schindelin [Fri, 09 Jan 2009 14:07:28 +0100]: > > If we find the "common" context lines that have only blank and > > punctuation letters in Dscho output, turn each of them into "-" and "+", > > and rearrange them so that all "-" are together followed by "+", it will > > match Bzr output. > So we'd need something like this (I still think we should treat curly > brackets the same as punctuation, and for good measure I just handled > everything that is not alphanumerical the same): Nice. With this patch of yours, --patience --collapse-non-alnums produces the same output as bzr for this last test case (the util_sock.c one). However, also for this last case, without --patience, diff --collapse-non-alnums finds *no* common lines at all. Mentioning in case you'd be interested in knowing. Cheers, -- Adeodato Simó dato at net.com.org.es Debian Developer adeodato at debian.org - You look beaten. - I just caught Tara laughing with another man. - Are you sure they weren't just... kissing or something? - No, they were laughing. -- Denny Crane and Alan Shore ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-09 13:07 ` Johannes Schindelin 2009-01-09 15:59 ` Adeodato Simó @ 2009-01-09 18:09 ` Linus Torvalds 2009-01-09 18:13 ` Linus Torvalds 2009-01-09 20:53 ` Junio C Hamano 2 siblings, 1 reply; 67+ messages in thread From: Linus Torvalds @ 2009-01-09 18:09 UTC (permalink / raw) To: Johannes Schindelin Cc: Junio C Hamano, Adeodato Simó, Clemens Buchacher, Pierre Habouzit, davidel, Francis Galiegue, Git ML On Fri, 9 Jan 2009, Johannes Schindelin wrote: > > -- snipsnap -- > [TOY PATCH] Add diff option '--collapse-non-alnums' I really don't think it should be about "alnum". Think about languages that use "begin" and "end" instead of "{" "}". I think we'd be better off just looking at the size, but _this_ is a really good area where "uniqueness" matters. Don't combine unique lines, but lines that have the same hash as a thousand other lines? Go right ahead. Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-09 18:09 ` Linus Torvalds @ 2009-01-09 18:13 ` Linus Torvalds 0 siblings, 0 replies; 67+ messages in thread From: Linus Torvalds @ 2009-01-09 18:13 UTC (permalink / raw) To: Johannes Schindelin Cc: Junio C Hamano, Adeodato Simó, Clemens Buchacher, Pierre Habouzit, davidel, Francis Galiegue, Git ML On Fri, 9 Jan 2009, Linus Torvalds wrote: > > Don't combine unique lines, but lines that have the same hash as a > thousand other lines? Go right ahead. Oh, and even then, don't ever combine fragments unless it's a single line, and unless you can really stop with just one or two combines. Because we don't want to combine 100 one-liner changes (on every other line) into one 300-line change. Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-09 13:07 ` Johannes Schindelin 2009-01-09 15:59 ` Adeodato Simó 2009-01-09 18:09 ` Linus Torvalds @ 2009-01-09 20:53 ` Junio C Hamano 2009-01-10 11:36 ` Johannes Schindelin 2 siblings, 1 reply; 67+ messages in thread From: Junio C Hamano @ 2009-01-09 20:53 UTC (permalink / raw) To: Johannes Schindelin Cc: Adeodato Simó, Linus Torvalds, Clemens Buchacher, Pierre Habouzit, davidel, Francis Galiegue, Git ML Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > On Thu, 8 Jan 2009, Junio C Hamano wrote: > >> If we find the "common" context lines that have only blank and >> punctuation letters in Dscho output, turn each of them into "-" and "+", >> and rearrange them so that all "-" are together followed by "+", it will >> match Bzr output. > > So we'd need something like this (I still think we should treat curly > brackets the same as punctuation, and for good measure I just handled > everything that is not alphanumerical the same): I meant by punctuation to include curlies (my wording may have been wrong but from the example with " }" line it should have been obvious). But I agree with both points Linus raised. The criteria to pick what to pretend unmatching should be "small insignificant lines" (small goes for both size and also number of consecutive "insignificant" lines), and the coallescing should be done to join a block of consecutive changed lines of a significant size (so you do not join two 1 or 2-line "changed line" blocks by pretending that a 1-line unchanged insignificant line in between them is unmatching). ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-09 20:53 ` Junio C Hamano @ 2009-01-10 11:36 ` Johannes Schindelin 0 siblings, 0 replies; 67+ messages in thread From: Johannes Schindelin @ 2009-01-10 11:36 UTC (permalink / raw) To: Junio C Hamano Cc: Adeodato Simó, Linus Torvalds, Clemens Buchacher, Pierre Habouzit, davidel, Francis Galiegue, Git ML Hi, On Fri, 9 Jan 2009, Junio C Hamano wrote: > Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > > > On Thu, 8 Jan 2009, Junio C Hamano wrote: > > > >> If we find the "common" context lines that have only blank and > >> punctuation letters in Dscho output, turn each of them into "-" and "+", > >> and rearrange them so that all "-" are together followed by "+", it will > >> match Bzr output. > > > > So we'd need something like this (I still think we should treat curly > > brackets the same as punctuation, and for good measure I just handled > > everything that is not alphanumerical the same): > > I meant by punctuation to include curlies (my wording may have been wrong > but from the example with " }" line it should have been obvious). > > But I agree with both points Linus raised. The criteria to pick what to > pretend unmatching should be "small insignificant lines" (small goes for > both size and also number of consecutive "insignificant" lines), and the > coallescing should be done to join a block of consecutive changed lines of > a significant size (so you do not join two 1 or 2-line "changed line" > blocks by pretending that a 1-line unchanged insignificant line in between > them is unmatching). Of course, the number should be configurable as with the inter-hunk context. However, I'll not have much time to work on this feature, and it would definitely need some experimenting to find the best parameters, e.g. maximal number of inter-hunk lines, maybe even an inter-hunk/hunk ratio, alnums, or as Linus suggests the length of the line (of which I am still not convinced). Ciao, Dscho ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-02 10:55 ` Clemens Buchacher 2009-01-02 10:58 ` Clemens Buchacher @ 2009-01-02 11:03 ` Junio C Hamano 1 sibling, 0 replies; 67+ messages in thread From: Junio C Hamano @ 2009-01-02 11:03 UTC (permalink / raw) To: Clemens Buchacher Cc: Linus Torvalds, Adeodato Simó, Johannes Schindelin, Pierre Habouzit, davidel, Francis Galiegue, Git ML Clemens Buchacher <drizzd@aon.at> writes: > But patience diff does more than that. Take a look at "git diff" and "git > diff --patience" output below (taken from [1]). > > *** git diff ****************************** git diff --patience ********** > #include <stdio.h> | #include <stdio.h> I suspect you have the above labels backwards... ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-02 1:56 ` Linus Torvalds 2009-01-02 10:55 ` Clemens Buchacher @ 2009-01-02 18:50 ` Adeodato Simó 1 sibling, 0 replies; 67+ messages in thread From: Adeodato Simó @ 2009-01-02 18:50 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Schindelin, Pierre Habouzit, davidel, Francis Galiegue, Git ML * Linus Torvalds [Thu, 01 Jan 2009 17:56:13 -0800]: > On Thu, 1 Jan 2009, Adeodato Simó wrote: > > For me, the cases where I find patience output to be of substantial > > higher readability are those involving a rewrite of several consecutive > > paragraphs (i.e., lines of code separated by blank lines). Compare: > I don't think that's a "patience diff" issue. Ah, I see. > That's simply an issue of merging consecutive diff fragments together if > they are close-by, and that's independent of the actual diff algorithm > itself. > > I'll note that in this particular case, `git diff` yielded the very same > > results with or without --patience. I don't know why that is, Johannes? > > I'll also note that /usr/bin/diff produces (in this case) something > > closer to patience than to git. > See above - I really don't think this has anything to do with "patience vs > non-patience". It's more akin to the things we do for our merge conflict > markers: if we have two merge conflicts next to each other, with just a > couple of lines in between, we coalesce the merge conflicts into one > larger one instead. > We don't do that for regular diffs - they're always kept minimal (ok, not > really minimal, but as close to minimal as the algorithm finds them). > See commit f407f14deaa14ebddd0d27238523ced8eca74393 for the git merge > conflict merging. We _could_ do similar things for regular diffs. It's > sometimes useful, sometimes not. Independently of patience diff, then, I'd very much support changes to improve the diff I pasted. -- Adeodato Simó dato at net.com.org.es Debian Developer adeodato at debian.org Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. -- Brian W. Kernighan ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2008-11-04 8:30 ` Pierre Habouzit 2008-11-04 14:34 ` Johannes Schindelin @ 2009-01-06 11:17 ` Pierre Habouzit 2009-01-06 11:39 ` Pierre Habouzit 2009-01-06 19:40 ` Johannes Schindelin 1 sibling, 2 replies; 67+ messages in thread From: Pierre Habouzit @ 2009-01-06 11:17 UTC (permalink / raw) To: Johannes Schindelin, Linus Torvalds; +Cc: davidel, Francis Galiegue, Git ML [-- Attachment #1: Type: text/plain, Size: 3580 bytes --] On jeu, jan 01, 2009 at 04:38:09 +0000, Johannes Schindelin wrote: > > Nothing fancy, really, just a straight-forward implementation of the > heavily under-documented and under-analyzed paience diff algorithm. Btw, what is the status of this series ? I see it neither in pu nor in next. And I would gladly see it included in git. On jeu, jan 01, 2009 at 07:45:21 +0000, Linus Torvalds wrote: > > > On Thu, 1 Jan 2009, Johannes Schindelin wrote: > > > > Nothing fancy, really, just a straight-forward implementation of the > > heavily under-documented and under-analyzed paience diff algorithm. > > Exactly because the patience diff is so under-documented, could you > perhaps give a few examples of how it differs in the result, and why it's > so wonderful? Yes, yes, I can google, and no, no, nothing useful shows up > except for *totally* content-free fanboisms. > > So could we have some actual real data on it? For example, look at the following (shamelessly stolen from the real life example http://glandium.org/blog/?p=120). With orig.mk reading: ] include $(DEPTH)/config/autoconf.mk ] ] include $(topsrcdir)/config/rules.mk ] ] libs:: ] $(INSTALL) $(srcdir)/nsKillAll.js $(DIST)/bin/components ] ] clean:: ] rm -f $(DIST)/bin/components/nsKillAll.js With patched.mk reading: ] include $(DEPTH)/config/autoconf.mk ] ] EXTRA_COMPONENTS = nsKillAll.js ] ] include $(topsrcdir)/config/rules.mk ] ] clean:: ] rm -f $(DIST)/bin/components/nsKillAll.js You get: Normal git diff | patience diff ---------------------------------------------------+------------------------------------------ include $(DEPTH)/config/autoconf.mk | include $(DEPTH)/config/autoconf.mk | -include $(topsrcdir)/config/rules.mk | +EXTRA_COMPONENTS = nsKillAll.js +EXTRA_COMPONENTS = nsKillAll.js | + | include $(topsrcdir)/config/rules.mk -libs:: | - $(INSTALL) $(srcdir)/... | -libs:: +include $(topsrcdir)/config/rules.mk | - $(INSTALL) $(srcdir)/... | - clean:: | clean:: rm -f $(DIST)/bin/components/nsKillAll.js | rm -f $(DIST)/bin/components/nsKillAll.js I've checked in many projects I have under git, the differences between git log -p and git log -p --patience. The patience algorithm is really really more readable with it involves code moves with changes in the moved sections. If the section you move across is smaller than the moved ones, the patience algorithm will show the moved code as removed where it was and added where it now is, changes included. The current diff will rather move the smaller invariend section you move across and present mangled diffs involving the function prototypes making it less than readable. Of course, moving code _and_ modifying it at the same time is rather bad style. BUt (1) it happens and the fact it's bad style is a bad argument to not show a decent diff of it (2) if you construct diffs of a range of commits it happens a lot. -- ·O· Pierre Habouzit ··O madcoder@debian.org OOO http://www.madism.org [-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-06 11:17 ` Pierre Habouzit @ 2009-01-06 11:39 ` Pierre Habouzit 2009-01-06 19:40 ` Johannes Schindelin 1 sibling, 0 replies; 67+ messages in thread From: Pierre Habouzit @ 2009-01-06 11:39 UTC (permalink / raw) To: Johannes Schindelin, Linus Torvalds; +Cc: davidel, Francis Galiegue, Git ML [-- Attachment #1.1: Type: text/plain, Size: 2784 bytes --] On Tue, Jan 06, 2009 at 11:17:12AM +0000, Pierre Habouzit wrote: > On jeu, jan 01, 2009 at 04:38:09 +0000, Johannes Schindelin wrote: > > > > Nothing fancy, really, just a straight-forward implementation of the > > heavily under-documented and under-analyzed paience diff algorithm. > > Btw, what is the status of this series ? I see it neither in pu nor in > next. And I would gladly see it included in git. Johannes: I've not had time to investigate, but when finding what I present in the end of this mail, I ran: git log -p > log-normal git log -p --patience > log-patience in git.git. I saw that patience diff is slower than normal diff, which is expected, but I had to kill the latter command because it leaks like hell. I've not investigated yet (And yes I'm running your latest posted series). > On jeu, jan 01, 2009 at 07:45:21 +0000, Linus Torvalds wrote: > > On Thu, 1 Jan 2009, Johannes Schindelin wrote: > > > > > > Nothing fancy, really, just a straight-forward implementation of the > > > heavily under-documented and under-analyzed paience diff algorithm. > > > > Exactly because the patience diff is so under-documented, could you > > perhaps give a few examples of how it differs in the result, and why it's > > so wonderful? Yes, yes, I can google, and no, no, nothing useful shows up > > except for *totally* content-free fanboisms. > > > > So could we have some actual real data on it? > I've checked in many projects I have under git, the differences between > git log -p and git log -p --patience. The patience algorithm is really > really more readable with it involves code moves with changes in the > moved sections. If the section you move across is smaller than the moved > ones, the patience algorithm will show the moved code as removed where > it was and added where it now is, changes included. The current diff > will rather move the smaller invariend section you move across and > present mangled diffs involving the function prototypes making it less > than readable. Actually git.git has a canonical example of this in 214a34d22. For those not having the --patience diff applied locally, attached are the two patches git show / git show --patience give. It's of course a matter of taste, but I like the patience version a lot more. I'm also curious to see what a merge conflict with such a move would look like (e.g. inverting some of the added arguments to the factorized function of 214a34d22 or something similar). I'm somehow convinced that it would generate a nicer conflict somehow. -- ·O· Pierre Habouzit ··O madcoder@debian.org OOO http://www.madism.org [-- Attachment #1.2: 214a34d22.normal --] [-- Type: text/plain, Size: 2323 bytes --] commit 214a34d22ec59ec7e1166772f06ecf8799f96c96 Author: Florian Weimer <fw@deneb.enyo.de> Date: Sun Aug 31 17:45:04 2008 +0200 git-svn: Introduce SVN::Git::Editor::_chg_file_get_blob Signed-off-by: Florian Weimer <fw@deneb.enyo.de> Acked-by: Eric Wong <normalperson@yhbt.net> diff --git a/git-svn.perl b/git-svn.perl index 0479f41..2c3e13f 100755 --- a/git-svn.perl +++ b/git-svn.perl @@ -3663,28 +3663,35 @@ sub change_file_prop { $self->SUPER::change_file_prop($fbat, $pname, $pval, $self->{pool}); } -sub chg_file { - my ($self, $fbat, $m) = @_; - if ($m->{mode_b} =~ /755$/ && $m->{mode_a} !~ /755$/) { - $self->change_file_prop($fbat,'svn:executable','*'); - } elsif ($m->{mode_b} !~ /755$/ && $m->{mode_a} =~ /755$/) { - $self->change_file_prop($fbat,'svn:executable',undef); - } - my $fh = Git::temp_acquire('git_blob'); - if ($m->{mode_b} =~ /^120/) { +sub _chg_file_get_blob ($$$$) { + my ($self, $fbat, $m, $which) = @_; + my $fh = Git::temp_acquire("git_blob_$which"); + if ($m->{"mode_$which"} =~ /^120/) { print $fh 'link ' or croak $!; $self->change_file_prop($fbat,'svn:special','*'); - } elsif ($m->{mode_a} =~ /^120/ && $m->{mode_b} !~ /^120/) { + } elsif ($m->{mode_a} =~ /^120/ && $m->{"mode_$which"} !~ /^120/) { $self->change_file_prop($fbat,'svn:special',undef); } - my $size = $::_repository->cat_blob($m->{sha1_b}, $fh); - croak "Failed to read object $m->{sha1_b}" if ($size < 0); + my $blob = $m->{"sha1_$which"}; + return ($fh,) if ($blob =~ /^0{40}$/); + my $size = $::_repository->cat_blob($blob, $fh); + croak "Failed to read object $blob" if ($size < 0); $fh->flush == 0 or croak $!; seek $fh, 0, 0 or croak $!; my $exp = ::md5sum($fh); seek $fh, 0, 0 or croak $!; + return ($fh, $exp); +} +sub chg_file { + my ($self, $fbat, $m) = @_; + if ($m->{mode_b} =~ /755$/ && $m->{mode_a} !~ /755$/) { + $self->change_file_prop($fbat,'svn:executable','*'); + } elsif ($m->{mode_b} !~ /755$/ && $m->{mode_a} =~ /755$/) { + $self->change_file_prop($fbat,'svn:executable',undef); + } + my ($fh, $exp) = _chg_file_get_blob $self, $fbat, $m, 'b'; my $pool = SVN::Pool->new; my $atd = $self->apply_textdelta($fbat, undef, $pool); my $got = SVN::TxDelta::send_stream($fh, @$atd, $pool); [-- Attachment #1.3: 214a34d22.patience --] [-- Type: text/plain, Size: 2290 bytes --] commit 214a34d22ec59ec7e1166772f06ecf8799f96c96 Author: Florian Weimer <fw@deneb.enyo.de> Date: Sun Aug 31 17:45:04 2008 +0200 git-svn: Introduce SVN::Git::Editor::_chg_file_get_blob Signed-off-by: Florian Weimer <fw@deneb.enyo.de> Acked-by: Eric Wong <normalperson@yhbt.net> diff --git a/git-svn.perl b/git-svn.perl index 0479f41..2c3e13f 100755 --- a/git-svn.perl +++ b/git-svn.perl @@ -3663,6 +3663,27 @@ sub change_file_prop { $self->SUPER::change_file_prop($fbat, $pname, $pval, $self->{pool}); } +sub _chg_file_get_blob ($$$$) { + my ($self, $fbat, $m, $which) = @_; + my $fh = Git::temp_acquire("git_blob_$which"); + if ($m->{"mode_$which"} =~ /^120/) { + print $fh 'link ' or croak $!; + $self->change_file_prop($fbat,'svn:special','*'); + } elsif ($m->{mode_a} =~ /^120/ && $m->{"mode_$which"} !~ /^120/) { + $self->change_file_prop($fbat,'svn:special',undef); + } + my $blob = $m->{"sha1_$which"}; + return ($fh,) if ($blob =~ /^0{40}$/); + my $size = $::_repository->cat_blob($blob, $fh); + croak "Failed to read object $blob" if ($size < 0); + $fh->flush == 0 or croak $!; + seek $fh, 0, 0 or croak $!; + + my $exp = ::md5sum($fh); + seek $fh, 0, 0 or croak $!; + return ($fh, $exp); +} + sub chg_file { my ($self, $fbat, $m) = @_; if ($m->{mode_b} =~ /755$/ && $m->{mode_a} !~ /755$/) { @@ -3670,21 +3691,7 @@ sub chg_file { } elsif ($m->{mode_b} !~ /755$/ && $m->{mode_a} =~ /755$/) { $self->change_file_prop($fbat,'svn:executable',undef); } - my $fh = Git::temp_acquire('git_blob'); - if ($m->{mode_b} =~ /^120/) { - print $fh 'link ' or croak $!; - $self->change_file_prop($fbat,'svn:special','*'); - } elsif ($m->{mode_a} =~ /^120/ && $m->{mode_b} !~ /^120/) { - $self->change_file_prop($fbat,'svn:special',undef); - } - my $size = $::_repository->cat_blob($m->{sha1_b}, $fh); - croak "Failed to read object $m->{sha1_b}" if ($size < 0); - $fh->flush == 0 or croak $!; - seek $fh, 0, 0 or croak $!; - - my $exp = ::md5sum($fh); - seek $fh, 0, 0 or croak $!; - + my ($fh, $exp) = _chg_file_get_blob $self, $fbat, $m, 'b'; my $pool = SVN::Pool->new; my $atd = $self->apply_textdelta($fbat, undef, $pool); my $got = SVN::TxDelta::send_stream($fh, @$atd, $pool); [-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-06 11:17 ` Pierre Habouzit 2009-01-06 11:39 ` Pierre Habouzit @ 2009-01-06 19:40 ` Johannes Schindelin 2009-01-07 14:39 ` Pierre Habouzit 2009-01-07 20:15 ` [PATCH 0/3] Teach Git about " Sam Vilain 1 sibling, 2 replies; 67+ messages in thread From: Johannes Schindelin @ 2009-01-06 19:40 UTC (permalink / raw) To: Pierre Habouzit; +Cc: Linus Torvalds, davidel, Francis Galiegue, Git ML Hi, On Tue, 6 Jan 2009, Pierre Habouzit wrote: > On jeu, jan 01, 2009 at 04:38:09 +0000, Johannes Schindelin wrote: > > > > Nothing fancy, really, just a straight-forward implementation of the > > heavily under-documented and under-analyzed paience diff algorithm. > > Btw, what is the status of this series ? I see it neither in pu nor in > next. And I would gladly see it included in git. AFAICT people wanted to be reasonably sure that it is worth the effort. Although I would like to see it in once it is fleshed out -- even if it does not meet our usefulness standard -- because people said Git is inferior for not providing a patience diff. If we have --patience, we can say "but we have it, it's just not useful, check for yourself". > On jeu, jan 01, 2009 at 07:45:21 +0000, Linus Torvalds wrote: > > > > So could we have some actual real data on it? > > For example, look at the following (shamelessly stolen from the real > life example http://glandium.org/blog/?p=120). Due to the lines being much longer than 80 characters, this example was not useful to me. Ciao, Dscho ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-06 19:40 ` Johannes Schindelin @ 2009-01-07 14:39 ` Pierre Habouzit 2009-01-07 17:01 ` Johannes Schindelin 2009-01-07 20:15 ` [PATCH 0/3] Teach Git about " Sam Vilain 1 sibling, 1 reply; 67+ messages in thread From: Pierre Habouzit @ 2009-01-07 14:39 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Linus Torvalds, davidel, Francis Galiegue, Git ML [-- Attachment #1.1: Type: text/plain, Size: 1832 bytes --] On Tue, Jan 06, 2009 at 07:40:02PM +0000, Johannes Schindelin wrote: > Hi, > > On Tue, 6 Jan 2009, Pierre Habouzit wrote: > > > On jeu, jan 01, 2009 at 04:38:09 +0000, Johannes Schindelin wrote: > > > > > > Nothing fancy, really, just a straight-forward implementation of the > > > heavily under-documented and under-analyzed paience diff algorithm. > > > > Btw, what is the status of this series ? I see it neither in pu nor in > > next. And I would gladly see it included in git. > > AFAICT people wanted to be reasonably sure that it is worth the effort. Fair enough. > Although I would like to see it in once it is fleshed out -- even if it > does not meet our usefulness standard -- because people said Git is > inferior for not providing a patience diff. If we have --patience, we can > say "but we have it, it's just not useful, check for yourself". Well I believe it's useful, but maybe the standard algorithm could be tweaked the way Linus proposes to make the "long" lines weight louder or so. > Due to the lines being much longer than 80 characters, this example was > not useful to me. I hope the other one was ;) WRT the leaks, you want to squash the attached patch on the proper patches of your series (maybe the xdl_free on map.entries could be put in a hasmap_destroy or similar btw, but valgrind reports no more leaks in xdiff now). As of timings, with the current implementation, patience diff is around 30% slower than the default implementation, which is a bit worse than what I would expect, probably due to the fact that you had to work around the libxdiff interface I guess. -- ·O· Pierre Habouzit ··O madcoder@debian.org OOO http://www.madism.org [-- Attachment #1.2: memory-leaks.diff --] [-- Type: text/plain, Size: 1120 bytes --] diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c index d01cbdd..c2ffb03 100644 --- a/xdiff/xpatience.c +++ b/xdiff/xpatience.c @@ -203,8 +203,10 @@ static struct entry *find_longest_common_sequence(struct hashmap *map) } /* No common unique lines were found */ - if (!longest) + if (!longest) { + xdl_free(sequence); return NULL; + } /* Iterate starting at the last element, adjusting the "next" members */ entry = sequence[longest - 1]; @@ -213,6 +215,7 @@ static struct entry *find_longest_common_sequence(struct hashmap *map) entry->previous->next = entry; entry = entry->previous; } + xdl_free(sequence); return entry; } @@ -350,6 +353,7 @@ static int patience_diff(mmfile_t *file1, mmfile_t *file2, env->xdf1.rchg[line1++ - 1] = 1; while(count2--) env->xdf2.rchg[line2++ - 1] = 1; + xdl_free(map.entries); return 0; } @@ -361,6 +365,7 @@ static int patience_diff(mmfile_t *file1, mmfile_t *file2, result = fall_back_to_classic_diff(&map, line1, count1, line2, count2); + xdl_free(map.entries); return result; } [-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-07 14:39 ` Pierre Habouzit @ 2009-01-07 17:01 ` Johannes Schindelin 2009-01-07 17:04 ` [PATCH v3 1/3] Implement " Johannes Schindelin 0 siblings, 1 reply; 67+ messages in thread From: Johannes Schindelin @ 2009-01-07 17:01 UTC (permalink / raw) To: Pierre Habouzit; +Cc: Linus Torvalds, davidel, Francis Galiegue, Git ML Hi, On Wed, 7 Jan 2009, Pierre Habouzit wrote: > On Tue, Jan 06, 2009 at 07:40:02PM +0000, Johannes Schindelin wrote: > > > Although I would like to see it in once it is fleshed out -- even if > > it does not meet our usefulness standard -- because people said Git is > > inferior for not providing a patience diff. If we have --patience, we > > can say "but we have it, it's just not useful, check for yourself". > > Well I believe it's useful, but maybe the standard algorithm could be > tweaked the way Linus proposes to make the "long" lines weight louder or > so. I think this "weighting idea" is a bit too much of handwaving to start anything close to a design; as I pointed out, anything that has something different than a 1 for a deleted/added line affects performance negatively. > WRT the leaks, you want to squash the attached patch on the proper > patches of your series (maybe the xdl_free on map.entries could be put > in a hasmap_destroy or similar btw, but valgrind reports no more leaks > in xdiff now). Thanks! I also squashed in a patch that avoids calling xdl_cleanup_records() and then memset()ing the rchg array to 0 (which worked around the segmentation fault). Patch 1/3 v3 follows, Dscho ^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v3 1/3] Implement the patience diff algorithm 2009-01-07 17:01 ` Johannes Schindelin @ 2009-01-07 17:04 ` Johannes Schindelin 2009-01-07 18:10 ` Davide Libenzi 0 siblings, 1 reply; 67+ messages in thread From: Johannes Schindelin @ 2009-01-07 17:04 UTC (permalink / raw) To: Pierre Habouzit; +Cc: Linus Torvalds, davidel, Francis Galiegue, Git ML The patience diff algorithm produces slightly more intuitive output than the classic Myers algorithm, as it does not try to minimize the number of +/- lines first, but tries to preserve the lines that are unique. To this end, it first determines lines that are unique in both files, then the maximal sequence which preserves the order (relative to both files) is extracted. Starting from this initial set of common lines, the rest of the lines is handled recursively, with Myers' algorithm as a fallback when the patience algorithm fails (due to no common unique lines). This patch includes memory leak fixes by Pierre Habouzit. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> --- This comes close to 'next' quality, IMHO. Changes vs v2: Pierre's memory leak fixes, and it now avoids calling xdl_cleanup_records() and then throwing away the results for patience diff. Pierre, could you do the performance tests again? I _think_ it might be somewhat faster now because of the xdl_cleanup_records() avoidance. xdiff/xdiff.h | 1 + xdiff/xdiffi.c | 3 + xdiff/xdiffi.h | 2 + xdiff/xpatience.c | 381 +++++++++++++++++++++++++++++++++++++++++++++++++++++ xdiff/xprepare.c | 3 +- 5 files changed, 389 insertions(+), 1 deletions(-) create mode 100644 xdiff/xpatience.c diff --git a/xdiff/xdiff.h b/xdiff/xdiff.h index 361f802..4da052a 100644 --- a/xdiff/xdiff.h +++ b/xdiff/xdiff.h @@ -32,6 +32,7 @@ extern "C" { #define XDF_IGNORE_WHITESPACE (1 << 2) #define XDF_IGNORE_WHITESPACE_CHANGE (1 << 3) #define XDF_IGNORE_WHITESPACE_AT_EOL (1 << 4) +#define XDF_PATIENCE_DIFF (1 << 5) #define XDF_WHITESPACE_FLAGS (XDF_IGNORE_WHITESPACE | XDF_IGNORE_WHITESPACE_CHANGE | XDF_IGNORE_WHITESPACE_AT_EOL) #define XDL_PATCH_NORMAL '-' diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c index 9d0324a..3e97462 100644 --- a/xdiff/xdiffi.c +++ b/xdiff/xdiffi.c @@ -329,6 +329,9 @@ int xdl_do_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, xdalgoenv_t xenv; diffdata_t dd1, dd2; + if (xpp->flags & XDF_PATIENCE_DIFF) + return xdl_do_patience_diff(mf1, mf2, xpp, xe); + if (xdl_prepare_env(mf1, mf2, xpp, xe) < 0) { return -1; diff --git a/xdiff/xdiffi.h b/xdiff/xdiffi.h index 3e099dc..ad033a8 100644 --- a/xdiff/xdiffi.h +++ b/xdiff/xdiffi.h @@ -55,5 +55,7 @@ int xdl_build_script(xdfenv_t *xe, xdchange_t **xscr); void xdl_free_script(xdchange_t *xscr); int xdl_emit_diff(xdfenv_t *xe, xdchange_t *xscr, xdemitcb_t *ecb, xdemitconf_t const *xecfg); +int xdl_do_patience_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, + xdfenv_t *env); #endif /* #if !defined(XDIFFI_H) */ diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c new file mode 100644 index 0000000..e42c16a --- /dev/null +++ b/xdiff/xpatience.c @@ -0,0 +1,381 @@ +/* + * LibXDiff by Davide Libenzi ( File Differential Library ) + * Copyright (C) 2003-2009 Davide Libenzi, Johannes E. Schindelin + * + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this library; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * Davide Libenzi <davidel@xmailserver.org> + * + */ +#include "xinclude.h" +#include "xtypes.h" +#include "xdiff.h" + +/* + * The basic idea of patience diff is to find lines that are unique in + * both files. These are intuitively the ones that we want to see as + * common lines. + * + * The maximal ordered sequence of such line pairs (where ordered means + * that the order in the sequence agrees with the order of the lines in + * both files) naturally defines an initial set of common lines. + * + * Now, the algorithm tries to extend the set of common lines by growing + * the line ranges where the files have identical lines. + * + * Between those common lines, the patience diff algorithm is applied + * recursively, until no unique line pairs can be found; these line ranges + * are handled by the well-known Myers algorithm. + */ + +#define NON_UNIQUE ULONG_MAX + +/* + * This is a hash mapping from line hash to line numbers in the first and + * second file. + */ +struct hashmap { + int nr, alloc; + struct entry { + unsigned long hash; + /* + * 0 = unused entry, 1 = first line, 2 = second, etc. + * line2 is NON_UNIQUE if the line is not unique + * in either the first or the second file. + */ + unsigned long line1, line2; + /* + * "next" & "previous" are used for the longest common + * sequence; + * initially, "next" reflects only the order in file1. + */ + struct entry *next, *previous; + } *entries, *first, *last; + /* were common records found? */ + unsigned long has_matches; + mmfile_t *file1, *file2; + xdfenv_t *env; + xpparam_t const *xpp; +}; + +/* The argument "pass" is 1 for the first file, 2 for the second. */ +static void insert_record(int line, struct hashmap *map, int pass) +{ + xrecord_t **records = pass == 1 ? + map->env->xdf1.recs : map->env->xdf2.recs; + xrecord_t *record = records[line - 1], *other; + /* + * After xdl_prepare_env() (or more precisely, due to + * xdl_classify_record()), the "ha" member of the records (AKA lines) + * is _not_ the hash anymore, but a linearized version of it. In + * other words, the "ha" member is guaranteed to start with 0 and + * the second record's ha can only be 0 or 1, etc. + * + * So we multiply ha by 2 in the hope that the hashing was + * "unique enough". + */ + int index = (int)((record->ha << 1) % map->alloc); + + while (map->entries[index].line1) { + other = map->env->xdf1.recs[map->entries[index].line1 - 1]; + if (map->entries[index].hash != record->ha || + !xdl_recmatch(record->ptr, record->size, + other->ptr, other->size, + map->xpp->flags)) { + if (++index >= map->alloc) + index = 0; + continue; + } + if (pass == 2) + map->has_matches = 1; + if (pass == 1 || map->entries[index].line2) + map->entries[index].line2 = NON_UNIQUE; + else + map->entries[index].line2 = line; + return; + } + if (pass == 2) + return; + map->entries[index].line1 = line; + map->entries[index].hash = record->ha; + if (!map->first) + map->first = map->entries + index; + if (map->last) { + map->last->next = map->entries + index; + map->entries[index].previous = map->last; + } + map->last = map->entries + index; + map->nr++; +} + +/* + * This function has to be called for each recursion into the inter-hunk + * parts, as previously non-unique lines can become unique when being + * restricted to a smaller part of the files. + * + * It is assumed that env has been prepared using xdl_prepare(). + */ +static int fill_hashmap(mmfile_t *file1, mmfile_t *file2, + xpparam_t const *xpp, xdfenv_t *env, + struct hashmap *result, + int line1, int count1, int line2, int count2) +{ + result->file1 = file1; + result->file2 = file2; + result->xpp = xpp; + result->env = env; + + /* We know exactly how large we want the hash map */ + result->alloc = count1 * 2; + result->entries = (struct entry *) + xdl_malloc(result->alloc * sizeof(struct entry)); + if (!result->entries) + return -1; + memset(result->entries, 0, result->alloc * sizeof(struct entry)); + + /* First, fill with entries from the first file */ + while (count1--) + insert_record(line1++, result, 1); + + /* Then search for matches in the second file */ + while (count2--) + insert_record(line2++, result, 2); + + return 0; +} + +/* + * Find the longest sequence with a smaller last element (meaning a smaller + * line2, as we construct the sequence with entries ordered by line1). + */ +static int binary_search(struct entry **sequence, int longest, + struct entry *entry) +{ + int left = -1, right = longest; + + while (left + 1 < right) { + int middle = (left + right) / 2; + /* by construction, no two entries can be equal */ + if (sequence[middle]->line2 > entry->line2) + right = middle; + else + left = middle; + } + /* return the index in "sequence", _not_ the sequence length */ + return left; +} + +/* + * The idea is to start with the list of common unique lines sorted by + * the order in file1. For each of these pairs, the longest (partial) + * sequence whose last element's line2 is smaller is determined. + * + * For efficiency, the sequences are kept in a list containing exactly one + * item per sequence length: the sequence with the smallest last + * element (in terms of line2). + */ +static struct entry *find_longest_common_sequence(struct hashmap *map) +{ + struct entry **sequence = xdl_malloc(map->nr * sizeof(struct entry *)); + int longest = 0, i; + struct entry *entry; + + for (entry = map->first; entry; entry = entry->next) { + if (!entry->line2 || entry->line2 == NON_UNIQUE) + continue; + i = binary_search(sequence, longest, entry); + entry->previous = i < 0 ? NULL : sequence[i]; + sequence[++i] = entry; + if (i == longest) + longest++; + } + + /* No common unique lines were found */ + if (!longest) { + xdl_free(sequence); + return NULL; + } + + /* Iterate starting at the last element, adjusting the "next" members */ + entry = sequence[longest - 1]; + entry->next = NULL; + while (entry->previous) { + entry->previous->next = entry; + entry = entry->previous; + } + xdl_free(sequence); + return entry; +} + +static int match(struct hashmap *map, int line1, int line2) +{ + xrecord_t *record1 = map->env->xdf1.recs[line1 - 1]; + xrecord_t *record2 = map->env->xdf2.recs[line2 - 1]; + return xdl_recmatch(record1->ptr, record1->size, + record2->ptr, record2->size, map->xpp->flags); +} + +static int patience_diff(mmfile_t *file1, mmfile_t *file2, + xpparam_t const *xpp, xdfenv_t *env, + int line1, int count1, int line2, int count2); + +static int walk_common_sequence(struct hashmap *map, struct entry *first, + int line1, int count1, int line2, int count2) +{ + int end1 = line1 + count1, end2 = line2 + count2; + int next1, next2; + + for (;;) { + /* Try to grow the line ranges of common lines */ + if (first) { + next1 = first->line1; + next2 = first->line2; + while (next1 > line1 && next2 > line2 && + match(map, next1 - 1, next2 - 1)) { + next1--; + next2--; + } + } else { + next1 = end1; + next2 = end2; + } + while (line1 < next1 && line2 < next2 && + match(map, line1, line2)) { + line1++; + line2++; + } + + /* Recurse */ + if (next1 > line1 || next2 > line2) { + struct hashmap submap; + + memset(&submap, 0, sizeof(submap)); + if (patience_diff(map->file1, map->file2, + map->xpp, map->env, + line1, next1 - line1, + line2, next2 - line2)) + return -1; + } + + if (!first) + return 0; + + while (first->next && + first->next->line1 == first->line1 + 1 && + first->next->line2 == first->line2 + 1) + first = first->next; + + line1 = first->line1 + 1; + line2 = first->line2 + 1; + + first = first->next; + } +} + +static int fall_back_to_classic_diff(struct hashmap *map, + int line1, int count1, int line2, int count2) +{ + /* + * This probably does not work outside Git, since + * we have a very simple mmfile structure. + * + * Note: ideally, we would reuse the prepared environment, but + * the libxdiff interface does not (yet) allow for diffing only + * ranges of lines instead of the whole files. + */ + mmfile_t subfile1, subfile2; + xpparam_t xpp; + xdfenv_t env; + + subfile1.ptr = (char *)map->env->xdf1.recs[line1 - 1]->ptr; + subfile1.size = map->env->xdf1.recs[line1 + count1 - 2]->ptr + + map->env->xdf1.recs[line1 + count1 - 2]->size - subfile1.ptr; + subfile2.ptr = (char *)map->env->xdf2.recs[line2 - 1]->ptr; + subfile2.size = map->env->xdf2.recs[line2 + count2 - 2]->ptr + + map->env->xdf2.recs[line2 + count2 - 2]->size - subfile2.ptr; + xpp.flags = map->xpp->flags & ~XDF_PATIENCE_DIFF; + if (xdl_do_diff(&subfile1, &subfile2, &xpp, &env) < 0) + return -1; + + memcpy(map->env->xdf1.rchg + line1 - 1, env.xdf1.rchg, count1); + memcpy(map->env->xdf2.rchg + line2 - 1, env.xdf2.rchg, count2); + + xdl_free_env(&env); + + return 0; +} + +/* + * Recursively find the longest common sequence of unique lines, + * and if none was found, ask xdl_do_diff() to do the job. + * + * This function assumes that env was prepared with xdl_prepare_env(). + */ +static int patience_diff(mmfile_t *file1, mmfile_t *file2, + xpparam_t const *xpp, xdfenv_t *env, + int line1, int count1, int line2, int count2) +{ + struct hashmap map; + struct entry *first; + int result = 0; + + /* trivial case: one side is empty */ + if (!count1) { + while(count2--) + env->xdf2.rchg[line2++ - 1] = 1; + return 0; + } else if (!count2) { + while(count1--) + env->xdf1.rchg[line1++ - 1] = 1; + return 0; + } + + memset(&map, 0, sizeof(map)); + if (fill_hashmap(file1, file2, xpp, env, &map, + line1, count1, line2, count2)) + return -1; + + /* are there any matching lines at all? */ + if (!map.has_matches) { + while(count1--) + env->xdf1.rchg[line1++ - 1] = 1; + while(count2--) + env->xdf2.rchg[line2++ - 1] = 1; + xdl_free(map.entries); + return 0; + } + + first = find_longest_common_sequence(&map); + if (first) + result = walk_common_sequence(&map, first, + line1, count1, line2, count2); + else + result = fall_back_to_classic_diff(&map, + line1, count1, line2, count2); + + xdl_free(map.entries); + return result; +} + +int xdl_do_patience_diff(mmfile_t *file1, mmfile_t *file2, + xpparam_t const *xpp, xdfenv_t *env) +{ + if (xdl_prepare_env(file1, file2, xpp, env) < 0) + return -1; + + /* environment is cleaned up in xdl_diff() */ + return patience_diff(file1, file2, xpp, env, + 1, env->xdf1.nrec, 1, env->xdf2.nrec); +} diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c index a43aa72..1689085 100644 --- a/xdiff/xprepare.c +++ b/xdiff/xprepare.c @@ -290,7 +290,8 @@ int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, xdl_free_classifier(&cf); - if (xdl_optimize_ctxs(&xe->xdf1, &xe->xdf2) < 0) { + if (!(xpp->flags & XDF_PATIENCE_DIFF) && + xdl_optimize_ctxs(&xe->xdf1, &xe->xdf2) < 0) { xdl_free_ctx(&xe->xdf2); xdl_free_ctx(&xe->xdf1); -- 1.6.0.2.GIT ^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [PATCH v3 1/3] Implement the patience diff algorithm 2009-01-07 17:04 ` [PATCH v3 1/3] Implement " Johannes Schindelin @ 2009-01-07 18:10 ` Davide Libenzi 2009-01-07 18:32 ` Johannes Schindelin 2009-01-07 18:59 ` Linus Torvalds 0 siblings, 2 replies; 67+ messages in thread From: Davide Libenzi @ 2009-01-07 18:10 UTC (permalink / raw) To: Johannes Schindelin Cc: Pierre Habouzit, Linus Torvalds, Francis Galiegue, Git ML On Wed, 7 Jan 2009, Johannes Schindelin wrote: > > The patience diff algorithm produces slightly more intuitive output > than the classic Myers algorithm, as it does not try to minimize the > number of +/- lines first, but tries to preserve the lines that are > unique. Johannes, sorry I had not time to follow this one. A couple of minor comments that arose just at glancing at the patch. > +/* > + * LibXDiff by Davide Libenzi ( File Differential Library ) > + * Copyright (C) 2003-2009 Davide Libenzi, Johannes E. Schindelin > + * > + * This library is free software; you can redistribute it and/or > + * modify it under the terms of the GNU Lesser General Public > + * License as published by the Free Software Foundation; either > + * version 2.1 of the License, or (at your option) any later version. > + * > + * This library is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * Lesser General Public License for more details. > + * > + * You should have received a copy of the GNU Lesser General Public > + * License along with this library; if not, write to the Free Software > + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA > + * > + * Davide Libenzi <davidel@xmailserver.org> You do not need to give me credit for something I don't even know how it works ;) > +static int fall_back_to_classic_diff(struct hashmap *map, > + int line1, int count1, int line2, int count2) > +{ > + /* > + * This probably does not work outside Git, since > + * we have a very simple mmfile structure. > + * > + * Note: ideally, we would reuse the prepared environment, but > + * the libxdiff interface does not (yet) allow for diffing only > + * ranges of lines instead of the whole files. > + */ > + mmfile_t subfile1, subfile2; > + xpparam_t xpp; > + xdfenv_t env; > + > + subfile1.ptr = (char *)map->env->xdf1.recs[line1 - 1]->ptr; > + subfile1.size = map->env->xdf1.recs[line1 + count1 - 2]->ptr + > + map->env->xdf1.recs[line1 + count1 - 2]->size - subfile1.ptr; > + subfile2.ptr = (char *)map->env->xdf2.recs[line2 - 1]->ptr; > + subfile2.size = map->env->xdf2.recs[line2 + count2 - 2]->ptr + > + map->env->xdf2.recs[line2 + count2 - 2]->size - subfile2.ptr; > + xpp.flags = map->xpp->flags & ~XDF_PATIENCE_DIFF; > + if (xdl_do_diff(&subfile1, &subfile2, &xpp, &env) < 0) > + return -1; xdiff allows for diffing ranges, and the most efficent method is exactly how you did ;) Once you know the lines pointers, there's no need to pass it the whole file and have it scan it whole to find the lines range it has to diff. Just pass the limited view like you did. - Davide ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v3 1/3] Implement the patience diff algorithm 2009-01-07 18:10 ` Davide Libenzi @ 2009-01-07 18:32 ` Johannes Schindelin 2009-01-07 20:09 ` Davide Libenzi 2009-01-07 18:59 ` Linus Torvalds 1 sibling, 1 reply; 67+ messages in thread From: Johannes Schindelin @ 2009-01-07 18:32 UTC (permalink / raw) To: Davide Libenzi; +Cc: Pierre Habouzit, Linus Torvalds, Francis Galiegue, Git ML Hi, On Wed, 7 Jan 2009, Davide Libenzi wrote: > On Wed, 7 Jan 2009, Johannes Schindelin wrote: > > > The patience diff algorithm produces slightly more intuitive output > > than the classic Myers algorithm, as it does not try to minimize the > > number of +/- lines first, but tries to preserve the lines that are > > unique. > > Johannes, sorry I had not time to follow this one. What? You mean you actually took some time off around Christmas??? :-) > A couple of minor comments that arose just at glancing at the patch. Thanks. > > +/* > > + * LibXDiff by Davide Libenzi ( File Differential Library ) > > + * Copyright (C) 2003-2009 Davide Libenzi, Johannes E. Schindelin > > + * > > + * This library is free software; you can redistribute it and/or > > + * modify it under the terms of the GNU Lesser General Public > > + * License as published by the Free Software Foundation; either > > + * version 2.1 of the License, or (at your option) any later version. > > + * > > + * This library is distributed in the hope that it will be useful, > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + * Lesser General Public License for more details. > > + * > > + * You should have received a copy of the GNU Lesser General Public > > + * License along with this library; if not, write to the Free Software > > + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA > > + * > > + * Davide Libenzi <davidel@xmailserver.org> > > You do not need to give me credit for something I don't even know how it > works ;) Well, I meant to pass the copyright to you, or at least share it. > > +static int fall_back_to_classic_diff(struct hashmap *map, > > + int line1, int count1, int line2, int count2) > > +{ > > + /* > > + * This probably does not work outside Git, since > > + * we have a very simple mmfile structure. > > + * > > + * Note: ideally, we would reuse the prepared environment, but > > + * the libxdiff interface does not (yet) allow for diffing only > > + * ranges of lines instead of the whole files. > > + */ > > + mmfile_t subfile1, subfile2; > > + xpparam_t xpp; > > + xdfenv_t env; > > + > > + subfile1.ptr = (char *)map->env->xdf1.recs[line1 - 1]->ptr; > > + subfile1.size = map->env->xdf1.recs[line1 + count1 - 2]->ptr + > > + map->env->xdf1.recs[line1 + count1 - 2]->size - subfile1.ptr; > > + subfile2.ptr = (char *)map->env->xdf2.recs[line2 - 1]->ptr; > > + subfile2.size = map->env->xdf2.recs[line2 + count2 - 2]->ptr + > > + map->env->xdf2.recs[line2 + count2 - 2]->size - subfile2.ptr; > > + xpp.flags = map->xpp->flags & ~XDF_PATIENCE_DIFF; > > + if (xdl_do_diff(&subfile1, &subfile2, &xpp, &env) < 0) > > + return -1; > > xdiff allows for diffing ranges, and the most efficent method is exactly > how you did ;) Once you know the lines pointers, there's no need to pass > it the whole file and have it scan it whole to find the lines range it > has to diff. Just pass the limited view like you did. Heh. Could it be that you misread my patch, and assumed that I faked an xdfenv? I did not, but instead faked two mmfiles, which is only as simple as I did it because in git.git, we only have contiguous mmfiles. (I recall that libxdiff allows for ropes instead of arrays.) The way I did it has one big shortcoming: I need to prepare an xdfenv for the subfiles even if I already prepared one for the complete files. IOW the lines are rehashed all over again. Ciao, Dscho ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v3 1/3] Implement the patience diff algorithm 2009-01-07 18:32 ` Johannes Schindelin @ 2009-01-07 20:09 ` Davide Libenzi 2009-01-07 20:19 ` Johannes Schindelin 0 siblings, 1 reply; 67+ messages in thread From: Davide Libenzi @ 2009-01-07 20:09 UTC (permalink / raw) To: Johannes Schindelin Cc: Pierre Habouzit, Linus Torvalds, Francis Galiegue, Git ML On Wed, 7 Jan 2009, Johannes Schindelin wrote: > Heh. > > Could it be that you misread my patch, and assumed that I faked an > xdfenv? > > I did not, but instead faked two mmfiles, which is only as simple as I did > it because in git.git, we only have contiguous mmfiles. (I recall that > libxdiff allows for ropes instead of arrays.) > > The way I did it has one big shortcoming: I need to prepare an xdfenv for > the subfiles even if I already prepared one for the complete files. IOW > the lines are rehashed all over again. I told you I just glanced at the code :) In that way, if you guys decide to merge this new algo, you'll need to split the prepare from the optimize, and feed it with an already prepared env. Before going that way, have you ever tried to tweak xdl_cleanup_records and xdl_clean_mmatch to reduce the level of optimization, and see the results you get? It is possible that you won't need two different algos inside git. - Davide ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v3 1/3] Implement the patience diff algorithm 2009-01-07 20:09 ` Davide Libenzi @ 2009-01-07 20:19 ` Johannes Schindelin 0 siblings, 0 replies; 67+ messages in thread From: Johannes Schindelin @ 2009-01-07 20:19 UTC (permalink / raw) To: Davide Libenzi; +Cc: Pierre Habouzit, Linus Torvalds, Francis Galiegue, Git ML Hi, On Wed, 7 Jan 2009, Davide Libenzi wrote: > On Wed, 7 Jan 2009, Johannes Schindelin wrote: > > > Could it be that you misread my patch, and assumed that I faked an > > xdfenv? > > > > I did not, but instead faked two mmfiles, which is only as simple as I did > > it because in git.git, we only have contiguous mmfiles. (I recall that > > libxdiff allows for ropes instead of arrays.) > > > > The way I did it has one big shortcoming: I need to prepare an xdfenv for > > the subfiles even if I already prepared one for the complete files. IOW > > the lines are rehashed all over again. > > I told you I just glanced at the code :) > In that way, if you guys decide to merge this new algo, you'll need to > split the prepare from the optimize, and feed it with an already prepared > env. Right. > Before going that way, have you ever tried to tweak xdl_cleanup_records > and xdl_clean_mmatch to reduce the level of optimization, and see the > results you get? It is possible that you won't need two different algos > inside git. No, I hadn't thought that libxdiff already determines uniqueness before actually running xdl_do_diff(). I also have to admit that I am not as clever as other people, and had quite a hard time figuring out as much as I did (for example, that rchg[i] == 1 means that this line is to be added/deleted, and that i is in the range 0, ..., N - 1 rather than 1, ..., N). So it is quite possible that something patience-like can be done earlier. However, I do not see a way to implement the recursion necessary for the patience diff. Remember: patience(line range): find unique lines if no unique lines found: resort to classical diff return extract the longest common sequence of unique common lines between those, recurse When recursing, previously non-unique lines can turn unique, of course. And I do not see how that recursion could be done before xdl_clean_mmatch(), short of redoing the hashing and cleaning records up. Of course, it might well be possible, but I am already out of my depth reading something like "rdis0" and "rpdis1", and being close to despair. :') Ciao, Dscho ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v3 1/3] Implement the patience diff algorithm 2009-01-07 18:10 ` Davide Libenzi 2009-01-07 18:32 ` Johannes Schindelin @ 2009-01-07 18:59 ` Linus Torvalds 2009-01-07 20:00 ` Johannes Schindelin 2009-01-07 20:11 ` Davide Libenzi 1 sibling, 2 replies; 67+ messages in thread From: Linus Torvalds @ 2009-01-07 18:59 UTC (permalink / raw) To: Davide Libenzi Cc: Johannes Schindelin, Pierre Habouzit, Francis Galiegue, Git ML On Wed, 7 Jan 2009, Davide Libenzi wrote: > > xdiff allows for diffing ranges, and the most efficent method is exactly > how you did ;) No, the performance problem is how it needs to re-hash everything. xdiff doesn't seem to have any way to use a "subset of the hash", so what the patience diff does is to use a subset of the mmfile, and then that will force all the rehashing to take place, which is kind of sad. It would be nice (for patience diff) if it could partition the _hashes_ instead of partitioning the _data_. That way it wouldn't need to rehash. See? Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v3 1/3] Implement the patience diff algorithm 2009-01-07 18:59 ` Linus Torvalds @ 2009-01-07 20:00 ` Johannes Schindelin 2009-01-07 20:11 ` Davide Libenzi 1 sibling, 0 replies; 67+ messages in thread From: Johannes Schindelin @ 2009-01-07 20:00 UTC (permalink / raw) To: Linus Torvalds; +Cc: Davide Libenzi, Pierre Habouzit, Francis Galiegue, Git ML Hi, On Wed, 7 Jan 2009, Linus Torvalds wrote: > On Wed, 7 Jan 2009, Davide Libenzi wrote: > > > > xdiff allows for diffing ranges, and the most efficent method is > > exactly how you did ;) > > No, the performance problem is how it needs to re-hash everything. xdiff > doesn't seem to have any way to use a "subset of the hash", so what the > patience diff does is to use a subset of the mmfile, and then that will > force all the rehashing to take place, which is kind of sad. > > It would be nice (for patience diff) if it could partition the _hashes_ > instead of partitioning the _data_. That way it wouldn't need to rehash. > See? Actually, for libxdiff (non-patience), this is not possible, as xdl_cleanup_records() rewrites the hashes so that they are in a contiguous interval (0, ..., N-1), where N is the number of distinct lines found. I am also pretty certain that at least the non-patience part needs the maximum of that "hash" to be as small as possible. Having said that, if Linus likes patience diff enough to want it faster, Dscho will improve the speed by avoiding the rehashing for the patience diff part (although the lines for which patience has to fall back to Myers diff will _still_ rehash, but that does not hurt _that_ much in practice). Ciao, Dscho ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v3 1/3] Implement the patience diff algorithm 2009-01-07 18:59 ` Linus Torvalds 2009-01-07 20:00 ` Johannes Schindelin @ 2009-01-07 20:11 ` Davide Libenzi 1 sibling, 0 replies; 67+ messages in thread From: Davide Libenzi @ 2009-01-07 20:11 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Schindelin, Pierre Habouzit, Francis Galiegue, Git ML On Wed, 7 Jan 2009, Linus Torvalds wrote: > On Wed, 7 Jan 2009, Davide Libenzi wrote: > > > > xdiff allows for diffing ranges, and the most efficent method is exactly > > how you did ;) > > No, the performance problem is how it needs to re-hash everything. xdiff > doesn't seem to have any way to use a "subset of the hash", so what the > patience diff does is to use a subset of the mmfile, and then that will > force all the rehashing to take place, which is kind of sad. > > It would be nice (for patience diff) if it could partition the _hashes_ > instead of partitioning the _data_. That way it wouldn't need to rehash. > See? Yeah, saw that afterwards ;) - Davide ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-06 19:40 ` Johannes Schindelin 2009-01-07 14:39 ` Pierre Habouzit @ 2009-01-07 20:15 ` Sam Vilain 2009-01-07 20:25 ` Linus Torvalds 2009-01-07 20:38 ` Johannes Schindelin 1 sibling, 2 replies; 67+ messages in thread From: Sam Vilain @ 2009-01-07 20:15 UTC (permalink / raw) To: Johannes Schindelin Cc: Pierre Habouzit, Linus Torvalds, davidel, Francis Galiegue, Git ML On Tue, 2009-01-06 at 20:40 +0100, Johannes Schindelin wrote: > Although I would like to see it in once it is fleshed out -- even if it > does not meet our usefulness standard -- because people said Git is > inferior for not providing a patience diff. If we have --patience, we can > say "but we have it, it's just not useful, check for yourself". Whatever happens, the current deterministic diff algorithm needs to stay for generating patch-id's... those really can't be allowed to change. Sam. ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-07 20:15 ` [PATCH 0/3] Teach Git about " Sam Vilain @ 2009-01-07 20:25 ` Linus Torvalds 2009-01-08 2:31 ` Sam Vilain 2009-01-07 20:38 ` Johannes Schindelin 1 sibling, 1 reply; 67+ messages in thread From: Linus Torvalds @ 2009-01-07 20:25 UTC (permalink / raw) To: Sam Vilain Cc: Johannes Schindelin, Pierre Habouzit, davidel, Francis Galiegue, Git ML On Thu, 8 Jan 2009, Sam Vilain wrote: > > Whatever happens, the current deterministic diff algorithm needs to stay > for generating patch-id's... those really can't be allowed to change. Sure they can. We never cache patch-id's over a long time. And we _have_ changed xdiff to modify the output of the patches before, quite regardless of any patience issues: see commit 9b28d55401a529ff08c709f42f66e765c93b0a20, which admittedly doesn't affect any _normal_ diffs, but can generate subtly different results for some cases. It's true that we want the diff algorithm to be deterministic in the sense that over the run of a _single_ rebase operation, the diff between two files should give similar and deterministic results, but that's certainly true of patience diff too. Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-07 20:25 ` Linus Torvalds @ 2009-01-08 2:31 ` Sam Vilain 0 siblings, 0 replies; 67+ messages in thread From: Sam Vilain @ 2009-01-08 2:31 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Schindelin, Pierre Habouzit, davidel, Francis Galiegue, Git ML Linus Torvalds wrote: > On Thu, 8 Jan 2009, Sam Vilain wrote: > >> Whatever happens, the current deterministic diff algorithm needs to stay >> for generating patch-id's... those really can't be allowed to change. >> > > Sure they can. > > We never cache patch-id's over a long time. And we _have_ changed xdiff to > modify the output of the patches before, quite regardless of any patience > issues: see commit 9b28d55401a529ff08c709f42f66e765c93b0a20, which > admittedly doesn't affect any _normal_ diffs, but can generate subtly > different results for some cases. > There's at least one person who thinks that they should be deterministic enough to be able to be placed in commit messages; http://article.gmane.org/gmane.comp.version-control.git/95671 Now of course the git cherry-pick feature to add the old patch ID to the commit message isn't written yet; but unless there was a fall-back mode to produce a "stable" patch ID, these breadcrumbs would become (even more) worthless. Sam ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-07 20:15 ` [PATCH 0/3] Teach Git about " Sam Vilain 2009-01-07 20:25 ` Linus Torvalds @ 2009-01-07 20:38 ` Johannes Schindelin 2009-01-07 20:48 ` Junio C Hamano 1 sibling, 1 reply; 67+ messages in thread From: Johannes Schindelin @ 2009-01-07 20:38 UTC (permalink / raw) To: Sam Vilain Cc: Pierre Habouzit, Linus Torvalds, davidel, Francis Galiegue, Git ML Hi, On Thu, 8 Jan 2009, Sam Vilain wrote: > On Tue, 2009-01-06 at 20:40 +0100, Johannes Schindelin wrote: > > Although I would like to see it in once it is fleshed out -- even if it > > does not meet our usefulness standard -- because people said Git is > > inferior for not providing a patience diff. If we have --patience, we can > > say "but we have it, it's just not useful, check for yourself". > > Whatever happens, the current deterministic diff algorithm needs to stay > for generating patch-id's... those really can't be allowed to change. Oh, there is no discussion about replacing the diff algorithm we have right now. Even if we never output patch-ids anywhere, so we always recalculate them, and therefore would be free to replace the diff algorithm with something else. The thing why I do not want patience diff to replace the existing code is: - patience diff is slower, - patience diff is often not worth it (produces the same, maybe even worse output), and - patience diff needs the existing code as a fallback anyway. Where it could possibly help to change existing behavior is with merging. So maybe somebody has some time to play with, and can apply this patch: -- snip -- diff --git a/ll-merge.c b/ll-merge.c index fa2ca52..f731026 100644 --- a/ll-merge.c +++ b/ll-merge.c @@ -79,6 +79,8 @@ static int ll_xdl_merge(const struct ll_merge_driver *drv_unused, memset(&xpp, 0, sizeof(xpp)); if (git_xmerge_style >= 0) style = git_xmerge_style; + if (getenv("GIT_PATIENCE_MERGE")) + xpp.flags |= XDF_PATIENCE_DIFF; return xdl_merge(orig, src1, name1, src2, name2, -- snap -- After compiling and installing, something like this should be fun to watch: $ git rev-list --all --parents | \ grep " .* " | \ while read commit parent1 parent2 otherparents do test -z "$otherparents" || continue git checkout $parent1 && git merge $parent2 && git diff > without-patience.txt && git reset --hard $parent1 && GIT_PATIENCE_MERGE=1 git merge $parent2 && git diff > with-patience.txt && if ! cmp without-patience.txt with-patience.txt then echo '===============================' echo "differences found in merge $commit" echo "without patience: $(wc -l < without-patience.txt)" echo "with patience: $(wc -l < with-patience.txt)" echo '-------------------------------' echo 'without patience:' cat without-patience.txt echo '-------------------------------' echo 'with patience:' cat with-patience.txt fi || exit done | tee patience-merge.out Ciao, Dscho ^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-07 20:38 ` Johannes Schindelin @ 2009-01-07 20:48 ` Junio C Hamano 2009-01-07 22:00 ` Johannes Schindelin 0 siblings, 1 reply; 67+ messages in thread From: Junio C Hamano @ 2009-01-07 20:48 UTC (permalink / raw) To: Johannes Schindelin Cc: Sam Vilain, Pierre Habouzit, Linus Torvalds, davidel, Francis Galiegue, Git ML Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > After compiling and installing, something like this should be fun to > watch: > > $ git rev-list --all --parents | \ > grep " .* " | \ > while read commit parent1 parent2 otherparents > do > test -z "$otherparents" || continue > git checkout $parent1 && > git merge $parent2 && > git diff > without-patience.txt && > ... > if ! cmp without-patience.txt with-patience.txt > then > echo '===============================' > echo "differences found in merge $commit" > ... > cat with-patience.txt > fi || > exit > done | tee patience-merge.out An even more interesting test would be possible by dropping "&&" from the two "git merge" invocations. - Your sample will exit at the first conflicting merge otherwise. - You may find cases where one resolves cleanly while the other leaves conflicts. ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-07 20:48 ` Junio C Hamano @ 2009-01-07 22:00 ` Johannes Schindelin 2009-01-07 22:45 ` Pierre Habouzit 0 siblings, 1 reply; 67+ messages in thread From: Johannes Schindelin @ 2009-01-07 22:00 UTC (permalink / raw) To: Junio C Hamano Cc: Sam Vilain, Pierre Habouzit, Linus Torvalds, davidel, Francis Galiegue, Git ML Hi, On Wed, 7 Jan 2009, Junio C Hamano wrote: > Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > > > After compiling and installing, something like this should be fun to > > watch: > > > > $ git rev-list --all --parents | \ > > grep " .* " | \ > > while read commit parent1 parent2 otherparents > > do > > test -z "$otherparents" || continue > > git checkout $parent1 && > > git merge $parent2 && > > git diff > without-patience.txt && > > ... > > if ! cmp without-patience.txt with-patience.txt > > then > > echo '===============================' > > echo "differences found in merge $commit" > > ... > > cat with-patience.txt > > fi || > > exit > > done | tee patience-merge.out > > An even more interesting test would be possible by dropping "&&" from the > two "git merge" invocations. > > - Your sample will exit at the first conflicting merge otherwise. > > - You may find cases where one resolves cleanly while the other leaves > conflicts. Yeah, that's why I always put "like" into that phrase "something like this"... :-) Actually, I had to read something and did not want my box to sit idle while I was doing that, so... The most interesting thing to me was: of the 4072 merges I have in my local git.git clone, only 66 show a difference. The next interesting thing: none -- I repeat, none! -- resulted in only one of both methods having conflicts. In all cases, if patience merge had conflicts, so had the classical merge, and vice versa. I would have expected patience merge to handle some conflicts more gracefully. So let's go on to the next metric: what are the differences in the --cc diffs' line counts? On average, patience merge produced 1.03225806451613 more lines of --cc diff, and the standard deviation between the line counts is 42.9823669772587. So from the line counts' point of view, the difference is lost in the noise. So let's look at a concrete example. I take 41a3e3aa9bdaede9ab7afed206428c1b071060d2, as it is one of the three merges with minimal --cc diff line counts (they all have 33 lines) and where patience merge makes a difference. This is the --cc diff without patience merge: -- snip -- diff --cc git-am.sh index a391254,5a7695e..0000000 --- a/git-am.sh +++ b/git-am.sh @@@ -327,11 -327,20 +327,28 @@@ d echo "Patch is empty. Was it split wrong?" stop_here $this } ++<<<<<<< HEAD:git-am.sh + SUBJECT="$(sed -n '/^Subject/ s/Subject: //p' "$dotest/info")" + case "$keep_subject" in -k) SUBJECT="[PATCH] $SUBJECT" ;; esac + + (printf '%s\n\n' "$SUBJECT"; cat "$dotest/msg") | + git stripspace > "$dotest/msg-clean" ++======= + if test -f "$dotest/rebasing" && + commit=$(sed -e 's/^From \([0-9a-f]*\) .*/\1/' \ + -e q "$dotest/$msgnum") && + test "$(git cat-file -t "$commit")" = commit + then + git cat-file commit "$commit" | + sed -e '1,/^$/d' >"$dotest/msg-clean" + else + SUBJECT="$(sed -n '/^Subject/ s/Subject: //p' "$dotest/info")" + case "$keep_subject" in -k) SUBJECT="[PATCH] $SUBJECT" ;; esac + + (echo "$SUBJECT" ; echo ; cat "$dotest/msg") | + git stripspace > "$dotest/msg-clean" + fi ++>>>>>>> 5e835cac8622373724235d299f1331ac4cf81ccf:git-am.sh ;; esac -- snap -- Compare this with the --cc diff _with_ patience merge: -- snip -- diff --cc git-am.sh index a391254,5a7695e..0000000 --- a/git-am.sh +++ b/git-am.sh @@@ -327,11 -327,20 +327,25 @@@ d echo "Patch is empty. Was it split wrong?" stop_here $this } - SUBJECT="$(sed -n '/^Subject/ s/Subject: //p' "$dotest/info")" - case "$keep_subject" in -k) SUBJECT="[PATCH] $SUBJECT" ;; esac - + if test -f "$dotest/rebasing" && + commit=$(sed -e 's/^From \([0-9a-f]*\) .*/\1/' \ + -e q "$dotest/$msgnum") && + test "$(git cat-file -t "$commit")" = commit + then + git cat-file commit "$commit" | + sed -e '1,/^$/d' >"$dotest/msg-clean" + else + SUBJECT="$(sed -n '/^Subject/ s/Subject: //p' "$dotest/info")" + case "$keep_subject" in -k) SUBJECT="[PATCH] $SUBJECT" ;; esac + ++<<<<<<< HEAD:git-am.sh + (printf '%s\n\n' "$SUBJECT"; cat "$dotest/msg") | + git stripspace > "$dotest/msg-clean" ++======= + (echo "$SUBJECT" ; echo ; cat "$dotest/msg") | + git stripspace > "$dotest/msg-clean" + fi ++>>>>>>> 5e835cac8622373724235d299f1331ac4cf81ccf:git-am.sh ;; esac -- snap -- So, the patience merge resulted in a much smaller _conflict_. However, another such merge is 276328ffb87cefdc515bee5f09916aea6e0244ed. This is the --cc diff without patience merge: -- snip -- diff --cc diff.c index 4e4e439,f91f256..0000000 --- a/diff.c +++ b/diff.c @@@ -1498,19 -1464,13 +1498,28 @@@ static void builtin_diff(const char *na char *a_one, *b_two; const char *set = diff_get_color_opt(o, DIFF_METAINFO); const char *reset = diff_get_color_opt(o, DIFF_RESET); + const char *a_prefix, *b_prefix; + + diff_set_mnemonic_prefix(o, "a/", "b/"); + if (DIFF_OPT_TST(o, REVERSE_DIFF)) { + a_prefix = o->b_prefix; + b_prefix = o->a_prefix; + } else { + a_prefix = o->a_prefix; + b_prefix = o->b_prefix; + } ++<<<<<<< HEAD:diff.c + a_one = quote_two(a_prefix, name_a + (*name_a == '/')); + b_two = quote_two(b_prefix, name_b + (*name_b == '/')); ++======= + /* Never use a non-valid filename anywhere if at all possible */ + name_a = DIFF_FILE_VALID(one) ? name_a : name_b; + name_b = DIFF_FILE_VALID(two) ? name_b : name_a; + + a_one = quote_two(o->a_prefix, name_a + (*name_a == '/')); + b_two = quote_two(o->b_prefix, name_b + (*name_b == '/')); ++>>>>>>> e261cf94848d31868c21fb11cade51c30dfcdbe7:diff.c lbl[0] = DIFF_FILE_VALID(one) ? a_one : "/dev/null"; lbl[1] = DIFF_FILE_VALID(two) ? b_two : "/dev/null"; fprintf(o->file, "%sdiff --git %s %s%s\n", set, a_one, b_two, reset); -- snap -- And this is _with_ patience merge: -- snip -- diff --cc diff.c index 4e4e439,f91f256..0000000 --- a/diff.c +++ b/diff.c @@@ -1498,19 -1464,13 +1498,28 @@@ static void builtin_diff(const char *na char *a_one, *b_two; const char *set = diff_get_color_opt(o, DIFF_METAINFO); const char *reset = diff_get_color_opt(o, DIFF_RESET); + const char *a_prefix, *b_prefix; + ++<<<<<<< HEAD:diff.c + diff_set_mnemonic_prefix(o, "a/", "b/"); + if (DIFF_OPT_TST(o, REVERSE_DIFF)) { + a_prefix = o->b_prefix; + b_prefix = o->a_prefix; + } else { + a_prefix = o->a_prefix; + b_prefix = o->b_prefix; + } + a_one = quote_two(a_prefix, name_a + (*name_a == '/')); + b_two = quote_two(b_prefix, name_b + (*name_b == '/')); ++======= + /* Never use a non-valid filename anywhere if at all possible */ + name_a = DIFF_FILE_VALID(one) ? name_a : name_b; + name_b = DIFF_FILE_VALID(two) ? name_b : name_a; + + a_one = quote_two(o->a_prefix, name_a + (*name_a == '/')); + b_two = quote_two(o->b_prefix, name_b + (*name_b == '/')); ++>>>>>>> e261cf94848d31868c21fb11cade51c30dfcdbe7:diff.c lbl[0] = DIFF_FILE_VALID(one) ? a_one : "/dev/null"; lbl[1] = DIFF_FILE_VALID(two) ? b_two : "/dev/null"; fprintf(o->file, "%sdiff --git %s %s%s\n", set, a_one, b_two, reset); -- snap -- So again, we have no clear winner. Therefore I counted the lines between conflict markers (actually, a perl script did). Of these 66 merges, on average patience merge produced 4.46774193548387 _fewer_ lines between conflict markers. Take that with a grain of salt, though: the standard deviation of this difference is a hefty 121.163046639509 lines. The worst case for patience diff was the merge 4698ef555a1706fe322a68a02a21fb1087940ac3, where the --cc diff line counts are 1300 (without) vs 1301 (with patience merge), but the lines between conflict markers are 197 vs a ridiculous 826 lines! But you should take that also with a grain of salt: this merge is a _subtree_ merge, and my test redid it as a _non-subtree_ merge. So I restricted the analysis to the non-subtree merges, and now non-patience merge comes out 6.97297297297297 conflict lines fewer than patience merge, with a standard deviation of 58.941106657867 (with a total count of 37 merges). Note that ~7 lines difference with a standard deviation of ~59 lines is pretty close to ~0 lines difference. In the end, the additional expense of patience merge might just not be worth it. Ciao, Dscho ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-07 22:00 ` Johannes Schindelin @ 2009-01-07 22:45 ` Pierre Habouzit 2009-01-07 23:03 ` Johannes Schindelin 0 siblings, 1 reply; 67+ messages in thread From: Pierre Habouzit @ 2009-01-07 22:45 UTC (permalink / raw) To: Johannes Schindelin Cc: Junio C Hamano, Sam Vilain, Linus Torvalds, davidel, Francis Galiegue, Git ML [-- Attachment #1: Type: text/plain, Size: 2168 bytes --] On Wed, Jan 07, 2009 at 10:00:07PM +0000, Johannes Schindelin wrote: > Therefore I counted the lines between conflict markers (actually, a perl > script did). Of these 66 merges, on average patience merge produced > 4.46774193548387 _fewer_ lines between conflict markers. > > Take that with a grain of salt, though: the standard deviation of this > difference is a hefty 121.163046639509 lines. > > The worst case for patience diff was the merge > 4698ef555a1706fe322a68a02a21fb1087940ac3, where the --cc diff line counts > are 1300 (without) vs 1301 (with patience merge), but the lines between > conflict markers are 197 vs a ridiculous 826 lines! > > But you should take that also with a grain of salt: this merge is a > _subtree_ merge, and my test redid it as a _non-subtree_ merge. > > So I restricted the analysis to the non-subtree merges, and now > non-patience merge comes out 6.97297297297297 conflict lines fewer than > patience merge, with a standard deviation of 58.941106657867 (with a total > count of 37 merges). > > Note that ~7 lines difference with a standard deviation of ~59 lines is > pretty close to ~0 lines difference. > > In the end, the additional expense of patience merge might just not be > worth it. Depends, if it can help generating nicer merges, it's good to have. We could have an option to git-merge that tries hard to generate the smallest conflict possible. _that_ would really really be worth it. I mean, I've had really really tricky conflicts to work with where git-merge genrated ridiculously big conflicts, and where I hard to resort using UI tools to perform the merge (meld IIRC to name it), and given how slow and crappy those tools are, I would gladly restart a merge with a --generate-smallest-conflicts-as-possible if it can save me from those merge tools. YMMV though. PS: I never thought the patience diff is a silver bullet, it's just yet another tool in the toolbox. -- ·O· Pierre Habouzit ··O madcoder@debian.org OOO http://www.madism.org [-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm 2009-01-07 22:45 ` Pierre Habouzit @ 2009-01-07 23:03 ` Johannes Schindelin 0 siblings, 0 replies; 67+ messages in thread From: Johannes Schindelin @ 2009-01-07 23:03 UTC (permalink / raw) To: Pierre Habouzit Cc: Junio C Hamano, Sam Vilain, Linus Torvalds, davidel, Francis Galiegue, Git ML Hi, On Wed, 7 Jan 2009, Pierre Habouzit wrote: > On Wed, Jan 07, 2009 at 10:00:07PM +0000, Johannes Schindelin wrote: > > Therefore I counted the lines between conflict markers (actually, a perl > > script did). Of these 66 merges, on average patience merge produced > > 4.46774193548387 _fewer_ lines between conflict markers. > > > > Take that with a grain of salt, though: the standard deviation of this > > difference is a hefty 121.163046639509 lines. > > > > The worst case for patience diff was the merge > > 4698ef555a1706fe322a68a02a21fb1087940ac3, where the --cc diff line counts > > are 1300 (without) vs 1301 (with patience merge), but the lines between > > conflict markers are 197 vs a ridiculous 826 lines! > > > > But you should take that also with a grain of salt: this merge is a > > _subtree_ merge, and my test redid it as a _non-subtree_ merge. > > > > So I restricted the analysis to the non-subtree merges, and now > > non-patience merge comes out 6.97297297297297 conflict lines fewer than > > patience merge, with a standard deviation of 58.941106657867 (with a total > > count of 37 merges). > > > > Note that ~7 lines difference with a standard deviation of ~59 lines is > > pretty close to ~0 lines difference. > > > > In the end, the additional expense of patience merge might just not be > > worth it. > > Depends, if it can help generating nicer merges, it's good to have. > > We could have an option to git-merge that tries hard to generate the > smallest conflict possible. I also showed you examples, in addition to numbers, exactly to point out that shorter conflicts do not necessarily mean nicer conflicts. Ciao, Dscho ^ permalink raw reply [flat|nested] 67+ messages in thread
end of thread, other threads:[~2009-01-10 11:37 UTC | newest] Thread overview: 67+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-11-04 0:40 libxdiff and patience diff Pierre Habouzit 2008-11-04 3:17 ` Davide Libenzi 2008-11-04 8:33 ` Pierre Habouzit 2008-11-04 5:39 ` Johannes Schindelin 2008-11-04 8:30 ` Pierre Habouzit 2008-11-04 14:34 ` Johannes Schindelin 2008-11-04 15:23 ` Pierre Habouzit 2008-11-04 15:57 ` Johannes Schindelin 2008-11-04 16:15 ` Pierre Habouzit 2009-01-01 16:38 ` [PATCH 0/3] Teach Git about the patience diff algorithm Johannes Schindelin 2009-01-01 16:38 ` [PATCH 1/3] Implement " Johannes Schindelin 2009-01-01 16:39 ` [PATCH 2/3] Introduce the diff option '--patience' Johannes Schindelin 2009-01-01 16:39 ` [PATCH 3/3] bash completions: Add the --patience option Johannes Schindelin 2009-01-01 19:45 ` [PATCH 0/3] Teach Git about the patience diff algorithm Linus Torvalds 2009-01-01 20:00 ` Linus Torvalds 2009-01-02 18:17 ` Johannes Schindelin 2009-01-02 18:49 ` Linus Torvalds 2009-01-02 19:07 ` Johannes Schindelin 2009-01-02 18:51 ` Jeff King 2009-01-02 21:59 ` [PATCH 1/3 v2] Implement " Johannes Schindelin 2009-01-02 21:59 ` Johannes Schindelin 2009-01-01 20:46 ` [PATCH 0/3] Teach Git about " Adeodato Simó 2009-01-02 1:56 ` Linus Torvalds 2009-01-02 10:55 ` Clemens Buchacher 2009-01-02 10:58 ` Clemens Buchacher 2009-01-02 16:42 ` Linus Torvalds 2009-01-02 18:46 ` Johannes Schindelin 2009-01-02 19:03 ` Linus Torvalds 2009-01-02 19:22 ` Johannes Schindelin 2009-01-02 19:39 ` Jeff King 2009-01-02 19:50 ` Jeff King 2009-01-02 20:52 ` Jeff King 2009-01-02 23:05 ` Linus Torvalds 2009-01-03 16:24 ` Bazaar's patience diff as GIT_EXTERNAL_DIFF Adeodato Simó 2009-01-02 21:59 ` [PATCH 0/3] Teach Git about the patience diff algorithm Johannes Schindelin 2009-01-08 19:55 ` Adeodato Simó 2009-01-08 20:06 ` Adeodato Simó 2009-01-09 6:54 ` Junio C Hamano 2009-01-09 13:07 ` Johannes Schindelin 2009-01-09 15:59 ` Adeodato Simó 2009-01-09 18:09 ` Linus Torvalds 2009-01-09 18:13 ` Linus Torvalds 2009-01-09 20:53 ` Junio C Hamano 2009-01-10 11:36 ` Johannes Schindelin 2009-01-02 11:03 ` Junio C Hamano 2009-01-02 18:50 ` Adeodato Simó 2009-01-06 11:17 ` Pierre Habouzit 2009-01-06 11:39 ` Pierre Habouzit 2009-01-06 19:40 ` Johannes Schindelin 2009-01-07 14:39 ` Pierre Habouzit 2009-01-07 17:01 ` Johannes Schindelin 2009-01-07 17:04 ` [PATCH v3 1/3] Implement " Johannes Schindelin 2009-01-07 18:10 ` Davide Libenzi 2009-01-07 18:32 ` Johannes Schindelin 2009-01-07 20:09 ` Davide Libenzi 2009-01-07 20:19 ` Johannes Schindelin 2009-01-07 18:59 ` Linus Torvalds 2009-01-07 20:00 ` Johannes Schindelin 2009-01-07 20:11 ` Davide Libenzi 2009-01-07 20:15 ` [PATCH 0/3] Teach Git about " Sam Vilain 2009-01-07 20:25 ` Linus Torvalds 2009-01-08 2:31 ` Sam Vilain 2009-01-07 20:38 ` Johannes Schindelin 2009-01-07 20:48 ` Junio C Hamano 2009-01-07 22:00 ` Johannes Schindelin 2009-01-07 22:45 ` Pierre Habouzit 2009-01-07 23:03 ` Johannes Schindelin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).