[PATCH 0/4] diff-highlight: make a few improvements

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] diff-highlight: make a few improvements
@ 2015-11-03  2:05 Jonathan Lebon
  2015-11-03  2:05 ` [PATCH 1/4] diff-highlight: add `less -r` to cmd in README Jonathan Lebon
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Jonathan Lebon @ 2015-11-03  2:05 UTC (permalink / raw)
  To: git; +Cc: peff, Jonathan Lebon

These patches bring a few improvements to the contrib/diff-highlight
Perl script. The major improvement is done in patch 3/4, which improves
diff-highlighting accuracy by implementing a recursive line matching
algorithm.

Please note that I have limited experience with Perl, so there may be
better ways to do things. (Let me know if that is the case!)

Jonathan Lebon (4):
  diff-highlight: add `less -r` to cmd in README
  diff-highlight: factor out prefix/suffix functions
  diff-highlight: match up lines before highlighting
  diff-highlight: add maxhunksize config option

 contrib/diff-highlight/README         |  87 +++++-----------
 contrib/diff-highlight/diff-highlight | 189 ++++++++++++++++++++++++----------
 2 files changed, 161 insertions(+), 115 deletions(-)

-- 
2.6.0

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/4] diff-highlight: add `less -r` to cmd in README
  2015-11-03  2:05 [PATCH 0/4] diff-highlight: make a few improvements Jonathan Lebon
@ 2015-11-03  2:05 ` Jonathan Lebon
  2015-11-03  2:41   ` Junio C Hamano
  2015-11-03 21:14   ` Jeff King
  2015-11-03  2:05 ` [PATCH 2/4] diff-highlight: factor out prefix/suffix functions Jonathan Lebon
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 17+ messages in thread
From: Jonathan Lebon @ 2015-11-03  2:05 UTC (permalink / raw)
  To: git; +Cc: peff, Jonathan Lebon

As it is, the suggested command for trying out diff-highlight will just
dump the whole git log output to the terminal. Let's pipe it through
`less` so users aren't surprised on the first try.

Signed-off-by: Jonathan Lebon <jonathan.lebon@gmail.com>
---
 contrib/diff-highlight/README | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/contrib/diff-highlight/README b/contrib/diff-highlight/README
index 836b97a..bbbfdda 100644
--- a/contrib/diff-highlight/README
+++ b/contrib/diff-highlight/README
@@ -44,9 +44,9 @@ Use
 
 You can try out the diff-highlight program with:
 
----------------------------------------------
-git log -p --color | /path/to/diff-highlight
----------------------------------------------
+------------------------------------------------------
+git log -p --color | /path/to/diff-highlight | less -r
+------------------------------------------------------
 
 If you want to use it all the time, drop it in your $PATH and put the
 following in your git configuration:
-- 
2.6.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/4] diff-highlight: factor out prefix/suffix functions
  2015-11-03  2:05 [PATCH 0/4] diff-highlight: make a few improvements Jonathan Lebon
  2015-11-03  2:05 ` [PATCH 1/4] diff-highlight: add `less -r` to cmd in README Jonathan Lebon
@ 2015-11-03  2:05 ` Jonathan Lebon
  2015-11-03 21:17   ` Jeff King
  2015-11-03  2:05 ` [PATCH 3/4] diff-highlight: match up lines before highlighting Jonathan Lebon
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Jonathan Lebon @ 2015-11-03  2:05 UTC (permalink / raw)
  To: git; +Cc: peff, Jonathan Lebon

In preparation for the next patch, we factor out the functions for
finding the common prefix and suffix between two lines.

Signed-off-by: Jonathan Lebon <jonathan.lebon@gmail.com>
---
 contrib/diff-highlight/diff-highlight | 98 ++++++++++++++++++++---------------
 1 file changed, 56 insertions(+), 42 deletions(-)

diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff-highlight/diff-highlight
index ffefc31..a332f86 100755
--- a/contrib/diff-highlight/diff-highlight
+++ b/contrib/diff-highlight/diff-highlight
@@ -110,48 +110,8 @@ sub highlight_pair {
 	my @a = split_line(shift);
 	my @b = split_line(shift);
 
-	# Find common prefix, taking care to skip any ansi
-	# color codes.
-	my $seen_plusminus;
-	my ($pa, $pb) = (0, 0);
-	while ($pa < @a && $pb < @b) {
-		if ($a[$pa] =~ /$COLOR/) {
-			$pa++;
-		}
-		elsif ($b[$pb] =~ /$COLOR/) {
-			$pb++;
-		}
-		elsif ($a[$pa] eq $b[$pb]) {
-			$pa++;
-			$pb++;
-		}
-		elsif (!$seen_plusminus && $a[$pa] eq '-' && $b[$pb] eq '+') {
-			$seen_plusminus = 1;
-			$pa++;
-			$pb++;
-		}
-		else {
-			last;
-		}
-	}
-
-	# Find common suffix, ignoring colors.
-	my ($sa, $sb) = ($#a, $#b);
-	while ($sa >= $pa && $sb >= $pb) {
-		if ($a[$sa] =~ /$COLOR/) {
-			$sa--;
-		}
-		elsif ($b[$sb] =~ /$COLOR/) {
-			$sb--;
-		}
-		elsif ($a[$sa] eq $b[$sb]) {
-			$sa--;
-			$sb--;
-		}
-		else {
-			last;
-		}
-	}
+	my ($pa, $pb) = find_common_prefix(\@a, \@b);
+	my ($sa, $sb) = find_common_suffix(\@a, $pa, \@b, $pb);
 
 	if (is_pair_interesting(\@a, $pa, $sa, \@b, $pb, $sb)) {
 		return highlight_line(\@a, $pa, $sa, \@OLD_HIGHLIGHT),
@@ -173,6 +133,60 @@ sub split_line {
 		split /($COLOR+)/;
 }
 
+sub find_common_prefix {
+	my ($a, $b) = @_;
+
+	# Take care to skip any ansi color codes.
+	my $seen_plusminus;
+	my ($pa, $pb) = (0, 0);
+	while ($pa < @$a && $pb < @$b) {
+		if ($a->[$pa] =~ /$COLOR/) {
+			$pa++;
+		}
+		elsif ($b->[$pb] =~ /$COLOR/) {
+			$pb++;
+		}
+		elsif ($a->[$pa] eq $b->[$pb]) {
+			$pa++;
+			$pb++;
+		}
+		elsif (!$seen_plusminus && $a->[$pa] eq '-' && $b->[$pb] eq '+') {
+			$seen_plusminus = 1;
+			$pa++;
+			$pb++;
+		}
+		else {
+			last;
+		}
+	}
+
+	return $pa, $pb;
+}
+
+sub find_common_suffix {
+	my ($a, $pa, $b, $pb) = @_;
+
+	# Take care to skip any ansi color codes.
+	my ($sa, $sb) = ($#$a, $#$b);
+	while ($sa >= $pa && $sb >= $pb) {
+		if ($a->[$sa] =~ /$COLOR/) {
+			$sa--;
+		}
+		elsif ($b->[$sb] =~ /$COLOR/) {
+			$sb--;
+		}
+		elsif ($a->[$sa] eq $b->[$sb]) {
+			$sa--;
+			$sb--;
+		}
+		else {
+			last;
+		}
+	}
+
+	return $sa, $sb;
+}
+
 sub highlight_line {
 	my ($line, $prefix, $suffix, $theme) = @_;
 
-- 
2.6.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/4] diff-highlight: match up lines before highlighting
  2015-11-03  2:05 [PATCH 0/4] diff-highlight: make a few improvements Jonathan Lebon
  2015-11-03  2:05 ` [PATCH 1/4] diff-highlight: add `less -r` to cmd in README Jonathan Lebon
  2015-11-03  2:05 ` [PATCH 2/4] diff-highlight: factor out prefix/suffix functions Jonathan Lebon
@ 2015-11-03  2:05 ` Jonathan Lebon
  2015-11-03 21:44   ` Jeff King
  2015-11-03  2:05 ` [PATCH 4/4] diff-highlight: add maxhunksize config option Jonathan Lebon
  2015-11-03 22:07 ` [PATCH 0/4] diff-highlight: make a few improvements Jeff King
  4 siblings, 1 reply; 17+ messages in thread
From: Jonathan Lebon @ 2015-11-03  2:05 UTC (permalink / raw)
  To: git; +Cc: peff, Jonathan Lebon

As mentioned in the README, one of the current limitations of
diff-highlight is that it only calculates highlights when the hunk
contains the same number of removed lines as added lines.

A further limitation upon this is that diff-highlight assumes that the
first line removed matches the first line added, similarly with the
second, the third, etc... As was demonstrated in the "Bugs" section of
the README, this poses limitations since that assumption does not always
give the best result.

With this patch, we eliminate those limitations by trying to match up
the removed and added lines before highlighting them. This is done using
a recursive algorithm.

Note that I did not bother with some common optimizations such as
memoization since the usual number of removed/added lines in a single
hunk are small. In practice, I have not felt any lag at all during
paging.

Signed-off-by: Jonathan Lebon <jonathan.lebon@gmail.com>
---
 contrib/diff-highlight/README         | 61 +------------------------
 contrib/diff-highlight/diff-highlight | 83 +++++++++++++++++++++++++++++------
 2 files changed, 70 insertions(+), 74 deletions(-)

diff --git a/contrib/diff-highlight/README b/contrib/diff-highlight/README
index bbbfdda..885ff2f 100644
--- a/contrib/diff-highlight/README
+++ b/contrib/diff-highlight/README
@@ -14,17 +14,7 @@ Instead, this script post-processes the line-oriented diff, finds pairs
 of lines, and highlights the differing segments.  It's currently very
 simple and stupid about doing these tasks. In particular:
 
-  1. It will only highlight hunks in which the number of removed and
-     added lines is the same, and it will pair lines within the hunk by
-     position (so the first removed line is compared to the first added
-     line, and so forth). This is simple and tends to work well in
-     practice. More complex changes don't highlight well, so we tend to
-     exclude them due to the "same number of removed and added lines"
-     restriction. Or even if we do try to highlight them, they end up
-     not highlighting because of our "don't highlight if the whole line
-     would be highlighted" rule.
-
-  2. It will find the common prefix and suffix of two lines, and
+  1. It will find the common prefix and suffix of two lines, and
      consider everything in the middle to be "different". It could
      instead do a real diff of the characters between the two lines and
      find common subsequences. However, the point of the highlight is to
@@ -142,52 +132,3 @@ heuristics.
 -----------------------------------------------------
 
    which is less readable than the current output.
-
-2. The multi-line matching assumes that lines in the pre- and post-image
-   match by position. This is often the case, but can be fooled when a
-   line is removed from the top and a new one added at the bottom (or
-   vice versa). Unless the lines in the middle are also changed, diffs
-   will show this as two hunks, and it will not get highlighted at all
-   (which is good). But if the lines in the middle are changed, the
-   highlighting can be misleading. Here's a pathological case:
-
------------------------------------------------------
--one
--two
--three
--four
-+two 2
-+three 3
-+four 4
-+five 5
------------------------------------------------------
-
-   which gets highlighted as:
-
------------------------------------------------------
--one
--t-{wo}
--three
--f-{our}
-+two 2
-+t+{hree 3}
-+four 4
-+f+{ive 5}
------------------------------------------------------
-
-   because it matches "two" to "three 3", and so forth. It would be
-   nicer as:
-
------------------------------------------------------
--one
--two
--three
--four
-+two +{2}
-+three +{3}
-+four +{4}
-+five 5
------------------------------------------------------
-
-   which would probably involve pre-matching the lines into pairs
-   according to some heuristic.
diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff-highlight/diff-highlight
index a332f86..46556fc 100755
--- a/contrib/diff-highlight/diff-highlight
+++ b/contrib/diff-highlight/diff-highlight
@@ -88,24 +88,79 @@ sub show_hunk {
 		return;
 	}
 
-	# If we have mismatched numbers of lines on each side, we could try to
-	# be clever and match up similar lines. But for now we are simple and
-	# stupid, and only handle multi-line hunks that remove and add the same
-	# number of lines.
-	if (@$a != @$b) {
-		print @$a, @$b;
-		return;
-	}
-
 	my @queue;
-	for (my $i = 0; $i < @$a; $i++) {
-		my ($rm, $add) = highlight_pair($a->[$i], $b->[$i]);
-		print $rm;
-		push @queue, $add;
-	}
+	match_and_highlight_pairs($a, 0, scalar @$a, $b, 0, scalar @$b, \@queue);
 	print @queue;
 }
 
+# Here, we try to be clever and match up similar lines. I.e. we try to
+# find which lines in the `rem` lines (array a) became which other lines
+# in the `add` lines (array b). To do this, we use a recursive algorithm
+# that works as follow:
+# 	1. Find the most similar pair of lines in all possible pairs
+# 	2. Do a recursive call to find the most similar pair of lines in all
+# 	   pairs, restricted to lower indices
+# 	3. Print the `rem` line of the best pair
+# 	4. Queue the `add` line of the best pair
+# 	5. Do a recursive call to find the most similar pair of lines in all
+# 	   pairs, restricted to higher indices
+sub match_and_highlight_pairs {
+	my ($a, $a_first, $a_last, $b,  $b_first, $b_last, $queue) = @_;
+
+	# base case: no more rem or add lines to pair up
+	if ($a_first >= $a_last || $b_first >= $b_last) {
+
+		# flush out any remaining rem lines
+		for (my $i = $a_first; $i < $a_last; $i++) {
+			print $a->[$i];
+		}
+
+		# queue up any remaining add lines
+		for (my $i = $b_first; $i < $b_last; $i++) {
+			push @$queue, $b->[$i];
+		}
+
+		return;
+	}
+
+	# prime the loop
+	my ($besti, $bestj) = ($a_first, $b_first);
+	my $bestn = calculate_match($a->[$a_first], $b->[$b_first]) + 1;
+
+	for (my $i = $a_first; $i < $a_last; $i++) {
+		for (my $j = $b_first; $j < $b_last; $j++) {
+			my $n = calculate_match($a->[$i], $b->[$j]);
+			if ($n < $bestn) {
+				($besti, $bestj, $bestn) = ($i, $j, $n);
+			}
+		}
+	}
+
+	# find the best matches in the lower pairs
+	match_and_highlight_pairs($a, $a_first, $besti, $b, $b_first, $bestj, $queue);
+
+	my ($rm, $add) = highlight_pair($a->[$besti], $b->[$bestj]);
+	print $rm;
+	push @$queue, $add;
+
+	# find the best matches in the higher pairs
+	match_and_highlight_pairs($a, $besti+1, $a_last, $b, $bestj+1, $b_last, $queue);
+}
+
+# A measure of how well the two lines passed match up. The smaller the
+# returned value, the better the match. The current implementation uses
+# a simple heuristic which tries to minimize the overall diff between
+# the two lines considering only their common prefix and suffix.
+sub calculate_match {
+	my @a = split_line(shift);
+	my @b = split_line(shift);
+
+	my ($pa, $pb) = find_common_prefix(\@a, \@b);
+	my ($sa, $sb) = find_common_suffix(\@a, $pa, \@b, $pb);
+
+	return ($sa - $pa) + ($sb - $pb);
+}
+
 sub highlight_pair {
 	my @a = split_line(shift);
 	my @b = split_line(shift);
-- 
2.6.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/4] diff-highlight: add maxhunksize config option
  2015-11-03  2:05 [PATCH 0/4] diff-highlight: make a few improvements Jonathan Lebon
                   ` (2 preceding siblings ...)
  2015-11-03  2:05 ` [PATCH 3/4] diff-highlight: match up lines before highlighting Jonathan Lebon
@ 2015-11-03  2:05 ` Jonathan Lebon
  2015-11-03 21:46   ` Jeff King
  2015-11-03 22:07 ` [PATCH 0/4] diff-highlight: make a few improvements Jeff King
  4 siblings, 1 reply; 17+ messages in thread
From: Jonathan Lebon @ 2015-11-03  2:05 UTC (permalink / raw)
  To: git; +Cc: peff, Jonathan Lebon

As the size of the hunk gets bigger, it becomes harder to jump back and
forth between the removed and added lines, and highlighting becomes less
beneficial. We add a new config option called

	diff-highlight.maxhunksize

which controls the maximum size of the hunk allowed for which
highlighting is still performed. The default value is set to 20.

Signed-off-by: Jonathan Lebon <jonathan.lebon@gmail.com>
---
 contrib/diff-highlight/README         | 20 ++++++++++++++++++++
 contrib/diff-highlight/diff-highlight | 16 ++++++++++++++++
 2 files changed, 36 insertions(+)

diff --git a/contrib/diff-highlight/README b/contrib/diff-highlight/README
index 885ff2f..ed12be1 100644
--- a/contrib/diff-highlight/README
+++ b/contrib/diff-highlight/README
@@ -89,6 +89,26 @@ newHighlight = "black #aaffaa"
 ---------------------------------------------
 
 
+Max Hunk Config
+---------------
+
+By default, diff-highlight will not do any highlighting if either the
+number of removed or added lines is greater than 20. This is because as
+the hunk gets bigger, it becomes harder to jump back and forth between
+the removed and added lines, and highlighting becomes less beneficial.
+
+You can change this default by setting the "diff-highlight.maxhunksize"
+configuration.
+
+Example:
+
+---------------------------------------------
+# Increase the maximum diff-highlight to 30
+[diff-highlight]
+maxhunksize = 30
+---------------------------------------------
+
+
 Bugs
 ----
 
diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff-highlight/diff-highlight
index 46556fc..a005146 100755
--- a/contrib/diff-highlight/diff-highlight
+++ b/contrib/diff-highlight/diff-highlight
@@ -17,6 +17,8 @@ my @NEW_HIGHLIGHT = (
 	color_config('color.diff-highlight.newreset', $OLD_HIGHLIGHT[2])
 );
 
+my $MAX_HUNK_SIZE = config('diff-highlight.maxhunksize', 20);
+
 my $RESET = "\x1b[m";
 my $COLOR = qr/\x1b\[[0-9;]*m/;
 my $BORING = qr/$COLOR|\s/;
@@ -79,6 +81,13 @@ sub color_config {
 	return length($s) ? $s : $default;
 }
 
+# Also handle our own fallback here to be independent.
+sub config {
+	my ($key, $default) = @_;
+	my $s = `git config --get $key 2>/dev/null`;
+	return length($s) ? $s : $default;
+}
+
 sub show_hunk {
 	my ($a, $b) = @_;
 
@@ -88,6 +97,13 @@ sub show_hunk {
 		return;
 	}
 
+	# Skip highlighting if the hunk gets bigger than the user configured
+	# limit.
+	if (@$a > $MAX_HUNK_SIZE || @$b > $MAX_HUNK_SIZE) {
+		print @$a, @$b;
+		return;
+	}
+
 	my @queue;
 	match_and_highlight_pairs($a, 0, scalar @$a, $b, 0, scalar @$b, \@queue);
 	print @queue;
-- 
2.6.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] diff-highlight: add `less -r` to cmd in README
  2015-11-03  2:05 ` [PATCH 1/4] diff-highlight: add `less -r` to cmd in README Jonathan Lebon
@ 2015-11-03  2:41   ` Junio C Hamano
  2015-11-03  3:12     ` Jonathan Lebon
  2015-11-03 21:14   ` Jeff King
  1 sibling, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2015-11-03  2:41 UTC (permalink / raw)
  To: Jonathan Lebon; +Cc: git, peff

Jonathan Lebon <jonathan.lebon@gmail.com> writes:

> As it is, the suggested command for trying out diff-highlight will just
> dump the whole git log output to the terminal. Let's pipe it through
> `less` so users aren't surprised on the first try.

That justifies the "less" part but not your choice of "-r".

I am assuming that you are telling "less" not to show the ANSI
"color" escape sequences using the caret notation with "-r", which
is a very natural and sensible thing to do when using `highlight`.

But if that is the case, you don't want "-r" (raw control chars for
everything).  You would want to say "-R", I think.

Other than that, looks like a sensible thing to do to me.

Thanks.


> Signed-off-by: Jonathan Lebon <jonathan.lebon@gmail.com>
> ---
>  contrib/diff-highlight/README | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/contrib/diff-highlight/README b/contrib/diff-highlight/README
> index 836b97a..bbbfdda 100644
> --- a/contrib/diff-highlight/README
> +++ b/contrib/diff-highlight/README
> @@ -44,9 +44,9 @@ Use
>  
>  You can try out the diff-highlight program with:
>  
> ----------------------------------------------
> -git log -p --color | /path/to/diff-highlight
> ----------------------------------------------
> +------------------------------------------------------
> +git log -p --color | /path/to/diff-highlight | less -r
> +------------------------------------------------------
>  
>  If you want to use it all the time, drop it in your $PATH and put the
>  following in your git configuration:

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] diff-highlight: add `less -r` to cmd in README
  2015-11-03  2:41   ` Junio C Hamano
@ 2015-11-03  3:12     ` Jonathan Lebon
  0 siblings, 0 replies; 17+ messages in thread
From: Jonathan Lebon @ 2015-11-03  3:12 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King

On Mon, Nov 2, 2015 at 9:41 PM, Junio C Hamano <gitster@pobox.com> wrote:
>
> Jonathan Lebon <jonathan.lebon@gmail.com> writes:
>
> > As it is, the suggested command for trying out diff-highlight will just
> > dump the whole git log output to the terminal. Let's pipe it through
> > `less` so users aren't surprised on the first try.
>
> That justifies the "less" part but not your choice of "-r".
>
> I am assuming that you are telling "less" not to show the ANSI
> "color" escape sequences using the caret notation with "-r", which
> is a very natural and sensible thing to do when using `highlight`.
>
> But if that is the case, you don't want "-r" (raw control chars for
> everything).  You would want to say "-R", I think.

Ahh thanks, that makes sense. I will update this for v2 tomorrow.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] diff-highlight: add `less -r` to cmd in README
  2015-11-03  2:05 ` [PATCH 1/4] diff-highlight: add `less -r` to cmd in README Jonathan Lebon
  2015-11-03  2:41   ` Junio C Hamano
@ 2015-11-03 21:14   ` Jeff King
  2015-11-03 21:51     ` Junio C Hamano
  1 sibling, 1 reply; 17+ messages in thread
From: Jeff King @ 2015-11-03 21:14 UTC (permalink / raw)
  To: Jonathan Lebon; +Cc: git

On Mon, Nov 02, 2015 at 09:05:31PM -0500, Jonathan Lebon wrote:

> As it is, the suggested command for trying out diff-highlight will just
> dump the whole git log output to the terminal. Let's pipe it through
> `less` so users aren't surprised on the first try.

That makes sense. I use "diff-highlight | less" myself, of course. I
think I excluded it from the README in the name of simplicity, but you
are right that it is probably better to give a complete working example.

I agree with Junio that "-R" is a more sensible default, but I don't
think that should be necessary. We already set LESS when running the
pager (and if you are overriding it, you are already in trouble, because
git itself will generate ANSI codes by default).

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] diff-highlight: factor out prefix/suffix functions
  2015-11-03  2:05 ` [PATCH 2/4] diff-highlight: factor out prefix/suffix functions Jonathan Lebon
@ 2015-11-03 21:17   ` Jeff King
  0 siblings, 0 replies; 17+ messages in thread
From: Jeff King @ 2015-11-03 21:17 UTC (permalink / raw)
  To: Jonathan Lebon; +Cc: git

On Mon, Nov 02, 2015 at 09:05:32PM -0500, Jonathan Lebon wrote:

> In preparation for the next patch, we factor out the functions for
> finding the common prefix and suffix between two lines.

Looks like a pretty straightforward movement. Your perl adjustments to
use array refs look good, and I think the multi-valued return is a good
interface.

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] diff-highlight: match up lines before highlighting
  2015-11-03  2:05 ` [PATCH 3/4] diff-highlight: match up lines before highlighting Jonathan Lebon
@ 2015-11-03 21:44   ` Jeff King
  2015-11-03 22:03     ` Jeff King
  0 siblings, 1 reply; 17+ messages in thread
From: Jeff King @ 2015-11-03 21:44 UTC (permalink / raw)
  To: Jonathan Lebon; +Cc: git

On Mon, Nov 02, 2015 at 09:05:33PM -0500, Jonathan Lebon wrote:

> As mentioned in the README, one of the current limitations of
> diff-highlight is that it only calculates highlights when the hunk
> contains the same number of removed lines as added lines.
> 
> A further limitation upon this is that diff-highlight assumes that the
> first line removed matches the first line added, similarly with the
> second, the third, etc... As was demonstrated in the "Bugs" section of
> the README, this poses limitations since that assumption does not always
> give the best result.
> 
> With this patch, we eliminate those limitations by trying to match up
> the removed and added lines before highlighting them. This is done using
> a recursive algorithm.

Hmm. So this seems like a reasonable incremental feature. I do think it
is a hack (piled on the hack that is the existing script :) ), and the
right solution is to actually do an LCS diff for each hunk that crosses
line boundaries.

I made some headway on that this summer as part of this discussion:

  http://thread.gmane.org/gmane.comp.version-control.git/271692

It's long, but there's some good stuff in there.  But I think I came to
the conclusion that this really needs to go inside of diff.c itself
(which will also require some heavy refactoring of the whitespace code;
see the referenced thread for details).

But I'm not opposed to an incremental feature like this in the meantime.
The real test for me is: how does it look in practice? These are all
heuristics, so I don't think we have anything better than eyeballing the
output.

Have you looked at a diff of the old/new output on something like
git.git?

> Note that I did not bother with some common optimizations such as
> memoization since the usual number of removed/added lines in a single
> hunk are small. In practice, I have not felt any lag at all during
> paging.

I'd worry less about normal use, and more about hitting some extreme
corner case. Your algorithm looks roughly quadratic in the size of the
hunk. I guess that is canceled out by the max-hunk-size option in the
next patch, though.

I don't think it's easy to make your algorithm non-quadratic, either, as
it inherently relies on pairwise comparisons (and not, say, generating a
fingerprint of each line and sorting them or something like that).

It might be worth memo-izing find_common_* calls, though, as that is
just repeated work (quadratic or not). It should be easy to time.

> +	# prime the loop
> +	my ($besti, $bestj) = ($a_first, $b_first);
> +	my $bestn = calculate_match($a->[$a_first], $b->[$b_first]) + 1;
> +
> +	for (my $i = $a_first; $i < $a_last; $i++) {
> +		for (my $j = $b_first; $j < $b_last; $j++) {
> +			my $n = calculate_match($a->[$i], $b->[$j]);
> +			if ($n < $bestn) {
> +				($besti, $bestj, $bestn) = ($i, $j, $n);
> +			}
> +		}
> +	}
> +
> +	# find the best matches in the lower pairs
> +	match_and_highlight_pairs($a, $a_first, $besti, $b, $b_first, $bestj, $queue);

Hmm, is this actually O(n^3)? We have a quadratic loop, and then we
recurse for the remainder.

If we have two candidates for which calculate_match returns the same
value, how do we break the tie? It looks like we'll just pick the
lowest-numbered match. I'd think we would want to prefer the one with
the closest line number.  But not having thought too hard about it, I
wonder:

  1. Does it actually make a difference which line we pick? The
     interesting bit is how much we highlight, so in that sense do we
     care only about the prefix and suffix sizes?

  2. Do you end up picking the closest line with your algorithm anyway,
     as will tend to match as we go, skipping over only lines that will
     likely have no match?

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] diff-highlight: add maxhunksize config option
  2015-11-03  2:05 ` [PATCH 4/4] diff-highlight: add maxhunksize config option Jonathan Lebon
@ 2015-11-03 21:46   ` Jeff King
  0 siblings, 0 replies; 17+ messages in thread
From: Jeff King @ 2015-11-03 21:46 UTC (permalink / raw)
  To: Jonathan Lebon; +Cc: git

On Mon, Nov 02, 2015 at 09:05:34PM -0500, Jonathan Lebon wrote:

> As the size of the hunk gets bigger, it becomes harder to jump back and
> forth between the removed and added lines, and highlighting becomes less
> beneficial. We add a new config option called
> 
> 	diff-highlight.maxhunksize
> 
> which controls the maximum size of the hunk allowed for which
> highlighting is still performed. The default value is set to 20.

I think this makes sense. I'm pleased it is here to limit the problem
space for patch 3, but I think it has value on its own as a heuristic.

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] diff-highlight: add `less -r` to cmd in README
  2015-11-03 21:14   ` Jeff King
@ 2015-11-03 21:51     ` Junio C Hamano
  2015-11-03 21:55       ` Jeff King
  0 siblings, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2015-11-03 21:51 UTC (permalink / raw)
  To: Jeff King; +Cc: Jonathan Lebon, git

Jeff King <peff@peff.net> writes:

> I agree with Junio that "-R" is a more sensible default, but I don't
> think that should be necessary. We already set LESS when running the
> pager (and if you are overriding it, you are already in trouble, because
> git itself will generate ANSI codes by default).

I do not quite understand this part.  Inside git itself when we
spawn the pager we export LESS with a sensible default, exactly to
help users who do not have LESS set and exported.  This one however
is not spawned by git but the end-user piping output of diff-hilite.

I agree that if the user has LESS exported without -R that would not
be pretty while viewing coloured output.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] diff-highlight: add `less -r` to cmd in README
  2015-11-03 21:51     ` Junio C Hamano
@ 2015-11-03 21:55       ` Jeff King
  0 siblings, 0 replies; 17+ messages in thread
From: Jeff King @ 2015-11-03 21:55 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Lebon, git

On Tue, Nov 03, 2015 at 01:51:54PM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > I agree with Junio that "-R" is a more sensible default, but I don't
> > think that should be necessary. We already set LESS when running the
> > pager (and if you are overriding it, you are already in trouble, because
> > git itself will generate ANSI codes by default).
> 
> I do not quite understand this part.  Inside git itself when we
> spawn the pager we export LESS with a sensible default, exactly to
> help users who do not have LESS set and exported.  This one however
> is not spawned by git but the end-user piping output of diff-hilite.

Oh, you're right. I mistook it as instructions for what to put in
pager.log, but that is clearly not what this is. I agree that telling
the user to use "-R" here is a good idea.

And indeed, the instructions just below the hunk in question describe
setting it properly in your config (with less, and no "-R"). So I was
simply confused and not reading carefully enough.

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] diff-highlight: match up lines before highlighting
  2015-11-03 21:44   ` Jeff King
@ 2015-11-03 22:03     ` Jeff King
  2015-11-17  5:18       ` Jonathan Lebon
  0 siblings, 1 reply; 17+ messages in thread
From: Jeff King @ 2015-11-03 22:03 UTC (permalink / raw)
  To: Jonathan Lebon; +Cc: git

On Tue, Nov 03, 2015 at 04:44:16PM -0500, Jeff King wrote:

> Have you looked at a diff of the old/new output on something like
> git.git?

This should be pretty easy to do (and time). I tried:

  git log --oneline --color -p >base
  time perl highlight.old <base >old
  time perl highlight.new <base >new
  diff -c old new | less

where the "highlight.*" scripts are the versions at master, and master
with your series applied.

Your is _much_ slower. I get:

  real    0m25.538s
  user    0m25.420s
  sys     0m0.120s

for the old versus:

  real    2m3.580s
  user    2m3.548s
  sys     0m0.156s

for your series. In an interactive setting, the latency may not be that
noticeable, but if you are digging far into history (e.g., "git log -p",
then using "/" in less to search for a commit or some test), I suspect
it would be very noticeable.

I was thinking there was some low-hanging fruit in memoizing the
calculations, but even the prefix/suffix computation is pairwise. I'm
not really sure how to make this much faster.

As for the output itself, the diff between the two looks promising. The
first several cases I looked at ar strict improvements. Some of them are
kind of weird, especially in English text. For example, see the RelNotes
update in 2635c2b. The base diff is:

    * "git rebase -i" had a minor regression recently, which stopped
    considering a line that begins with an indented '#' in its insn
-   sheet not a comment, which is now fixed.
-   (merge 1db168e gr/rebase-i-drop-warn later to maint).
+   sheet not a comment. Further, the code was still too picky on
+   Windows where CRLF left by the editor is turned into a trailing CR
+   on the line read via the "read" built-in command of bash.  Both of
+   these issues are now fixed.
+   (merge 39743cf gr/rebase-i-drop-warn later to maint).

Before we highlighted nothing, and now we hone in on "now fixed". Which
is _sort of_ a match, but really the whole sentence structure has
changed. If this is the worst regression, I can certainly live with it.
And even a proper LCS diff would probably end up making spaghetti of a
text change like this.

In the other thread I mentioned earlier, the solution I cooked up was
dropping highlighting entirely for hunks over a certain percentage of
highlighting. I wonder if we could do something similar here (e.g.,
don't match lines where more than 50% of the line would be highlighted).

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] diff-highlight: make a few improvements
  2015-11-03  2:05 [PATCH 0/4] diff-highlight: make a few improvements Jonathan Lebon
                   ` (3 preceding siblings ...)
  2015-11-03  2:05 ` [PATCH 4/4] diff-highlight: add maxhunksize config option Jonathan Lebon
@ 2015-11-03 22:07 ` Jeff King
  4 siblings, 0 replies; 17+ messages in thread
From: Jeff King @ 2015-11-03 22:07 UTC (permalink / raw)
  To: Jonathan Lebon; +Cc: git

On Mon, Nov 02, 2015 at 09:05:30PM -0500, Jonathan Lebon wrote:

> These patches bring a few improvements to the contrib/diff-highlight
> Perl script. The major improvement is done in patch 3/4, which improves
> diff-highlighting accuracy by implementing a recursive line matching
> algorithm.
> 
> Please note that I have limited experience with Perl, so there may be
> better ways to do things. (Let me know if that is the case!)

I forgot to mention in the rest of my review: thank you for working on
this, and for making a nice clean series. It was very pleasant to read.

I think the output overall is nicer. I am concerned with the performance
implications. I see this as a stopgap until we have a true diff inside
the hunks, but if we can overcome the performance implications, I think
it's worth applying in the meantime.

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] diff-highlight: match up lines before highlighting
  2015-11-03 22:03     ` Jeff King
@ 2015-11-17  5:18       ` Jonathan Lebon
  2015-11-17 22:50         ` Jeff King
  0 siblings, 1 reply; 17+ messages in thread
From: Jonathan Lebon @ 2015-11-17  5:18 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Tue, Nov 3, 2015 at 5:03 PM, Jeff King <peff@peff.net> wrote:
> Your is _much_ slower. I get:
>
>   real    0m25.538s
>   user    0m25.420s
>   sys     0m0.120s
>
> for the old versus:
>
>   real    2m3.580s
>   user    2m3.548s
>   sys     0m0.156s

Thanks for investigating and trying it out. I got the same results
here as well.

> for your series. In an interactive setting, the latency may not be that
> noticeable, but if you are digging far into history (e.g., "git log -p",
> then using "/" in less to search for a commit or some test), I suspect
> it would be very noticeable.

Agreed.

> I was thinking there was some low-hanging fruit in memoizing the
> calculations, but even the prefix/suffix computation is pairwise. I'm
> not really sure how to make this much faster.

I gave memoization a try to see if it could improve the situation. I also
lowered maxhunksize to 10. Doing `git log -p` on git.git went from 2m31
to 2m11. So I think it would require a whole other approach overall.

> As for the output itself, the diff between the two looks promising. The
> first several cases I looked at ar strict improvements. Some of them are
> kind of weird, especially in English text.

Yes, I'm very happy with the improvements and run with these patches all
the time for now.

> In the other thread I mentioned earlier, the solution I cooked up was
> dropping highlighting entirely for hunks over a certain percentage of
> highlighting. I wonder if we could do something similar here (e.g.,
> don't match lines where more than 50% of the line would be highlighted).

I looked over but haven't tested the patches in the other thread yet. But
overall, the LCS definitely looks promising. I'm hoping to find some time
to have a more serious go at it and maybe pick it up where you left off.

>
> -Peff

Thanks again for reviewing these patches and apologies for the delayed
reply.

Jonathan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] diff-highlight: match up lines before highlighting
  2015-11-17  5:18       ` Jonathan Lebon
@ 2015-11-17 22:50         ` Jeff King
  0 siblings, 0 replies; 17+ messages in thread
From: Jeff King @ 2015-11-17 22:50 UTC (permalink / raw)
  To: Jonathan Lebon; +Cc: git

On Tue, Nov 17, 2015 at 12:18:28AM -0500, Jonathan Lebon wrote:

> > In the other thread I mentioned earlier, the solution I cooked up was
> > dropping highlighting entirely for hunks over a certain percentage of
> > highlighting. I wonder if we could do something similar here (e.g.,
> > don't match lines where more than 50% of the line would be highlighted).
> 
> I looked over but haven't tested the patches in the other thread yet. But
> overall, the LCS definitely looks promising. I'm hoping to find some time
> to have a more serious go at it and maybe pick it up where you left off.
> [...]
> Thanks again for reviewing these patches and apologies for the delayed
> reply.

Great! I look forward to seeing what you produce. Let me know if you
want me to clarify anything from that earlier discussion.

And don't worry about delayed replies; it's part of the normal workflow
around here.

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2015-11-17 22:50 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-11-03  2:05 [PATCH 0/4] diff-highlight: make a few improvements Jonathan Lebon
2015-11-03  2:05 ` [PATCH 1/4] diff-highlight: add `less -r` to cmd in README Jonathan Lebon
2015-11-03  2:41   ` Junio C Hamano
2015-11-03  3:12     ` Jonathan Lebon
2015-11-03 21:14   ` Jeff King
2015-11-03 21:51     ` Junio C Hamano
2015-11-03 21:55       ` Jeff King
2015-11-03  2:05 ` [PATCH 2/4] diff-highlight: factor out prefix/suffix functions Jonathan Lebon
2015-11-03 21:17   ` Jeff King
2015-11-03  2:05 ` [PATCH 3/4] diff-highlight: match up lines before highlighting Jonathan Lebon
2015-11-03 21:44   ` Jeff King
2015-11-03 22:03     ` Jeff King
2015-11-17  5:18       ` Jonathan Lebon
2015-11-17 22:50         ` Jeff King
2015-11-03  2:05 ` [PATCH 4/4] diff-highlight: add maxhunksize config option Jonathan Lebon
2015-11-03 21:46   ` Jeff King
2015-11-03 22:07 ` [PATCH 0/4] diff-highlight: make a few improvements Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).