Understanding and improving --word-diff

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Understanding and improving --word-diff
@ 2010-11-08 15:16 Matthijs Kooijman
  2010-11-08 15:41 ` Matthieu Moy
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Matthijs Kooijman @ 2010-11-08 15:16 UTC (permalink / raw)
  To: git; +Cc: Thomas Rast, Johannes Schindelin

[-- Attachment #1: Type: text/plain, Size: 7104 bytes --]

Hi folks,

I recently discovered --word-diff (or rather, --color-words and found
--word-diff when I started to hack on the git master version) and I had
hoped it would make the unified diffs generated by git-diff more
readable.

More specifically, I had expected to get a normal unified diff, with
colouring added to highlight the changes within the normal - and + lines
(so you don't have to review the entire changed line to see that just a
single word or character has changed).

E.g., I would like to see:

-a <r>b</r> c
+a <g>x</g> c

Unfortunately, all --word-diff types currently departs from line-based -
and + lines and show the new version of the file with the changed words
(both old and new versions) shown inline, marked with coloring or
{- ...  -} kind of syntax. E.g., with --word-diff=color, the above would
look like:

a <r>b</r><g>x</g> c

Personally, I think that the first example above is easier to read than
the second one (at least for diffs of code).

I was planning to let this mail be accompanied with a patch, so I've
started hacking on this feature already. However, halfway through some
cleanups and a prototype implementation of the above (breaking some of
the other --word-diff formats in the process), I found that the current
generalization of the different styles as stored in diff_words_styles[]
does not apply cleanly enough to my intended output format. While trying
to extend this generalization to something that would fit, I found that
I don't actually understand the rationale behind --word-diff and the
formats well enough to find a proper implemenation (see the link at the
bottom of this email for the unfinished code I hacked up until now).

So, here's some observations and questions about how --word-diff works
or should work. Comments are welcome, both in general terms as well as
in terms of the word-diff implementation (I know my way around there by
now).

Intended use
------------
First of all, it seems that the main intended use of --word-diff is for
LaTeX or HTML documents or similar, where blocks of running text might
be hard-wrapped (and thus rewrapped after a small change). In these
cases, a small change in wording could cause a lot of whitespace to
shift, resulting in a big normal diff. The current word-diff
implementation therefore simply does not show the whitespace (or rather,
non-word) changes, since they're usually not relevant to LaTeX anyway.

Is this indeed the main usecase, or are there others I'm missing?

Inexact output
--------------
Secondly, the --word-diff output currently never displays any changes to
the non-word (whitespace) parts of a file. This makes sense for the
LaTeX case, but sometimes you might want to get exact diff output
instead. At first glance this seems possible by specifiying a word-regex
of "." or something similar (i.e., make sure that the word regex matches
everything). But this is problematic for newlines. The documentation
states that stuff gets silently ignored if a newline ends up inside a
word. For the --word-diff=color format, this is probably a fixed
limitation of the otput format: you can't give a color to a newline (or
a space, for that matter).

Including a newline inside a {- ... -} block should not be a problem
with the --word-diff=plain format, and something similar can be argued
for the porcelain format.

An alternative approach would be to add a --word-diff-exact flag, which
would cause the whitespace between to matches of the word regex to be
treated as a word as well and have it included in the generate
word-diff.

This still leaves an implementation problem: To generate the word-diff,
the current code looks at one patch hunk at a time, collecting all the
plus and minus lines. It then splits those lines into words and
generates two new "files" containing one word per line. It then applies
a diff to this new document to get the word-diff.

When a word would contain a newline, this would effectively mean the
word would be split into two words for the word-diff, which will
probably screw up the output. An obvious solution would be to use some
escape sequence (e.g. \n) for a newline, though that might get messy and
inefficient. An alternative that seems feasable is to use the empty word
(i.e., an empty line in the word-diff "files") to mean a newline.

This would mean that every newline always breaks a word into two,
regardless of what the word regex is set to (but I guess that makes
sense anyway?). I also think this would allow complete diff output wrt
whitespace and newlines, for output formats that support it: plain,
(modified) porcelain and my proposed format.

Porcelain format
----------------
Lastly, the "porcelain" word-diff format seems a bit weird to me. Is
the format specified somewhere, or are there any programs that use it
currently? I couldn't find any users inside the git.git tree itself?

Looking at the format itself, it's a bit unclear to me what the ~ lines
mean exactly. Commit 882749, which introduced the format says the mean
"newlines in the input", but I'm not sure if this means the old file,
new file or both.

In fact, it seems that this uncertainty makes the porcelain-format
ambiguous wrt newlines. For example, these two diff hunks:

@@ -1,3 +1,2 @@
 a
-b
 c

@@ -1,3 +1,3 @@
 a
-b
+
 c

both look the same in porcelain format, except for the hunk header.

@@ -1,3 +1,2 @@
 a
~
-b
~
 c
~
@@ -1,3 +1,3 @@
 a
~
-b
~
 c
~

This is somewhat expected, of course, since the --word-diff formats are
documented to show only changes to words, not to non-words/whitespace.
So I guess it is expected that the output is ambigious wrt whitespace,
but if so, what is the use of this porcelain format? Wouldn't it be make
a lot more sense to make the format unambiguous and make it do
word-based diff at the same time? I think this should be possible
because of the explicit notation used for the newline.

For example, Specifying the ~ lines to mean a newline in the old, new or
both files depending on the previous +, - or space prefixed line is
probably enough for this. By generating empty +, - or space prefixed
lines when needed, every occurence of ~ could be disambiguated.

For example, the above two diff hunks would then become the following.
The only difference is the near-empty line (just a space prefix) after
-b in the second hunk.

@@ -1,3 +1,2 @@
 a
~
-b
~
 c
~
@@ -1,3 +1,3 @@
 a
~
-b

~
 c
~

So, these are some thoughts I've had while hacking on the code. As said,
suggestions are welcome. I'd like my hacking to result in some useful
patches, but right now I'm unsure what direction(s) I should be
thinking/working in.

In case you're interested in the hacking I've done so far, I've put it
up here:

http://git.stderr.nl/gitweb?p=matthijs/upstream/git.git;a=shortlog;h=refs/heads/word-diff

Most of it is broken or not properly tested, but it gets an idea what
kinds of cleanup I've been doing.

Gr.

Matthijs

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Understanding and improving --word-diff
  2010-11-08 15:16 Understanding and improving --word-diff Matthijs Kooijman
@ 2010-11-08 15:41 ` Matthieu Moy
  2010-11-08 15:49   ` Matthijs Kooijman
  2010-11-08 16:33   ` Thomas Rast
  2010-11-08 17:35 ` Wincent Colaiuta
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 10+ messages in thread
From: Matthieu Moy @ 2010-11-08 15:41 UTC (permalink / raw)
  To: git; +Cc: Thomas Rast, Johannes Schindelin

Matthijs Kooijman <matthijs@stdin.nl> writes:

> Lastly, the "porcelain" word-diff format seems a bit weird to me. Is
> the format specified somewhere, or are there any programs that use it
> currently? I couldn't find any users inside the git.git tree itself?

There's a pending patch for "gitk --color-words". Actually, the
porcelain format was precisely implemented to allow this patch to
exist IIRC.

-- 
Matthieu Moy
http://www-verimag.imag.fr/~moy/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Understanding and improving --word-diff
  2010-11-08 15:41 ` Matthieu Moy
@ 2010-11-08 15:49   ` Matthijs Kooijman
  2010-11-08 16:33   ` Thomas Rast
  1 sibling, 0 replies; 10+ messages in thread
From: Matthijs Kooijman @ 2010-11-08 15:49 UTC (permalink / raw)
  To: Matthieu Moy; +Cc: git, Thomas Rast, Johannes Schindelin

[-- Attachment #1: Type: text/plain, Size: 428 bytes --]

Hi Matthieu,

> There's a pending patch for "gitk --color-words". Actually, the
> porcelain format was precisely implemented to allow this patch to
> exist IIRC.
I think you mean this patch?

[PATCH v4' 2/2] gitk: add the equivalent of diff --color-words
https://kerneltrap.org/mailarchive/git/2010/4/19/28707

I had seen it around, but hadn't realized it wasn't in master yet. I'll
have a closer look at it.

Thanks.

Matthijs

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Understanding and improving --word-diff
  2010-11-08 15:41 ` Matthieu Moy
  2010-11-08 15:49   ` Matthijs Kooijman
@ 2010-11-08 16:33   ` Thomas Rast
  1 sibling, 0 replies; 10+ messages in thread
From: Thomas Rast @ 2010-11-08 16:33 UTC (permalink / raw)
  To: Matthijs Kooijman; +Cc: Matthieu Moy, git, Johannes Schindelin

Matthieu Moy wrote:
> Matthijs Kooijman <matthijs@stdin.nl> writes:
> 
> > Lastly, the "porcelain" word-diff format seems a bit weird to me. Is
> > the format specified somewhere, or are there any programs that use it
> > currently? I couldn't find any users inside the git.git tree itself?
> 
> There's a pending patch for "gitk --color-words". Actually, the
> porcelain format was precisely implemented to allow this patch to
> exist IIRC.

Also, my intent was to specify the format fully under its option
description.  It's rather simple, but if you think the
explanation/definition is inadequate, please provide a patch.  We'd
really want the writers of consumers (GUIs and such) to be very clear
about how it is formatted.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Understanding and improving --word-diff
  2010-11-08 15:16 Understanding and improving --word-diff Matthijs Kooijman
  2010-11-08 15:41 ` Matthieu Moy
@ 2010-11-08 17:35 ` Wincent Colaiuta
  2010-11-08 19:22 ` Thomas Rast
  2010-11-09 22:01 ` Jeff King
  3 siblings, 0 replies; 10+ messages in thread
From: Wincent Colaiuta @ 2010-11-08 17:35 UTC (permalink / raw)
  To: Matthijs Kooijman; +Cc: git, Thomas Rast, Johannes Schindelin

On 08/11/2010, at 16:16, Matthijs Kooijman wrote:

> Hi folks,
> 
> I recently discovered --word-diff (or rather, --color-words and found
> --word-diff when I started to hack on the git master version) and I had
> hoped it would make the unified diffs generated by git-diff more
> readable.

[snip]

> Inexact output
> --------------
> Secondly, the --word-diff output currently never displays any changes to
> the non-word (whitespace) parts of a file. This makes sense for the
> LaTeX case, but sometimes you might want to get exact diff output
> instead. At first glance this seems possible by specifiying a word-regex
> of "." or something similar (i.e., make sure that the word regex matches
> everything). But this is problematic for newlines. The documentation
> states that stuff gets silently ignored if a newline ends up inside a
> word. For the --word-diff=color format, this is probably a fixed
> limitation of the otput format: you can't give a color to a newline (or
> a space, for that matter).

[snip]

> Porcelain format
> ----------------
> Lastly, the "porcelain" word-diff format seems a bit weird to me. Is
> the format specified somewhere, or are there any programs that use it
> currently? I couldn't find any users inside the git.git tree itself?
> 
> Looking at the format itself, it's a bit unclear to me what the ~ lines
> mean exactly. Commit 882749, which introduced the format says the mean
> "newlines in the input", but I'm not sure if this means the old file,
> new file or both.

I posted along similar lines back in September:

  http://article.gmane.org/gmane.comp.version-control.git/156375

I found I couldn't really use the porcelain format due to information loss and ended up having to roll my own, parse the normal diff output and using that.

Cheers,
Wincent

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Understanding and improving --word-diff
  2010-11-08 15:16 Understanding and improving --word-diff Matthijs Kooijman
  2010-11-08 15:41 ` Matthieu Moy
  2010-11-08 17:35 ` Wincent Colaiuta
@ 2010-11-08 19:22 ` Thomas Rast
  2010-11-09 22:01 ` Jeff King
  3 siblings, 0 replies; 10+ messages in thread
From: Thomas Rast @ 2010-11-08 19:22 UTC (permalink / raw)
  To: Matthijs Kooijman
  Cc: git, Johannes Schindelin, Matthieu Moy, Wincent Colaiuta

Matthijs Kooijman wrote:
> This is somewhat expected, of course, since the --word-diff formats are
> documented to show only changes to words, not to non-words/whitespace.
> So I guess it is expected that the output is ambigious wrt whitespace,
> but if so, what is the use of this porcelain format? Wouldn't it be make
> a lot more sense to make the format unambiguous and make it do
> word-based diff at the same time? I think this should be possible
> because of the explicit notation used for the newline.

It's not just the newlines.  You will note that when diffing e.g.

  a x  c
vs.
  a  b c

then because the word-diff engine doesn't bother with the non-word
parts, you will end up with (using ordinary --word-diff)

  a  [-x-]{+b+} c

or using the porcelain format

  $ git diff --no-index --word-diff=porcelain a b | cat -E
  diff --git 1/a 2/b$
  index 4f47486..b040ae0 100644$
  --- 1/a$
  +++ 2/b$
  @@ -1 +1 @@$
   a  $
  -x$
  +b$
    c$
  ~$

Note how the second space after 'x' disappeared.  The case that
Wincent explains in the email he linked (thanks; I didn't even see it
the first time around) shows a more drastic example.  So the logic
that inserts the space again is too simplistic.

What it does not is something like (haven't checked again) always
insert the postimage space that went with the preceding word.  What it
*would* have to do is actually compute differences in the space, while
at the same time ignoring them for the purposes of the LCS algorithm.
If it weren't for the customizable word-regex, it might be enough to
put the spaces in the words again but enable the internal equivalent
of -b.

> Looking at the format itself, it's a bit unclear to me what the ~ lines
> mean exactly. Commit 882749, which introduced the format says the mean
> "newlines in the input", but I'm not sure if this means the old file,
> new file or both.

As Matthieu already said, the history of --word-diff=porcelain is
just: I needed a way to pipe word-diff output to gitk for coloring,
without any messing around with the color strings.  The ~ is the
minimal required feature to at least let git communicate *some*
linebreaks.

You are probably right in that while the format leaves open the
possibility of showing addition and removal of space (although the
current engine does not really attempt to compute it), I did not think
far enough to let it show addition and removal of whitespace.

> For example, Specifying the ~ lines to mean a newline in the old, new or
> both files depending on the previous +, - or space prefixed line is
> probably enough for this. By generating empty +, - or space prefixed
> lines when needed, every occurence of ~ could be disambiguated.

AFAICS that breaks down if you couple a newline-change-aware consumer
with a non-aware producer (i.e., a new frontend with an old git),
since there is currently no guarantee that the line before a ~ is
context.

So you would have to introduce a new format (porcelain2?) anyway, and
if you're already going that route, then for simplicity I'd rather
have something like ~+ (or so) instead of requiring the parser to
track state.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Understanding and improving --word-diff
  2010-11-08 15:16 Understanding and improving --word-diff Matthijs Kooijman
                   ` (2 preceding siblings ...)
  2010-11-08 19:22 ` Thomas Rast
@ 2010-11-09 22:01 ` Jeff King
  2010-11-10  0:05   ` Johannes Schindelin
  2010-11-18  6:40   ` Jeff King
  3 siblings, 2 replies; 10+ messages in thread
From: Jeff King @ 2010-11-09 22:01 UTC (permalink / raw)
  To: git, Thomas Rast, Johannes Schindelin

On Mon, Nov 08, 2010 at 04:16:01PM +0100, Matthijs Kooijman wrote:

> E.g., I would like to see:
> 
> -a <r>b</r> c
> +a <g>x</g> c
> 
> Unfortunately, all --word-diff types currently departs from line-based -
> and + lines and show the new version of the file with the changed words
> (both old and new versions) shown inline, marked with coloring or
> {- ...  -} kind of syntax. E.g., with --word-diff=color, the above would
> look like:
> 
> a <r>b</r><g>x</g> c
> 
> Personally, I think that the first example above is easier to read than
> the second one (at least for diffs of code).

Yeah, as you figured out, word diff is really about formats that aren't
line oriented. I have seen other systems do what you are asking for, but
I'm not sure they are nearly as complex as a true word diff. In
particular, I have noticed two simplifications:

  1. Only highlight between two lines that are in part of a hunk by
     themselves. In other words, highlight on this:

        context
       -foo bar
       +foo baz
        context

     but not this:

        context
       -foo bar
       -another line
       +foo baz
       +another changed line
        context

     Even though you could perhaps match up pairs of lines (in this
     case, the first removed line obviously corresponds with the first
     added line, etc), there are a ton of cases where it would be
     uselessly distracting, as the lines simply aren't related to each
     other (of course, there are one-liner changes where that is the
     case, too, but they are hopefully less common, and since it is only
     a single pair of lines, there is no guessing about how they were
     matched up).

  2. Don't do real LCS matching between the pairs. Just find the common
     prefix and suffix, and highlight everything in the middle. This
     would yield something like:

       -a <r>b c d</r> e
       +a <g>x c z</g> e

     when you could in theory produce:

       -a <r>b</r> c <r>d</r> e
       -a <g>x</g> c <g>z</g> e

     but I suspect in practice it covers many situations.

For fun, I hacked up the perl script below which will highlight (using
ANSI reverse-codes!) a diff using both of these simplifications. It
looks for diff hunks on stdin, so you can do:

  git log | perl /path/to/script.pl

I'm curious to see how the output looks on some real-world diffs. I
admit to not having given this problem much thought, but instead just
sat down and started hacking. :)

The script tries to handle color in the input sanely, so you can see
your highlighting in color. But there are two caveats:

  1. You have to pass --color to git explicitly, since piping stdout
     turns off git's color autodetection.

  2. There is a buglet in git's color emitting code. For added lines, it
     produces "<g>+</g><g>rest of line</g>" which is annoying to parse.
     It is fixed by the patch below:

diff --git a/diff.c b/diff.c
index b1b0d3d..0efcdb7 100644
--- a/diff.c
+++ b/diff.c
@@ -363,9 +363,9 @@ static void emit_add_line(const char *reset,
 		emit_line_0(ecbdata->opt, ws, reset, '+', line, len);
 	else {
 		/* Emit just the prefix, then the rest. */
-		emit_line_0(ecbdata->opt, set, reset, '+', "", 0);
+		emit_line_0(ecbdata->opt, set, "", '+', "", 0);
 		ws_check_emit(line, len, ecbdata->ws_rule,
-			      ecbdata->opt->file, set, reset, ws);
+			      ecbdata->opt->file, "", reset, ws);
 	}
 }
 

This patch, btw, is much easier to read when piped through my script. :)

Here's the script:

-- >8 --
#!/usr/bin/perl

use strict;
my $COLOR = qr/\x1b\[[0-9;]*m/;

my @window;

while (<>) {
  chomp;
  s/^($COLOR)*//; my $prefix = $&;
  s/($COLOR)*$//; my $suffix = $&;

  push @window, [$prefix, $_, $suffix];

  if ($window[0] && $window[0]->[1] =~ /^(\@| )/ &&
      $window[1] && $window[1]->[1] =~ /^-/ &&
      $window[2] && $window[2]->[1] =~ /^\+/ &&
      $window[3] && $window[3]->[1] !~ /^\+/) {
    show_line(shift @window);
    show_pair(shift @window, shift @window);
  }

  if (@window >= 4) {
    show_line(shift @window);
  }
}

if (@window == 3 &&
    $window[0] && $window[0]->[1] =~ /^(\@| )/ &&
    $window[1] && $window[1]->[1] =~ /^-/ &&
    $window[2] && $window[2]->[1] =~ /^\+/) {
  show_line(shift @window);
  show_pair(shift @window, shift @window);
}
while (@window) {
  show_line(shift @window);
}

exit 0;

sub show_line {
  my $line = shift;
  print @$line, "\n";
}

sub show_pair {
  my ($from, $to) = @_;
  my @from = split //, $from->[1];
  my @to = split //, $to->[1];

  my $prefix = 1;
  while ($from[$prefix] eq $to[$prefix]) {
    $prefix++;
  }
  my $suffix = 0;
  while ($from[$#from-$suffix] eq $to[$#to-$suffix]) {
    $suffix++;
  }

  print $from->[0], highlight($from->[1], $prefix, $suffix), $from->[2], "\n";
  print $to->[0], highlight($to->[1], $prefix, $suffix), $to->[2], "\n";
}

sub highlight {
  my ($line, $prefix, $suffix) = @_;
  my $end = length($line) - $suffix;
  return join('',
    substr($line, 0, $prefix),
    "\x1b[7m",
    substr($line, $prefix, $end - $prefix),
    "\x1b[27m",
    substr($line, $end)
  );
}

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: Understanding and improving --word-diff
  2010-11-09 22:01 ` Jeff King
@ 2010-11-10  0:05   ` Johannes Schindelin
  2010-11-10  4:17     ` Jeff King
  2010-11-18  6:40   ` Jeff King
  1 sibling, 1 reply; 10+ messages in thread
From: Johannes Schindelin @ 2010-11-10  0:05 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Thomas Rast

Hi,

On Tue, 9 Nov 2010, Jeff King wrote:

> On Mon, Nov 08, 2010 at 04:16:01PM +0100, Matthijs Kooijman wrote:
> 
> > E.g., I would like to see:
> > 
> > -a <r>b</r> c
> > +a <g>x</g> c
> > 
> > Unfortunately, all --word-diff types currently departs from line-based 
> > - and + lines and show the new version of the file with the changed 
> > words (both old and new versions) shown inline, marked with coloring 
> > or {- ...  -} kind of syntax. E.g., with --word-diff=color, the above 
> > would look like:
> > 
> > a <r>b</r><g>x</g> c
> > 
> > Personally, I think that the first example above is easier to read 
> > than the second one (at least for diffs of code).
> 
> Yeah, as you figured out, word diff is really about formats that aren't 
> line oriented.

Not really correct. While the _current_ way to show word diffs is 
imitating GNU wdiff, the internal data structure does allow for more 
sophisticated output.

Ciao,
Dscho

P.S.: Peff, I hope you're fine with me responding to the first sentence 
only. After all, you know that my attention span is 7.2 seconds.

Some people taught me to put a smiley after such a statement, so: ;-)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Understanding and improving --word-diff
  2010-11-10  0:05   ` Johannes Schindelin
@ 2010-11-10  4:17     ` Jeff King
  0 siblings, 0 replies; 10+ messages in thread
From: Jeff King @ 2010-11-10  4:17 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Thomas Rast

On Wed, Nov 10, 2010 at 01:05:56AM +0100, Johannes Schindelin wrote:

> > Yeah, as you figured out, word diff is really about formats that aren't 
> > line oriented.
> 
> Not really correct. While the _current_ way to show word diffs is 
> imitating GNU wdiff, the internal data structure does allow for more 
> sophisticated output.

Fair enough. I was basing my statement on the output. If somebody would
like to demonstrate how the code can be used to get line-oriented diffs
with highlighting, I would be very happy to eat my words. :)

> P.S.: Peff, I hope you're fine with me responding to the first sentence 
> only. After all, you know that my attention span is 7.2 seconds.

Heh. I am happy you responded at all. You are very quiet these days. :)

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Understanding and improving --word-diff
  2010-11-09 22:01 ` Jeff King
  2010-11-10  0:05   ` Johannes Schindelin
@ 2010-11-18  6:40   ` Jeff King
  1 sibling, 0 replies; 10+ messages in thread
From: Jeff King @ 2010-11-18  6:40 UTC (permalink / raw)
  To: git; +Cc: Thomas Rast, Johannes Schindelin

On Tue, Nov 09, 2010 at 05:01:36PM -0500, Jeff King wrote:

>   2. There is a buglet in git's color emitting code. For added lines, it
>      produces "<g>+</g><g>rest of line</g>" which is annoying to parse.
>      It is fixed by the patch below:
> 
> diff --git a/diff.c b/diff.c
> index b1b0d3d..0efcdb7 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -363,9 +363,9 @@ static void emit_add_line(const char *reset,
>  		emit_line_0(ecbdata->opt, ws, reset, '+', line, len);
>  	else {
>  		/* Emit just the prefix, then the rest. */
> -		emit_line_0(ecbdata->opt, set, reset, '+', "", 0);
> +		emit_line_0(ecbdata->opt, set, "", '+', "", 0);
>  		ws_check_emit(line, len, ecbdata->ws_rule,
> -			      ecbdata->opt->file, set, reset, ws);
> +			      ecbdata->opt->file, "", reset, ws);
>  	}
>  }

FYI, I looked further at this patch, and it should not be used. There
are cases where ws_check_emit will output a reset (e.g., because it has
colored some whitespace), and needs to be able to set the proper color
again. So while I think the intent of this patch is fine (to avoid
duplicate colorizing in the output), the actual implementation would
need to be much more complex.

Instead, here's an updated version of my color highlighting script that
handles this case.

With the patch I posted earlier today:

  [PATCH 2/2] allow command-specific pagers in pager.<cmd>
  http://article.gmane.org/gmane.comp.version-control.git/161624

you can do:

  git config pager.log '/path/to/diff-highlight.pl | less'

and get automagic coloring. Still, this is a bit of a hack, and there
are some funny corner cases, so something that used the internal diff
machinery would be much nicer.

-Peff

-- >8 --
#!/usr/bin/perl

use strict;
my $COLOR = qr/\x1b\[[0-9;]*m/;
my $RESET = "\x1b[0m";
my $RED = "\x1b[31m";
my $GREEN = "\x1b[32m";
my $REVERSE = "\x1b[7m";
my $UNREVERSE = "\x1b[27m";

my @window;

while (<>) {
  chomp;
  my $plain = $_;
  $plain =~ s/$COLOR//g;

  push @window, [$_, $plain];

  if ($window[0] && $window[0]->[1] =~ /^(\@| )/ &&
      $window[1] && $window[1]->[1] =~ /^-/ &&
      $window[2] && $window[2]->[1] =~ /^\+/ &&
      $window[3] && $window[3]->[1] !~ /^\+/) {
    show_line(shift @window);
    show_pair(shift @window, shift @window);
  }

  if (@window >= 4) {
    show_line(shift @window);
  }
}

if (@window == 3 &&
    $window[0] && $window[0]->[1] =~ /^(\@| )/ &&
    $window[1] && $window[1]->[1] =~ /^-/ &&
    $window[2] && $window[2]->[1] =~ /^\+/) {
  show_line(shift @window);
  show_pair(shift @window, shift @window);
}
while (@window) {
  show_line(shift @window);
}

exit 0;

sub show_line {
  my $line = shift;
  print $line->[0], "\n";
}

sub show_pair {
  my ($from, $to) = @_;
  my @from = split //, $from->[1];
  my @to = split //, $to->[1];

  my $prefix = 1;
  while ($from[$prefix] eq $to[$prefix]) {
    $prefix++;
  }
  my $suffix = 0;
  while ($from[$#from-$suffix] eq $to[$#to-$suffix]) {
    $suffix++;
  }

  print $RED, highlight($from->[1], $prefix, $suffix), $RESET, "\n";
  print $GREEN, highlight($to->[1], $prefix, $suffix), $RESET, "\n";
}

sub highlight {
  my ($line, $prefix, $suffix) = @_;
  my $end = length($line) - $suffix;
  return $line unless $end > $prefix;
  return join('',
    substr($line, 0, $prefix),
    $REVERSE,
    substr($line, $prefix, $end - $prefix),
    $UNREVERSE,
    substr($line, $end)
  );
}

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2010-11-18  6:41 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-08 15:16 Understanding and improving --word-diff Matthijs Kooijman
2010-11-08 15:41 ` Matthieu Moy
2010-11-08 15:49   ` Matthijs Kooijman
2010-11-08 16:33   ` Thomas Rast
2010-11-08 17:35 ` Wincent Colaiuta
2010-11-08 19:22 ` Thomas Rast
2010-11-09 22:01 ` Jeff King
2010-11-10  0:05   ` Johannes Schindelin
2010-11-10  4:17     ` Jeff King
2010-11-18  6:40   ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).