Re: Understanding and improving --word-diff

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Thomas Rast <trast@student.ethz.ch>
To: Matthijs Kooijman <matthijs@stdin.nl>
Cc: <git@vger.kernel.org>,
	Johannes Schindelin <johannes.schindelin@gmx.de>,
	Matthieu Moy <Matthieu.Moy@grenoble-inp.fr>,
	Wincent Colaiuta <win@wincent.com>
Subject: Re: Understanding and improving --word-diff
Date: Mon, 8 Nov 2010 20:22:13 +0100	[thread overview]
Message-ID: <201011082022.13678.trast@student.ethz.ch> (raw)
In-Reply-To: <20101108151601.GF22067@login.drsnuggles.stderr.nl>

Matthijs Kooijman wrote:
> This is somewhat expected, of course, since the --word-diff formats are
> documented to show only changes to words, not to non-words/whitespace.
> So I guess it is expected that the output is ambigious wrt whitespace,
> but if so, what is the use of this porcelain format? Wouldn't it be make
> a lot more sense to make the format unambiguous and make it do
> word-based diff at the same time? I think this should be possible
> because of the explicit notation used for the newline.

It's not just the newlines.  You will note that when diffing e.g.

  a x  c
vs.
  a  b c

then because the word-diff engine doesn't bother with the non-word
parts, you will end up with (using ordinary --word-diff)

  a  [-x-]{+b+} c

or using the porcelain format

  $ git diff --no-index --word-diff=porcelain a b | cat -E
  diff --git 1/a 2/b$
  index 4f47486..b040ae0 100644$
  --- 1/a$
  +++ 2/b$
  @@ -1 +1 @@$
   a  $
  -x$
  +b$
    c$
  ~$

Note how the second space after 'x' disappeared.  The case that
Wincent explains in the email he linked (thanks; I didn't even see it
the first time around) shows a more drastic example.  So the logic
that inserts the space again is too simplistic.

What it does not is something like (haven't checked again) always
insert the postimage space that went with the preceding word.  What it
*would* have to do is actually compute differences in the space, while
at the same time ignoring them for the purposes of the LCS algorithm.
If it weren't for the customizable word-regex, it might be enough to
put the spaces in the words again but enable the internal equivalent
of -b.

> Looking at the format itself, it's a bit unclear to me what the ~ lines
> mean exactly. Commit 882749, which introduced the format says the mean
> "newlines in the input", but I'm not sure if this means the old file,
> new file or both.

As Matthieu already said, the history of --word-diff=porcelain is
just: I needed a way to pipe word-diff output to gitk for coloring,
without any messing around with the color strings.  The ~ is the
minimal required feature to at least let git communicate *some*
linebreaks.

You are probably right in that while the format leaves open the
possibility of showing addition and removal of space (although the
current engine does not really attempt to compute it), I did not think
far enough to let it show addition and removal of whitespace.

> For example, Specifying the ~ lines to mean a newline in the old, new or
> both files depending on the previous +, - or space prefixed line is
> probably enough for this. By generating empty +, - or space prefixed
> lines when needed, every occurence of ~ could be disambiguated.

AFAICS that breaks down if you couple a newline-change-aware consumer
with a non-aware producer (i.e., a new frontend with an old git),
since there is currently no guarantee that the line before a ~ is
context.

So you would have to introduce a new format (porcelain2?) anyway, and
if you're already going that route, then for simplicity I'd rather
have something like ~+ (or so) instead of requiring the parser to
track state.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

next prev parent reply	other threads:[~2010-11-08 19:22 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-11-08 15:16 Understanding and improving --word-diff Matthijs Kooijman
2010-11-08 15:41 ` Matthieu Moy
2010-11-08 15:49   ` Matthijs Kooijman
2010-11-08 16:33   ` Thomas Rast
2010-11-08 17:35 ` Wincent Colaiuta
2010-11-08 19:22 ` Thomas Rast [this message]
2010-11-09 22:01 ` Jeff King
2010-11-10  0:05   ` Johannes Schindelin
2010-11-10  4:17     ` Jeff King
2010-11-18  6:40   ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=201011082022.13678.trast@student.ethz.ch \
    --to=trast@student.ethz.ch \
    --cc=Matthieu.Moy@grenoble-inp.fr \
    --cc=git@vger.kernel.org \
    --cc=johannes.schindelin@gmx.de \
    --cc=matthijs@stdin.nl \
    --cc=win@wincent.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).