git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Thomas Rast <trast@student.ethz.ch>
To: Tay Ray Chuan <rctay89@gmail.com>
Cc: Git Mailing List <git@vger.kernel.org>,
	Junio C Hamano <gitster@pobox.com>
Subject: Re: [PATCH 2/2] diff --word-diff: use non-whitespace regex by default
Date: Thu, 12 Jan 2012 10:22:03 +0100	[thread overview]
Message-ID: <87ipkhqnr8.fsf@thomas.inf.ethz.ch> (raw)
In-Reply-To: <CALUzUxo3DcKqC6sQFQ1Oi0vgASFSHCcmOgHAj2_4c3vEjy663w@mail.gmail.com> (Tay Ray Chuan's message of "Thu, 12 Jan 2012 08:52:49 +0800")

Tay Ray Chuan <rctay89@gmail.com> writes:

> On Thu, Jan 12, 2012 at 4:05 AM, Thomas Rast <trast@student.ethz.ch> wrote:
>> Tay Ray Chuan <rctay89@gmail.com> writes:
>>
>>> Factor out the comprehensive non-whitespace regex in use by PATTERNS and
>>> IPATTERN and use it as the word-diff regex for the default diff driver.
>>
>> Why?

Sorry for distracting you with the performance argument; it was mostly
the first thing that came to my mind that I could use to ask for the
motivation, and evaluation of tradeoffs, that both were missing from the
proposed commit message.

> But I think it's worthwhile to trade-off performance for a sensible
> default. Something like
>
>   matrix[a,b,c]
>   matrix[d,b,c]
>
> gives
>
>   matrix[[-a-]{+d+},b,c]
>
> and when we have
>
>   ImagineALanguageLikeFoo
>   ImagineALanguageLikeBar
>
> we get
>
>   ImagineALanguageLike[-Foo-]{+Bar+}

In that case (and I should have read the original patch), I am
definitely against this change.  It turns the default word-diff into
character-diff, which is something entirely different, and frequently
useless precisely for the reason you state:

> (But I cheated. Foo and Bar have no common characters in common; if
> they did, the word diff would be messy.)

Case in point, consider my patch sent out yesterday

  http://article.gmane.org/gmane.comp.version-control.git/188391

It consists of a one-hunk doc update.  word-diff is not brilliant:

  -k::
          Usually the program [-'cleans up'-]{+removes email cruft from+} the Subject:
          header line to extract the title line for the commit log
          [-message,-]
  [-      among which (1) remove 'Re:' or 're:', (2) leading-]
  [-      whitespaces, (3) '[' up to ']', typically '[PATCH]', and-]
  [-      then prepends "[PATCH] ".-]{+message.+}  This [-flag forbids-]{+option prevents+} this munging, and is most
          useful when used to read back 'git format-patch -k' output.
[snip the rest as it's only {+}]

But character-diff tries too hard to find common subsequences:

  $ g show HEAD^^ --word-diff-regex='[^[:space:]]' | xsel
  -k::
          Usually the program [-'cl-]{+remov+}e[-an-]s {+email cr+}u[-p'-]{+ft from+} the Subject:
          header line to extract the title line for the commit log
          message[-,-]
  [-      among which (1) remove 'Re:' or 're:', (2) leading-]
  [-      w-]{+.  T+}hi[-te-]s[-paces, (3) '[' up t-] o[-']', ty-]p[-ically '[PATCH]', and-]t[-he-]{+io+}n pre[-p-]{+v+}en[-ds "[PATCH] ".  This flag forbid-]{+t+}s this munging, and is most
          useful when used to read back 'git format-patch -k' output.
[snip]

Wouldn't you agree that

  w-]{+.  T+}hi[-te-]s[-paces, (3) '[' up t-] o[-']', ty-]p[

is just line noise?  The colors don't even help as most of it is removed
(red).

Regarding your examples

> [1] http://article.gmane.org/gmane.comp.version-control.git/105896
> [2] http://article.gmane.org/gmane.comp.version-control.git/105237

first please notice that both of them were written before (and actually
discussing) the introduction of the wordRegex feature.  At this point,
we were trying to make up our minds w.r.t. how powerful the feature
needs to be.  Nowadays (or in fact, starting a few days after those
emails) the user can easily achieve everything discussed here by setting
the wordRegex to taste.

That being said, I can see some arguments for changing the default to
split punctuation into a separate word.  That is, whereas the current
default is semantically equivalent to a wordRegex of

  [^[:space:]]*

(but has a faster code path) and your proposal is equivalent to

  [^[:space:]]|UTF_8_GUARD

I think there is a case to be made for a default of

  [^[:space:]]|([[:alnum:]]|UTF_8_GUARD)+

or some such.  There's a lot of bikeshedding lurking in the (non)extent
of the [[:alnum:]] here, however.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

  reply	other threads:[~2012-01-12  9:22 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-01-11 17:25 [PATCH 1/2] t4034-diff-words: replace regex for diff driver Tay Ray Chuan
2012-01-11 17:25 ` [PATCH 2/2] diff --word-diff: use non-whitespace regex by default Tay Ray Chuan
2012-01-11 20:05   ` Thomas Rast
2012-01-12  0:52     ` Tay Ray Chuan
2012-01-12  9:22       ` Thomas Rast [this message]
2012-01-18  7:32         ` Tay Ray Chuan
2012-01-19 15:53           ` Thomas Rast
2012-01-20  1:14             ` Tay Ray Chuan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87ipkhqnr8.fsf@thomas.inf.ethz.ch \
    --to=trast@student.ethz.ch \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=rctay89@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).