From: Thomas Rast <trast@student.ethz.ch>
To: Tay Ray Chuan <rctay89@gmail.com>
Cc: Git Mailing List <git@vger.kernel.org>,
Junio C Hamano <gitster@pobox.com>
Subject: Re: [PATCH 2/2] diff --word-diff: use non-whitespace regex by default
Date: Thu, 19 Jan 2012 16:53:16 +0100 [thread overview]
Message-ID: <87bopzofir.fsf@thomas.inf.ethz.ch> (raw)
In-Reply-To: <CALUzUxqXTXZv4RE=4rBa79T3_1y7UdqZ6okjC1y-Ve+=NDbQ2g@mail.gmail.com> (Tay Ray Chuan's message of "Wed, 18 Jan 2012 15:32:29 +0800")
Tay Ray Chuan <rctay89@gmail.com> writes:
> On Thu, Jan 12, 2012 at 5:22 PM, Thomas Rast <trast@student.ethz.ch> wrote:
>> [snip]
>> Case in point, consider my patch sent out yesterday
>>
>> http://article.gmane.org/gmane.comp.version-control.git/188391
>>
>> It consists of a one-hunk doc update. word-diff is not brilliant:
>>
>> -k::
>> Usually the program [-'cleans up'-]{+removes email cruft from+} the Subject:
>> header line to extract the title line for the commit log
>> [-message,-]
>> [- among which (1) remove 'Re:' or 're:', (2) leading-]
>> [- whitespaces, (3) '[' up to ']', typically '[PATCH]', and-]
>> [- then prepends "[PATCH] ".-]{+message.+} This [-flag forbids-]{+option prevents+} this munging, and is most
>> useful when used to read back 'git format-patch -k' output.
>> [snip the rest as it's only {+}]
>>
>> But character-diff tries too hard to find common subsequences:
>>
>> $ g show HEAD^^ --word-diff-regex='[^[:space:]]' | xsel
>>[snip]
>> w-]{+. T+}hi[-te-]s[-paces, (3) '[' up t-] o[-']', ty-]p[
>>
>> is just line noise? The colors don't even help as most of it is removed
>> (red).
>
> You missed the '+' quantifier, as in
>
> [^[:space:]]+
Did I? I was working from the example you provided earlier
} matrix[a,b,c]
} matrix[d,b,c]
} gives
} matrix[[-a-]{+d+},b,c]
}
} and when we have
}
} ImagineALanguageLikeFoo
} ImagineALanguageLikeBar
} we get
} ImagineALanguageLike[-Foo-]{+Bar+}
Under [^[:space:]]+ neither of the examples would work. Actually,
[^[:space:]]+ is the same as today's default, the [^[:space:]]* I
mentioned later is (strictly speaking) broken as it allows for a
0-length match. (It doesn't really matter because IIRC the engine
ignores 0-length words.)
>> That being said, I can see some arguments for changing the default to
>> split punctuation into a separate word. That is, whereas the current
>> default is semantically equivalent to a wordRegex of
>>
>> [^[:space:]]*
>>
>> (but has a faster code path)
>
> Oh right, there *is* a sensible default implemented in. Somehow I was
> under the impression that there wasn't.
>
> I wonder which is faster, using the non-whitespace regex, or the
> isspace() calls...
I tried measuring it across a few commits, but it mostly gets drowned
out by the diff effort. For a commit with stat
exercises/cgal/cover/cover.cpp | 5 +-
exercises/cgal/cover/cover.in1 |27014 +++++++++++++++-----
exercises/cgal/cover/cover.in2 |48996 +++++++++++++++++++++++------------
exercises/cgal/cover/cover.in3 |55041 +++++++++++++++++++++++++--------------
exercises/cgal/cover/cover.in4 |47600 ++++++++++++++++++++--------------
exercises/cgal/cover/cover.int |43491 ++++++++++++++++++++++---------
exercises/cgal/cover/cover.out1 | 53 +-
exercises/cgal/cover/cover.out2 | 24 +-
exercises/cgal/cover/cover.out3 | 11 +-
exercises/cgal/cover/cover.out4 | 2 +-
exercises/cgal/cover/cover.outt | 23 +-
exercises/cgal/cover/gen | 39 +-
exercises/cgal/cover/gen-1.cpp | 4 +-
exercises/cgal/cover/gen-2.cpp | 6 +-
exercises/cgal/cover/gen-3.cpp | 6 +-
(sorry, can't share as those testcases are secret) I get best-of-5
timings
--word-diff-regex='[^[:space:]]+' 0:07.50real 7.40user 0.07system
--word-diff 0:07.47real 7.41user 0.03system
In conclusion, "meh". I think ripping out the isspace() part would make
for a nice code reduction.
>> and your proposal is equivalent to
>>
>> [^[:space:]]|UTF_8_GUARD
>>
>> I think there is a case to be made for a default of
>>
>> [^[:space:]]|([[:alnum:]]|UTF_8_GUARD)+
>>
>> or some such. There's a lot of bikeshedding lurking in the (non)extent
>> of the [[:alnum:]] here, however.
>
> Care to explain further? Not to sure what you mean here.
For natural language, it may or may not make sense to match numbers as
part of a word.
For typical use in e.g. emails, a lot of punctuation has a double role;
breaking words in
http://article.gmane.org/gmane.comp.version-control.git/188391
may or may not make sense.
For some uses, especially source code, it would be better to match an
underscore _ as part of a complete word, too.
For some programming languages, say lisp, a dash - would also belong in
the same category.
There's no real reason other than ease of implementation why the pattern
handles ASCII non-alphanumerics separately, but non-ASCII UTF-8
non-alnums (like, say, unicode NO-BREAK SPACE which would show as \xc2
\xa0) always goes into a word. But if you were to make UTF-8 sequences
a single word, text in (say) many European languages would become
chunked at accented letters.
I'm sure you can find more items for this list. It's a grey area.
--
Thomas Rast
trast@{inf,student}.ethz.ch
next prev parent reply other threads:[~2012-01-19 15:53 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-01-11 17:25 [PATCH 1/2] t4034-diff-words: replace regex for diff driver Tay Ray Chuan
2012-01-11 17:25 ` [PATCH 2/2] diff --word-diff: use non-whitespace regex by default Tay Ray Chuan
2012-01-11 20:05 ` Thomas Rast
2012-01-12 0:52 ` Tay Ray Chuan
2012-01-12 9:22 ` Thomas Rast
2012-01-18 7:32 ` Tay Ray Chuan
2012-01-19 15:53 ` Thomas Rast [this message]
2012-01-20 1:14 ` Tay Ray Chuan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87bopzofir.fsf@thomas.inf.ethz.ch \
--to=trast@student.ethz.ch \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=rctay89@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).