* [PATCH 1/2] t4034-diff-words: replace regex for diff driver @ 2012-01-11 17:25 Tay Ray Chuan 2012-01-11 17:25 ` [PATCH 2/2] diff --word-diff: use non-whitespace regex by default Tay Ray Chuan 0 siblings, 1 reply; 8+ messages in thread From: Tay Ray Chuan @ 2012-01-11 17:25 UTC (permalink / raw) To: Git Mailing List; +Cc: Junio C Hamano The next patch uses a non-whitespace regex, similar to the regex currently used by the 'testdriver' diff driver; replace the regex with a distinct one so that we can continue to conclude its effects. Signed-off-by: Tay Ray Chuan <rctay89@gmail.com> --- Kept separate to keep the next patch clean. --- t/t4034-diff-words.sh | 20 +++++++++++++++++--- 1 files changed, 17 insertions(+), 3 deletions(-) diff --git a/t/t4034-diff-words.sh b/t/t4034-diff-words.sh index 6f1e5a2..9ae0e1a 100755 --- a/t/t4034-diff-words.sh +++ b/t/t4034-diff-words.sh @@ -46,6 +46,20 @@ cat >expect.non-whitespace-is-word <<-\EOF <GREEN>aeff = aeff * ( aaa )<RESET> EOF +cat >expect.everything-is-word <<-\EOF + <BOLD>diff --git a/pre b/post<RESET> + <BOLD>index 330b04f..5ed8eff 100644<RESET> + <BOLD>--- a/pre<RESET> + <BOLD>+++ b/post<RESET> + <CYAN>@@ -1,3 +1,7 @@<RESET> + <RED>h(4)<RESET><GREEN>h(4),hh[44]<RESET> + + a = b + c<RESET> + + <GREEN>aa = a<RESET> + + <GREEN>aeff = aeff * ( aaa )<RESET> +EOF word_diff () { test_must_fail git diff --no-index "$@" pre post >output && @@ -179,7 +193,7 @@ test_expect_success 'word diff with a regular expression' ' ' test_expect_success 'set up a diff driver' ' - git config diff.testdriver.wordRegex "[^[:space:]]" && + git config diff.testdriver.wordRegex ".+" && cat <<-\EOF >.gitattributes pre diff=testdriver post diff=testdriver @@ -192,7 +206,7 @@ test_expect_success 'option overrides .gitattributes' ' ' test_expect_success 'use regex supplied by driver' ' - cp expect.non-whitespace-is-word expect && + cp expect.everything-is-word expect && word_diff --color-words ' @@ -224,7 +238,7 @@ test_expect_success 'command-line overrides config: --word-diff-regex' ' ' test_expect_success '.gitattributes override config' ' - cp expect.non-whitespace-is-word expect && + cp expect.everything-is-word expect && word_diff --color-words ' -- 1.7.7.584.g16d0ea ^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 2/2] diff --word-diff: use non-whitespace regex by default 2012-01-11 17:25 [PATCH 1/2] t4034-diff-words: replace regex for diff driver Tay Ray Chuan @ 2012-01-11 17:25 ` Tay Ray Chuan 2012-01-11 20:05 ` Thomas Rast 0 siblings, 1 reply; 8+ messages in thread From: Tay Ray Chuan @ 2012-01-11 17:25 UTC (permalink / raw) To: Git Mailing List; +Cc: Junio C Hamano Factor out the comprehensive non-whitespace regex in use by PATTERNS and IPATTERN and use it as the word-diff regex for the default diff driver. As the default regex is no longer non-empty, update the word-regex selection logic (non-default driver from pre-image, then post-image, then the diff.wordRegex config) accordingly. Signed-off-by: Tay Ray Chuan <rctay89@gmail.com> --- diff.c | 14 ++++++++------ t/t4034-diff-words.sh | 31 +++++++++---------------------- userdiff.c | 8 +++++--- 3 files changed, 22 insertions(+), 31 deletions(-) diff --git a/diff.c b/diff.c index 374ecf3..5f71f9f 100644 --- a/diff.c +++ b/diff.c @@ -1987,9 +1987,10 @@ static const struct userdiff_funcname *diff_funcname_pattern(struct diff_filespe return one->driver->funcname.pattern ? &one->driver->funcname : NULL; } -static const char *userdiff_word_regex(struct diff_filespec *one) +static const char *userdiff_word_regex(struct diff_filespec *one, int *is_default) { diff_filespec_load_driver(one); + *is_default = !strcmp(one->driver->name, "default"); return one->driver->word_regex; } @@ -2180,17 +2181,18 @@ static void builtin_diff(const char *name_a, else if (!prefixcmp(diffopts, "-u")) xecfg.ctxlen = strtoul(diffopts + 2, NULL, 10); if (o->word_diff) { - int i; + int i, is_default; ecbdata.diff_words = xcalloc(1, sizeof(struct diff_words_data)); ecbdata.diff_words->type = o->word_diff; ecbdata.diff_words->opt = o; + is_default = 0; if (!o->word_regex) - o->word_regex = userdiff_word_regex(one); - if (!o->word_regex) - o->word_regex = userdiff_word_regex(two); - if (!o->word_regex) + o->word_regex = userdiff_word_regex(one, &is_default); + if (is_default) + o->word_regex = userdiff_word_regex(two, &is_default); + if (is_default && diff_word_regex_cfg) o->word_regex = diff_word_regex_cfg; if (o->word_regex) { ecbdata.diff_words->word_regex = (regex_t *) diff --git a/t/t4034-diff-words.sh b/t/t4034-diff-words.sh index 9ae0e1a..e588849 100755 --- a/t/t4034-diff-words.sh +++ b/t/t4034-diff-words.sh @@ -84,26 +84,13 @@ test_expect_success setup ' git config diff.color.func magenta ' -test_expect_success 'set up pre and post with runs of whitespace' ' +test_expect_success 'set up pre and post with runs of non-whitespace' ' cp pre.simple pre && - cp post.simple post + cp post.simple post && + cp expect.non-whitespace-is-word expect ' -test_expect_success 'word diff with runs of whitespace' ' - cat >expect <<-\EOF && - <BOLD>diff --git a/pre b/post<RESET> - <BOLD>index 330b04f..5ed8eff 100644<RESET> - <BOLD>--- a/pre<RESET> - <BOLD>+++ b/post<RESET> - <CYAN>@@ -1,3 +1,7 @@<RESET> - <RED>h(4)<RESET><GREEN>h(4),hh[44]<RESET> - - a = b + c<RESET> - - <GREEN>aa = a<RESET> - - <GREEN>aeff = aeff * ( aaa )<RESET> - EOF +test_expect_success 'word diff defaults to runs of non-whitespace' ' word_diff --color-words && word_diff --word-diff=color && word_diff --color --word-diff=color @@ -116,8 +103,8 @@ test_expect_success '--word-diff=porcelain' ' --- a/pre +++ b/post @@ -1,3 +1,7 @@ - -h(4) - +h(4),hh[44] + h(4) + +,hh[44] ~ # significant space ~ @@ -140,7 +127,7 @@ test_expect_success '--word-diff=plain' ' --- a/pre +++ b/post @@ -1,3 +1,7 @@ - [-h(4)-]{+h(4),hh[44]+} + h(4){+,hh[44]+} a = b + c @@ -159,7 +146,7 @@ test_expect_success '--word-diff=plain --color' ' <BOLD>--- a/pre<RESET> <BOLD>+++ b/post<RESET> <CYAN>@@ -1,3 +1,7 @@<RESET> - <RED>[-h(4)-]<RESET><GREEN>{+h(4),hh[44]+}<RESET> + h(4)<GREEN>{+,hh[44]+}<RESET> a = b + c<RESET> @@ -177,7 +164,7 @@ test_expect_success 'word diff without context' ' <BOLD>--- a/pre<RESET> <BOLD>+++ b/post<RESET> <CYAN>@@ -1 +1 @@<RESET> - <RED>h(4)<RESET><GREEN>h(4),hh[44]<RESET> + h(4)<GREEN>,hh[44]<RESET> <CYAN>@@ -3,0 +4,4 @@<RESET> <RESET><MAGENTA>a = b + c<RESET> <GREEN>aa = a<RESET> diff --git a/userdiff.c b/userdiff.c index 76109da..cf38566 100644 --- a/userdiff.c +++ b/userdiff.c @@ -7,12 +7,14 @@ static struct userdiff_driver *drivers; static int ndrivers; static int drivers_alloc; +#define NON_WHITESPACE \ + "[^[:space:]]|[\xc0-\xff][\x80-\xbf]+" #define PATTERNS(name, pattern, word_regex) \ { name, NULL, -1, { pattern, REG_EXTENDED }, \ - word_regex "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+" } + word_regex "|" NON_WHITESPACE } #define IPATTERN(name, pattern, word_regex) \ { name, NULL, -1, { pattern, REG_EXTENDED | REG_ICASE }, \ - word_regex "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+" } + word_regex "|" NON_WHITESPACE } static struct userdiff_driver builtin_drivers[] = { IPATTERN("fortran", "!^([C*]|[ \t]*!)\n" @@ -140,7 +142,7 @@ PATTERNS("csharp", "[a-zA-Z_][a-zA-Z0-9_]*" "|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lL]?" "|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->"), -{ "default", NULL, -1, { NULL, 0 } }, +{ "default", NULL, -1, { NULL, 0 }, NON_WHITESPACE }, }; #undef PATTERNS #undef IPATTERN -- 1.7.7.584.g16d0ea ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] diff --word-diff: use non-whitespace regex by default 2012-01-11 17:25 ` [PATCH 2/2] diff --word-diff: use non-whitespace regex by default Tay Ray Chuan @ 2012-01-11 20:05 ` Thomas Rast 2012-01-12 0:52 ` Tay Ray Chuan 0 siblings, 1 reply; 8+ messages in thread From: Thomas Rast @ 2012-01-11 20:05 UTC (permalink / raw) To: Tay Ray Chuan; +Cc: Git Mailing List, Junio C Hamano Tay Ray Chuan <rctay89@gmail.com> writes: > Factor out the comprehensive non-whitespace regex in use by PATTERNS and > IPATTERN and use it as the word-diff regex for the default diff driver. Why? I seem to recall that the motivation for keeping the original code as-is instead of just emulating its behavior with a default regex was that it is faster. So disabling the default mode should at least have an advantage? </devils-advocate> -- Thomas Rast trast@{inf,student}.ethz.ch ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] diff --word-diff: use non-whitespace regex by default 2012-01-11 20:05 ` Thomas Rast @ 2012-01-12 0:52 ` Tay Ray Chuan 2012-01-12 9:22 ` Thomas Rast 0 siblings, 1 reply; 8+ messages in thread From: Tay Ray Chuan @ 2012-01-12 0:52 UTC (permalink / raw) To: Thomas Rast; +Cc: Git Mailing List, Junio C Hamano Hi, Thomas, first off, thanks for looking through this. On Thu, Jan 12, 2012 at 4:05 AM, Thomas Rast <trast@student.ethz.ch> wrote: > Tay Ray Chuan <rctay89@gmail.com> writes: > >> Factor out the comprehensive non-whitespace regex in use by PATTERNS and >> IPATTERN and use it as the word-diff regex for the default diff driver. > > Why? > > I seem to recall that the motivation for keeping the original code as-is > instead of just emulating its behavior with a default regex was that it > is faster. So disabling the default mode should at least have an > advantage? > > </devils-advocate> If you're talking about speed, yeah, that's probably true. But I think it's worthwhile to trade-off performance for a sensible default. Something like matrix[a,b,c] matrix[d,b,c] gives matrix[[-a-]{+d+},b,c] and when we have ImagineALanguageLikeFoo ImagineALanguageLikeBar we get ImagineALanguageLike[-Foo-]{+Bar+} (But I cheated. Foo and Bar have no common characters in common; if they did, the word diff would be messy.) Both of which seem sensible. From a usability/effectiveness standpoint, I think it's more useful than what the current word-diff defaults to - the whole line is taken as a "word", with the pre-image shown as deleted and the post-image as added; we don't even try to run LCS on it. Examples are lifted from: [1] http://article.gmane.org/gmane.comp.version-control.git/105896 [2] http://article.gmane.org/gmane.comp.version-control.git/105237 -- Cheers, Ray Chuan ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] diff --word-diff: use non-whitespace regex by default 2012-01-12 0:52 ` Tay Ray Chuan @ 2012-01-12 9:22 ` Thomas Rast 2012-01-18 7:32 ` Tay Ray Chuan 0 siblings, 1 reply; 8+ messages in thread From: Thomas Rast @ 2012-01-12 9:22 UTC (permalink / raw) To: Tay Ray Chuan; +Cc: Git Mailing List, Junio C Hamano Tay Ray Chuan <rctay89@gmail.com> writes: > On Thu, Jan 12, 2012 at 4:05 AM, Thomas Rast <trast@student.ethz.ch> wrote: >> Tay Ray Chuan <rctay89@gmail.com> writes: >> >>> Factor out the comprehensive non-whitespace regex in use by PATTERNS and >>> IPATTERN and use it as the word-diff regex for the default diff driver. >> >> Why? Sorry for distracting you with the performance argument; it was mostly the first thing that came to my mind that I could use to ask for the motivation, and evaluation of tradeoffs, that both were missing from the proposed commit message. > But I think it's worthwhile to trade-off performance for a sensible > default. Something like > > matrix[a,b,c] > matrix[d,b,c] > > gives > > matrix[[-a-]{+d+},b,c] > > and when we have > > ImagineALanguageLikeFoo > ImagineALanguageLikeBar > > we get > > ImagineALanguageLike[-Foo-]{+Bar+} In that case (and I should have read the original patch), I am definitely against this change. It turns the default word-diff into character-diff, which is something entirely different, and frequently useless precisely for the reason you state: > (But I cheated. Foo and Bar have no common characters in common; if > they did, the word diff would be messy.) Case in point, consider my patch sent out yesterday http://article.gmane.org/gmane.comp.version-control.git/188391 It consists of a one-hunk doc update. word-diff is not brilliant: -k:: Usually the program [-'cleans up'-]{+removes email cruft from+} the Subject: header line to extract the title line for the commit log [-message,-] [- among which (1) remove 'Re:' or 're:', (2) leading-] [- whitespaces, (3) '[' up to ']', typically '[PATCH]', and-] [- then prepends "[PATCH] ".-]{+message.+} This [-flag forbids-]{+option prevents+} this munging, and is most useful when used to read back 'git format-patch -k' output. [snip the rest as it's only {+}] But character-diff tries too hard to find common subsequences: $ g show HEAD^^ --word-diff-regex='[^[:space:]]' | xsel -k:: Usually the program [-'cl-]{+remov+}e[-an-]s {+email cr+}u[-p'-]{+ft from+} the Subject: header line to extract the title line for the commit log message[-,-] [- among which (1) remove 'Re:' or 're:', (2) leading-] [- w-]{+. T+}hi[-te-]s[-paces, (3) '[' up t-] o[-']', ty-]p[-ically '[PATCH]', and-]t[-he-]{+io+}n pre[-p-]{+v+}en[-ds "[PATCH] ". This flag forbid-]{+t+}s this munging, and is most useful when used to read back 'git format-patch -k' output. [snip] Wouldn't you agree that w-]{+. T+}hi[-te-]s[-paces, (3) '[' up t-] o[-']', ty-]p[ is just line noise? The colors don't even help as most of it is removed (red). Regarding your examples > [1] http://article.gmane.org/gmane.comp.version-control.git/105896 > [2] http://article.gmane.org/gmane.comp.version-control.git/105237 first please notice that both of them were written before (and actually discussing) the introduction of the wordRegex feature. At this point, we were trying to make up our minds w.r.t. how powerful the feature needs to be. Nowadays (or in fact, starting a few days after those emails) the user can easily achieve everything discussed here by setting the wordRegex to taste. That being said, I can see some arguments for changing the default to split punctuation into a separate word. That is, whereas the current default is semantically equivalent to a wordRegex of [^[:space:]]* (but has a faster code path) and your proposal is equivalent to [^[:space:]]|UTF_8_GUARD I think there is a case to be made for a default of [^[:space:]]|([[:alnum:]]|UTF_8_GUARD)+ or some such. There's a lot of bikeshedding lurking in the (non)extent of the [[:alnum:]] here, however. -- Thomas Rast trast@{inf,student}.ethz.ch ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] diff --word-diff: use non-whitespace regex by default 2012-01-12 9:22 ` Thomas Rast @ 2012-01-18 7:32 ` Tay Ray Chuan 2012-01-19 15:53 ` Thomas Rast 0 siblings, 1 reply; 8+ messages in thread From: Tay Ray Chuan @ 2012-01-18 7:32 UTC (permalink / raw) To: Thomas Rast; +Cc: Git Mailing List, Junio C Hamano On Thu, Jan 12, 2012 at 5:22 PM, Thomas Rast <trast@student.ethz.ch> wrote: > [snip] > Case in point, consider my patch sent out yesterday > > http://article.gmane.org/gmane.comp.version-control.git/188391 > > It consists of a one-hunk doc update. word-diff is not brilliant: > > -k:: > Usually the program [-'cleans up'-]{+removes email cruft from+} the Subject: > header line to extract the title line for the commit log > [-message,-] > [- among which (1) remove 'Re:' or 're:', (2) leading-] > [- whitespaces, (3) '[' up to ']', typically '[PATCH]', and-] > [- then prepends "[PATCH] ".-]{+message.+} This [-flag forbids-]{+option prevents+} this munging, and is most > useful when used to read back 'git format-patch -k' output. > [snip the rest as it's only {+}] > > But character-diff tries too hard to find common subsequences: > > $ g show HEAD^^ --word-diff-regex='[^[:space:]]' | xsel >[snip] > w-]{+. T+}hi[-te-]s[-paces, (3) '[' up t-] o[-']', ty-]p[ > > is just line noise? The colors don't even help as most of it is removed > (red). You missed the '+' quantifier, as in [^[:space:]]+ Using that regex, that abomination of a word-diff that you mentioned disappears, like this: -k:: Usually the program [-'cleans up'-]{+removes email cruft from+} the Subject: header line to extract the title line for the commit log [-message,-] [- among which (1) remove 'Re:' or 're:', (2) leading-] [- whitespaces, (3) '[' up to ']', typically '[PATCH]', and-] [- then prepends "[PATCH] ".-]{+message.+} This [-flag forbids-]{+option prevents+} this munging, and is most useful when used to read back 'git format-patch -k' output. > [snip] > That being said, I can see some arguments for changing the default to > split punctuation into a separate word. That is, whereas the current > default is semantically equivalent to a wordRegex of > > [^[:space:]]* > > (but has a faster code path) Oh right, there *is* a sensible default implemented in. Somehow I was under the impression that there wasn't. I wonder which is faster, using the non-whitespace regex, or the isspace() calls... > and your proposal is equivalent to > > [^[:space:]]|UTF_8_GUARD > > I think there is a case to be made for a default of > > [^[:space:]]|([[:alnum:]]|UTF_8_GUARD)+ > > or some such. There's a lot of bikeshedding lurking in the (non)extent > of the [[:alnum:]] here, however. Care to explain further? Not to sure what you mean here. -- Cheers, Ray Chuan ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] diff --word-diff: use non-whitespace regex by default 2012-01-18 7:32 ` Tay Ray Chuan @ 2012-01-19 15:53 ` Thomas Rast 2012-01-20 1:14 ` Tay Ray Chuan 0 siblings, 1 reply; 8+ messages in thread From: Thomas Rast @ 2012-01-19 15:53 UTC (permalink / raw) To: Tay Ray Chuan; +Cc: Git Mailing List, Junio C Hamano Tay Ray Chuan <rctay89@gmail.com> writes: > On Thu, Jan 12, 2012 at 5:22 PM, Thomas Rast <trast@student.ethz.ch> wrote: >> [snip] >> Case in point, consider my patch sent out yesterday >> >> http://article.gmane.org/gmane.comp.version-control.git/188391 >> >> It consists of a one-hunk doc update. word-diff is not brilliant: >> >> -k:: >> Usually the program [-'cleans up'-]{+removes email cruft from+} the Subject: >> header line to extract the title line for the commit log >> [-message,-] >> [- among which (1) remove 'Re:' or 're:', (2) leading-] >> [- whitespaces, (3) '[' up to ']', typically '[PATCH]', and-] >> [- then prepends "[PATCH] ".-]{+message.+} This [-flag forbids-]{+option prevents+} this munging, and is most >> useful when used to read back 'git format-patch -k' output. >> [snip the rest as it's only {+}] >> >> But character-diff tries too hard to find common subsequences: >> >> $ g show HEAD^^ --word-diff-regex='[^[:space:]]' | xsel >>[snip] >> w-]{+. T+}hi[-te-]s[-paces, (3) '[' up t-] o[-']', ty-]p[ >> >> is just line noise? The colors don't even help as most of it is removed >> (red). > > You missed the '+' quantifier, as in > > [^[:space:]]+ Did I? I was working from the example you provided earlier } matrix[a,b,c] } matrix[d,b,c] } gives } matrix[[-a-]{+d+},b,c] } } and when we have } } ImagineALanguageLikeFoo } ImagineALanguageLikeBar } we get } ImagineALanguageLike[-Foo-]{+Bar+} Under [^[:space:]]+ neither of the examples would work. Actually, [^[:space:]]+ is the same as today's default, the [^[:space:]]* I mentioned later is (strictly speaking) broken as it allows for a 0-length match. (It doesn't really matter because IIRC the engine ignores 0-length words.) >> That being said, I can see some arguments for changing the default to >> split punctuation into a separate word. That is, whereas the current >> default is semantically equivalent to a wordRegex of >> >> [^[:space:]]* >> >> (but has a faster code path) > > Oh right, there *is* a sensible default implemented in. Somehow I was > under the impression that there wasn't. > > I wonder which is faster, using the non-whitespace regex, or the > isspace() calls... I tried measuring it across a few commits, but it mostly gets drowned out by the diff effort. For a commit with stat exercises/cgal/cover/cover.cpp | 5 +- exercises/cgal/cover/cover.in1 |27014 +++++++++++++++----- exercises/cgal/cover/cover.in2 |48996 +++++++++++++++++++++++------------ exercises/cgal/cover/cover.in3 |55041 +++++++++++++++++++++++++-------------- exercises/cgal/cover/cover.in4 |47600 ++++++++++++++++++++-------------- exercises/cgal/cover/cover.int |43491 ++++++++++++++++++++++--------- exercises/cgal/cover/cover.out1 | 53 +- exercises/cgal/cover/cover.out2 | 24 +- exercises/cgal/cover/cover.out3 | 11 +- exercises/cgal/cover/cover.out4 | 2 +- exercises/cgal/cover/cover.outt | 23 +- exercises/cgal/cover/gen | 39 +- exercises/cgal/cover/gen-1.cpp | 4 +- exercises/cgal/cover/gen-2.cpp | 6 +- exercises/cgal/cover/gen-3.cpp | 6 +- (sorry, can't share as those testcases are secret) I get best-of-5 timings --word-diff-regex='[^[:space:]]+' 0:07.50real 7.40user 0.07system --word-diff 0:07.47real 7.41user 0.03system In conclusion, "meh". I think ripping out the isspace() part would make for a nice code reduction. >> and your proposal is equivalent to >> >> [^[:space:]]|UTF_8_GUARD >> >> I think there is a case to be made for a default of >> >> [^[:space:]]|([[:alnum:]]|UTF_8_GUARD)+ >> >> or some such. There's a lot of bikeshedding lurking in the (non)extent >> of the [[:alnum:]] here, however. > > Care to explain further? Not to sure what you mean here. For natural language, it may or may not make sense to match numbers as part of a word. For typical use in e.g. emails, a lot of punctuation has a double role; breaking words in http://article.gmane.org/gmane.comp.version-control.git/188391 may or may not make sense. For some uses, especially source code, it would be better to match an underscore _ as part of a complete word, too. For some programming languages, say lisp, a dash - would also belong in the same category. There's no real reason other than ease of implementation why the pattern handles ASCII non-alphanumerics separately, but non-ASCII UTF-8 non-alnums (like, say, unicode NO-BREAK SPACE which would show as \xc2 \xa0) always goes into a word. But if you were to make UTF-8 sequences a single word, text in (say) many European languages would become chunked at accented letters. I'm sure you can find more items for this list. It's a grey area. -- Thomas Rast trast@{inf,student}.ethz.ch ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] diff --word-diff: use non-whitespace regex by default 2012-01-19 15:53 ` Thomas Rast @ 2012-01-20 1:14 ` Tay Ray Chuan 0 siblings, 0 replies; 8+ messages in thread From: Tay Ray Chuan @ 2012-01-20 1:14 UTC (permalink / raw) To: Thomas Rast; +Cc: Git Mailing List, Junio C Hamano On Thu, Jan 19, 2012 at 11:53 PM, Thomas Rast <trast@student.ethz.ch> wrote: >[snip] > Under [^[:space:]]+ neither of the examples would work. Actually, > [^[:space:]]+ is the same as today's default, the [^[:space:]]* I > mentioned later is (strictly speaking) broken as it allows for a > 0-length match. (It doesn't really matter because IIRC the engine > ignores 0-length words.) My bad. >[snip] > I tried measuring it across a few commits, but it mostly gets drowned > out by the diff effort. For a commit with stat > > exercises/cgal/cover/cover.cpp | 5 +- > exercises/cgal/cover/cover.in1 |27014 +++++++++++++++----- > exercises/cgal/cover/cover.in2 |48996 +++++++++++++++++++++++------------ > exercises/cgal/cover/cover.in3 |55041 +++++++++++++++++++++++++-------------- > exercises/cgal/cover/cover.in4 |47600 ++++++++++++++++++++-------------- > exercises/cgal/cover/cover.int |43491 ++++++++++++++++++++++--------- > exercises/cgal/cover/cover.out1 | 53 +- > exercises/cgal/cover/cover.out2 | 24 +- > exercises/cgal/cover/cover.out3 | 11 +- > exercises/cgal/cover/cover.out4 | 2 +- > exercises/cgal/cover/cover.outt | 23 +- > exercises/cgal/cover/gen | 39 +- > exercises/cgal/cover/gen-1.cpp | 4 +- > exercises/cgal/cover/gen-2.cpp | 6 +- > exercises/cgal/cover/gen-3.cpp | 6 +- > > (sorry, can't share as those testcases are secret) I get best-of-5 > timings > > --word-diff-regex='[^[:space:]]+' 0:07.50real 7.40user 0.07system > --word-diff 0:07.47real 7.41user 0.03system > > In conclusion, "meh". I think ripping out the isspace() part would make > for a nice code reduction. Thanks for the numbers. Well, that agrees with the intuition that regex is slower than isspace(), since you have run it through the regex engine. >>> and your proposal is equivalent to >>> >>> [^[:space:]]|UTF_8_GUARD >>> >>> I think there is a case to be made for a default of >>> >>> [^[:space:]]|([[:alnum:]]|UTF_8_GUARD)+ >>> >>> or some such. There's a lot of bikeshedding lurking in the (non)extent >>> of the [[:alnum:]] here, however. >> >> Care to explain further? Not to sure what you mean here. > > For natural language, it may or may not make sense to match numbers as > part of a word. > > For typical use in e.g. emails, a lot of punctuation has a double role; > breaking words in > > http://article.gmane.org/gmane.comp.version-control.git/188391 > > may or may not make sense. > > For some uses, especially source code, it would be better to match an > underscore _ as part of a complete word, too. > > For some programming languages, say lisp, a dash - would also belong in > the same category. > > There's no real reason other than ease of implementation why the pattern > handles ASCII non-alphanumerics separately, but non-ASCII UTF-8 > non-alnums (like, say, unicode NO-BREAK SPACE which would show as \xc2 > \xa0) always goes into a word. But if you were to make UTF-8 sequences > a single word, text in (say) many European languages would become > chunked at accented letters. > > I'm sure you can find more items for this list. It's a grey area. Thanks. -- Cheers, Ray Chuan ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2012-01-20 1:14 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-01-11 17:25 [PATCH 1/2] t4034-diff-words: replace regex for diff driver Tay Ray Chuan 2012-01-11 17:25 ` [PATCH 2/2] diff --word-diff: use non-whitespace regex by default Tay Ray Chuan 2012-01-11 20:05 ` Thomas Rast 2012-01-12 0:52 ` Tay Ray Chuan 2012-01-12 9:22 ` Thomas Rast 2012-01-18 7:32 ` Tay Ray Chuan 2012-01-19 15:53 ` Thomas Rast 2012-01-20 1:14 ` Tay Ray Chuan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).