From: Junio C Hamano <gitster@pobox.com>
To: "Ping Yin" <pkufranky@gmail.com>
Cc: "Johannes Schindelin" <Johannes.Schindelin@gmx.de>, git@vger.kernel.org
Subject: Re: [PATCH v2 4/5] Make boundary characters for --color-words configurable
Date: Sun, 04 May 2008 22:00:13 -0700 [thread overview]
Message-ID: <7vprs1ny5e.fsf@gitster.siamese.dyndns.org> (raw)
In-Reply-To: <46dff0320805041840g1b9362d3u138b9d40cde160f2@mail.gmail.com> (Ping Yin's message of "Mon, 5 May 2008 09:40:47 +0800")
"Ping Yin" <pkufranky@gmail.com> writes:
> For this example,both "/if/while/ (i />/>=/ /1/0/)" and "/if/while/
> (i >//=/ /1/0/)" are fine to me.
For the particular example, both are Ok, but for this other example:
-if (i > 1...
+if ((i > 1...
it probably is better to treat each non-word character as a separate
token, that is, it would be easier to read if we said "( stayed intact,
and another ( was added", instead of saying "( is changed to ((".
So "a run of punct chars" rule only sometimes produces better output but
otherwise worse output, and to make it produce better output consistently,
we would need to know the syntax of the target language for tokenization,
i.e. ">=" and ">" are comparison operators, while "(" is a token and "(("
is better split into two open-paren tokens.
So as a very longer term subproject, we may want to teach the mechanism
language specific tokenization rules, just like we can specify the hunk
header pattern via gitattributes(5) to the diff output layer.
Of course, I do not expect you to do that during this round --- and if we
choose to keep the rule simple, I think it is probably better to use
one-char-one-token rule for now.
> And when designing, i think it's better to take multi-byte characters
> into account. For multi-byte characters (especially CJK), every
> character should be considered as a token.
If we take an idealistic view for the longer term, we should be tokenizing
even CJK sensibly, but unlike Occidental scripts, we cannot even use
inter-word spacing for tokenizing hint, so unless we are willing to learn
morphological analysis (which we are not for now), the best we can do is
to use one-char-one-token rule.
Side Note. For Japanese we could cheat and often do a slightly
better job than simple one-char-one-token without having full
morphological analysis by splicing between Kanji and Kana
boundaries, but I'd prefer not to go there and keep the rules we
would use to the minimum.
I should stress that I said "character" in the above "punct" and "CJK"
discussions, not "byte".
next prev parent reply other threads:[~2008-05-05 5:01 UTC|newest]
Thread overview: 92+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-05-02 3:39 [PATCH] Make words boundary for --color-words configurable Ping Yin
2008-05-02 3:54 ` Junio C Hamano
2008-05-02 4:28 ` Ping Yin
2008-05-02 13:59 ` [PATCH] Make boundary characters " Ping Yin
2008-05-02 14:26 ` Ping Yin
2008-05-02 14:27 ` Ping Yin
2008-05-03 11:57 ` [PATCH v2 0/5] " Ping Yin
2008-05-03 11:57 ` [PATCH v2 1/5] diff.c: Remove code redundancy in diff_words_show Ping Yin
2008-05-03 11:57 ` [PATCH v2 2/5] diff.c: Use show variable name in fn_out_diff_words_aux Ping Yin
2008-05-03 11:57 ` [PATCH v2 3/5] diff.c: Fix --color-words showing trailing deleted words at another line Ping Yin
2008-05-03 11:57 ` [PATCH v2 4/5] Make boundary characters for --color-words configurable Ping Yin
2008-05-03 11:57 ` [PATCH v2 5/5] fn_out_diff_words_aux: Handle common diff line more carefully Ping Yin
2008-05-03 18:18 ` [PATCH v2 4/5] Make boundary characters for --color-words configurable Junio C Hamano
2008-05-03 18:41 ` Teemu Likonen
2008-05-04 0:32 ` Ping Yin
2008-05-04 9:44 ` Johannes Schindelin
2008-05-04 16:35 ` Ping Yin
2008-05-04 20:16 ` Junio C Hamano
2008-05-04 20:47 ` Jakub Narebski
2008-05-04 21:27 ` Teemu Likonen
2008-05-05 12:14 ` Johannes Schindelin
2008-05-05 1:40 ` Ping Yin
2008-05-05 5:00 ` Junio C Hamano [this message]
2008-05-05 12:10 ` Ping Yin
2008-05-06 0:40 ` Ping Yin
2008-05-06 8:55 ` Johannes Schindelin
2008-05-07 1:15 ` Ping Yin
2008-05-07 11:24 ` Johannes Schindelin
2008-05-07 12:19 ` Ping Yin
2008-05-07 13:10 ` Johannes Schindelin
2008-05-07 14:11 ` Ping Yin
2008-05-07 19:13 ` Junio C Hamano
2008-05-07 19:33 ` Junio C Hamano
2008-05-07 19:45 ` Jeff King
2008-05-07 20:02 ` Junio C Hamano
2008-05-07 22:04 ` Jeff King
2008-05-08 10:34 ` Teemu Likonen
2008-05-10 9:02 ` Ping Yin
2008-05-10 9:14 ` Teemu Likonen
2008-05-11 13:16 ` Ping Yin
2008-05-11 13:27 ` Ping Yin
2008-05-11 16:27 ` Junio C Hamano
2008-05-12 16:31 ` Ping Yin
2008-05-12 18:57 ` Jakub Narebski
2008-05-12 19:17 ` Junio C Hamano
2008-05-12 19:57 ` Jakub Narebski
2008-05-13 1:37 ` Ping Yin
2008-05-13 1:42 ` Ping Yin
2008-05-10 8:20 ` Ping Yin
2008-05-05 11:51 ` Johannes Schindelin
2008-05-05 12:02 ` Ping Yin
2008-05-03 18:01 ` [PATCH v2 3/5] diff.c: Fix --color-words showing trailing deleted words at another line Junio C Hamano
2008-05-03 12:01 ` [PATCH v2 2/5] diff.c: Use show variable name in fn_out_diff_words_aux Ping Yin
2008-05-03 17:47 ` Junio C Hamano
2008-05-03 18:20 ` [PATCH v2 1/5] diff.c: Remove code redundancy in diff_words_show Junio C Hamano
2008-05-04 4:20 ` [PATCH v3 0/6] --color-words improvement Ping Yin
2008-05-04 4:20 ` [PATCH v3 1/6] diff.c: Remove code redundancy in diff_words_show Ping Yin
2008-05-04 4:20 ` [PATCH v3 2/6] fn_out_diff_words_aux: Use short variable name Ping Yin
2008-05-04 4:20 ` [PATCH v3 3/6] --color-words: Fix showing trailing deleted words at another line Ping Yin
2008-05-04 4:20 ` [PATCH v3 4/6] --color-words: Make non-word characters configurable Ping Yin
2008-05-04 4:20 ` [PATCH v3 5/6] fn_out_diff_words_aux: Handle common diff line more carefully Ping Yin
2008-05-04 4:20 ` [PATCH v3 6/6] --color-words: Add test t4030 Ping Yin
2008-05-04 9:54 ` [PATCH v3 5/6] fn_out_diff_words_aux: Handle common diff line more carefully Johannes Schindelin
2008-05-04 16:53 ` Ping Yin
2008-05-05 12:11 ` Johannes Schindelin
2008-05-05 14:18 ` Ping Yin
2008-05-04 6:45 ` [PATCH v3 4/6] --color-words: Make non-word characters configurable Junio C Hamano
2008-05-04 7:04 ` Ping Yin
2008-05-04 9:52 ` [PATCH v3 3/6] --color-words: Fix showing trailing deleted words at another line Johannes Schindelin
2008-05-04 16:48 ` Ping Yin
2008-05-05 12:10 ` Johannes Schindelin
2008-05-04 9:47 ` [PATCH v3 2/6] fn_out_diff_words_aux: Use short variable name Johannes Schindelin
2008-05-04 16:39 ` Ping Yin
2008-05-05 12:05 ` Johannes Schindelin
2008-05-04 9:46 ` [PATCH v3 1/6] diff.c: Remove code redundancy in diff_words_show Johannes Schindelin
2008-05-02 14:36 ` [PATCH] Make boundary characters for --color-words configurable Teemu Likonen
2008-05-03 0:22 ` Ping Yin
2008-05-03 13:22 ` Dirk Süsserott
2008-05-03 13:57 ` Ping Yin
2008-05-03 14:03 ` [PATCH] --color-words: Make the word characters configurable Johannes Schindelin
2008-05-03 14:13 ` Ping Yin
2008-05-03 14:23 ` Johannes Schindelin
2008-05-03 14:43 ` Teemu Likonen
2008-05-04 9:18 ` Johannes Schindelin
2008-05-03 17:43 ` Junio C Hamano
2008-05-04 9:25 ` Johannes Schindelin
2008-05-02 7:45 ` [PATCH] Make words boundary for --color-words configurable Johannes Schindelin
2008-05-02 8:14 ` Teemu Likonen
2008-05-02 9:23 ` Ping Yin
2008-05-02 10:01 ` Teemu Likonen
2008-05-02 9:28 ` Ping Yin
2008-05-03 0:18 ` Jakub Narebski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7vprs1ny5e.fsf@gitster.siamese.dyndns.org \
--to=gitster@pobox.com \
--cc=Johannes.Schindelin@gmx.de \
--cc=git@vger.kernel.org \
--cc=pkufranky@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).