From mboxrd@z Thu Jan 1 00:00:00 1970 From: Junio C Hamano Subject: Re: [PATCH v2 4/5] Make boundary characters for --color-words configurable Date: Sun, 04 May 2008 22:00:13 -0700 Message-ID: <7vprs1ny5e.fsf@gitster.siamese.dyndns.org> References: <46dff0320805020726y2592732cj9aef0111e5b2288a@mail.gmail.com> <1209815828-6548-2-git-send-email-pkufranky@gmail.com> <1209815828-6548-3-git-send-email-pkufranky@gmail.com> <1209815828-6548-4-git-send-email-pkufranky@gmail.com> <1209815828-6548-5-git-send-email-pkufranky@gmail.com> <7vy76rtfns.fsf@gitster.siamese.dyndns.org> <46dff0320805031732x25286707r991358162046c07c@mail.gmail.com> <46dff0320805040935n22354e1bta85b3f3fe7c16cad@mail.gmail.com> <7v63ttq0y8.fsf@gitster.siamese.dyndns.org> <46dff0320805041840g1b9362d3u138b9d40cde160f2@mail.gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: "Johannes Schindelin" , git@vger.kernel.org To: "Ping Yin" X-From: git-owner@vger.kernel.org Mon May 05 07:01:21 2008 Return-path: Envelope-to: gcvg-git-2@gmane.org Received: from vger.kernel.org ([209.132.176.167]) by lo.gmane.org with esmtp (Exim 4.50) id 1JsspE-0005qx-17 for gcvg-git-2@gmane.org; Mon, 05 May 2008 07:01:20 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752037AbYEEFAc (ORCPT ); Mon, 5 May 2008 01:00:32 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751910AbYEEFAb (ORCPT ); Mon, 5 May 2008 01:00:31 -0400 Received: from a-sasl-fastnet.sasl.smtp.pobox.com ([207.106.133.19]:37635 "EHLO sasl.smtp.pobox.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751800AbYEEFAb (ORCPT ); Mon, 5 May 2008 01:00:31 -0400 Received: from localhost.localdomain (localhost [127.0.0.1]) by a-sasl-fastnet.sasl.smtp.pobox.com (Postfix) with ESMTP id B21DD15B0; Mon, 5 May 2008 01:00:29 -0400 (EDT) Received: from pobox.com (ip68-225-240-77.oc.oc.cox.net [68.225.240.77]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by a-sasl-fastnet.sasl.smtp.pobox.com (Postfix) with ESMTP id 368C415AD; Mon, 5 May 2008 01:00:24 -0400 (EDT) In-Reply-To: <46dff0320805041840g1b9362d3u138b9d40cde160f2@mail.gmail.com> (Ping Yin's message of "Mon, 5 May 2008 09:40:47 +0800") User-Agent: Gnus/5.110006 (No Gnus v0.6) Emacs/21.4 (gnu/linux) X-Pobox-Relay-ID: 2EE8D014-1A60-11DD-97D5-80001473D85F-77302942!a-sasl-fastnet.pobox.com Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: "Ping Yin" writes: > For this example,both "/if/while/ (i />/>=/ /1/0/)" and "/if/while/ > (i >//=/ /1/0/)" are fine to me. For the particular example, both are Ok, but for this other example: -if (i > 1... +if ((i > 1... it probably is better to treat each non-word character as a separate token, that is, it would be easier to read if we said "( stayed intact, and another ( was added", instead of saying "( is changed to ((". So "a run of punct chars" rule only sometimes produces better output but otherwise worse output, and to make it produce better output consistently, we would need to know the syntax of the target language for tokenization, i.e. ">=" and ">" are comparison operators, while "(" is a token and "((" is better split into two open-paren tokens. So as a very longer term subproject, we may want to teach the mechanism language specific tokenization rules, just like we can specify the hunk header pattern via gitattributes(5) to the diff output layer. Of course, I do not expect you to do that during this round --- and if we choose to keep the rule simple, I think it is probably better to use one-char-one-token rule for now. > And when designing, i think it's better to take multi-byte characters > into account. For multi-byte characters (especially CJK), every > character should be considered as a token. If we take an idealistic view for the longer term, we should be tokenizing even CJK sensibly, but unlike Occidental scripts, we cannot even use inter-word spacing for tokenizing hint, so unless we are willing to learn morphological analysis (which we are not for now), the best we can do is to use one-char-one-token rule. Side Note. For Japanese we could cheat and often do a slightly better job than simple one-char-one-token without having full morphological analysis by splicing between Kanji and Kana boundaries, but I'd prefer not to go there and keep the rules we would use to the minimum. I should stress that I said "character" in the above "punct" and "CJK" discussions, not "byte".