From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Ping Yin" Subject: Re: [PATCH v2 4/5] Make boundary characters for --color-words configurable Date: Mon, 5 May 2008 20:10:11 +0800 Message-ID: <46dff0320805050510t3bc5fd0eq44e0d58d1bb57629@mail.gmail.com> References: <46dff0320805020726y2592732cj9aef0111e5b2288a@mail.gmail.com> <1209815828-6548-4-git-send-email-pkufranky@gmail.com> <1209815828-6548-5-git-send-email-pkufranky@gmail.com> <7vy76rtfns.fsf@gitster.siamese.dyndns.org> <46dff0320805031732x25286707r991358162046c07c@mail.gmail.com> <46dff0320805040935n22354e1bta85b3f3fe7c16cad@mail.gmail.com> <7v63ttq0y8.fsf@gitster.siamese.dyndns.org> <46dff0320805041840g1b9362d3u138b9d40cde160f2@mail.gmail.com> <7vprs1ny5e.fsf@gitster.siamese.dyndns.org> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Cc: "Johannes Schindelin" , git@vger.kernel.org To: "Junio C Hamano" X-From: git-owner@vger.kernel.org Mon May 05 14:11:02 2008 Return-path: Envelope-to: gcvg-git-2@gmane.org Received: from vger.kernel.org ([209.132.176.167]) by lo.gmane.org with esmtp (Exim 4.50) id 1JszX4-0002N4-4B for gcvg-git-2@gmane.org; Mon, 05 May 2008 14:11:02 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752021AbYEEMKO (ORCPT ); Mon, 5 May 2008 08:10:14 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752007AbYEEMKO (ORCPT ); Mon, 5 May 2008 08:10:14 -0400 Received: from an-out-0708.google.com ([209.85.132.251]:48916 "EHLO an-out-0708.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751983AbYEEMKM (ORCPT ); Mon, 5 May 2008 08:10:12 -0400 Received: by an-out-0708.google.com with SMTP id d40so536041and.103 for ; Mon, 05 May 2008 05:10:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; bh=vC0lxxvbEMd6k6caeaE09/I4fOx+H7bTw1NbQnRNGKQ=; b=MzApMfT5Fa1EIj8jqoGb4gVyjU5+ugV3xR6Byfvl2lYcD7PbP3Mak6QMbhFOKN1GhIaSmpz10QL0XR0ttA0qH/jB37T3nSN4lkL+Nrnu2JvvdzuF262Ud0bs5NwCb6DWwGwzEvs8Pjl6paSszpUFMbAtCBCoYD6fOxS/84e5CRg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=tOEnezSXO5PZxcf3z+bCMGEa1FPcNz8iWjx37eKxr/CMi6fTR6IAbgfK1w2wUPyQcCxBicLrglkEvokqAZuVw7Tnnr/FxNN5507zQa9g2i9CIHZvnu2msEB1NY0DDZWttoezVo46xApa11Tey81IU9urIedoHm/sVhGJgEYZj60= Received: by 10.100.213.4 with SMTP id l4mr7438633ang.53.1209989411363; Mon, 05 May 2008 05:10:11 -0700 (PDT) Received: by 10.100.32.10 with HTTP; Mon, 5 May 2008 05:10:11 -0700 (PDT) In-Reply-To: <7vprs1ny5e.fsf@gitster.siamese.dyndns.org> Content-Disposition: inline Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: On Mon, May 5, 2008 at 1:00 PM, Junio C Hamano wrote: > "Ping Yin" writes: > > > For this example,both "/if/while/ (i />/>=/ /1/0/)" and "/if/while/ > > (i >//=/ /1/0/)" are fine to me. > > For the particular example, both are Ok, but for this other example: > > -if (i > 1... > +if ((i > 1... > > it probably is better to treat each non-word character as a separate > token, that is, it would be easier to read if we said "( stayed intact, > and another ( was added", instead of saying "( is changed to ((". > > So "a run of punct chars" rule only sometimes produces better output but > otherwise worse output, and to make it produce better output consistently, > we would need to know the syntax of the target language for tokenization, > i.e. ">=" and ">" are comparison operators, while "(" is a token and "((" > is better split into two open-paren tokens. > > So as a very longer term subproject, we may want to teach the mechanism > language specific tokenization rules, just like we can specify the hunk > header pattern via gitattributes(5) to the diff output layer. > > Of course, I do not expect you to do that during this round --- and if we > choose to keep the rule simple, I think it is probably better to use > one-char-one-token rule for now. > > > > And when designing, i think it's better to take multi-byte characters > > into account. For multi-byte characters (especially CJK), every > > character should be considered as a token. > > If we take an idealistic view for the longer term, we should be tokenizing > even CJK sensibly, but unlike Occidental scripts, we cannot even use > inter-word spacing for tokenizing hint, so unless we are willing to learn > morphological analysis (which we are not for now), the best we can do is > to use one-char-one-token rule. > > Side Note. For Japanese we could cheat and often do a slightly > better job than simple one-char-one-token without having full > morphological analysis by splicing between Kanji and Kana > boundaries, but I'd prefer not to go there and keep the rules we > would use to the minimum. > > I should stress that I said "character" in the above "punct" and "CJK" > discussions, not "byte". > The one-char-one-token and multi-char-one-token rules may have different implementation issues. I think multi-char-one-token rule may be more representative. So for the current time, i prefer considering both run of word characters and single non-word character as a token. -- Ping Yin