From mboxrd@z Thu Jan 1 00:00:00 1970 From: Junio C Hamano Subject: Re: [PATCH v3 2/4] word diff: customizable word splits Date: Sun, 11 Jan 2009 14:20:07 -0800 Message-ID: <7vfxjppjs8.fsf@gitster.siamese.dyndns.org> References: <4aea85caafd38a058145c5769fe8a30ffdbd4d13.1231669012.git.trast@student.ethz.ch> <529cd830908f018f796dbc46d3b055c1f8ba9c1b.1231669012.git.trast@student.ethz.ch> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: git@vger.kernel.org, Johannes Schindelin , Teemu Likonen To: Thomas Rast X-From: git-owner@vger.kernel.org Sun Jan 11 23:21:56 2009 Return-path: Envelope-to: gcvg-git-2@gmane.org Received: from vger.kernel.org ([209.132.176.167]) by lo.gmane.org with esmtp (Exim 4.50) id 1LM8ge-0003c3-I7 for gcvg-git-2@gmane.org; Sun, 11 Jan 2009 23:21:41 +0100 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752441AbZAKWUU (ORCPT ); Sun, 11 Jan 2009 17:20:20 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752433AbZAKWUS (ORCPT ); Sun, 11 Jan 2009 17:20:18 -0500 Received: from a-sasl-quonix.sasl.smtp.pobox.com ([208.72.237.25]:33337 "EHLO sasl.smtp.pobox.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752410AbZAKWUQ (ORCPT ); Sun, 11 Jan 2009 17:20:16 -0500 Received: from localhost.localdomain (unknown [127.0.0.1]) by b-sasl-quonix.sasl.smtp.pobox.com (Postfix) with ESMTP id 4C1741C45E; Sun, 11 Jan 2009 17:20:16 -0500 (EST) Received: from pobox.com (unknown [68.225.240.211]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by b-sasl-quonix.sasl.smtp.pobox.com (Postfix) with ESMTPSA id 170AC1C45C; Sun, 11 Jan 2009 17:20:09 -0500 (EST) User-Agent: Gnus/5.110006 (No Gnus v0.6) Emacs/21.4 (gnu/linux) X-Pobox-Relay-ID: 05E50900-E02E-11DD-A73F-2E3B113D384A-77302942!a-sasl-quonix.pobox.com Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: Thomas Rast writes: > Allows for user-configurable word splits when using --color-words. > This can make the diff more readable if the regex is configured > according to the language of the file. > > Each non-overlapping match of the regex is a word; everything in > between is whitespace. What happens if the input "language" does not have any inter-word spacing but its words can still be expressed by regexp patterns? ImagineALanguageThatAllowsYouToWriteSomethingLikeThis. Does the mechanism help users who want to do word-diff files written in such a language by outputting: ImagineALanguageThatWhichAllowsYou... when '[A-Z][a-z]*' is given by the word pattern? > We disallow matching the empty string (because > it results in an endless loop) or a newline (breaks color escapes and > interacts badly with the input coming from the usual line diff). To > help the user, we set REG_NEWLINE so that [^...] and . do not match > newlines. AndImagineALanguageWhoseWordStruc tureDoesNotCareAboutLineBreak Can you help users with such payload? Side note. Yes, I am coming from Japanese background. Side note 2. No, I am not saying your code must support both of the above to be acceptable. I am just gauging the design assumptions and limitations. > Insertion of spaces is somewhat subtle. We echo a "context" space > twice (once on each side of the diff) if it follows directly after a > word. While this loses a tiny bit of accuracy, it runs together long > sequences of changed word into one removed and one added block, making > the diff much more readable. I guess this part can be later enhanced to be more precise, so that it keeps the original context space more faithfully (i.e. does not lose two consecutive spaces in the original occidental script, and does not insert any extra space to the oriental script), if we were to support the second example I gave above in the future as a follow-up patch. > +--color-words[=]:: > + Show colored word diff, i.e., color words which have changed. > + By default, a new word only starts at whitespace, so that a > + 'word' is defined as a maximal sequence of non-whitespace > + characters. The optional argument can be used to > + configure this. > ++ > +The must be an (extended) regular expression. When set, every > +non-overlapping match of the is considered a word. (Regular > +expression semantics ensure that quantifiers grab a maximal sequence > +of characters.) Anything between these matches is considered > +whitespace and ignored for the purposes of finding differences. You > +may want to append `|\S` to your regular expression to make sure that > +it matches all non-whitespace characters. Whose regexp library do we assume here? Traditionally we limited ourselves to POSIX BRE, and I do not think anybody minds using POSIX ERE here, but we need to be clear. In either case \S is a pcre outside POSIX. The rest I only skimmed but did not spot anything glaringly wrong; thanks.