All of lore.kernel.org
 help / color / mirror / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: Brandon Casey <casey@nrlssc.navy.mil>
Cc: Arjen Laarhoven <arjen@yaph.org>,
	Mike Ralphson <mike.ralphson@gmail.com>,
	Johannes Sixt <j.sixt@viscovery.net>, Jeff King <peff@peff.net>,
	Boyd Lynn Gerber <gerberb@zenez.com>,
	Git Mailing List <git@vger.kernel.org>,
	Avery Pennarun <apenwarr@gmail.com>,
	Johan Herland <johan@herland.net>, Andreas Ericsson <ae@op5.se>,
	Kirill Smelkov <kirr@mns.spb.ru>,
	Giuseppe Bilotta <giuseppe.bilotta@gmail.com>,
	Gustaf Hendeby <hendeby@isy.liu.se>,
	Jonathan del Strother <maillist@steelskies.com>
Subject: Re: [PATCH v2 4/4] diff.c: convert builtin funcname patterns to non-GNU extended regex syntax
Date: Sat, 20 Sep 2008 00:51:39 -0700	[thread overview]
Message-ID: <7vej3f45mc.fsf@gitster.siamese.dyndns.org> (raw)
In-Reply-To: <7vtzcb47vh.fsf@gitster.siamese.dyndns.org> (Junio C. Hamano's message of "Sat, 20 Sep 2008 00:02:58 -0700")

Junio C Hamano <gitster@pobox.com> writes:

> static const struct funcname_pattern_entry builtin_funcname_pattern[] = {
> 	{ "bibtex", "(@[a-zA-Z]{1,}[ \t]*\{{0,1}[ \t]*[^ \t\"@',\\#}{~%]*).*$",
> 	  REG_EXTENDED },
> 	{ "html", "^[ \t]*(<[Hh][1-6][ \t].*>.*)$", REG_EXTENDED },
> 	{ "java",
> 	  "!^[ \t]*(catch|do|for|if|instanceof|new|return|switch|throw|while)\n"
> 	  "^[ \t]*(([ \t]*[A-Za-z_][A-Za-z_0-9]*){2,}[ \t]*\\([^;]*)$",
> 	  REG_EXTENDED },
> 	{ "pascal",
> 	  "^((procedure|function|constructor|destructor|interface|"
> 		"implementation|initialization|finalization)[ \t]*.*)$"
> 	  "|"
> 	  "^(.*=[ \t]*(class|record).*)$",
> 	  REG_EXTENDED },
> 	{ "php", "^[\t ]*((function|class).*)", REG_EXTENDED },
> 	{ "python", "^[ \t]*((class|def)[ \t].*)$", REG_EXTENDED },
> 	{ "ruby", "^[ \t]*((class|module|def)[ \t].*)$",
> 	  REG_EXTENDED },
> 	{ "tex",
> 	  "^(\\\\((sub)*section|chapter|part)\\*{0,1}\{.*)$",
> 	  REG_EXTENDED },
> };

On the "Objective-C hunk header" topic, I mentioned that I wondered how a
construct like:

	"( A | B ) | ( C | D )"

could possibly work, given that xdiff-interface.c::ff_regexp() uses a
regmatch_t array with only two elements to capture (which means it can
only capture $0 and $1).

It turns out that the function is not really prepared to do this
properly.  In order to cope with a pattern without _any_ capturing
parentheses in the regexp, it has a trick to use $0 if $1 is undefined.

In other words, if you write:

	"junk ( A | B ) | garbage ( C | D )"

then lines "junk A", "junk B", "garbage C" and "garbage D" would all match
and be on the hunk header line.  However, "garbage" is not stripped out
(while "junk" does get left out of the capture).

This happens not to be a problem with any of the above built-in patterns.
The risky one is "pascal", but it's $2 captures the entire line anyway, so
it does not make any difference if the code used $0 instead.  E.g.  a line
that matches the second alternative, "foo = class bar", will be captured
as $0 and this is the same as (un)captured $2.

But it would be a good idea to fix this while our attention to the issue
is still fresh.  As we discussed earlier, we can replace the top-level "|"
with "\n" and fix the semantics of multiple positive regexp to grab $1 (or
$0 if $1 is unavaialble) of the first positive one.  If we did so, when
the pattern

	"junk ( A | B ) | garbage ( C | D )"

matches a line "garbage C", the initial garbage part will not be included
inside the capture.

Here is such a proposed fix.

-- >8 --
diff: fix "multiple regexp" semantics to find hunk header comment

When multiple regular expressions are concatenated with "\n", they were
traditionally AND'ed together, and only a line that matches _all_ of them
were taken as a match.  This however is unwieldy when multiple regexp
feature is used to specify alternatives.

This fixes the semantics to take the first match.  A nagative pattern, if
matches, makes the line to fail.  A positive pattern, if matches, will be
the final match and what it captures in $1 is used as the hunk header
comment.

We could write alternatives using "|" in ERE, but the machinery can only
use captured $1 as the hunk header comment (or $0 if there is no match in
$1), so you cannot write:

    "junk ( A | B ) | garbage ( C | D )"

and expect both "junk" and "garbage" to get stripped with the existing
code.  With this fix, you can write it as:

    "junk ( A | B ) \n garbage ( C | D )"

and the way capture works would match the user expectation more
naturally.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 diff.c            |    2 +-
 xdiff-interface.c |   17 ++++++++++-------
 2 files changed, 11 insertions(+), 8 deletions(-)

diff --git c/diff.c w/diff.c
index a733010..1bcbbd5 100644
--- c/diff.c
+++ w/diff.c
@@ -1414,7 +1414,7 @@ static const struct funcname_pattern_entry builtin_funcname_pattern[] = {
 	{ "pascal",
 	  "^((procedure|function|constructor|destructor|interface|"
 		"implementation|initialization|finalization)[ \t]*.*)$"
-	  "|"
+	  "\n"
 	  "^(.*=[ \t]*(class|record).*)$",
 	  REG_EXTENDED },
 	{ "php", "^[\t ]*((function|class).*)", REG_EXTENDED },
diff --git c/xdiff-interface.c w/xdiff-interface.c
index 7f1a7d3..6c6bb19 100644
--- c/xdiff-interface.c
+++ w/xdiff-interface.c
@@ -194,26 +194,29 @@ static long ff_regexp(const char *line, long len,
 	char *line_buffer = xstrndup(line, len); /* make NUL terminated */
 	struct ff_regs *regs = priv;
 	regmatch_t pmatch[2];
-	int result = 0, i;
+	int i;
+	int result = -1;
 
 	for (i = 0; i < regs->nr; i++) {
 		struct ff_reg *reg = regs->array + i;
-		if (reg->negate ^ !!regexec(&reg->re,
-					line_buffer, 2, pmatch, 0)) {
-			free(line_buffer);
-			return -1;
+		if (!regexec(&reg->re, line_buffer, 2, pmatch, 0)) {
+			if (reg->negate)
+				goto fail;
+			break;
 		}
 	}
+	if (regs->nr <= i)
+		goto fail;
 	i = pmatch[1].rm_so >= 0 ? 1 : 0;
 	line += pmatch[i].rm_so;
 	result = pmatch[i].rm_eo - pmatch[i].rm_so;
 	if (result > buffer_size)
 		result = buffer_size;
 	else
-		while (result > 0 && (isspace(line[result - 1]) ||
-					line[result - 1] == '\n'))
+		while (result > 0 && (isspace(line[result - 1])))
 			result--;
 	memcpy(buffer, line, result);
+ fail:
 	free(line_buffer);
 	return result;
 }

  reply	other threads:[~2008-09-20  7:53 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <7vskry1485.fsf@gitster.siamese.dyndns.org>
2008-09-18 22:40 ` [PATCH v2 1/4] diff.c: return pattern entry pointer rather than just the hunk header pattern Brandon Casey
2008-09-19 18:14   ` Boyd Lynn Gerber
2008-09-19 19:11     ` Brandon Casey
2008-09-18 22:42 ` [PATCH v2 2/4] diff.c: associate a flag with each pattern and use it for compiling regex Brandon Casey
2008-09-18 22:44 ` [PATCH v2 3/4] diff.*.xfuncname which uses "extended" regex's for hunk header selection Brandon Casey
2008-09-18 22:47 ` [PATCH v2 4/4] diff.c: convert builtin funcname patterns to non-GNU extended regex syntax Brandon Casey
2008-09-18 22:59   ` Brandon Casey
2008-09-19 15:59     ` Brandon Casey
2008-09-18 23:40   ` Johan Herland
2008-09-19 15:55     ` Brandon Casey
2008-09-19 20:29   ` Junio C Hamano
2008-09-20  6:58     ` Junio C Hamano
2008-09-20 21:03       ` Brandon Casey
2008-09-20 22:29         ` Junio C Hamano
2008-09-22 16:59           ` Brandon Casey
2008-09-24  0:04             ` Brandon Casey
2008-09-26 17:49               ` Brandon Casey
2008-09-29 21:52                 ` [PATCH] diff.c: remove duplicate bibtex pattern introduced by merge 92bb9785 Brandon Casey
2008-09-22 23:19       ` [PATCH bc/maint-diff-hunk-header] t4018-diff-funcname: test syntax of builtin xfuncname patterns Brandon Casey
2008-09-22 23:26         ` [PATCH bc/master-diff-hunk-header] " Brandon Casey
2008-09-23  0:49           ` [PATCH] diff funcname_pattern: Allow HTML header tags without attributes Johan Herland
2008-09-23  1:46             ` Junio C Hamano
2008-09-23  2:05               ` Johan Herland
2008-09-20  7:02     ` [PATCH v2 4/4] diff.c: convert builtin funcname patterns to non-GNU extended regex syntax Junio C Hamano
2008-09-20  7:51       ` Junio C Hamano [this message]
2008-09-22  8:29   ` Gustaf Hendeby

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7vej3f45mc.fsf@gitster.siamese.dyndns.org \
    --to=gitster@pobox.com \
    --cc=ae@op5.se \
    --cc=apenwarr@gmail.com \
    --cc=arjen@yaph.org \
    --cc=casey@nrlssc.navy.mil \
    --cc=gerberb@zenez.com \
    --cc=git@vger.kernel.org \
    --cc=giuseppe.bilotta@gmail.com \
    --cc=hendeby@isy.liu.se \
    --cc=j.sixt@viscovery.net \
    --cc=johan@herland.net \
    --cc=kirr@mns.spb.ru \
    --cc=maillist@steelskies.com \
    --cc=mike.ralphson@gmail.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.