From: Thomas Rast <trast@student.ethz.ch>
To: Junio C Hamano <gitster@pobox.com>
Cc: Scott Johnson <scottj75074@yahoo.com>,
Michael J Gruber <git@drmicha.warpmail.net>,
Matthijs Kooijman <matthijs@stdin.nl>, <git@vger.kernel.org>
Subject: Re: [PATCH v2 2/4] diff.c: implement a sanity check for word regexes
Date: Sun, 19 Dec 2010 02:59:44 +0100 [thread overview]
Message-ID: <201012190259.45301.trast@student.ethz.ch> (raw)
In-Reply-To: <7vvd2qg5jj.fsf@alter.siamese.dyndns.org>
Junio C Hamano wrote:
> Thomas Rast <trast@student.ethz.ch> writes:
>
> > * The word regex matches anything that is !isspace().
> >
> > * The word regex does not match '\n'. (This case is not very harmful,
> > but we used to silently cut off at the '\n' which may go against
> > user expectations.)
>
> How expensive to run this check twice, every time word_regex finds a
> match?
It runs the first bullet point for every non-match, and the second
bullet point for every match. So it looks at every input character
exactly once.
> As this is about making sure that we got a sane regex from the user (or a
> builtin pattern), I wonder if we can make it not depend on the payload we
> are matching the regex against. Then before using a word_regex that we
> have not checked, we check if that regex is sane, mark it checked, and do
> not have to do the check over and over again.
Algorithmically it should be easy once you have the finite state
automaton corresponding to the regex: just verify that for every
possible non-terminal state, there is a transition for every
!isspace() character to a state other than "fail to match" or "match
the empty string".
In the implementation, it might be doable if we switch to compat/regex
on all platforms, since we then have ready access to all internal
structures regcomp() creates, including the DFA.
I'll think about at least using compat/regex for a static check of all
*builtin* patterns, which would be superior to the brute force
approach in 4/4.
--
Thomas Rast
trast@{inf,student}.ethz.ch
next prev parent reply other threads:[~2010-12-19 1:59 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-12-15 3:47 html userdiff is not showing all my changes Scott Johnson
2010-12-15 9:06 ` Michael J Gruber
2010-12-15 9:12 ` Matthijs Kooijman
2010-12-15 9:29 ` Michael J Gruber
2010-12-15 15:13 ` [PATCH 0/4] --word-regex sanity checking and such Thomas Rast
2010-12-15 15:13 ` [PATCH 1/4] diff.c: pass struct diff_words into find_word_boundaries Thomas Rast
2010-12-15 15:13 ` [PATCH 2/4] diff.c: implement a sanity check for word regexes Thomas Rast
2010-12-15 15:13 ` [PATCH 3/4] userdiff: fix typo in ruby word regex Thomas Rast
2010-12-15 15:13 ` [PATCH 4/4] t4034: bulk verify builtin word regex sanity Thomas Rast
[not found] ` <913156.57703.qm@web110711.mail.gq1.yahoo.com>
2010-12-15 19:51 ` [PATCH 0/4] --word-regex sanity checking and such Thomas Rast
2010-12-15 20:48 ` Scott Johnson
2010-12-18 16:17 ` [PATCH v2 " Thomas Rast
2010-12-18 16:17 ` [PATCH v2 1/4] diff.c: pass struct diff_words into find_word_boundaries Thomas Rast
2010-12-18 16:17 ` [PATCH v2 2/4] diff.c: implement a sanity check for word regexes Thomas Rast
2010-12-18 21:00 ` Junio C Hamano
2010-12-19 1:59 ` Thomas Rast [this message]
2010-12-18 16:17 ` [PATCH v2 3/4] userdiff: fix typo in ruby and python " Thomas Rast
2010-12-18 21:02 ` Junio C Hamano
2010-12-19 2:10 ` Thomas Rast
2010-12-18 16:17 ` [PATCH v2 4/4] t4034: bulk verify builtin word regex sanity Thomas Rast
2011-01-11 21:47 ` [RFC/PATCH 0/3] " Jonathan Nieder
2011-01-11 21:48 ` [PATCH 1/3] " Jonathan Nieder
2011-01-18 18:00 ` Re*: " Junio C Hamano
2011-01-11 21:48 ` [PATCH 2/3] userdiff: simplify word-diff safeguard Jonathan Nieder
2011-01-11 21:49 ` [PATCH 3/3] t4034 (diff --word-diff): style suggestions Jonathan Nieder
2010-12-18 16:24 ` [PATCH v2 0/4] --word-regex sanity checking and such Thomas Rast
2010-12-18 20:48 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=201012190259.45301.trast@student.ethz.ch \
--to=trast@student.ethz.ch \
--cc=git@drmicha.warpmail.net \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=matthijs@stdin.nl \
--cc=scottj75074@yahoo.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.