From: Christian Himpel <chressie@googlemail.com>
To: Johannes Sixt <j.sixt@viscovery.net>
Cc: Jeff King <peff@peff.net>,
Christian Himpel <chressie@googlemail.com>,
git@vger.kernel.org
Subject: Re: [PATCH] git-am: force egrep to use correct characters set
Date: Mon, 28 Sep 2009 11:32:16 +0200 [thread overview]
Message-ID: <20090928093216.GA31459@lamagra.informatik.uni-ulm.de> (raw)
In-Reply-To: <4AC06FFF.20008@viscovery.net>
On Mon, Sep 28, 2009 at 10:12:47AM +0200, Johannes Sixt wrote:
> Jeff King schrieb:
> > On Fri, Sep 25, 2009 at 06:43:20PM +0200, Christian Himpel wrote:
> >
> >> According to egrep(1) the US-ASCII table is used when LC_ALL=C is set.
> >> We do not rely here on the LC_ALL value we get from the environment.
> >
> > Hmm. Probably makes sense here, as it is a wide enough range that it may
> > pick up other stray non-ascii characters in other charsets (though as
> > the manpage notes, the likely thing is to pick up A-Z along with a-z,
> > which is OK here as we encompass both in our range).
> >
> > There are two other calls to egrep with brackets (both in
> > git-submodule.sh), but they are just [0-7], which is presumably OK in
> > just about any charset.
> >
> > Do you happen to know a charset in which this is a problem, just for
> > reference?
>
> It's not so much about charsets than about languages:
>
> Within a bracket expression, a range expression consists
> of two characters separated by a hyphen. It matches any
> single character that sorts between the two characters,
> inclusive, using the locale's collating sequence and
> character set. For example, in the default C locale,
> [a-d] is equivalent to [abcd]. Many locales sort char-
> acters in dictionary order, and in these locales [a-d]
> is typically not equivalent to [abcd]; it might be
> equivalent to [aBbCcDd], for example. To obtain the
> traditional interpretation of bracket expressions, you
> can use the C locale by setting the LC_ALL environment
> variable to the value C.
>
> For example, in locale de_DE.UTF-8, GNU grep '[a-z]' matches lowercase
> letters, uppercase letters (!), and umlauts (!!) because in dictionary
> order, 'A' and 'a' are equivalent and 'Ä' sorts after 'A'. (The input must
> be UTF-8, of course.)
Thanks for pointing this out. You are right. I must have read the
"dictonary order" part over.
> Given that this applies not only to egrep, but to grep in general (and
> perhaps even to other tools that support ranges, like sed), it may be
> necessary to audit all range expressions.
After doing a quick:
LC_ALL=C find . -name '*.sh' -exec \
egrep -Hne '(grep|awk|sed).*\[.*-.*\]' {} \;
As far as I can see, range expressions are used:
1. to replace or grep hexadecimal numbers (SHA1 sums). This shouldn't
be a problem, if we can assume that these numbers are never malformed.
2. to replace or grep numbers (with digits). This shouldn't be a
problem, since digits should be in dictionary order in every language
(?!).
3. in git-rebase--interactive.sh:742 to grep for a previously generated
string. So it should be safe here.
> The case identified by Christian is certainly important because it is
> applied to a file whose contents can be anything, and the purpose of the
> check is to identify the text as an mbox file, whose header section can be
> only US-ASCII by definition. So, I think it has merit to apply the patch.
Yes. It seems that this is the only place where it is important to match
just the ASCII printable characters.
Regards,
chressie
next prev parent reply other threads:[~2009-09-28 9:32 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-09-25 15:14 [PATCH 1/2] git-am: fixed patch_format detection according to RFC2822 Christian Himpel
2009-09-25 15:17 ` [PATCH 2/2] git-am: force egrep to use correct characters set Christian Himpel
2009-09-25 15:45 ` [PATCH 2/2 v2] " Christian Himpel
2009-09-25 16:43 ` [PATCH] " Christian Himpel
2009-09-27 7:40 ` Jeff King
2009-09-28 6:55 ` Christian Himpel
2009-09-28 7:16 ` Jeff King
2009-09-28 8:12 ` Johannes Sixt
2009-09-28 9:32 ` Christian Himpel [this message]
2009-09-28 9:53 ` Johannes Sixt
2009-09-28 12:09 ` Christian Himpel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090928093216.GA31459@lamagra.informatik.uni-ulm.de \
--to=chressie@googlemail.com \
--cc=git@vger.kernel.org \
--cc=j.sixt@viscovery.net \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).