From: Johannes Sixt <j.sixt@viscovery.net>
To: Jeff King <peff@peff.net>
Cc: Christian Himpel <chressie@googlemail.com>, git@vger.kernel.org
Subject: Re: [PATCH] git-am: force egrep to use correct characters set
Date: Mon, 28 Sep 2009 10:12:47 +0200 [thread overview]
Message-ID: <4AC06FFF.20008@viscovery.net> (raw)
In-Reply-To: <20090927074015.GB15393@coredump.intra.peff.net>
Jeff King schrieb:
> On Fri, Sep 25, 2009 at 06:43:20PM +0200, Christian Himpel wrote:
>
>> According to egrep(1) the US-ASCII table is used when LC_ALL=C is set.
>> We do not rely here on the LC_ALL value we get from the environment.
>
> Hmm. Probably makes sense here, as it is a wide enough range that it may
> pick up other stray non-ascii characters in other charsets (though as
> the manpage notes, the likely thing is to pick up A-Z along with a-z,
> which is OK here as we encompass both in our range).
>
> There are two other calls to egrep with brackets (both in
> git-submodule.sh), but they are just [0-7], which is presumably OK in
> just about any charset.
>
> Do you happen to know a charset in which this is a problem, just for
> reference?
It's not so much about charsets than about languages:
Within a bracket expression, a range expression consists
of two characters separated by a hyphen. It matches any
single character that sorts between the two characters,
inclusive, using the locale's collating sequence and
character set. For example, in the default C locale,
[a-d] is equivalent to [abcd]. Many locales sort char-
acters in dictionary order, and in these locales [a-d]
is typically not equivalent to [abcd]; it might be
equivalent to [aBbCcDd], for example. To obtain the
traditional interpretation of bracket expressions, you
can use the C locale by setting the LC_ALL environment
variable to the value C.
For example, in locale de_DE.UTF-8, GNU grep '[a-z]' matches lowercase
letters, uppercase letters (!), and umlauts (!!) because in dictionary
order, 'A' and 'a' are equivalent and 'Ä' sorts after 'A'. (The input must
be UTF-8, of course.)
Given that this applies not only to egrep, but to grep in general (and
perhaps even to other tools that support ranges, like sed), it may be
necessary to audit all range expressions.
The case identified by Christian is certainly important because it is
applied to a file whose contents can be anything, and the purpose of the
check is to identify the text as an mbox file, whose header section can be
only US-ASCII by definition. So, I think it has merit to apply the patch.
-- Hannes
next prev parent reply other threads:[~2009-09-28 8:12 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-09-25 15:14 [PATCH 1/2] git-am: fixed patch_format detection according to RFC2822 Christian Himpel
2009-09-25 15:17 ` [PATCH 2/2] git-am: force egrep to use correct characters set Christian Himpel
2009-09-25 15:45 ` [PATCH 2/2 v2] " Christian Himpel
2009-09-25 16:43 ` [PATCH] " Christian Himpel
2009-09-27 7:40 ` Jeff King
2009-09-28 6:55 ` Christian Himpel
2009-09-28 7:16 ` Jeff King
2009-09-28 8:12 ` Johannes Sixt [this message]
2009-09-28 9:32 ` Christian Himpel
2009-09-28 9:53 ` Johannes Sixt
2009-09-28 12:09 ` Christian Himpel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4AC06FFF.20008@viscovery.net \
--to=j.sixt@viscovery.net \
--cc=chressie@googlemail.com \
--cc=git@vger.kernel.org \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).