From: Sam Vilain <sam@vilain.net>
To: Jeff King <peff@peff.net>
Cc: Robin Rosenberg <robin.rosenberg.lists@dewire.com>,
Junio C Hamano <gitster@pobox.com>,
git@vger.kernel.org
Subject: Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
Date: Sun, 30 Mar 2008 16:40:53 +1300 [thread overview]
Message-ID: <47EF0BC5.4040102@vilain.net> (raw)
In-Reply-To: <20080329214516.GB30851@coredump.intra.peff.net>
Jeff King wrote:
> My point is that we don't _know_ what is happening in between the decode
> and encode. Does that intermediate form have the information required to
> convert back to the exact same bytes as the original form?
No, it doesn't. If you want that, save a copy of the string (it's a
lazy copy anyway).
The module that will let you see into the strings to see what it
happening is Devel::Peek. Using that, you will see the state of the
UTF8 scalar flag. For example;
maia:~$ perl -Mutf8 -MDevel::Peek -le 'Dump "Güt"'
SV = PV(0x605d08) at 0x62f230
REFCNT = 1
FLAGS = (PADBUSY,PADTMP,POK,READONLY,pPOK,UTF8)
PV = 0x60cd20 "G\303\274t"\0 [UTF8 "G\x{fc}t"]
CUR = 4
LEN = 8
By default, all strings that are read from files will NOT have this flag
set, unless the filehandle that was read from was marked as being utf-8
(in order to preserve C semantics by default);
maia:~$ echo "Güt" | perl -MDevel::Peek -nle 'Dump $_'
SV = PV(0x6052d0) at 0x604220
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x62f0e0 "G\303\274t"\0
CUR = 4
LEN = 80
maia:~$ echo "Güt" | perl -MDevel::Peek -nle 'BEGIN { binmode STDIN,
":utf8" } Dump $_'
SV = PV(0x6052d0) at 0x604220
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x62f100 "G\303\274t"\0 [UTF8 "G\x{fc}t"]
CUR = 4
LEN = 80
> But it still feels a little wrong to test by converting.
utf8::decode works in-place; it is essentially checking that the string
is valid, and if so, marking it as UTF8.
my ($encoding);
if (utf8::decode($string)) {
if (utf8::is_utf($string)) {
$encoding = "UTF-8";
}
else {
$encoding = "US-ASCII";
}
}
else {
$encoding = "ISO8859-1"
}
For US-ASCII, you'll only have to encode if the string contains special
characters (those below \037) or any "=" characters.
You could try using langinfo CODESET instead of hardcoding ISO8859-1
like that, but at least on my system can return bizarre values like
ANSI_X3.4-1968, which may be in some contexts a "correct" description of
the encoding, but is unlikely to be understood by mail clients.
> There must be
> some way to ask "is this valid utf-8" (there are several candidate
> functions, but I don't think either of us quite knows the right way to
> invoke them).
I think you were just reading the note on the utf8::valid function a
little too strongly.
You could use this block;
if ($string =~ m/[\200-\377]/) {
Encode::_utf8_on($string);
if (!utf8::valid($string)) {
Encode::_utf8_off($string);
}
}
Anyway, I guess all this rubbish is why people use CPAN modules, so that
they don't have to continually rediscover every single protocol quirk
and reinvent the wheel.
ie, it would be much, much simpler to use MIME::Entity->build for all of
this, and remove the duplication of code.
Sam.
next prev parent reply other threads:[~2008-03-30 3:38 UTC|newest]
Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-03-28 6:30 [ANNOUNCE] GIT 1.5.5-rc2 Junio C Hamano
2008-03-28 18:13 ` Jeff King
2008-03-28 21:05 ` Junio C Hamano
2008-03-28 21:23 ` Jeff King
2008-03-28 21:27 ` Jeff King
2008-03-28 21:28 ` [PATCH 1/2] send-email: specify content-type of --compose body Jeff King
2008-03-28 21:29 ` [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters Jeff King
2008-03-29 7:19 ` Robin Rosenberg
2008-03-29 7:22 ` Jeff King
2008-03-29 8:41 ` Robin Rosenberg
2008-03-29 8:49 ` Jeff King
2008-03-29 9:02 ` Robin Rosenberg
2008-03-29 9:11 ` Jeff King
2008-03-29 9:39 ` Robin Rosenberg
2008-03-29 9:43 ` Jeff King
2008-03-29 12:54 ` Robin Rosenberg
2008-03-29 21:45 ` Jeff King
2008-03-30 3:40 ` Sam Vilain [this message]
2008-03-30 4:39 ` Jeff King
2008-03-30 23:47 ` Junio C Hamano
2008-03-29 8:44 ` Robin Rosenberg
2008-03-29 8:53 ` Jeff King
2008-03-29 9:38 ` Robin Rosenberg
2008-03-29 9:52 ` Jeff King
2008-03-29 12:54 ` Robin Rosenberg
2008-03-29 21:18 ` Jeff King
2008-03-29 21:43 ` Robin Rosenberg
2008-03-29 22:00 ` Jeff King
2008-03-30 2:12 ` Sam Vilain
2008-03-30 4:31 ` Jeff King
2008-05-21 19:39 ` Junio C Hamano
2008-05-21 19:47 ` Jeff King
[not found] <7caf19ae394accab538d2f94953bb62b55a2c79f.1206486012.git.peff@peff.net>
2008-03-25 23:03 ` Jeff King
2008-03-26 5:59 ` Teemu Likonen
2008-03-26 6:20 ` Jeff King
2008-03-26 8:30 ` Teemu Likonen
2008-03-26 8:39 ` Jeff King
2008-03-26 9:23 ` Teemu Likonen
2008-03-26 9:32 ` Teemu Likonen
2008-03-26 9:35 ` Jeff King
2008-03-26 9:33 ` Jeff King
2008-03-27 7:38 ` Jeff King
2008-03-27 19:44 ` Todd Zullinger
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=47EF0BC5.4040102@vilain.net \
--to=sam@vilain.net \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=peff@peff.net \
--cc=robin.rosenberg.lists@dewire.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).