From: Linus Torvalds <torvalds@linux-foundation.org>
To: Junio C Hamano <junkio@cox.net>, Git Mailing List <git@vger.kernel.org>
Subject: Do a better job at guessing unknown character sets
Date: Tue, 17 Jul 2007 10:34:44 -0700 (PDT) [thread overview]
Message-ID: <alpine.LFD.0.999.0707171027100.19166@woody.linux-foundation.org> (raw)
At least in the kernel development community, we're generally slowly
converting to UTF-8 everywhere, and the old default of Latin1 in emails is
being supplanted by UTF-8, and it doesn't necessarily show up as such in
the mail headers (because, quite frankly, when people send patches
around, they want the email client to do as little as humanly possible
about the patch)
Despite that, it's often the case that email addresses etc still have
Latin1, so I've seen emails where this is a mixed bag, with Signed-off
parts being copied from email (and containing Latin1 characters), and the
rest of the email being a patch in UTF-8.
So this suggests a very natural change: if the target character set is
utf-8 (the default), and if the source already looks like utf-8, just
assume that it doesn't need any conversion at all.
Only assume that it needs conversion if it isn't already valid utf-8, in
which case we (for historical reasons) will assume it's Latin1.
Basically no really _valid_ latin1 will ever look like utf-8, so while
this changes our historical behaviour, it doesn't do so in practice, and
makes the default behaviour saner for the case where the input was already
in proper format.
We could do a more fancy guess, of course, but this correctly handled a
series of patches I just got from Andrew that had a mixture of Latin1 and
UTF-8 (in different emails, but without any character set indication).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
I think this makes sense from a "the world is moving to utf-8" standpoint,
even if obviously some people might consider it a bit ugly to do per-line
"guessing".
Comments?
builtin-mailinfo.c | 33 +++++++++++++++++++++++++++++----
1 files changed, 29 insertions(+), 4 deletions(-)
diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c
index 489c2c5..a37a4ff 100644
--- a/builtin-mailinfo.c
+++ b/builtin-mailinfo.c
@@ -499,15 +499,40 @@ static int decode_b_segment(char *in, char *ot, char *ep)
return 0;
}
+/*
+ * When there is no known charset, guess.
+ *
+ * Right now we assume that if the target is UTF-8 (the default),
+ * and it already looks like UTF-8 (which includes US-ASCII as its
+ * subset, of course) then that is what it is and there is nothing
+ * to do.
+ *
+ * Otherwise, we default to assuming it is Latin1 for historical
+ * reasons.
+ */
+static const char *guess_charset(const char *line, const char *target_charset)
+{
+ if (is_encoding_utf8(target_charset)) {
+ if (is_utf8(line))
+ return NULL;
+ }
+ return "latin1";
+}
+
static void convert_to_utf8(char *line, const char *charset)
{
- static const char latin_one[] = "latin1";
- const char *input_charset = *charset ? charset : latin_one;
- char *out = reencode_string(line, metainfo_charset, input_charset);
+ char *out;
+
+ if (!charset || !*charset) {
+ charset = guess_charset(line, metainfo_charset);
+ if (!charset)
+ return;
+ }
+ out = reencode_string(line, metainfo_charset, charset);
if (!out)
die("cannot convert from %s to %s\n",
- input_charset, metainfo_charset);
+ charset, metainfo_charset);
strcpy(line, out);
free(out);
}
next reply other threads:[~2007-07-17 17:34 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-07-17 17:34 Linus Torvalds [this message]
2007-07-17 19:56 ` Do a better job at guessing unknown character sets Johannes Schindelin
2007-07-17 20:01 ` david
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.LFD.0.999.0707171027100.19166@woody.linux-foundation.org \
--to=torvalds@linux-foundation.org \
--cc=git@vger.kernel.org \
--cc=junkio@cox.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).