Do a better job at guessing unknown character sets

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Do a better job at guessing unknown character sets
@ 2007-07-17 17:34 Linus Torvalds
  2007-07-17 19:56 ` Johannes Schindelin
  0 siblings, 1 reply; 3+ messages in thread
From: Linus Torvalds @ 2007-07-17 17:34 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List

At least in the kernel development community, we're generally slowly 
converting to UTF-8 everywhere, and the old default of Latin1 in emails is 
being supplanted by UTF-8, and it doesn't necessarily show up as such in 
the mail headers (because, quite frankly, when people send patches 
around, they want the email client to do as little as humanly possible 
about the patch)

Despite that, it's often the case that email addresses etc still have 
Latin1, so I've seen emails where this is a mixed bag, with Signed-off 
parts being copied from email (and containing Latin1 characters), and the 
rest of the email being a patch in UTF-8.

So this suggests a very natural change: if the target character set is 
utf-8 (the default), and if the source already looks like utf-8, just 
assume that it doesn't need any conversion at all.

Only assume that it needs conversion if it isn't already valid utf-8, in 
which case we (for historical reasons) will assume it's Latin1.

Basically no really _valid_ latin1 will ever look like utf-8, so while 
this changes our historical behaviour, it doesn't do so in practice, and 
makes the default behaviour saner for the case where the input was already 
in proper format.

We could do a more fancy guess, of course, but this correctly handled a 
series of patches I just got from Andrew that had a mixture of Latin1 and 
UTF-8 (in different emails, but without any character set indication).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

I think this makes sense from a "the world is moving to utf-8" standpoint, 
even if obviously some people might consider it a bit ugly to do per-line 
"guessing".

Comments?

 builtin-mailinfo.c |   33 +++++++++++++++++++++++++++++----
 1 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c
index 489c2c5..a37a4ff 100644
--- a/builtin-mailinfo.c
+++ b/builtin-mailinfo.c
@@ -499,15 +499,40 @@ static int decode_b_segment(char *in, char *ot, char *ep)
 	return 0;
 }

+/*
+ * When there is no known charset, guess.
+ *
+ * Right now we assume that if the target is UTF-8 (the default),
+ * and it already looks like UTF-8 (which includes US-ASCII as its
+ * subset, of course) then that is what it is and there is nothing
+ * to do.
+ *
+ * Otherwise, we default to assuming it is Latin1 for historical
+ * reasons.
+ */
+static const char *guess_charset(const char *line, const char *target_charset)
+{
+	if (is_encoding_utf8(target_charset)) {
+		if (is_utf8(line))
+			return NULL;
+	}
+	return "latin1";
+}
+
 static void convert_to_utf8(char *line, const char *charset)
 {
-	static const char latin_one[] = "latin1";
-	const char *input_charset = *charset ? charset : latin_one;
-	char *out = reencode_string(line, metainfo_charset, input_charset);
+	char *out;
+
+	if (!charset || !*charset) {
+		charset = guess_charset(line, metainfo_charset);
+		if (!charset)
+			return;
+	}

+	out = reencode_string(line, metainfo_charset, charset);
 	if (!out)
 		die("cannot convert from %s to %s\n",
-		    input_charset, metainfo_charset);
+		    charset, metainfo_charset);
 	strcpy(line, out);
 	free(out);
 }

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: Do a better job at guessing unknown character sets
  2007-07-17 17:34 Do a better job at guessing unknown character sets Linus Torvalds
@ 2007-07-17 19:56 ` Johannes Schindelin
  2007-07-17 20:01   ` david
  0 siblings, 1 reply; 3+ messages in thread
From: Johannes Schindelin @ 2007-07-17 19:56 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List

Hi,

On Tue, 17 Jul 2007, Linus Torvalds wrote:

> I think this makes sense from a "the world is moving to utf-8" 
> standpoint, even if obviously some people might consider it a bit ugly 
> to do per-line "guessing".
> 
> Comments?

IMHO this is a good change.  Encodings are such a hassle, and probably 
only because the inventors of ASCII just were narrow-minded enough not to 
care.  With this patch, the hassle factor diminishes AFAICT.

Ciao,
Dscho

P.S.: I think that in case of undesired behaviour, even if it is detected 
late in the game, filter-branch/rewrite-commits will help.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Do a better job at guessing unknown character sets
  2007-07-17 19:56 ` Johannes Schindelin
@ 2007-07-17 20:01   ` david
  0 siblings, 0 replies; 3+ messages in thread
From: david @ 2007-07-17 20:01 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Linus Torvalds, Junio C Hamano, Git Mailing List

On Tue, 17 Jul 2007, Johannes Schindelin wrote:

> On Tue, 17 Jul 2007, Linus Torvalds wrote:
>
>> I think this makes sense from a "the world is moving to utf-8"
>> standpoint, even if obviously some people might consider it a bit ugly
>> to do per-line "guessing".
>>
>> Comments?
>
> Encodings are such a hassle, and probably
> only because the inventors of ASCII just were narrow-minded enough not to
> care.

to be perfectly fair, at the time ASCII was invented it was done to 
eliminate the use of the different, incompatible character sets that were 
in use at the time. And it did the job well (I think the only surviver 
from those sets is EBCDIC, and only due to the legacy installed base)

current character encodings are doing things that weren't dreamed of by 
anyone at the time.

David Lang

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2007-07-17 20:07 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-17 17:34 Do a better job at guessing unknown character sets Linus Torvalds
2007-07-17 19:56 ` Johannes Schindelin
2007-07-17 20:01   ` david

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).