From mboxrd@z Thu Jan  1 00:00:00 1970
From: dwmw2@infradead.org (David Woodhouse)
Date: Wed, 06 Jan 2010 23:43:00 +0000
Subject: Sending UTF-8 patches (was: [PATCH 2/2] Remove now-defunct
	ts7250 nand driver)
In-Reply-To: <20100106232128.GE24250@shareable.org>
References: <201001051459.58621.hartleys@visionengravers.com>
	<1262784693.3181.8034.camel@macbook.infradead.org>
	<20100106180705.GC11773@shareable.org>
	<1262803010.3181.8484.camel@macbook.infradead.org>
	<20100106232128.GE24250@shareable.org>
Message-ID: <1262821380.3181.8838.camel@macbook.infradead.org>
To: linux-arm-kernel@lists.infradead.org
List-Id: linux-arm-kernel.lists.infradead.org

On Wed, 2010-01-06 at 23:21 +0000, Jamie Lokier wrote:
> > Personally, I suspect you're right, and it should be converted too.
> 
> It would need to optional for git users whose source code isn't UTF-8 -
> possibly converting the other way for them.  But yeah I think it'd make
> sense to be on by default.

It's silly to talk of 'converting the other way'. The tool converts from
the charset of the email, to the charset that the git repository is
configured for (which is UTF-8 by default and in all sane cases).

Why would you want to convert the other way?

It is currently optional, but I suspect that's the wrong approach. The
only reason you'd ever want that is if the mail it's interpreting is
mislabelled -- and in that case surely the best workaround is an option
which lets you override the Content-Type: header, not just disable the
charset conversion altogether. Obviously if you override the input
charset to be equal to your repository configuration, that means that no
conversion is done. But that's just a fairly unimportant special case of
the override, surely?

> Section 4.1.2, Charset Parameter, final paragraph:
> 
> >>   In general, composition software should always use the "lowest common
> >>   denominator" character set possible.  For example, if a body contains
> >>   only US-ASCII characters, it SHOULD be marked as being in the US-
> >>   ASCII character set, not ISO-8859-1, which, like all the ISO-8859
> >>   family of character sets, is a superset of US-ASCII.  More generally,
> >>   if a widely-used character set is a subset of another character set,
> >>   and a body contains only characters in the widely-used subset, it
> >>   should be labelled as being in that subset.  This will increase the
> >>   chances that the recipient will be able to view the resulting entity
> >>   correctly.
> 
> It's a SHOULD, but it's still a good idea.  ISO-8859-1 is still very
> widely-used for email.

That might have made sense 13 years ago, but today I would suggest that
we ought to be applying a general principle that UTF-8 SHOULD be used
everywhere.

By doing so, we mostly sidestep the need for the whole charset labelling
and conversion clusterfuck that nobody ever quite managed to get right
anyway. If everything on the system is UTF-8, all the time, then it
doesn't _matter_ that nobody ever really managed to get labelling to
work.

>>From RFC2119:
3. SHOULD   This word, or the adjective "RECOMMENDED", mean that there
   may exist valid reasons in particular circumstances to ignore a
   particular item, but the full implications must be understood and
   carefully weighed before choosing a different course.

I understand the full implications of not using legacy character sets,
and have carefully weighed them before choosing to send UTF-8.

What's the down-side? That some Luddite out there might have a system
which _still_ can't render UTF-8 email in 2010, so they're only seeing
the 99.999% of the email which is readable as ASCII?

-- 
dwmw2