[PATCH 0/2] Locally alias "latin-1" to "ISO-8859-1"

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] Locally alias "latin-1" to "ISO-8859-1"
@ 2016-09-27  1:22 Junio C Hamano
  2016-09-27  1:22 ` [PATCH 1/2] utf8: refactor code to decide fallback encoding Junio C Hamano
  2016-09-27  1:22 ` [PATCH 2/2] utf8: accept "latin-1" as ISO-8859-1 Junio C Hamano
  0 siblings, 2 replies; 7+ messages in thread
From: Junio C Hamano @ 2016-09-27  1:22 UTC (permalink / raw)
  To: git

Some systems do not seem to ship "latin-1" as a valid locale, even
though they happilly accept more modern official name "ISO-8859-1".
Naturally, "iconv -f iso-8859-1" succeeds while "iconv -f latin-1"
fails on such a system.

We already have in utf8.c to accomodate overly strict iconv_open()
that does not like various spellings of UTF-8 when our users spell
it differently from the most official "UTF-8" form.  Piggyback on
the mechanism and teach outselves that "latin-1" used to be the way
to say "ISO-8859-1".

I feel dirty for doing it this way, but I found it the easiest
workaround to apply recent patches we saw on the mailing list.

Junio C Hamano (2):
  utf8: refactor code to decide fallback encoding
  utf8: accept "latin-1" as ISO-8859-1

 utf8.c | 36 +++++++++++++++++++++++++-----------
 1 file changed, 25 insertions(+), 11 deletions(-)

-- 
2.10.0-556-g5bbc40b

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] utf8: refactor code to decide fallback encoding
  2016-09-27  1:22 [PATCH 0/2] Locally alias "latin-1" to "ISO-8859-1" Junio C Hamano
@ 2016-09-27  1:22 ` Junio C Hamano
  2016-09-27  5:52   ` Jeff King
  2016-09-27  1:22 ` [PATCH 2/2] utf8: accept "latin-1" as ISO-8859-1 Junio C Hamano
  1 sibling, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2016-09-27  1:22 UTC (permalink / raw)
  To: git

The codepath we use to call iconv_open() has a provision to use a
fallback encoding when it fails, hoping that "UTF-8" being spelled
differently could be the reason why the library function did not
like the encoding names we gave it.  Essentially, we turn what we
have observed to be used as variants of "UTF-8" (e.g. "utf8") into
the most official spelling and use that as a fallback.

We do the same thing for input and output encoding.  Introduce a
helper function to do just one side and call that twice.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 utf8.c | 29 ++++++++++++++++++-----------
 1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/utf8.c b/utf8.c
index 00e10c8..550e785 100644
--- a/utf8.c
+++ b/utf8.c
@@ -489,6 +489,21 @@ char *reencode_string_iconv(const char *in, size_t insz, iconv_t conv, int *outs
 	return out;
 }
 
+static const char *fallback_encoding(const char *name)
+{
+	/*
+	 * Some platforms do not have the variously spelled variants of
+	 * UTF-8, so let's fall back to trying the most official
+	 * spelling. We do so only as a fallback in case the platform
+	 * does understand the user's spelling, but not our official
+	 * one.
+	 */
+	if (is_encoding_utf8(name))
+		return "UTF-8";
+
+	return name;
+}
+
 char *reencode_string_len(const char *in, int insz,
 			  const char *out_encoding, const char *in_encoding,
 			  int *outsz)
@@ -501,17 +516,9 @@ char *reencode_string_len(const char *in, int insz,
 
 	conv = iconv_open(out_encoding, in_encoding);
 	if (conv == (iconv_t) -1) {
-		/*
-		 * Some platforms do not have the variously spelled variants of
-		 * UTF-8, so let's fall back to trying the most official
-		 * spelling. We do so only as a fallback in case the platform
-		 * does understand the user's spelling, but not our official
-		 * one.
-		 */
-		if (is_encoding_utf8(in_encoding))
-			in_encoding = "UTF-8";
-		if (is_encoding_utf8(out_encoding))
-			out_encoding = "UTF-8";
+		in_encoding = fallback_encoding(in_encoding);
+		out_encoding = fallback_encoding(out_encoding);
+
 		conv = iconv_open(out_encoding, in_encoding);
 		if (conv == (iconv_t) -1)
 			return NULL;
-- 
2.10.0-556-g5bbc40b


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] utf8: refactor code to decide fallback encoding
  2016-09-27  1:22 ` [PATCH 1/2] utf8: refactor code to decide fallback encoding Junio C Hamano
@ 2016-09-27  5:52   ` Jeff King
  2016-09-27 15:33     ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff King @ 2016-09-27  5:52 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Mon, Sep 26, 2016 at 06:22:10PM -0700, Junio C Hamano wrote:

> @@ -501,17 +516,9 @@ char *reencode_string_len(const char *in, int insz,
>  
>  	conv = iconv_open(out_encoding, in_encoding);
>  	if (conv == (iconv_t) -1) {
> -		/*
> -		 * Some platforms do not have the variously spelled variants of
> -		 * UTF-8, so let's fall back to trying the most official
> -		 * spelling. We do so only as a fallback in case the platform
> -		 * does understand the user's spelling, but not our official
> -		 * one.
> -		 */
> -		if (is_encoding_utf8(in_encoding))
> -			in_encoding = "UTF-8";
> -		if (is_encoding_utf8(out_encoding))
> -			out_encoding = "UTF-8";
> +		in_encoding = fallback_encoding(in_encoding);
> +		out_encoding = fallback_encoding(out_encoding);
> +

This comment is interesting. We're concerned about a platform knowing
"utf8" but not "UTF-8". When we fallback, we do it for both the input
and output encodings, because we don't know which may have caused the
problem. So is it possible that we improve one case but break the other?

With just UTF-8, I don't think so. That could only be the case with
something like "utf8 -> utf-8" because they both become "UTF-8". So
either it improves the situation or not (because we either understand
UTF-8 or not).

But once we introduce other fallbacks, then "utf8 -> latin1" may become
"UTF-8 -> iso8859-1". A system that knows only "utf8" and "iso8859-1"
_could_ work if we turned the knobs individually, but won't if we turn
them both at once. Worse, a system that knows only "UTF-8" and "latin1"
works now, but would break with your patches.

I'm not convinced it's worth worrying about, though. The existence of
such a system is theoretical at this point. I'm not even sure how common
the "know about utf8 but not UTF-8" thing is, or if we were merely being
overly cautious.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] utf8: refactor code to decide fallback encoding
  2016-09-27  5:52   ` Jeff King
@ 2016-09-27 15:33     ` Junio C Hamano
  0 siblings, 0 replies; 7+ messages in thread
From: Junio C Hamano @ 2016-09-27 15:33 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> But once we introduce other fallbacks, then "utf8 -> latin1" may become
> "UTF-8 -> iso8859-1". A system that knows only "utf8" and "iso8859-1"
> _could_ work if we turned the knobs individually, but won't if we turn
> them both at once. Worse, a system that knows only "UTF-8" and "latin1"
> works now, but would break with your patches.
>
> I'm not convinced it's worth worrying about, though. The existence of
> such a system is theoretical at this point. I'm not even sure how common
> the "know about utf8 but not UTF-8" thing is, or if we were merely being
> overly cautious.

Yeah, I did consider having to try the permutations until it works,
but suspecting that somebody takes "utf8" without taking "UTF-8" is
to pretty much invalidate the basic premise of the existing code,
i.e. spelling it as "UTF-8" is the most likely to work anywhere as
long as UTF-8 is supported, so I stopped worrying about it at that
point.

I'd actually welcome a more generic suggestions we can put in our
documentation so that we can _lose_ the fallback entirely (e.g. "if
your contributor spelled 'utf8' and your system, which does take
'UTF-8', does not like it, then here is what you can do to your
/etc/locale.alias").

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 2/2] utf8: accept "latin-1" as ISO-8859-1
  2016-09-27  1:22 [PATCH 0/2] Locally alias "latin-1" to "ISO-8859-1" Junio C Hamano
  2016-09-27  1:22 ` [PATCH 1/2] utf8: refactor code to decide fallback encoding Junio C Hamano
@ 2016-09-27  1:22 ` Junio C Hamano
  2016-09-27  5:57   ` Jeff King
  1 sibling, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2016-09-27  1:22 UTC (permalink / raw)
  To: git

Even though latin-1 is still seen in e-mail headers, some platforms
only install ISO-8859-1.  "iconv -f ISO-8859-1" succeeds, while
"iconv -f latin-1" fails on such a system.

Using the same fallback_encoding() mechanism factored out in the
previous step, teach ourselves that "ISO-8859-1" has a better chance
of being accepted than "latin-1".

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 utf8.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/utf8.c b/utf8.c
index 550e785..0c8e011 100644
--- a/utf8.c
+++ b/utf8.c
@@ -501,6 +501,13 @@ static const char *fallback_encoding(const char *name)
 	if (is_encoding_utf8(name))
 		return "UTF-8";
 
+	/*
+	 * Even though latin-1 is still seen in e-mail
+	 * headers, some platforms only install ISO-8859-1.
+	 */
+	if (!strcasecmp(name, "latin-1"))
+		return "ISO-8859-1";
+
 	return name;
 }
 
-- 
2.10.0-556-g5bbc40b


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] utf8: accept "latin-1" as ISO-8859-1
  2016-09-27  1:22 ` [PATCH 2/2] utf8: accept "latin-1" as ISO-8859-1 Junio C Hamano
@ 2016-09-27  5:57   ` Jeff King
  2016-09-27  6:08     ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff King @ 2016-09-27  5:57 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Mon, Sep 26, 2016 at 06:22:11PM -0700, Junio C Hamano wrote:

> Even though latin-1 is still seen in e-mail headers, some platforms
> only install ISO-8859-1.  "iconv -f ISO-8859-1" succeeds, while
> "iconv -f latin-1" fails on such a system.
> 
> Using the same fallback_encoding() mechanism factored out in the
> previous step, teach ourselves that "ISO-8859-1" has a better chance
> of being accepted than "latin-1".

I was curious if this was the most official or accepted spelling.
Grepping a few hundred thousand messages from my mail archives, it does
seem to be the most common.

> diff --git a/utf8.c b/utf8.c
> index 550e785..0c8e011 100644
> --- a/utf8.c
> +++ b/utf8.c
> @@ -501,6 +501,13 @@ static const char *fallback_encoding(const char *name)
>  	if (is_encoding_utf8(name))
>  		return "UTF-8";
>  
> +	/*
> +	 * Even though latin-1 is still seen in e-mail
> +	 * headers, some platforms only install ISO-8859-1.
> +	 */
> +	if (!strcasecmp(name, "latin-1"))
> +		return "ISO-8859-1";
> +

For the UTF-8 fallbacks, we actually detect their equivalence via
same_encoding() before even hitting iconv. Is it worth doing the same
here?

I have to admit that I don't care too deeply about performance for
somebody who wants to convert "latin1" to "ISO-8859-1". If one of your
encodings is not UTF-8, you are probably Doing It Wrong. :)

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] utf8: accept "latin-1" as ISO-8859-1
  2016-09-27  5:57   ` Jeff King
@ 2016-09-27  6:08     ` Junio C Hamano
  0 siblings, 0 replies; 7+ messages in thread
From: Junio C Hamano @ 2016-09-27  6:08 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> I have to admit that I don't care too deeply about performance for
> somebody who wants to convert "latin1" to "ISO-8859-1". If one of your
> encodings is not UTF-8, you are probably Doing It Wrong. :)

Exactly.  Note that the "you" in the above are usually plural,
collectively referring to both the sender and the receiver.  I
usually am on the poor receiving end ;-)

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-09-27 15:33 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-09-27  1:22 [PATCH 0/2] Locally alias "latin-1" to "ISO-8859-1" Junio C Hamano
2016-09-27  1:22 ` [PATCH 1/2] utf8: refactor code to decide fallback encoding Junio C Hamano
2016-09-27  5:52   ` Jeff King
2016-09-27 15:33     ` Junio C Hamano
2016-09-27  1:22 ` [PATCH 2/2] utf8: accept "latin-1" as ISO-8859-1 Junio C Hamano
2016-09-27  5:57   ` Jeff King
2016-09-27  6:08     ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).