* [PATCH 0/2] Locally alias "latin-1" to "ISO-8859-1" @ 2016-09-27 1:22 Junio C Hamano 2016-09-27 1:22 ` [PATCH 1/2] utf8: refactor code to decide fallback encoding Junio C Hamano 2016-09-27 1:22 ` [PATCH 2/2] utf8: accept "latin-1" as ISO-8859-1 Junio C Hamano 0 siblings, 2 replies; 7+ messages in thread From: Junio C Hamano @ 2016-09-27 1:22 UTC (permalink / raw) To: git Some systems do not seem to ship "latin-1" as a valid locale, even though they happilly accept more modern official name "ISO-8859-1". Naturally, "iconv -f iso-8859-1" succeeds while "iconv -f latin-1" fails on such a system. We already have in utf8.c to accomodate overly strict iconv_open() that does not like various spellings of UTF-8 when our users spell it differently from the most official "UTF-8" form. Piggyback on the mechanism and teach outselves that "latin-1" used to be the way to say "ISO-8859-1". I feel dirty for doing it this way, but I found it the easiest workaround to apply recent patches we saw on the mailing list. Junio C Hamano (2): utf8: refactor code to decide fallback encoding utf8: accept "latin-1" as ISO-8859-1 utf8.c | 36 +++++++++++++++++++++++++----------- 1 file changed, 25 insertions(+), 11 deletions(-) -- 2.10.0-556-g5bbc40b ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 1/2] utf8: refactor code to decide fallback encoding 2016-09-27 1:22 [PATCH 0/2] Locally alias "latin-1" to "ISO-8859-1" Junio C Hamano @ 2016-09-27 1:22 ` Junio C Hamano 2016-09-27 5:52 ` Jeff King 2016-09-27 1:22 ` [PATCH 2/2] utf8: accept "latin-1" as ISO-8859-1 Junio C Hamano 1 sibling, 1 reply; 7+ messages in thread From: Junio C Hamano @ 2016-09-27 1:22 UTC (permalink / raw) To: git The codepath we use to call iconv_open() has a provision to use a fallback encoding when it fails, hoping that "UTF-8" being spelled differently could be the reason why the library function did not like the encoding names we gave it. Essentially, we turn what we have observed to be used as variants of "UTF-8" (e.g. "utf8") into the most official spelling and use that as a fallback. We do the same thing for input and output encoding. Introduce a helper function to do just one side and call that twice. Signed-off-by: Junio C Hamano <gitster@pobox.com> --- utf8.c | 29 ++++++++++++++++++----------- 1 file changed, 18 insertions(+), 11 deletions(-) diff --git a/utf8.c b/utf8.c index 00e10c8..550e785 100644 --- a/utf8.c +++ b/utf8.c @@ -489,6 +489,21 @@ char *reencode_string_iconv(const char *in, size_t insz, iconv_t conv, int *outs return out; } +static const char *fallback_encoding(const char *name) +{ + /* + * Some platforms do not have the variously spelled variants of + * UTF-8, so let's fall back to trying the most official + * spelling. We do so only as a fallback in case the platform + * does understand the user's spelling, but not our official + * one. + */ + if (is_encoding_utf8(name)) + return "UTF-8"; + + return name; +} + char *reencode_string_len(const char *in, int insz, const char *out_encoding, const char *in_encoding, int *outsz) @@ -501,17 +516,9 @@ char *reencode_string_len(const char *in, int insz, conv = iconv_open(out_encoding, in_encoding); if (conv == (iconv_t) -1) { - /* - * Some platforms do not have the variously spelled variants of - * UTF-8, so let's fall back to trying the most official - * spelling. We do so only as a fallback in case the platform - * does understand the user's spelling, but not our official - * one. - */ - if (is_encoding_utf8(in_encoding)) - in_encoding = "UTF-8"; - if (is_encoding_utf8(out_encoding)) - out_encoding = "UTF-8"; + in_encoding = fallback_encoding(in_encoding); + out_encoding = fallback_encoding(out_encoding); + conv = iconv_open(out_encoding, in_encoding); if (conv == (iconv_t) -1) return NULL; -- 2.10.0-556-g5bbc40b ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH 1/2] utf8: refactor code to decide fallback encoding 2016-09-27 1:22 ` [PATCH 1/2] utf8: refactor code to decide fallback encoding Junio C Hamano @ 2016-09-27 5:52 ` Jeff King 2016-09-27 15:33 ` Junio C Hamano 0 siblings, 1 reply; 7+ messages in thread From: Jeff King @ 2016-09-27 5:52 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Mon, Sep 26, 2016 at 06:22:10PM -0700, Junio C Hamano wrote: > @@ -501,17 +516,9 @@ char *reencode_string_len(const char *in, int insz, > > conv = iconv_open(out_encoding, in_encoding); > if (conv == (iconv_t) -1) { > - /* > - * Some platforms do not have the variously spelled variants of > - * UTF-8, so let's fall back to trying the most official > - * spelling. We do so only as a fallback in case the platform > - * does understand the user's spelling, but not our official > - * one. > - */ > - if (is_encoding_utf8(in_encoding)) > - in_encoding = "UTF-8"; > - if (is_encoding_utf8(out_encoding)) > - out_encoding = "UTF-8"; > + in_encoding = fallback_encoding(in_encoding); > + out_encoding = fallback_encoding(out_encoding); > + This comment is interesting. We're concerned about a platform knowing "utf8" but not "UTF-8". When we fallback, we do it for both the input and output encodings, because we don't know which may have caused the problem. So is it possible that we improve one case but break the other? With just UTF-8, I don't think so. That could only be the case with something like "utf8 -> utf-8" because they both become "UTF-8". So either it improves the situation or not (because we either understand UTF-8 or not). But once we introduce other fallbacks, then "utf8 -> latin1" may become "UTF-8 -> iso8859-1". A system that knows only "utf8" and "iso8859-1" _could_ work if we turned the knobs individually, but won't if we turn them both at once. Worse, a system that knows only "UTF-8" and "latin1" works now, but would break with your patches. I'm not convinced it's worth worrying about, though. The existence of such a system is theoretical at this point. I'm not even sure how common the "know about utf8 but not UTF-8" thing is, or if we were merely being overly cautious. -Peff ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/2] utf8: refactor code to decide fallback encoding 2016-09-27 5:52 ` Jeff King @ 2016-09-27 15:33 ` Junio C Hamano 0 siblings, 0 replies; 7+ messages in thread From: Junio C Hamano @ 2016-09-27 15:33 UTC (permalink / raw) To: Jeff King; +Cc: git Jeff King <peff@peff.net> writes: > But once we introduce other fallbacks, then "utf8 -> latin1" may become > "UTF-8 -> iso8859-1". A system that knows only "utf8" and "iso8859-1" > _could_ work if we turned the knobs individually, but won't if we turn > them both at once. Worse, a system that knows only "UTF-8" and "latin1" > works now, but would break with your patches. > > I'm not convinced it's worth worrying about, though. The existence of > such a system is theoretical at this point. I'm not even sure how common > the "know about utf8 but not UTF-8" thing is, or if we were merely being > overly cautious. Yeah, I did consider having to try the permutations until it works, but suspecting that somebody takes "utf8" without taking "UTF-8" is to pretty much invalidate the basic premise of the existing code, i.e. spelling it as "UTF-8" is the most likely to work anywhere as long as UTF-8 is supported, so I stopped worrying about it at that point. I'd actually welcome a more generic suggestions we can put in our documentation so that we can _lose_ the fallback entirely (e.g. "if your contributor spelled 'utf8' and your system, which does take 'UTF-8', does not like it, then here is what you can do to your /etc/locale.alias"). ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 2/2] utf8: accept "latin-1" as ISO-8859-1 2016-09-27 1:22 [PATCH 0/2] Locally alias "latin-1" to "ISO-8859-1" Junio C Hamano 2016-09-27 1:22 ` [PATCH 1/2] utf8: refactor code to decide fallback encoding Junio C Hamano @ 2016-09-27 1:22 ` Junio C Hamano 2016-09-27 5:57 ` Jeff King 1 sibling, 1 reply; 7+ messages in thread From: Junio C Hamano @ 2016-09-27 1:22 UTC (permalink / raw) To: git Even though latin-1 is still seen in e-mail headers, some platforms only install ISO-8859-1. "iconv -f ISO-8859-1" succeeds, while "iconv -f latin-1" fails on such a system. Using the same fallback_encoding() mechanism factored out in the previous step, teach ourselves that "ISO-8859-1" has a better chance of being accepted than "latin-1". Signed-off-by: Junio C Hamano <gitster@pobox.com> --- utf8.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/utf8.c b/utf8.c index 550e785..0c8e011 100644 --- a/utf8.c +++ b/utf8.c @@ -501,6 +501,13 @@ static const char *fallback_encoding(const char *name) if (is_encoding_utf8(name)) return "UTF-8"; + /* + * Even though latin-1 is still seen in e-mail + * headers, some platforms only install ISO-8859-1. + */ + if (!strcasecmp(name, "latin-1")) + return "ISO-8859-1"; + return name; } -- 2.10.0-556-g5bbc40b ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] utf8: accept "latin-1" as ISO-8859-1 2016-09-27 1:22 ` [PATCH 2/2] utf8: accept "latin-1" as ISO-8859-1 Junio C Hamano @ 2016-09-27 5:57 ` Jeff King 2016-09-27 6:08 ` Junio C Hamano 0 siblings, 1 reply; 7+ messages in thread From: Jeff King @ 2016-09-27 5:57 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Mon, Sep 26, 2016 at 06:22:11PM -0700, Junio C Hamano wrote: > Even though latin-1 is still seen in e-mail headers, some platforms > only install ISO-8859-1. "iconv -f ISO-8859-1" succeeds, while > "iconv -f latin-1" fails on such a system. > > Using the same fallback_encoding() mechanism factored out in the > previous step, teach ourselves that "ISO-8859-1" has a better chance > of being accepted than "latin-1". I was curious if this was the most official or accepted spelling. Grepping a few hundred thousand messages from my mail archives, it does seem to be the most common. > diff --git a/utf8.c b/utf8.c > index 550e785..0c8e011 100644 > --- a/utf8.c > +++ b/utf8.c > @@ -501,6 +501,13 @@ static const char *fallback_encoding(const char *name) > if (is_encoding_utf8(name)) > return "UTF-8"; > > + /* > + * Even though latin-1 is still seen in e-mail > + * headers, some platforms only install ISO-8859-1. > + */ > + if (!strcasecmp(name, "latin-1")) > + return "ISO-8859-1"; > + For the UTF-8 fallbacks, we actually detect their equivalence via same_encoding() before even hitting iconv. Is it worth doing the same here? I have to admit that I don't care too deeply about performance for somebody who wants to convert "latin1" to "ISO-8859-1". If one of your encodings is not UTF-8, you are probably Doing It Wrong. :) -Peff ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] utf8: accept "latin-1" as ISO-8859-1 2016-09-27 5:57 ` Jeff King @ 2016-09-27 6:08 ` Junio C Hamano 0 siblings, 0 replies; 7+ messages in thread From: Junio C Hamano @ 2016-09-27 6:08 UTC (permalink / raw) To: Jeff King; +Cc: git Jeff King <peff@peff.net> writes: > I have to admit that I don't care too deeply about performance for > somebody who wants to convert "latin1" to "ISO-8859-1". If one of your > encodings is not UTF-8, you are probably Doing It Wrong. :) Exactly. Note that the "you" in the above are usually plural, collectively referring to both the sender and the receiver. I usually am on the poor receiving end ;-) ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2016-09-27 15:33 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-09-27 1:22 [PATCH 0/2] Locally alias "latin-1" to "ISO-8859-1" Junio C Hamano 2016-09-27 1:22 ` [PATCH 1/2] utf8: refactor code to decide fallback encoding Junio C Hamano 2016-09-27 5:52 ` Jeff King 2016-09-27 15:33 ` Junio C Hamano 2016-09-27 1:22 ` [PATCH 2/2] utf8: accept "latin-1" as ISO-8859-1 Junio C Hamano 2016-09-27 5:57 ` Jeff King 2016-09-27 6:08 ` Junio C Hamano
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).