From: Robin Rosenberg <robin.rosenberg.lists@dewire.com>
To: Jeff King <peff@peff.net>
Cc: Junio C Hamano <gitster@pobox.com>, git@vger.kernel.org
Subject: Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
Date: Sat, 29 Mar 2008 22:43:40 +0100 [thread overview]
Message-ID: <200803292243.40733.robin.rosenberg.lists@dewire.com> (raw)
In-Reply-To: <20080329211849.GA30851@coredump.intra.peff.net>
Den Saturday 29 March 2008 22.18.49 skrev Jeff King:
> On Sat, Mar 29, 2008 at 01:54:10PM +0100, Robin Rosenberg wrote:
> > I think you really should try the UTF-8 guess, since a file may well be
> > UTF-8 even if the user locale is something else. Especially for XML
> > files, UTF-8 is common, but there are many more cases. Look into
> > git-gui/po for more examples. The probability of a UTF-8 test being wrong
> > is just so unimaginable low.
>
> Thinking about this more, I think it is only half the solution. If
> something is not valid utf-8, then we know it must be something else.
> But if something is valid utf-8, is it necessarily utf-8? I think we are
> going to have a much higher probability of guessing wrong there.
>
> For example, consider the bytes { 0xc3, 0xb6 }. In utf-8, they are 'ö'.
> But in iso8859-1, they also have meaning (paragraph symbol followed by
> Ã). Now that is an unlikely combination to come up. And maybe for
> Latin-1, having two non-ascii characters next to each other is unlikely.
First that is even by random an unlikely sequence. For any "real" is string
it simply won't happen, even in this context. Try scanning everything you
can think of and see if you find such a sequence that is not actually UTF-8.
> But over all commonly used encodings, what is the probability in an
> average text of that encoding that it contains valid UTF-8?
> For example, I have no idea what patterns can be found in EUCJP.
See here http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf
Note that a random string is a randomly generated string. Not a random string
from the set of actually existing strings.
> There is some magic with how Perl marks strings as "binary" versus
> "utf-8" that I don't quite understand. And I think is_utf8 is really
> about asking "is the utf-8 flag set".
>
> I think this discussion would benefit greatly from somebody who has more
> of a clue how perl i18n stuff works. Why don't you work up a patch that
> makes sense for you, and then hopefully that will get some attention?
The only real question as I see it is whether perl has a builtin metod that
works better than the decode/encode. Anyone?
-- robin
next prev parent reply other threads:[~2008-03-29 21:45 UTC|newest]
Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-03-28 6:30 [ANNOUNCE] GIT 1.5.5-rc2 Junio C Hamano
2008-03-28 18:13 ` Jeff King
2008-03-28 21:05 ` Junio C Hamano
2008-03-28 21:23 ` Jeff King
2008-03-28 21:27 ` Jeff King
2008-03-28 21:28 ` [PATCH 1/2] send-email: specify content-type of --compose body Jeff King
2008-03-28 21:29 ` [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters Jeff King
2008-03-29 7:19 ` Robin Rosenberg
2008-03-29 7:22 ` Jeff King
2008-03-29 8:41 ` Robin Rosenberg
2008-03-29 8:49 ` Jeff King
2008-03-29 9:02 ` Robin Rosenberg
2008-03-29 9:11 ` Jeff King
2008-03-29 9:39 ` Robin Rosenberg
2008-03-29 9:43 ` Jeff King
2008-03-29 12:54 ` Robin Rosenberg
2008-03-29 21:45 ` Jeff King
2008-03-30 3:40 ` Sam Vilain
2008-03-30 4:39 ` Jeff King
2008-03-30 23:47 ` Junio C Hamano
2008-03-29 8:44 ` Robin Rosenberg
2008-03-29 8:53 ` Jeff King
2008-03-29 9:38 ` Robin Rosenberg
2008-03-29 9:52 ` Jeff King
2008-03-29 12:54 ` Robin Rosenberg
2008-03-29 21:18 ` Jeff King
2008-03-29 21:43 ` Robin Rosenberg [this message]
2008-03-29 22:00 ` Jeff King
2008-03-30 2:12 ` Sam Vilain
2008-03-30 4:31 ` Jeff King
2008-05-21 19:39 ` Junio C Hamano
2008-05-21 19:47 ` Jeff King
[not found] <7caf19ae394accab538d2f94953bb62b55a2c79f.1206486012.git.peff@peff.net>
2008-03-25 23:03 ` Jeff King
2008-03-26 5:59 ` Teemu Likonen
2008-03-26 6:20 ` Jeff King
2008-03-26 8:30 ` Teemu Likonen
2008-03-26 8:39 ` Jeff King
2008-03-26 9:23 ` Teemu Likonen
2008-03-26 9:32 ` Teemu Likonen
2008-03-26 9:35 ` Jeff King
2008-03-26 9:33 ` Jeff King
2008-03-27 7:38 ` Jeff King
2008-03-27 19:44 ` Todd Zullinger
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200803292243.40733.robin.rosenberg.lists@dewire.com \
--to=robin.rosenberg.lists@dewire.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).