From: Jeff King <peff@peff.net>
To: Robin Rosenberg <robin.rosenberg.lists@dewire.com>
Cc: Junio C Hamano <gitster@pobox.com>, git@vger.kernel.org
Subject: Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
Date: Sat, 29 Mar 2008 18:00:11 -0400 [thread overview]
Message-ID: <20080329220011.GC30851@coredump.intra.peff.net> (raw)
In-Reply-To: <200803292243.40733.robin.rosenberg.lists@dewire.com>
On Sat, Mar 29, 2008 at 10:43:40PM +0100, Robin Rosenberg wrote:
> First that is even by random an unlikely sequence. For any "real" is string
> it simply won't happen, even in this context. Try scanning everything you
> can think of and see if you find such a sequence that is not actually UTF-8.
That's the problem I was mentioning: "everything I can think of" is
basically just us-ascii with a few accented characters. I don't know
how, e.g., Japanese texts will fare with such a test.
> > But over all commonly used encodings, what is the probability in an
> > average text of that encoding that it contains valid UTF-8?
> > For example, I have no idea what patterns can be found in EUCJP.
>
> See here http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf
Thanks, that is an interesting read. And he seems to indicate that you
can guess with a reasonable degree of success. But a few points on that
work:
- he has a specific methodology for guessing, which is more elaborate
than what you proposed. So to get his results, you would need to
implement his method. Hopefully if perl does have a "guess if this
looks like utf8" method, it uses a similar scheme.
- he does admit that some encodings have difficult to assess
probabilities, and it will vary from language to language. See page
22:
If a specific language does not use all three letters (a single
letter on the left and the corresponding two letters on the
right), then this combination presents no danger. Further checks
can then be made with a dictionary, although there is the problem
that a dictionary never contains all possible words, and that of
course resource names don't necessarily have to be words.
- he mentions Latin, Cyrillic, and Hebrew encodings. I note the
conspicuous absence of any Asian languages.
> Note that a random string is a randomly generated string. Not a random
> string from the set of actually existing strings.
Sure. But looking at random strings isn't terribly useful; there is a
non-uniform distribution over the set of strings, dependent on the
_actual_ encoding. So there are going to be "good" encodings that will
guess well, and there will be "bad" encodings that might not (and by
"will", I mean "there may be"; that is the very thing I am saying we
don't have good evidence for).
-Peff
next prev parent reply other threads:[~2008-03-29 22:01 UTC|newest]
Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-03-28 6:30 [ANNOUNCE] GIT 1.5.5-rc2 Junio C Hamano
2008-03-28 18:13 ` Jeff King
2008-03-28 21:05 ` Junio C Hamano
2008-03-28 21:23 ` Jeff King
2008-03-28 21:27 ` Jeff King
2008-03-28 21:28 ` [PATCH 1/2] send-email: specify content-type of --compose body Jeff King
2008-03-28 21:29 ` [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters Jeff King
2008-03-29 7:19 ` Robin Rosenberg
2008-03-29 7:22 ` Jeff King
2008-03-29 8:41 ` Robin Rosenberg
2008-03-29 8:49 ` Jeff King
2008-03-29 9:02 ` Robin Rosenberg
2008-03-29 9:11 ` Jeff King
2008-03-29 9:39 ` Robin Rosenberg
2008-03-29 9:43 ` Jeff King
2008-03-29 12:54 ` Robin Rosenberg
2008-03-29 21:45 ` Jeff King
2008-03-30 3:40 ` Sam Vilain
2008-03-30 4:39 ` Jeff King
2008-03-30 23:47 ` Junio C Hamano
2008-03-29 8:44 ` Robin Rosenberg
2008-03-29 8:53 ` Jeff King
2008-03-29 9:38 ` Robin Rosenberg
2008-03-29 9:52 ` Jeff King
2008-03-29 12:54 ` Robin Rosenberg
2008-03-29 21:18 ` Jeff King
2008-03-29 21:43 ` Robin Rosenberg
2008-03-29 22:00 ` Jeff King [this message]
2008-03-30 2:12 ` Sam Vilain
2008-03-30 4:31 ` Jeff King
2008-05-21 19:39 ` Junio C Hamano
2008-05-21 19:47 ` Jeff King
[not found] <7caf19ae394accab538d2f94953bb62b55a2c79f.1206486012.git.peff@peff.net>
2008-03-25 23:03 ` Jeff King
2008-03-26 5:59 ` Teemu Likonen
2008-03-26 6:20 ` Jeff King
2008-03-26 8:30 ` Teemu Likonen
2008-03-26 8:39 ` Jeff King
2008-03-26 9:23 ` Teemu Likonen
2008-03-26 9:32 ` Teemu Likonen
2008-03-26 9:35 ` Jeff King
2008-03-26 9:33 ` Jeff King
2008-03-27 7:38 ` Jeff King
2008-03-27 19:44 ` Todd Zullinger
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20080329220011.GC30851@coredump.intra.peff.net \
--to=peff@peff.net \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=robin.rosenberg.lists@dewire.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).