From: "Uwe Kleine-König" <zeisberg@informatik.uni-freiburg.de>
To: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Cc: Junio C Hamano <junkio@cox.net>,
Alexander Litvinov <litvinov2004@gmail.com>,
git@vger.kernel.org
Subject: Re: specify charset for commits
Date: Fri, 22 Dec 2006 16:09:49 +0100 [thread overview]
Message-ID: <20061222150948.GA6005@cepheus> (raw)
In-Reply-To: <Pine.LNX.4.63.0612220351520.19693@wbgn013.biozentrum.uni-wuerzburg.de>
Hello Johannes,
Johannes Schindelin wrote:
> The problem is: you cannot easily recognize if it is UTF8 or not,
> programatically. There is a good indicator _against_ UTF8, namely the
> first byte can _only_ be 0xxxxxxx, 110xxxxx, 1110xxxx, 11110xxx. But there
> is no _positive_ sign that it is UTF8. For example, many umlauts and other
> special modifications to letters, stay in the range 0x7f-0xff.
That's not the only indication. Here comes a (Python) function that
checks is string s is correctly UTF-8 encoded:
def is_utf8_str(s):
cnt_furtherbytes = 0
for c in s:
if cnt_furtherbytes > 0:
if ord(c) & 0xc0 == 0x80:
cnt_furtherbytes -= 1
else:
return False
else:
if ord(c) < 0x80:
continue
elif ord(c) < 0xc0:
return False
elif ord(c) < 0xe0:
cnt_furtherbytes = 1
elif ord(c) < 0xf0:
cnt_furtherbytes = 2
elif ord(c) < 0xf8:
cnt_furtherbytes = 3
elif ord(c) < 0xfc:
cnt_furtherbytes = 4
elif ord(c) < 0xfe:
cnt_furtherbytes = 5
else:
return False
return True
An UTF-8 character is either one byte long with the msb 0 or a sequence
starting with a value between 0xc0 and 0xfd (inclusive) and depending on
that first value up to six further bytes in the range 0x80 to 0xbf.
You could even be more strict by checking for Unicode 3.1 conformance
(i.e. a character has to be encoded in it's shortest form).
Look at utf8(7) for further details. (This manpage is included in the
Debian manpages package.)
Best regards
Uwe
--
Uwe Kleine-König
http://www.google.com/search?q=5+choose+3
next prev parent reply other threads:[~2006-12-22 15:10 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-12-08 11:44 [PATCH] Fix documentation copy&paste typo Uwe Kleine-Koenig
2006-12-19 14:16 ` Uwe Kleine-König
2006-12-19 17:27 ` Junio C Hamano
2006-12-21 8:59 ` specify charset for commits (Was: [PATCH] Fix documentation copy&paste typo) Uwe Kleine-König
2006-12-21 9:51 ` Johannes Schindelin
2006-12-21 10:11 ` Santi Béjar
2006-12-21 10:23 ` Alexander Litvinov
2006-12-21 10:52 ` Jakub Narebski
2006-12-21 13:05 ` Alexander Litvinov
2006-12-21 13:14 ` Jakub Narebski
2006-12-21 13:43 ` Uwe Kleine-König
2006-12-21 18:19 ` specify charset for commits Junio C Hamano
2006-12-21 18:48 ` Nicolas Pitre
2006-12-21 19:11 ` Uwe Kleine-König
2006-12-21 19:36 ` Alexander Litvinov
2006-12-22 12:07 ` Johannes Schindelin
2006-12-22 15:09 ` Uwe Kleine-König [this message]
2006-12-22 22:02 ` Uwe Kleine-König
2006-12-22 15:31 ` Nicolas Pitre
2006-12-22 19:01 ` Junio C Hamano
2006-12-22 21:03 ` [PATCH 1/2] libgit.a: add some UTF-8 handling functions Johannes Schindelin
2006-12-22 21:27 ` Junio C Hamano
2006-12-22 21:36 ` Johannes Schindelin
2006-12-22 21:58 ` Junio C Hamano
2006-12-22 22:20 ` Johannes Schindelin
2006-12-22 22:33 ` Junio C Hamano
2006-12-25 4:03 ` Alexander Litvinov
2006-12-22 22:14 ` Uwe Kleine-König
2006-12-22 22:19 ` Uwe Kleine-König
2006-12-22 22:34 ` Johannes Schindelin
2006-12-22 23:50 ` Johannes Schindelin
2006-12-23 8:52 ` Uwe Kleine-König
2006-12-23 14:12 ` Johannes Schindelin
2006-12-23 19:53 ` warn non utf-8 commit log messages Junio C Hamano
2006-12-23 23:46 ` Johannes Schindelin
2006-12-22 21:06 ` [PATCH 2/2] git-commit-tree: if i18n.commitencoding is utf-8 (default), check it Johannes Schindelin
2006-12-22 21:50 ` Junio C Hamano
2006-12-22 22:21 ` Johannes Schindelin
2006-12-22 21:15 ` [RFC/PATCH 3/2] Wrap lines in shortlog Johannes Schindelin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20061222150948.GA6005@cepheus \
--to=zeisberg@informatik.uni-freiburg.de \
--cc=Johannes.Schindelin@gmx.de \
--cc=git@vger.kernel.org \
--cc=junkio@cox.net \
--cc=litvinov2004@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).