git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Uwe Kleine-König" <zeisberg@informatik.uni-freiburg.de>
To: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Cc: Junio C Hamano <junkio@cox.net>,
	Alexander Litvinov <litvinov2004@gmail.com>,
	git@vger.kernel.org
Subject: Re: specify charset for commits
Date: Fri, 22 Dec 2006 16:09:49 +0100	[thread overview]
Message-ID: <20061222150948.GA6005@cepheus> (raw)
In-Reply-To: <Pine.LNX.4.63.0612220351520.19693@wbgn013.biozentrum.uni-wuerzburg.de>

Hello Johannes,

Johannes Schindelin wrote:
> The problem is: you cannot easily recognize if it is UTF8 or not, 
> programatically. There is a good indicator _against_ UTF8, namely the 
> first byte can _only_ be 0xxxxxxx, 110xxxxx, 1110xxxx, 11110xxx. But there 
> is no _positive_ sign that it is UTF8. For example, many umlauts and other 
> special modifications to letters, stay in the range 0x7f-0xff.
That's not the only indication.  Here comes a (Python) function that
checks is string s is correctly UTF-8 encoded:

	def is_utf8_str(s):
	  cnt_furtherbytes = 0
	  for c in s:
	    if cnt_furtherbytes > 0:
	      if ord(c) & 0xc0 == 0x80:
		cnt_furtherbytes -= 1
	      else:
		return False
	    else:
	      if ord(c) < 0x80:
		continue
	      elif ord(c) < 0xc0:
	        return False
	      elif ord(c) < 0xe0:
		cnt_furtherbytes = 1
	      elif ord(c) < 0xf0:
		cnt_furtherbytes = 2
	      elif ord(c) < 0xf8:
		cnt_furtherbytes = 3
	      elif ord(c) < 0xfc:
		cnt_furtherbytes = 4
	      elif ord(c) < 0xfe:
		cnt_furtherbytes = 5
	      else:
		return False
	  return True

An UTF-8 character is either one byte long with the msb 0 or a sequence
starting with a value between 0xc0 and 0xfd (inclusive) and depending on
that first value up to six further bytes in the range 0x80 to 0xbf.

You could even be more strict by checking for Unicode 3.1 conformance
(i.e. a character has to be encoded in it's shortest form).

Look at utf8(7) for further details.  (This manpage is included in the
Debian manpages package.)

Best regards
Uwe

-- 
Uwe Kleine-König

http://www.google.com/search?q=5+choose+3

  reply	other threads:[~2006-12-22 15:10 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-12-08 11:44 [PATCH] Fix documentation copy&paste typo Uwe Kleine-Koenig
2006-12-19 14:16 ` Uwe Kleine-König
2006-12-19 17:27   ` Junio C Hamano
2006-12-21  8:59     ` specify charset for commits (Was: [PATCH] Fix documentation copy&paste typo) Uwe Kleine-König
2006-12-21  9:51       ` Johannes Schindelin
2006-12-21 10:11         ` Santi Béjar
2006-12-21 10:23         ` Alexander Litvinov
2006-12-21 10:52           ` Jakub Narebski
2006-12-21 13:05             ` Alexander Litvinov
2006-12-21 13:14               ` Jakub Narebski
2006-12-21 13:43             ` Uwe Kleine-König
2006-12-21 18:19           ` specify charset for commits Junio C Hamano
2006-12-21 18:48             ` Nicolas Pitre
2006-12-21 19:11             ` Uwe Kleine-König
2006-12-21 19:36             ` Alexander Litvinov
2006-12-22 12:07             ` Johannes Schindelin
2006-12-22 15:09               ` Uwe Kleine-König [this message]
2006-12-22 22:02                 ` Uwe Kleine-König
2006-12-22 15:31               ` Nicolas Pitre
2006-12-22 19:01                 ` Junio C Hamano
2006-12-22 21:03                   ` [PATCH 1/2] libgit.a: add some UTF-8 handling functions Johannes Schindelin
2006-12-22 21:27                     ` Junio C Hamano
2006-12-22 21:36                       ` Johannes Schindelin
2006-12-22 21:58                         ` Junio C Hamano
2006-12-22 22:20                           ` Johannes Schindelin
2006-12-22 22:33                             ` Junio C Hamano
2006-12-25  4:03                             ` Alexander Litvinov
2006-12-22 22:14                         ` Uwe Kleine-König
2006-12-22 22:19                     ` Uwe Kleine-König
2006-12-22 22:34                       ` Johannes Schindelin
2006-12-22 23:50                         ` Johannes Schindelin
2006-12-23  8:52                           ` Uwe Kleine-König
2006-12-23 14:12                             ` Johannes Schindelin
2006-12-23 19:53                           ` warn non utf-8 commit log messages Junio C Hamano
2006-12-23 23:46                             ` Johannes Schindelin
2006-12-22 21:06                   ` [PATCH 2/2] git-commit-tree: if i18n.commitencoding is utf-8 (default), check it Johannes Schindelin
2006-12-22 21:50                     ` Junio C Hamano
2006-12-22 22:21                       ` Johannes Schindelin
2006-12-22 21:15                   ` [RFC/PATCH 3/2] Wrap lines in shortlog Johannes Schindelin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20061222150948.GA6005@cepheus \
    --to=zeisberg@informatik.uni-freiburg.de \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=git@vger.kernel.org \
    --cc=junkio@cox.net \
    --cc=litvinov2004@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).