* [PATCH] Fix documentation copy&paste typo
@ 2006-12-08 11:44 Uwe Kleine-Koenig
  2006-12-19 14:16 ` Uwe Kleine-König
  0 siblings, 1 reply; 39+ messages in thread
From: Uwe Kleine-Koenig @ 2006-12-08 11:44 UTC (permalink / raw)
  To: git; +Cc: Uwe Zeisberger
From: Uwe Zeisberger <zeisberg@informatik.uni-freiburg.de>
This was introduced in 45a3b12cfd3eaa05bbb0954790d5be5b8240a7b5
Signed-off-by: Uwe Kleine-König <zeisberg@informatik.uni-freiburg.de>
---
 gitweb/gitweb.perl |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index 093bd72..ed40810 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -120,7 +120,7 @@ our %feature = (
 	# To disable system wide have in $GITWEB_CONFIG
 	# $feature{'snapshot'}{'default'} = [undef];
 	# To have project specific config enable override in $GITWEB_CONFIG
-	# $feature{'blame'}{'override'} = 1;
+	# $feature{'snapshot'}{'override'} = 1;
 	# and in project config gitweb.snapshot = none|gzip|bzip2;
 	'snapshot' => {
 		'sub' => \&feature_snapshot,
-- 
1.4.4.2.gb772
^ permalink raw reply related	[flat|nested] 39+ messages in thread
* Re: [PATCH] Fix documentation copy&paste typo
  2006-12-08 11:44 [PATCH] Fix documentation copy&paste typo Uwe Kleine-Koenig
@ 2006-12-19 14:16 ` Uwe Kleine-König
  2006-12-19 17:27   ` Junio C Hamano
  0 siblings, 1 reply; 39+ messages in thread
From: Uwe Kleine-König @ 2006-12-19 14:16 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
Hello Junio,
Uwe Kleine-König wrote:
> From: Uwe Zeisberger <zeisberg@informatik.uni-freiburg.de>
> 
> This was introduced in 45a3b12cfd3eaa05bbb0954790d5be5b8240a7b5
> 
> Signed-off-by: Uwe Kleine-König <zeisberg@informatik.uni-freiburg.de>
> ---
> [...]
you took that patch as bbee1d971dc07c29f840b439aa2a2c890a12cf9f, thanks
for that.
Somehow the 'ö' (o-umlaut) in my name was messed up.  If I do
	git cat-file -p bbee1d971dc07c29 | xxd | grep eine
I get:
	0000160: 6569 6e65 2d4b 1b2c 4143 361b 2842 6e69 eine-K.,AC6.(Bni
That is, the 'ö' became 8 byte long.  Can you tell me what went wrong
there?
The commits by Karl Hasselström <kha@treskal.com> (e.g. e67c66251a4165)
use UTF-8.
Does there exist a (maybe project specific) convention for the encoding
of commit logs?
Best regards
Uwe
-- 
Uwe Kleine-Koenig
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCH] Fix documentation copy&paste typo
  2006-12-19 14:16 ` Uwe Kleine-König
@ 2006-12-19 17:27   ` Junio C Hamano
  2006-12-21  8:59     ` specify charset for commits (Was: [PATCH] Fix documentation copy&paste typo) Uwe Kleine-König
  0 siblings, 1 reply; 39+ messages in thread
From: Junio C Hamano @ 2006-12-19 17:27 UTC (permalink / raw)
  To: Uwe Kleine-König; +Cc: git
Uwe Kleine-König <zeisberg@informatik.uni-freiburg.de> writes:
> I get:
>
> 	0000160: 6569 6e65 2d4b 1b2c 4143 361b 2842 6e69 eine-K.,AC6.(Bni
>
> That is, the 'ö' became 8 byte long.  Can you tell me what went wrong
> there?
Me, keyboard and Emacs screwed up and stored it in ISO-2022
instead of UTF-8.  Sorry.
^ permalink raw reply	[flat|nested] 39+ messages in thread
* specify charset for commits (Was: [PATCH] Fix documentation copy&paste typo)
  2006-12-19 17:27   ` Junio C Hamano
@ 2006-12-21  8:59     ` Uwe Kleine-König
  2006-12-21  9:51       ` Johannes Schindelin
  0 siblings, 1 reply; 39+ messages in thread
From: Uwe Kleine-König @ 2006-12-21  8:59 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
Hello Junio,
Junio C Hamano wrote:
> Me, keyboard and Emacs screwed up and stored it in ISO-2022
> instead of UTF-8.  Sorry.
It's a pity, but too late to change.[1]
What do you think about a patch that makes git-commit-tree call iconv on
its input to get it to UTF-8 (or any other charset).  Maybe it
makes sense to add another header to commit objects (e.g.
"charset UTF-8") if something in the commit object is non-ASCII?
In my eyes it would make sense to even force UTF-8 for commit logs (and
author, committer).  The downside is that it becomes impossible to store
arbitrary byte sequences in commit objects.  (IMHO not a real
limitation.)
Best regards
Uwe
[1] actually I think it's worse, because my iconv (from Debian's libc6
    version 2.3.6.ds1-8) was unable to convert it correctly to utf-8 for
    any encoding that starts with ISO-2022.
-- 
Uwe Kleine-König
http://www.google.com/search?q=sin%28pi%2F2%29
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: specify charset for commits (Was: [PATCH] Fix documentation copy&paste typo)
  2006-12-21  8:59     ` specify charset for commits (Was: [PATCH] Fix documentation copy&paste typo) Uwe Kleine-König
@ 2006-12-21  9:51       ` Johannes Schindelin
  2006-12-21 10:11         ` Santi Béjar
  2006-12-21 10:23         ` Alexander Litvinov
  0 siblings, 2 replies; 39+ messages in thread
From: Johannes Schindelin @ 2006-12-21  9:51 UTC (permalink / raw)
  To: Uwe Kleine-König; +Cc: Junio C Hamano, git
[-- Attachment #1: Type: TEXT/PLAIN, Size: 544 bytes --]
Hi,
On Thu, 21 Dec 2006, Uwe Kleine-König wrote:
> Junio C Hamano wrote:
> > Me, keyboard and Emacs screwed up and stored it in ISO-2022
> > instead of UTF-8.  Sorry.
> It's a pity, but too late to change.[1]
> 
> What do you think about a patch that makes git-commit-tree call iconv on
> its input to get it to UTF-8 (or any other charset).
We had this discussion over and over again. Last time (I think) was here:
http://article.gmane.org/gmane.comp.version-control.git/11710
Summary: we do not want to force the use of utf8.
Hth,
Dscho
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: specify charset for commits (Was: [PATCH] Fix documentation copy&paste typo)
  2006-12-21  9:51       ` Johannes Schindelin
@ 2006-12-21 10:11         ` Santi Béjar
  2006-12-21 10:23         ` Alexander Litvinov
  1 sibling, 0 replies; 39+ messages in thread
From: Santi Béjar @ 2006-12-21 10:11 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Uwe Kleine-König, Junio C Hamano, git
On 12/21/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> Hi,
>
> On Thu, 21 Dec 2006, Uwe Kleine-König wrote:
>
> > Junio C Hamano wrote:
> > > Me, keyboard and Emacs screwed up and stored it in ISO-2022
> > > instead of UTF-8.  Sorry.
> > It's a pity, but too late to change.[1]
> >
> > What do you think about a patch that makes git-commit-tree call iconv on
> > its input to get it to UTF-8 (or any other charset).
>
> We had this discussion over and over again. Last time (I think) was here:
>
> http://article.gmane.org/gmane.comp.version-control.git/11710
>
> Summary: we do not want to force the use of utf8.
>
Maybe git could have an example of the hook "commit-msg" that checks
that the commit message is in certain charset.
Santi
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: specify charset for commits (Was: [PATCH] Fix documentation copy&paste typo)
  2006-12-21  9:51       ` Johannes Schindelin
  2006-12-21 10:11         ` Santi Béjar
@ 2006-12-21 10:23         ` Alexander Litvinov
  2006-12-21 10:52           ` Jakub Narebski
  2006-12-21 18:19           ` specify charset for commits Junio C Hamano
  1 sibling, 2 replies; 39+ messages in thread
From: Alexander Litvinov @ 2006-12-21 10:23 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Uwe Kleine-König, Junio C Hamano, git
> > What do you think about a patch that makes git-commit-tree call iconv on
> > its input to get it to UTF-8 (or any other charset).
>
> We had this discussion over and over again. Last time (I think) was here:
> http://article.gmane.org/gmane.comp.version-control.git/11710
> Summary: we do not want to force the use of utf8.
May we can add new header into commit with commit text encoding ?
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: specify charset for commits (Was: [PATCH] Fix documentation copy&paste typo)
  2006-12-21 10:23         ` Alexander Litvinov
@ 2006-12-21 10:52           ` Jakub Narebski
  2006-12-21 13:05             ` Alexander Litvinov
  2006-12-21 13:43             ` Uwe Kleine-König
  2006-12-21 18:19           ` specify charset for commits Junio C Hamano
  1 sibling, 2 replies; 39+ messages in thread
From: Jakub Narebski @ 2006-12-21 10:52 UTC (permalink / raw)
  To: git
Alexander Litvinov wrote:
>>> What do you think about a patch that makes git-commit-tree call iconv on
>>> its input to get it to UTF-8 (or any other charset).
>>
>> We had this discussion over and over again. Last time (I think) was here:
>> http://article.gmane.org/gmane.comp.version-control.git/11710
>> Summary: we do not want to force the use of utf8.
> 
> May we can add new header into commit with commit text encoding ?
I think it should be repository-wide decision. And we have
i18n.commitEncoding configuration variable (perhaps it should be propagated
on clone?).
-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: specify charset for commits (Was: [PATCH] Fix documentation copy&paste typo)
  2006-12-21 10:52           ` Jakub Narebski
@ 2006-12-21 13:05             ` Alexander Litvinov
  2006-12-21 13:14               ` Jakub Narebski
  2006-12-21 13:43             ` Uwe Kleine-König
  1 sibling, 1 reply; 39+ messages in thread
From: Alexander Litvinov @ 2006-12-21 13:05 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git
В сообщении от Thursday 21 December 2006 16:52 Jakub Narebski написал(a):
> > May we can add new header into commit with commit text encoding ?
>
> I think it should be repository-wide decision. And we have
> i18n.commitEncoding configuration variable (perhaps it should be propagated
> on clone?).
I would disagree with you. Is is not hard to imagine international project 
managed by git. We [developers] can start to use utf-8 or similar universal 
encoding but it is not easy sometimes. Fir example, not long ago all russian 
linux machines has LANG set to ru_RU.KOI8-R, now it tend to be ru_RU.UTF-8. 
It will not big surprise to me if developr from China or Japan use something 
very unusual. And just imagine one developer using Windows and Cygwin - ha ha 
ha, try to ask him to change the encoding :-)
The easiest way for git is just to store commit encoding and let tool for 
history browsing deal with encoding. Or as it does now - simply ignore all 
encodings at all and work with bytes not chars. But at this case history 
browsing tool must have some magic knowlage about encoding taken from 
air :-). Or from config file that cover most used encodings.
Alexander Litvinov.
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: specify charset for commits (Was: [PATCH] Fix documentation copy&paste typo)
  2006-12-21 13:05             ` Alexander Litvinov
@ 2006-12-21 13:14               ` Jakub Narebski
  0 siblings, 0 replies; 39+ messages in thread
From: Jakub Narebski @ 2006-12-21 13:14 UTC (permalink / raw)
  To: git
Alexander Litvinov wrote:
> ? ????????? ?? Thursday 21 December 2006 16:52 Jakub Narebski ???????(a):
>>> May we can add new header into commit with commit text encoding ?
>>
>> I think it should be repository-wide decision. And we have
>> i18n.commitEncoding configuration variable (perhaps it should be propagated
>> on clone?).
> 
> I would disagree with you. Is is not hard to imagine international project 
> managed by git. We [developers] can start to use utf-8 or similar universal 
> encoding but it is not easy sometimes. Fir example, not long ago all russian 
> linux machines has LANG set to ru_RU.KOI8-R, now it tend to be ru_RU.UTF-8. 
> It will not big surprise to me if developr from China or Japan use something 
> very unusual. And just imagine one developer using Windows and Cygwin - ha ha 
> ha, try to ask him to change the encoding :-)
> 
> The easiest way for git is just to store commit encoding and let tool for 
> history browsing deal with encoding. Or as it does now - simply ignore all 
> encodings at all and work with bytes not chars. But at this case history 
> browsing tool must have some magic knowlage about encoding taken from 
> air :-). Or from config file that cover most used encodings.
Perhaps it is time to resurrect idea about "note" header in commit object?
It could be used to store charset for those commits which doesn't use
default charset...
-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: specify charset for commits (Was: [PATCH] Fix documentation copy&paste typo)
  2006-12-21 10:52           ` Jakub Narebski
  2006-12-21 13:05             ` Alexander Litvinov
@ 2006-12-21 13:43             ` Uwe Kleine-König
  1 sibling, 0 replies; 39+ messages in thread
From: Uwe Kleine-König @ 2006-12-21 13:43 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git
Hello,
Jakub Narebski wrote:
> Alexander Litvinov wrote:
> 
> >>> What do you think about a patch that makes git-commit-tree call iconv on
> >>> its input to get it to UTF-8 (or any other charset).
> >>
> >> We had this discussion over and over again. Last time (I think) was here:
> >> http://article.gmane.org/gmane.comp.version-control.git/11710
> >> Summary: we do not want to force the use of utf8.
> > 
> > May we can add new header into commit with commit text encoding ?
> 
> I think it should be repository-wide decision. And we have
> i18n.commitEncoding configuration variable
The disadvantage from a repository-wide decision is that you cannot
change it after a while.
I didn't know that variable, but I think as it exists,
git-commit-tree should iconv the commit message from local to
i18n.commitEncoding before writing it.
Moreover I like the idea of a new header for commits specifing the
encoding.  Git could default to the handling as it is now (i.e. just
bytes) if the header is missing.
Best regards
Uwe
-- 
Uwe Kleine-König
http://www.google.com/search?q=1+newton+in+kg*m+%2F+s%5E2
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: specify charset for commits
  2006-12-21 10:23         ` Alexander Litvinov
  2006-12-21 10:52           ` Jakub Narebski
@ 2006-12-21 18:19           ` Junio C Hamano
  2006-12-21 18:48             ` Nicolas Pitre
                               ` (3 more replies)
  1 sibling, 4 replies; 39+ messages in thread
From: Junio C Hamano @ 2006-12-21 18:19 UTC (permalink / raw)
  To: Alexander Litvinov; +Cc: Johannes Schindelin, Uwe Kleine-König, git
Alexander Litvinov <litvinov2004@gmail.com> writes:
> May we can add new header into commit with commit text encoding ?
I do not think we want to change the commit header, nor we would
want to re-encode, but I can see two possible improvements:
 (1) git-am should default to -u; this was suggested on the list
     long time ago, but is an incompatible change.  v1.5.0 we
     can afford to be incompatible to make it more usable and
     safer.
 (2) update commit-tree to reject non utf-8 log messages and
     author/committer names when i18n.commitEncoding is _NOT_
     set, or set to utf-8.
     Maybe later we can use encoding validation routines for
     other encodings by checking i18n.commitEncoding, but at the
     minimum the above would be safe enough for recommended UTF-8
     only cases.
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: specify charset for commits
  2006-12-21 18:19           ` specify charset for commits Junio C Hamano
@ 2006-12-21 18:48             ` Nicolas Pitre
  2006-12-21 19:11             ` Uwe Kleine-König
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 39+ messages in thread
From: Nicolas Pitre @ 2006-12-21 18:48 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Alexander Litvinov, Johannes Schindelin, Uwe Kleine-König,
	git
On Thu, 21 Dec 2006, Junio C Hamano wrote:
> Alexander Litvinov <litvinov2004@gmail.com> writes:
> 
> > May we can add new header into commit with commit text encoding ?
> 
> I do not think we want to change the commit header, nor we would
> want to re-encode, but I can see two possible improvements:
> 
>  (1) git-am should default to -u; this was suggested on the list
>      long time ago, but is an incompatible change.  v1.5.0 we
>      can afford to be incompatible to make it more usable and
>      safer.
> 
>  (2) update commit-tree to reject non utf-8 log messages and
>      author/committer names when i18n.commitEncoding is _NOT_
>      set, or set to utf-8.
This would be a good thing, both of them actually.
The Linux kernel already contains different charsets for author full 
name and/or commit messages making git-log output a mix of encodings 
already.  And sometimes this inconsistency comes from the _same_ author.
Nicolas
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: specify charset for commits
  2006-12-21 18:19           ` specify charset for commits Junio C Hamano
  2006-12-21 18:48             ` Nicolas Pitre
@ 2006-12-21 19:11             ` Uwe Kleine-König
  2006-12-21 19:36             ` Alexander Litvinov
  2006-12-22 12:07             ` Johannes Schindelin
  3 siblings, 0 replies; 39+ messages in thread
From: Uwe Kleine-König @ 2006-12-21 19:11 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Alexander Litvinov, Johannes Schindelin, git
Junio C Hamano wrote:
> Alexander Litvinov <litvinov2004@gmail.com> writes:
> 
> > May we can add new header into commit with commit text encoding ?
> 
> I do not think we want to change the commit header, nor we would
> want to re-encode, but I can see two possible improvements:
> 
>  (1) git-am should default to -u; this was suggested on the list
>      long time ago, but is an incompatible change.  v1.5.0 we
>      can afford to be incompatible to make it more usable and
>      safer.
> 
>  (2) update commit-tree to reject non utf-8 log messages and
>      author/committer names when i18n.commitEncoding is _NOT_
>      set, or set to utf-8.
> 
>      Maybe later we can use encoding validation routines for
>      other encodings by checking i18n.commitEncoding, but at the
>      minimum the above would be safe enough for recommended UTF-8
>      only cases.
As I only want to use UTF-8 both suggestions are fine for me.
Is there a generic way to check an encoding?  (I don't know if there is
an encoding that can encode everything.  If so, we could use iconv -f
$enc -t superencoding.  Until now I thought UTF-8 can do, but in the
post Johannes pointed out, you (Junio) implyed that it cannot.)
Best regards
Uwe
-- 
Uwe Kleine-König
dd if=/proc/self/exe bs=1 skip=1 count=3 2>/dev/null
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: specify charset for commits
  2006-12-21 18:19           ` specify charset for commits Junio C Hamano
  2006-12-21 18:48             ` Nicolas Pitre
  2006-12-21 19:11             ` Uwe Kleine-König
@ 2006-12-21 19:36             ` Alexander Litvinov
  2006-12-22 12:07             ` Johannes Schindelin
  3 siblings, 0 replies; 39+ messages in thread
From: Alexander Litvinov @ 2006-12-21 19:36 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Johannes Schindelin, Uwe Kleine-König, git
> I do not think we want to change the commit header
Can you please explain why not ?
>  (1) git-am should default to -u; this was suggested on the list
>      long time ago, but is an incompatible change.  v1.5.0 we
>      can afford to be incompatible to make it more usable and
>      safer.
I use git-am rarely so can't comment on this
>  (2) update commit-tree to reject non utf-8 log messages and
>      author/committer names when i18n.commitEncoding is _NOT_
>      set, or set to utf-8.
>
>      Maybe later we can use encoding validation routines for
>      other encodings by checking i18n.commitEncoding, but at the
>      minimum the above would be safe enough for recommended UTF-8
>      only cases.
See the situation:
1. I have utf-8 encoded repo.
2. Somebody clone my repo, try to commit using non-utf-8 encoding, fail and 
change i18n.commitEncoding. When it commits something and ask me to pull.
3. I pull and got non-utf-8 commit message :-)
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: specify charset for commits
  2006-12-21 18:19           ` specify charset for commits Junio C Hamano
                               ` (2 preceding siblings ...)
  2006-12-21 19:36             ` Alexander Litvinov
@ 2006-12-22 12:07             ` Johannes Schindelin
  2006-12-22 15:09               ` Uwe Kleine-König
  2006-12-22 15:31               ` Nicolas Pitre
  3 siblings, 2 replies; 39+ messages in thread
From: Johannes Schindelin @ 2006-12-22 12:07 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Alexander Litvinov, Uwe Kleine-König, git
Hi,
On Thu, 21 Dec 2006, Junio C Hamano wrote:
>  (2) update commit-tree to reject non utf-8 log messages and
>      author/committer names when i18n.commitEncoding is _NOT_
>      set, or set to utf-8.
The problem is: you cannot easily recognize if it is UTF8 or not, 
programatically. There is a good indicator _against_ UTF8, namely the 
first byte can _only_ be 0xxxxxxx, 110xxxxx, 1110xxxx, 11110xxx. But there 
is no _positive_ sign that it is UTF8. For example, many umlauts and other 
special modifications to letters, stay in the range 0x7f-0xff.
Ciao,
Dscho
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: specify charset for commits
  2006-12-22 12:07             ` Johannes Schindelin
@ 2006-12-22 15:09               ` Uwe Kleine-König
  2006-12-22 22:02                 ` Uwe Kleine-König
  2006-12-22 15:31               ` Nicolas Pitre
  1 sibling, 1 reply; 39+ messages in thread
From: Uwe Kleine-König @ 2006-12-22 15:09 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Junio C Hamano, Alexander Litvinov, git
Hello Johannes,
Johannes Schindelin wrote:
> The problem is: you cannot easily recognize if it is UTF8 or not, 
> programatically. There is a good indicator _against_ UTF8, namely the 
> first byte can _only_ be 0xxxxxxx, 110xxxxx, 1110xxxx, 11110xxx. But there 
> is no _positive_ sign that it is UTF8. For example, many umlauts and other 
> special modifications to letters, stay in the range 0x7f-0xff.
That's not the only indication.  Here comes a (Python) function that
checks is string s is correctly UTF-8 encoded:
	def is_utf8_str(s):
	  cnt_furtherbytes = 0
	  for c in s:
	    if cnt_furtherbytes > 0:
	      if ord(c) & 0xc0 == 0x80:
		cnt_furtherbytes -= 1
	      else:
		return False
	    else:
	      if ord(c) < 0x80:
		continue
	      elif ord(c) < 0xc0:
	        return False
	      elif ord(c) < 0xe0:
		cnt_furtherbytes = 1
	      elif ord(c) < 0xf0:
		cnt_furtherbytes = 2
	      elif ord(c) < 0xf8:
		cnt_furtherbytes = 3
	      elif ord(c) < 0xfc:
		cnt_furtherbytes = 4
	      elif ord(c) < 0xfe:
		cnt_furtherbytes = 5
	      else:
		return False
	  return True
An UTF-8 character is either one byte long with the msb 0 or a sequence
starting with a value between 0xc0 and 0xfd (inclusive) and depending on
that first value up to six further bytes in the range 0x80 to 0xbf.
You could even be more strict by checking for Unicode 3.1 conformance
(i.e. a character has to be encoded in it's shortest form).
Look at utf8(7) for further details.  (This manpage is included in the
Debian manpages package.)
Best regards
Uwe
-- 
Uwe Kleine-König
http://www.google.com/search?q=5+choose+3
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: specify charset for commits
  2006-12-22 12:07             ` Johannes Schindelin
  2006-12-22 15:09               ` Uwe Kleine-König
@ 2006-12-22 15:31               ` Nicolas Pitre
  2006-12-22 19:01                 ` Junio C Hamano
  1 sibling, 1 reply; 39+ messages in thread
From: Nicolas Pitre @ 2006-12-22 15:31 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Junio C Hamano, Alexander Litvinov, Uwe Kleine-König, git
On Fri, 22 Dec 2006, Johannes Schindelin wrote:
> Hi,
> 
> On Thu, 21 Dec 2006, Junio C Hamano wrote:
> 
> >  (2) update commit-tree to reject non utf-8 log messages and
> >      author/committer names when i18n.commitEncoding is _NOT_
> >      set, or set to utf-8.
> 
> The problem is: you cannot easily recognize if it is UTF8 or not, 
> programatically. There is a good indicator _against_ UTF8, namely the 
> first byte can _only_ be 0xxxxxxx, 110xxxxx, 1110xxxx, 11110xxx. But there 
> is no _positive_ sign that it is UTF8. For example, many umlauts and other 
> special modifications to letters, stay in the range 0x7f-0xff.
Still... that would be a good enough thing to have in the majority of 
cases, wouldn't it?
Nicolas
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: specify charset for commits
  2006-12-22 15:31               ` Nicolas Pitre
@ 2006-12-22 19:01                 ` Junio C Hamano
  2006-12-22 21:03                   ` [PATCH 1/2] libgit.a: add some UTF-8 handling functions Johannes Schindelin
                                     ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Junio C Hamano @ 2006-12-22 19:01 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Johannes Schindelin, Alexander Litvinov, Uwe Kleine-König,
	git
Nicolas Pitre <nico@cam.org> writes:
> On Fri, 22 Dec 2006, Johannes Schindelin wrote:
>
>> Hi,
>> 
>> On Thu, 21 Dec 2006, Junio C Hamano wrote:
>> 
>> >  (2) update commit-tree to reject non utf-8 log messages and
>> >      author/committer names when i18n.commitEncoding is _NOT_
>> >      set, or set to utf-8.
>> 
>> The problem is: you cannot easily recognize if it is UTF8 or not, 
>> programatically. There is a good indicator _against_ UTF8, namely the 
>> first byte can _only_ be 0xxxxxxx, 110xxxxx, 1110xxxx, 11110xxx. But there 
>> is no _positive_ sign that it is UTF8. For example, many umlauts and other 
>> special modifications to letters, stay in the range 0x7f-0xff.
>
> Still... that would be a good enough thing to have in the majority of 
> cases, wouldn't it?
I think that would be very sane thing to do.
^ permalink raw reply	[flat|nested] 39+ messages in thread
* [PATCH 1/2] libgit.a: add some UTF-8 handling functions
  2006-12-22 19:01                 ` Junio C Hamano
@ 2006-12-22 21:03                   ` Johannes Schindelin
  2006-12-22 21:27                     ` Junio C Hamano
  2006-12-22 22:19                     ` Uwe Kleine-König
  2006-12-22 21:06                   ` [PATCH 2/2] git-commit-tree: if i18n.commitencoding is utf-8 (default), check it Johannes Schindelin
  2006-12-22 21:15                   ` [RFC/PATCH 3/2] Wrap lines in shortlog Johannes Schindelin
  2 siblings, 2 replies; 39+ messages in thread
From: Johannes Schindelin @ 2006-12-22 21:03 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Nicolas Pitre, Uwe Kleine-König, git
This adds utf8_byte_count(), utf8_strlen() and print_wrapped_text().
The most important is probably utf8_strlen(), which returns the length
of the text, if it is in UTF-8, otherwise -1.
Note that we do not go the full nine yards: we could also check that
the character is encoded with the minimum amount of bytes, as pointed
out by Uwe Kleine-Koenig.
The function print_wrapped_text() can be used to wrap text to a certain
line length.
Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
---
	On Fri, 22 Dec 2006, Junio C Hamano wrote:
	> Nicolas Pitre <nico@cam.org> writes:
	> 
	> > On Fri, 22 Dec 2006, Johannes Schindelin wrote:
	> >> 
	> >> On Thu, 21 Dec 2006, Junio C Hamano wrote:
	> >> 
	> >> >  (2) update commit-tree to reject non utf-8 log messages and
	> >> >      author/committer names when i18n.commitEncoding is _NOT_
	> >> >      set, or set to utf-8.
	> >> 
	> >> The problem is: you cannot easily recognize if it is UTF8 or 
	> >> not, programatically. There is a good indicator _against_ 
	> >> UTF8, namely the first byte can _only_ be 0xxxxxxx, 110xxxxx, 
	> >> 1110xxxx, 11110xxx. But there is no _positive_ sign that it 
	> >> is UTF8. For example, many umlauts and other special 
	> >> modifications to letters, stay in the range 0x7f-0xff.
	> >
	> > Still... that would be a good enough thing to have in the 
	> > majority of cases, wouldn't it?
	> 
	> I think that would be very sane thing to do.
	Well, this patch together with the next one implements that.
 Makefile |    6 ++-
 utf8.c   |   93 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 utf8.h   |    8 +++++
 3 files changed, 105 insertions(+), 2 deletions(-)
diff --git a/Makefile b/Makefile
index 29c4662..b4ca48b 100644
--- a/Makefile
+++ b/Makefile
@@ -237,7 +237,8 @@ LIB_H = \
 	archive.h blob.h cache.h commit.h csum-file.h delta.h grep.h \
 	diff.h object.h pack.h pkt-line.h quote.h refs.h list-objects.h sideband.h \
 	run-command.h strbuf.h tag.h tree.h git-compat-util.h revision.h \
-	tree-walk.h log-tree.h dir.h path-list.h unpack-trees.h builtin.h
+	tree-walk.h log-tree.h dir.h path-list.h unpack-trees.h builtin.h \
+	utf8.h
 
 DIFF_OBJS = \
 	diff.o diff-lib.o diffcore-break.o diffcore-order.o \
@@ -256,7 +257,8 @@ LIB_OBJS = \
 	revision.o pager.o tree-walk.o xdiff-interface.o \
 	write_or_die.o trace.o list-objects.o grep.o \
 	alloc.o merge-file.o path-list.o help.o unpack-trees.o $(DIFF_OBJS) \
-	color.o wt-status.o archive-zip.o archive-tar.o shallow.o
+	color.o wt-status.o archive-zip.o archive-tar.o shallow.o \
+	utf8.o
 
 BUILTIN_OBJS = \
 	builtin-add.o \
diff --git a/utf8.c b/utf8.c
new file mode 100644
index 0000000..06a66c7
--- /dev/null
+++ b/utf8.c
@@ -0,0 +1,93 @@
+#include "git-compat-util.h"
+#include "utf8.h"
+
+/*
+ * This function returns the number of bytes occupied by the character
+ * pointed to by the variable start. If it is not valid UTF-8, it
+ * returns -1.
+ */
+int utf8_byte_count(const char *start)
+{
+	unsigned char c = *(unsigned char *)start;
+	int i, count = 0;
+
+	if (!(c & 0x80))
+		count = 1;
+	else if ((c & 0xe0) == 0xc0)
+		count = 2;
+	else if ((c & 0xf0) == 0xe0)
+		count = 3;
+	else if ((c & 0xf8) == 0xf0)
+		count = 4;
+	else
+		return -1;
+
+	for (i = 1; i < count; i++)
+		if ((start[i] & 0xc0) != 0x80)
+			return -1;
+	return count;
+}
+
+int utf8_strlen(const char *text)
+{
+	int len = 0;
+	while (*text) {
+		int count = utf8_byte_count(text);
+		if (count < 0)
+			return -1;
+		len += count;
+		text += count;
+	}
+	return len;
+}
+
+static void print_spaces(int count)
+{
+	static const char s[] = "                    ";
+	while (count >= sizeof(s)) {
+		fwrite(s, sizeof(s) - 1, 1, stdout);
+		count -= sizeof(s) - 1;
+	}
+	fwrite(s, count, 1, stdout);
+}
+
+/*
+ * Wrap the text, if necessary. The variable indent is the indent for the
+ * first line, indent2 is the indent for all other lines.
+ */
+void print_wrapped_text(const char *text, int indent, int indent2, int len)
+{
+	int count = 0, space = -1;
+	int l = utf8_strlen(text), assume_utf8 = (l >= 0);
+
+	l = indent;
+
+	for (;;) {
+		char c = text[count];
+		if (!c || isspace(c)) {
+			if (l < len || space < 0) {
+				const char *start = text;
+				if (space >= 0)
+					start += space;
+				else
+					print_spaces(indent);
+				fwrite(start, text + count - start, 1, stdout);
+				if (!c) {
+					putchar('\n');
+					return;
+				} else if (c == '\t')
+					l |= 0x07;
+				space = count;
+			} else {
+				putchar('\n');
+				text += space + 1;
+				indent = indent2;
+				space = -1;
+				count = l = 0;
+				continue;
+			}
+		}
+		count += assume_utf8 ? utf8_byte_count(text + count) : 1;
+		l++;
+	}
+}
diff --git a/utf8.h b/utf8.h
new file mode 100644
index 0000000..96dded9
--- /dev/null
+++ b/utf8.h
@@ -0,0 +1,8 @@
+#ifndef GIT_UTF8_H
+#define GIT_UTF8_H
+
+int utf8_byte_count(const char *start);
+int utf8_strlen(const char *text);
+void print_wrapped_text(const char *text, int indent, int indent2, int len);
+
+#endif
-- 
1.4.4.3.ge5f98-dirty
^ permalink raw reply related	[flat|nested] 39+ messages in thread
* [PATCH 2/2] git-commit-tree: if i18n.commitencoding is utf-8 (default), check it
  2006-12-22 19:01                 ` Junio C Hamano
  2006-12-22 21:03                   ` [PATCH 1/2] libgit.a: add some UTF-8 handling functions Johannes Schindelin
@ 2006-12-22 21:06                   ` Johannes Schindelin
  2006-12-22 21:50                     ` Junio C Hamano
  2006-12-22 21:15                   ` [RFC/PATCH 3/2] Wrap lines in shortlog Johannes Schindelin
  2 siblings, 1 reply; 39+ messages in thread
From: Johannes Schindelin @ 2006-12-22 21:06 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Nicolas Pitre, Uwe Kleine-König, git
Now, git-commit-tree refuses to commit when i18n.commitencoding is
either unset, or set to "utf-8", and the commit message does not
minimally conform to the UTF-8 encoding.
Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
---
	Unfortunately, I could not think of a shorter oneline description. 
	But my next patch fixes at least the output in shortlog.
 builtin-commit-tree.c |   14 ++++++++++++--
 1 files changed, 12 insertions(+), 2 deletions(-)
diff --git a/builtin-commit-tree.c b/builtin-commit-tree.c
index 856f3cd..810b440 100644
--- a/builtin-commit-tree.c
+++ b/builtin-commit-tree.c
@@ -7,6 +7,7 @@
 #include "commit.h"
 #include "tree.h"
 #include "builtin.h"
+#include "utf8.h"
 
 #define BLOCKING (1ul << 14)
 
@@ -32,7 +33,7 @@ static void add_buffer(char **bufp, unsigned int *sizep, const char *fmt, ...)
 	len = vsnprintf(one_line, sizeof(one_line), fmt, args);
 	va_end(args);
 	size = *sizep;
-	newsize = size + len;
+	newsize = size + len + 1;
 	alloc = (size + 32767) & ~32767;
 	buf = *bufp;
 	if (newsize > alloc) {
@@ -40,7 +41,7 @@ static void add_buffer(char **bufp, unsigned int *sizep, const char *fmt, ...)
 		buf = xrealloc(buf, alloc);
 		*bufp = buf;
 	}
-	*sizep = newsize;
+	*sizep = newsize - 1;
 	memcpy(buf + size, one_line, len);
 }
 
@@ -127,6 +128,15 @@ int cmd_commit_tree(int argc, const char **argv, const char *prefix)
 	while (fgets(comment, sizeof(comment), stdin) != NULL)
 		add_buffer(&buffer, &size, "%s", comment);
 
+	/* And check the encoding */
+	buffer[size] = '\0';
+	if (!strcmp(git_commit_encoding, "utf-8") && utf8_strlen(buffer) < 0) {
+		fprintf(stderr, "Commit message does not conform to UTF-8.\n"
+			"Please fix the message,"
+			" or set the config variable i18n.commitencoding.\n");
+		return 1;
+	}
+
 	if (!write_sha1_file(buffer, size, commit_type, commit_sha1)) {
 		printf("%s\n", sha1_to_hex(commit_sha1));
 		return 0;
-- 
1.4.4.3.ge5f98-dirty
^ permalink raw reply related	[flat|nested] 39+ messages in thread
* [RFC/PATCH 3/2] Wrap lines in shortlog
  2006-12-22 19:01                 ` Junio C Hamano
  2006-12-22 21:03                   ` [PATCH 1/2] libgit.a: add some UTF-8 handling functions Johannes Schindelin
  2006-12-22 21:06                   ` [PATCH 2/2] git-commit-tree: if i18n.commitencoding is utf-8 (default), check it Johannes Schindelin
@ 2006-12-22 21:15                   ` Johannes Schindelin
  2 siblings, 0 replies; 39+ messages in thread
From: Johannes Schindelin @ 2006-12-22 21:15 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Nicolas Pitre, Uwe Kleine-König, git
[-- Attachment #1: Type: TEXT/PLAIN, Size: 253 bytes --]
Hi,
It is nicer to wrap the lines of too long oneline descriptions. This patch 
even works in UTF-8.
The patch is attached, since I cannot find the setting in pine to make it 
a UTF-8 one. Besides, I deliberately fscked up one test case.
Ciao,
Dscho
[-- Attachment #2: shortlog.patch --]
[-- Type: TEXT/PLAIN, Size: 2978 bytes --]
[PATCH] Use print_wrapped_text() in shortlog
Some oneline descriptions are just too long. In shortlog, it looks much
nicer when they are wrapped. Since print_wrapped_text() is UTF-8 aware,
it also works with those descriptions.
Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
---
	Probably this should check i18n.commitencoding, too...
 builtin-shortlog.c  |    4 +++-
 t/t4201-shortlog.sh |   50 ++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 53 insertions(+), 1 deletions(-)
diff --git a/builtin-shortlog.c b/builtin-shortlog.c
index edb4042..30e7cb5 100644
--- a/builtin-shortlog.c
+++ b/builtin-shortlog.c
@@ -4,6 +4,7 @@
 #include "diff.h"
 #include "path-list.h"
 #include "revision.h"
+#include "utf8.h"
 
 static const char shortlog_usage[] =
 "git-shortlog [-n] [-s] [<commit-id>... ]";
@@ -321,7 +322,8 @@ int cmd_shortlog(int argc, const char **argv, const char *prefix)
 		} else {
 			printf("%s (%d):\n", list.items[i].path, onelines->nr);
 			for (j = onelines->nr - 1; j >= 0; j--)
-				printf("      %s\n", onelines->items[j].path);
+				print_wrapped_text(onelines->items[j].path,
+					6, 9, 76);
 			printf("\n");
 		}
 
diff --git a/t/t4201-shortlog.sh b/t/t4201-shortlog.sh
new file mode 100644
index 0000000..e4085f9
--- /dev/null
+++ b/t/t4201-shortlog.sh
@@ -0,0 +1,50 @@
+#!/bin/sh
+#
+# Copyright (c) 2006 Johannes E. Schindelin
+#
+
+test_description='git-shortlog
+'
+
+. ./test-lib.sh
+
+echo 1 > a1
+git add a1
+tree=$(git write-tree)
+commit=$((echo "Test"; echo) | git commit-tree $tree)
+git update-ref HEAD $commit 
+
+echo 2 > a1
+git commit -m "This is a very, very long first line for the commit message to see if it is wrapped correctly" a1
+
+# test if the wrapping is still valid when replacing all i's by treble clefs.
+echo 3 > a1
+git commit -m "$(echo "This is a very, very long first line for the commit message to see if it is wrapped correctly" | sed "s/i/1234/g" | tr 1234 '\360\235\204\236')" a1
+
+# now fsck up the utf8
+git repo-config i18n.commitencoding non-utf-8
+echo 4 > a1
+git commit -m "$(echo "This is a very, very long first line for the commit message to see if it is wrapped correctly" | sed "s/i/1234/g" | tr 1234 '\370\235\204\236')" a1
+
+echo 5 > a1
+git commit -m "a								12	34	56	78" a1
+
+git shortlog HEAD > out
+
+cat > expect << EOF
+A U Thor (5):
+      Test
+      This is a very, very long first line for the commit message to see if
+         it is wrapped correctly
+      Thðs ðs a very, very long fðrst lðne for the commðt message to see ðf
+         ðt ðs wrapped correctly
+      Thøs øs a very, very long først løne for the commøt
+         message to see øf øt øs wrapped correctly
+      a								12	34
+         56	78
+
+EOF
+
+test_expect_success 'shortlog wrapping' 'diff -u expect out'
+
+test_done
^ permalink raw reply related	[flat|nested] 39+ messages in thread
* Re: [PATCH 1/2] libgit.a: add some UTF-8 handling functions
  2006-12-22 21:03                   ` [PATCH 1/2] libgit.a: add some UTF-8 handling functions Johannes Schindelin
@ 2006-12-22 21:27                     ` Junio C Hamano
  2006-12-22 21:36                       ` Johannes Schindelin
  2006-12-22 22:19                     ` Uwe Kleine-König
  1 sibling, 1 reply; 39+ messages in thread
From: Junio C Hamano @ 2006-12-22 21:27 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Nicolas Pitre, Uwe Kleine-König, git
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> This adds utf8_byte_count(), utf8_strlen() and print_wrapped_text().
>
> The most important is probably utf8_strlen(), which returns the length
> of the text, if it is in UTF-8, otherwise -1.
>
> Note that we do not go the full nine yards: we could also check that
> the character is encoded with the minimum amount of bytes, as pointed
> out by Uwe Kleine-Koenig.
>
> The function print_wrapped_text() can be used to wrap text to a certain
> line length.
If you do wrapped_text, I think you do not _want_ strlen (the
definition to me of strlen is "number of characters in the
string").  What you want is a function that returns the number
of columns consumed when displayed on monospace terminal.
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCH 1/2] libgit.a: add some UTF-8 handling functions
  2006-12-22 21:27                     ` Junio C Hamano
@ 2006-12-22 21:36                       ` Johannes Schindelin
  2006-12-22 21:58                         ` Junio C Hamano
  2006-12-22 22:14                         ` Uwe Kleine-König
  0 siblings, 2 replies; 39+ messages in thread
From: Johannes Schindelin @ 2006-12-22 21:36 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Nicolas Pitre, Uwe Kleine-König, git
Hi,
On Fri, 22 Dec 2006, Junio C Hamano wrote:
> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> 
> > This adds utf8_byte_count(), utf8_strlen() and print_wrapped_text().
> >
> > The most important is probably utf8_strlen(), which returns the length
> > of the text, if it is in UTF-8, otherwise -1.
> >
> > Note that we do not go the full nine yards: we could also check that
> > the character is encoded with the minimum amount of bytes, as pointed
> > out by Uwe Kleine-Koenig.
> >
> > The function print_wrapped_text() can be used to wrap text to a certain
> > line length.
> 
> If you do wrapped_text, I think you do not _want_ strlen (the
> definition to me of strlen is "number of characters in the
> string").  What you want is a function that returns the number
> of columns consumed when displayed on monospace terminal.
To me, characters are the symbols occupying one "column" each. Bytes are 
the 8-bit thingies that you usually use to encode the characters.
Ciao,
Dscho
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCH 2/2] git-commit-tree: if i18n.commitencoding is utf-8 (default), check it
  2006-12-22 21:06                   ` [PATCH 2/2] git-commit-tree: if i18n.commitencoding is utf-8 (default), check it Johannes Schindelin
@ 2006-12-22 21:50                     ` Junio C Hamano
  2006-12-22 22:21                       ` Johannes Schindelin
  0 siblings, 1 reply; 39+ messages in thread
From: Junio C Hamano @ 2006-12-22 21:50 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> Now, git-commit-tree refuses to commit when i18n.commitencoding is
> either unset, or set to "utf-8", and the commit message does not
> minimally conform to the UTF-8 encoding.
>
> Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
> ---
>
> 	Unfortunately, I could not think of a shorter oneline description. 
> 	But my next patch fixes at least the output in shortlog.
I think the rule that you described on the one-line description
makes more sense than "either unset of set to utf-8".  In other
words, I'd prefer doing this in a repository that explicitly
asks for it.
I do not want to get burned by too many incompatible changes X-<.
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCH 1/2] libgit.a: add some UTF-8 handling functions
  2006-12-22 21:36                       ` Johannes Schindelin
@ 2006-12-22 21:58                         ` Junio C Hamano
  2006-12-22 22:20                           ` Johannes Schindelin
  2006-12-22 22:14                         ` Uwe Kleine-König
  1 sibling, 1 reply; 39+ messages in thread
From: Junio C Hamano @ 2006-12-22 21:58 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Nicolas Pitre, Uwe Kleine-König, git
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>> If you do wrapped_text, I think you do not _want_ strlen (the
>> definition to me of strlen is "number of characters in the
>> string").  What you want is a function that returns the number
>> of columns consumed when displayed on monospace terminal.
>
> To me, characters are the symbols occupying one "column" each. Bytes are 
> the 8-bit thingies that you usually use to encode the characters.
I cannot tell from your reponse if you are very well aware of
Asian "double-width" characters and your version of strlen()
counts one such character as two, or if you are totally unaware
about the issue and your function returns 1 for a string that
consists of a single such character.
If the former, then the function is not strlen() anymore, and if
the latter, then it is unusable for wrapping purposes.
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: specify charset for commits
  2006-12-22 15:09               ` Uwe Kleine-König
@ 2006-12-22 22:02                 ` Uwe Kleine-König
  0 siblings, 0 replies; 39+ messages in thread
From: Uwe Kleine-König @ 2006-12-22 22:02 UTC (permalink / raw)
  To: Johannes Schindelin, Junio C Hamano, Alexander Litvinov, git
Hello,
Uwe Kleine-König wrote:
> 	def is_utf8_str(s):
> 	  cnt_furtherbytes = 0
> 	  for c in s:
> 	    if cnt_furtherbytes > 0:
> 	      if ord(c) & 0xc0 == 0x80:
> 	        cnt_furtherbytes -= 1
> 	      else:
> 	        return False
> 	    else:
> 	      if ord(c) < 0x80:
> 	        continue
> 	      elif ord(c) < 0xc0:
> 	        return False
> 	      elif ord(c) < 0xe0:
> 	        cnt_furtherbytes = 1
> 	      elif ord(c) < 0xf0:
> 	        cnt_furtherbytes = 2
> 	      elif ord(c) < 0xf8:
> 	        cnt_furtherbytes = 3
> 	      elif ord(c) < 0xfc:
> 	        cnt_furtherbytes = 4
> 	      elif ord(c) < 0xfe:
> 	        cnt_furtherbytes = 5
> 	      else:
> 	        return False
> 	  return True
While I washed the dishes I noticed that the last "return True" should
be "return cnt_furtherbytes == 0".  Just before someone else corrects me
... :-)
Best regards
Uwe
-- 
Uwe Kleine-König
http://www.google.com/search?q=parsec%5E2*Joule%2FNewton+in+tablespoon
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCH 1/2] libgit.a: add some UTF-8 handling functions
  2006-12-22 21:36                       ` Johannes Schindelin
  2006-12-22 21:58                         ` Junio C Hamano
@ 2006-12-22 22:14                         ` Uwe Kleine-König
  1 sibling, 0 replies; 39+ messages in thread
From: Uwe Kleine-König @ 2006-12-22 22:14 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Junio C Hamano, Nicolas Pitre, git
Hello Johannes,
Johannes Schindelin wrote:
> On Fri, 22 Dec 2006, Junio C Hamano wrote:
> 
> > Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> > 
> > > This adds utf8_byte_count(), utf8_strlen() and print_wrapped_text().
> > >
> > > The most important is probably utf8_strlen(), which returns the length
> > > of the text, if it is in UTF-8, otherwise -1.
> > >
> > > Note that we do not go the full nine yards: we could also check that
> > > the character is encoded with the minimum amount of bytes, as pointed
> > > out by Uwe Kleine-Koenig.
> > >
> > > The function print_wrapped_text() can be used to wrap text to a certain
> > > line length.
> > 
> > If you do wrapped_text, I think you do not _want_ strlen (the
> > definition to me of strlen is "number of characters in the
> > string").  What you want is a function that returns the number
> > of columns consumed when displayed on monospace terminal.
> 
> To me, characters are the symbols occupying one "column" each. Bytes are 
> the 8-bit thingies that you usually use to encode the characters.
Quoting utf-8(7):
	are no longer valid in UTF-8 locales.  Firstly, a single byte
	does not necessarily correspond any more to a single character.
	Secondly, since modern terminal emulators in UTF-8 mode also
	support Chinese, Japanese, and Korean double-width characters as
	well as non-spacing combining characters, outputting a single
	character does not necessarily advance the cursor by one
	position as it did in ASCII.  Library functions such as
	mbsrtowcs(3) and wcswidth(3) should be used today to count
	characters and cursor positions.
I'd prefer using a similar naming scheme.  To acknowledge Junio,
wcslen(3) (the wide-character equivalent of the strlen() function)
counts the number of (wide-)characters in a string.
Best regards,
Uwe
-- 
Uwe Kleine-König
http://www.google.com/search?q=e+%5E+%28i+pi%29
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCH 1/2] libgit.a: add some UTF-8 handling functions
  2006-12-22 21:03                   ` [PATCH 1/2] libgit.a: add some UTF-8 handling functions Johannes Schindelin
  2006-12-22 21:27                     ` Junio C Hamano
@ 2006-12-22 22:19                     ` Uwe Kleine-König
  2006-12-22 22:34                       ` Johannes Schindelin
  1 sibling, 1 reply; 39+ messages in thread
From: Uwe Kleine-König @ 2006-12-22 22:19 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Junio C Hamano, Nicolas Pitre, git
Hello,
Johannes Schindelin wrote:
> Note that we do not go the full nine yards: we could also check that
> the character is encoded with the minimum amount of bytes, as pointed
> out by Uwe Kleine-Koenig.
While we're talking about UTF-8 in commit-logs:  I'd prefer to have my
name properly written with o-umlaut.
Best regards
Uwe
-- 
Uwe Kleine-König
http://www.google.com/search?q=0+degree+Celsius+in+kelvin
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCH 1/2] libgit.a: add some UTF-8 handling functions
  2006-12-22 21:58                         ` Junio C Hamano
@ 2006-12-22 22:20                           ` Johannes Schindelin
  2006-12-22 22:33                             ` Junio C Hamano
  2006-12-25  4:03                             ` Alexander Litvinov
  0 siblings, 2 replies; 39+ messages in thread
From: Johannes Schindelin @ 2006-12-22 22:20 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Nicolas Pitre, Uwe Kleine-König, git
Hi,
On Fri, 22 Dec 2006, Junio C Hamano wrote:
> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> 
> >> If you do wrapped_text, I think you do not _want_ strlen (the
> >> definition to me of strlen is "number of characters in the
> >> string").  What you want is a function that returns the number
> >> of columns consumed when displayed on monospace terminal.
> >
> > To me, characters are the symbols occupying one "column" each. Bytes are 
> > the 8-bit thingies that you usually use to encode the characters.
> 
> I cannot tell from your reponse if you are very well aware of
> Asian "double-width" characters and your version of strlen()
> counts one such character as two, or if you are totally unaware
> about the issue and your function returns 1 for a string that
> consists of a single such character.
> 
> If the former, then the function is not strlen() anymore, and if
> the latter, then it is unusable for wrapping purposes.
The latter. Oh, well. Call me a Western idiot.
And scrap that patch.
Ciao,
Dscho
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCH 2/2] git-commit-tree: if i18n.commitencoding is utf-8 (default), check it
  2006-12-22 21:50                     ` Junio C Hamano
@ 2006-12-22 22:21                       ` Johannes Schindelin
  0 siblings, 0 replies; 39+ messages in thread
From: Johannes Schindelin @ 2006-12-22 22:21 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
Hi,
On Fri, 22 Dec 2006, Junio C Hamano wrote:
> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> 
> > Now, git-commit-tree refuses to commit when i18n.commitencoding is
> > either unset, or set to "utf-8", and the commit message does not
> > minimally conform to the UTF-8 encoding.
> >
> > Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
> > ---
> >
> > 	Unfortunately, I could not think of a shorter oneline description. 
> > 	But my next patch fixes at least the output in shortlog.
> 
> I think the rule that you described on the one-line description
> makes more sense than "either unset of set to utf-8".  In other
> words, I'd prefer doing this in a repository that explicitly
> asks for it.
Well, the problem is this line:
environment.c:21:char git_commit_encoding[MAX_ENCODING_LENGTH] = "utf-8";
> I do not want to get burned by too many incompatible changes X-<.
Understandable.
Ciao,
Dscho
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCH 1/2] libgit.a: add some UTF-8 handling functions
  2006-12-22 22:20                           ` Johannes Schindelin
@ 2006-12-22 22:33                             ` Junio C Hamano
  2006-12-25  4:03                             ` Alexander Litvinov
  1 sibling, 0 replies; 39+ messages in thread
From: Junio C Hamano @ 2006-12-22 22:33 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Nicolas Pitre, Uwe Kleine-König, git
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>> > To me, characters are the symbols occupying one "column" each. Bytes are 
>> > the 8-bit thingies that you usually use to encode the characters.
>> 
>> I cannot tell from your reponse if you are very well aware of
>> Asian "double-width" characters and your version of strlen()
>> counts one such character as two, or if you are totally unaware
>> about the issue and your function returns 1 for a string that
>> consists of a single such character.
>> 
>> If the former, then the function is not strlen() anymore, and if
>> the latter, then it is unusable for wrapping purposes.
>
> The latter. Oh, well. Call me a Western idiot.
>
> And scrap that patch.
Hey, don't give up too quickly.  Although the execution of the
initial revision might have been less than desirable, the series
really meant well and was in the right direction.  We do that
all the time ;-).
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCH 1/2] libgit.a: add some UTF-8 handling functions
  2006-12-22 22:19                     ` Uwe Kleine-König
@ 2006-12-22 22:34                       ` Johannes Schindelin
  2006-12-22 23:50                         ` Johannes Schindelin
  0 siblings, 1 reply; 39+ messages in thread
From: Johannes Schindelin @ 2006-12-22 22:34 UTC (permalink / raw)
  To: Uwe Kleine-König; +Cc: git
[-- Attachment #1: Type: TEXT/PLAIN, Size: 791 bytes --]
Dear Mr Zeisberg,
On Fri, 22 Dec 2006, Uwe Kleine-König wrote:
> Johannes Schindelin wrote:
> > Note that we do not go the full nine yards: we could also check that
> > the character is encoded with the minimum amount of bytes, as pointed
> > out by Uwe Kleine-Koenig.
> While we're talking about UTF-8 in commit-logs:  I'd prefer to have my
> name properly written with o-umlaut.
I did this because I have no easy way to input UTF-8, and because I am 
lazy, and because I did not know how many times this patch has to be 
revised.
Apart from that, it seems that the checking of UTF-8 is actually quite 
simple, and we could even copy it from 
http://www.cl.cam.ac.uk/~mgk25/ucs/utf8_check.c, where the check you 
proposed is included.
But I had enough of UTF-8 for a day.
Ciao,
Dscho
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCH 1/2] libgit.a: add some UTF-8 handling functions
  2006-12-22 22:34                       ` Johannes Schindelin
@ 2006-12-22 23:50                         ` Johannes Schindelin
  2006-12-23  8:52                           ` Uwe Kleine-König
  2006-12-23 19:53                           ` warn non utf-8 commit log messages Junio C Hamano
  0 siblings, 2 replies; 39+ messages in thread
From: Johannes Schindelin @ 2006-12-22 23:50 UTC (permalink / raw)
  To: Uwe Kleine-König; +Cc: git, junkio
[-- Attachment #1: Type: TEXT/PLAIN, Size: 12645 bytes --]
Hi,
On Fri, 22 Dec 2006, Johannes Schindelin wrote:
> Dear Mr Zeisberg,
> 
> On Fri, 22 Dec 2006, Uwe Kleine-König wrote:
> 
> > Johannes Schindelin wrote:
> > > Note that we do not go the full nine yards: we could also check that
> > > the character is encoded with the minimum amount of bytes, as pointed
> > > out by Uwe Kleine-Koenig.
> > While we're talking about UTF-8 in commit-logs:  I'd prefer to have my
> > name properly written with o-umlaut.
> 
> I did this because I have no easy way to input UTF-8, and because I am 
> lazy, and because I did not know how many times this patch has to be 
> revised.
> 
> Apart from that, it seems that the checking of UTF-8 is actually quite 
> simple, and we could even copy it from 
> http://www.cl.cam.ac.uk/~mgk25/ucs/utf8_check.c, where the check you 
> proposed is included.
> 
> But I had enough of UTF-8 for a day.
Okay, so I lied (this are both patches revised and combined):
--
 Makefile              |    6 +
 builtin-commit-tree.c |   14 ++
 utf8.c                |  277 +++++++++++++++++++++++++++++++++++++++++++++++++
 utf8.h                |    8 +
 4 files changed, 301 insertions(+), 4 deletions(-)
diff --git a/Makefile b/Makefile
--- a/Makefile
+++ b/Makefile
@@ -234,7 +237,8 @@ LIB_H = \
 	archive.h blob.h cache.h commit.h csum-file.h delta.h grep.h \
 	diff.h object.h pack.h pkt-line.h quote.h refs.h list-objects.h sideband.h \
 	run-command.h strbuf.h tag.h tree.h git-compat-util.h revision.h \
-	tree-walk.h log-tree.h dir.h path-list.h unpack-trees.h builtin.h
+	tree-walk.h log-tree.h dir.h path-list.h unpack-trees.h builtin.h \
+	utf8.h
 
 DIFF_OBJS = \
 	diff.o diff-lib.o diffcore-break.o diffcore-order.o \
@@ -253,7 +257,8 @@ LIB_OBJS = \
 	revision.o pager.o tree-walk.o xdiff-interface.o \
 	write_or_die.o trace.o list-objects.o grep.o \
 	alloc.o merge-file.o path-list.o help.o unpack-trees.o $(DIFF_OBJS) \
-	color.o wt-status.o archive-zip.o archive-tar.o shallow.o
+	color.o wt-status.o archive-zip.o archive-tar.o shallow.o \
+	utf8.o
 
 BUILTIN_OBJS = \
 	builtin-add.o \
diff --git a/builtin-commit-tree.c b/builtin-commit-tree.c
index 856f3cd..ef7cc91 100644
--- a/builtin-commit-tree.c
+++ b/builtin-commit-tree.c
@@ -7,6 +7,7 @@
 #include "commit.h"
 #include "tree.h"
 #include "builtin.h"
+#include "utf8.h"
 
 #define BLOCKING (1ul << 14)
 
@@ -32,7 +33,7 @@ static void add_buffer(char **bufp, unsigned int *sizep, const char *fmt, ...)
 	len = vsnprintf(one_line, sizeof(one_line), fmt, args);
 	va_end(args);
 	size = *sizep;
-	newsize = size + len;
+	newsize = size + len + 1;
 	alloc = (size + 32767) & ~32767;
 	buf = *bufp;
 	if (newsize > alloc) {
@@ -40,7 +41,7 @@ static void add_buffer(char **bufp, unsigned int *sizep, const char *fmt, ...)
 		buf = xrealloc(buf, alloc);
 		*bufp = buf;
 	}
-	*sizep = newsize;
+	*sizep = newsize - 1;
 	memcpy(buf + size, one_line, len);
 }
 
@@ -127,6 +128,15 @@ int cmd_commit_tree(int argc, const char **argv, const char *prefix)
 	while (fgets(comment, sizeof(comment), stdin) != NULL)
 		add_buffer(&buffer, &size, "%s", comment);
 
+	/* And check the encoding */
+	buffer[size] = '\0';
+	if (!strcmp(git_commit_encoding, "utf-8") && !is_utf8(buffer)) {
+		fprintf(stderr, "Commit message does not conform to UTF-8.\n"
+			"Please fix the message,"
+			" or set the config variable i18n.commitencoding.\n");
+		return 1;
+	}
+
 	if (!write_sha1_file(buffer, size, commit_type, commit_sha1)) {
 		printf("%s\n", sha1_to_hex(commit_sha1));
 		return 0;
diff --git a/utf8.c b/utf8.c
new file mode 100644
index 0000000..8daec78
--- /dev/null
+++ b/utf8.c
@@ -0,0 +1,277 @@
+#include "git-compat-util.h"
+#include "utf8.h"
+
+/* This code is originally from http://www.cl.cam.ac.uk/~mgk25/ucs/ */
+
+struct interval {
+  int first;
+  int last;
+};
+
+/* auxiliary function for binary search in interval table */
+static int bisearch(wchar_t ucs, const struct interval *table, int max) {
+	int min = 0;
+	int mid;
+
+	if (ucs < table[0].first || ucs > table[max].last)
+		return 0;
+	while (max >= min) {
+		mid = (min + max) / 2;
+		if (ucs > table[mid].last)
+			min = mid + 1;
+		else if (ucs < table[mid].first)
+			max = mid - 1;
+		else
+			return 1;
+	}
+
+	return 0;
+}
+
+/* The following two functions define the column width of an ISO 10646
+ * character as follows:
+ *
+ *    - The null character (U+0000) has a column width of 0.
+ *
+ *    - Other C0/C1 control characters and DEL will lead to a return
+ *      value of -1.
+ *
+ *    - Non-spacing and enclosing combining characters (general
+ *      category code Mn or Me in the Unicode database) have a
+ *      column width of 0.
+ *
+ *    - SOFT HYPHEN (U+00AD) has a column width of 1.
+ *
+ *    - Other format characters (general category code Cf in the Unicode
+ *      database) and ZERO WIDTH SPACE (U+200B) have a column width of 0.
+ *
+ *    - Hangul Jamo medial vowels and final consonants (U+1160-U+11FF)
+ *      have a column width of 0.
+ *
+ *    - Spacing characters in the East Asian Wide (W) or East Asian
+ *      Full-width (F) category as defined in Unicode Technical
+ *      Report #11 have a column width of 2.
+ *
+ *    - All remaining characters (including all printable
+ *      ISO 8859-1 and WGL4 characters, Unicode control characters,
+ *      etc.) have a column width of 1.
+ *
+ * This implementation assumes that wchar_t characters are encoded
+ * in ISO 10646.
+ */
+
+static int wcwidth(wchar_t ch)
+{
+	/*
+	 * Sorted list of non-overlapping intervals of non-spacing characters,
+	 * generated by
+	 *   "uniset +cat=Me +cat=Mn +cat=Cf -00AD +1160-11FF +200B c".
+	 */
+	static const struct interval combining[] = {
+		{ 0x0300, 0x0357 }, { 0x035D, 0x036F }, { 0x0483, 0x0486 },
+		{ 0x0488, 0x0489 }, { 0x0591, 0x05A1 }, { 0x05A3, 0x05B9 },
+		{ 0x05BB, 0x05BD }, { 0x05BF, 0x05BF }, { 0x05C1, 0x05C2 },
+		{ 0x05C4, 0x05C4 }, { 0x0600, 0x0603 }, { 0x0610, 0x0615 },
+		{ 0x064B, 0x0658 }, { 0x0670, 0x0670 }, { 0x06D6, 0x06E4 },
+		{ 0x06E7, 0x06E8 }, { 0x06EA, 0x06ED }, { 0x070F, 0x070F },
+		{ 0x0711, 0x0711 }, { 0x0730, 0x074A }, { 0x07A6, 0x07B0 },
+		{ 0x0901, 0x0902 }, { 0x093C, 0x093C }, { 0x0941, 0x0948 },
+		{ 0x094D, 0x094D }, { 0x0951, 0x0954 }, { 0x0962, 0x0963 },
+		{ 0x0981, 0x0981 }, { 0x09BC, 0x09BC }, { 0x09C1, 0x09C4 },
+		{ 0x09CD, 0x09CD }, { 0x09E2, 0x09E3 }, { 0x0A01, 0x0A02 },
+		{ 0x0A3C, 0x0A3C }, { 0x0A41, 0x0A42 }, { 0x0A47, 0x0A48 },
+		{ 0x0A4B, 0x0A4D }, { 0x0A70, 0x0A71 }, { 0x0A81, 0x0A82 },
+		{ 0x0ABC, 0x0ABC }, { 0x0AC1, 0x0AC5 }, { 0x0AC7, 0x0AC8 },
+		{ 0x0ACD, 0x0ACD }, { 0x0AE2, 0x0AE3 }, { 0x0B01, 0x0B01 },
+		{ 0x0B3C, 0x0B3C }, { 0x0B3F, 0x0B3F }, { 0x0B41, 0x0B43 },
+		{ 0x0B4D, 0x0B4D }, { 0x0B56, 0x0B56 }, { 0x0B82, 0x0B82 },
+		{ 0x0BC0, 0x0BC0 }, { 0x0BCD, 0x0BCD }, { 0x0C3E, 0x0C40 },
+		{ 0x0C46, 0x0C48 }, { 0x0C4A, 0x0C4D }, { 0x0C55, 0x0C56 },
+		{ 0x0CBC, 0x0CBC }, { 0x0CBF, 0x0CBF }, { 0x0CC6, 0x0CC6 },
+		{ 0x0CCC, 0x0CCD }, { 0x0D41, 0x0D43 }, { 0x0D4D, 0x0D4D },
+		{ 0x0DCA, 0x0DCA }, { 0x0DD2, 0x0DD4 }, { 0x0DD6, 0x0DD6 },
+		{ 0x0E31, 0x0E31 }, { 0x0E34, 0x0E3A }, { 0x0E47, 0x0E4E },
+		{ 0x0EB1, 0x0EB1 }, { 0x0EB4, 0x0EB9 }, { 0x0EBB, 0x0EBC },
+		{ 0x0EC8, 0x0ECD }, { 0x0F18, 0x0F19 }, { 0x0F35, 0x0F35 },
+		{ 0x0F37, 0x0F37 }, { 0x0F39, 0x0F39 }, { 0x0F71, 0x0F7E },
+		{ 0x0F80, 0x0F84 }, { 0x0F86, 0x0F87 }, { 0x0F90, 0x0F97 },
+		{ 0x0F99, 0x0FBC }, { 0x0FC6, 0x0FC6 }, { 0x102D, 0x1030 },
+		{ 0x1032, 0x1032 }, { 0x1036, 0x1037 }, { 0x1039, 0x1039 },
+		{ 0x1058, 0x1059 }, { 0x1160, 0x11FF }, { 0x1712, 0x1714 },
+		{ 0x1732, 0x1734 }, { 0x1752, 0x1753 }, { 0x1772, 0x1773 },
+		{ 0x17B4, 0x17B5 }, { 0x17B7, 0x17BD }, { 0x17C6, 0x17C6 },
+		{ 0x17C9, 0x17D3 }, { 0x17DD, 0x17DD }, { 0x180B, 0x180D },
+		{ 0x18A9, 0x18A9 }, { 0x1920, 0x1922 }, { 0x1927, 0x1928 },
+		{ 0x1932, 0x1932 }, { 0x1939, 0x193B }, { 0x200B, 0x200F },
+		{ 0x202A, 0x202E }, { 0x2060, 0x2063 }, { 0x206A, 0x206F },
+		{ 0x20D0, 0x20EA }, { 0x302A, 0x302F }, { 0x3099, 0x309A },
+		{ 0xFB1E, 0xFB1E }, { 0xFE00, 0xFE0F }, { 0xFE20, 0xFE23 },
+		{ 0xFEFF, 0xFEFF }, { 0xFFF9, 0xFFFB }, { 0x1D167, 0x1D169 },
+		{ 0x1D173, 0x1D182 }, { 0x1D185, 0x1D18B },
+		{ 0x1D1AA, 0x1D1AD }, { 0xE0001, 0xE0001 },
+		{ 0xE0020, 0xE007F }, { 0xE0100, 0xE01EF }
+	};
+
+	/* test for 8-bit control characters */
+	if (ch == 0)
+		return 0;
+	if (ch < 32 || (ch >= 0x7f && ch < 0xa0))
+		return -1;
+
+	/* binary search in table of non-spacing characters */
+	if (bisearch(ch, combining, sizeof(combining)
+				/ sizeof(struct interval) - 1))
+		return 0;
+
+	/*
+	 * If we arrive here, ch is neither a combining nor a C0/C1
+	 * control character.
+	 */
+
+	return 1 +
+		(ch >= 0x1100 &&
+                    /* Hangul Jamo init. consonants */
+		 (ch <= 0x115f ||
+		  ch == 0x2329 || ch == 0x232a ||
+                  /* CJK ... Yi */
+		  (ch >= 0x2e80 && ch <= 0xa4cf &&
+		   ch != 0x303f) ||
+		  /* Hangul Syllables */
+		  (ch >= 0xac00 && ch <= 0xd7a3) ||
+		  /* CJK Compatibility Ideographs */
+		  (ch >= 0xf900 && ch <= 0xfaff) ||
+		  /* CJK Compatibility Forms */
+		  (ch >= 0xfe30 && ch <= 0xfe6f) ||
+		  /* Fullwidth Forms */
+		  (ch >= 0xff00 && ch <= 0xff60) ||
+		  (ch >= 0xffe0 && ch <= 0xffe6) ||
+		  (ch >= 0x20000 && ch <= 0x2fffd) ||
+		  (ch >= 0x30000 && ch <= 0x3fffd)));
+}
+
+/*
+ * This function returns the number of columns occupied by the character
+ * pointed to by the variable start. The pointer is updated to point at
+ * the next character. If it was not valid UTF-8, the pointer is set to NULL.
+ */
+int utf8_width(const char **start)
+{
+	unsigned char *s = (unsigned char *)*start;
+	wchar_t ch;
+
+	if (*s < 0x80) {
+		/* 0xxxxxxx */
+		ch = *s;
+		*start += 1;
+	} else if ((s[0] & 0xe0) == 0xc0) {
+		/* 110XXXXx 10xxxxxx */
+		if ((s[1] & 0xc0) != 0x80 ||
+				/* overlong? */
+				(s[0] & 0xfe) == 0xc0)
+			goto invalid;
+		ch = ((s[0] & 0x1f) << 6) | (s[1] & 0x3f);
+		*start += 2;
+	} else if ((s[0] & 0xf0) == 0xe0) {
+		/* 1110XXXX 10Xxxxxx 10xxxxxx */
+		if ((s[1] & 0xc0) != 0x80 ||
+				(s[2] & 0xc0) != 0x80 ||
+				/* overlong? */
+				(s[0] == 0xe0 && (s[1] & 0xe0) == 0x80) ||
+				/* surrogate? */
+				(s[0] == 0xed && (s[1] & 0xe0) == 0xa0) ||
+				/* U+FFFE or U+FFFF? */
+				(s[0] == 0xef && s[1] == 0xbf &&
+				 (s[2] & 0xfe) == 0xbe))
+			goto invalid;
+		ch = ((s[0] & 0x0f) << 12) |
+			((s[1] & 0x3f) << 6) | (s[2] & 0x3f);
+		*start += 3;
+	} else if ((s[0] & 0xf8) == 0xf0) {
+		/* 11110XXX 10XXxxxx 10xxxxxx 10xxxxxx */
+		if ((s[1] & 0xc0) != 0x80 ||
+				(s[2] & 0xc0) != 0x80 ||
+				(s[3] & 0xc0) != 0x80 ||
+				/* overlong? */
+				(s[0] == 0xf0 && (s[1] & 0xf0) == 0x80) ||
+				/* > U+10FFFF? */
+				(s[0] == 0xf4 && s[1] > 0x8f) || s[0] > 0xf4)
+			goto invalid;
+		ch = ((s[0] & 0x07) << 18) | ((s[1] & 0x3f) << 12) |
+			((s[2] & 0x3f) << 6) | (s[3] & 0x3f);
+		*start += 4;
+	} else {
+invalid:
+		*start = NULL;
+		return 0;
+	}
+
+	return wcwidth(ch);
+}
+
+int is_utf8(const char *text)
+{
+	while (*text) {
+		if (*text == '\n' || *text == '\t' || *text == '\r') {
+			text++;
+			continue;
+		}
+		utf8_width(&text);
+		if (!text)
+			return 0;
+	}
+	return 1;
+}
+
+static void print_spaces(int count)
+{
+	static const char s[] = "                    ";
+	while (count >= sizeof(s)) {
+		fwrite(s, sizeof(s) - 1, 1, stdout);
+		count -= sizeof(s) - 1;
+	}
+	fwrite(s, count, 1, stdout);
+}
+
+/*
+ * Wrap the text, if necessary. The variable indent is the indent for the
+ * first line, indent2 is the indent for all other lines.
+ */
+void print_wrapped_text(const char *text, int indent, int indent2, int width)
+{
+	int w = indent, assume_utf8 = is_utf8(text);
+	const char *bol = text, *space = NULL;
+
+	for (;;) {
+		char c = *text;
+		if (!c || isspace(c)) {
+			if (w < width || space < 0) {
+				const char *start = bol;
+				if (space)
+					start = space;
+				else
+					print_spaces(indent);
+				fwrite(start, text - start, 1, stdout);
+				if (!c) {
+					putchar('\n');
+					return;
+				} else if (c == '\t')
+					w |= 0x07;
+				space = text;
+				w++;
+				text++;
+			} else {
+				putchar('\n');
+				text = bol = space + 1;
+				space = NULL;
+				w = indent = indent2;
+			}
+			continue;
+		}
+		if (assume_utf8)
+			w += utf8_width(&text);
+		else {
+			w++;
+			text++;
+		}
+	}
+}
diff --git a/utf8.h b/utf8.h
new file mode 100644
index 0000000..a0d7f59
--- /dev/null
+++ b/utf8.h
@@ -0,0 +1,8 @@
+#ifndef GIT_UTF8_H
+#define GIT_UTF8_H
+
+int utf8_width(const char **start);
+int is_utf8(const char *text);
+void print_wrapped_text(const char *text, int indent, int indent2, int len);
+
+#endif
^ permalink raw reply related	[flat|nested] 39+ messages in thread
* Re: [PATCH 1/2] libgit.a: add some UTF-8 handling functions
  2006-12-22 23:50                         ` Johannes Schindelin
@ 2006-12-23  8:52                           ` Uwe Kleine-König
  2006-12-23 14:12                             ` Johannes Schindelin
  2006-12-23 19:53                           ` warn non utf-8 commit log messages Junio C Hamano
  1 sibling, 1 reply; 39+ messages in thread
From: Uwe Kleine-König @ 2006-12-23  8:52 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, junkio
Hallo Johannes,
Johannes Schindelin wrote:
> @@ -127,6 +128,15 @@ int cmd_commit_tree(int argc, const char **argv, const char *prefix)
>  	while (fgets(comment, sizeof(comment), stdin) != NULL)
>  		add_buffer(&buffer, &size, "%s", comment);
>  
> +	/* And check the encoding */
> +	buffer[size] = '\0';
> +	if (!strcmp(git_commit_encoding, "utf-8") && !is_utf8(buffer)) {
Maybe you could be more generous here.  E.g.
	if ((!strcasecmp(git_commit_encoding, "utf-8") ||
	!strcasecmp(git_commit_encoding, "utf8")) && !is_utf8(buffer))
Junio suggested to make this check if i18n.commitEncoding is empty.  I
didn't check the code to see if this case is included.
Gruessle
Uwe
-- 
Uwe Kleine-König
http://www.google.com/search?q=2+to+the+power+of+12
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCH 1/2] libgit.a: add some UTF-8 handling functions
  2006-12-23  8:52                           ` Uwe Kleine-König
@ 2006-12-23 14:12                             ` Johannes Schindelin
  0 siblings, 0 replies; 39+ messages in thread
From: Johannes Schindelin @ 2006-12-23 14:12 UTC (permalink / raw)
  To: Uwe Kleine-König; +Cc: git, junkio
[-- Attachment #1: Type: TEXT/PLAIN, Size: 1045 bytes --]
Hi,
On Sat, 23 Dec 2006, Uwe Kleine-König wrote:
> Johannes Schindelin wrote:
> > @@ -127,6 +128,15 @@ int cmd_commit_tree(int argc, const char **argv, const char *prefix)
> >  	while (fgets(comment, sizeof(comment), stdin) != NULL)
> >  		add_buffer(&buffer, &size, "%s", comment);
> >  
> > +	/* And check the encoding */
> > +	buffer[size] = '\0';
> > +	if (!strcmp(git_commit_encoding, "utf-8") && !is_utf8(buffer)) {
> Maybe you could be more generous here.  E.g.
> 
> 	if ((!strcasecmp(git_commit_encoding, "utf-8") ||
> 	!strcasecmp(git_commit_encoding, "utf8")) && !is_utf8(buffer))
> 
> Junio suggested to make this check if i18n.commitEncoding is empty.  I
> didn't check the code to see if this case is included.
The problem is, as I pointed out in another mail, that environment.c sets 
the default git_commit_encoding to "utf-8". This is hardwired, and I have 
no way to check if that was set by the config or not, other than reparsing 
the config myself.
> Gruessle
Hah! You don't use umlauts and ssharp yourself!
Ciao,
Dscho
^ permalink raw reply	[flat|nested] 39+ messages in thread
* warn non utf-8 commit log messages.
  2006-12-22 23:50                         ` Johannes Schindelin
  2006-12-23  8:52                           ` Uwe Kleine-König
@ 2006-12-23 19:53                           ` Junio C Hamano
  2006-12-23 23:46                             ` Johannes Schindelin
  1 sibling, 1 reply; 39+ messages in thread
From: Junio C Hamano @ 2006-12-23 19:53 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>> But I had enough of UTF-8 for a day.
>
> Okay, so I lied (this are both patches revised and combined):
I am thinking of putting this in 'next', with the following
changes on top of your combined patch.
git-commit-tree warns if the commit message does not minimally
conform to the UTF-8 encoding when i18n.commitencoding is either
unset, or set to "utf-8".  It does not die as in your version.
 builtin-commit-tree.c |   13 +++++++------
 utf8.c                |    2 +-
 2 files changed, 8 insertions(+), 7 deletions(-)
diff --git a/builtin-commit-tree.c b/builtin-commit-tree.c
index f274721..f641787 100644
--- a/builtin-commit-tree.c
+++ b/builtin-commit-tree.c
@@ -78,6 +78,11 @@ static int new_parent(int idx)
 	return 1;
 }
 
+static const char commit_utf8_warn[] =
+"Warning: commit message does not conform to UTF-8.\n"
+"You may want to amend it after fixing the message, or set the config\n"
+"variable i18n.commitencoding to the encoding your project uses.\n";
+
 int cmd_commit_tree(int argc, const char **argv, const char *prefix)
 {
 	int i;
@@ -133,12 +138,8 @@ int cmd_commit_tree(int argc, const char **argv, const char *prefix)
 
 	/* And check the encoding */
 	buffer[size] = '\0';
-	if (!strcmp(git_commit_encoding, "utf-8") && !is_utf8(buffer)) {
-		fprintf(stderr, "Commit message does not conform to UTF-8.\n"
-			"Please fix the message,"
-			" or set the config variable i18n.commitencoding.\n");
-		return 1;
-	}
+	if (!strcmp(git_commit_encoding, "utf-8") && !is_utf8(buffer))
+		fprintf(stderr, commit_utf8_warn);
 
 	if (!write_sha1_file(buffer, size, commit_type, commit_sha1)) {
 		printf("%s\n", sha1_to_hex(commit_sha1));
diff --git a/utf8.c b/utf8.c
index aed60ad..8fa6257 100644
--- a/utf8.c
+++ b/utf8.c
@@ -244,7 +244,7 @@ void print_wrapped_text(const char *text, int indent, int indent2, int width)
 	for (;;) {
 		char c = *text;
 		if (!c || isspace(c)) {
-			if (w < width || space < 0) {
+			if (w < width || !space) {
 				const char *start = bol;
 				if (space)
 					start = space;
^ permalink raw reply related	[flat|nested] 39+ messages in thread
* Re: warn non utf-8 commit log messages.
  2006-12-23 19:53                           ` warn non utf-8 commit log messages Junio C Hamano
@ 2006-12-23 23:46                             ` Johannes Schindelin
  0 siblings, 0 replies; 39+ messages in thread
From: Johannes Schindelin @ 2006-12-23 23:46 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
Hi,
On Sat, 23 Dec 2006, Junio C Hamano wrote:
> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> 
> >> But I had enough of UTF-8 for a day.
> >
> > Okay, so I lied (this are both patches revised and combined):
> 
> I am thinking of putting this in 'next', with the following
> changes on top of your combined patch.
> 
> git-commit-tree warns if the commit message does not minimally
> conform to the UTF-8 encoding when i18n.commitencoding is either
> unset, or set to "utf-8".  It does not die as in your version.
Yeah, this is nicer.
> -			if (w < width || space < 0) {
> +			if (w < width || !space) {
This is a real bug fix. Thank you. I changed quite a bit between offset 
and char*, and eventually forgot this part.
Ciao,
Dscho
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCH 1/2] libgit.a: add some UTF-8 handling functions
  2006-12-22 22:20                           ` Johannes Schindelin
  2006-12-22 22:33                             ` Junio C Hamano
@ 2006-12-25  4:03                             ` Alexander Litvinov
  1 sibling, 0 replies; 39+ messages in thread
From: Alexander Litvinov @ 2006-12-25  4:03 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Junio C Hamano, Nicolas Pitre, Uwe Kleine-König, git
> > > To me, characters are the symbols occupying one "column" each. Bytes
> > > are the 8-bit thingies that you usually use to encode the characters.
You can check man 3 wcwidth:
wcwidth - determine columns needed for a wide character
We possible could convert utf-8 encoded string into wchar_t[] and use that 
function.
^ permalink raw reply	[flat|nested] 39+ messages in thread
end of thread, other threads:[~2006-12-25  4:04 UTC | newest]
Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-12-08 11:44 [PATCH] Fix documentation copy&paste typo Uwe Kleine-Koenig
2006-12-19 14:16 ` Uwe Kleine-König
2006-12-19 17:27   ` Junio C Hamano
2006-12-21  8:59     ` specify charset for commits (Was: [PATCH] Fix documentation copy&paste typo) Uwe Kleine-König
2006-12-21  9:51       ` Johannes Schindelin
2006-12-21 10:11         ` Santi Béjar
2006-12-21 10:23         ` Alexander Litvinov
2006-12-21 10:52           ` Jakub Narebski
2006-12-21 13:05             ` Alexander Litvinov
2006-12-21 13:14               ` Jakub Narebski
2006-12-21 13:43             ` Uwe Kleine-König
2006-12-21 18:19           ` specify charset for commits Junio C Hamano
2006-12-21 18:48             ` Nicolas Pitre
2006-12-21 19:11             ` Uwe Kleine-König
2006-12-21 19:36             ` Alexander Litvinov
2006-12-22 12:07             ` Johannes Schindelin
2006-12-22 15:09               ` Uwe Kleine-König
2006-12-22 22:02                 ` Uwe Kleine-König
2006-12-22 15:31               ` Nicolas Pitre
2006-12-22 19:01                 ` Junio C Hamano
2006-12-22 21:03                   ` [PATCH 1/2] libgit.a: add some UTF-8 handling functions Johannes Schindelin
2006-12-22 21:27                     ` Junio C Hamano
2006-12-22 21:36                       ` Johannes Schindelin
2006-12-22 21:58                         ` Junio C Hamano
2006-12-22 22:20                           ` Johannes Schindelin
2006-12-22 22:33                             ` Junio C Hamano
2006-12-25  4:03                             ` Alexander Litvinov
2006-12-22 22:14                         ` Uwe Kleine-König
2006-12-22 22:19                     ` Uwe Kleine-König
2006-12-22 22:34                       ` Johannes Schindelin
2006-12-22 23:50                         ` Johannes Schindelin
2006-12-23  8:52                           ` Uwe Kleine-König
2006-12-23 14:12                             ` Johannes Schindelin
2006-12-23 19:53                           ` warn non utf-8 commit log messages Junio C Hamano
2006-12-23 23:46                             ` Johannes Schindelin
2006-12-22 21:06                   ` [PATCH 2/2] git-commit-tree: if i18n.commitencoding is utf-8 (default), check it Johannes Schindelin
2006-12-22 21:50                     ` Junio C Hamano
2006-12-22 22:21                       ` Johannes Schindelin
2006-12-22 21:15                   ` [RFC/PATCH 3/2] Wrap lines in shortlog Johannes Schindelin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).