* [PATCH] Fix documentation copy&paste typo @ 2006-12-08 11:44 Uwe Kleine-Koenig 2006-12-19 14:16 ` Uwe Kleine-König 0 siblings, 1 reply; 39+ messages in thread From: Uwe Kleine-Koenig @ 2006-12-08 11:44 UTC (permalink / raw) To: git; +Cc: Uwe Zeisberger From: Uwe Zeisberger <zeisberg@informatik.uni-freiburg.de> This was introduced in 45a3b12cfd3eaa05bbb0954790d5be5b8240a7b5 Signed-off-by: Uwe Kleine-König <zeisberg@informatik.uni-freiburg.de> --- gitweb/gitweb.perl | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl index 093bd72..ed40810 100755 --- a/gitweb/gitweb.perl +++ b/gitweb/gitweb.perl @@ -120,7 +120,7 @@ our %feature = ( # To disable system wide have in $GITWEB_CONFIG # $feature{'snapshot'}{'default'} = [undef]; # To have project specific config enable override in $GITWEB_CONFIG - # $feature{'blame'}{'override'} = 1; + # $feature{'snapshot'}{'override'} = 1; # and in project config gitweb.snapshot = none|gzip|bzip2; 'snapshot' => { 'sub' => \&feature_snapshot, -- 1.4.4.2.gb772 ^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: [PATCH] Fix documentation copy&paste typo 2006-12-08 11:44 [PATCH] Fix documentation copy&paste typo Uwe Kleine-Koenig @ 2006-12-19 14:16 ` Uwe Kleine-König 2006-12-19 17:27 ` Junio C Hamano 0 siblings, 1 reply; 39+ messages in thread From: Uwe Kleine-König @ 2006-12-19 14:16 UTC (permalink / raw) To: Junio C Hamano; +Cc: git Hello Junio, Uwe Kleine-König wrote: > From: Uwe Zeisberger <zeisberg@informatik.uni-freiburg.de> > > This was introduced in 45a3b12cfd3eaa05bbb0954790d5be5b8240a7b5 > > Signed-off-by: Uwe Kleine-König <zeisberg@informatik.uni-freiburg.de> > --- > [...] you took that patch as bbee1d971dc07c29f840b439aa2a2c890a12cf9f, thanks for that. Somehow the 'ö' (o-umlaut) in my name was messed up. If I do git cat-file -p bbee1d971dc07c29 | xxd | grep eine I get: 0000160: 6569 6e65 2d4b 1b2c 4143 361b 2842 6e69 eine-K.,AC6.(Bni That is, the 'ö' became 8 byte long. Can you tell me what went wrong there? The commits by Karl Hasselström <kha@treskal.com> (e.g. e67c66251a4165) use UTF-8. Does there exist a (maybe project specific) convention for the encoding of commit logs? Best regards Uwe -- Uwe Kleine-Koenig ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] Fix documentation copy&paste typo 2006-12-19 14:16 ` Uwe Kleine-König @ 2006-12-19 17:27 ` Junio C Hamano 2006-12-21 8:59 ` specify charset for commits (Was: [PATCH] Fix documentation copy&paste typo) Uwe Kleine-König 0 siblings, 1 reply; 39+ messages in thread From: Junio C Hamano @ 2006-12-19 17:27 UTC (permalink / raw) To: Uwe Kleine-König; +Cc: git Uwe Kleine-König <zeisberg@informatik.uni-freiburg.de> writes: > I get: > > 0000160: 6569 6e65 2d4b 1b2c 4143 361b 2842 6e69 eine-K.,AC6.(Bni > > That is, the 'ö' became 8 byte long. Can you tell me what went wrong > there? Me, keyboard and Emacs screwed up and stored it in ISO-2022 instead of UTF-8. Sorry. ^ permalink raw reply [flat|nested] 39+ messages in thread
* specify charset for commits (Was: [PATCH] Fix documentation copy&paste typo) 2006-12-19 17:27 ` Junio C Hamano @ 2006-12-21 8:59 ` Uwe Kleine-König 2006-12-21 9:51 ` Johannes Schindelin 0 siblings, 1 reply; 39+ messages in thread From: Uwe Kleine-König @ 2006-12-21 8:59 UTC (permalink / raw) To: Junio C Hamano; +Cc: git Hello Junio, Junio C Hamano wrote: > Me, keyboard and Emacs screwed up and stored it in ISO-2022 > instead of UTF-8. Sorry. It's a pity, but too late to change.[1] What do you think about a patch that makes git-commit-tree call iconv on its input to get it to UTF-8 (or any other charset). Maybe it makes sense to add another header to commit objects (e.g. "charset UTF-8") if something in the commit object is non-ASCII? In my eyes it would make sense to even force UTF-8 for commit logs (and author, committer). The downside is that it becomes impossible to store arbitrary byte sequences in commit objects. (IMHO not a real limitation.) Best regards Uwe [1] actually I think it's worse, because my iconv (from Debian's libc6 version 2.3.6.ds1-8) was unable to convert it correctly to utf-8 for any encoding that starts with ISO-2022. -- Uwe Kleine-König http://www.google.com/search?q=sin%28pi%2F2%29 ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: specify charset for commits (Was: [PATCH] Fix documentation copy&paste typo) 2006-12-21 8:59 ` specify charset for commits (Was: [PATCH] Fix documentation copy&paste typo) Uwe Kleine-König @ 2006-12-21 9:51 ` Johannes Schindelin 2006-12-21 10:11 ` Santi Béjar 2006-12-21 10:23 ` Alexander Litvinov 0 siblings, 2 replies; 39+ messages in thread From: Johannes Schindelin @ 2006-12-21 9:51 UTC (permalink / raw) To: Uwe Kleine-König; +Cc: Junio C Hamano, git [-- Attachment #1: Type: TEXT/PLAIN, Size: 544 bytes --] Hi, On Thu, 21 Dec 2006, Uwe Kleine-König wrote: > Junio C Hamano wrote: > > Me, keyboard and Emacs screwed up and stored it in ISO-2022 > > instead of UTF-8. Sorry. > It's a pity, but too late to change.[1] > > What do you think about a patch that makes git-commit-tree call iconv on > its input to get it to UTF-8 (or any other charset). We had this discussion over and over again. Last time (I think) was here: http://article.gmane.org/gmane.comp.version-control.git/11710 Summary: we do not want to force the use of utf8. Hth, Dscho ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: specify charset for commits (Was: [PATCH] Fix documentation copy&paste typo) 2006-12-21 9:51 ` Johannes Schindelin @ 2006-12-21 10:11 ` Santi Béjar 2006-12-21 10:23 ` Alexander Litvinov 1 sibling, 0 replies; 39+ messages in thread From: Santi Béjar @ 2006-12-21 10:11 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Uwe Kleine-König, Junio C Hamano, git On 12/21/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > Hi, > > On Thu, 21 Dec 2006, Uwe Kleine-König wrote: > > > Junio C Hamano wrote: > > > Me, keyboard and Emacs screwed up and stored it in ISO-2022 > > > instead of UTF-8. Sorry. > > It's a pity, but too late to change.[1] > > > > What do you think about a patch that makes git-commit-tree call iconv on > > its input to get it to UTF-8 (or any other charset). > > We had this discussion over and over again. Last time (I think) was here: > > http://article.gmane.org/gmane.comp.version-control.git/11710 > > Summary: we do not want to force the use of utf8. > Maybe git could have an example of the hook "commit-msg" that checks that the commit message is in certain charset. Santi ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: specify charset for commits (Was: [PATCH] Fix documentation copy&paste typo) 2006-12-21 9:51 ` Johannes Schindelin 2006-12-21 10:11 ` Santi Béjar @ 2006-12-21 10:23 ` Alexander Litvinov 2006-12-21 10:52 ` Jakub Narebski 2006-12-21 18:19 ` specify charset for commits Junio C Hamano 1 sibling, 2 replies; 39+ messages in thread From: Alexander Litvinov @ 2006-12-21 10:23 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Uwe Kleine-König, Junio C Hamano, git > > What do you think about a patch that makes git-commit-tree call iconv on > > its input to get it to UTF-8 (or any other charset). > > We had this discussion over and over again. Last time (I think) was here: > http://article.gmane.org/gmane.comp.version-control.git/11710 > Summary: we do not want to force the use of utf8. May we can add new header into commit with commit text encoding ? ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: specify charset for commits (Was: [PATCH] Fix documentation copy&paste typo) 2006-12-21 10:23 ` Alexander Litvinov @ 2006-12-21 10:52 ` Jakub Narebski 2006-12-21 13:05 ` Alexander Litvinov 2006-12-21 13:43 ` Uwe Kleine-König 2006-12-21 18:19 ` specify charset for commits Junio C Hamano 1 sibling, 2 replies; 39+ messages in thread From: Jakub Narebski @ 2006-12-21 10:52 UTC (permalink / raw) To: git Alexander Litvinov wrote: >>> What do you think about a patch that makes git-commit-tree call iconv on >>> its input to get it to UTF-8 (or any other charset). >> >> We had this discussion over and over again. Last time (I think) was here: >> http://article.gmane.org/gmane.comp.version-control.git/11710 >> Summary: we do not want to force the use of utf8. > > May we can add new header into commit with commit text encoding ? I think it should be repository-wide decision. And we have i18n.commitEncoding configuration variable (perhaps it should be propagated on clone?). -- Jakub Narebski Warsaw, Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: specify charset for commits (Was: [PATCH] Fix documentation copy&paste typo) 2006-12-21 10:52 ` Jakub Narebski @ 2006-12-21 13:05 ` Alexander Litvinov 2006-12-21 13:14 ` Jakub Narebski 2006-12-21 13:43 ` Uwe Kleine-König 1 sibling, 1 reply; 39+ messages in thread From: Alexander Litvinov @ 2006-12-21 13:05 UTC (permalink / raw) To: Jakub Narebski; +Cc: git В сообщении от Thursday 21 December 2006 16:52 Jakub Narebski написал(a): > > May we can add new header into commit with commit text encoding ? > > I think it should be repository-wide decision. And we have > i18n.commitEncoding configuration variable (perhaps it should be propagated > on clone?). I would disagree with you. Is is not hard to imagine international project managed by git. We [developers] can start to use utf-8 or similar universal encoding but it is not easy sometimes. Fir example, not long ago all russian linux machines has LANG set to ru_RU.KOI8-R, now it tend to be ru_RU.UTF-8. It will not big surprise to me if developr from China or Japan use something very unusual. And just imagine one developer using Windows and Cygwin - ha ha ha, try to ask him to change the encoding :-) The easiest way for git is just to store commit encoding and let tool for history browsing deal with encoding. Or as it does now - simply ignore all encodings at all and work with bytes not chars. But at this case history browsing tool must have some magic knowlage about encoding taken from air :-). Or from config file that cover most used encodings. Alexander Litvinov. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: specify charset for commits (Was: [PATCH] Fix documentation copy&paste typo) 2006-12-21 13:05 ` Alexander Litvinov @ 2006-12-21 13:14 ` Jakub Narebski 0 siblings, 0 replies; 39+ messages in thread From: Jakub Narebski @ 2006-12-21 13:14 UTC (permalink / raw) To: git Alexander Litvinov wrote: > ? ????????? ?? Thursday 21 December 2006 16:52 Jakub Narebski ???????(a): >>> May we can add new header into commit with commit text encoding ? >> >> I think it should be repository-wide decision. And we have >> i18n.commitEncoding configuration variable (perhaps it should be propagated >> on clone?). > > I would disagree with you. Is is not hard to imagine international project > managed by git. We [developers] can start to use utf-8 or similar universal > encoding but it is not easy sometimes. Fir example, not long ago all russian > linux machines has LANG set to ru_RU.KOI8-R, now it tend to be ru_RU.UTF-8. > It will not big surprise to me if developr from China or Japan use something > very unusual. And just imagine one developer using Windows and Cygwin - ha ha > ha, try to ask him to change the encoding :-) > > The easiest way for git is just to store commit encoding and let tool for > history browsing deal with encoding. Or as it does now - simply ignore all > encodings at all and work with bytes not chars. But at this case history > browsing tool must have some magic knowlage about encoding taken from > air :-). Or from config file that cover most used encodings. Perhaps it is time to resurrect idea about "note" header in commit object? It could be used to store charset for those commits which doesn't use default charset... -- Jakub Narebski Warsaw, Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: specify charset for commits (Was: [PATCH] Fix documentation copy&paste typo) 2006-12-21 10:52 ` Jakub Narebski 2006-12-21 13:05 ` Alexander Litvinov @ 2006-12-21 13:43 ` Uwe Kleine-König 1 sibling, 0 replies; 39+ messages in thread From: Uwe Kleine-König @ 2006-12-21 13:43 UTC (permalink / raw) To: Jakub Narebski; +Cc: git Hello, Jakub Narebski wrote: > Alexander Litvinov wrote: > > >>> What do you think about a patch that makes git-commit-tree call iconv on > >>> its input to get it to UTF-8 (or any other charset). > >> > >> We had this discussion over and over again. Last time (I think) was here: > >> http://article.gmane.org/gmane.comp.version-control.git/11710 > >> Summary: we do not want to force the use of utf8. > > > > May we can add new header into commit with commit text encoding ? > > I think it should be repository-wide decision. And we have > i18n.commitEncoding configuration variable The disadvantage from a repository-wide decision is that you cannot change it after a while. I didn't know that variable, but I think as it exists, git-commit-tree should iconv the commit message from local to i18n.commitEncoding before writing it. Moreover I like the idea of a new header for commits specifing the encoding. Git could default to the handling as it is now (i.e. just bytes) if the header is missing. Best regards Uwe -- Uwe Kleine-König http://www.google.com/search?q=1+newton+in+kg*m+%2F+s%5E2 ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: specify charset for commits 2006-12-21 10:23 ` Alexander Litvinov 2006-12-21 10:52 ` Jakub Narebski @ 2006-12-21 18:19 ` Junio C Hamano 2006-12-21 18:48 ` Nicolas Pitre ` (3 more replies) 1 sibling, 4 replies; 39+ messages in thread From: Junio C Hamano @ 2006-12-21 18:19 UTC (permalink / raw) To: Alexander Litvinov; +Cc: Johannes Schindelin, Uwe Kleine-König, git Alexander Litvinov <litvinov2004@gmail.com> writes: > May we can add new header into commit with commit text encoding ? I do not think we want to change the commit header, nor we would want to re-encode, but I can see two possible improvements: (1) git-am should default to -u; this was suggested on the list long time ago, but is an incompatible change. v1.5.0 we can afford to be incompatible to make it more usable and safer. (2) update commit-tree to reject non utf-8 log messages and author/committer names when i18n.commitEncoding is _NOT_ set, or set to utf-8. Maybe later we can use encoding validation routines for other encodings by checking i18n.commitEncoding, but at the minimum the above would be safe enough for recommended UTF-8 only cases. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: specify charset for commits 2006-12-21 18:19 ` specify charset for commits Junio C Hamano @ 2006-12-21 18:48 ` Nicolas Pitre 2006-12-21 19:11 ` Uwe Kleine-König ` (2 subsequent siblings) 3 siblings, 0 replies; 39+ messages in thread From: Nicolas Pitre @ 2006-12-21 18:48 UTC (permalink / raw) To: Junio C Hamano Cc: Alexander Litvinov, Johannes Schindelin, Uwe Kleine-König, git On Thu, 21 Dec 2006, Junio C Hamano wrote: > Alexander Litvinov <litvinov2004@gmail.com> writes: > > > May we can add new header into commit with commit text encoding ? > > I do not think we want to change the commit header, nor we would > want to re-encode, but I can see two possible improvements: > > (1) git-am should default to -u; this was suggested on the list > long time ago, but is an incompatible change. v1.5.0 we > can afford to be incompatible to make it more usable and > safer. > > (2) update commit-tree to reject non utf-8 log messages and > author/committer names when i18n.commitEncoding is _NOT_ > set, or set to utf-8. This would be a good thing, both of them actually. The Linux kernel already contains different charsets for author full name and/or commit messages making git-log output a mix of encodings already. And sometimes this inconsistency comes from the _same_ author. Nicolas ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: specify charset for commits 2006-12-21 18:19 ` specify charset for commits Junio C Hamano 2006-12-21 18:48 ` Nicolas Pitre @ 2006-12-21 19:11 ` Uwe Kleine-König 2006-12-21 19:36 ` Alexander Litvinov 2006-12-22 12:07 ` Johannes Schindelin 3 siblings, 0 replies; 39+ messages in thread From: Uwe Kleine-König @ 2006-12-21 19:11 UTC (permalink / raw) To: Junio C Hamano; +Cc: Alexander Litvinov, Johannes Schindelin, git Junio C Hamano wrote: > Alexander Litvinov <litvinov2004@gmail.com> writes: > > > May we can add new header into commit with commit text encoding ? > > I do not think we want to change the commit header, nor we would > want to re-encode, but I can see two possible improvements: > > (1) git-am should default to -u; this was suggested on the list > long time ago, but is an incompatible change. v1.5.0 we > can afford to be incompatible to make it more usable and > safer. > > (2) update commit-tree to reject non utf-8 log messages and > author/committer names when i18n.commitEncoding is _NOT_ > set, or set to utf-8. > > Maybe later we can use encoding validation routines for > other encodings by checking i18n.commitEncoding, but at the > minimum the above would be safe enough for recommended UTF-8 > only cases. As I only want to use UTF-8 both suggestions are fine for me. Is there a generic way to check an encoding? (I don't know if there is an encoding that can encode everything. If so, we could use iconv -f $enc -t superencoding. Until now I thought UTF-8 can do, but in the post Johannes pointed out, you (Junio) implyed that it cannot.) Best regards Uwe -- Uwe Kleine-König dd if=/proc/self/exe bs=1 skip=1 count=3 2>/dev/null ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: specify charset for commits 2006-12-21 18:19 ` specify charset for commits Junio C Hamano 2006-12-21 18:48 ` Nicolas Pitre 2006-12-21 19:11 ` Uwe Kleine-König @ 2006-12-21 19:36 ` Alexander Litvinov 2006-12-22 12:07 ` Johannes Schindelin 3 siblings, 0 replies; 39+ messages in thread From: Alexander Litvinov @ 2006-12-21 19:36 UTC (permalink / raw) To: Junio C Hamano; +Cc: Johannes Schindelin, Uwe Kleine-König, git > I do not think we want to change the commit header Can you please explain why not ? > (1) git-am should default to -u; this was suggested on the list > long time ago, but is an incompatible change. v1.5.0 we > can afford to be incompatible to make it more usable and > safer. I use git-am rarely so can't comment on this > (2) update commit-tree to reject non utf-8 log messages and > author/committer names when i18n.commitEncoding is _NOT_ > set, or set to utf-8. > > Maybe later we can use encoding validation routines for > other encodings by checking i18n.commitEncoding, but at the > minimum the above would be safe enough for recommended UTF-8 > only cases. See the situation: 1. I have utf-8 encoded repo. 2. Somebody clone my repo, try to commit using non-utf-8 encoding, fail and change i18n.commitEncoding. When it commits something and ask me to pull. 3. I pull and got non-utf-8 commit message :-) ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: specify charset for commits 2006-12-21 18:19 ` specify charset for commits Junio C Hamano ` (2 preceding siblings ...) 2006-12-21 19:36 ` Alexander Litvinov @ 2006-12-22 12:07 ` Johannes Schindelin 2006-12-22 15:09 ` Uwe Kleine-König 2006-12-22 15:31 ` Nicolas Pitre 3 siblings, 2 replies; 39+ messages in thread From: Johannes Schindelin @ 2006-12-22 12:07 UTC (permalink / raw) To: Junio C Hamano; +Cc: Alexander Litvinov, Uwe Kleine-König, git Hi, On Thu, 21 Dec 2006, Junio C Hamano wrote: > (2) update commit-tree to reject non utf-8 log messages and > author/committer names when i18n.commitEncoding is _NOT_ > set, or set to utf-8. The problem is: you cannot easily recognize if it is UTF8 or not, programatically. There is a good indicator _against_ UTF8, namely the first byte can _only_ be 0xxxxxxx, 110xxxxx, 1110xxxx, 11110xxx. But there is no _positive_ sign that it is UTF8. For example, many umlauts and other special modifications to letters, stay in the range 0x7f-0xff. Ciao, Dscho ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: specify charset for commits 2006-12-22 12:07 ` Johannes Schindelin @ 2006-12-22 15:09 ` Uwe Kleine-König 2006-12-22 22:02 ` Uwe Kleine-König 2006-12-22 15:31 ` Nicolas Pitre 1 sibling, 1 reply; 39+ messages in thread From: Uwe Kleine-König @ 2006-12-22 15:09 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Junio C Hamano, Alexander Litvinov, git Hello Johannes, Johannes Schindelin wrote: > The problem is: you cannot easily recognize if it is UTF8 or not, > programatically. There is a good indicator _against_ UTF8, namely the > first byte can _only_ be 0xxxxxxx, 110xxxxx, 1110xxxx, 11110xxx. But there > is no _positive_ sign that it is UTF8. For example, many umlauts and other > special modifications to letters, stay in the range 0x7f-0xff. That's not the only indication. Here comes a (Python) function that checks is string s is correctly UTF-8 encoded: def is_utf8_str(s): cnt_furtherbytes = 0 for c in s: if cnt_furtherbytes > 0: if ord(c) & 0xc0 == 0x80: cnt_furtherbytes -= 1 else: return False else: if ord(c) < 0x80: continue elif ord(c) < 0xc0: return False elif ord(c) < 0xe0: cnt_furtherbytes = 1 elif ord(c) < 0xf0: cnt_furtherbytes = 2 elif ord(c) < 0xf8: cnt_furtherbytes = 3 elif ord(c) < 0xfc: cnt_furtherbytes = 4 elif ord(c) < 0xfe: cnt_furtherbytes = 5 else: return False return True An UTF-8 character is either one byte long with the msb 0 or a sequence starting with a value between 0xc0 and 0xfd (inclusive) and depending on that first value up to six further bytes in the range 0x80 to 0xbf. You could even be more strict by checking for Unicode 3.1 conformance (i.e. a character has to be encoded in it's shortest form). Look at utf8(7) for further details. (This manpage is included in the Debian manpages package.) Best regards Uwe -- Uwe Kleine-König http://www.google.com/search?q=5+choose+3 ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: specify charset for commits 2006-12-22 15:09 ` Uwe Kleine-König @ 2006-12-22 22:02 ` Uwe Kleine-König 0 siblings, 0 replies; 39+ messages in thread From: Uwe Kleine-König @ 2006-12-22 22:02 UTC (permalink / raw) To: Johannes Schindelin, Junio C Hamano, Alexander Litvinov, git Hello, Uwe Kleine-König wrote: > def is_utf8_str(s): > cnt_furtherbytes = 0 > for c in s: > if cnt_furtherbytes > 0: > if ord(c) & 0xc0 == 0x80: > cnt_furtherbytes -= 1 > else: > return False > else: > if ord(c) < 0x80: > continue > elif ord(c) < 0xc0: > return False > elif ord(c) < 0xe0: > cnt_furtherbytes = 1 > elif ord(c) < 0xf0: > cnt_furtherbytes = 2 > elif ord(c) < 0xf8: > cnt_furtherbytes = 3 > elif ord(c) < 0xfc: > cnt_furtherbytes = 4 > elif ord(c) < 0xfe: > cnt_furtherbytes = 5 > else: > return False > return True While I washed the dishes I noticed that the last "return True" should be "return cnt_furtherbytes == 0". Just before someone else corrects me ... :-) Best regards Uwe -- Uwe Kleine-König http://www.google.com/search?q=parsec%5E2*Joule%2FNewton+in+tablespoon ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: specify charset for commits 2006-12-22 12:07 ` Johannes Schindelin 2006-12-22 15:09 ` Uwe Kleine-König @ 2006-12-22 15:31 ` Nicolas Pitre 2006-12-22 19:01 ` Junio C Hamano 1 sibling, 1 reply; 39+ messages in thread From: Nicolas Pitre @ 2006-12-22 15:31 UTC (permalink / raw) To: Johannes Schindelin Cc: Junio C Hamano, Alexander Litvinov, Uwe Kleine-König, git On Fri, 22 Dec 2006, Johannes Schindelin wrote: > Hi, > > On Thu, 21 Dec 2006, Junio C Hamano wrote: > > > (2) update commit-tree to reject non utf-8 log messages and > > author/committer names when i18n.commitEncoding is _NOT_ > > set, or set to utf-8. > > The problem is: you cannot easily recognize if it is UTF8 or not, > programatically. There is a good indicator _against_ UTF8, namely the > first byte can _only_ be 0xxxxxxx, 110xxxxx, 1110xxxx, 11110xxx. But there > is no _positive_ sign that it is UTF8. For example, many umlauts and other > special modifications to letters, stay in the range 0x7f-0xff. Still... that would be a good enough thing to have in the majority of cases, wouldn't it? Nicolas ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: specify charset for commits 2006-12-22 15:31 ` Nicolas Pitre @ 2006-12-22 19:01 ` Junio C Hamano 2006-12-22 21:03 ` [PATCH 1/2] libgit.a: add some UTF-8 handling functions Johannes Schindelin ` (2 more replies) 0 siblings, 3 replies; 39+ messages in thread From: Junio C Hamano @ 2006-12-22 19:01 UTC (permalink / raw) To: Nicolas Pitre Cc: Johannes Schindelin, Alexander Litvinov, Uwe Kleine-König, git Nicolas Pitre <nico@cam.org> writes: > On Fri, 22 Dec 2006, Johannes Schindelin wrote: > >> Hi, >> >> On Thu, 21 Dec 2006, Junio C Hamano wrote: >> >> > (2) update commit-tree to reject non utf-8 log messages and >> > author/committer names when i18n.commitEncoding is _NOT_ >> > set, or set to utf-8. >> >> The problem is: you cannot easily recognize if it is UTF8 or not, >> programatically. There is a good indicator _against_ UTF8, namely the >> first byte can _only_ be 0xxxxxxx, 110xxxxx, 1110xxxx, 11110xxx. But there >> is no _positive_ sign that it is UTF8. For example, many umlauts and other >> special modifications to letters, stay in the range 0x7f-0xff. > > Still... that would be a good enough thing to have in the majority of > cases, wouldn't it? I think that would be very sane thing to do. ^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH 1/2] libgit.a: add some UTF-8 handling functions 2006-12-22 19:01 ` Junio C Hamano @ 2006-12-22 21:03 ` Johannes Schindelin 2006-12-22 21:27 ` Junio C Hamano 2006-12-22 22:19 ` Uwe Kleine-König 2006-12-22 21:06 ` [PATCH 2/2] git-commit-tree: if i18n.commitencoding is utf-8 (default), check it Johannes Schindelin 2006-12-22 21:15 ` [RFC/PATCH 3/2] Wrap lines in shortlog Johannes Schindelin 2 siblings, 2 replies; 39+ messages in thread From: Johannes Schindelin @ 2006-12-22 21:03 UTC (permalink / raw) To: Junio C Hamano; +Cc: Nicolas Pitre, Uwe Kleine-König, git This adds utf8_byte_count(), utf8_strlen() and print_wrapped_text(). The most important is probably utf8_strlen(), which returns the length of the text, if it is in UTF-8, otherwise -1. Note that we do not go the full nine yards: we could also check that the character is encoded with the minimum amount of bytes, as pointed out by Uwe Kleine-Koenig. The function print_wrapped_text() can be used to wrap text to a certain line length. Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> --- On Fri, 22 Dec 2006, Junio C Hamano wrote: > Nicolas Pitre <nico@cam.org> writes: > > > On Fri, 22 Dec 2006, Johannes Schindelin wrote: > >> > >> On Thu, 21 Dec 2006, Junio C Hamano wrote: > >> > >> > (2) update commit-tree to reject non utf-8 log messages and > >> > author/committer names when i18n.commitEncoding is _NOT_ > >> > set, or set to utf-8. > >> > >> The problem is: you cannot easily recognize if it is UTF8 or > >> not, programatically. There is a good indicator _against_ > >> UTF8, namely the first byte can _only_ be 0xxxxxxx, 110xxxxx, > >> 1110xxxx, 11110xxx. But there is no _positive_ sign that it > >> is UTF8. For example, many umlauts and other special > >> modifications to letters, stay in the range 0x7f-0xff. > > > > Still... that would be a good enough thing to have in the > > majority of cases, wouldn't it? > > I think that would be very sane thing to do. Well, this patch together with the next one implements that. Makefile | 6 ++- utf8.c | 93 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ utf8.h | 8 +++++ 3 files changed, 105 insertions(+), 2 deletions(-) diff --git a/Makefile b/Makefile index 29c4662..b4ca48b 100644 --- a/Makefile +++ b/Makefile @@ -237,7 +237,8 @@ LIB_H = \ archive.h blob.h cache.h commit.h csum-file.h delta.h grep.h \ diff.h object.h pack.h pkt-line.h quote.h refs.h list-objects.h sideband.h \ run-command.h strbuf.h tag.h tree.h git-compat-util.h revision.h \ - tree-walk.h log-tree.h dir.h path-list.h unpack-trees.h builtin.h + tree-walk.h log-tree.h dir.h path-list.h unpack-trees.h builtin.h \ + utf8.h DIFF_OBJS = \ diff.o diff-lib.o diffcore-break.o diffcore-order.o \ @@ -256,7 +257,8 @@ LIB_OBJS = \ revision.o pager.o tree-walk.o xdiff-interface.o \ write_or_die.o trace.o list-objects.o grep.o \ alloc.o merge-file.o path-list.o help.o unpack-trees.o $(DIFF_OBJS) \ - color.o wt-status.o archive-zip.o archive-tar.o shallow.o + color.o wt-status.o archive-zip.o archive-tar.o shallow.o \ + utf8.o BUILTIN_OBJS = \ builtin-add.o \ diff --git a/utf8.c b/utf8.c new file mode 100644 index 0000000..06a66c7 --- /dev/null +++ b/utf8.c @@ -0,0 +1,93 @@ +#include "git-compat-util.h" +#include "utf8.h" + +/* + * This function returns the number of bytes occupied by the character + * pointed to by the variable start. If it is not valid UTF-8, it + * returns -1. + */ +int utf8_byte_count(const char *start) +{ + unsigned char c = *(unsigned char *)start; + int i, count = 0; + + if (!(c & 0x80)) + count = 1; + else if ((c & 0xe0) == 0xc0) + count = 2; + else if ((c & 0xf0) == 0xe0) + count = 3; + else if ((c & 0xf8) == 0xf0) + count = 4; + else + return -1; + + for (i = 1; i < count; i++) + if ((start[i] & 0xc0) != 0x80) + return -1; + return count; +} + +int utf8_strlen(const char *text) +{ + int len = 0; + while (*text) { + int count = utf8_byte_count(text); + if (count < 0) + return -1; + len += count; + text += count; + } + return len; +} + +static void print_spaces(int count) +{ + static const char s[] = " "; + while (count >= sizeof(s)) { + fwrite(s, sizeof(s) - 1, 1, stdout); + count -= sizeof(s) - 1; + } + fwrite(s, count, 1, stdout); +} + +/* + * Wrap the text, if necessary. The variable indent is the indent for the + * first line, indent2 is the indent for all other lines. + */ +void print_wrapped_text(const char *text, int indent, int indent2, int len) +{ + int count = 0, space = -1; + int l = utf8_strlen(text), assume_utf8 = (l >= 0); + + l = indent; + + for (;;) { + char c = text[count]; + if (!c || isspace(c)) { + if (l < len || space < 0) { + const char *start = text; + if (space >= 0) + start += space; + else + print_spaces(indent); + fwrite(start, text + count - start, 1, stdout); + if (!c) { + putchar('\n'); + return; + } else if (c == '\t') + l |= 0x07; + space = count; + } else { + putchar('\n'); + text += space + 1; + indent = indent2; + space = -1; + count = l = 0; + continue; + } + } + count += assume_utf8 ? utf8_byte_count(text + count) : 1; + l++; + } +} diff --git a/utf8.h b/utf8.h new file mode 100644 index 0000000..96dded9 --- /dev/null +++ b/utf8.h @@ -0,0 +1,8 @@ +#ifndef GIT_UTF8_H +#define GIT_UTF8_H + +int utf8_byte_count(const char *start); +int utf8_strlen(const char *text); +void print_wrapped_text(const char *text, int indent, int indent2, int len); + +#endif -- 1.4.4.3.ge5f98-dirty ^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: [PATCH 1/2] libgit.a: add some UTF-8 handling functions 2006-12-22 21:03 ` [PATCH 1/2] libgit.a: add some UTF-8 handling functions Johannes Schindelin @ 2006-12-22 21:27 ` Junio C Hamano 2006-12-22 21:36 ` Johannes Schindelin 2006-12-22 22:19 ` Uwe Kleine-König 1 sibling, 1 reply; 39+ messages in thread From: Junio C Hamano @ 2006-12-22 21:27 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Nicolas Pitre, Uwe Kleine-König, git Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > This adds utf8_byte_count(), utf8_strlen() and print_wrapped_text(). > > The most important is probably utf8_strlen(), which returns the length > of the text, if it is in UTF-8, otherwise -1. > > Note that we do not go the full nine yards: we could also check that > the character is encoded with the minimum amount of bytes, as pointed > out by Uwe Kleine-Koenig. > > The function print_wrapped_text() can be used to wrap text to a certain > line length. If you do wrapped_text, I think you do not _want_ strlen (the definition to me of strlen is "number of characters in the string"). What you want is a function that returns the number of columns consumed when displayed on monospace terminal. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/2] libgit.a: add some UTF-8 handling functions 2006-12-22 21:27 ` Junio C Hamano @ 2006-12-22 21:36 ` Johannes Schindelin 2006-12-22 21:58 ` Junio C Hamano 2006-12-22 22:14 ` Uwe Kleine-König 0 siblings, 2 replies; 39+ messages in thread From: Johannes Schindelin @ 2006-12-22 21:36 UTC (permalink / raw) To: Junio C Hamano; +Cc: Nicolas Pitre, Uwe Kleine-König, git Hi, On Fri, 22 Dec 2006, Junio C Hamano wrote: > Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > > > This adds utf8_byte_count(), utf8_strlen() and print_wrapped_text(). > > > > The most important is probably utf8_strlen(), which returns the length > > of the text, if it is in UTF-8, otherwise -1. > > > > Note that we do not go the full nine yards: we could also check that > > the character is encoded with the minimum amount of bytes, as pointed > > out by Uwe Kleine-Koenig. > > > > The function print_wrapped_text() can be used to wrap text to a certain > > line length. > > If you do wrapped_text, I think you do not _want_ strlen (the > definition to me of strlen is "number of characters in the > string"). What you want is a function that returns the number > of columns consumed when displayed on monospace terminal. To me, characters are the symbols occupying one "column" each. Bytes are the 8-bit thingies that you usually use to encode the characters. Ciao, Dscho ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/2] libgit.a: add some UTF-8 handling functions 2006-12-22 21:36 ` Johannes Schindelin @ 2006-12-22 21:58 ` Junio C Hamano 2006-12-22 22:20 ` Johannes Schindelin 2006-12-22 22:14 ` Uwe Kleine-König 1 sibling, 1 reply; 39+ messages in thread From: Junio C Hamano @ 2006-12-22 21:58 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Nicolas Pitre, Uwe Kleine-König, git Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: >> If you do wrapped_text, I think you do not _want_ strlen (the >> definition to me of strlen is "number of characters in the >> string"). What you want is a function that returns the number >> of columns consumed when displayed on monospace terminal. > > To me, characters are the symbols occupying one "column" each. Bytes are > the 8-bit thingies that you usually use to encode the characters. I cannot tell from your reponse if you are very well aware of Asian "double-width" characters and your version of strlen() counts one such character as two, or if you are totally unaware about the issue and your function returns 1 for a string that consists of a single such character. If the former, then the function is not strlen() anymore, and if the latter, then it is unusable for wrapping purposes. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/2] libgit.a: add some UTF-8 handling functions 2006-12-22 21:58 ` Junio C Hamano @ 2006-12-22 22:20 ` Johannes Schindelin 2006-12-22 22:33 ` Junio C Hamano 2006-12-25 4:03 ` Alexander Litvinov 0 siblings, 2 replies; 39+ messages in thread From: Johannes Schindelin @ 2006-12-22 22:20 UTC (permalink / raw) To: Junio C Hamano; +Cc: Nicolas Pitre, Uwe Kleine-König, git Hi, On Fri, 22 Dec 2006, Junio C Hamano wrote: > Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > > >> If you do wrapped_text, I think you do not _want_ strlen (the > >> definition to me of strlen is "number of characters in the > >> string"). What you want is a function that returns the number > >> of columns consumed when displayed on monospace terminal. > > > > To me, characters are the symbols occupying one "column" each. Bytes are > > the 8-bit thingies that you usually use to encode the characters. > > I cannot tell from your reponse if you are very well aware of > Asian "double-width" characters and your version of strlen() > counts one such character as two, or if you are totally unaware > about the issue and your function returns 1 for a string that > consists of a single such character. > > If the former, then the function is not strlen() anymore, and if > the latter, then it is unusable for wrapping purposes. The latter. Oh, well. Call me a Western idiot. And scrap that patch. Ciao, Dscho ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/2] libgit.a: add some UTF-8 handling functions 2006-12-22 22:20 ` Johannes Schindelin @ 2006-12-22 22:33 ` Junio C Hamano 2006-12-25 4:03 ` Alexander Litvinov 1 sibling, 0 replies; 39+ messages in thread From: Junio C Hamano @ 2006-12-22 22:33 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Nicolas Pitre, Uwe Kleine-König, git Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: >> > To me, characters are the symbols occupying one "column" each. Bytes are >> > the 8-bit thingies that you usually use to encode the characters. >> >> I cannot tell from your reponse if you are very well aware of >> Asian "double-width" characters and your version of strlen() >> counts one such character as two, or if you are totally unaware >> about the issue and your function returns 1 for a string that >> consists of a single such character. >> >> If the former, then the function is not strlen() anymore, and if >> the latter, then it is unusable for wrapping purposes. > > The latter. Oh, well. Call me a Western idiot. > > And scrap that patch. Hey, don't give up too quickly. Although the execution of the initial revision might have been less than desirable, the series really meant well and was in the right direction. We do that all the time ;-). ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/2] libgit.a: add some UTF-8 handling functions 2006-12-22 22:20 ` Johannes Schindelin 2006-12-22 22:33 ` Junio C Hamano @ 2006-12-25 4:03 ` Alexander Litvinov 1 sibling, 0 replies; 39+ messages in thread From: Alexander Litvinov @ 2006-12-25 4:03 UTC (permalink / raw) To: Johannes Schindelin Cc: Junio C Hamano, Nicolas Pitre, Uwe Kleine-König, git > > > To me, characters are the symbols occupying one "column" each. Bytes > > > are the 8-bit thingies that you usually use to encode the characters. You can check man 3 wcwidth: wcwidth - determine columns needed for a wide character We possible could convert utf-8 encoded string into wchar_t[] and use that function. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/2] libgit.a: add some UTF-8 handling functions 2006-12-22 21:36 ` Johannes Schindelin 2006-12-22 21:58 ` Junio C Hamano @ 2006-12-22 22:14 ` Uwe Kleine-König 1 sibling, 0 replies; 39+ messages in thread From: Uwe Kleine-König @ 2006-12-22 22:14 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Junio C Hamano, Nicolas Pitre, git Hello Johannes, Johannes Schindelin wrote: > On Fri, 22 Dec 2006, Junio C Hamano wrote: > > > Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > > > > > This adds utf8_byte_count(), utf8_strlen() and print_wrapped_text(). > > > > > > The most important is probably utf8_strlen(), which returns the length > > > of the text, if it is in UTF-8, otherwise -1. > > > > > > Note that we do not go the full nine yards: we could also check that > > > the character is encoded with the minimum amount of bytes, as pointed > > > out by Uwe Kleine-Koenig. > > > > > > The function print_wrapped_text() can be used to wrap text to a certain > > > line length. > > > > If you do wrapped_text, I think you do not _want_ strlen (the > > definition to me of strlen is "number of characters in the > > string"). What you want is a function that returns the number > > of columns consumed when displayed on monospace terminal. > > To me, characters are the symbols occupying one "column" each. Bytes are > the 8-bit thingies that you usually use to encode the characters. Quoting utf-8(7): are no longer valid in UTF-8 locales. Firstly, a single byte does not necessarily correspond any more to a single character. Secondly, since modern terminal emulators in UTF-8 mode also support Chinese, Japanese, and Korean double-width characters as well as non-spacing combining characters, outputting a single character does not necessarily advance the cursor by one position as it did in ASCII. Library functions such as mbsrtowcs(3) and wcswidth(3) should be used today to count characters and cursor positions. I'd prefer using a similar naming scheme. To acknowledge Junio, wcslen(3) (the wide-character equivalent of the strlen() function) counts the number of (wide-)characters in a string. Best regards, Uwe -- Uwe Kleine-König http://www.google.com/search?q=e+%5E+%28i+pi%29 ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/2] libgit.a: add some UTF-8 handling functions 2006-12-22 21:03 ` [PATCH 1/2] libgit.a: add some UTF-8 handling functions Johannes Schindelin 2006-12-22 21:27 ` Junio C Hamano @ 2006-12-22 22:19 ` Uwe Kleine-König 2006-12-22 22:34 ` Johannes Schindelin 1 sibling, 1 reply; 39+ messages in thread From: Uwe Kleine-König @ 2006-12-22 22:19 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Junio C Hamano, Nicolas Pitre, git Hello, Johannes Schindelin wrote: > Note that we do not go the full nine yards: we could also check that > the character is encoded with the minimum amount of bytes, as pointed > out by Uwe Kleine-Koenig. While we're talking about UTF-8 in commit-logs: I'd prefer to have my name properly written with o-umlaut. Best regards Uwe -- Uwe Kleine-König http://www.google.com/search?q=0+degree+Celsius+in+kelvin ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/2] libgit.a: add some UTF-8 handling functions 2006-12-22 22:19 ` Uwe Kleine-König @ 2006-12-22 22:34 ` Johannes Schindelin 2006-12-22 23:50 ` Johannes Schindelin 0 siblings, 1 reply; 39+ messages in thread From: Johannes Schindelin @ 2006-12-22 22:34 UTC (permalink / raw) To: Uwe Kleine-König; +Cc: git [-- Attachment #1: Type: TEXT/PLAIN, Size: 791 bytes --] Dear Mr Zeisberg, On Fri, 22 Dec 2006, Uwe Kleine-König wrote: > Johannes Schindelin wrote: > > Note that we do not go the full nine yards: we could also check that > > the character is encoded with the minimum amount of bytes, as pointed > > out by Uwe Kleine-Koenig. > While we're talking about UTF-8 in commit-logs: I'd prefer to have my > name properly written with o-umlaut. I did this because I have no easy way to input UTF-8, and because I am lazy, and because I did not know how many times this patch has to be revised. Apart from that, it seems that the checking of UTF-8 is actually quite simple, and we could even copy it from http://www.cl.cam.ac.uk/~mgk25/ucs/utf8_check.c, where the check you proposed is included. But I had enough of UTF-8 for a day. Ciao, Dscho ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/2] libgit.a: add some UTF-8 handling functions 2006-12-22 22:34 ` Johannes Schindelin @ 2006-12-22 23:50 ` Johannes Schindelin 2006-12-23 8:52 ` Uwe Kleine-König 2006-12-23 19:53 ` warn non utf-8 commit log messages Junio C Hamano 0 siblings, 2 replies; 39+ messages in thread From: Johannes Schindelin @ 2006-12-22 23:50 UTC (permalink / raw) To: Uwe Kleine-König; +Cc: git, junkio [-- Attachment #1: Type: TEXT/PLAIN, Size: 12645 bytes --] Hi, On Fri, 22 Dec 2006, Johannes Schindelin wrote: > Dear Mr Zeisberg, > > On Fri, 22 Dec 2006, Uwe Kleine-König wrote: > > > Johannes Schindelin wrote: > > > Note that we do not go the full nine yards: we could also check that > > > the character is encoded with the minimum amount of bytes, as pointed > > > out by Uwe Kleine-Koenig. > > While we're talking about UTF-8 in commit-logs: I'd prefer to have my > > name properly written with o-umlaut. > > I did this because I have no easy way to input UTF-8, and because I am > lazy, and because I did not know how many times this patch has to be > revised. > > Apart from that, it seems that the checking of UTF-8 is actually quite > simple, and we could even copy it from > http://www.cl.cam.ac.uk/~mgk25/ucs/utf8_check.c, where the check you > proposed is included. > > But I had enough of UTF-8 for a day. Okay, so I lied (this are both patches revised and combined): -- Makefile | 6 + builtin-commit-tree.c | 14 ++ utf8.c | 277 +++++++++++++++++++++++++++++++++++++++++++++++++ utf8.h | 8 + 4 files changed, 301 insertions(+), 4 deletions(-) diff --git a/Makefile b/Makefile --- a/Makefile +++ b/Makefile @@ -234,7 +237,8 @@ LIB_H = \ archive.h blob.h cache.h commit.h csum-file.h delta.h grep.h \ diff.h object.h pack.h pkt-line.h quote.h refs.h list-objects.h sideband.h \ run-command.h strbuf.h tag.h tree.h git-compat-util.h revision.h \ - tree-walk.h log-tree.h dir.h path-list.h unpack-trees.h builtin.h + tree-walk.h log-tree.h dir.h path-list.h unpack-trees.h builtin.h \ + utf8.h DIFF_OBJS = \ diff.o diff-lib.o diffcore-break.o diffcore-order.o \ @@ -253,7 +257,8 @@ LIB_OBJS = \ revision.o pager.o tree-walk.o xdiff-interface.o \ write_or_die.o trace.o list-objects.o grep.o \ alloc.o merge-file.o path-list.o help.o unpack-trees.o $(DIFF_OBJS) \ - color.o wt-status.o archive-zip.o archive-tar.o shallow.o + color.o wt-status.o archive-zip.o archive-tar.o shallow.o \ + utf8.o BUILTIN_OBJS = \ builtin-add.o \ diff --git a/builtin-commit-tree.c b/builtin-commit-tree.c index 856f3cd..ef7cc91 100644 --- a/builtin-commit-tree.c +++ b/builtin-commit-tree.c @@ -7,6 +7,7 @@ #include "commit.h" #include "tree.h" #include "builtin.h" +#include "utf8.h" #define BLOCKING (1ul << 14) @@ -32,7 +33,7 @@ static void add_buffer(char **bufp, unsigned int *sizep, const char *fmt, ...) len = vsnprintf(one_line, sizeof(one_line), fmt, args); va_end(args); size = *sizep; - newsize = size + len; + newsize = size + len + 1; alloc = (size + 32767) & ~32767; buf = *bufp; if (newsize > alloc) { @@ -40,7 +41,7 @@ static void add_buffer(char **bufp, unsigned int *sizep, const char *fmt, ...) buf = xrealloc(buf, alloc); *bufp = buf; } - *sizep = newsize; + *sizep = newsize - 1; memcpy(buf + size, one_line, len); } @@ -127,6 +128,15 @@ int cmd_commit_tree(int argc, const char **argv, const char *prefix) while (fgets(comment, sizeof(comment), stdin) != NULL) add_buffer(&buffer, &size, "%s", comment); + /* And check the encoding */ + buffer[size] = '\0'; + if (!strcmp(git_commit_encoding, "utf-8") && !is_utf8(buffer)) { + fprintf(stderr, "Commit message does not conform to UTF-8.\n" + "Please fix the message," + " or set the config variable i18n.commitencoding.\n"); + return 1; + } + if (!write_sha1_file(buffer, size, commit_type, commit_sha1)) { printf("%s\n", sha1_to_hex(commit_sha1)); return 0; diff --git a/utf8.c b/utf8.c new file mode 100644 index 0000000..8daec78 --- /dev/null +++ b/utf8.c @@ -0,0 +1,277 @@ +#include "git-compat-util.h" +#include "utf8.h" + +/* This code is originally from http://www.cl.cam.ac.uk/~mgk25/ucs/ */ + +struct interval { + int first; + int last; +}; + +/* auxiliary function for binary search in interval table */ +static int bisearch(wchar_t ucs, const struct interval *table, int max) { + int min = 0; + int mid; + + if (ucs < table[0].first || ucs > table[max].last) + return 0; + while (max >= min) { + mid = (min + max) / 2; + if (ucs > table[mid].last) + min = mid + 1; + else if (ucs < table[mid].first) + max = mid - 1; + else + return 1; + } + + return 0; +} + +/* The following two functions define the column width of an ISO 10646 + * character as follows: + * + * - The null character (U+0000) has a column width of 0. + * + * - Other C0/C1 control characters and DEL will lead to a return + * value of -1. + * + * - Non-spacing and enclosing combining characters (general + * category code Mn or Me in the Unicode database) have a + * column width of 0. + * + * - SOFT HYPHEN (U+00AD) has a column width of 1. + * + * - Other format characters (general category code Cf in the Unicode + * database) and ZERO WIDTH SPACE (U+200B) have a column width of 0. + * + * - Hangul Jamo medial vowels and final consonants (U+1160-U+11FF) + * have a column width of 0. + * + * - Spacing characters in the East Asian Wide (W) or East Asian + * Full-width (F) category as defined in Unicode Technical + * Report #11 have a column width of 2. + * + * - All remaining characters (including all printable + * ISO 8859-1 and WGL4 characters, Unicode control characters, + * etc.) have a column width of 1. + * + * This implementation assumes that wchar_t characters are encoded + * in ISO 10646. + */ + +static int wcwidth(wchar_t ch) +{ + /* + * Sorted list of non-overlapping intervals of non-spacing characters, + * generated by + * "uniset +cat=Me +cat=Mn +cat=Cf -00AD +1160-11FF +200B c". + */ + static const struct interval combining[] = { + { 0x0300, 0x0357 }, { 0x035D, 0x036F }, { 0x0483, 0x0486 }, + { 0x0488, 0x0489 }, { 0x0591, 0x05A1 }, { 0x05A3, 0x05B9 }, + { 0x05BB, 0x05BD }, { 0x05BF, 0x05BF }, { 0x05C1, 0x05C2 }, + { 0x05C4, 0x05C4 }, { 0x0600, 0x0603 }, { 0x0610, 0x0615 }, + { 0x064B, 0x0658 }, { 0x0670, 0x0670 }, { 0x06D6, 0x06E4 }, + { 0x06E7, 0x06E8 }, { 0x06EA, 0x06ED }, { 0x070F, 0x070F }, + { 0x0711, 0x0711 }, { 0x0730, 0x074A }, { 0x07A6, 0x07B0 }, + { 0x0901, 0x0902 }, { 0x093C, 0x093C }, { 0x0941, 0x0948 }, + { 0x094D, 0x094D }, { 0x0951, 0x0954 }, { 0x0962, 0x0963 }, + { 0x0981, 0x0981 }, { 0x09BC, 0x09BC }, { 0x09C1, 0x09C4 }, + { 0x09CD, 0x09CD }, { 0x09E2, 0x09E3 }, { 0x0A01, 0x0A02 }, + { 0x0A3C, 0x0A3C }, { 0x0A41, 0x0A42 }, { 0x0A47, 0x0A48 }, + { 0x0A4B, 0x0A4D }, { 0x0A70, 0x0A71 }, { 0x0A81, 0x0A82 }, + { 0x0ABC, 0x0ABC }, { 0x0AC1, 0x0AC5 }, { 0x0AC7, 0x0AC8 }, + { 0x0ACD, 0x0ACD }, { 0x0AE2, 0x0AE3 }, { 0x0B01, 0x0B01 }, + { 0x0B3C, 0x0B3C }, { 0x0B3F, 0x0B3F }, { 0x0B41, 0x0B43 }, + { 0x0B4D, 0x0B4D }, { 0x0B56, 0x0B56 }, { 0x0B82, 0x0B82 }, + { 0x0BC0, 0x0BC0 }, { 0x0BCD, 0x0BCD }, { 0x0C3E, 0x0C40 }, + { 0x0C46, 0x0C48 }, { 0x0C4A, 0x0C4D }, { 0x0C55, 0x0C56 }, + { 0x0CBC, 0x0CBC }, { 0x0CBF, 0x0CBF }, { 0x0CC6, 0x0CC6 }, + { 0x0CCC, 0x0CCD }, { 0x0D41, 0x0D43 }, { 0x0D4D, 0x0D4D }, + { 0x0DCA, 0x0DCA }, { 0x0DD2, 0x0DD4 }, { 0x0DD6, 0x0DD6 }, + { 0x0E31, 0x0E31 }, { 0x0E34, 0x0E3A }, { 0x0E47, 0x0E4E }, + { 0x0EB1, 0x0EB1 }, { 0x0EB4, 0x0EB9 }, { 0x0EBB, 0x0EBC }, + { 0x0EC8, 0x0ECD }, { 0x0F18, 0x0F19 }, { 0x0F35, 0x0F35 }, + { 0x0F37, 0x0F37 }, { 0x0F39, 0x0F39 }, { 0x0F71, 0x0F7E }, + { 0x0F80, 0x0F84 }, { 0x0F86, 0x0F87 }, { 0x0F90, 0x0F97 }, + { 0x0F99, 0x0FBC }, { 0x0FC6, 0x0FC6 }, { 0x102D, 0x1030 }, + { 0x1032, 0x1032 }, { 0x1036, 0x1037 }, { 0x1039, 0x1039 }, + { 0x1058, 0x1059 }, { 0x1160, 0x11FF }, { 0x1712, 0x1714 }, + { 0x1732, 0x1734 }, { 0x1752, 0x1753 }, { 0x1772, 0x1773 }, + { 0x17B4, 0x17B5 }, { 0x17B7, 0x17BD }, { 0x17C6, 0x17C6 }, + { 0x17C9, 0x17D3 }, { 0x17DD, 0x17DD }, { 0x180B, 0x180D }, + { 0x18A9, 0x18A9 }, { 0x1920, 0x1922 }, { 0x1927, 0x1928 }, + { 0x1932, 0x1932 }, { 0x1939, 0x193B }, { 0x200B, 0x200F }, + { 0x202A, 0x202E }, { 0x2060, 0x2063 }, { 0x206A, 0x206F }, + { 0x20D0, 0x20EA }, { 0x302A, 0x302F }, { 0x3099, 0x309A }, + { 0xFB1E, 0xFB1E }, { 0xFE00, 0xFE0F }, { 0xFE20, 0xFE23 }, + { 0xFEFF, 0xFEFF }, { 0xFFF9, 0xFFFB }, { 0x1D167, 0x1D169 }, + { 0x1D173, 0x1D182 }, { 0x1D185, 0x1D18B }, + { 0x1D1AA, 0x1D1AD }, { 0xE0001, 0xE0001 }, + { 0xE0020, 0xE007F }, { 0xE0100, 0xE01EF } + }; + + /* test for 8-bit control characters */ + if (ch == 0) + return 0; + if (ch < 32 || (ch >= 0x7f && ch < 0xa0)) + return -1; + + /* binary search in table of non-spacing characters */ + if (bisearch(ch, combining, sizeof(combining) + / sizeof(struct interval) - 1)) + return 0; + + /* + * If we arrive here, ch is neither a combining nor a C0/C1 + * control character. + */ + + return 1 + + (ch >= 0x1100 && + /* Hangul Jamo init. consonants */ + (ch <= 0x115f || + ch == 0x2329 || ch == 0x232a || + /* CJK ... Yi */ + (ch >= 0x2e80 && ch <= 0xa4cf && + ch != 0x303f) || + /* Hangul Syllables */ + (ch >= 0xac00 && ch <= 0xd7a3) || + /* CJK Compatibility Ideographs */ + (ch >= 0xf900 && ch <= 0xfaff) || + /* CJK Compatibility Forms */ + (ch >= 0xfe30 && ch <= 0xfe6f) || + /* Fullwidth Forms */ + (ch >= 0xff00 && ch <= 0xff60) || + (ch >= 0xffe0 && ch <= 0xffe6) || + (ch >= 0x20000 && ch <= 0x2fffd) || + (ch >= 0x30000 && ch <= 0x3fffd))); +} + +/* + * This function returns the number of columns occupied by the character + * pointed to by the variable start. The pointer is updated to point at + * the next character. If it was not valid UTF-8, the pointer is set to NULL. + */ +int utf8_width(const char **start) +{ + unsigned char *s = (unsigned char *)*start; + wchar_t ch; + + if (*s < 0x80) { + /* 0xxxxxxx */ + ch = *s; + *start += 1; + } else if ((s[0] & 0xe0) == 0xc0) { + /* 110XXXXx 10xxxxxx */ + if ((s[1] & 0xc0) != 0x80 || + /* overlong? */ + (s[0] & 0xfe) == 0xc0) + goto invalid; + ch = ((s[0] & 0x1f) << 6) | (s[1] & 0x3f); + *start += 2; + } else if ((s[0] & 0xf0) == 0xe0) { + /* 1110XXXX 10Xxxxxx 10xxxxxx */ + if ((s[1] & 0xc0) != 0x80 || + (s[2] & 0xc0) != 0x80 || + /* overlong? */ + (s[0] == 0xe0 && (s[1] & 0xe0) == 0x80) || + /* surrogate? */ + (s[0] == 0xed && (s[1] & 0xe0) == 0xa0) || + /* U+FFFE or U+FFFF? */ + (s[0] == 0xef && s[1] == 0xbf && + (s[2] & 0xfe) == 0xbe)) + goto invalid; + ch = ((s[0] & 0x0f) << 12) | + ((s[1] & 0x3f) << 6) | (s[2] & 0x3f); + *start += 3; + } else if ((s[0] & 0xf8) == 0xf0) { + /* 11110XXX 10XXxxxx 10xxxxxx 10xxxxxx */ + if ((s[1] & 0xc0) != 0x80 || + (s[2] & 0xc0) != 0x80 || + (s[3] & 0xc0) != 0x80 || + /* overlong? */ + (s[0] == 0xf0 && (s[1] & 0xf0) == 0x80) || + /* > U+10FFFF? */ + (s[0] == 0xf4 && s[1] > 0x8f) || s[0] > 0xf4) + goto invalid; + ch = ((s[0] & 0x07) << 18) | ((s[1] & 0x3f) << 12) | + ((s[2] & 0x3f) << 6) | (s[3] & 0x3f); + *start += 4; + } else { +invalid: + *start = NULL; + return 0; + } + + return wcwidth(ch); +} + +int is_utf8(const char *text) +{ + while (*text) { + if (*text == '\n' || *text == '\t' || *text == '\r') { + text++; + continue; + } + utf8_width(&text); + if (!text) + return 0; + } + return 1; +} + +static void print_spaces(int count) +{ + static const char s[] = " "; + while (count >= sizeof(s)) { + fwrite(s, sizeof(s) - 1, 1, stdout); + count -= sizeof(s) - 1; + } + fwrite(s, count, 1, stdout); +} + +/* + * Wrap the text, if necessary. The variable indent is the indent for the + * first line, indent2 is the indent for all other lines. + */ +void print_wrapped_text(const char *text, int indent, int indent2, int width) +{ + int w = indent, assume_utf8 = is_utf8(text); + const char *bol = text, *space = NULL; + + for (;;) { + char c = *text; + if (!c || isspace(c)) { + if (w < width || space < 0) { + const char *start = bol; + if (space) + start = space; + else + print_spaces(indent); + fwrite(start, text - start, 1, stdout); + if (!c) { + putchar('\n'); + return; + } else if (c == '\t') + w |= 0x07; + space = text; + w++; + text++; + } else { + putchar('\n'); + text = bol = space + 1; + space = NULL; + w = indent = indent2; + } + continue; + } + if (assume_utf8) + w += utf8_width(&text); + else { + w++; + text++; + } + } +} diff --git a/utf8.h b/utf8.h new file mode 100644 index 0000000..a0d7f59 --- /dev/null +++ b/utf8.h @@ -0,0 +1,8 @@ +#ifndef GIT_UTF8_H +#define GIT_UTF8_H + +int utf8_width(const char **start); +int is_utf8(const char *text); +void print_wrapped_text(const char *text, int indent, int indent2, int len); + +#endif ^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: [PATCH 1/2] libgit.a: add some UTF-8 handling functions 2006-12-22 23:50 ` Johannes Schindelin @ 2006-12-23 8:52 ` Uwe Kleine-König 2006-12-23 14:12 ` Johannes Schindelin 2006-12-23 19:53 ` warn non utf-8 commit log messages Junio C Hamano 1 sibling, 1 reply; 39+ messages in thread From: Uwe Kleine-König @ 2006-12-23 8:52 UTC (permalink / raw) To: Johannes Schindelin; +Cc: git, junkio Hallo Johannes, Johannes Schindelin wrote: > @@ -127,6 +128,15 @@ int cmd_commit_tree(int argc, const char **argv, const char *prefix) > while (fgets(comment, sizeof(comment), stdin) != NULL) > add_buffer(&buffer, &size, "%s", comment); > > + /* And check the encoding */ > + buffer[size] = '\0'; > + if (!strcmp(git_commit_encoding, "utf-8") && !is_utf8(buffer)) { Maybe you could be more generous here. E.g. if ((!strcasecmp(git_commit_encoding, "utf-8") || !strcasecmp(git_commit_encoding, "utf8")) && !is_utf8(buffer)) Junio suggested to make this check if i18n.commitEncoding is empty. I didn't check the code to see if this case is included. Gruessle Uwe -- Uwe Kleine-König http://www.google.com/search?q=2+to+the+power+of+12 ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/2] libgit.a: add some UTF-8 handling functions 2006-12-23 8:52 ` Uwe Kleine-König @ 2006-12-23 14:12 ` Johannes Schindelin 0 siblings, 0 replies; 39+ messages in thread From: Johannes Schindelin @ 2006-12-23 14:12 UTC (permalink / raw) To: Uwe Kleine-König; +Cc: git, junkio [-- Attachment #1: Type: TEXT/PLAIN, Size: 1045 bytes --] Hi, On Sat, 23 Dec 2006, Uwe Kleine-König wrote: > Johannes Schindelin wrote: > > @@ -127,6 +128,15 @@ int cmd_commit_tree(int argc, const char **argv, const char *prefix) > > while (fgets(comment, sizeof(comment), stdin) != NULL) > > add_buffer(&buffer, &size, "%s", comment); > > > > + /* And check the encoding */ > > + buffer[size] = '\0'; > > + if (!strcmp(git_commit_encoding, "utf-8") && !is_utf8(buffer)) { > Maybe you could be more generous here. E.g. > > if ((!strcasecmp(git_commit_encoding, "utf-8") || > !strcasecmp(git_commit_encoding, "utf8")) && !is_utf8(buffer)) > > Junio suggested to make this check if i18n.commitEncoding is empty. I > didn't check the code to see if this case is included. The problem is, as I pointed out in another mail, that environment.c sets the default git_commit_encoding to "utf-8". This is hardwired, and I have no way to check if that was set by the config or not, other than reparsing the config myself. > Gruessle Hah! You don't use umlauts and ssharp yourself! Ciao, Dscho ^ permalink raw reply [flat|nested] 39+ messages in thread
* warn non utf-8 commit log messages. 2006-12-22 23:50 ` Johannes Schindelin 2006-12-23 8:52 ` Uwe Kleine-König @ 2006-12-23 19:53 ` Junio C Hamano 2006-12-23 23:46 ` Johannes Schindelin 1 sibling, 1 reply; 39+ messages in thread From: Junio C Hamano @ 2006-12-23 19:53 UTC (permalink / raw) To: Johannes Schindelin; +Cc: git Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: >> But I had enough of UTF-8 for a day. > > Okay, so I lied (this are both patches revised and combined): I am thinking of putting this in 'next', with the following changes on top of your combined patch. git-commit-tree warns if the commit message does not minimally conform to the UTF-8 encoding when i18n.commitencoding is either unset, or set to "utf-8". It does not die as in your version. builtin-commit-tree.c | 13 +++++++------ utf8.c | 2 +- 2 files changed, 8 insertions(+), 7 deletions(-) diff --git a/builtin-commit-tree.c b/builtin-commit-tree.c index f274721..f641787 100644 --- a/builtin-commit-tree.c +++ b/builtin-commit-tree.c @@ -78,6 +78,11 @@ static int new_parent(int idx) return 1; } +static const char commit_utf8_warn[] = +"Warning: commit message does not conform to UTF-8.\n" +"You may want to amend it after fixing the message, or set the config\n" +"variable i18n.commitencoding to the encoding your project uses.\n"; + int cmd_commit_tree(int argc, const char **argv, const char *prefix) { int i; @@ -133,12 +138,8 @@ int cmd_commit_tree(int argc, const char **argv, const char *prefix) /* And check the encoding */ buffer[size] = '\0'; - if (!strcmp(git_commit_encoding, "utf-8") && !is_utf8(buffer)) { - fprintf(stderr, "Commit message does not conform to UTF-8.\n" - "Please fix the message," - " or set the config variable i18n.commitencoding.\n"); - return 1; - } + if (!strcmp(git_commit_encoding, "utf-8") && !is_utf8(buffer)) + fprintf(stderr, commit_utf8_warn); if (!write_sha1_file(buffer, size, commit_type, commit_sha1)) { printf("%s\n", sha1_to_hex(commit_sha1)); diff --git a/utf8.c b/utf8.c index aed60ad..8fa6257 100644 --- a/utf8.c +++ b/utf8.c @@ -244,7 +244,7 @@ void print_wrapped_text(const char *text, int indent, int indent2, int width) for (;;) { char c = *text; if (!c || isspace(c)) { - if (w < width || space < 0) { + if (w < width || !space) { const char *start = bol; if (space) start = space; ^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: warn non utf-8 commit log messages. 2006-12-23 19:53 ` warn non utf-8 commit log messages Junio C Hamano @ 2006-12-23 23:46 ` Johannes Schindelin 0 siblings, 0 replies; 39+ messages in thread From: Johannes Schindelin @ 2006-12-23 23:46 UTC (permalink / raw) To: Junio C Hamano; +Cc: git Hi, On Sat, 23 Dec 2006, Junio C Hamano wrote: > Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > > >> But I had enough of UTF-8 for a day. > > > > Okay, so I lied (this are both patches revised and combined): > > I am thinking of putting this in 'next', with the following > changes on top of your combined patch. > > git-commit-tree warns if the commit message does not minimally > conform to the UTF-8 encoding when i18n.commitencoding is either > unset, or set to "utf-8". It does not die as in your version. Yeah, this is nicer. > - if (w < width || space < 0) { > + if (w < width || !space) { This is a real bug fix. Thank you. I changed quite a bit between offset and char*, and eventually forgot this part. Ciao, Dscho ^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH 2/2] git-commit-tree: if i18n.commitencoding is utf-8 (default), check it 2006-12-22 19:01 ` Junio C Hamano 2006-12-22 21:03 ` [PATCH 1/2] libgit.a: add some UTF-8 handling functions Johannes Schindelin @ 2006-12-22 21:06 ` Johannes Schindelin 2006-12-22 21:50 ` Junio C Hamano 2006-12-22 21:15 ` [RFC/PATCH 3/2] Wrap lines in shortlog Johannes Schindelin 2 siblings, 1 reply; 39+ messages in thread From: Johannes Schindelin @ 2006-12-22 21:06 UTC (permalink / raw) To: Junio C Hamano; +Cc: Nicolas Pitre, Uwe Kleine-König, git Now, git-commit-tree refuses to commit when i18n.commitencoding is either unset, or set to "utf-8", and the commit message does not minimally conform to the UTF-8 encoding. Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> --- Unfortunately, I could not think of a shorter oneline description. But my next patch fixes at least the output in shortlog. builtin-commit-tree.c | 14 ++++++++++++-- 1 files changed, 12 insertions(+), 2 deletions(-) diff --git a/builtin-commit-tree.c b/builtin-commit-tree.c index 856f3cd..810b440 100644 --- a/builtin-commit-tree.c +++ b/builtin-commit-tree.c @@ -7,6 +7,7 @@ #include "commit.h" #include "tree.h" #include "builtin.h" +#include "utf8.h" #define BLOCKING (1ul << 14) @@ -32,7 +33,7 @@ static void add_buffer(char **bufp, unsigned int *sizep, const char *fmt, ...) len = vsnprintf(one_line, sizeof(one_line), fmt, args); va_end(args); size = *sizep; - newsize = size + len; + newsize = size + len + 1; alloc = (size + 32767) & ~32767; buf = *bufp; if (newsize > alloc) { @@ -40,7 +41,7 @@ static void add_buffer(char **bufp, unsigned int *sizep, const char *fmt, ...) buf = xrealloc(buf, alloc); *bufp = buf; } - *sizep = newsize; + *sizep = newsize - 1; memcpy(buf + size, one_line, len); } @@ -127,6 +128,15 @@ int cmd_commit_tree(int argc, const char **argv, const char *prefix) while (fgets(comment, sizeof(comment), stdin) != NULL) add_buffer(&buffer, &size, "%s", comment); + /* And check the encoding */ + buffer[size] = '\0'; + if (!strcmp(git_commit_encoding, "utf-8") && utf8_strlen(buffer) < 0) { + fprintf(stderr, "Commit message does not conform to UTF-8.\n" + "Please fix the message," + " or set the config variable i18n.commitencoding.\n"); + return 1; + } + if (!write_sha1_file(buffer, size, commit_type, commit_sha1)) { printf("%s\n", sha1_to_hex(commit_sha1)); return 0; -- 1.4.4.3.ge5f98-dirty ^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: [PATCH 2/2] git-commit-tree: if i18n.commitencoding is utf-8 (default), check it 2006-12-22 21:06 ` [PATCH 2/2] git-commit-tree: if i18n.commitencoding is utf-8 (default), check it Johannes Schindelin @ 2006-12-22 21:50 ` Junio C Hamano 2006-12-22 22:21 ` Johannes Schindelin 0 siblings, 1 reply; 39+ messages in thread From: Junio C Hamano @ 2006-12-22 21:50 UTC (permalink / raw) To: Johannes Schindelin; +Cc: git Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > Now, git-commit-tree refuses to commit when i18n.commitencoding is > either unset, or set to "utf-8", and the commit message does not > minimally conform to the UTF-8 encoding. > > Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> > --- > > Unfortunately, I could not think of a shorter oneline description. > But my next patch fixes at least the output in shortlog. I think the rule that you described on the one-line description makes more sense than "either unset of set to utf-8". In other words, I'd prefer doing this in a repository that explicitly asks for it. I do not want to get burned by too many incompatible changes X-<. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 2/2] git-commit-tree: if i18n.commitencoding is utf-8 (default), check it 2006-12-22 21:50 ` Junio C Hamano @ 2006-12-22 22:21 ` Johannes Schindelin 0 siblings, 0 replies; 39+ messages in thread From: Johannes Schindelin @ 2006-12-22 22:21 UTC (permalink / raw) To: Junio C Hamano; +Cc: git Hi, On Fri, 22 Dec 2006, Junio C Hamano wrote: > Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > > > Now, git-commit-tree refuses to commit when i18n.commitencoding is > > either unset, or set to "utf-8", and the commit message does not > > minimally conform to the UTF-8 encoding. > > > > Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> > > --- > > > > Unfortunately, I could not think of a shorter oneline description. > > But my next patch fixes at least the output in shortlog. > > I think the rule that you described on the one-line description > makes more sense than "either unset of set to utf-8". In other > words, I'd prefer doing this in a repository that explicitly > asks for it. Well, the problem is this line: environment.c:21:char git_commit_encoding[MAX_ENCODING_LENGTH] = "utf-8"; > I do not want to get burned by too many incompatible changes X-<. Understandable. Ciao, Dscho ^ permalink raw reply [flat|nested] 39+ messages in thread
* [RFC/PATCH 3/2] Wrap lines in shortlog 2006-12-22 19:01 ` Junio C Hamano 2006-12-22 21:03 ` [PATCH 1/2] libgit.a: add some UTF-8 handling functions Johannes Schindelin 2006-12-22 21:06 ` [PATCH 2/2] git-commit-tree: if i18n.commitencoding is utf-8 (default), check it Johannes Schindelin @ 2006-12-22 21:15 ` Johannes Schindelin 2 siblings, 0 replies; 39+ messages in thread From: Johannes Schindelin @ 2006-12-22 21:15 UTC (permalink / raw) To: Junio C Hamano; +Cc: Nicolas Pitre, Uwe Kleine-König, git [-- Attachment #1: Type: TEXT/PLAIN, Size: 253 bytes --] Hi, It is nicer to wrap the lines of too long oneline descriptions. This patch even works in UTF-8. The patch is attached, since I cannot find the setting in pine to make it a UTF-8 one. Besides, I deliberately fscked up one test case. Ciao, Dscho [-- Attachment #2: shortlog.patch --] [-- Type: TEXT/PLAIN, Size: 2978 bytes --] [PATCH] Use print_wrapped_text() in shortlog Some oneline descriptions are just too long. In shortlog, it looks much nicer when they are wrapped. Since print_wrapped_text() is UTF-8 aware, it also works with those descriptions. Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> --- Probably this should check i18n.commitencoding, too... builtin-shortlog.c | 4 +++- t/t4201-shortlog.sh | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 53 insertions(+), 1 deletions(-) diff --git a/builtin-shortlog.c b/builtin-shortlog.c index edb4042..30e7cb5 100644 --- a/builtin-shortlog.c +++ b/builtin-shortlog.c @@ -4,6 +4,7 @@ #include "diff.h" #include "path-list.h" #include "revision.h" +#include "utf8.h" static const char shortlog_usage[] = "git-shortlog [-n] [-s] [<commit-id>... ]"; @@ -321,7 +322,8 @@ int cmd_shortlog(int argc, const char **argv, const char *prefix) } else { printf("%s (%d):\n", list.items[i].path, onelines->nr); for (j = onelines->nr - 1; j >= 0; j--) - printf(" %s\n", onelines->items[j].path); + print_wrapped_text(onelines->items[j].path, + 6, 9, 76); printf("\n"); } diff --git a/t/t4201-shortlog.sh b/t/t4201-shortlog.sh new file mode 100644 index 0000000..e4085f9 --- /dev/null +++ b/t/t4201-shortlog.sh @@ -0,0 +1,50 @@ +#!/bin/sh +# +# Copyright (c) 2006 Johannes E. Schindelin +# + +test_description='git-shortlog +' + +. ./test-lib.sh + +echo 1 > a1 +git add a1 +tree=$(git write-tree) +commit=$((echo "Test"; echo) | git commit-tree $tree) +git update-ref HEAD $commit + +echo 2 > a1 +git commit -m "This is a very, very long first line for the commit message to see if it is wrapped correctly" a1 + +# test if the wrapping is still valid when replacing all i's by treble clefs. +echo 3 > a1 +git commit -m "$(echo "This is a very, very long first line for the commit message to see if it is wrapped correctly" | sed "s/i/1234/g" | tr 1234 '\360\235\204\236')" a1 + +# now fsck up the utf8 +git repo-config i18n.commitencoding non-utf-8 +echo 4 > a1 +git commit -m "$(echo "This is a very, very long first line for the commit message to see if it is wrapped correctly" | sed "s/i/1234/g" | tr 1234 '\370\235\204\236')" a1 + +echo 5 > a1 +git commit -m "a 12 34 56 78" a1 + +git shortlog HEAD > out + +cat > expect << EOF +A U Thor (5): + Test + This is a very, very long first line for the commit message to see if + it is wrapped correctly + Thðs ðs a very, very long fðrst lðne for the commðt message to see ðf + ðt ðs wrapped correctly + Thøs øs a very, very long først løne for the commøt + message to see øf øt øs wrapped correctly + a 12 34 + 56 78 + +EOF + +test_expect_success 'shortlog wrapping' 'diff -u expect out' + +test_done ^ permalink raw reply related [flat|nested] 39+ messages in thread
end of thread, other threads:[~2006-12-25 4:04 UTC | newest] Thread overview: 39+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-12-08 11:44 [PATCH] Fix documentation copy&paste typo Uwe Kleine-Koenig 2006-12-19 14:16 ` Uwe Kleine-König 2006-12-19 17:27 ` Junio C Hamano 2006-12-21 8:59 ` specify charset for commits (Was: [PATCH] Fix documentation copy&paste typo) Uwe Kleine-König 2006-12-21 9:51 ` Johannes Schindelin 2006-12-21 10:11 ` Santi Béjar 2006-12-21 10:23 ` Alexander Litvinov 2006-12-21 10:52 ` Jakub Narebski 2006-12-21 13:05 ` Alexander Litvinov 2006-12-21 13:14 ` Jakub Narebski 2006-12-21 13:43 ` Uwe Kleine-König 2006-12-21 18:19 ` specify charset for commits Junio C Hamano 2006-12-21 18:48 ` Nicolas Pitre 2006-12-21 19:11 ` Uwe Kleine-König 2006-12-21 19:36 ` Alexander Litvinov 2006-12-22 12:07 ` Johannes Schindelin 2006-12-22 15:09 ` Uwe Kleine-König 2006-12-22 22:02 ` Uwe Kleine-König 2006-12-22 15:31 ` Nicolas Pitre 2006-12-22 19:01 ` Junio C Hamano 2006-12-22 21:03 ` [PATCH 1/2] libgit.a: add some UTF-8 handling functions Johannes Schindelin 2006-12-22 21:27 ` Junio C Hamano 2006-12-22 21:36 ` Johannes Schindelin 2006-12-22 21:58 ` Junio C Hamano 2006-12-22 22:20 ` Johannes Schindelin 2006-12-22 22:33 ` Junio C Hamano 2006-12-25 4:03 ` Alexander Litvinov 2006-12-22 22:14 ` Uwe Kleine-König 2006-12-22 22:19 ` Uwe Kleine-König 2006-12-22 22:34 ` Johannes Schindelin 2006-12-22 23:50 ` Johannes Schindelin 2006-12-23 8:52 ` Uwe Kleine-König 2006-12-23 14:12 ` Johannes Schindelin 2006-12-23 19:53 ` warn non utf-8 commit log messages Junio C Hamano 2006-12-23 23:46 ` Johannes Schindelin 2006-12-22 21:06 ` [PATCH 2/2] git-commit-tree: if i18n.commitencoding is utf-8 (default), check it Johannes Schindelin 2006-12-22 21:50 ` Junio C Hamano 2006-12-22 22:21 ` Johannes Schindelin 2006-12-22 21:15 ` [RFC/PATCH 3/2] Wrap lines in shortlog Johannes Schindelin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).