gitweb on kernel.org and UTF-8

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* gitweb on kernel.org and UTF-8
@ 2005-11-23  0:03 Junio C Hamano
  2005-11-23  0:59 ` H. Peter Anvin
  0 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2005-11-23  0:03 UTC (permalink / raw)
  To: git

Is it possible that the UTF-8 check in gitweb running on
kernel.org machine is somehow too strict?

The following two commits in git.git repository are not showing
properly.

I have a track record of getting peoples' names wrong, so I
double checked my commit objects, and as far as I can tell, all
of them are encoded in UTF-8 properly (or at least I can view
what I expect if I throw raw bytes from the commit objects at my
Firefox):

        c3df8568424684bbcc7df7722eb3ec34bdae8b2d

        This is from Yoshifuji-san; the third character in
        author name field is mangled.

	bb931cf9d73d94d9940b6d0ee56b6c13ad42f1a0

	This is from Lukas Sandstr*m; o with Umlaut on top is
	showing a ?.  Incidentally, the blob that records recent
	version of Documentation/git-pack-redundant.txt has his
	name in it, which has the same ? problem, but "plain"
	option shows his name correctly in UTF-8.

Interestingly enough, my name spelled in Japanese
(Documentatino/git-lost-found.txt) is intact.  Am I getting a
VIP treatment somehow?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: gitweb on kernel.org and UTF-8
  2005-11-23  0:03 gitweb on kernel.org and UTF-8 Junio C Hamano
@ 2005-11-23  0:59 ` H. Peter Anvin
  2005-11-23  3:35   ` Kay Sievers
  0 siblings, 1 reply; 7+ messages in thread
From: H. Peter Anvin @ 2005-11-23  0:59 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Junio C Hamano wrote:
> Is it possible that the UTF-8 check in gitweb running on
> kernel.org machine is somehow too strict?
> 
> The following two commits in git.git repository are not showing
> properly.
> 
> I have a track record of getting peoples' names wrong, so I
> double checked my commit objects, and as far as I can tell, all
> of them are encoded in UTF-8 properly (or at least I can view
> what I expect if I throw raw bytes from the commit objects at my
> Firefox):
> 
>         c3df8568424684bbcc7df7722eb3ec34bdae8b2d
> 
>         This is from Yoshifuji-san; the third character in
>         author name field is mangled.
> 
> 	bb931cf9d73d94d9940b6d0ee56b6c13ad42f1a0
> 
> 	This is from Lukas Sandstr*m; o with Umlaut on top is
> 	showing a ?.  Incidentally, the blob that records recent
> 	version of Documentation/git-pack-redundant.txt has his
> 	name in it, which has the same ? problem, but "plain"
> 	option shows his name correctly in UTF-8.
> 
> Interestingly enough, my name spelled in Japanese
> (Documentatino/git-lost-found.txt) is intact.  Am I getting a
> VIP treatment somehow?
> 

I think it's missing a "binmode STDOUT, ':utf8';" somewhere...

For what it's worth, I looked at both the above examples and the binary 
encoding in the git repository is undoubtedly correct; the two 
characters are U+82F1/E8 8B B1 (英) and U+00F6/C3 B6 (ö) respectively, 
both of which are 100% valid UTF-8.

	-hpa

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: gitweb on kernel.org and UTF-8
  2005-11-23  0:59 ` H. Peter Anvin
@ 2005-11-23  3:35   ` Kay Sievers
  2005-11-23  3:42     ` H. Peter Anvin
  2005-11-24  3:24     ` Junio C Hamano
  0 siblings, 2 replies; 7+ messages in thread
From: Kay Sievers @ 2005-11-23  3:35 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Junio C Hamano, git

On Tue, Nov 22, 2005 at 04:59:16PM -0800, H. Peter Anvin wrote:
> Junio C Hamano wrote:
> >Is it possible that the UTF-8 check in gitweb running on
> >kernel.org machine is somehow too strict?
> >
> >The following two commits in git.git repository are not showing
> >properly.
> >
> >I have a track record of getting peoples' names wrong, so I
> >double checked my commit objects, and as far as I can tell, all
> >of them are encoded in UTF-8 properly (or at least I can view
> >what I expect if I throw raw bytes from the commit objects at my
> >Firefox):
> >
> >        c3df8568424684bbcc7df7722eb3ec34bdae8b2d
> >
> >        This is from Yoshifuji-san; the third character in
> >        author name field is mangled.
> >
> >	bb931cf9d73d94d9940b6d0ee56b6c13ad42f1a0
> >
> >	This is from Lukas Sandstr*m; o with Umlaut on top is
> >	showing a ?.  Incidentally, the blob that records recent
> >	version of Documentation/git-pack-redundant.txt has his
> >	name in it, which has the same ? problem, but "plain"
> >	option shows his name correctly in UTF-8.
> >
> >Interestingly enough, my name spelled in Japanese
> >(Documentatino/git-lost-found.txt) is intact.  Am I getting a
> >VIP treatment somehow?
> >
> 
> I think it's missing a "binmode STDOUT, ':utf8';" somewhere...
> 
> For what it's worth, I looked at both the above examples and the binary 
> encoding in the git repository is undoubtedly correct; the two 
> characters are U+82F1/E8 8B B1 (英) and U+00F6/C3 B6 (ö) respectively, 
> both of which are 100% valid UTF-8.

Should be fine now. The escapeHTML() garbled the utf8 "ö", and the
decode() failed that.

Thanks,
Kay

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: gitweb on kernel.org and UTF-8
  2005-11-23  3:35   ` Kay Sievers
@ 2005-11-23  3:42     ` H. Peter Anvin
  2005-11-24  3:24     ` Junio C Hamano
  1 sibling, 0 replies; 7+ messages in thread
From: H. Peter Anvin @ 2005-11-23  3:42 UTC (permalink / raw)
  To: Kay Sievers; +Cc: Junio C Hamano, git

Kay Sievers wrote:
> 
> Should be fine now. The escapeHTML() garbled the utf8 "ö", and the
> decode() failed that.
> 

Indeed, looks much better.

Now if I could only figure out why both Konsole and Firefox seems to use 
a standalone cedilla to represent U+FFFD, instead of something more 
logical like an inverted question mark or empty box.

	-hpa

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: gitweb on kernel.org and UTF-8
  2005-11-23  3:35   ` Kay Sievers
  2005-11-23  3:42     ` H. Peter Anvin
@ 2005-11-24  3:24     ` Junio C Hamano
  2005-11-24  5:01       ` Ryan Anderson
  1 sibling, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2005-11-24  3:24 UTC (permalink / raw)
  To: Kay Sievers, Paul Mackerras; +Cc: git

Kay Sievers <kay.sievers@vrfy.org> writes:

> Should be fine now. The escapeHTML() garbled the utf8 "ö", and the
> decode() failed that.

Looking better.  Thanks.

This begs for addressing another issue, although I am hesitant
to open this can of worms at this moment.

It might be a good idea to have configuration items gitk and
gitweb can use to get a hint to decide what the commit log
message and the blob data encodings might be.  gitweb already
adds its own information to the git repository format
(.git/description), so this _could_ be stored outside just like
that (e.g. .git/commit_log_encoding), but using .git/config is
probably better.

How about doing something like this?

	[i18n]
        	commitEncoding = utf8
		blobEncoding = utf8

to mean:

	If you _have_ to make an assumption on an encoding
	commit and blob objects are in, utf8 is your best bet
	(but mistakes can happen, and some blobs can be binary).

Then gitweb and gitk can look at commitEncoding and blobEncoding
as a hint to base its display defaults on.  Sending everything
out in utf8 would be sane and safe choice these days for gitweb,
so if commitEncoding is latin-1 it may need to iconv latin-1 to
utf8 while reading commits.  For blobs, it might be better off
asking file(1) or File::MMagic (since you are using Perl in
gitweb --- sorry I do not know tcl equivalent of that) what they
are; eventually you would want to be able to show repositories
full of jpeg pictures anyway ;-).

On the commit-producing side, I could have:

	[i18n]
        	editorEncoding = latin-1

and if editorEncoding is different from commitEncoding,
"git-commit -c $commit" would first iconv from utf8 to latin-1
before populating the user's editor, and iconv back from latin-1
to utf8 before feeding what the user edited to commit-tree.

Pathname encoding is the reason why I was hesitant about bring
this up.  Although it is too late for 1.0 now, we _could_ have
declared that the paths recorded in git tree objects and index
files are internally utf8, and working tree paths can be in
different encoding.  As a local repository configuration not
project wide configuration, we could have something like:

	[i18n]
        	pathnameEncoding = latin-1

to mean that the filesystem paths returned by readdir(3) and
accepted by open(2) and friends are in latin-1.  Comparison and
movement between working tree files, the index file, and tree
objects have to involve iconv and do the right thing.  So if you
fetch from such a repository into a filesystem that stores
pathnames in utf8, the right thing should happen.

I personally feel any sane project should restrict its pathname
to ASCII only, so this issue might be moot (or is the right word
"mute"?), but something like this _might_ be useful in later
versions of git.

But not in 1.0.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: gitweb on kernel.org and UTF-8
  2005-11-24  3:24     ` Junio C Hamano
@ 2005-11-24  5:01       ` Ryan Anderson
  2005-11-24  6:19         ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Ryan Anderson @ 2005-11-24  5:01 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Kay Sievers, Paul Mackerras, git

On Wed, Nov 23, 2005 at 07:24:38PM -0800, Junio C Hamano wrote:
> 
> How about doing something like this?
> 
> 	[i18n]
>         	commitEncoding = utf8
> 		blobEncoding = utf8
> 
> to mean:
> 
> 	If you _have_ to make an assumption on an encoding
> 	commit and blob objects are in, utf8 is your best bet
> 	(but mistakes can happen, and some blobs can be binary).

The rest of the options help clarify this, but can you make these
options 'assumeCommitEncoding' and 'assumeBlobEncoding' to make it clear
that these are *assumptions* and not actually controlling what gets
written?


-- 

Ryan Anderson
  sometimes Pug Majere

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: gitweb on kernel.org and UTF-8
  2005-11-24  5:01       ` Ryan Anderson
@ 2005-11-24  6:19         ` Junio C Hamano
  0 siblings, 0 replies; 7+ messages in thread
From: Junio C Hamano @ 2005-11-24  6:19 UTC (permalink / raw)
  To: Ryan Anderson; +Cc: git

Ryan Anderson <ryan@michonline.com> writes:

> On Wed, Nov 23, 2005 at 07:24:38PM -0800, Junio C Hamano wrote:
>> 
>> How about doing something like this?
>> 
>> 	[i18n]
>>         	commitEncoding = utf8
>> 		blobEncoding = utf8
>> 
>> to mean:
>> 
>> 	If you _have_ to make an assumption on an encoding
>> 	commit and blob objects are in, utf8 is your best bet
>> 	(but mistakes can happen, and some blobs can be binary).
>
> The rest of the options help clarify this, but can you make these
> options 'assumeCommitEncoding' and 'assumeBlobEncoding' to make it clear
> that these are *assumptions* and not actually controlling what gets
> written?

As I outlined in the "editorEncoding" part, if everything works
as planned, your latin-1 editing editor would leave latin-1
message for git-commit to pick up (or command line "-m $msg"
option would be encoded in latin-1), and iconv would munge that
to utf8 to feed commit-tree (because of "commitEncoding" being
utf8). In that sense, commitEncoding is not assumption for the
writers.  If everybody, including outside sources we merge from,
makes best effort not to screw up, these settings would
faithfully describe what encoding logs are in.

But writers can screw up, and funnily encoded commit messages
merge from outside source brings in cannot be fixed after the
fact, so "assume" part must be implied anyway for readers.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2005-11-24  6:19 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-11-23  0:03 gitweb on kernel.org and UTF-8 Junio C Hamano
2005-11-23  0:59 ` H. Peter Anvin
2005-11-23  3:35   ` Kay Sievers
2005-11-23  3:42     ` H. Peter Anvin
2005-11-24  3:24     ` Junio C Hamano
2005-11-24  5:01       ` Ryan Anderson
2005-11-24  6:19         ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).