git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Junio C Hamano <junkio@cox.net>
To: Kay Sievers <kay.sievers@vrfy.org>, Paul Mackerras <paulus@samba.org>
Cc: git@vger.kernel.org
Subject: Re: gitweb on kernel.org and UTF-8
Date: Wed, 23 Nov 2005 19:24:38 -0800	[thread overview]
Message-ID: <7vfypm20eh.fsf@assigned-by-dhcp.cox.net> (raw)
In-Reply-To: <20051123033526.GA24098@vrfy.org> (Kay Sievers's message of "Wed, 23 Nov 2005 04:35:26 +0100")

Kay Sievers <kay.sievers@vrfy.org> writes:

> Should be fine now. The escapeHTML() garbled the utf8 "ö", and the
> decode() failed that.

Looking better.  Thanks.

This begs for addressing another issue, although I am hesitant
to open this can of worms at this moment.

It might be a good idea to have configuration items gitk and
gitweb can use to get a hint to decide what the commit log
message and the blob data encodings might be.  gitweb already
adds its own information to the git repository format
(.git/description), so this _could_ be stored outside just like
that (e.g. .git/commit_log_encoding), but using .git/config is
probably better.

How about doing something like this?

	[i18n]
        	commitEncoding = utf8
		blobEncoding = utf8

to mean:

	If you _have_ to make an assumption on an encoding
	commit and blob objects are in, utf8 is your best bet
	(but mistakes can happen, and some blobs can be binary).

Then gitweb and gitk can look at commitEncoding and blobEncoding
as a hint to base its display defaults on.  Sending everything
out in utf8 would be sane and safe choice these days for gitweb,
so if commitEncoding is latin-1 it may need to iconv latin-1 to
utf8 while reading commits.  For blobs, it might be better off
asking file(1) or File::MMagic (since you are using Perl in
gitweb --- sorry I do not know tcl equivalent of that) what they
are; eventually you would want to be able to show repositories
full of jpeg pictures anyway ;-).

On the commit-producing side, I could have:

	[i18n]
        	editorEncoding = latin-1

and if editorEncoding is different from commitEncoding,
"git-commit -c $commit" would first iconv from utf8 to latin-1
before populating the user's editor, and iconv back from latin-1
to utf8 before feeding what the user edited to commit-tree.

Pathname encoding is the reason why I was hesitant about bring
this up.  Although it is too late for 1.0 now, we _could_ have
declared that the paths recorded in git tree objects and index
files are internally utf8, and working tree paths can be in
different encoding.  As a local repository configuration not
project wide configuration, we could have something like:

	[i18n]
        	pathnameEncoding = latin-1

to mean that the filesystem paths returned by readdir(3) and
accepted by open(2) and friends are in latin-1.  Comparison and
movement between working tree files, the index file, and tree
objects have to involve iconv and do the right thing.  So if you
fetch from such a repository into a filesystem that stores
pathnames in utf8, the right thing should happen.

I personally feel any sane project should restrict its pathname
to ASCII only, so this issue might be moot (or is the right word
"mute"?), but something like this _might_ be useful in later
versions of git.

But not in 1.0.

  parent reply	other threads:[~2005-11-24  3:24 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-11-23  0:03 gitweb on kernel.org and UTF-8 Junio C Hamano
2005-11-23  0:59 ` H. Peter Anvin
2005-11-23  3:35   ` Kay Sievers
2005-11-23  3:42     ` H. Peter Anvin
2005-11-24  3:24     ` Junio C Hamano [this message]
2005-11-24  5:01       ` Ryan Anderson
2005-11-24  6:19         ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7vfypm20eh.fsf@assigned-by-dhcp.cox.net \
    --to=junkio@cox.net \
    --cc=git@vger.kernel.org \
    --cc=kay.sievers@vrfy.org \
    --cc=paulus@samba.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).