git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* gitweb: charset problem
@ 2005-10-24  7:18 Nico -telmich- Schottelius
  2005-10-24 12:34 ` Kay Sievers
  2005-10-24 13:56 ` Horst von Brand
  0 siblings, 2 replies; 8+ messages in thread
From: Nico -telmich- Schottelius @ 2005-10-24  7:18 UTC (permalink / raw)
  To: Kay Sievers, Git Mailing List; +Cc: Christian Gierke, Peter Portmann

[-- Attachment #1: Type: text/plain, Size: 1034 bytes --]

Hello!

gitweb (my $version =           "247";) seems to send utf-8 as meta tag encoding
(<meta http-equiv="content-type" content="text/html; charset=utf-8"/>).
The problem is that the name of the user "HansjOErg" (OE is the german umlaut)
is in iso8859-1 in /etc/passwd.
This is guessed, but it does not look like utf-8, as it's a one byte encoding:

00007b0: 3031 323a 3130 303a 4861 6e73 6af6 7267  012:100:Hansj.rg

What would be the correct way to fix that? Change the username to utf-8?
(Is this possible without causing problems in other programs?)
Or tell gitweb that it should convert non-UTF-8 to UTF-8?

But we also have another problem: Sometimes we have umlauts in the commit messages.
Those are also displayed incorrectly. When I switch to iso-8859-1 encoding in mozilla,
the characters in the username and in the commit message are ok.

Greetings,

Nico

-- 
Latest project: cconfig (http://nico.schotteli.us/papers/linux/cconfig/)
Open Source nutures open minds and free, creative developers.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 827 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: gitweb: charset problem
  2005-10-24  7:18 gitweb: charset problem Nico -telmich- Schottelius
@ 2005-10-24 12:34 ` Kay Sievers
  2005-10-24 13:56 ` Horst von Brand
  1 sibling, 0 replies; 8+ messages in thread
From: Kay Sievers @ 2005-10-24 12:34 UTC (permalink / raw)
  To: Nico -telmich- Schottelius
  Cc: Git Mailing List, Christian Gierke, Peter Portmann

On Mon, Oct 24, 2005 at 09:18:39AM +0200, Nico -telmich- Schottelius wrote:
> gitweb (my $version =           "247";) seems to send utf-8 as meta tag encoding
> (<meta http-equiv="content-type" content="text/html; charset=utf-8"/>).

Yes, that's is intentional, also the http header is overwritten, if the
webserver's default is not utf8.

> The problem is that the name of the user "HansjOErg" (OE is the german umlaut)
> is in iso8859-1 in /etc/passwd.

Huh, not sure if it's a good idea to put that into a username,
never tried or ever seen that.

> This is guessed, but it does not look like utf-8, as it's a one byte encoding:
> 
> 00007b0: 3031 323a 3130 303a 4861 6e73 6af6 7267  012:100:Hansj.rg

Sure, 0xf6 is the single letter 'ö' (oe) in iso-8859.

> What would be the correct way to fix that? Change the username to utf-8?
> (Is this possible without causing problems in other programs?)
> Or tell gitweb that it should convert non-UTF-8 to UTF-8?

Don't know. At best get rid of the non-ascii chars in /etc/passwd
if you don't want to get in trouble... :)

All other programs sure, should definitely use utf8.

> But we also have another problem: Sometimes we have umlauts in the commit messages.
> Those are also displayed incorrectly. When I switch to iso-8859-1 encoding in mozilla,
> the characters in the username and in the commit message are ok.

utf8 is the one and only sane encoding if you need more than ascii chars.
Just convert everything to utf8 from your locale to your webserver and
all that pain will go away immediately. :)

Best,
Kay

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: gitweb: charset problem
  2005-10-24  7:18 gitweb: charset problem Nico -telmich- Schottelius
  2005-10-24 12:34 ` Kay Sievers
@ 2005-10-24 13:56 ` Horst von Brand
  2005-10-24 21:55   ` Daniel Barkalow
  1 sibling, 1 reply; 8+ messages in thread
From: Horst von Brand @ 2005-10-24 13:56 UTC (permalink / raw)
  To: Nico -telmich- Schottelius
  Cc: Kay Sievers, Git Mailing List, Christian Gierke, Peter Portmann

Nico -telmich- Schottelius <nico-linux-git@schottelius.org> wrote:
> gitweb (my $version = "247";) seems to send utf-8 as meta tag encoding
> (<meta http-equiv="content-type" content="text/html; charset=utf-8"/>).

> The problem is that the name of the user "HansjOErg" (OE is the german
> umlaut) is in iso8859-1 in /etc/passwd.  This is guessed, but it does not
> look like utf-8, as it's a one byte encoding:

> 00007b0: 3031 323a 3130 303a 4861 6e73 6af6 7267  012:100:Hansj.rg

> What would be the correct way to fix that? Change the username to utf-8?
> (Is this possible without causing problems in other programs?)
> Or tell gitweb that it should convert non-UTF-8 to UTF-8?

I'd be /very/ wary of usernames that aren't plain ASCII, just lovercase
letters and digits, not starting with a digit, at most 8 characters long.
I've seen more than enough hard-to-debug funnies in the most surprising
places otherwise.

> But we also have another problem: Sometimes we have umlauts in the commit
> messages.  Those are also displayed incorrectly. When I switch to
> iso-8859-1 encoding in mozilla, the characters in the username and in the
> commit message are ok.

I believe the Emperor Penguin decreed messages have to be ASCII, or else
UTF-8. Please don't add to the mess by using non-portable encodings!
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: gitweb: charset problem
  2005-10-24 13:56 ` Horst von Brand
@ 2005-10-24 21:55   ` Daniel Barkalow
  2005-10-24 22:39     ` Junio C Hamano
  0 siblings, 1 reply; 8+ messages in thread
From: Daniel Barkalow @ 2005-10-24 21:55 UTC (permalink / raw)
  To: Horst von Brand
  Cc: Nico -telmich- Schottelius, Kay Sievers, Git Mailing List,
	Christian Gierke, Peter Portmann, Junio C Hamano

On Mon, 24 Oct 2005, Horst von Brand wrote:

> > But we also have another problem: Sometimes we have umlauts in the commit
> > messages.  Those are also displayed incorrectly. When I switch to
> > iso-8859-1 encoding in mozilla, the characters in the username and in the
> > commit message are ok.
> 
> I believe the Emperor Penguin decreed messages have to be ASCII, or else
> UTF-8. Please don't add to the mess by using non-portable encodings!

Should we possibly reject non-UTF-8 input to commits?

IIRC, we actually define that to be UTF-8, unlike most of the other stuff, 
for which we don't actually insist on a policy.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: gitweb: charset problem
  2005-10-24 21:55   ` Daniel Barkalow
@ 2005-10-24 22:39     ` Junio C Hamano
  2005-10-25 16:01       ` Daniel Barkalow
  0 siblings, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2005-10-24 22:39 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: git

Daniel Barkalow <barkalow@iabervon.org> writes:

> On Mon, 24 Oct 2005, Horst von Brand wrote:
>
>> I believe the Emperor Penguin decreed messages have to be
>> ASCII, or else UTF-8. Please don't add to the mess by using
>> non-portable encodings!
>
> Should we possibly reject non-UTF-8 input to commits?

Please, don't.

> IIRC, we actually define that to be UTF-8, unlike most of the
> other stuff, for which we don't actually insist on a policy.

No, we do not define nor insist on a particluar policy as far as
I know.  We suggest the use of UTF-8 merely from common sense to
help interoperability, and make UTF-8 slightly easier to use
than other encodings by giving specific support for it in some
tools, namely -u flag in git-mailinfo.

It is perfectly reasonable if a company internal project that
works in Russia to standardize on KOI, or in Japan on EUC-JP.
We simply allow it without encouraging nor discouraging it.  If
gitweb can take a configuration mechanism to override the
built-in UTF-8 header, that is perfectly a valid thing to do to
help such an environment.

However, we suggest UTF-8 if the project does not have a
compelling reason to do otherwise [*1*].  If you want to be
prepared for the day your project might have wider participants
than you originally envisioned, that is the most sensible thing
to do.  This is especially true because the commit logs cannot
be re-encoded after the fact.

[Footnote]

*1* For example, I've never made GNU emacs to work well with
Japanese in UTF-8 , so if people in my company internal project
wanted to use Japanese in commit logs, I would probably
standardize on EUC-JP for such a project.  Luckily so far I have
not been forced to make that decision.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: gitweb: charset problem
  2005-10-24 22:39     ` Junio C Hamano
@ 2005-10-25 16:01       ` Daniel Barkalow
  2005-10-25 17:31         ` Junio C Hamano
  2005-10-25 17:44         ` Junio C Hamano
  0 siblings, 2 replies; 8+ messages in thread
From: Daniel Barkalow @ 2005-10-25 16:01 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Mon, 24 Oct 2005, Junio C Hamano wrote:

> > IIRC, we actually define that to be UTF-8, unlike most of the
> > other stuff, for which we don't actually insist on a policy.
> 
> No, we do not define nor insist on a particluar policy as far as
> I know.  We suggest the use of UTF-8 merely from common sense to
> help interoperability, and make UTF-8 slightly easier to use
> than other encodings by giving specific support for it in some
> tools, namely -u flag in git-mailinfo.

I thought we'd decided on uninterpreted byte values for blobs, filenames, 
and trees (and everything in the working tree), but using UTF-8 for tag 
and commit objects.

Consider if you started a project in EUC-JP, and then decided to switch to 
UTF-8 later (when your environment handled it cleanly, perhaps). You could 
convert all the file contents and move files to re-encoded names, but 
you'd then want to commit these changes and have the log before and after 
simultaneously intelligable.

> [Footnote]
> 
> *1* For example, I've never made GNU emacs to work well with
> Japanese in UTF-8 , so if people in my company internal project
> wanted to use Japanese in commit logs, I would probably
> standardize on EUC-JP for such a project.  Luckily so far I have
> not been forced to make that decision.

It wouldn't be hard to convert at some point between the editor and the 
commit object, and you don't re-edit the commit objects like you do 
tracked files. It probably wouldn't even be hard for commit-tree to 
convert its input based on locale. (And stuff which prints commit contents 
for user consumption probably ought to re-encode it if necessary, too)

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: gitweb: charset problem
  2005-10-25 16:01       ` Daniel Barkalow
@ 2005-10-25 17:31         ` Junio C Hamano
  2005-10-25 17:44         ` Junio C Hamano
  1 sibling, 0 replies; 8+ messages in thread
From: Junio C Hamano @ 2005-10-25 17:31 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: git

Daniel Barkalow <barkalow@iabervon.org> writes:

> Consider if you started a project in EUC-JP, and then decided to switch to 
> UTF-8 later (when your environment handled it cleanly, perhaps). You could 
> convert...

I am not saying that using UTF-8 is impossible in some
situations.  I am saying that if all involved parties agree to
use something else in a private project, that is their choice
and there is no reason to forbid it.  They may be shooting in
the foot in the long run, they may be not.

For the internal project I was using as an example, I do not
forsee anybody who does not do Japanese ever touching it, nor it
needs to record any other language in the future (this comes
from the nature of the project -- keeping track of some
documents that are written in Japanese and we are not in
translation business).  Log and contents being encoded in EUC-JP
is perfectly valid right now and in the future in that project.
In such an application there is nothing gained by using UTF-8,
and it will only inconvenience the users if we insisted on
UTF-8.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: gitweb: charset problem
  2005-10-25 16:01       ` Daniel Barkalow
  2005-10-25 17:31         ` Junio C Hamano
@ 2005-10-25 17:44         ` Junio C Hamano
  1 sibling, 0 replies; 8+ messages in thread
From: Junio C Hamano @ 2005-10-25 17:44 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: git

Daniel Barkalow <barkalow@iabervon.org> writes:

> It wouldn't be hard to convert at some point between the editor and the 
> commit object, and you don't re-edit the commit objects like you do 
> tracked files. It probably wouldn't even be hard for commit-tree to 
> convert its input based on locale. (And stuff which prints commit contents 
> for user consumption probably ought to re-encode it if necessary, too)

Don't get me wrong.  I am not opposed to giving preferential
treatment to UTF-8 by supporting it better.  I think it may be a
good idea to have an *option* in commit-tree and mktag to
convert from LC_CTYPE to utf-8, just like mailinfo does.

I am just opposing to make UTF-8 mandatory.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2005-10-25 17:44 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-10-24  7:18 gitweb: charset problem Nico -telmich- Schottelius
2005-10-24 12:34 ` Kay Sievers
2005-10-24 13:56 ` Horst von Brand
2005-10-24 21:55   ` Daniel Barkalow
2005-10-24 22:39     ` Junio C Hamano
2005-10-25 16:01       ` Daniel Barkalow
2005-10-25 17:31         ` Junio C Hamano
2005-10-25 17:44         ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).