All of lore.kernel.org
 help / color / mirror / Atom feed
* Alphabet of kernel source
@ 2004-06-23 21:06 Pete Zaitcev
  2004-06-23 21:46 ` David Eger
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Pete Zaitcev @ 2004-06-23 21:06 UTC (permalink / raw)
  To: linux-kernel, zaitcev

Guys,

I have a silly question, for which I am unable to google out the answer
so far. Do we have a Linus' decree on the charset and encoding of the
kernel source?

I had a funny situation recently... I prefer non-MIME attachements
for two reasons: a) I grab parts of the header and fold them into
patch and b) it is easier to quote fragments of the patch with clients
I tried (mutt and sylpheed). Admittendly, a different MUA software may
change these habits, but please bear with me here. So, someone sent
me a patch which included a context line with MODULE_AUTHOR() with
an accented name, which the author entered in ISO-8859-1 (he was German).
I replied, but my mail agent recoded the reply as UTF-8. The author
agreed to my patch, and copied my reply, sent to me. Everything was
perfectly readable at this point, but the patch rejected. Because
I use Russian and Japanese simultaneously, all utilities run with UTF-8
my boxes, so it took me a moment to do "LANG=C vi" and find the problem.

Anyhow, long story short, this got me thinking... What is the charset
and the encoding of the actual source? I saw quite a discussion about
the filenames, but this is different. I am sorry if this was discussed
previously.

-- Pete

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Alphabet of kernel source
  2004-06-23 21:06 Alphabet of kernel source Pete Zaitcev
@ 2004-06-23 21:46 ` David Eger
  2004-06-23 22:18   ` Kalin KOZHUHAROV
  2004-06-23 21:58 ` Alphabet of kernel source Andries Brouwer
  2004-06-24 11:06 ` Richard B. Johnson
  2 siblings, 1 reply; 7+ messages in thread
From: David Eger @ 2004-06-23 21:46 UTC (permalink / raw)
  To: Pete Zaitcev; +Cc: linux-kernel


I started a thread a while ago (2.6.3/2.6.4) where I submitted some
patches to UTF-8ifying the kernel sources.  Basically, most of the
kernel is ASCII (98.4% of the files).  The rest are mostly ISO-Latin-1,
with the rare bit of Japanese (in a couple of charsets) and some just
random bytes in some of the Documentation/...

http://www.yak.net/random/linux-2.6.4-utf8-cleanup-auto.diff
http://www.yak.net/random/linux-2.6.4-utf8-cleanup-cstrings2utf8.diff
http://www.yak.net/random/linux-2.6.4-utf8-cleanup-jp.diff
http://www.yak.net/random/linux-2.6.4-utf8-cleanup-wrong.diff

It's sorta difficult to do non-ASCII patches over email because
the kernel developers like reading their mail in mutt, and don't 
like attachments (the only sane ways to send non 7-bit clean data:
8-bit MIME: tagged and bagged or uuencoded)

Further, you confuse the hell out of vi if you have any trash (8bit data
in another charset) in a file that's supposed to be UTF-8.  i.e. don't
think you're going to be able to look at a charset changing patch in
anything.

-dte


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Alphabet of kernel source
  2004-06-23 21:06 Alphabet of kernel source Pete Zaitcev
  2004-06-23 21:46 ` David Eger
@ 2004-06-23 21:58 ` Andries Brouwer
  2004-06-24 11:06 ` Richard B. Johnson
  2 siblings, 0 replies; 7+ messages in thread
From: Andries Brouwer @ 2004-06-23 21:58 UTC (permalink / raw)
  To: Pete Zaitcev; +Cc: linux-kernel

On Wed, Jun 23, 2004 at 02:06:28PM -0700, Pete Zaitcev wrote:

> Anyhow, long story short, this got me thinking... What is the charset
> and the encoding of the actual source? I saw quite a discussion about
> the filenames, but this is different. I am sorry if this was discussed
> previously.

This has come up repeatedly. As far as I recall, Linus has never said
anything. The de facto situation can be seen by just inspecting the
MAINTAINERS file. Kai Makisara has a diaeresis on the first vowel of
his last name. Today (2.6.6) that is still coded in ISO 8859-1.

In old discussions people who disliked 8859-1 expressed strong preference
for plain ASCII (possibly with TeX-like escape sequences for non-ASCII).
These days it seems that, if anything is changed, the only reasonable action
would be to switch to UTF-8.

Andries

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Alphabet of kernel source
  2004-06-23 21:46 ` David Eger
@ 2004-06-23 22:18   ` Kalin KOZHUHAROV
  2004-06-24  6:16     ` David Eger
  0 siblings, 1 reply; 7+ messages in thread
From: Kalin KOZHUHAROV @ 2004-06-23 22:18 UTC (permalink / raw)
  To: David Eger; +Cc: LKML

David Eger wrote:
> I started a thread a while ago (2.6.3/2.6.4) where I submitted some
> patches to UTF-8ifying the kernel sources.  Basically, most of the
> kernel is ASCII (98.4% of the files).  The rest are mostly ISO-Latin-1,
> with the rare bit of Japanese (in a couple of charsets) and some just
> random bytes in some of the Documentation/...

The "problem" is contributor names, although having everything in plain ASCII is resonable, I guess.

> http://www.yak.net/random/linux-2.6.4-utf8-cleanup-auto.diff
A lot of names and some art supposed to be ASCII.

> http://www.yak.net/random/linux-2.6.4-utf8-cleanup-cstrings2utf8.diff
Some degree symbols and microseconds... and names.
I remember having problems with lm-sensors trying to print degrees, how did they fight the problem?

> http://www.yak.net/random/linux-2.6.4-utf8-cleanup-jp.diff
Ok, this Japanese is only in the comments.
I can translate that in no time and fix this diff.
WTF is arch/v850/ ?
I guess you had some kind of script, can you try it on vanilla 2.6.7, plesae, and post results.

> http://www.yak.net/random/linux-2.6.4-utf8-cleanup-wrong.diff
There are a few microseconds written properly, but may commonly by typed as us, or just don't use abbr.

> It's sorta difficult to do non-ASCII patches over email because
> the kernel developers like reading their mail in mutt, and don't 
> like attachments (the only sane ways to send non 7-bit clean data:
> 8-bit MIME: tagged and bagged or uuencoded)
> 
> Further, you confuse the hell out of vi if you have any trash (8bit data
> in another charset) in a file that's supposed to be UTF-8.  i.e. don't
> think you're going to be able to look at a charset changing patch in
> anything.
Totally agree, although I use Mozilla Mail (and sometimes mutt).

Kalin.

-- 
||///_ o  *****************************
||//'_/>     WWW: http://ThinRope.net/
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Alphabet of kernel source
  2004-06-23 22:18   ` Kalin KOZHUHAROV
@ 2004-06-24  6:16     ` David Eger
  2004-06-27  5:48       ` [PATCH] Translate Japanese comments in arch/v850 ( was: Alphabet of kernel source) Kalin KOZHUHAROV
  0 siblings, 1 reply; 7+ messages in thread
From: David Eger @ 2004-06-24  6:16 UTC (permalink / raw)
  To: Kalin KOZHUHAROV; +Cc: LKML

On Thu, Jun 24, 2004 at 07:18:41AM +0900, Kalin KOZHUHAROV wrote:
> >http://www.yak.net/random/linux-2.6.4-utf8-cleanup-cstrings2utf8.diff
> Some degree symbols and microseconds... and names.
> I remember having problems with lm-sensors trying to print degrees, how did 
> they fight the problem?

I assume the local charset on the machines where they cat /proc/blah are 
running in ISO Latin 1 ;-)

> >http://www.yak.net/random/linux-2.6.4-utf8-cleanup-jp.diff
> Ok, this Japanese is only in the comments.
> I can translate that in no time and fix this diff.

actually, I'm pretty sure the diff is correct against 2.6.4 - the bytes
should all be correct, as I checked it with someone who works with
said files...

> I guess you had some kind of script, can you try it on vanilla 2.6.7, 
> plesae, and post results.

I will regenerate the patches if someone in charge (Linus or Andrew) 
actually wants them.

-dte

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Alphabet of kernel source
  2004-06-23 21:06 Alphabet of kernel source Pete Zaitcev
  2004-06-23 21:46 ` David Eger
  2004-06-23 21:58 ` Alphabet of kernel source Andries Brouwer
@ 2004-06-24 11:06 ` Richard B. Johnson
  2 siblings, 0 replies; 7+ messages in thread
From: Richard B. Johnson @ 2004-06-24 11:06 UTC (permalink / raw)
  To: Pete Zaitcev; +Cc: linux-kernel

On Wed, 23 Jun 2004, Pete Zaitcev wrote:

> Guys,
>
> I have a silly question, for which I am unable to google out the answer
> so far. Do we have a Linus' decree on the charset and encoding of the
> kernel source?
>
[SNIPPED...]

Good question!  It was supposed to be ASCII which, I guess is
UTF-8 or something like that. However, I find that tabs, which
were decreed to be at 8-collumn intervals end up being used
instead of spaces i.e., one-column, etc. So, if you look at
some well-patched source you sometimes see a mess.

The names of contributors often have non-ASCII characters
in them. This may not be a problem, but when using `pine`
without the 'latest-and-greatest' version, they sometimes
are unreadable.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.26 on an i686 machine (5570.56 BogoMips).
            Note 96.31% of all statistics are fiction.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] Translate Japanese comments in arch/v850  ( was: Alphabet of kernel source)
  2004-06-24  6:16     ` David Eger
@ 2004-06-27  5:48       ` Kalin KOZHUHAROV
  0 siblings, 0 replies; 7+ messages in thread
From: Kalin KOZHUHAROV @ 2004-06-27  5:48 UTC (permalink / raw)
  To: David Eger; +Cc: LKML

[-- Attachment #1: Type: text/plain, Size: 1020 bytes --]

David Eger wrote:
>>>http://www.yak.net/random/linux-2.6.4-utf8-cleanup-jp.diff
>>
>>Ok, this Japanese is only in the comments.
>>I can translate that in no time and fix this diff.
> 
> actually, I'm pretty sure the diff is correct against 2.6.4 - the bytes
> should all be correct, as I checked it with someone who works with
> said files...

OK, I had a few idle minutes, so I did patch the Japanese comments in arch/v850.

I am not exactly 100% sure I translated it correctly since I have no idea what exactly was that NEC v850 evaluation board, but should be OK (say 95% sure).

Patches just the comments, so code is untouched.

The other thing is that one of the files was encoded (i.e. readable) in iso-2022-jp, the other in euc-jp...
No idea how patch will handle this, I hope it doesn't bother with locale settings, etc.

Attaching as application/octet-stream in a hope for better handling of i18n issues, sorry for the inconvenience.

Here goes the patch:

Signed-off-by: Kalin KOZHUHAROV <kalin@thinrope.net>


[-- Attachment #2: v850-jp2en.diff --]
[-- Type: application/octet-stream, Size: 1947 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2004-06-27  5:48 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-06-23 21:06 Alphabet of kernel source Pete Zaitcev
2004-06-23 21:46 ` David Eger
2004-06-23 22:18   ` Kalin KOZHUHAROV
2004-06-24  6:16     ` David Eger
2004-06-27  5:48       ` [PATCH] Translate Japanese comments in arch/v850 ( was: Alphabet of kernel source) Kalin KOZHUHAROV
2004-06-23 21:58 ` Alphabet of kernel source Andries Brouwer
2004-06-24 11:06 ` Richard B. Johnson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.