* [Patch] Support UTF-8 scripts
@ 2005-08-13 12:07 "Martin v. Löwis"
2005-08-13 16:35 ` Stephen Pollei
2005-08-31 23:27 ` H. Peter Anvin
0 siblings, 2 replies; 80+ messages in thread
From: "Martin v. Löwis" @ 2005-08-13 12:07 UTC (permalink / raw)
To: linux-kernel
This patch adds support for UTF-8 signatures (aka BOM, byte order
mark) to binfmt_script. Files that start with EF BF FF # ! are now
recognized as scripts (in addition to files starting with # !).
With such support, creating scripts that reliably carry non-ASCII
characters is simplified. Editors and the script interpreter can
easily agree on what the encoding of the script is, and the
interpreter can then render strings appropriately. Currently,
Python supports source files that start with the UTF-8 signature;
the approach would naturally extend to Perl to enhance/replace
the "use utf8" pragma. Likewise, Tcl could use the UTF-8 signature
to reliably identify UTF-8 source code (instead of assuming
[encoding system] for source code).
Please find the patch attached below.
Regards,
Martin
Signed-off-by: Martin v. Löwis <martin@v.loewis.de>
diff --git a/fs/binfmt_script.c b/fs/binfmt_script.c
--- a/fs/binfmt_script.c
+++ b/fs/binfmt_script.c
@@ -1,7 +1,7 @@
/*
* linux/fs/binfmt_script.c
*
- * Copyright (C) 1996 Martin von Löwis
+ * Copyright (C) 1996, 2005 Martin von Löwis
* original #!-checking implemented by tytso.
*/
@@ -23,7 +23,16 @@ static int load_script(struct linux_binp
char interp[BINPRM_BUF_SIZE];
int retval;
- if ((bprm->buf[0] != '#') || (bprm->buf[1] != '!') ||
(bprm->sh_bang))
+ /* It is a recursive invocation. */
+ if (bprm->sh_bang)
+ return -ENOEXEC;
+
+ /* It starts neither with #!, nor with #! preceded by
+ the UTF-8 signature. */
+ if (!(((bprm->buf[0] == '#') && (bprm->buf[1] == '!'))
+ || ((bprm->buf[0] == '\xef') && (bprm->buf[1] == '\xbb')
+ && (bprm->buf[2] == '\xbf') && (bprm->buf[3] == '#')
+ && (bprm->buf[4] == '!'))))
return -ENOEXEC;
/*
* This section does the #! interpretation.
@@ -46,7 +55,8 @@ static int load_script(struct linux_binp
else
break;
}
- for (cp = bprm->buf+2; (*cp == ' ') || (*cp == '\t'); cp++);
+ cp = (bprm->buf[0]=='\xef') ? bprm->buf+5 : bprm->buf+2;
+ while ((*cp == ' ') || (*cp == '\t')) cp++;
if (*cp == '\0')
return -ENOEXEC; /* No interpreter name found */
i_name = cp;
^ permalink raw reply [flat|nested] 80+ messages in thread* Re: [Patch] Support UTF-8 scripts 2005-08-13 12:07 [Patch] Support UTF-8 scripts "Martin v. Löwis" @ 2005-08-13 16:35 ` Stephen Pollei 2005-08-13 18:42 ` Lee Revell 2005-08-31 23:27 ` H. Peter Anvin 1 sibling, 1 reply; 80+ messages in thread From: Stephen Pollei @ 2005-08-13 16:35 UTC (permalink / raw) To: Martin v. Löwis; +Cc: linux-kernel On 8/13/05, "Martin v. Löwis" <martin@v.loewis.de> wrote: > This patch adds support for UTF-8 signatures (aka BOM, byte order > mark) to binfmt_script. > With such support, creating scripts that reliably carry non-ASCII > characters is simplified. > the approach would naturally extend to Perl to enhance/replace > the "use utf8" pragma. Thats great for the perl6 people. http://dev.perl.org/perl6/doc/design/syn/S03.html says they are going to be using « and » as operators... So I'd imagine that a lot of perl6 scripts would be utf8. -- http://dmoz.org/profiles/pollei.html http://sourceforge.net/users/stephen_pollei/ http://www.orkut.com/Profile.aspx?uid=2455954990164098214 http://stephen_pollei.home.comcast.net/ ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-08-13 16:35 ` Stephen Pollei @ 2005-08-13 18:42 ` Lee Revell 2005-08-13 18:49 ` Hugo Mills ` (3 more replies) 0 siblings, 4 replies; 80+ messages in thread From: Lee Revell @ 2005-08-13 18:42 UTC (permalink / raw) To: Stephen Pollei; +Cc: Martin v. Löwis, linux-kernel On Sat, 2005-08-13 at 09:35 -0700, Stephen Pollei wrote: > Thats great for the perl6 people. > http://dev.perl.org/perl6/doc/design/syn/S03.html says they are going > to be using « and » as operators... Is Larry smoking crack? That's one of the worst ideas I've heard in a long time. There's no easy way to enter those at the keyboard! http://www.cl.cam.ac.uk/~mgk25/unicode.html#input Lee ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-08-13 18:42 ` Lee Revell @ 2005-08-13 18:49 ` Hugo Mills 2005-08-13 18:53 ` Lee Revell ` (2 more replies) 2005-08-14 0:53 ` Alan Cox ` (2 subsequent siblings) 3 siblings, 3 replies; 80+ messages in thread From: Hugo Mills @ 2005-08-13 18:49 UTC (permalink / raw) To: Lee Revell; +Cc: Stephen Pollei, Martin v. Löwis, linux-kernel [-- Attachment #1: Type: text/plain, Size: 948 bytes --] On Sat, Aug 13, 2005 at 02:42:52PM -0400, Lee Revell wrote: > On Sat, 2005-08-13 at 09:35 -0700, Stephen Pollei wrote: > > Thats great for the perl6 people. > > http://dev.perl.org/perl6/doc/design/syn/S03.html says they are going > > to be using « and » as operators... > > Is Larry smoking crack? That's one of the worst ideas I've heard in a > long time. There's no easy way to enter those at the keyboard! I have "setxkbmap -symbols 'en_US(pc102)+gb'" in my ~/.xsession, and « and » are available as AltGr-z and AltGr-x respectively. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 1C335860 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Anyone who claims their cryptographic protocol is secure is --- either a genius or a fool. Given the genius/fool ratio for our species, the odds aren't good. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-08-13 18:49 ` Hugo Mills @ 2005-08-13 18:53 ` Lee Revell 2005-08-14 0:57 ` Alan Cox 2005-08-13 19:20 ` Lee Revell 2005-08-16 9:46 ` Jan Engelhardt 2 siblings, 1 reply; 80+ messages in thread From: Lee Revell @ 2005-08-13 18:53 UTC (permalink / raw) To: Hugo Mills; +Cc: Stephen Pollei, Martin v. Löwis, linux-kernel On Sat, 2005-08-13 at 19:49 +0100, Hugo Mills wrote: > On Sat, Aug 13, 2005 at 02:42:52PM -0400, Lee Revell wrote: > > On Sat, 2005-08-13 at 09:35 -0700, Stephen Pollei wrote: > > > Thats great for the perl6 people. > > > http://dev.perl.org/perl6/doc/design/syn/S03.html says they are going > > > to be using « and » as operators... > > > > Is Larry smoking crack? That's one of the worst ideas I've heard in a > > long time. There's no easy way to enter those at the keyboard! > > I have "setxkbmap -symbols 'en_US(pc102)+gb'" in my ~/.xsession, > and « and » are available as AltGr-z and AltGr-x respectively. Most keyboards don't have an AltGr key. Lee ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-08-13 18:53 ` Lee Revell @ 2005-08-14 0:57 ` Alan Cox 2005-08-14 1:19 ` Kyle Moffett 0 siblings, 1 reply; 80+ messages in thread From: Alan Cox @ 2005-08-14 0:57 UTC (permalink / raw) To: Lee Revell; +Cc: Hugo Mills, Stephen Pollei, Martin v. Löwis, linux-kernel > > I have "setxkbmap -symbols 'en_US(pc102)+gb'" in my ~/.xsession, > > and « and » are available as AltGr-z and AltGr-x respectively. > > Most keyboards don't have an AltGr key. You must be an American. Most old the worlds keyboards have an AltGr key. You'll find that US keyboards have two alt keys to avoid confusing people (like one button mice ;)) but the right one is understood by the X bindings to be "AltGr". Even though the US keyboard is apparently lacking functionality its purely a text label issue Alan ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-08-14 0:57 ` Alan Cox @ 2005-08-14 1:19 ` Kyle Moffett 2005-08-14 1:40 ` Lee Revell 0 siblings, 1 reply; 80+ messages in thread From: Kyle Moffett @ 2005-08-14 1:19 UTC (permalink / raw) To: Alan Cox Cc: Lee Revell, Hugo Mills, Stephen Pollei, Martin v. Löwis , linux-kernel On Aug 13, 2005, at 20:57:45, Alan Cox wrote: >>> I have "setxkbmap -symbols 'en_US(pc102)+gb'" in my ~/.xsession, >>> and « and » are available as AltGr-z and AltGr-x respectively. >> >> Most keyboards don't have an AltGr key. > > You must be an American. Most old the worlds keyboards have an AltGr > key. You'll find that US keyboards have two alt keys to avoid > confusing > people (like one button mice ;)) but the right one is understood by > the > X bindings to be "AltGr". Even though the US keyboard is apparently > lacking functionality its purely a text label issue And those of us who are Mac OS X oriented have patched our console and X keycodes to match the mac way of generating symbols: Alt-\ = « Alt-Shift-\ = » Alt-Shift-+ = ± If only someone could come up with a good character palette like exists on that OS, something that could generate a wide variety of keysyms, preferably all of UTF-8, and send them to the topmost window. Cheers, Kyle Moffett -- Unix was not designed to stop people from doing stupid things, because that would also stop them from doing clever things. -- Doug Gwyn ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-08-14 1:19 ` Kyle Moffett @ 2005-08-14 1:40 ` Lee Revell 2005-08-14 10:40 ` Wichert Akkerman 0 siblings, 1 reply; 80+ messages in thread From: Lee Revell @ 2005-08-14 1:40 UTC (permalink / raw) To: Kyle Moffett Cc: Alan Cox, Hugo Mills, Stephen Pollei, Martin v. Löwis , linux-kernel On Sat, 2005-08-13 at 21:19 -0400, Kyle Moffett wrote: > And those of us who are Mac OS X oriented have patched our console and > X keycodes to match the mac way of generating symbols: > > Alt-\ = « > Alt-Shift-\ = » > Alt-Shift-+ = ± > My point exactly, it's idiotic for Perl6 to use these as OPERATORS, the atoms of the language, when there's not even a platform independent way to type them in. Lee ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-08-14 1:40 ` Lee Revell @ 2005-08-14 10:40 ` Wichert Akkerman 0 siblings, 0 replies; 80+ messages in thread From: Wichert Akkerman @ 2005-08-14 10:40 UTC (permalink / raw) To: linux-kernel Previously Lee Revell wrote: > My point exactly, it's idiotic for Perl6 to use these as OPERATORS, the > atoms of the language, when there's not even a platform independent way > to type them in. I anyone had bothered to read the URL in one of the earlier emails you would have seen that '<<' is an accepted alternative spelling. Wichert. -- Wichert Akkerman <wichert@wiggy.net> It is simple to make things. http://www.wiggy.net/ It is hard to make things simple. ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-08-13 18:49 ` Hugo Mills 2005-08-13 18:53 ` Lee Revell @ 2005-08-13 19:20 ` Lee Revell 2005-08-16 9:46 ` Jan Engelhardt 2 siblings, 0 replies; 80+ messages in thread From: Lee Revell @ 2005-08-13 19:20 UTC (permalink / raw) To: Hugo Mills; +Cc: Stephen Pollei, Martin v. Löwis, linux-kernel On Sat, 2005-08-13 at 19:49 +0100, Hugo Mills wrote: > On Sat, Aug 13, 2005 at 02:42:52PM -0400, Lee Revell wrote: > > On Sat, 2005-08-13 at 09:35 -0700, Stephen Pollei wrote: > > > Thats great for the perl6 people. > > > http://dev.perl.org/perl6/doc/design/syn/S03.html says they are going > > > to be using « and » as operators... > > > > Is Larry smoking crack? That's one of the worst ideas I've heard in a > > long time. There's no easy way to enter those at the keyboard! > > I have "setxkbmap -symbols 'en_US(pc102)+gb'" in my ~/.xsession, > and « and » are available as AltGr-z and AltGr-x respectively. > Well, now it's obvious he's just trying to raise the bar for the obfuscated perl contest. If you thought these were fun before, you'll love them with ¥ and « and »! Lee ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-08-13 18:49 ` Hugo Mills 2005-08-13 18:53 ` Lee Revell 2005-08-13 19:20 ` Lee Revell @ 2005-08-16 9:46 ` Jan Engelhardt 2 siblings, 0 replies; 80+ messages in thread From: Jan Engelhardt @ 2005-08-16 9:46 UTC (permalink / raw) To: Hugo Mills; +Cc: Lee Revell, Stephen Pollei, Martin v. Löwis, linux-kernel [-- Attachment #1: Type: TEXT/PLAIN, Size: 574 bytes --] >> > Thats great for the perl6 people. >> > http://dev.perl.org/perl6/doc/design/syn/S03.html says they are going >> > to be using « and » as operators... >> >> Is Larry smoking crack? That's one of the worst ideas I've heard in a >> long time. There's no easy way to enter those at the keyboard! > > I have "setxkbmap -symbols 'en_US(pc102)+gb'" in my ~/.xsession, >and « and » are available as AltGr-z and AltGr-x respectively. .Xmodmap: keycode 117 = MultiKey and then use [the Windows(R) Context Menu Key],[<],[<] to generate « Cheers :) Jan Engelhardt -- ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-08-13 18:42 ` Lee Revell 2005-08-13 18:49 ` Hugo Mills @ 2005-08-14 0:53 ` Alan Cox 2005-08-14 4:10 ` James Cloos 2005-08-14 6:18 ` Jason L Tibbitts III 2005-08-15 8:01 ` Helge Hafting 3 siblings, 1 reply; 80+ messages in thread From: Alan Cox @ 2005-08-14 0:53 UTC (permalink / raw) To: Lee Revell; +Cc: Stephen Pollei, Martin v. Löwis, linux-kernel On Sad, 2005-08-13 at 14:42 -0400, Lee Revell wrote: > Is Larry smoking crack? That's one of the worst ideas I've heard in a > long time. There's no easy way to enter those at the keyboard! The command line console mappings may not include them by default (you can obviously add them if your keyboard lacks them). The X keyboard however does include compose functionality for » and « and many other symbols that might be useful eg ± Alan ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-08-14 0:53 ` Alan Cox @ 2005-08-14 4:10 ` James Cloos 0 siblings, 0 replies; 80+ messages in thread From: James Cloos @ 2005-08-14 4:10 UTC (permalink / raw) To: linux-kernel; +Cc: Lee Revell >>>>> "Alan" == Alan Cox <alan@lxorguk.ukuu.org.uk> writes: Alan> The command line console mappings may not include them by Alan> default (you can obviously add them if your keyboard lacks Alan> them). The X keyboard however does include compose functionality Alan> for » and « and many other symbols that might be useful eg ± Not to mention that many editors, including emacs and vim, have their own support for entering such non-ascii characters no matter what the console or X11 keyboards look like. -JimC -- James H. Cloos, Jr. <cloos@jhcloos.com> ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-08-13 18:42 ` Lee Revell 2005-08-13 18:49 ` Hugo Mills 2005-08-14 0:53 ` Alan Cox @ 2005-08-14 6:18 ` Jason L Tibbitts III [not found] ` <feed8cdd050814125845fe4e2e@mail.gmail.com> 2005-08-14 21:52 ` Kyle Moffett 2005-08-15 8:01 ` Helge Hafting 3 siblings, 2 replies; 80+ messages in thread From: Jason L Tibbitts III @ 2005-08-14 6:18 UTC (permalink / raw) To: Lee Revell; +Cc: Stephen Pollei, Martin v. Löwis, linux-kernel >>>>> "LR" == Lee Revell <rlrevell@joe-job.com> writes: LR> Is Larry smoking crack? That's one of the worst ideas I've heard LR> in a long time. There's no easy way to enter those at the LR> keyboard! I know folks enjoy trashing Perl these days, but it's not justified in this case. From the Perl6-Bible - http://search.cpan.org/dist/Perl6-Bible/lib/Perl6/Bible/S03.pod: For those still living without the blessings of Unicode, that can also be written: << ... >>. - J< ^ permalink raw reply [flat|nested] 80+ messages in thread
[parent not found: <feed8cdd050814125845fe4e2e@mail.gmail.com>]
* Re: [Patch] Support UTF-8 scripts [not found] ` <feed8cdd050814125845fe4e2e@mail.gmail.com> @ 2005-08-14 19:59 ` Lee Revell 2005-08-14 20:13 ` Stephen Pollei ` (3 more replies) 0 siblings, 4 replies; 80+ messages in thread From: Lee Revell @ 2005-08-14 19:59 UTC (permalink / raw) To: Stephen Pollei; +Cc: Jason L Tibbitts III, Martin v. Löwis, linux-kernel On Sun, 2005-08-14 at 12:58 -0700, Stephen Pollei wrote: > My main point was that utf-8 for identifiers, operators, and string > constants are becoming more prevalent, so BOM support for scripts > sounds like a Good Idea™ . > I know the alternatives are available. That doesn't make it any less idiotic to use non ASCII characters as operators. I think it's a very slippery slope. We write code in ASCII, dammit. Lee ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-08-14 19:59 ` Lee Revell @ 2005-08-14 20:13 ` Stephen Pollei 2005-08-14 20:22 ` Lee Revell 2005-08-14 23:55 ` Alan Cox ` (2 subsequent siblings) 3 siblings, 1 reply; 80+ messages in thread From: Stephen Pollei @ 2005-08-14 20:13 UTC (permalink / raw) To: Lee Revell; +Cc: Jason L Tibbitts III, Martin v. Löwis, linux-kernel On 8/14/05, Lee Revell <rlrevell@joe-job.com> wrote: > I know the alternatives are available. That doesn't make it any less > idiotic to use non ASCII characters as operators. I think it's a very > slippery slope. We write code in ASCII, dammit. Yes you and I might write 99.9% of our code in good'ol **American** Standard Code for Information Interchange -- however not all the world is USA. For instance notice the http://de.wikipedia.org/wiki/Umlaut/ in "Löwis"... Seems like lots of Europeans might want a bigger charset, not to mention Asians, Hindus, and whomever else. -- http://dmoz.org/profiles/pollei.html http://sourceforge.net/users/stephen_pollei/ http://www.orkut.com/Profile.aspx?uid=2455954990164098214 http://stephen_pollei.home.comcast.net/ ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-08-14 20:13 ` Stephen Pollei @ 2005-08-14 20:22 ` Lee Revell 2005-08-14 22:10 ` "Martin v. Löwis" 0 siblings, 1 reply; 80+ messages in thread From: Lee Revell @ 2005-08-14 20:22 UTC (permalink / raw) To: Stephen Pollei; +Cc: Jason L Tibbitts III, Martin v. Löwis, linux-kernel On Sun, 2005-08-14 at 13:13 -0700, Stephen Pollei wrote: > Seems like lots of Europeans might want a bigger > charset, not to mention Asians, Hindus, and whomever else. For strings, of course. But there's no need for UTF-8 operators. Lee ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-08-14 20:22 ` Lee Revell @ 2005-08-14 22:10 ` "Martin v. Löwis" 0 siblings, 0 replies; 80+ messages in thread From: "Martin v. Löwis" @ 2005-08-14 22:10 UTC (permalink / raw) To: Lee Revell; +Cc: Stephen Pollei, Jason L Tibbitts III, linux-kernel Lee Revell wrote: > For strings, of course. But there's no need for UTF-8 operators. Indeed - this is the main rationale for the patch, of course. People want to write non-ASCII in script primarily in string literals, and (perhaps even more often) in comments. Now, for comments, it wouldn't really matter that the interpreter knows what the encoding is - but the editor would have to know, and the UTF-8 signature primarily helps the editor (*). Then we are back to the rationale for this patch: if you need the UTF-8 signature to reliably identify the script as being UTF-8 encoded, you then currently cannot easily run it as a script through binfmt_script, as that code requires a script to start with #!. Regards, Martin (*) As I said before: atleast for Python, the UTF-8 signature also has syntactic meaning. It is allowed at the beginning of a file as an addition to the language syntax, and it tells the interpreter that Unicode literals (usually represented internally as UCS-2) are represented as UTF-8 in the source code. ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-08-14 19:59 ` Lee Revell 2005-08-14 20:13 ` Stephen Pollei @ 2005-08-14 23:55 ` Alan Cox 2005-08-16 13:56 ` David Madore [not found] ` <mailman.1124063520.13257.linux-kernel2news@redhat.com> 3 siblings, 0 replies; 80+ messages in thread From: Alan Cox @ 2005-08-14 23:55 UTC (permalink / raw) To: Lee Revell Cc: Stephen Pollei, Jason L Tibbitts III, Martin v. Löwis, linux-kernel On Sul, 2005-08-14 at 15:59 -0400, Lee Revell wrote: > I know the alternatives are available. That doesn't make it any less > idiotic to use non ASCII characters as operators. I think it's a very > slippery slope. We write code in ASCII, dammit. Its a trivial patch and there is a lot to be said for UTF-8 scripts. As to writing code in ascii, the kernel regularly has outbreaks of either UTF-8 or ISO-8859-* especially in the docs directory. Standardising these on UTF-8 would be helpful. Yes the kernel code is C so ASCII except for the odd abuser of the © symbol. Alan ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-08-14 19:59 ` Lee Revell 2005-08-14 20:13 ` Stephen Pollei 2005-08-14 23:55 ` Alan Cox @ 2005-08-16 13:56 ` David Madore [not found] ` <mailman.1124063520.13257.linux-kernel2news@redhat.com> 3 siblings, 0 replies; 80+ messages in thread From: David Madore @ 2005-08-16 13:56 UTC (permalink / raw) To: linux-kernel On Sun, Aug 14, 2005 at 08:00:31PM +0000, Lee Revell wrote: > We write code in ASCII, dammit. <URL: http://www.madore.org/~david/weblog/2004-12.html#d.2004-12-03.0813 > :-) -- David A. Madore (david.madore@ens.fr, http://www.madore.org/~david/ ) ^ permalink raw reply [flat|nested] 80+ messages in thread
[parent not found: <mailman.1124063520.13257.linux-kernel2news@redhat.com>]
* Re: [Patch] Support UTF-8 scripts [not found] ` <mailman.1124063520.13257.linux-kernel2news@redhat.com> @ 2005-08-16 20:17 ` Pete Zaitcev 0 siblings, 0 replies; 80+ messages in thread From: Pete Zaitcev @ 2005-08-16 20:17 UTC (permalink / raw) To: Alan Cox; +Cc: zaitcev, linux-kernel On Mon, 15 Aug 2005 00:55:54 +0100, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote: > On Sul, 2005-08-14 at 15:59 -0400, Lee Revell wrote: > > I know the alternatives are available. That doesn't make it any less > > idiotic to use non ASCII characters as operators. I think it's a very > > slippery slope. We write code in ASCII, dammit. > > Its a trivial patch and there is a lot to be said for UTF-8 scripts. As > to writing code in ascii, the kernel regularly has outbreaks of either > UTF-8 or ISO-8859-* especially in the docs directory. Standardising > these on UTF-8 would be helpful. > > Yes the kernel code is C so ASCII except for the odd abuser of the © > symbol. We write kernel code in ASCII because of patches in e-mail. When a patch is saved (often by a script), it is divorced of the encoding in which e-mail was done. Forwarding of patches then causes them to fail to apply. Everything else can be worked around. In my experience, the most common case of such patch rejects has to do with a European using a non-UTF-8 encoding for his name, rather than with the copyright symbol. -- Pete ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-08-14 6:18 ` Jason L Tibbitts III [not found] ` <feed8cdd050814125845fe4e2e@mail.gmail.com> @ 2005-08-14 21:52 ` Kyle Moffett 2005-08-14 22:12 ` Valdis.Kletnieks 1 sibling, 1 reply; 80+ messages in thread From: Kyle Moffett @ 2005-08-14 21:52 UTC (permalink / raw) To: Jason L Tibbitts III Cc: Lee Revell, Stephen Pollei, Martin v. Löwis , linux-kernel On Aug 14, 2005, at 02:18:13, Jason L Tibbitts III wrote: >>>>>> "LR" == Lee Revell <rlrevell@joe-job.com> writes: > LR> Is Larry smoking crack? > > From the Perl6-Bible: http://search.cpan.org/dist/Perl6-Bible/lib/ > Perl6/Bible/S03.pod: I think this confirms that the answer is yes. See the following at the above URL: > Note that ?^ is functionally identical to !.?| differs from || in > that ?| always > returns a standard boolean value (either 1 or 0), whereas || > returns the actual > value of the first of its arguments that is true. Since when is the string "!.?|" an operator??? Or "?^", "+|", "~|", "?|", etc. I think Larry's gone off the deep end on this one. It may be an incredibly powerful and expressive language, but it seems _really_ strange, and probably will produce the best Obfuscated-code contest the world has ever seen. (Better even than the Perl5 one). Cheers, Kyle Moffett -- Simple things should be simple and complex things should be possible -- Alan Kay ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-08-14 21:52 ` Kyle Moffett @ 2005-08-14 22:12 ` Valdis.Kletnieks 0 siblings, 0 replies; 80+ messages in thread From: Valdis.Kletnieks @ 2005-08-14 22:12 UTC (permalink / raw) To: Kyle Moffett Cc: Jason L Tibbitts III, Lee Revell, Stephen Pollei, Martin v. Löwis , linux-kernel [-- Attachment #1: Type: text/plain, Size: 372 bytes --] On Sun, 14 Aug 2005 17:52:36 EDT, Kyle Moffett said: > > Note that ?^ is functionally identical to !.?| differs from || in > Since when is the string "!.?|" an operator??? I think that was supposed to read: Note that ?^ is functionally identical to !. ?| differs from ?? in that ?| returns (and so on) (two separate sentences lacking whitespace between them.... [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-08-13 18:42 ` Lee Revell ` (2 preceding siblings ...) 2005-08-14 6:18 ` Jason L Tibbitts III @ 2005-08-15 8:01 ` Helge Hafting 3 siblings, 0 replies; 80+ messages in thread From: Helge Hafting @ 2005-08-15 8:01 UTC (permalink / raw) To: Lee Revell Cc: Stephen Pollei, "\"Martin v.\" Löwis", linux-kernel Lee Revell wrote: >On Sat, 2005-08-13 at 09:35 -0700, Stephen Pollei wrote: > > >>Thats great for the perl6 people. >>http://dev.perl.org/perl6/doc/design/syn/S03.html says they are going >>to be using « and » as operators... >> >> > >Is Larry smoking crack? That's one of the worst ideas I've heard in a >long time. There's no easy way to enter those at the keyboard! > > On your keyboard, that is. So what? My keyboard happen to have no easy way of entering a dollar sign, even though it is in «ascii». That makes sense though, as it is one of those ascii characters that is almost never used in my part of the world. Still, if I needed to use the «$» when programming, I sure could map it to some key combination. X is nice that way. Helge Hafting ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-08-13 12:07 [Patch] Support UTF-8 scripts "Martin v. Löwis" 2005-08-13 16:35 ` Stephen Pollei @ 2005-08-31 23:27 ` H. Peter Anvin 1 sibling, 0 replies; 80+ messages in thread From: H. Peter Anvin @ 2005-08-31 23:27 UTC (permalink / raw) To: linux-kernel Followup to: <42FDE286.40707@v.loewis.de> By author: =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= <martin@v.loewis.de> In newsgroup: linux.dev.kernel > > This patch adds support for UTF-8 signatures (aka BOM, byte order > mark) to binfmt_script. Files that start with EF BF FF # ! are now > recognized as scripts (in addition to files starting with # !). > > With such support, creating scripts that reliably carry non-ASCII > characters is simplified. Editors and the script interpreter can > easily agree on what the encoding of the script is, and the > interpreter can then render strings appropriately. Currently, > Python supports source files that start with the UTF-8 signature; > the approach would naturally extend to Perl to enhance/replace > the "use utf8" pragma. Likewise, Tcl could use the UTF-8 signature > to reliably identify UTF-8 source code (instead of assuming > [encoding system] for source code). > BOM should not be used in UTF-8. In fact, it shouldn't be used at all. -hpa ^ permalink raw reply [flat|nested] 80+ messages in thread
[parent not found: <4B2ZV-2dl-7@gated-at.bofh.it>]
[parent not found: <4HKbZ-Cx-37@gated-at.bofh.it>]
* Re: [Patch] Support UTF-8 scripts [not found] ` <4HKbZ-Cx-37@gated-at.bofh.it> @ 2005-09-15 18:24 ` "Martin v. Löwis" 2005-09-15 18:25 ` H. Peter Anvin 0 siblings, 1 reply; 80+ messages in thread From: "Martin v. Löwis" @ 2005-09-15 18:24 UTC (permalink / raw) To: H. Peter Anvin, linux-kernel H. Peter Anvin wrote: > BOM should not be used in UTF-8. In fact, it shouldn't be used at > all. Says who? In UTF-8, it is not used to indicate a byte order; instead, it is used to indicate the fact that the file is UTF-8, like a magic. That's why I prefer to call it "UTF-8 signature". The Unicode consortium thinks that the BOM can be used in UTF-8: http://www.unicode.org/faq/utf_bom.html#29 The UTF-8 signature is very useful, and I would prefer if it would be used instead of format-specific encoding declarations. Regards, Martin ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-15 18:24 ` "Martin v. Löwis" @ 2005-09-15 18:25 ` H. Peter Anvin 2005-09-15 18:39 ` "Martin v. Löwis" 0 siblings, 1 reply; 80+ messages in thread From: H. Peter Anvin @ 2005-09-15 18:25 UTC (permalink / raw) To: "Martin v. Löwis"; +Cc: linux-kernel Martin v. Löwis wrote: > > Says who? In UTF-8, it is not used to indicate a byte order; instead, > it is used to indicate the fact that the file is UTF-8, like a magic. > That's why I prefer to call it "UTF-8 signature". > > The Unicode consortium thinks that the BOM can be used in UTF-8: > > http://www.unicode.org/faq/utf_bom.html#29 > > The UTF-8 signature is very useful, and I would prefer if it would > be used instead of format-specific encoding declarations. > In Unix, it's a hideously bad idea. The reason is that Unix inherently assumes that text streams can be merged, split, and modified. In other words, unless you can guarantee that EVERY program can handle BOM EVERYWHERE, it's broken. In other words, it's broken. -hpa ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-15 18:25 ` H. Peter Anvin @ 2005-09-15 18:39 ` "Martin v. Löwis" 2005-09-15 19:20 ` H. Peter Anvin 2005-09-16 8:13 ` Bernd Petrovitsch 0 siblings, 2 replies; 80+ messages in thread From: "Martin v. Löwis" @ 2005-09-15 18:39 UTC (permalink / raw) To: H. Peter Anvin; +Cc: linux-kernel H. Peter Anvin wrote: > In Unix, it's a hideously bad idea. The reason is that Unix inherently > assumes that text streams can be merged, split, and modified. In other > words, unless you can guarantee that EVERY program can handle BOM > EVERYWHERE, it's broken. This argument is bogus. We are talking about scripts here, which cannot be merged, split, and modified. You don't cat(1) or sort(1) them - it's just pointless to do that. You create them with text editors, and those *can* handle the UTF-8 signature. > In other words, it's broken. We can do that now, or in five or ten years. I'm willing to wait that long, but I'm certain that more people will find the UTF-8 signature useful over time. It's the only sane way to get non-ASCII into script source in a consistent way. Regards, Martin ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-15 18:39 ` "Martin v. Löwis" @ 2005-09-15 19:20 ` H. Peter Anvin 2005-09-16 8:13 ` Bernd Petrovitsch 1 sibling, 0 replies; 80+ messages in thread From: H. Peter Anvin @ 2005-09-15 19:20 UTC (permalink / raw) To: "Martin v. Löwis"; +Cc: linux-kernel Martin v. Löwis wrote: > > We can do that now, or in five or ten years. I'm willing to wait that > long, but I'm certain that more people will find the UTF-8 signature > useful over time. It's the only sane way to get non-ASCII into script > source in a consistent way. > No. The sane way is to just use UTF-8. In five or ten years, by the time you've gotten your idiotic BOM mess to sort-of work, it will be completely pointless to have anything *but* UTF-8, and thus it's pointless. Don't perpetuate the braindamage. -hpa ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-15 18:39 ` "Martin v. Löwis" 2005-09-15 19:20 ` H. Peter Anvin @ 2005-09-16 8:13 ` Bernd Petrovitsch 1 sibling, 0 replies; 80+ messages in thread From: Bernd Petrovitsch @ 2005-09-16 8:13 UTC (permalink / raw) To: "Martin v. Löwis"; +Cc: H. Peter Anvin, linux-kernel On Thu, 2005-09-15 at 20:39 +0200, "Martin v. Löwis" wrote: > H. Peter Anvin wrote: > > > In Unix, it's a hideously bad idea. The reason is that Unix inherently > > assumes that text streams can be merged, split, and modified. In other > > words, unless you can guarantee that EVERY program can handle BOM > > EVERYWHERE, it's broken. > > This argument is bogus. We are talking about scripts here, which cannot > be merged, split, and modified. You don't cat(1) or sort(1) them - it's Sure they can since they are plain text files. How do you think one merges scripts? Just `cat`ing them all into one new file and edit that new file is much faster and simpler than to open an empty new file with your editor, then you open all the other scripts in your editor and copy them by hand. And you (or at least I) do `grep`/`egrep`/`fgrep`, `wc` them. And probably with several other tools too - think of `find <dir> -type f -print0 | xargs -0r <cmd>`. > just pointless to do that. You create them with text editors, and those > *can* handle the UTF-8 signature. It is not uncommon to create scripts and the like with other programs, other scripts, what-else. Apart from the fact the a "script" is merely a plain text file with the eXecutable bit set. And *that* is the only difference, so you have to at least (all instances of) `chmod` to insert and remove the BOM. This gets funny if you think of file systems without a concept of "executable bit" and copying files around. Another standard tool to patch. And how do you solve `cat`ing a script (with set X bit) like: `cat <script >other-file` where other-file will not have the X bit set. The `cat` program doesn't even know (or care about) the names of the two files. Bernd -- Firmix Software GmbH http://www.firmix.at/ mobil: +43 664 4416156 fax: +43 1 7890849-55 Embedded Linux Development and Services ^ permalink raw reply [flat|nested] 80+ messages in thread
[parent not found: <4N6EL-4Hq-3@gated-at.bofh.it>]
[parent not found: <4N6EL-4Hq-5@gated-at.bofh.it>]
[parent not found: <4N6EK-4Hq-1@gated-at.bofh.it>]
[parent not found: <4N6EX-4Hq-27@gated-at.bofh.it>]
[parent not found: <4N6Ox-4Ts-33@gated-at.bofh.it>]
[parent not found: <4N7AS-67L-3@gated-at.bofh.it>]
* Re: [Patch] Support UTF-8 scripts [not found] ` <4N7AS-67L-3@gated-at.bofh.it> @ 2005-09-16 18:02 ` Bodo Eggert 2005-09-16 18:09 ` H. Peter Anvin [not found] ` <200509170028.59973.dhazelton@enter.net> 0 siblings, 2 replies; 80+ messages in thread From: Bodo Eggert @ 2005-09-16 18:02 UTC (permalink / raw) To: H. Peter Anvin, Martin v. Löwis, linux-kernel Bernd Petrovitsch <bernd@firmix.at> wrote: > On Thu, 2005-09-15 at 20:39 +0200, "Martin v. Löwis" wrote: >> H. Peter Anvin wrote: >> > In Unix, it's a hideously bad idea. The reason is that Unix inherently >> > assumes that text streams can be merged, split, and modified. In other >> > words, unless you can guarantee that EVERY program can handle BOM >> > EVERYWHERE, it's broken. You can't sort /bin/ls into /tmp/ls and expect /tmp/ls to be meaningfull, but /bin/ls works as expected. You can't usurally concat perl scripts and shell scripts either, but both kinds of script run quite well. And if you do "cat /bin/cat /bin/cp > /bin/catcp", what's "catcp foo bar" supposed to do? First output foo and bar to stdout, then copy foo to bar? Is execve() broken if it doesn't do what I described? Is the ELF header broken because it's not recogmized EVERYWHERE? I don't think so. >> This argument is bogus. We are talking about scripts here, which cannot >> be merged, split, and modified. You don't cat(1) or sort(1) them - it's > > Sure they can since they are plain text files. > How do you think one merges scripts? > Just `cat`ing them all into one new file and edit that new file is much > faster and simpler than to open an empty new file with your editor, then > you open all the other scripts in your editor and copy them by hand. What's supposed to happen if you concatenate a script from your french user and from your russian user, both using localized text, into one file? Unless you can guarantee every editor to correctly handle this case, all usage of 8-bit-characters should be disabled - NOT! If you concatenate two plain text files, you will use cat. If you concatenate two pnm image files, you will use pnmcat. If you concatenate two utf-8 files, you will use utf8cat. If you concatenate two binaries, you will shoot your feet. That's easy, isn't it? BTW: I think decent utf-8 capable programs SHOULD ignore extra BOM markers. > And you (or at least I) do `grep`/`egrep`/`fgrep`, `wc` them. You can *grep utf-8 scripts, but you can't *grep binaries. Shouldn't this be fixed by implementing an in-kernel ASCII assembler and convert all binaries to assembler text? > And > probably with several other tools too - think of `find <dir> -type f > -print0 | xargs -0r <cmd>`. utf-8 filenames will work correctly (unless used as an extended BASIC script with non-ASCII variable names, but that would be insane). >> just pointless to do that. You create them with text editors, and those >> can handle the UTF-8 signature. > > It is not uncommon to create scripts and the like with other programs, > other scripts, what-else. It's not uncommon to create binaries using other programs. So what? > Apart from the fact the a "script" is merely a plain text file with the > eXecutable bit set. And an utf-8 script is a utf-8 encoded text file with it's executable bit set. > And that is the only difference, so you have to at > least (all instances of) `chmod` to insert and remove the BOM. [...] In order to make it harder for the interpreter to correctly detect utf-8? You can have DOS executables run in dosboxes, windows applications run in windows, java archives run in java, but utf-8 scripts should be mangled in order to work "correctly", and mangled back in order to be editable? *That*'s insane! Just make execve ignore the BOM marker before "#!" as the patch does, and you're done. The rest is somebody else's not-a-problem. BTW2: However, I don't like the patch. I'd first check for a utf-8 signature, and if it's found, adjust the buffer offset by 3. Then I'd run the old code checking for the sh_bang. OTOH, I just read the patch and not the .c file, maybe (unlikely) my idea wouldn't work correctly. -- Ich danke GMX dafür, die Verwendung meiner Adressen mittels per SPF verbreiteten Lügen zu sabotieren. ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-16 18:02 ` Bodo Eggert @ 2005-09-16 18:09 ` H. Peter Anvin 2005-09-16 18:57 ` Bodo Eggert [not found] ` <200509170028.59973.dhazelton@enter.net> 1 sibling, 1 reply; 80+ messages in thread From: H. Peter Anvin @ 2005-09-16 18:09 UTC (permalink / raw) To: 7eggert; +Cc: "Martin v. Löwis", linux-kernel Bodo Eggert wrote: > > What's supposed to happen if you concatenate a script from your french > user and from your russian user, both using localized text, into one file? > Unless you can guarantee every editor to correctly handle this case, all > usage of 8-bit-characters should be disabled - NOT! > Actually, it's quite easy to avoid problems by using UTF-8 consistently. The 8-bit characters are oddballs and need to be treated specially, but look, guys, it's 2005 - UTF-8 should be the norm, not the exception. -hpa ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-16 18:09 ` H. Peter Anvin @ 2005-09-16 18:57 ` Bodo Eggert 2005-09-16 19:08 ` Martin Mares ` (2 more replies) 0 siblings, 3 replies; 80+ messages in thread From: Bodo Eggert @ 2005-09-16 18:57 UTC (permalink / raw) To: H. Peter Anvin; +Cc: 7eggert, "Martin v. Löwis", linux-kernel On Fri, 16 Sep 2005, H. Peter Anvin wrote: > Bodo Eggert wrote: > > What's supposed to happen if you concatenate a script from your french > > user and from your russian user, both using localized text, into one file? > > Unless you can guarantee every editor to correctly handle this case, all > > usage of 8-bit-characters should be disabled - NOT! > > Actually, it's quite easy to avoid problems by using UTF-8 consistently. > The 8-bit characters are oddballs and need to be treated specially, > but look, guys, it's 2005 - UTF-8 should be the norm, not the exception. It should, but as long as old programs are still around, we'll have both and need a marker to distinguish them. Otherwise we'll be stuck with legacy scripts for a long time. -- I'm a member of DNA (National Assocciation of Dyslexics). -- Storm in <5Z4Z7.52353$4x4.6445347@news2-win.server.ntlworld.com> ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-16 18:57 ` Bodo Eggert @ 2005-09-16 19:08 ` Martin Mares 2005-09-16 19:25 ` H. Peter Anvin 2005-09-16 19:57 ` Horst von Brand 2 siblings, 0 replies; 80+ messages in thread From: Martin Mares @ 2005-09-16 19:08 UTC (permalink / raw) To: Bodo Eggert; +Cc: H. Peter Anvin, "Martin v. Löwis", linux-kernel Hello! > It should, but as long as old programs are still around, we'll have both > and need a marker to distinguish them. I doubt that. For ages people were using several different encodings on a single system (at least here in .cz) without any markers and although there were some rough edges, almost everything worked. Now we do the same with ISO-8859-2 and UTF-8, again with no need for a marker. Have a nice fortnight -- Martin `MJ' Mares <mj@ucw.cz> http://atrey.karlin.mff.cuni.cz/~mj/ Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth Linux vs. Windows is a no-WIN situation. ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-16 18:57 ` Bodo Eggert 2005-09-16 19:08 ` Martin Mares @ 2005-09-16 19:25 ` H. Peter Anvin 2005-09-16 19:57 ` Horst von Brand 2 siblings, 0 replies; 80+ messages in thread From: H. Peter Anvin @ 2005-09-16 19:25 UTC (permalink / raw) To: Bodo Eggert; +Cc: "Martin v. Löwis", linux-kernel Bodo Eggert wrote: > > It should, but as long as old programs are still around, we'll have both > and need a marker to distinguish them. Otherwise we'll be stuck with > legacy scripts for a long time. > You don't have markers (although they're defined, see ISO 2022) for your 8-bit encodings, and *THEY'RE THE ONES THAT NEED TO BE DISTINGUISHED.* Flagging UTF-8, especially with the BOM (as opposed to the ISO 2022 signature, <ESC>%G) is pointless in the context, since you still can't distinguish your arbitrary number of legacy encodings. Oh, yes, and try to stick ISO 2022 signatures in scripts or whatnot, and you can see what current software does with a signature standard that dates back to the 1970's. -hpa ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-16 18:57 ` Bodo Eggert 2005-09-16 19:08 ` Martin Mares 2005-09-16 19:25 ` H. Peter Anvin @ 2005-09-16 19:57 ` Horst von Brand 2 siblings, 0 replies; 80+ messages in thread From: Horst von Brand @ 2005-09-16 19:57 UTC (permalink / raw) To: Bodo Eggert; +Cc: H. Peter Anvin, "Martin v. Löwis", linux-kernel Bodo Eggert <7eggert@gmx.de> wrote: > On Fri, 16 Sep 2005, H. Peter Anvin wrote: > > Bodo Eggert wrote: [...] > > > Unless you can guarantee every editor to correctly handle this case, all > > > usage of 8-bit-characters should be disabled - NOT! > > Actually, it's quite easy to avoid problems by using UTF-8 consistently. > > The 8-bit characters are oddballs and need to be treated specially, > > but look, guys, it's 2005 - UTF-8 should be the norm, not the exception. Right. > It should, but as long as old programs are still around, we'll have both > and need a marker to distinguish them. Otherwise we'll be stuck with > legacy scripts for a long time. Please. Let people who mess with legacy stuff suffer, don't make everybody else (and forevermore!) pay the price. -- Dr. Horst H. von Brand User #22616 counter.li.org Departamento de Informatica Fono: +56 32 654431 Universidad Tecnica Federico Santa Maria +56 32 654239 Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513 ^ permalink raw reply [flat|nested] 80+ messages in thread
[parent not found: <200509170028.59973.dhazelton@enter.net>]
* Re: [Patch] Support UTF-8 scripts [not found] ` <200509170028.59973.dhazelton@enter.net> @ 2005-09-17 6:28 ` "Martin v. Löwis" 2005-09-17 22:31 ` D. Hazelton 2005-09-17 17:16 ` Bodo Eggert 1 sibling, 1 reply; 80+ messages in thread From: "Martin v. Löwis" @ 2005-09-17 6:28 UTC (permalink / raw) To: D. Hazelton; +Cc: 7eggert, H. Peter Anvin, linux-kernel D. Hazelton wrote: > This is a bogus argument. You're comparing the way a _binary_ > executable works to the way an interpreted _text_ script works. > execve(), at least on my system, isn't capable of running a script - > if I want to do that from a program I have to tell execve() that it's > running /bin/sh and the script file is in the parameter list. This being the linux-kernel list, I assume your system is Linux, no? Well, on Linux, execve *does* support script files. This is the whole point of my patch - I would not propose a kernel patch to improve this support if it weren't there in the first place. > While I appreciate that the kernel is capable of performing complex > actions when execve runs into a file that is not an a.out or elf > binary I have yet to see a "binfmt script" option in the kernel > config files ever. It's not a config option because it is always enabled. See fs/binfmt_script.c for details. It wasn't integrated into the binfmt system until I made it so some ten years ago, though. > On the other hand, there is the "binfmt_misc" option, which does the > work that you seem to be looking for and can, AFAIK, be set to handle > both ASCII and UTF-8 scripts. Why add the complexity to the kernel > when it's not needed? One shouldn't add complexity if its not needed. However, this patch does not add complexity. It is fairly trivial. Regards, Martin ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-17 6:28 ` "Martin v. Löwis" @ 2005-09-17 22:31 ` D. Hazelton 2005-09-18 3:45 ` Kyle Moffett 2005-09-18 6:58 ` "Martin v. Löwis" 0 siblings, 2 replies; 80+ messages in thread From: D. Hazelton @ 2005-09-17 22:31 UTC (permalink / raw) To: Martin v. Löwis; +Cc: 7eggert, H. Peter Anvin, linux-kernel On Saturday 17 September 2005 06:28, "Martin v. Löwis" wrote: > D. Hazelton wrote: > > This is a bogus argument. You're comparing the way a _binary_ > > executable works to the way an interpreted _text_ script works. > > execve(), at least on my system, isn't capable of running a > > script - if I want to do that from a program I have to tell > > execve() that it's running /bin/sh and the script file is in the > > parameter list. > > This being the linux-kernel list, I assume your system is Linux, > no? Well, on Linux, execve *does* support script files. This is the > whole point of my patch - I would not propose a kernel patch to > improve this support if it weren't there in the first place. This is news to me. The last time I handed execve() a script as a paramter I had errors returned from execve() -- I must admit that this was not on my current system and I had assumed that the behavior would be consistent. > > While I appreciate that the kernel is capable of performing > > complex actions when execve runs into a file that is not an a.out > > or elf binary I have yet to see a "binfmt script" option in the > > kernel config files ever. > > It's not a config option because it is always enabled. See > fs/binfmt_script.c for details. It wasn't integrated into the > binfmt system until I made it so some ten years ago, though. I haven't gotten into that section of the code yet. I've been slowly working my way through the code from the drivers that seem to cause strange behavior on my system and then up the tree from there. > > On the other hand, there is the "binfmt_misc" option, which does > > the work that you seem to be looking for and can, AFAIK, be set > > to handle both ASCII and UTF-8 scripts. Why add the complexity to > > the kernel when it's not needed? > > One shouldn't add complexity if its not needed. However, this patch > does not add complexity. It is fairly trivial. You are correct. It is fairly trivial. However my point still is valid that the Kernel has the whole binfmt_misc system -- I will admit that I have recently been shown numbers that show a noticeable difference in the speed of a binary executed using the binfmt_misc system and the binfmt_script system, but the fact remains that offering handling for UTF8 and ASCII scripts directly in the kernel will likely lead to at least one more patch in which the the full Unicode standard is implemented. That, and my point remains that the kernel should know absolutely nothing about how to execute a text file - the kernel should return an error to the extent of "I don't know what to do with this file" to the shell that tries to execute it, and the shell can then check for the sh_bang. I do admit that this change would break a lot of existing code, so I'll leave the argument to the experts. > Regards, > Martin DRH ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-17 22:31 ` D. Hazelton @ 2005-09-18 3:45 ` Kyle Moffett 2005-09-19 0:14 ` D. Hazelton 2005-09-18 6:58 ` "Martin v. Löwis" 1 sibling, 1 reply; 80+ messages in thread From: Kyle Moffett @ 2005-09-18 3:45 UTC (permalink / raw) To: D. Hazelton; +Cc: Martin v. Löwis , 7eggert, H. Peter Anvin, linux-kernel On Sep 17, 2005, at 18:31:33, D. Hazelton wrote: > That, and my point remains that the kernel should know absolutely > nothing about how to execute a text file - the kernel should return > an error to the extent of "I don't know what to do with this file" > to the shell that tries to execute it, and the shell can then check > for the sh_bang. I do admit that this change would break a lot of > existing code, so I'll leave the argument to the experts. No, that would not work at all. We have a very nice system to allow set-uid scripts (Specifically, I like my nice secure taint-mode set- uid perl scripts). If you did this, they would break completely, not to mention _add_ all sorts of unsolvable race conditions to the few ways of working around such a lack of SUID scripts. Also, it means that I can't just "mv /sbin/init /sbin/init.real ; vim /sbin/init" to do a simple wrapper around the init program, I would need to write a compiled C program to do all sorts of fragile hackish things like calling a script /sbin/init.sh. Cheers, Kyle Moffett -- There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult. -- C.A.R. Hoare ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-18 3:45 ` Kyle Moffett @ 2005-09-19 0:14 ` D. Hazelton 0 siblings, 0 replies; 80+ messages in thread From: D. Hazelton @ 2005-09-19 0:14 UTC (permalink / raw) To: Kyle Moffett; +Cc: Martin v. Löwis, 7eggert, H. Peter Anvin, linux-kernel On Sunday 18 September 2005 03:45, Kyle Moffett wrote: > On Sep 17, 2005, at 18:31:33, D. Hazelton wrote: > > That, and my point remains that the kernel should know absolutely > > nothing about how to execute a text file - the kernel should > > return an error to the extent of "I don't know what to do with > > this file" to the shell that tries to execute it, and the shell > > can then check for the sh_bang. I do admit that this change would > > break a lot of existing code, so I'll leave the argument to the > > experts. > > No, that would not work at all. We have a very nice system to > allow set-uid scripts (Specifically, I like my nice secure > taint-mode set- uid perl scripts). If you did this, they would > break completely, not to mention _add_ all sorts of unsolvable race > conditions to the few ways of working around such a lack of SUID > scripts. Also, it means that I can't just "mv /sbin/init > /sbin/init.real ; vim /sbin/init" to do a simple wrapper around the > init program, I would need to write a compiled C program to do all > sorts of fragile hackish things like calling a script > /sbin/init.sh. This makes a lot more sense than I expected to hear. This argument alone is enough for me to understand the reasoning behind the kernel knowing how to interpret a shell script. Problem is, the program would not be fragile or hackish - it'd be almost as simple as a "hello world" program. #include <unistd.h> int main() { /* if this fails the system is busted anyway */ return execve( "/bin/sh", "/sbin/init.sh", 0 ); }; -- This program would do the trick nicely, and since init is run as root, there is no need to worry about the program having to grab privs. However, the real problem is that this would break the initrd systems used by most distributions for installation, and it would probably break most of the "early userspace" systems just coming into use. As I said originally - my comment about having the shell itself interpret the sh_bang would break a lot of stuff and I've been shown that I have to spend more time in the kernel code (as I haven't finished going through the various drivers to see how those have been made to work) before I can make a good suggestion in a discussion like this. DRH ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-17 22:31 ` D. Hazelton 2005-09-18 3:45 ` Kyle Moffett @ 2005-09-18 6:58 ` "Martin v. Löwis" 2005-09-19 0:31 ` D. Hazelton 1 sibling, 1 reply; 80+ messages in thread From: "Martin v. Löwis" @ 2005-09-18 6:58 UTC (permalink / raw) To: D. Hazelton; +Cc: 7eggert, H. Peter Anvin, linux-kernel D. Hazelton wrote: > This is news to me. The last time I handed execve() a script as a > paramter I had errors returned from execve() -- I must admit that > this was not on my current system and I had assumed that the behavior > would be consistent. The kernel checks for #!<path>, and that <path> is an existing executable. If not, execve fails. > You are correct. It is fairly trivial. However my point still is valid > that the Kernel has the whole binfmt_misc system -- I will admit that > I have recently been shown numbers that show a noticeable difference > in the speed of a binary executed using the binfmt_misc system and > the binfmt_script system, but the fact remains that offering handling > for UTF8 and ASCII scripts directly in the kernel will likely lead to > at least one more patch in which the the full Unicode standard is > implemented. The problem with the binfmt_misc approach is that you need *another* execve call: with binfmt_misc, you register <utf8sig>#!, and a generic binary. Then, this generic binary will interpret the #! signature *again*, and invoke the proper interpreter. This will intepret the first line *yet again* (finding that it is a comment), and continue processing the file. However, this is not the real problem. The real problem is that the specific binfmt_misc "backend" would not be universally available, and then the same script would start on some systems, and break on others. This may be acceptable for large or specific applications (e.g. you have to setup the ibcs2 module to run SCO applications); it is not for scripts. Now, the "universally available" part would not apply right now, as only the most recent kernels would provide the feature. However, within a few years, the feature would be part of "Linux" - then people can start using it extensively. > That, and my point remains that the kernel should know absolutely > nothing about how to execute a text file - the kernel should return > an error to the extent of "I don't know what to do with this file" to > the shell that tries to execute it, and the shell can then check for > the sh_bang. I do admit that this change would break a lot of > existing code, so I'll leave the argument to the experts. The point is that it is not necessarily the shell which starts programs - the shell is but one creator of new processes. It is very common today that, say, httpd starts new programs - this mechanism is called CGI. Your approach was in use until 1985 or so, when Unix implementations started to support #! natively. This was done both for convenience and for performance: if programs would always use system(3) to start new processes, there would always be a shell that execs the eventual interpreter. I'm not sure, but I believe that most current shells have "forgotten" how to do the #! magic, since, by now, "traditionally" this is a kernel responsibility. Regards, Martin ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-18 6:58 ` "Martin v. Löwis" @ 2005-09-19 0:31 ` D. Hazelton 0 siblings, 0 replies; 80+ messages in thread From: D. Hazelton @ 2005-09-19 0:31 UTC (permalink / raw) To: Martin v. Löwis; +Cc: 7eggert, H. Peter Anvin, linux-kernel On Sunday 18 September 2005 06:58, "Martin v. Löwis" wrote: > D. Hazelton wrote: > > This is news to me. The last time I handed execve() a script as a > > paramter I had errors returned from execve() -- I must admit that > > this was not on my current system and I had assumed that the > > behavior would be consistent. > > The kernel checks for #!<path>, and that <path> is an existing > executable. If not, execve fails. > > > You are correct. It is fairly trivial. However my point still is > > valid that the Kernel has the whole binfmt_misc system -- I will > > admit that I have recently been shown numbers that show a > > noticeable difference in the speed of a binary executed using the > > binfmt_misc system and the binfmt_script system, but the fact > > remains that offering handling for UTF8 and ASCII scripts > > directly in the kernel will likely lead to at least one more > > patch in which the the full Unicode standard is implemented. > > The problem with the binfmt_misc approach is that you need > *another* execve call: with binfmt_misc, you register <utf8sig>#!, > and a generic binary. Then, this generic binary will interpret the > #! signature *again*, and invoke the proper interpreter. This will > intepret the first line *yet again* (finding that it is a comment), > and continue processing the file. True. I had forgotten that for truly generic rules about handling the #! there would be double the overhead for the sh_bang. > However, this is not the real problem. The real problem is that > the specific binfmt_misc "backend" would not be universally > available, and then the same script would start on some systems, > and break on others. This may be acceptable for large or specific > applications (e.g. you have to setup the ibcs2 module to run > SCO applications); it is not for scripts. Again this is all too true. Doubly so with the problem of an initrd that has 'init' as a script. > Now, the "universally available" part would not apply right now, > as only the most recent kernels would provide the feature. However, > within a few years, the feature would be part of "Linux" - then > people can start using it extensively. This sounds to me like you're saying in a few years my suggestion of using binfmt_misc would be tenable. Unfortunately, unless forced into it, no distro would ever use it. As I now see it, binfmt_script is pretty much a hard-coded hack that gives the system a bit more speed for running scripts. And since I've thought about the consequences of ripping it out after the posts yesterday - there is no clean way to remove it and still have a large number of systems still function. > > That, and my point remains that the kernel should know absolutely > > nothing about how to execute a text file - the kernel should > > return an error to the extent of "I don't know what to do with > > this file" to the shell that tries to execute it, and the shell > > can then check for the sh_bang. I do admit that this change would > > break a lot of existing code, so I'll leave the argument to the > > experts. > > The point is that it is not necessarily the shell which starts > programs - the shell is but one creator of new processes. It is > very common today that, say, httpd starts new programs - this > mechanism is called CGI. Your approach was in use until 1985 or > so, when Unix implementations started to support #! natively. > This was done both for convenience and for performance: if > programs would always use system(3) to start new processes, > there would always be a shell that execs the eventual > interpreter. True. In some cases, though, system(3) is really unusable - like you mentioned, httpd often starts new processes. Since daemons don't, technically, run on top of a shell, having one use system(3) to start a new process would add a lot of unnecessary overhead. > I'm not sure, but I believe that most current shells have > "forgotten" how to do the #! magic, since, by now, "traditionally" > this is a kernel responsibility. Not true. Bash, at least, still handles the sh_bang. (Provable by using it to call a perl script that doesn't have the exec bit set. This worked for me just a week ago :) DRH ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts [not found] ` <200509170028.59973.dhazelton@enter.net> 2005-09-17 6:28 ` "Martin v. Löwis" @ 2005-09-17 17:16 ` Bodo Eggert 1 sibling, 0 replies; 80+ messages in thread From: Bodo Eggert @ 2005-09-17 17:16 UTC (permalink / raw) To: D. Hazelton; +Cc: 7eggert, H. Peter Anvin, Martin v.Löwis, linux-kernel On Sat, 17 Sep 2005, D. Hazelton wrote: > On Friday 16 September 2005 18:02, Bodo Eggert wrote: > > Bernd Petrovitsch <bernd@firmix.at> wrote: > > > On Thu, 2005-09-15 at 20:39 +0200, "Martin v. Löwis" wrote: > > >> H. Peter Anvin wrote: > > >> > In Unix, it's a hideously bad > > >> > idea. The reason is that Unix inherently assumes that text > > >> > streams can be merged, split, and modified. In other words, > > >> > unless you can guarantee that EVERY program can handle BOM > > >> > EVERYWHERE, it's broken. > > > > You can't sort /bin/ls into /tmp/ls and expect /tmp/ls to be > > meaningfull, but /bin/ls works as expected. You can't usurally > > concat perl scripts and shell scripts either, but both kinds of > > script run quite well. > > > > And if you do "cat /bin/cat /bin/cp > /bin/catcp", what's "catcp > > foo bar" supposed to do? First output foo and bar to stdout, then > > copy foo to bar? Is execve() broken if it doesn't do what I > > described? Is the ELF header broken because it's not recogmized > > EVERYWHERE? I don't think so. > > This is a bogus argument. You're comparing the way a _binary_ > executable works to the way an interpreted _text_ script works. You can live with binaries, therefore the features not provided by binaries aren't vital for each and every executable. > execve(), at least on my system, isn't capable of running a script - > if I want to do that from a program I have to tell execve() that it's > running /bin/sh and the script file is in the parameter list. Fix your system, it's broken. > While I appreciate that the kernel is capable of performing complex > actions when execve runs into a file that is not an a.out or elf > binary I have yet to see a "binfmt script" option in the kernel > config files ever. Your wish ... but you won't be happy. --- ../t/linux-2.6.12/fs/Makefile 2005-06-17 21:48:29.000000000 +0200 +++ ./fs/Makefile 2005-09-17 18:02:36.000000000 +0200 @@ -20,9 +20,7 @@ obj-y += $(nfsd-y) $(nfsd-m) obj-$(CONFIG_BINFMT_AOUT) += binfmt_aout.o obj-$(CONFIG_BINFMT_EM86) += binfmt_em86.o obj-$(CONFIG_BINFMT_MISC) += binfmt_misc.o - -# binfmt_script is always there -obj-y += binfmt_script.o +obj-$(CONFIG_BINFMT_SCRIPT) += binfmt_script.o obj-$(CONFIG_BINFMT_ELF) += binfmt_elf.o obj-$(CONFIG_BINFMT_ELF_FDPIC) += binfmt_elf_fdpic.o --- ../t/linux-2.6.12/fs/Kconfig.binfmt 2005-06-17 21:48:29.000000000 +0200 +++ ./fs/Kconfig.binfmt 2005-09-17 17:59:39.000000000 +0200 @@ -42,6 +42,12 @@ config BINFMT_FLAT help Support uClinux FLAT format binaries. +config BINFMT_SCRIPT + bool "Kernel support for script files" + default y + help + Support script files starting with a '#!' marker. + config BINFMT_ZFLAT bool "Enable ZFLAT support" depends on BINFMT_FLAT > On the other hand, there is the "binfmt_misc" option, which does the > work that you seem to be looking for and can, AFAIK, be set to handle > both ASCII and UTF-8 scripts. Why add the complexity to the kernel > when it's not needed? Skipping 3 bytes vs. handling tons of binary formats? I bet the memory required to hold the utf8 binfmt_misc entry alone will be bigger than the code added by this patch. > > BTW: I think decent utf-8 capable programs SHOULD ignore extra BOM > > markers. > > All well and good if you use UTF-8. I, personally, am happy with ASCII > and have found no need for the extensive UTF character set (in fact, > I despise it when people insist on using UTF-8 in mediums in which > the character set is defined in the standards to be ASCII or a subset > of ASCII) I'm not using it, because nobody else is using it, and evrybody else does the same for exactly the same reasons. That's why just using utf-8 does not work out. However, if there were means of using both transparently, people could migrate. The editor part is simple, but if you can't use your favorite editor to generate shell scripts, it's a showstopper. > Since I am quite happy with the small subset of ASCII that I use on a > regular basis, and since I am always seeking ways to optimize my code > and my scripts I don't want the editor I'm using adding extra > characters behind my back. ACK. But you should be able to edit international text without tons of helper scripts, so a BOM will be usefull to mark utf-8. > > > And you (or at least I) do `grep`/`egrep`/`fgrep`, `wc` them. > > > > You can *grep utf-8 scripts, but you can't *grep binaries. > > Shouldn't this be fixed by implementing an in-kernel ASCII > > assembler and convert all binaries to assembler text? > > Bogus argument. Every shell I've ever used has expected the command > line to contain only ASCII characters. With that restriction in mind > it's clear that it'd be hard to put a UTF8 string as the argument to > grep. Although I doubt wc would be buggered by UTF8 input... If your shell isn't 8-bit-clean, it should have been replaced in the last millenium. Handling combined characters will be the problem. > > > And > > > probably with several other tools too - think of `find <dir> > > > -type f -print0 | xargs -0r <cmd>`. > > > > utf-8 filenames will work correctly (unless used as an extended > > BASIC script with non-ASCII variable names, but that would be > > insane). > > This is the truth. As I previously mentioned I have yet to find a > shell that accepted UTF8 on the command line without choking. And > allowing UTF8 for filenames would, I believe, require any number of > changes to the kernel, not the least of which would be changes to > the various filesystems to allow for UTF8 and to any number of system > calls that would be taking a filename for an argument. It's not a task of allowing utf-8 filenames, but a task of disallowing non-canonialized and non-utf8 filenames if files might be created. Systems doing that won't be a strictly POSIX conformant, but as long as there is a mounted FAT partition, it can't be anyway. > > >> just pointless to do that. You create them with text editors, > > >> and those can handle the UTF-8 signature. > > > > > > It is not uncommon to create scripts and the like with other > > > programs, other scripts, what-else. > > > > It's not uncommon to create binaries using other programs. So what? > > Bullsh*t. The case of one binary creating another doesn't apply - > because you either enter the data for the binary by hand (tedious and > difficult) or you use a binary that takes input and produces the > binary you need. And if the binary is missing the proper headers, > it's pretty much useless. And you can live with binaries being non-editable, non-generatable without propper tools. > When a script creates another script it is > just creating a text file, putting the data in the file as it reaches > those parts and has no way to know that it should be inserting the > BOM. If scripts are just text files, why doesn't sort<script|sh usurally do the right thing? Scripts are _not_ random text, they have specific structures. They consist of well-formed data, and you should better know what kind of script you're creating, therefore you should also know wether to write sh_bang or BOM_sh_bang. If you don't, don't generate the script! > > > Apart from the fact the a "script" is merely a plain text file with > > > the eXecutable bit set. > > > > And an utf-8 script is a utf-8 encoded text file with it's > > executable bit set. > > And the kernel should have no more code in it to execute them than is > already present in the binfmt_misc code. No need for special kernel > code when you can simply hand a chunk of parameters regarding the > various executable formats to the kernel using a clean, simple and > proven interface. And even then I feel it should be limited to > binaries - a script is, by definition, interpreted. As such, it > belongs in the same place as the interpreter - in userland. (And I > fail to see why this is even brought up other than some people being > lazy and not wanting to do things _correctly_) So the binfmt_sh code should be completely abandoned in favor of binfmt_misc? > > > And that is the only difference, so you have to at > > > least (all instances of) `chmod` to insert and remove the BOM. > > > > [...] > > > > In order to make it harder for the interpreter to correctly detect > > utf-8? You can have DOS executables run in dosboxes, windows > > applications run in windows, java archives run in java, but utf-8 > > scripts should be mangled in order to work "correctly", and mangled > > back in order to be editable? *That*'s insane! > > > > Just make execve ignore the BOM marker before "#!" as the patch > > does, and you're done. The rest is somebody else's not-a-problem. > > GCC allows for non-ascii input as a formality. The specifications of > both C and C++ clearly define the input character set to be limited > to an extremely limited subset of ASCII, as do the specifications of > most other language. This is a userspace problem. > (Perl 6 is the first language I've ever heard of > that directly includes non-ascii characters in the accepted character > set) The MS-DOS 3.3 shell accepted international characters in program names. > AFAIK, the most common shells don't accept UTF-8 in the command set - > they instead see the non-ascii UTF-8 characters as a series of bytes, > and if one of them happens to be NULL, you're pretty much screwed. There is no '\0' in utf-8-encoded data. > > BTW2: However, I don't like the patch. > > Neither do I. such a thing doesn't belong in the kernel. It's better than - using a legacy wrapper script for each script. - mangeling each utf8 file before and after editing it - forcing the world to convert to utf-8 within two weeks - using a wrapper script around each and every utf-8 script which would unnescensarily throw out pages and wastes CPU cycles while requiring each user to add several KB of kernel code for binfmt_misc and to have the interpreter for the wrapper script installed I actually created a wrapper script for binfmt_misc and called it a hundres times, here is the result: $ time for((i=0;i<100;i++));do ./foo > /dev/null;done # with wrapper real 0m2.350s user 0m1.808s sys 0m0.476s $ time for((i=0;i<100;i++));do ./bar > /dev/null;done # without wrapper real 0m0.461s user 0m0.232s sys 0m0.216s And I'm sure this script has a bug to exploit. (foo and bar will ust print "test\n" to stdout) -- "Our parents, worse than our grandparents, gave birth to us who are worse than they, and we shall in our turn bear offspring still more evil." -- Horace (BC 65-8) ^ permalink raw reply [flat|nested] 80+ messages in thread
[parent not found: <4Nvab-7o5-11@gated-at.bofh.it>]
[parent not found: <4Nvab-7o5-13@gated-at.bofh.it>]
[parent not found: <4Nvab-7o5-15@gated-at.bofh.it>]
[parent not found: <4Nvab-7o5-17@gated-at.bofh.it>]
[parent not found: <4Nvab-7o5-19@gated-at.bofh.it>]
[parent not found: <4Nvab-7o5-21@gated-at.bofh.it>]
[parent not found: <4Nvab-7o5-23@gated-at.bofh.it>]
[parent not found: <4Nvab-7o5-25@gated-at.bofh.it>]
[parent not found: <4Nvab-7o5-27@gated-at.bofh.it>]
[parent not found: <4NvjM-7CU-7@gated-at.bofh.it>]
[parent not found: <4NvjM-7CU-5@gated-at.bofh.it>]
[parent not found: <4NxbR-20S-1@gated-at.bofh.it>]
[parent not found: <4NEn7-3M5-7@gated-at.bofh.it>]
[parent not found: <4NTvO-yJ-13@gated-at.bofh.it>]
* Re: [Patch] Support UTF-8 scripts [not found] ` <4NTvO-yJ-13@gated-at.bofh.it> @ 2005-09-18 0:53 ` Bodo Eggert 2005-09-18 16:53 ` Bernd Petrovitsch [not found] ` <4O1MJ-3Hf-5@gated-at.bofh.it> 1 sibling, 1 reply; 80+ messages in thread From: Bodo Eggert @ 2005-09-18 0:53 UTC (permalink / raw) To: Bernd Petrovitsch, Martin v. Löwis, H. Peter Anvin, linux-kernel Bernd Petrovitsch <bernd@firmix.at> wrote: > On Sat, 2005-09-17 at 08:20 +0200, "Martin v. Löwis" wrote: >> Bernd Petrovitsch wrote: >> > On Fri, 2005-09-16 at 22:41 +0200, "Martin v. Löwis" wrote: >> > [ Language-specific examples ] >> > >> > And that's the only working way - the programming languages can actually >> > do it because it defines the syntax and semantics of the contents >> > anyways. >> >> It works from the programming language point of view, but it is a mess >> from the text editor point of view. > > Most of the text editors have ways to markup the source files. Not even > the various editors are able to agreen on one method for all, so why > could the (Linux) world agree on one for all text files? You don't need a marker for all text files, but it's legal to have a marker for utf-8 text files (see the uniocode standard 4.0.0 section 15.9), and it's handy to use it until you made everybody in the world convert everything to utf-8 (but not utf-{16,32}{le,be}). >> > With this marker you are interferign with (at least) *all* text files. >> >> Hmm. What does that have to do with the patch I'm proposing? This >> patch does *not* interfere with all text files. It is only relevant >> for executable files starting with the #! magic. > > It *does* interfere since scripts are also text files in every aspect. > So every feature you want for "scripts" you also get for text files (and > vice versa BTW). If utf-8 encoded text files are text files, and text files are scripts, and all of them shall have the same features, utf-8 encoded text files with BOM MUST be recognized as legal scripts, too. Therefore this patch fixes a kernel bug. BTW: Implementing the other utf signatures from Table 15.3 is left to the reader as an exercise.-) > If you think "script" and "text file" are different, define both of > them, please, otherwise a discussion is pointless. If all text files are script files, execute this mail. >> > And there are always tools out there which simply do not understand the >> > generic marker and can not ignore it since these bytes are part of the >> > file. >> >> This conclusion is false. Many tools that don't understand the file >> structure still can do their job on the files. So the fact that a tool >> does not understand the structure does not necessarily imply that >> the tool breaks when the structure changes. > > It *may* break just because of some to-be-ignored inline marking due to > some questionable feature. How exactly does it break, and what is it? And why must *it* be prevented from breaking by ignoring script signatures in valid text files? > And *when* (not if) it breaks, it is probably cumbersome to find since > you have pretty unprintable characters. If your tools can't print utf-8 encoded characters, they are broken for ISO-8859-*, too. Besides that, it's not a kernel problem. > Let alone the confusion why the size of a file with `ls -l` is different > from the size in the editor or a marker-aware `wc -c`. > So IMHO either you have a clear and visible marker or you none at all. Like e.g. the "From "-line starting each message in a mbox file? Virtually no email client will display it. The size of email messages does differ from it's unencoded content size, too! Off cause nobody can handle this, and all users contantly try to kill themselfes because of that - NOT. -- Ich danke GMX dafür, die Verwendung meiner Adressen mittels per SPF verbreiteten Lügen zu sabotieren. ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-18 0:53 ` Bodo Eggert @ 2005-09-18 16:53 ` Bernd Petrovitsch 0 siblings, 0 replies; 80+ messages in thread From: Bernd Petrovitsch @ 2005-09-18 16:53 UTC (permalink / raw) To: 7eggert; +Cc: Martin v. Löwis, H. Peter Anvin, linux-kernel On Sun, 2005-09-18 at 02:53 +0200, Bodo Eggert wrote: > Bernd Petrovitsch <bernd@firmix.at> wrote: [...] > > Most of the text editors have ways to markup the source files. Not even > > the various editors are able to agreen on one method for all, so why > > could the (Linux) world agree on one for all text files? > > You don't need a marker for all text files, but it's legal to have a marker > for utf-8 text files (see the uniocode standard 4.0.0 section 15.9), and > it's handy to use it until you made everybody in the world convert > everything to utf-8 (but not utf-{16,32}{le,be}). Have fun patching almost every text processing tool and concept out there. Apart from that the way of that marker is wrong it seems to me that the UTF-8 body has no other choice than such a insane "rule" or recommendation). > >> > With this marker you are interferign with (at least) *all* text files. > >> > >> Hmm. What does that have to do with the patch I'm proposing? This > >> patch does *not* interfere with all text files. It is only relevant > >> for executable files starting with the #! magic. > > > > It *does* interfere since scripts are also text files in every aspect. > > So every feature you want for "scripts" you also get for text files (and > > vice versa BTW). > > If utf-8 encoded text files are text files, and text files are scripts, No one said all text files are scripts, instead it is the other way 'round. [ snipped because of ex falso quod libet ] > > If you think "script" and "text file" are different, define both of > > them, please, otherwise a discussion is pointless. > > If all text files are script files, execute this mail. See above. Obviously you misunderstand some thing. > >> > And there are always tools out there which simply do not understand the > >> > generic marker and can not ignore it since these bytes are part of the > >> > file. > >> > >> This conclusion is false. Many tools that don't understand the file > >> structure still can do their job on the files. So the fact that a tool > >> does not understand the structure does not necessarily imply that > >> the tool breaks when the structure changes. > > > > It *may* break just because of some to-be-ignored inline marking due to > > some questionable feature. > > How exactly does it break, and what is it? And why must *it* be prevented > from breaking by ignoring script signatures in valid text files? The question was: What is if this marker in encountered within a file? To be ignored (by UTF-8 aware tools)? Some other interpretation? Illegal/Forbidden? > > And *when* (not if) it breaks, it is probably cumbersome to find since > > you have pretty unprintable characters. > > If your tools can't print utf-8 encoded characters, they are broken for > ISO-8859-*, too. Besides that, it's not a kernel problem. Which is again not true since lots of tools out there printed ISO-8859-* correctly before UTF-8 was deployed. [...] Bernd -- Firmix Software GmbH http://www.firmix.at/ mobil: +43 664 4416156 fax: +43 1 7890849-55 Embedded Linux Development and Services ^ permalink raw reply [flat|nested] 80+ messages in thread
[parent not found: <4O1MJ-3Hf-5@gated-at.bofh.it>]
[parent not found: <4O8Oh-5jp-7@gated-at.bofh.it>]
* Re: [Patch] Support UTF-8 scripts [not found] ` <4O8Oh-5jp-7@gated-at.bofh.it> @ 2005-09-18 19:23 ` Bodo Eggert 2005-09-18 21:03 ` Bernd Petrovitsch ` (2 more replies) 2005-09-19 4:54 ` "Martin v. Löwis" 1 sibling, 3 replies; 80+ messages in thread From: Bodo Eggert @ 2005-09-18 19:23 UTC (permalink / raw) To: Bernd Petrovitsch, Martin v. Löwis, H. Peter Anvin, linux-kernel Bernd Petrovitsch <bernd@firmix.at> wrote: > On Sun, 2005-09-18 at 09:23 +0200, "Martin v. Löwis" wrote: > [...] >> >>Hmm. What does that have to do with the patch I'm proposing? This >> >>patch does *not* interfere with all text files. It is only relevant >> >>for executable files starting with the #! magic. >> > >> > It *does* interfere since scripts are also text files in every aspect. >> > So every feature you want for "scripts" you also get for text files (and >> > vice versa BTW). >> >> The specific feature I get is that when I pass a file starting >> with <utf8sig>#! to execve, Linux will execute the file following >> the #!. In what way do I get this feature for text in general? >> And if I do, why is that a problem? > > After applying this patch it seems that "Linux" is supporting this > marker officially in general - especially if the kernel supports it. It will be the first POSIX kernel to correctly support utf-8 scripts. It's 2005, and according to other(?) posters, this should be standard. > I > suppose the next kernel patch is to support Win-like CR-LF sequences > (which is not the case AFAIK). Maybe it should, maybe it shouldn't. If I used MAC or DOS, I'd be sure it should.-) > BTW even some standards body thinks that this is the way to go, Not surprisingly the Unicode Consortium is one of them. > it > raises more problems and questions than resolves anything. The problem of ow to handle BOM is solved by reading the standard. > And though scripts are usually edited/changed/"parsed"/... with an text > editor, it is not always the case. Therefore the automatic extension to > *all text files* (especially as the marker basically applies to all text > files, not only scripts). > You want to focus just on your patch and ignore the directly implied > potential problems arising ... There is no problem arising from the patch, it solves one. To solve the rest, use recode. [...] > Apparently I have to repeat: If you do `cat a.txt b.txt >c.txt` where > a.txt and b.txt have this marker, then c.txt have the marker of b.txt > somewhere in the middle. Does this make sense in anyway? > How do I get rid of the marker in the middle transparently? The unicode standard defines how to handle them. >> > Let alone the confusion why the size of a file with `ls -l` is different >> > from the size in the editor or a marker-aware `wc -c`. >> >> This is true for any UTF-8 file, or any multibyte encoding. For any >> multibyte encoding, the number of bytes in the file is different from >> the number of characters. That doesn't (and shouldn't) stop people from >> using multi-byte encodings. > > It is different even if a pure ASCII file is marked as UTF-8. No pure ASCII file will be marked, since a marked file will be no ASCII file. > And sure, the problem exists in general with multi-byte encodings. ACK, but that's not a kernel problem nor a specific unicode problem. Fix it by making China, Greece an Japan convert to ASCII and by making all mathematicans stop using strange characters. All other users will follow. >> What the editor displays as the number of "things" is up to its own. >> The output of wc -c will always be the same as the one of ls -l, >> as wc -c does *not* give you characters: >> >> -c, --bytes >> print the byte counts >> >> You might have been thinking of 'wc -m'. > > It depends on the definition of "character". There are other standards > which define "character" as "byte". There are architectures defining a byte to be 32 bit. They are irrelevant, too. [...] >> Not sure what this has >> to do with the specific patch, though. > > It is not supported by the kernel. So either you remove it or you make > some compatibility hack (like an appropriate sym-link -EDOESNOTWORK #!/usr/bin/perl -T -s -w >, etc.). Since the > kernel can start java classes directly, you can probably make a similar > thing for the UTF-8 stuff. If MSDOS text files are text files are legal scripts, the kernel should recognize [\x0D\x0A] as valid line breaks. (The real reason would be unicode allowing NEL to be encoded as 0x0D or 0x0A.) This compile-tested patch adds 32 bytes to binfmt_script: --- ./fs/binfmt_script.c.old 2005-09-18 20:28:32.000000000 +0200 +++ ./fs/binfmt_script.c 2005-09-18 20:29:44.000000000 +0200 @@ -18,7 +18,7 @@ static int load_script(struct linux_binprm *bprm,struct pt_regs *regs) { - char *cp, *i_name, *i_arg; + char *cp, *cp2, *i_name, *i_arg; struct file *file; char interp[BINPRM_BUF_SIZE]; int retval; @@ -47,6 +47,9 @@ static int load_script(struct linux_binp bprm->buf[BINPRM_BUF_SIZE - 1] = '\0'; if ((cp = strchr(bprm->buf, '\n')) == NULL) cp = bprm->buf+BINPRM_BUF_SIZE-1; + if ((cp2 = strchr(bprm->buf, '\x0D')) != NULL + && cp2 < cp) + cp = cp2; *cp = '\0'; while (cp > bprm->buf) { cp--; -- Ich danke GMX dafür, die Verwendung meiner Adressen mittels per SPF verbreiteten Lügen zu sabotieren. ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-18 19:23 ` Bodo Eggert @ 2005-09-18 21:03 ` Bernd Petrovitsch 2005-09-19 19:37 ` Bodo Eggert 2005-09-18 22:29 ` Valdis.Kletnieks 2005-09-19 6:03 ` H. Peter Anvin 2 siblings, 1 reply; 80+ messages in thread From: Bernd Petrovitsch @ 2005-09-18 21:03 UTC (permalink / raw) To: 7eggert; +Cc: Martin v. Löwis, H. Peter Anvin, linux-kernel On Sun, 2005-09-18 at 21:23 +0200, Bodo Eggert wrote: [...] > >> Not sure what this has > >> to do with the specific patch, though. > > > > It is not supported by the kernel. So either you remove it or you make > > some compatibility hack (like an appropriate sym-link > > -EDOESNOTWORK > > #!/usr/bin/perl -T -s -w It depends on /usr/bin/perl how it handles a white-space character directly after "-w". > >, etc.). Since the > > kernel can start java classes directly, you can probably make a similar > > thing for the UTF-8 stuff. > > If MSDOS text files are text files are legal scripts, the kernel > should recognize [\x0D\x0A] as valid line breaks. The Unix worls does recognize the line breaks. It's up to the tool how to handle the white-space character before it. Especially for C and similar languages with continuation lines this leads to interesting (or now more boring) problems. Bernd -- Firmix Software GmbH http://www.firmix.at/ mobil: +43 664 4416156 fax: +43 1 7890849-55 Embedded Linux Development and Services ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-18 21:03 ` Bernd Petrovitsch @ 2005-09-19 19:37 ` Bodo Eggert 0 siblings, 0 replies; 80+ messages in thread From: Bodo Eggert @ 2005-09-19 19:37 UTC (permalink / raw) To: Bernd Petrovitsch Cc: 7eggert, Martin v. Löwis, H. Peter Anvin, linux-kernel On Sun, 18 Sep 2005, Bernd Petrovitsch wrote: > On Sun, 2005-09-18 at 21:23 +0200, Bodo Eggert wrote: > > >, etc.). Since the > > > kernel can start java classes directly, you can probably make a similar > > > thing for the UTF-8 stuff. > > > > If MSDOS text files are text files are legal scripts, the kernel > > should recognize [\x0D\x0A] as valid line breaks. > > The Unix worls does recognize the line breaks. Create a valid text file with macintosh line breaks (as allowed in unicode files) and try it. -- If enough data is collected, a board of inquiry can prove ANYTHING. ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-18 19:23 ` Bodo Eggert 2005-09-18 21:03 ` Bernd Petrovitsch @ 2005-09-18 22:29 ` Valdis.Kletnieks 2005-09-19 6:03 ` H. Peter Anvin 2 siblings, 0 replies; 80+ messages in thread From: Valdis.Kletnieks @ 2005-09-18 22:29 UTC (permalink / raw) To: 7eggert Cc: Bernd Petrovitsch, Martin v. Löwis, H. Peter Anvin, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1242 bytes --] On Sun, 18 Sep 2005 21:23:42 +0200, Bodo Eggert said: > Bernd Petrovitsch <bernd@firmix.at> wrote: > > Apparently I have to repeat: If you do `cat a.txt b.txt >c.txt` where > > a.txt and b.txt have this marker, then c.txt have the marker of b.txt > > somewhere in the middle. Does this make sense in anyway? > > How do I get rid of the marker in the middle transparently? > > The unicode standard defines how to handle them. For the benefit of those of us who are interested in the problem, but aren't in the mood to wade through a long standard looking for the answer to a specific question, can you elaborate? It isn't as obvious as all that, because of all the nasty corner cases... > > It is different even if a pure ASCII file is marked as UTF-8. > > No pure ASCII file will be marked, since a marked file will be no > ASCII file. Given a file "a.txt" that's pure ASCII, and a file "b.txt" that has the BOM marker on it, what happens when you do "cat a.txt b.txt > c.txt"? 'cat' doesn't know, and has no way of knowing, that c.txt needs a BOM at the *front* of the file until it's already written past the point in c.txt where the BOM has to go. What does the Unicode standard say to do in this case? [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-18 19:23 ` Bodo Eggert 2005-09-18 21:03 ` Bernd Petrovitsch 2005-09-18 22:29 ` Valdis.Kletnieks @ 2005-09-19 6:03 ` H. Peter Anvin 2 siblings, 0 replies; 80+ messages in thread From: H. Peter Anvin @ 2005-09-19 6:03 UTC (permalink / raw) To: 7eggert; +Cc: Bernd Petrovitsch, "Martin v. Löwis", linux-kernel Bodo Eggert wrote: > > It will be the first POSIX kernel to correctly support utf-8 scripts. > It's 2005, and according to other(?) posters, this should be standard. > UTF-8, yes. BOM bullshit, no. -hpa ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts [not found] ` <4O8Oh-5jp-7@gated-at.bofh.it> 2005-09-18 19:23 ` Bodo Eggert @ 2005-09-19 4:54 ` "Martin v. Löwis" 2005-09-19 8:26 ` Bernd Petrovitsch 1 sibling, 1 reply; 80+ messages in thread From: "Martin v. Löwis" @ 2005-09-19 4:54 UTC (permalink / raw) To: Bernd Petrovitsch, linux-kernel Bernd Petrovitsch wrote: >>The specific feature I get is that when I pass a file starting >>with <utf8sig>#! to execve, Linux will execute the file following >>the #!. In what way do I get this feature for text in general? >>And if I do, why is that a problem? > > > After applying this patch it seems that "Linux" is supporting this > marker officially in general - especially if the kernel supports it. What makes it seem so? That binfmt_script supports a certain convention doesn't mean that all other programs also somehow need to support that convention - and certainly not in the same way. > I suppose the next kernel patch is to support Win-like CR-LF sequences > (which is not the case AFAIK). What makes you suppose that? I have no plans to submit such a patch. > And though scripts are usually edited/changed/"parsed"/... with an text > editor, it is not always the case. Therefore the automatic extension to > *all text files* (especially as the marker basically applies to all text > files, not only scripts). > You want to focus just on your patch and ignore the directly implied > potential problems arising ... Because there are no problems arising. The next time somebody submits a patch to cat(1) to strip off UTF-8 signatures, you *then* complain that this is the wrong thing to do, because it violates the specification of cat. This reasoning is just flawed: it is like saying to a web browser developer: "don't _support_ XHTML, because there are so many tools which use HTML 4". > Apparently I have to repeat: If you do `cat a.txt b.txt >c.txt` where > a.txt and b.txt have this marker, then c.txt have the marker of b.txt > somewhere in the middle. Does this make sense in anyway? Indeed, it does. There is nothing inherently wrong with having the marker in the middle. > How do I get rid of the marker in the middle transparently? http://www.unicode.org/faq/utf_bom.html#38 >>What the editor displays as the number of "things" is up to its own. >>The output of wc -c will always be the same as the one of ls -l, >>as wc -c does *not* give you characters: >> >> -c, --bytes >> print the byte counts >> >>You might have been thinking of 'wc -m'. > > > It depends on the definition of "character". There are other standards > which define "character" as "byte". Certainly. However, you specifically talked about 'wc -c', and, in wc(1), atleast in the implementation commonly used on Linux, characters and bytes are not the same. >>It depends on the editor I use, of course > > > No, more on the OS the editor runs on. You talked about Windows specifically. On Windows, most editors give you the choice of chosing the line ending, and will preserve whatever line ending they find when adding new lines to a file. Regards, Martin ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-19 4:54 ` "Martin v. Löwis" @ 2005-09-19 8:26 ` Bernd Petrovitsch 2005-09-19 9:00 ` Valdis.Kletnieks 2005-09-19 21:40 ` "Martin v. Löwis" 0 siblings, 2 replies; 80+ messages in thread From: Bernd Petrovitsch @ 2005-09-19 8:26 UTC (permalink / raw) To: "Martin v. Löwis"; +Cc: linux-kernel On Mon, 2005-09-19 at 06:54 +0200, "Martin v. Löwis" wrote: > Bernd Petrovitsch wrote: > >>The specific feature I get is that when I pass a file starting > >>with <utf8sig>#! to execve, Linux will execute the file following > >>the #!. In what way do I get this feature for text in general? > >>And if I do, why is that a problem? > > > > After applying this patch it seems that "Linux" is supporting this > > marker officially in general - especially if the kernel supports it. > > What makes it seem so? That binfmt_script supports a certain convention > doesn't mean that all other programs also somehow need to support that > convention - and certainly not in the same way. We will see how it develops. Actually the marker could be used to detect endianness of the file if I read below URL correctly .... > > I suppose the next kernel patch is to support Win-like CR-LF sequences > > (which is not the case AFAIK). > > What makes you suppose that? I have no plans to submit such a patch. No need to. Other people tried already. > This reasoning is just flawed: it is like saying to a web browser > developer: "don't _support_ XHTML, because there are so many tools > which use HTML 4". No, the saying was more: "don't support XHTML since it may break HTML compliant browsers." For XHTML/HTML we all know that this is not the case, so the comparison is flawed. > > Apparently I have to repeat: If you do `cat a.txt b.txt >c.txt` where > > a.txt and b.txt have this marker, then c.txt have the marker of b.txt > > somewhere in the middle. Does this make sense in anyway? > > Indeed, it does. There is nothing inherently wrong with having > the marker in the middle. > > > How do I get rid of the marker in the middle transparently? > > http://www.unicode.org/faq/utf_bom.html#38 Thanks. ---- snip ---- In that case, any U+FEFF occurring in the middle of the file can be ignored, or treated as an error. ---- snip ---- Well, this doesn't sound like an clear rule stating that it *must* be ignored. BTW: ---- snip ---- Q: How I should deal with BOMs? [...] 3. Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols, use of the BOM as encoding form signature should be avoided. ---- snip ---- Voila. Avoid the BOM in your scripts and be done. > > It depends on the definition of "character". There are other standards > > which define "character" as "byte". > > Certainly. However, you specifically talked about 'wc -c', and, in > wc(1), atleast in the implementation commonly used on Linux, characters > and bytes are not the same. Yes, now since multi-byte character sets gets more commonly used. However, I don't think you get this into the C standard. But we are now far off the discussion .... > >>It depends on the editor I use, of course > > > > No, more on the OS the editor runs on. > > You talked about Windows specifically. On Windows, most editors give you > the choice of chosing the line ending, and will preserve whatever line > ending they find when adding new lines to a file. I belive this vor vim, emacs, etc. but I don't believe ir for the native ones ... Bernd -- Firmix Software GmbH http://www.firmix.at/ mobil: +43 664 4416156 fax: +43 1 7890849-55 Embedded Linux Development and Services ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-19 8:26 ` Bernd Petrovitsch @ 2005-09-19 9:00 ` Valdis.Kletnieks 2005-09-19 9:41 ` Bernd Petrovitsch 2005-09-19 21:40 ` "Martin v. Löwis" 1 sibling, 1 reply; 80+ messages in thread From: Valdis.Kletnieks @ 2005-09-19 9:00 UTC (permalink / raw) To: Bernd Petrovitsch; +Cc: "Martin v. Löwis", linux-kernel [-- Attachment #1: Type: text/plain, Size: 635 bytes --] On Mon, 19 Sep 2005 10:26:22 +0200, Bernd Petrovitsch said: > We will see how it develops. Actually the marker could be used to detect > endianness of the file if I read below URL correctly .... Text files have endianness???? > ---- snip ---- > Q: How I should deal with BOMs? > [...] > 3. Some byte oriented protocols expect ASCII characters at the beginning > of a file. If UTF-8 is used with these protocols, use of the BOM as > encoding form signature should be avoided. > ---- snip ---- > Voila. Avoid the BOM in your scripts and be done. At which point the proposed kernel patch becomes pointless.. ;) [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-19 9:00 ` Valdis.Kletnieks @ 2005-09-19 9:41 ` Bernd Petrovitsch 0 siblings, 0 replies; 80+ messages in thread From: Bernd Petrovitsch @ 2005-09-19 9:41 UTC (permalink / raw) To: Valdis.Kletnieks; +Cc: "Martin v. Löwis", linux-kernel On Mon, 2005-09-19 at 05:00 -0400, Valdis.Kletnieks@vt.edu wrote: > On Mon, 19 Sep 2005 10:26:22 +0200, Bernd Petrovitsch said: > > > We will see how it develops. Actually the marker could be used to detect > > endianness of the file if I read below URL correctly .... > > Text files have endianness???? Unicode-16 ones with 16 bit per character (as in Win NT), yes. UTF-8 ones not AFAIK. Bernd -- Firmix Software GmbH http://www.firmix.at/ mobil: +43 664 4416156 fax: +43 1 7890849-55 Embedded Linux Development and Services ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-19 8:26 ` Bernd Petrovitsch 2005-09-19 9:00 ` Valdis.Kletnieks @ 2005-09-19 21:40 ` "Martin v. Löwis" 1 sibling, 0 replies; 80+ messages in thread From: "Martin v. Löwis" @ 2005-09-19 21:40 UTC (permalink / raw) To: Bernd Petrovitsch; +Cc: "Martin v. Löwis", linux-kernel Bernd Petrovitsch wrote: >>>It depends on the definition of "character". There are other standards >>>which define "character" as "byte". >> >>Certainly. However, you specifically talked about 'wc -c', and, in >>wc(1), atleast in the implementation commonly used on Linux, characters >>and bytes are not the same. > > > Yes, now since multi-byte character sets gets more commonly used. > However, I don't think you get this into the C standard. But we are now > far off the discussion .... It does indeed, so just one final clarification. wc(1) is not part of the C standard - ISO 9899 does not talk about command line utilities at all. The relevant standard is POSIX; IEEE Std 1003.1, 2004 Edition says, in http://www.opengroup.org/onlinepubs/009695399/utilities/wc.html -c Write to the standard output the number of bytes in each input file. [...] -m Write to the standard output the number of characters in each input file. [...] RATIONALE [...] The -c option stands for "character" count, even though it counts bytes. This stems from the sometimes erroneous historical view that bytes and characters are the same size. Due to international requirements, the -m option (reminiscent of "multi-byte") was added to obtain actual character counts. Regards, Martin ^ permalink raw reply [flat|nested] 80+ messages in thread
[parent not found: <4NsP0-3YF-11@gated-at.bofh.it>]
[parent not found: <4NsP0-3YF-13@gated-at.bofh.it>]
[parent not found: <4NsP0-3YF-15@gated-at.bofh.it>]
[parent not found: <4NsP0-3YF-17@gated-at.bofh.it>]
[parent not found: <4NsP1-3YF-19@gated-at.bofh.it>]
[parent not found: <4NsP1-3YF-21@gated-at.bofh.it>]
[parent not found: <4NsOZ-3YF-9@gated-at.bofh.it>]
[parent not found: <4NsYH-4bv-27@gated-at.bofh.it>]
[parent not found: <4NtBr-4WU-3@gated-at.bofh.it>]
[parent not found: <4NtL0-5lQ-13@gated-at.bofh.it>]
* Re: [Patch] Support UTF-8 scripts [not found] ` <4NtL0-5lQ-13@gated-at.bofh.it> @ 2005-09-16 20:34 ` "Martin v. Löwis" 2005-09-17 12:01 ` Martin Mares 0 siblings, 1 reply; 80+ messages in thread From: "Martin v. Löwis" @ 2005-09-16 20:34 UTC (permalink / raw) To: Martin Mares; +Cc: linux-kernel Martin Mares wrote: > I doubt that. For ages people were using several different encodings on > a single system (at least here in .cz) without any markers and although > there were some rough edges, almost everything worked. Now we do the same > with ISO-8859-2 and UTF-8, again with no need for a marker. This is true for text files, where a human reader can interpret the data correctly even in absence of a declaration. For programming languages, this is typically not the case. Instead, in order to correctly interpret the source code, you need to declare the encoding. For a script, this should be done inside the file itself, as there is no explicit invocation of a compiler or some such where the script encoding could be specified externally. Regards, Martin ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-16 20:34 ` "Martin v. Löwis" @ 2005-09-17 12:01 ` Martin Mares 2005-09-17 12:25 ` "Martin v. Löwis" 0 siblings, 1 reply; 80+ messages in thread From: Martin Mares @ 2005-09-17 12:01 UTC (permalink / raw) To: "Martin v. Löwis"; +Cc: linux-kernel Hello! > This is true for text files, where a human reader can interpret the data > correctly even in absence of a declaration. For programming languages, > this is typically not the case. Instead, in order to correctly interpret > the source code, you need to declare the encoding. For a script, [...] This makes no sense. For a script, the shell does not care about the encoding at all. Also, currently, people use zillions of encodings, most of which have no signature, so introducing a signature for UTF-8 does not win anything. In the future, most people will probably use only UTF-8, so the signature carries no information. Have a nice fortnight -- Martin `MJ' Mares <mj@ucw.cz> http://atrey.karlin.mff.cuni.cz/~mj/ Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth Q: Who invented the first airplane that did not fly? A: The Wrong Brothers. ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-17 12:01 ` Martin Mares @ 2005-09-17 12:25 ` "Martin v. Löwis" 2005-09-17 12:28 ` Martin Mares 2005-09-19 7:08 ` Pavel Machek 0 siblings, 2 replies; 80+ messages in thread From: "Martin v. Löwis" @ 2005-09-17 12:25 UTC (permalink / raw) To: Martin Mares; +Cc: linux-kernel Martin Mares wrote: > This makes no sense. For a script, the shell does not care about the encoding > at all. I'm not (only) talking about /bin/sh. I'm primarily talking about /usr/bin/python, /usr/bin/perl, and /usr/bin/wish. In all these languages, the interpreter *does* care about the encoding. 1. In Python, the syntax u"some data" denotes a Unicode literal (stored internally either in UCS-2 or UCS-4); the literals are converted from the source encoding to the internal representation. This requires knowledge of the source encoding. 2. In Tcl, all strings are internally represented in UTF-8, and converted from the source encoding (which currently is inferred from the locale of the process executing the script). 3. In Perl, 'use utf8' declares that the encoding of the script is UTF-8, meaning that non-ASCII can be used in string literals, identifiers, and regular expressions. > Also, currently, people use zillions of encodings, most of which have no > signature, so introducing a signature for UTF-8 does not win anything. This specific patch does win something: it allows to executed scripts which start with <utf8 signature>#! This is useful e.g. for Python, which recognizes the UTF-8 signature as declaring the source encoding of the Python module to be UTF-8. > In the future, most people will probably use only UTF-8, so the signature > carries no information. In the future, the signature *will* carry no information. But the future is, well, in the future. I just can't understand why (some) people are so opposed to this patch. It is a really trivial, straight-forward change. It introduces no policy, just a feature: you can put the UTF-8 signature in your script file, if you want to (and your scripting language supports it). By no means it forces you to put the UTF-8 signature in your all script files, let alone all your text files. Regards, Martin ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-17 12:25 ` "Martin v. Löwis" @ 2005-09-17 12:28 ` Martin Mares 2005-09-17 12:53 ` "Martin v. Löwis" 2005-09-19 7:08 ` Pavel Machek 1 sibling, 1 reply; 80+ messages in thread From: Martin Mares @ 2005-09-17 12:28 UTC (permalink / raw) To: "Martin v. Löwis"; +Cc: linux-kernel Hello! > I'm not (only) talking about /bin/sh. I'm primarily talking about > /usr/bin/python, /usr/bin/perl, and /usr/bin/wish. In all these > languages, the interpreter *does* care about the encoding. Agreed. On the other hand, in all these languages you can pass the encoding as a parameter to the interpreter, cannot you? > In the future, the signature *will* carry no information. But the future > is, well, in the future. > > I just can't understand why (some) people are so opposed to this patch. Occam's razor? Have a nice fortnight -- Martin `MJ' Mares <mj@ucw.cz> http://atrey.karlin.mff.cuni.cz/~mj/ Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth "In accord to UNIX philosophy, PERL gives you enough rope to hang yourself." -- Larry Wall ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-17 12:28 ` Martin Mares @ 2005-09-17 12:53 ` "Martin v. Löwis" 2005-09-17 13:05 ` Martin Mares 0 siblings, 1 reply; 80+ messages in thread From: "Martin v. Löwis" @ 2005-09-17 12:53 UTC (permalink / raw) To: Martin Mares; +Cc: linux-kernel Martin Mares wrote: > Agreed. On the other hand, in all these languages you can pass the encoding > as a parameter to the interpreter, cannot you? Not in general, no. If you have a library of multiple modules, different modules may have different encodings. In particular, if UTF-8 in source code becomes more common (because it is better supported than now), people will start using it for libraries. At the same time, a lot of code is around that still uses other encodings (typically Latin-1). So you may have two encodings in the same program (different modules); that's why you need the encoding declared *in* the file. Now, there are different ways to do that: you can find language-specific ways (such as 'use utf8;'), and this is what most languages currently do. However, this is a nightmare for editor developers, and a severe inconvenience for script authors - which now have to put the encoding declaration into the files. With the UTF-8 signature, things become much simpler: editors can automatically detect presence of the signature, and need no language-specific parsing. The language interpreters have a guarantee that the signature is at the beginning of the file, so they don't need to switch encodings in the middle of parsing. Users can configure their editors to always write the signature for certain types of files, and don't need to worry about putting correct encoding declarations into the files. >>In the future, the signature *will* carry no information. But the future >>is, well, in the future. >> >>I just can't understand why (some) people are so opposed to this patch. > > > Occam's razor? Probably not literally, as we are not searching for an explanation of some phenomenon. You are probably suggesting that people dislike the feature because they see no need for it (as one poster stated it: I don't use UTF-8, so I don't want that feature). However, I do believe there is a need for the feature, and that the gains by far outweigh the costs. Regards, Martin ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-17 12:53 ` "Martin v. Löwis" @ 2005-09-17 13:05 ` Martin Mares 2005-09-17 13:33 ` "Martin v. Löwis" 0 siblings, 1 reply; 80+ messages in thread From: Martin Mares @ 2005-09-17 13:05 UTC (permalink / raw) To: "Martin v. Löwis"; +Cc: linux-kernel Hello! > With the UTF-8 signature, things become much simpler: editors can > automatically detect presence of the signature, and need no > language-specific parsing. I still think that this does solve only a completely insignificant part of the problem. Given the zillion existing encodings, you are able to identify UTF-8, leaving you with zillion-1 other encodings you are unable to deal with. > Probably not literally, as we are not searching for an explanation of > some phenomenon. ACK, not literally. > You are probably suggesting that people dislike the > feature because they see no need for it (as one poster stated it: > I don't use UTF-8, so I don't want that feature). I see a need for a feature which would help identify the charset of the script, but the patch in question obviously doesn't offer that -- it solves only a single special case of the problem in a completely non-systematic way. This does not sound right. Have a nice fortnight -- Martin `MJ' Mares <mj@ucw.cz> http://atrey.karlin.mff.cuni.cz/~mj/ Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth "How I need a drink, alcoholic in nature, after the tough chapters involving quantum mechanics!" = \pi ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-17 13:05 ` Martin Mares @ 2005-09-17 13:33 ` "Martin v. Löwis" 0 siblings, 0 replies; 80+ messages in thread From: "Martin v. Löwis" @ 2005-09-17 13:33 UTC (permalink / raw) To: Martin Mares; +Cc: linux-kernel Martin Mares wrote: > I still think that this does solve only a completely insignificant part > of the problem. Given the zillion existing encodings, you are able to identify > UTF-8, leaving you with zillion-1 other encodings you are unable to deal with. Correct. This is a special case only. The more general problem is already solved: both Python and Perl support source encodings in the entire zillion encodings. As I explained, this general solution, while being general, is also not very user-friendly. Now, why does UTF-8 deserve to be a special case? One reason is that it has the potential to replace the entire zillion of encodings over time. However, this can only happen if tool support for this encoding is really good. The patch contributes a (minor) fragment to the support - it is a small patch only. The other reason is that UTF-8 defines its own encoding declaration, unlike most of the other zillion-1 encodings. So naturally, an implementation that supports UTF-8 in this way cannot extend to other encodings. hpa suggested that ISO-2022 would be a more general mechanism, but pointed out that it hasn't implemented widely in the last 30 years, so it is unlikely that it will get much better support in the next thirty years. > I see a need for a feature which would help identify the charset of the script, > but the patch in question obviously doesn't offer that -- it solves only a single > special case of the problem in a completely non-systematic way. This does not > sound right. It's not a complete solution, but it *is* part of a general solution. People have tried in the past to solve the general problem of "identify the encoding of a text file", both in really general ways (iso-2022) and in format-specific ways (perl, python). All these solutions are tedious to use. There is another general solution: gradually replace the zillion encodings with a single one, namely Unicode (or, specifically, UTF-8). This solution will only work when done gradually. Clearly, this patch doesn't implement this solution entirely, but it contributes to it, by making usage of UTF-8 in script files more simple. Many more changes to other software (i.e. non-kernel changes) will be necessary to implement this solution, as well as (obviously) changes to existing files. Regards, Martin ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-17 12:25 ` "Martin v. Löwis" 2005-09-17 12:28 ` Martin Mares @ 2005-09-19 7:08 ` Pavel Machek 2005-09-19 7:18 ` "Martin v. Löwis" 1 sibling, 1 reply; 80+ messages in thread From: Pavel Machek @ 2005-09-19 7:08 UTC (permalink / raw) To: Martin v. Löwis; +Cc: Martin Mares, linux-kernel Hi! > I just can't understand why (some) people are so opposed to this patch. > It is a really trivial, straight-forward change. It introduces no > policy, just a feature: you can put the UTF-8 signature in your script > file, if you want to (and your scripting language supports it). By > no means it forces you to put the UTF-8 signature in your all script > files, let alone all your text files. Why is binfmt_misc not enough for you? Pavel -- if you have sharp zaurus hardware you don't need... you know my address ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-19 7:08 ` Pavel Machek @ 2005-09-19 7:18 ` "Martin v. Löwis" 2005-09-19 7:24 ` Pavel Machek 2005-09-19 23:49 ` Horst von Brand 0 siblings, 2 replies; 80+ messages in thread From: "Martin v. Löwis" @ 2005-09-19 7:18 UTC (permalink / raw) To: Pavel Machek; +Cc: Martin Mares, linux-kernel Pavel Machek wrote: > Why is binfmt_misc not enough for you? For two reasons: for one, it has the overhead of yet another exec call. This is different from usages for, say, Java byte code or Python byte code, where the registered interpreter already is the eventual binary which has to be invoked anyway; for a binfmt_misc application, you need an additional wrapper which reinterprets the first line, and then invokes the eventual interpreter. The other reason is availability: as an author of an UTF-8 script, you would have to communicate to your users that they need the right binfmt_misc wrapper installed (which they may have to build first). While installing additional stuff to run a single program is acceptable for large applications, it is likely not for script files. To make the feature useful in practice, it must be builtin. Regards, Martin ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-19 7:18 ` "Martin v. Löwis" @ 2005-09-19 7:24 ` Pavel Machek 2005-09-19 7:46 ` "Martin v. Löwis" 2005-09-19 10:48 ` Alan Cox 2005-09-19 23:49 ` Horst von Brand 1 sibling, 2 replies; 80+ messages in thread From: Pavel Machek @ 2005-09-19 7:24 UTC (permalink / raw) To: Martin v. Löwis; +Cc: Martin Mares, linux-kernel On Po 19-09-05 09:18:33, "Martin v. Löwis" wrote: > Pavel Machek wrote: > > Why is binfmt_misc not enough for you? > > For two reasons: for one, it has the overhead of yet another > exec call. This is different from usages for, say, Java byte > code or Python byte code, where the registered interpreter already > is the eventual binary which has to be invoked anyway; for > a binfmt_misc application, you need an additional wrapper > which reinterprets the first line, and then invokes the eventual > interpreter. Who cares? exec is fast. > The other reason is availability: as an author of an UTF-8 > script, you would have to communicate to your users that they > need the right binfmt_misc wrapper installed (which they may > have to build first). While installing additional stuff to > run a single program is acceptable for large applications, > it is likely not for script files. To make the feature useful > in practice, it must be builtin. This is distribution problem, not kernel problem. "/bin/ls should be built into kernel, because otherwise you can't call /bin/ls from script" is not an argument. If UTF-8 compatibility is important, distros will get it right. If it is not, you loose, but at least kernel is not messed up. Pavel -- if you have sharp zaurus hardware you don't need... you know my address ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-19 7:24 ` Pavel Machek @ 2005-09-19 7:46 ` "Martin v. Löwis" 2005-09-19 7:50 ` Pavel Machek 2005-09-19 10:48 ` Alan Cox 1 sibling, 1 reply; 80+ messages in thread From: "Martin v. Löwis" @ 2005-09-19 7:46 UTC (permalink / raw) To: Pavel Machek; +Cc: Martin Mares, linux-kernel Pavel Machek wrote: > If UTF-8 compatibility is important, distros will get it right. If it > is not, you loose, but at least kernel is not messed up. The patch doesn't mess up the kernel. Regards, Martin ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-19 7:46 ` "Martin v. Löwis" @ 2005-09-19 7:50 ` Pavel Machek 0 siblings, 0 replies; 80+ messages in thread From: Pavel Machek @ 2005-09-19 7:50 UTC (permalink / raw) To: Martin v. Löwis; +Cc: Martin Mares, linux-kernel On Po 19-09-05 09:46:11, "Martin v. Löwis" wrote: > Pavel Machek wrote: > > If UTF-8 compatibility is important, distros will get it right. If it > > is not, you loose, but at least kernel is not messed up. > > The patch doesn't mess up the kernel. Every patch does. Except that yours one does not because it is not going in :-). Pavel -- if you have sharp zaurus hardware you don't need... you know my address ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-19 7:24 ` Pavel Machek 2005-09-19 7:46 ` "Martin v. Löwis" @ 2005-09-19 10:48 ` Alan Cox 1 sibling, 0 replies; 80+ messages in thread From: Alan Cox @ 2005-09-19 10:48 UTC (permalink / raw) To: Pavel Machek; +Cc: Martin v. Löwis, Martin Mares, linux-kernel On Llu, 2005-09-19 at 09:24 +0200, Pavel Machek wrote: > > which reinterprets the first line, and then invokes the eventual > > interpreter. > > Who cares? exec is fast. It would be nice if it was but exec + user space overhead of startup is merely "faster than many equivalent systems". It's still slow ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-19 7:18 ` "Martin v. Löwis" 2005-09-19 7:24 ` Pavel Machek @ 2005-09-19 23:49 ` Horst von Brand 1 sibling, 0 replies; 80+ messages in thread From: Horst von Brand @ 2005-09-19 23:49 UTC (permalink / raw) To: "Martin v. Löwis"; +Cc: Pavel Machek, Martin Mares, linux-kernel [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain, Size: 969 bytes --] Martin v. Löwis <martin@v.loewis.de> wrote: > Pavel Machek wrote: > > Why is binfmt_misc not enough for you? > For two reasons: for one, it has the overhead of yet another > exec call. For an interpreted language this is surely irrelevant. [...] > The other reason is availability: as an author of an UTF-8 > script, you would have to communicate to your users that they > need the right binfmt_misc wrapper installed (which they may > have to build first). While installing additional stuff to > run a single program is acceptable for large applications, > it is likely not for script files. To make the feature useful > in practice, it must be builtin. That is a distribution problem. -- Dr. Horst H. von Brand User #22616 counter.li.org Departamento de Informatica Fono: +56 32 654431 Universidad Tecnica Federico Santa Maria +56 32 654239 Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513 ^ permalink raw reply [flat|nested] 80+ messages in thread
[parent not found: <4Nu4p-5Js-3@gated-at.bofh.it>]
* Re: [Patch] Support UTF-8 scripts [not found] ` <4Nu4p-5Js-3@gated-at.bofh.it> @ 2005-09-16 20:41 ` "Martin v. Löwis" 2005-09-16 22:08 ` H. Peter Anvin 2005-09-16 22:45 ` Bernd Petrovitsch 0 siblings, 2 replies; 80+ messages in thread From: "Martin v. Löwis" @ 2005-09-16 20:41 UTC (permalink / raw) To: H. Peter Anvin; +Cc: linux-kernel H. Peter Anvin wrote: > You don't have markers (although they're defined, see ISO 2022) for your > 8-bit encodings, and *THEY'RE THE ONES THAT NEED TO BE DISTINGUISHED.* > Flagging UTF-8, especially with the BOM (as opposed to the ISO 2022 > signature, <ESC>%G) is pointless in the context, since you still can't > distinguish your arbitrary number of legacy encodings. In programming languages that support the notion of source encodings, you do have markers for 8-bit encodings. For example, in Python, you can specify # -*- coding: iso-8859-1 -*- to denote the source encoding. In Perl, you write use encoding "latin-1"; (with 'use utf8;' being a special-case shortcut). In Java, you can specify the encoding through the -encoding argument to javac. In gcc, you use -finput-charset (with the special case of -fexec-charset and -fwide-exec-charset potentially being different). So you *must* use encoding declarations in some languages; the UTF-8 signature is a particularly convenient way of doing so, since it allows for uniformity across languages, with no need for the text editors to parse all the different programming languages. Regards, Martin ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-16 20:41 ` "Martin v. Löwis" @ 2005-09-16 22:08 ` H. Peter Anvin 2005-09-17 6:05 ` "Martin v. Löwis" 2005-09-16 22:45 ` Bernd Petrovitsch 1 sibling, 1 reply; 80+ messages in thread From: H. Peter Anvin @ 2005-09-16 22:08 UTC (permalink / raw) To: "Martin v. Löwis"; +Cc: linux-kernel Martin v. Löwis wrote: > In programming languages that support the notion of source encodings, > you do have markers for 8-bit encodings. For example, in Python, you > can specify > > # -*- coding: iso-8859-1 -*- > > to denote the source encoding. In Perl, you write > > use encoding "latin-1"; > > (with 'use utf8;' being a special-case shortcut). > > In Java, you can specify the encoding through the -encoding argument > to javac. In gcc, you use -finput-charset (with the special case of > -fexec-charset and -fwide-exec-charset potentially being different). > > So you *must* use encoding declarations in some languages; the UTF-8 > signature is a particularly convenient way of doing so, since it allows > for uniformity across languages, with no need for the text editors to > parse all the different programming languages. Did you miss the point? There has been a standard for marking for *30 years*, and virtually NOONE (outside Japan) uses it. -hpa ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-16 22:08 ` H. Peter Anvin @ 2005-09-17 6:05 ` "Martin v. Löwis" 0 siblings, 0 replies; 80+ messages in thread From: "Martin v. Löwis" @ 2005-09-17 6:05 UTC (permalink / raw) To: H. Peter Anvin; +Cc: linux-kernel H. Peter Anvin wrote: > Did you miss the point? There has been a standard for marking for *30 > years*, and virtually NOONE (outside Japan) uses it. I understood that fact - but I fail to see the point. If you mean to imply "people did not use ISO-2022, therefore, they will never use encoding declarations", I think this implication is false. People do use encoding declarations. If you mean to imply "people did not use ISO-2022, therefore, they will never use the UTF-8 signature", I think this implications is also false. People do use the UTF-8 signature, even outside Japan. The primary reason is that the UTF-8 signature is much easier to implement than ISO-2022: if you support UTF-8 in your tool (say, a text editor), anyway, adding support for the UTF-8 signature is almost trivial. Therefore, many more editors support the UTF-8 signature today than ever supported ISO-2022. Regards, Martin ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-16 20:41 ` "Martin v. Löwis" 2005-09-16 22:08 ` H. Peter Anvin @ 2005-09-16 22:45 ` Bernd Petrovitsch 2005-09-17 6:20 ` "Martin v. Löwis" 1 sibling, 1 reply; 80+ messages in thread From: Bernd Petrovitsch @ 2005-09-16 22:45 UTC (permalink / raw) To: "Martin v. Löwis"; +Cc: H. Peter Anvin, linux-kernel On Fri, 2005-09-16 at 22:41 +0200, "Martin v. Löwis" wrote: [ Language-specific examples ] And that's the only working way - the programming languages can actually do it because it defines the syntax and semantics of the contents anyways. With this marker you are interferign with (at least) *all* text files. And thus with *all* tools which "handle" those text files. > So you *must* use encoding declarations in some languages; the UTF-8 ... if you absolutely want to use Non-ASCII characters in the source code. In most (if not all) of them exist a native gettext() interface ... > signature is a particularly convenient way of doing so, since it allows > for uniformity across languages, with no need for the text editors to > parse all the different programming languages. And there are always tools out there which simply do not understand the generic marker and can not ignore it since these bytes are part of the file. And thus tools (and people) will kill those markers (for whatever reason and if it's simple ignorance) anyway. Or another example: (Try to) start a perl/shell/... script (without paranmeter on the first line) which was edited on Win* and binary copied to a Unix system. Or at least guess what will happen .... Bernd -- Firmix Software GmbH http://www.firmix.at/ mobil: +43 664 4416156 fax: +43 1 7890849-55 Embedded Linux Development and Services ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-16 22:45 ` Bernd Petrovitsch @ 2005-09-17 6:20 ` "Martin v. Löwis" 2005-09-17 22:28 ` Bernd Petrovitsch 0 siblings, 1 reply; 80+ messages in thread From: "Martin v. Löwis" @ 2005-09-17 6:20 UTC (permalink / raw) To: Bernd Petrovitsch Cc: "Martin v. Löwis", H. Peter Anvin, linux-kernel Bernd Petrovitsch wrote: > On Fri, 2005-09-16 at 22:41 +0200, "Martin v. Löwis" wrote: > [ Language-specific examples ] > > And that's the only working way - the programming languages can actually > do it because it defines the syntax and semantics of the contents > anyways. It works from the programming language point of view, but it is a mess from the text editor point of view. Even for the programming language, it is a pain to implement: what if you have non-ASCII characters before the pragma that declares the encoding? and so on. > With this marker you are interferign with (at least) *all* text files. Hmm. What does that have to do with the patch I'm proposing? This patch does *not* interfere with all text files. It is only relevant for executable files starting with the #! magic. > And thus with *all* tools which "handle" those text files. This is simply not true. My patch does not interfere with any such tools. They continue to work just fine. >>So you *must* use encoding declarations in some languages; the UTF-8 > > > ... if you absolutely want to use Non-ASCII characters in the source > code. In most (if not all) of them exist a native gettext() > interface ... True. However, this is more tedious to use. Also, it doesn't apply to all cases: e.g. if you have comments, documentation etc. in the source code, gettext is no option. Likewise, people often want to use non-ASCII in identifiers (e.g. class Lösung); this can also only work if you know what the source encoding is. You may argue that people just shouldn't do that, because it does not work well, but this is not convincing: it doesn't work well because language developers are to lazy to implement it. In fact, some languages (C, C++, Java, C#) do support non-ASCII identifiers (atleast in their specifications); there really isn't a good reason not to support it in scripting languages as well. > And there are always tools out there which simply do not understand the > generic marker and can not ignore it since these bytes are part of the > file. This conclusion is false. Many tools that don't understand the file structure still can do their job on the files. So the fact that a tool does not understand the structure does not necessarily imply that the tool breaks when the structure changes. > Or another example: (Try to) start a perl/shell/... script (without > paranmeter on the first line) which was edited on Win* and binary copied > to a Unix system. Or at least guess what will happen .... For a Python script, I don't need to guess: It will just work. Regards, Martin ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-17 6:20 ` "Martin v. Löwis" @ 2005-09-17 22:28 ` Bernd Petrovitsch 2005-09-18 7:23 ` "Martin v. Löwis" 0 siblings, 1 reply; 80+ messages in thread From: Bernd Petrovitsch @ 2005-09-17 22:28 UTC (permalink / raw) To: "Martin v. Löwis"; +Cc: H. Peter Anvin, linux-kernel On Sat, 2005-09-17 at 08:20 +0200, "Martin v. Löwis" wrote: > Bernd Petrovitsch wrote: > > On Fri, 2005-09-16 at 22:41 +0200, "Martin v. Löwis" wrote: > > [ Language-specific examples ] > > > > And that's the only working way - the programming languages can actually > > do it because it defines the syntax and semantics of the contents > > anyways. > > It works from the programming language point of view, but it is a mess > from the text editor point of view. Most of the text editors have ways to markup the source files. Not even the various editors are able to agreen on one method for all, so why could the (Linux) world agree on one for all text files? > Even for the programming language, it is a pain to implement: what > if you have non-ASCII characters before the pragma that declares the > encoding? and so on. That's the problem of the language definers who absolutely want such (IMHO absolutely superflous) features. > > With this marker you are interferign with (at least) *all* text files. > > Hmm. What does that have to do with the patch I'm proposing? This > patch does *not* interfere with all text files. It is only relevant > for executable files starting with the #! magic. It *does* interfere since scripts are also text files in every aspect. So every feature you want for "scripts" you also get for text files (and vice versa BTW). If you think "script" and "text file" are different, define both of them, please, otherwise a discussion is pointless. > > And there are always tools out there which simply do not understand the > > generic marker and can not ignore it since these bytes are part of the > > file. > > This conclusion is false. Many tools that don't understand the file > structure still can do their job on the files. So the fact that a tool > does not understand the structure does not necessarily imply that > the tool breaks when the structure changes. It *may* break just because of some to-be-ignored inline marking due to some questionable feature. And *when* (not if) it breaks, it is probably cumbersome to find since you have pretty unprintable characters. Let alone the confusion why the size of a file with `ls -l` is different from the size in the editor or a marker-aware `wc -c`. So IMHO either you have a clear and visible marker or you none at all. > > Or another example: (Try to) start a perl/shell/... script (without > > paranmeter on the first line) which was edited on Win* and binary copied > > to a Unix system. Or at least guess what will happen .... > > For a Python script, I don't need to guess: It will just work. Then write a short python script (with a "#!/usr/bin/python" line at the start [without parameters]) natively on a Win*-system, copy it binary over to an arbitrary Linux system and see what's happening. Bernd -- Firmix Software GmbH http://www.firmix.at/ mobil: +43 664 4416156 fax: +43 1 7890849-55 Embedded Linux Development and Services ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-17 22:28 ` Bernd Petrovitsch @ 2005-09-18 7:23 ` "Martin v. Löwis" 2005-09-18 14:50 ` Bernd Petrovitsch 0 siblings, 1 reply; 80+ messages in thread From: "Martin v. Löwis" @ 2005-09-18 7:23 UTC (permalink / raw) To: Bernd Petrovitsch Cc: "Martin v. Löwis", H. Peter Anvin, linux-kernel Bernd Petrovitsch wrote: > Most of the text editors have ways to markup the source files. Not even > the various editors are able to agreen on one method for all, so why > could the (Linux) world agree on one for all text files? You are ignoring the role of standardization. People invent their own mechanism if a standard is missing (or virtually unimplementable). For declaring encodings, there is no standard (except of iso-2022, which is really hard to implement correctly). Therefore, editor authors create their own standards. Atleast Python abstained from creating yet another standard, and instead supports both the declarations from Emacs and vim. To some degree, it also supports notepad (namely through the UTF-8 signature). However, people are much more likely to agree on a technology when it is defined by a recognized standards body. This is the case for the UTF-8 signature, which is defined by the Unicode consortium, for precisely this purpose. Therefore, editors *will* agree on that mechanism, while keeping their own mechanism for the more general problem. >>Even for the programming language, it is a pain to implement: what >>if you have non-ASCII characters before the pragma that declares the >>encoding? and so on. > > > That's the problem of the language definers who absolutely want such > (IMHO absolutely superflous) features. It's not the language designers who absolutely want this feature. It's the language users. Of course, you'ld have to be a language designer to know that fact - language users go to the language designers asking for the feature, not to the kernel developers. >>Hmm. What does that have to do with the patch I'm proposing? This >>patch does *not* interfere with all text files. It is only relevant >>for executable files starting with the #! magic. > > > It *does* interfere since scripts are also text files in every aspect. > So every feature you want for "scripts" you also get for text files (and > vice versa BTW). The specific feature I get is that when I pass a file starting with <utf8sig>#! to execve, Linux will execute the file following the #!. In what way do I get this feature for text in general? And if I do, why is that a problem? > If you think "script" and "text file" are different, define both of > them, please, otherwise a discussion is pointless. A script file (in the context of this discussion) is a text file that is executable (i.e. has the appropriate subset of S_IXUSR|S_IXGRP|S_IXOTH set), starts with #!, and has the path name of an executable file after the #!. More generally, a script file is a text file written in a scripting language. A scripting language is a programming language which supports "direct" execution of source code. So in the more general definition, a script file does not need to start with #!; for the context of this discussion, we should restrict attention to files actually affected by the patch. >>This conclusion is false. Many tools that don't understand the file >>structure still can do their job on the files. So the fact that a tool >>does not understand the structure does not necessarily imply that >>the tool breaks when the structure changes. > > > It *may* break just because of some to-be-ignored inline marking due to > some questionable feature. Be more specific. For what specific kind of file will cat(1) break? Unless cat(1) has a 2GB limitation, I very much doubt it will break (i.e. fail to do its job, "concatenate files and print on the standard output") for any kind of input - whether this is text files, binary files, images, sound files, HTML files. cat always does what it is designed to do. > Let alone the confusion why the size of a file with `ls -l` is different > from the size in the editor or a marker-aware `wc -c`. This is true for any UTF-8 file, or any multibyte encoding. For any multibyte encoding, the number of bytes in the file is different from the number of characters. That doesn't (and shouldn't) stop people from using multi-byte encodings. What the editor displays as the number of "things" is up to its own. The output of wc -c will always be the same as the one of ls -l, as wc -c does *not* give you characters: -c, --bytes print the byte counts You might have been thinking of 'wc -m'. >>For a Python script, I don't need to guess: It will just work. > > > Then write a short python script (with a "#!/usr/bin/python" line at the > start [without parameters]) natively on a Win*-system, copy it binary > over to an arbitrary Linux system and see what's happening. It depends on the editor I use, of course: the kernel will consider any CR after the n as part of the interpreter name. Not sure what this has to do with the specific patch, though. Regards, Martin ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts 2005-09-18 7:23 ` "Martin v. Löwis" @ 2005-09-18 14:50 ` Bernd Petrovitsch 0 siblings, 0 replies; 80+ messages in thread From: Bernd Petrovitsch @ 2005-09-18 14:50 UTC (permalink / raw) To: "Martin v. Löwis"; +Cc: H. Peter Anvin, linux-kernel On Sun, 2005-09-18 at 09:23 +0200, "Martin v. Löwis" wrote: [...] > >>Hmm. What does that have to do with the patch I'm proposing? This > >>patch does *not* interfere with all text files. It is only relevant > >>for executable files starting with the #! magic. > > > > It *does* interfere since scripts are also text files in every aspect. > > So every feature you want for "scripts" you also get for text files (and > > vice versa BTW). > > The specific feature I get is that when I pass a file starting > with <utf8sig>#! to execve, Linux will execute the file following > the #!. In what way do I get this feature for text in general? > And if I do, why is that a problem? After applying this patch it seems that "Linux" is supporting this marker officially in general - especially if the kernel supports it. I suppose the next kernel patch is to support Win-like CR-LF sequences (which is not the case AFAIK). BTW even some standards body thinks that this is the way to go, it raises more problems and questions than resolves anything. > > If you think "script" and "text file" are different, define both of > > them, please, otherwise a discussion is pointless. > > A script file (in the context of this discussion) is a text file > that is executable (i.e. has the appropriate subset of > S_IXUSR|S_IXGRP|S_IXOTH set), starts with #!, and has the path > name of an executable file after the #!. > > More generally, a script file is a text file written in a scripting > language. A scripting language is a programming language which > supports "direct" execution of source code. So in the more > general definition, a script file does not need to start with > #!; for the context of this discussion, we should restrict > attention to files actually affected by the patch. And though scripts are usually edited/changed/"parsed"/... with an text editor, it is not always the case. Therefore the automatic extension to *all text files* (especially as the marker basically applies to all text files, not only scripts). You want to focus just on your patch and ignore the directly implied potential problems arising ... [...] > > It *may* break just because of some to-be-ignored inline marking due to > > some questionable feature. > > Be more specific. For what specific kind of file will cat(1) break? `cat` as such will not break (as such). > Unless cat(1) has a 2GB limitation, I very much doubt it will break > (i.e. fail to do its job, "concatenate files and print on the standard > output") for any kind of input - whether this is text files, binary > files, images, sound files, HTML files. cat always does what it is > designed to do. Apparently I have to repeat: If you do `cat a.txt b.txt >c.txt` where a.txt and b.txt have this marker, then c.txt have the marker of b.txt somewhere in the middle. Does this make sense in anyway? How do I get rid of the marker in the middle transparently? > > Let alone the confusion why the size of a file with `ls -l` is different > > from the size in the editor or a marker-aware `wc -c`. > > This is true for any UTF-8 file, or any multibyte encoding. For any > multibyte encoding, the number of bytes in the file is different from > the number of characters. That doesn't (and shouldn't) stop people from > using multi-byte encodings. It is different even if a pure ASCII file is marked as UTF-8. And sure, the problem exists in general with multi-byte encodings. > What the editor displays as the number of "things" is up to its own. > The output of wc -c will always be the same as the one of ls -l, > as wc -c does *not* give you characters: > > -c, --bytes > print the byte counts > > You might have been thinking of 'wc -m'. It depends on the definition of "character". There are other standards which define "character" as "byte". [...] > > Then write a short python script (with a "#!/usr/bin/python" line at the > > start [without parameters]) natively on a Win*-system, copy it binary > > over to an arbitrary Linux system and see what's happening. > > It depends on the editor I use, of course: the kernel will consider any No, more on the OS the editor runs on. > CR after the n as part of the interpreter name. Not sure what this has ACK. > to do with the specific patch, though. It is not supported by the kernel. So either you remove it or you make some compatibility hack (like an appropriate sym-link, etc.). Since the kernel can start java classes directly, you can probably make a similar thing for the UTF-8 stuff. Bernd -- Firmix Software GmbH http://www.firmix.at/ mobil: +43 664 4416156 fax: +43 1 7890849-55 Embedded Linux Development and Services ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [Patch] Support UTF-8 scripts [not found] ` <4NsOZ-3YF-9@gated-at.bofh.it> [not found] ` <4NsYH-4bv-27@gated-at.bofh.it> @ 2005-09-17 6:45 ` "Martin v. Löwis" 1 sibling, 0 replies; 80+ messages in thread From: "Martin v. Löwis" @ 2005-09-17 6:45 UTC (permalink / raw) To: 7eggert; +Cc: linux-kernel Bodo Eggert wrote: > BTW2: However, I don't like the patch. > > I'd first check for a utf-8 signature, and if it's found, adjust the > buffer offset by 3. Then I'd run the old code checking for the sh_bang. > OTOH, I just read the patch and not the .c file, maybe (unlikely) my idea > wouldn't work correctly. I believe this wouldn't work. binfmt_script currently has the code for (cp = bprm->buf+2; (*cp == ' ') || (*cp == '\t'); cp++); to get out the (start of the) interpreter file name. This knows implicitly that you need to skip two bytes #!; for UTF-8 signatures, it would be 5 bytes. Now, if you meant to suggest that bprm->buf should be adjusted (e.g. through 'brpm->buf += 3'): this cannot work, either. It would break subsequent binfmt modules which assume that bprm->buf is the first 1KiB (or so) of the file to be executed. If you suggest that the patch should merely check for the signature, and then skip it: this is what the patch does. Regards, Martin P.S. I just noticed there is a bprm->buf[BINPRM_BUF_SIZE - 1] = '\0'; which seems incorrect: it puts a null-byte into the buffer data, thus (slightly) corrupting the data for subsequent binfmt modules (although it already knows the file starts with #!, so the subsequent modules will fail, anyway) Also, I think the above loop should also terminate for ' *cp == '\0' if there is neither a space nor a tab in the file. ^ permalink raw reply [flat|nested] 80+ messages in thread
[parent not found: <4NXfZ-5P0-1@gated-at.bofh.it>]
[parent not found: <4NYlM-7i0-5@gated-at.bofh.it>]
[parent not found: <4Olip-6HH-13@gated-at.bofh.it>]
* Re: [Patch] Support UTF-8 scripts [not found] ` <4Olip-6HH-13@gated-at.bofh.it> @ 2005-09-19 4:41 ` "Martin v. Löwis" 0 siblings, 0 replies; 80+ messages in thread From: "Martin v. Löwis" @ 2005-09-19 4:41 UTC (permalink / raw) To: D. Hazelton, linux-kernel D. Hazelton wrote: >>I would need to write a compiled C program to do all >>sorts of fragile hackish things like calling a script >>/sbin/init.sh. > > > Problem is, the program > would not be fragile or hackish - it'd be almost as simple as a > "hello world" program. > > #include <unistd.h> > > int main() { > /* if this fails the system is busted anyway */ > return execve( "/bin/sh", "/sbin/init.sh", 0 ); > }; This attempt nicely illustrates Kyle's point. This program *is* fragile and hackish. It is fragile because, even though it is only five lines, contains two major bugs: 1. execve takes an argv array, not a null-terminated list of strings. So this compiles with a warning about incompatible pointer types; you meant to use execl(3). 2. In the exec family, the path to the program is different from argv[0]. So the correct line would be return execl("/bin/sh", "sh", /sbin/init.sh", 0); It is hackisch, because it also lacks a feature commonly found in such wrappers: 3. arguments passed to the wrapper are not forwarded to the executable. In particular, init takes several arguments (e.g. the runlevel), which should be forwarded to the final executable. Just try completing the wrapper on your own. Regards, Martin ^ permalink raw reply [flat|nested] 80+ messages in thread
[parent not found: <4NVHm-3yE-13@gated-at.bofh.it>]
[parent not found: <4NVHm-3yE-15@gated-at.bofh.it>]
[parent not found: <4NVHm-3yE-17@gated-at.bofh.it>]
[parent not found: <4NVHm-3yE-19@gated-at.bofh.it>]
[parent not found: <4NVHm-3yE-21@gated-at.bofh.it>]
[parent not found: <4NVHm-3yE-23@gated-at.bofh.it>]
[parent not found: <4NVHm-3yE-25@gated-at.bofh.it>]
[parent not found: <4NVHm-3yE-27@gated-at.bofh.it>]
[parent not found: <4NVHm-3yE-29@gated-at.bofh.it>]
[parent not found: <4NVHm-3yE-31@gated-at.bofh.it>]
[parent not found: <4NVHn-3yE-33@gated-at.bofh.it>]
[parent not found: <4NVHn-3yE-35@gated-at.bofh.it>]
[parent not found: <4NVHn-3yE-37@gated-at.bofh.it>]
[parent not found: <4NVHn-3yE-39@gated-at.bofh.it>]
[parent not found: <4Od1x-3e3-5@gated-at.bofh.it>]
[parent not found: <4Od1x-3e3-7@gated-at.bofh.it>]
[parent not found: <4Od1w-3e3-3@gated-at.bofh.it>]
[parent not found: <4OfZo-7AG-21@gated-at.bofh.it>]
* Re: [Patch] Support UTF-8 scripts [not found] ` <4OfZo-7AG-21@gated-at.bofh.it> @ 2005-09-19 5:11 ` "Martin v. Löwis" 0 siblings, 0 replies; 80+ messages in thread From: "Martin v. Löwis" @ 2005-09-19 5:11 UTC (permalink / raw) To: Valdis.Kletnieks, linux-kernel Valdis.Kletnieks@vt.edu wrote: > For the benefit of those of us who are interested in the problem, but aren't > in the mood to wade through a long standard looking for the answer to a > specific question, can you elaborate? See http://www.unicode.org/faq/utf_bom.html#38 > It isn't as obvious as all that, because of all the nasty corner cases... It really depends on the specific structure of the text file. For Python scripts, the Python interpreter will reject a U+FEFF in the middle of the file as a syntax error (*). This is, IMO, a reasonable reaction: you just shouldn't concatenate Python scripts blindly. They may have different source encodings, so any concatenation of Python scripts needs to convert them both into a common encoding. The first script may also fail to terminate with a newline, so concatenating Python scripts also needs to insert a line break. In edition, you would also typically want to remove the docstring in the second file. The same holds for many other formats: for example, you cannot blindly concatenate XML files, either (the result often won't be an XML file). So that the BOM is treated as an error would give no problem. > Given a file "a.txt" that's pure ASCII, and a file "b.txt" that has the BOM > marker on it, what happens when you do "cat a.txt b.txt > c.txt"? You answer the question yourself correctly: > 'cat' doesn't know, and has no way of knowing, that c.txt needs a BOM at the > *front* of the file until it's already written past the point in c.txt where > the BOM has to go. > > What does the Unicode standard say to do in this case? The point is that the BOM *also* is a regular character, U+FEFF. It used to have a specific function, too, but now U+2060 (WORD JOINER) should be used for that function. So U+FEFF is exclusively used for the BOM now. If you see it in the middle of a file, you know it doesn't belong there (*). In processing the file, you can complain, you can ignore it, and you can chose to strip it off. Which of these you do depends on the application; if you don't know better, treating it as ZERO WIDTH NON-BREAKING SPACE is the recommended reaction. Regards, Martin (*) unless it occurs in a string literal, in which case it becomes part of the string. In the case of concatenating two Python files, it won't be part of a string literal, though, but instead occur at the beginning of a line. ^ permalink raw reply [flat|nested] 80+ messages in thread
end of thread, other threads:[~2005-09-20 3:28 UTC | newest]
Thread overview: 80+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-08-13 12:07 [Patch] Support UTF-8 scripts "Martin v. Löwis"
2005-08-13 16:35 ` Stephen Pollei
2005-08-13 18:42 ` Lee Revell
2005-08-13 18:49 ` Hugo Mills
2005-08-13 18:53 ` Lee Revell
2005-08-14 0:57 ` Alan Cox
2005-08-14 1:19 ` Kyle Moffett
2005-08-14 1:40 ` Lee Revell
2005-08-14 10:40 ` Wichert Akkerman
2005-08-13 19:20 ` Lee Revell
2005-08-16 9:46 ` Jan Engelhardt
2005-08-14 0:53 ` Alan Cox
2005-08-14 4:10 ` James Cloos
2005-08-14 6:18 ` Jason L Tibbitts III
[not found] ` <feed8cdd050814125845fe4e2e@mail.gmail.com>
2005-08-14 19:59 ` Lee Revell
2005-08-14 20:13 ` Stephen Pollei
2005-08-14 20:22 ` Lee Revell
2005-08-14 22:10 ` "Martin v. Löwis"
2005-08-14 23:55 ` Alan Cox
2005-08-16 13:56 ` David Madore
[not found] ` <mailman.1124063520.13257.linux-kernel2news@redhat.com>
2005-08-16 20:17 ` Pete Zaitcev
2005-08-14 21:52 ` Kyle Moffett
2005-08-14 22:12 ` Valdis.Kletnieks
2005-08-15 8:01 ` Helge Hafting
2005-08-31 23:27 ` H. Peter Anvin
[not found] <4B2ZV-2dl-7@gated-at.bofh.it>
[not found] ` <4HKbZ-Cx-37@gated-at.bofh.it>
2005-09-15 18:24 ` "Martin v. Löwis"
2005-09-15 18:25 ` H. Peter Anvin
2005-09-15 18:39 ` "Martin v. Löwis"
2005-09-15 19:20 ` H. Peter Anvin
2005-09-16 8:13 ` Bernd Petrovitsch
[not found] <4N6EL-4Hq-3@gated-at.bofh.it>
[not found] ` <4N6EL-4Hq-5@gated-at.bofh.it>
[not found] ` <4N6EK-4Hq-1@gated-at.bofh.it>
[not found] ` <4N6EX-4Hq-27@gated-at.bofh.it>
[not found] ` <4N6Ox-4Ts-33@gated-at.bofh.it>
[not found] ` <4N7AS-67L-3@gated-at.bofh.it>
2005-09-16 18:02 ` Bodo Eggert
2005-09-16 18:09 ` H. Peter Anvin
2005-09-16 18:57 ` Bodo Eggert
2005-09-16 19:08 ` Martin Mares
2005-09-16 19:25 ` H. Peter Anvin
2005-09-16 19:57 ` Horst von Brand
[not found] ` <200509170028.59973.dhazelton@enter.net>
2005-09-17 6:28 ` "Martin v. Löwis"
2005-09-17 22:31 ` D. Hazelton
2005-09-18 3:45 ` Kyle Moffett
2005-09-19 0:14 ` D. Hazelton
2005-09-18 6:58 ` "Martin v. Löwis"
2005-09-19 0:31 ` D. Hazelton
2005-09-17 17:16 ` Bodo Eggert
[not found] <4Nvab-7o5-11@gated-at.bofh.it>
[not found] ` <4Nvab-7o5-13@gated-at.bofh.it>
[not found] ` <4Nvab-7o5-15@gated-at.bofh.it>
[not found] ` <4Nvab-7o5-17@gated-at.bofh.it>
[not found] ` <4Nvab-7o5-19@gated-at.bofh.it>
[not found] ` <4Nvab-7o5-21@gated-at.bofh.it>
[not found] ` <4Nvab-7o5-23@gated-at.bofh.it>
[not found] ` <4Nvab-7o5-25@gated-at.bofh.it>
[not found] ` <4Nvab-7o5-27@gated-at.bofh.it>
[not found] ` <4NvjM-7CU-7@gated-at.bofh.it>
[not found] ` <4NvjM-7CU-5@gated-at.bofh.it>
[not found] ` <4NxbR-20S-1@gated-at.bofh.it>
[not found] ` <4NEn7-3M5-7@gated-at.bofh.it>
[not found] ` <4NTvO-yJ-13@gated-at.bofh.it>
2005-09-18 0:53 ` Bodo Eggert
2005-09-18 16:53 ` Bernd Petrovitsch
[not found] ` <4O1MJ-3Hf-5@gated-at.bofh.it>
[not found] ` <4O8Oh-5jp-7@gated-at.bofh.it>
2005-09-18 19:23 ` Bodo Eggert
2005-09-18 21:03 ` Bernd Petrovitsch
2005-09-19 19:37 ` Bodo Eggert
2005-09-18 22:29 ` Valdis.Kletnieks
2005-09-19 6:03 ` H. Peter Anvin
2005-09-19 4:54 ` "Martin v. Löwis"
2005-09-19 8:26 ` Bernd Petrovitsch
2005-09-19 9:00 ` Valdis.Kletnieks
2005-09-19 9:41 ` Bernd Petrovitsch
2005-09-19 21:40 ` "Martin v. Löwis"
[not found] <4NsP0-3YF-11@gated-at.bofh.it>
[not found] ` <4NsP0-3YF-13@gated-at.bofh.it>
[not found] ` <4NsP0-3YF-15@gated-at.bofh.it>
[not found] ` <4NsP0-3YF-17@gated-at.bofh.it>
[not found] ` <4NsP1-3YF-19@gated-at.bofh.it>
[not found] ` <4NsP1-3YF-21@gated-at.bofh.it>
[not found] ` <4NsOZ-3YF-9@gated-at.bofh.it>
[not found] ` <4NsYH-4bv-27@gated-at.bofh.it>
[not found] ` <4NtBr-4WU-3@gated-at.bofh.it>
[not found] ` <4NtL0-5lQ-13@gated-at.bofh.it>
2005-09-16 20:34 ` "Martin v. Löwis"
2005-09-17 12:01 ` Martin Mares
2005-09-17 12:25 ` "Martin v. Löwis"
2005-09-17 12:28 ` Martin Mares
2005-09-17 12:53 ` "Martin v. Löwis"
2005-09-17 13:05 ` Martin Mares
2005-09-17 13:33 ` "Martin v. Löwis"
2005-09-19 7:08 ` Pavel Machek
2005-09-19 7:18 ` "Martin v. Löwis"
2005-09-19 7:24 ` Pavel Machek
2005-09-19 7:46 ` "Martin v. Löwis"
2005-09-19 7:50 ` Pavel Machek
2005-09-19 10:48 ` Alan Cox
2005-09-19 23:49 ` Horst von Brand
[not found] ` <4Nu4p-5Js-3@gated-at.bofh.it>
2005-09-16 20:41 ` "Martin v. Löwis"
2005-09-16 22:08 ` H. Peter Anvin
2005-09-17 6:05 ` "Martin v. Löwis"
2005-09-16 22:45 ` Bernd Petrovitsch
2005-09-17 6:20 ` "Martin v. Löwis"
2005-09-17 22:28 ` Bernd Petrovitsch
2005-09-18 7:23 ` "Martin v. Löwis"
2005-09-18 14:50 ` Bernd Petrovitsch
2005-09-17 6:45 ` "Martin v. Löwis"
[not found] ` <4NXfZ-5P0-1@gated-at.bofh.it>
[not found] ` <4NYlM-7i0-5@gated-at.bofh.it>
[not found] ` <4Olip-6HH-13@gated-at.bofh.it>
2005-09-19 4:41 ` "Martin v. Löwis"
[not found] <4NVHm-3yE-13@gated-at.bofh.it>
[not found] ` <4NVHm-3yE-15@gated-at.bofh.it>
[not found] ` <4NVHm-3yE-17@gated-at.bofh.it>
[not found] ` <4NVHm-3yE-19@gated-at.bofh.it>
[not found] ` <4NVHm-3yE-21@gated-at.bofh.it>
[not found] ` <4NVHm-3yE-23@gated-at.bofh.it>
[not found] ` <4NVHm-3yE-25@gated-at.bofh.it>
[not found] ` <4NVHm-3yE-27@gated-at.bofh.it>
[not found] ` <4NVHm-3yE-29@gated-at.bofh.it>
[not found] ` <4NVHm-3yE-31@gated-at.bofh.it>
[not found] ` <4NVHn-3yE-33@gated-at.bofh.it>
[not found] ` <4NVHn-3yE-35@gated-at.bofh.it>
[not found] ` <4NVHn-3yE-37@gated-at.bofh.it>
[not found] ` <4NVHn-3yE-39@gated-at.bofh.it>
[not found] ` <4Od1x-3e3-5@gated-at.bofh.it>
[not found] ` <4Od1x-3e3-7@gated-at.bofh.it>
[not found] ` <4Od1w-3e3-3@gated-at.bofh.it>
[not found] ` <4OfZo-7AG-21@gated-at.bofh.it>
2005-09-19 5:11 ` "Martin v. Löwis"
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox