public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [Patch] Support UTF-8 scripts
@ 2005-08-13 12:07 "Martin v. Löwis"
  2005-08-13 16:35 ` Stephen Pollei
  2005-08-31 23:27 ` H. Peter Anvin
  0 siblings, 2 replies; 80+ messages in thread
From: "Martin v. Löwis" @ 2005-08-13 12:07 UTC (permalink / raw)
  To: linux-kernel

This patch adds support for UTF-8 signatures (aka BOM, byte order
mark) to binfmt_script. Files that start with EF BF FF # ! are now
recognized as scripts (in addition to files starting with # !).

With such support, creating scripts that reliably carry non-ASCII
characters is simplified. Editors and the script interpreter can
easily agree on what the encoding of the script is, and the
interpreter can then render strings appropriately. Currently,
Python supports source files that start with the UTF-8 signature;
the approach would naturally extend to Perl to enhance/replace
the "use utf8" pragma. Likewise, Tcl could use the UTF-8 signature
to reliably identify UTF-8 source code (instead of assuming
[encoding system] for source code).

Please find the patch attached below.

Regards,
Martin

Signed-off-by: Martin v. Löwis <martin@v.loewis.de>

diff --git a/fs/binfmt_script.c b/fs/binfmt_script.c
--- a/fs/binfmt_script.c
+++ b/fs/binfmt_script.c
@@ -1,7 +1,7 @@
 /*
  *  linux/fs/binfmt_script.c
  *
- *  Copyright (C) 1996  Martin von Löwis
+ *  Copyright (C) 1996, 2005  Martin von Löwis
  *  original #!-checking implemented by tytso.
  */

@@ -23,7 +23,16 @@ static int load_script(struct linux_binp
        char interp[BINPRM_BUF_SIZE];
        int retval;

-       if ((bprm->buf[0] != '#') || (bprm->buf[1] != '!') ||
(bprm->sh_bang))
+       /* It is a recursive invocation. */
+       if (bprm->sh_bang)
+               return -ENOEXEC;
+
+       /* It starts neither with #!, nor with #! preceded by
+          the UTF-8 signature. */
+       if (!(((bprm->buf[0] == '#') && (bprm->buf[1] == '!'))
+             || ((bprm->buf[0] == '\xef') && (bprm->buf[1] == '\xbb')
+                 && (bprm->buf[2] == '\xbf') && (bprm->buf[3] == '#')
+                 && (bprm->buf[4] == '!'))))
                return -ENOEXEC;
        /*
         * This section does the #! interpretation.
@@ -46,7 +55,8 @@ static int load_script(struct linux_binp
                else
                        break;
        }
-       for (cp = bprm->buf+2; (*cp == ' ') || (*cp == '\t'); cp++);
+       cp = (bprm->buf[0]=='\xef') ? bprm->buf+5 : bprm->buf+2;
+       while ((*cp == ' ') || (*cp == '\t')) cp++;
        if (*cp == '\0')
                return -ENOEXEC; /* No interpreter name found */
        i_name = cp;

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-08-13 12:07 "Martin v. Löwis"
@ 2005-08-13 16:35 ` Stephen Pollei
  2005-08-13 18:42   ` Lee Revell
  2005-08-31 23:27 ` H. Peter Anvin
  1 sibling, 1 reply; 80+ messages in thread
From: Stephen Pollei @ 2005-08-13 16:35 UTC (permalink / raw)
  To: Martin v. Löwis; +Cc: linux-kernel

On 8/13/05, "Martin v. Löwis" <martin@v.loewis.de> wrote:
> This patch adds support for UTF-8 signatures (aka BOM, byte order
> mark) to binfmt_script. 

> With such support, creating scripts that reliably carry non-ASCII
> characters is simplified. 
> the approach would naturally extend to Perl to enhance/replace
> the "use utf8" pragma. 

Thats great for the perl6 people.
http://dev.perl.org/perl6/doc/design/syn/S03.html says they are going
to be using « and » as operators... So I'd imagine that a lot of perl6
scripts would be utf8.

-- 
http://dmoz.org/profiles/pollei.html
http://sourceforge.net/users/stephen_pollei/
http://www.orkut.com/Profile.aspx?uid=2455954990164098214
http://stephen_pollei.home.comcast.net/

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-08-13 16:35 ` Stephen Pollei
@ 2005-08-13 18:42   ` Lee Revell
  2005-08-13 18:49     ` Hugo Mills
                       ` (3 more replies)
  0 siblings, 4 replies; 80+ messages in thread
From: Lee Revell @ 2005-08-13 18:42 UTC (permalink / raw)
  To: Stephen Pollei; +Cc: Martin v. Löwis, linux-kernel

On Sat, 2005-08-13 at 09:35 -0700, Stephen Pollei wrote:
> Thats great for the perl6 people.
> http://dev.perl.org/perl6/doc/design/syn/S03.html says they are going
> to be using « and » as operators...

Is Larry smoking crack?  That's one of the worst ideas I've heard in a
long time.  There's no easy way to enter those at the keyboard!

http://www.cl.cam.ac.uk/~mgk25/unicode.html#input

Lee


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-08-13 18:42   ` Lee Revell
@ 2005-08-13 18:49     ` Hugo Mills
  2005-08-13 18:53       ` Lee Revell
                         ` (2 more replies)
  2005-08-14  0:53     ` Alan Cox
                       ` (2 subsequent siblings)
  3 siblings, 3 replies; 80+ messages in thread
From: Hugo Mills @ 2005-08-13 18:49 UTC (permalink / raw)
  To: Lee Revell; +Cc: Stephen Pollei, Martin v. Löwis, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 948 bytes --]

On Sat, Aug 13, 2005 at 02:42:52PM -0400, Lee Revell wrote:
> On Sat, 2005-08-13 at 09:35 -0700, Stephen Pollei wrote:
> > Thats great for the perl6 people.
> > http://dev.perl.org/perl6/doc/design/syn/S03.html says they are going
> > to be using « and » as operators...
> 
> Is Larry smoking crack?  That's one of the worst ideas I've heard in a
> long time.  There's no easy way to enter those at the keyboard!

   I have "setxkbmap -symbols 'en_US(pc102)+gb'" in my ~/.xsession,
and « and » are available as AltGr-z and AltGr-x respectively.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 1C335860 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Anyone who claims their cryptographic protocol is secure is ---   
         either a genius or a fool.  Given the genius/fool ratio         
                 for our species,  the odds aren't good.                 

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-08-13 18:49     ` Hugo Mills
@ 2005-08-13 18:53       ` Lee Revell
  2005-08-14  0:57         ` Alan Cox
  2005-08-13 19:20       ` Lee Revell
  2005-08-16  9:46       ` Jan Engelhardt
  2 siblings, 1 reply; 80+ messages in thread
From: Lee Revell @ 2005-08-13 18:53 UTC (permalink / raw)
  To: Hugo Mills; +Cc: Stephen Pollei, Martin v. Löwis, linux-kernel

On Sat, 2005-08-13 at 19:49 +0100, Hugo Mills wrote:
> On Sat, Aug 13, 2005 at 02:42:52PM -0400, Lee Revell wrote:
> > On Sat, 2005-08-13 at 09:35 -0700, Stephen Pollei wrote:
> > > Thats great for the perl6 people.
> > > http://dev.perl.org/perl6/doc/design/syn/S03.html says they are going
> > > to be using « and » as operators...
> > 
> > Is Larry smoking crack?  That's one of the worst ideas I've heard in a
> > long time.  There's no easy way to enter those at the keyboard!
> 
>    I have "setxkbmap -symbols 'en_US(pc102)+gb'" in my ~/.xsession,
> and « and » are available as AltGr-z and AltGr-x respectively.

Most keyboards don't have an AltGr key.

Lee


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-08-13 18:49     ` Hugo Mills
  2005-08-13 18:53       ` Lee Revell
@ 2005-08-13 19:20       ` Lee Revell
  2005-08-16  9:46       ` Jan Engelhardt
  2 siblings, 0 replies; 80+ messages in thread
From: Lee Revell @ 2005-08-13 19:20 UTC (permalink / raw)
  To: Hugo Mills; +Cc: Stephen Pollei, Martin v. Löwis, linux-kernel

On Sat, 2005-08-13 at 19:49 +0100, Hugo Mills wrote:
> On Sat, Aug 13, 2005 at 02:42:52PM -0400, Lee Revell wrote:
> > On Sat, 2005-08-13 at 09:35 -0700, Stephen Pollei wrote:
> > > Thats great for the perl6 people.
> > > http://dev.perl.org/perl6/doc/design/syn/S03.html says they are going
> > > to be using « and » as operators...
> > 
> > Is Larry smoking crack?  That's one of the worst ideas I've heard in a
> > long time.  There's no easy way to enter those at the keyboard!
> 
>    I have "setxkbmap -symbols 'en_US(pc102)+gb'" in my ~/.xsession,
> and « and » are available as AltGr-z and AltGr-x respectively.
> 

Well, now it's obvious he's just trying to raise the bar for the
obfuscated perl contest.  If you thought these were fun before, you'll
love them with ¥ and « and »!

Lee


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-08-13 18:42   ` Lee Revell
  2005-08-13 18:49     ` Hugo Mills
@ 2005-08-14  0:53     ` Alan Cox
  2005-08-14  4:10       ` James Cloos
  2005-08-14  6:18     ` Jason L Tibbitts III
  2005-08-15  8:01     ` Helge Hafting
  3 siblings, 1 reply; 80+ messages in thread
From: Alan Cox @ 2005-08-14  0:53 UTC (permalink / raw)
  To: Lee Revell; +Cc: Stephen Pollei, Martin v. Löwis, linux-kernel

On Sad, 2005-08-13 at 14:42 -0400, Lee Revell wrote:
> Is Larry smoking crack?  That's one of the worst ideas I've heard in a
> long time.  There's no easy way to enter those at the keyboard!

The command line console mappings may not include them by default (you
can obviously add them if your keyboard lacks them). The X keyboard
however does include compose functionality for » and « and many other
symbols that might be useful eg ± 

Alan


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-08-13 18:53       ` Lee Revell
@ 2005-08-14  0:57         ` Alan Cox
  2005-08-14  1:19           ` Kyle Moffett
  0 siblings, 1 reply; 80+ messages in thread
From: Alan Cox @ 2005-08-14  0:57 UTC (permalink / raw)
  To: Lee Revell; +Cc: Hugo Mills, Stephen Pollei, Martin v. Löwis, linux-kernel

> >    I have "setxkbmap -symbols 'en_US(pc102)+gb'" in my ~/.xsession,
> > and « and » are available as AltGr-z and AltGr-x respectively.
> 
> Most keyboards don't have an AltGr key.

You must be an American. Most old the worlds keyboards have an AltGr
key. You'll find that US keyboards have two alt keys to avoid confusing
people (like one button mice ;)) but the right one is understood by the
X bindings to be "AltGr". Even though the US keyboard is apparently
lacking functionality its purely a text label issue

Alan


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-08-14  0:57         ` Alan Cox
@ 2005-08-14  1:19           ` Kyle Moffett
  2005-08-14  1:40             ` Lee Revell
  0 siblings, 1 reply; 80+ messages in thread
From: Kyle Moffett @ 2005-08-14  1:19 UTC (permalink / raw)
  To: Alan Cox
  Cc: Lee Revell, Hugo Mills, Stephen Pollei,  Martin v. Löwis ,
	linux-kernel

On Aug 13, 2005, at 20:57:45, Alan Cox wrote:
>>>    I have "setxkbmap -symbols 'en_US(pc102)+gb'" in my ~/.xsession,
>>> and « and » are available as AltGr-z and AltGr-x respectively.
>>
>> Most keyboards don't have an AltGr key.
>
> You must be an American. Most old the worlds keyboards have an AltGr
> key. You'll find that US keyboards have two alt keys to avoid  
> confusing
> people (like one button mice ;)) but the right one is understood by  
> the
> X bindings to be "AltGr". Even though the US keyboard is apparently
> lacking functionality its purely a text label issue

And those of us who are Mac OS X oriented have patched our console and X
keycodes to match the mac way of generating symbols:

Alt-\        = «
Alt-Shift-\  = »
Alt-Shift-+  = ±

If only someone could come up with a good character palette like exists
on that OS, something that could generate a wide variety of keysyms,
preferably all of UTF-8, and send them to the topmost window.

Cheers,
Kyle Moffett

--
Unix was not designed to stop people from doing stupid things,  
because that
would also stop them from doing clever things.
   -- Doug Gwyn



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-08-14  1:19           ` Kyle Moffett
@ 2005-08-14  1:40             ` Lee Revell
  2005-08-14 10:40               ` Wichert Akkerman
  0 siblings, 1 reply; 80+ messages in thread
From: Lee Revell @ 2005-08-14  1:40 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Alan Cox, Hugo Mills, Stephen Pollei,  Martin v. Löwis ,
	linux-kernel

On Sat, 2005-08-13 at 21:19 -0400, Kyle Moffett wrote:
> And those of us who are Mac OS X oriented have patched our console and
> X keycodes to match the mac way of generating symbols:
> 
> Alt-\        = «
> Alt-Shift-\  = »
> Alt-Shift-+  = ±
> 

My point exactly, it's idiotic for Perl6 to use these as OPERATORS, the
atoms of the language, when there's not even a platform independent way
to type them in.

Lee


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-08-14  0:53     ` Alan Cox
@ 2005-08-14  4:10       ` James Cloos
  0 siblings, 0 replies; 80+ messages in thread
From: James Cloos @ 2005-08-14  4:10 UTC (permalink / raw)
  To: linux-kernel; +Cc: Lee Revell

>>>>> "Alan" == Alan Cox <alan@lxorguk.ukuu.org.uk> writes:

Alan> The command line console mappings may not include them by
Alan> default (you can obviously add them if your keyboard lacks
Alan> them). The X keyboard however does include compose functionality
Alan> for » and « and many other symbols that might be useful eg ±

Not to mention that many editors, including emacs and vim, have their
own support for entering such non-ascii characters no matter what the
console or X11 keyboards look like.

-JimC
-- 
James H. Cloos, Jr. <cloos@jhcloos.com>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-08-13 18:42   ` Lee Revell
  2005-08-13 18:49     ` Hugo Mills
  2005-08-14  0:53     ` Alan Cox
@ 2005-08-14  6:18     ` Jason L Tibbitts III
       [not found]       ` <feed8cdd050814125845fe4e2e@mail.gmail.com>
  2005-08-14 21:52       ` Kyle Moffett
  2005-08-15  8:01     ` Helge Hafting
  3 siblings, 2 replies; 80+ messages in thread
From: Jason L Tibbitts III @ 2005-08-14  6:18 UTC (permalink / raw)
  To: Lee Revell; +Cc: Stephen Pollei, Martin v. Löwis, linux-kernel

>>>>> "LR" == Lee Revell <rlrevell@joe-job.com> writes:

LR> Is Larry smoking crack?  That's one of the worst ideas I've heard
LR> in a long time.  There's no easy way to enter those at the
LR> keyboard!

I know folks enjoy trashing Perl these days, but it's not justified in
this case.  From the Perl6-Bible -
http://search.cpan.org/dist/Perl6-Bible/lib/Perl6/Bible/S03.pod:

 For those still living without the blessings of Unicode, that can
 also be written: << ... >>.

 - J<

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-08-14  1:40             ` Lee Revell
@ 2005-08-14 10:40               ` Wichert Akkerman
  0 siblings, 0 replies; 80+ messages in thread
From: Wichert Akkerman @ 2005-08-14 10:40 UTC (permalink / raw)
  To: linux-kernel

Previously Lee Revell wrote:
> My point exactly, it's idiotic for Perl6 to use these as OPERATORS, the
> atoms of the language, when there's not even a platform independent way
> to type them in.

I anyone had bothered to read the URL in one of the earlier emails you
would have seen that '<<' is an accepted alternative spelling.

Wichert.

-- 
Wichert Akkerman <wichert@wiggy.net>    It is simple to make things.
http://www.wiggy.net/                   It is hard to make things simple.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
       [not found]       ` <feed8cdd050814125845fe4e2e@mail.gmail.com>
@ 2005-08-14 19:59         ` Lee Revell
  2005-08-14 20:13           ` Stephen Pollei
                             ` (3 more replies)
  0 siblings, 4 replies; 80+ messages in thread
From: Lee Revell @ 2005-08-14 19:59 UTC (permalink / raw)
  To: Stephen Pollei; +Cc: Jason L Tibbitts III, Martin v. Löwis, linux-kernel

On Sun, 2005-08-14 at 12:58 -0700, Stephen Pollei wrote:
> My main point was that utf-8 for identifiers, operators, and string
> constants are becoming more prevalent, so BOM support for scripts
> sounds like a Good Idea™ .
> 

I know the alternatives are available.  That doesn't make it any less
idiotic to use non ASCII characters as operators.  I think it's a very
slippery slope.  We write code in ASCII, dammit.

Lee


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-08-14 19:59         ` Lee Revell
@ 2005-08-14 20:13           ` Stephen Pollei
  2005-08-14 20:22             ` Lee Revell
  2005-08-14 23:55           ` Alan Cox
                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 80+ messages in thread
From: Stephen Pollei @ 2005-08-14 20:13 UTC (permalink / raw)
  To: Lee Revell; +Cc: Jason L Tibbitts III, Martin v. Löwis, linux-kernel

On 8/14/05, Lee Revell <rlrevell@joe-job.com> wrote:
> I know the alternatives are available.  That doesn't make it any less
> idiotic to use non ASCII characters as operators.  I think it's a very
> slippery slope.  We write code in ASCII, dammit.

Yes you and I might write 99.9% of our code in good'ol **American**
Standard Code for Information Interchange -- however not all the world
is USA. For instance notice the http://de.wikipedia.org/wiki/Umlaut/
in "Löwis"... Seems like lots of Europeans might want a bigger
charset, not to mention Asians, Hindus, and whomever else.

-- 
http://dmoz.org/profiles/pollei.html
http://sourceforge.net/users/stephen_pollei/
http://www.orkut.com/Profile.aspx?uid=2455954990164098214
http://stephen_pollei.home.comcast.net/

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-08-14 20:13           ` Stephen Pollei
@ 2005-08-14 20:22             ` Lee Revell
  2005-08-14 22:10               ` "Martin v. Löwis"
  0 siblings, 1 reply; 80+ messages in thread
From: Lee Revell @ 2005-08-14 20:22 UTC (permalink / raw)
  To: Stephen Pollei; +Cc: Jason L Tibbitts III, Martin v. Löwis, linux-kernel

On Sun, 2005-08-14 at 13:13 -0700, Stephen Pollei wrote:
> Seems like lots of Europeans might want a bigger
> charset, not to mention Asians, Hindus, and whomever else. 

For strings, of course.  But there's no need for UTF-8 operators.

Lee


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-08-14  6:18     ` Jason L Tibbitts III
       [not found]       ` <feed8cdd050814125845fe4e2e@mail.gmail.com>
@ 2005-08-14 21:52       ` Kyle Moffett
  2005-08-14 22:12         ` Valdis.Kletnieks
  1 sibling, 1 reply; 80+ messages in thread
From: Kyle Moffett @ 2005-08-14 21:52 UTC (permalink / raw)
  To: Jason L Tibbitts III
  Cc: Lee Revell, Stephen Pollei,  Martin v. Löwis , linux-kernel

On Aug 14, 2005, at 02:18:13, Jason L Tibbitts III wrote:
>>>>>> "LR" == Lee Revell <rlrevell@joe-job.com> writes:
> LR> Is Larry smoking crack?
>
> From the Perl6-Bible: http://search.cpan.org/dist/Perl6-Bible/lib/ 
> Perl6/Bible/S03.pod:

I think this confirms that the answer is yes.  See the following at  
the above URL:
> Note that ?^ is functionally identical to !.?| differs from || in  
> that ?| always
> returns a standard boolean value (either 1 or 0), whereas ||  
> returns the actual
> value of the first of its arguments that is true.

Since when is the string "!.?|" an operator???  Or "?^", "+|", "~|",  
"?|", etc.  I
think Larry's gone off the deep end on this one.  It may be an  
incredibly powerful
and expressive language, but it seems _really_ strange, and probably  
will produce
the best Obfuscated-code contest the world has ever seen. (Better  
even than the
Perl5 one).

Cheers,
Kyle Moffett

--
Simple things should be simple and complex things should be possible
   -- Alan Kay




^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-08-14 20:22             ` Lee Revell
@ 2005-08-14 22:10               ` "Martin v. Löwis"
  0 siblings, 0 replies; 80+ messages in thread
From: "Martin v. Löwis" @ 2005-08-14 22:10 UTC (permalink / raw)
  To: Lee Revell; +Cc: Stephen Pollei, Jason L Tibbitts III, linux-kernel

Lee Revell wrote:
> For strings, of course.  But there's no need for UTF-8 operators.

Indeed - this is the main rationale for the patch, of course. People
want to write non-ASCII in script primarily in string literals,
and (perhaps even more often) in comments. Now, for comments, it
wouldn't really matter that the interpreter knows what the encoding
is - but the editor would have to know, and the UTF-8 signature
primarily helps the editor (*).

Then we are back to the rationale for this patch: if you need the
UTF-8 signature to reliably identify the script as being UTF-8
encoded, you then currently cannot easily run it as a script through
binfmt_script, as that code requires a script to start with #!.

Regards,
Martin

(*) As I said before: atleast for Python, the UTF-8 signature also
has syntactic meaning. It is allowed at the beginning of a file
as an addition to the language syntax, and it tells the interpreter
that Unicode literals (usually represented internally as UCS-2)
are represented as UTF-8 in the source code.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-08-14 21:52       ` Kyle Moffett
@ 2005-08-14 22:12         ` Valdis.Kletnieks
  0 siblings, 0 replies; 80+ messages in thread
From: Valdis.Kletnieks @ 2005-08-14 22:12 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Jason L Tibbitts III, Lee Revell, Stephen Pollei,
	 Martin v. Löwis , linux-kernel

[-- Attachment #1: Type: text/plain, Size: 372 bytes --]

On Sun, 14 Aug 2005 17:52:36 EDT, Kyle Moffett said:

> > Note that ?^ is functionally identical to !.?| differs from || in  

> Since when is the string "!.?|" an operator??? 

I think that was supposed to read:

Note that ?^ is functionally identical to !.
?| differs from ?? in that ?| returns (and so on)

(two separate sentences lacking whitespace between them....



[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-08-14 19:59         ` Lee Revell
  2005-08-14 20:13           ` Stephen Pollei
@ 2005-08-14 23:55           ` Alan Cox
  2005-08-16 13:56           ` David Madore
       [not found]           ` <mailman.1124063520.13257.linux-kernel2news@redhat.com>
  3 siblings, 0 replies; 80+ messages in thread
From: Alan Cox @ 2005-08-14 23:55 UTC (permalink / raw)
  To: Lee Revell
  Cc: Stephen Pollei, Jason L Tibbitts III, Martin v. Löwis,
	linux-kernel

On Sul, 2005-08-14 at 15:59 -0400, Lee Revell wrote:
> I know the alternatives are available.  That doesn't make it any less
> idiotic to use non ASCII characters as operators.  I think it's a very
> slippery slope.  We write code in ASCII, dammit.

Its a trivial patch and there is a lot to be said for UTF-8 scripts. As
to writing code in ascii, the kernel regularly has outbreaks of either
UTF-8 or ISO-8859-* especially in the docs directory. Standardising
these on UTF-8 would be helpful.

Yes the kernel code is C so ASCII except for the odd abuser of the ©
symbol.

Alan


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-08-13 18:42   ` Lee Revell
                       ` (2 preceding siblings ...)
  2005-08-14  6:18     ` Jason L Tibbitts III
@ 2005-08-15  8:01     ` Helge Hafting
  3 siblings, 0 replies; 80+ messages in thread
From: Helge Hafting @ 2005-08-15  8:01 UTC (permalink / raw)
  To: Lee Revell
  Cc: Stephen Pollei, "\"Martin v.\" Löwis",
	linux-kernel

Lee Revell wrote:

>On Sat, 2005-08-13 at 09:35 -0700, Stephen Pollei wrote:
>  
>
>>Thats great for the perl6 people.
>>http://dev.perl.org/perl6/doc/design/syn/S03.html says they are going
>>to be using « and » as operators...
>>    
>>
>
>Is Larry smoking crack?  That's one of the worst ideas I've heard in a
>long time.  There's no easy way to enter those at the keyboard!
>  
>
On your keyboard, that is.  So what?

My keyboard happen to have no easy way of entering a dollar sign,
even though it is in «ascii».  That makes sense though, as it is one
of those ascii characters that is almost never used in my part of the world.

Still, if I needed to use the «$» when programming, I sure could map it to
some key combination.  X is nice that way. 

Helge Hafting

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-08-13 18:49     ` Hugo Mills
  2005-08-13 18:53       ` Lee Revell
  2005-08-13 19:20       ` Lee Revell
@ 2005-08-16  9:46       ` Jan Engelhardt
  2 siblings, 0 replies; 80+ messages in thread
From: Jan Engelhardt @ 2005-08-16  9:46 UTC (permalink / raw)
  To: Hugo Mills; +Cc: Lee Revell, Stephen Pollei, Martin v. Löwis, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 574 bytes --]

>> > Thats great for the perl6 people.
>> > http://dev.perl.org/perl6/doc/design/syn/S03.html says they are going
>> > to be using « and » as operators...
>> 
>> Is Larry smoking crack?  That's one of the worst ideas I've heard in a
>> long time.  There's no easy way to enter those at the keyboard!
>
>   I have "setxkbmap -symbols 'en_US(pc102)+gb'" in my ~/.xsession,
>and « and » are available as AltGr-z and AltGr-x respectively.

.Xmodmap: keycode 117 = MultiKey

and then use [the Windows(R) Context Menu Key],[<],[<] to generate «
Cheers :)


Jan Engelhardt
-- 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-08-14 19:59         ` Lee Revell
  2005-08-14 20:13           ` Stephen Pollei
  2005-08-14 23:55           ` Alan Cox
@ 2005-08-16 13:56           ` David Madore
       [not found]           ` <mailman.1124063520.13257.linux-kernel2news@redhat.com>
  3 siblings, 0 replies; 80+ messages in thread
From: David Madore @ 2005-08-16 13:56 UTC (permalink / raw)
  To: linux-kernel

On Sun, Aug 14, 2005 at 08:00:31PM +0000, Lee Revell wrote:
>		   We write code in ASCII, dammit.

<URL: http://www.madore.org/~david/weblog/2004-12.html#d.2004-12-03.0813 >

:-)

-- 
     David A. Madore
    (david.madore@ens.fr,
     http://www.madore.org/~david/ )

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
       [not found]           ` <mailman.1124063520.13257.linux-kernel2news@redhat.com>
@ 2005-08-16 20:17             ` Pete Zaitcev
  0 siblings, 0 replies; 80+ messages in thread
From: Pete Zaitcev @ 2005-08-16 20:17 UTC (permalink / raw)
  To: Alan Cox; +Cc: zaitcev, linux-kernel

On Mon, 15 Aug 2005 00:55:54 +0100, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> On Sul, 2005-08-14 at 15:59 -0400, Lee Revell wrote:

> > I know the alternatives are available.  That doesn't make it any less
> > idiotic to use non ASCII characters as operators.  I think it's a very
> > slippery slope.  We write code in ASCII, dammit.
> 
> Its a trivial patch and there is a lot to be said for UTF-8 scripts. As
> to writing code in ascii, the kernel regularly has outbreaks of either
> UTF-8 or ISO-8859-* especially in the docs directory. Standardising
> these on UTF-8 would be helpful.
> 
> Yes the kernel code is C so ASCII except for the odd abuser of the ©
> symbol.

We write kernel code in ASCII because of patches in e-mail. When a patch
is saved (often by a script), it is divorced of the encoding in which
e-mail was done. Forwarding of patches then causes them to fail to apply.
Everything else can be worked around.

In my experience, the most common case of such patch rejects has to do
with a European using a non-UTF-8 encoding for his name, rather than
with the copyright symbol.

-- Pete

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-08-13 12:07 "Martin v. Löwis"
  2005-08-13 16:35 ` Stephen Pollei
@ 2005-08-31 23:27 ` H. Peter Anvin
  1 sibling, 0 replies; 80+ messages in thread
From: H. Peter Anvin @ 2005-08-31 23:27 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <42FDE286.40707@v.loewis.de>
By author:    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= <martin@v.loewis.de>
In newsgroup: linux.dev.kernel
>
> This patch adds support for UTF-8 signatures (aka BOM, byte order
> mark) to binfmt_script. Files that start with EF BF FF # ! are now
> recognized as scripts (in addition to files starting with # !).
> 
> With such support, creating scripts that reliably carry non-ASCII
> characters is simplified. Editors and the script interpreter can
> easily agree on what the encoding of the script is, and the
> interpreter can then render strings appropriately. Currently,
> Python supports source files that start with the UTF-8 signature;
> the approach would naturally extend to Perl to enhance/replace
> the "use utf8" pragma. Likewise, Tcl could use the UTF-8 signature
> to reliably identify UTF-8 source code (instead of assuming
> [encoding system] for source code).
> 

BOM should not be used in UTF-8.  In fact, it shouldn't be used at
all.

	-hpa

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
       [not found] ` <4HKbZ-Cx-37@gated-at.bofh.it>
@ 2005-09-15 18:24   ` "Martin v. Löwis"
  2005-09-15 18:25     ` H. Peter Anvin
  0 siblings, 1 reply; 80+ messages in thread
From: "Martin v. Löwis" @ 2005-09-15 18:24 UTC (permalink / raw)
  To: H. Peter Anvin, linux-kernel

H. Peter Anvin wrote:
> BOM should not be used in UTF-8.  In fact, it shouldn't be used at
> all.

Says who? In UTF-8, it is not used to indicate a byte order; instead,
it is used to indicate the fact that the file is UTF-8, like a magic.
That's why I prefer to call it "UTF-8 signature".

The Unicode consortium thinks that the BOM can be used in UTF-8:

http://www.unicode.org/faq/utf_bom.html#29

The UTF-8 signature is very useful, and I would prefer if it would
be used instead of format-specific encoding declarations.

Regards,
Martin

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-15 18:24   ` "Martin v. Löwis"
@ 2005-09-15 18:25     ` H. Peter Anvin
  2005-09-15 18:39       ` "Martin v. Löwis"
  0 siblings, 1 reply; 80+ messages in thread
From: H. Peter Anvin @ 2005-09-15 18:25 UTC (permalink / raw)
  To: "Martin v. Löwis"; +Cc: linux-kernel

Martin v. Löwis wrote:
> 
> Says who? In UTF-8, it is not used to indicate a byte order; instead,
> it is used to indicate the fact that the file is UTF-8, like a magic.
> That's why I prefer to call it "UTF-8 signature".
> 
> The Unicode consortium thinks that the BOM can be used in UTF-8:
> 
> http://www.unicode.org/faq/utf_bom.html#29
> 
> The UTF-8 signature is very useful, and I would prefer if it would
> be used instead of format-specific encoding declarations.
> 

In Unix, it's a hideously bad idea.  The reason is that Unix inherently 
assumes that text streams can be merged, split, and modified.  In other 
words, unless you can guarantee that EVERY program can handle BOM 
EVERYWHERE, it's broken.

In other words, it's broken.

	-hpa


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-15 18:25     ` H. Peter Anvin
@ 2005-09-15 18:39       ` "Martin v. Löwis"
  2005-09-15 19:20         ` H. Peter Anvin
  2005-09-16  8:13         ` Bernd Petrovitsch
  0 siblings, 2 replies; 80+ messages in thread
From: "Martin v. Löwis" @ 2005-09-15 18:39 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-kernel

H. Peter Anvin wrote:

> In Unix, it's a hideously bad idea.  The reason is that Unix inherently
> assumes that text streams can be merged, split, and modified.  In other
> words, unless you can guarantee that EVERY program can handle BOM
> EVERYWHERE, it's broken.

This argument is bogus. We are talking about scripts here, which cannot
be merged, split, and modified. You don't cat(1) or sort(1) them - it's
just pointless to do that. You create them with text editors, and those
*can* handle the UTF-8 signature.

> In other words, it's broken.

We can do that now, or in five or ten years. I'm willing to wait that
long, but I'm certain that more people will find the UTF-8 signature
useful over time. It's the only sane way to get non-ASCII into script
source in a consistent way.

Regards,
Martin

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-15 18:39       ` "Martin v. Löwis"
@ 2005-09-15 19:20         ` H. Peter Anvin
  2005-09-16  8:13         ` Bernd Petrovitsch
  1 sibling, 0 replies; 80+ messages in thread
From: H. Peter Anvin @ 2005-09-15 19:20 UTC (permalink / raw)
  To: "Martin v. Löwis"; +Cc: linux-kernel

Martin v. Löwis wrote:
> 
> We can do that now, or in five or ten years. I'm willing to wait that
> long, but I'm certain that more people will find the UTF-8 signature
> useful over time. It's the only sane way to get non-ASCII into script
> source in a consistent way.
> 

No.  The sane way is to just use UTF-8.

In five or ten years, by the time you've gotten your idiotic BOM mess to 
sort-of work, it will be completely pointless to have anything *but* 
UTF-8, and thus it's pointless.

Don't perpetuate the braindamage.

	-hpa

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-15 18:39       ` "Martin v. Löwis"
  2005-09-15 19:20         ` H. Peter Anvin
@ 2005-09-16  8:13         ` Bernd Petrovitsch
  1 sibling, 0 replies; 80+ messages in thread
From: Bernd Petrovitsch @ 2005-09-16  8:13 UTC (permalink / raw)
  To: "Martin v. Löwis"; +Cc: H. Peter Anvin, linux-kernel

On Thu, 2005-09-15 at 20:39 +0200, "Martin v. Löwis" wrote:
> H. Peter Anvin wrote:
> 
> > In Unix, it's a hideously bad idea.  The reason is that Unix inherently
> > assumes that text streams can be merged, split, and modified.  In other
> > words, unless you can guarantee that EVERY program can handle BOM
> > EVERYWHERE, it's broken.
> 
> This argument is bogus. We are talking about scripts here, which cannot
> be merged, split, and modified. You don't cat(1) or sort(1) them - it's

Sure they can since they are plain text files.
How do you think one merges scripts?
Just `cat`ing them all into one new file and edit that new file is much
faster and simpler than to open an empty new file with your editor, then
you open all the other scripts in your editor and copy them by hand.
And you (or at least I) do `grep`/`egrep`/`fgrep`, `wc` them. And
probably with several other tools too - think of `find <dir> -type f
-print0 | xargs -0r <cmd>`.

> just pointless to do that. You create them with text editors, and those
> *can* handle the UTF-8 signature.

It is not uncommon to create scripts and the like with other programs,
other scripts, what-else.
Apart from the fact the a "script" is merely a plain text file with the
eXecutable bit set. And *that* is the only difference, so you have to at
least (all instances of) `chmod` to insert and remove the BOM.
This gets funny if you think of file systems without a concept of
"executable bit" and copying files around. Another standard tool to
patch.
And how do you solve `cat`ing a script (with set X bit) like:
`cat <script >other-file` where other-file will not have the X bit set. 
The `cat` program doesn't even know (or care about) the names of the two
files.

	Bernd
-- 
Firmix Software GmbH                   http://www.firmix.at/
mobil: +43 664 4416156                 fax: +43 1 7890849-55
          Embedded Linux Development and Services


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
       [not found]         ` <4N7AS-67L-3@gated-at.bofh.it>
@ 2005-09-16 18:02           ` Bodo Eggert
  2005-09-16 18:09             ` H. Peter Anvin
       [not found]             ` <200509170028.59973.dhazelton@enter.net>
  0 siblings, 2 replies; 80+ messages in thread
From: Bodo Eggert @ 2005-09-16 18:02 UTC (permalink / raw)
  To: H. Peter Anvin, Martin v. Löwis, linux-kernel

Bernd Petrovitsch <bernd@firmix.at> wrote:
> On Thu, 2005-09-15 at 20:39 +0200, "Martin v. Löwis" wrote:
>> H. Peter Anvin wrote:

>> > In Unix, it's a hideously bad idea.  The reason is that Unix inherently
>> > assumes that text streams can be merged, split, and modified.  In other
>> > words, unless you can guarantee that EVERY program can handle BOM
>> > EVERYWHERE, it's broken.

You can't sort /bin/ls into /tmp/ls and expect /tmp/ls to be meaningfull,
but /bin/ls works as expected. You can't usurally concat perl scripts and
shell scripts either, but both kinds of script run quite well.

And if you do "cat /bin/cat /bin/cp > /bin/catcp", what's "catcp foo bar"
supposed to do? First output foo and bar to stdout, then copy foo to bar?
Is execve() broken if it doesn't do what I described? Is the ELF header
broken because it's not recogmized EVERYWHERE? I don't think so.

>> This argument is bogus. We are talking about scripts here, which cannot
>> be merged, split, and modified. You don't cat(1) or sort(1) them - it's
> 
> Sure they can since they are plain text files.
> How do you think one merges scripts?
> Just `cat`ing them all into one new file and edit that new file is much
> faster and simpler than to open an empty new file with your editor, then
> you open all the other scripts in your editor and copy them by hand.

What's supposed to happen if you concatenate a script from your french
user and from your russian user, both using localized text, into one file?
Unless you can guarantee every editor to correctly handle this case, all
usage of 8-bit-characters should be disabled - NOT!

If you concatenate two plain text files, you will use cat.
If you concatenate two pnm image files, you will use pnmcat.
If you concatenate two utf-8 files, you will use utf8cat.
If you concatenate two binaries, you will shoot your feet.
That's easy, isn't it?

BTW: I think decent utf-8 capable programs SHOULD ignore extra BOM markers.

> And you (or at least I) do `grep`/`egrep`/`fgrep`, `wc` them.

You can *grep utf-8 scripts, but you can't *grep binaries. Shouldn't
this be fixed by implementing an in-kernel ASCII assembler and convert
all binaries to assembler text?

> And
> probably with several other tools too - think of `find <dir> -type f
> -print0 | xargs -0r <cmd>`.

utf-8 filenames will work correctly (unless used as an extended BASIC
script with non-ASCII variable names, but that would be insane).

>> just pointless to do that. You create them with text editors, and those
>> can handle the UTF-8 signature.
> 
> It is not uncommon to create scripts and the like with other programs,
> other scripts, what-else.

It's not uncommon to create binaries using other programs. So what?

> Apart from the fact the a "script" is merely a plain text file with the
> eXecutable bit set.

And an utf-8 script is a utf-8 encoded text file with it's executable bit
set.

> And that is the only difference, so you have to at
> least (all instances of) `chmod` to insert and remove the BOM.
[...]

In order to make it harder for the interpreter to correctly detect utf-8?
You can have DOS executables run in dosboxes, windows applications run
in windows, java archives run in java, but utf-8 scripts should be
mangled in order to work "correctly", and mangled back in order to be
editable? *That*'s insane!

Just make execve ignore the BOM marker before "#!" as the patch does, and
you're done. The rest is somebody else's not-a-problem.



BTW2: However, I don't like the patch.

I'd first check for a utf-8 signature, and if it's found, adjust the
buffer offset by 3. Then I'd run the old code checking for the sh_bang.
OTOH, I just read the patch and not the .c file, maybe (unlikely) my idea
wouldn't work correctly.

-- 
Ich danke GMX dafür, die Verwendung meiner Adressen mittels per SPF
verbreiteten Lügen zu sabotieren.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-16 18:02           ` [Patch] Support UTF-8 scripts Bodo Eggert
@ 2005-09-16 18:09             ` H. Peter Anvin
  2005-09-16 18:57               ` Bodo Eggert
       [not found]             ` <200509170028.59973.dhazelton@enter.net>
  1 sibling, 1 reply; 80+ messages in thread
From: H. Peter Anvin @ 2005-09-16 18:09 UTC (permalink / raw)
  To: 7eggert; +Cc: "Martin v. Löwis", linux-kernel

Bodo Eggert wrote:
> 
> What's supposed to happen if you concatenate a script from your french
> user and from your russian user, both using localized text, into one file?
> Unless you can guarantee every editor to correctly handle this case, all
> usage of 8-bit-characters should be disabled - NOT!
> 

Actually, it's quite easy to avoid problems by using UTF-8 consistently. 
   The 8-bit characters are oddballs and need to be treated specially, 
but look, guys, it's 2005 - UTF-8 should be the norm, not the exception.

	-hpa

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-16 18:09             ` H. Peter Anvin
@ 2005-09-16 18:57               ` Bodo Eggert
  2005-09-16 19:08                 ` Martin Mares
                                   ` (2 more replies)
  0 siblings, 3 replies; 80+ messages in thread
From: Bodo Eggert @ 2005-09-16 18:57 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: 7eggert, "Martin v. Löwis", linux-kernel

On Fri, 16 Sep 2005, H. Peter Anvin wrote:
> Bodo Eggert wrote:

> > What's supposed to happen if you concatenate a script from your french
> > user and from your russian user, both using localized text, into one file?
> > Unless you can guarantee every editor to correctly handle this case, all
> > usage of 8-bit-characters should be disabled - NOT!
> 
> Actually, it's quite easy to avoid problems by using UTF-8 consistently. 
>    The 8-bit characters are oddballs and need to be treated specially, 
> but look, guys, it's 2005 - UTF-8 should be the norm, not the exception.

It should, but as long as old programs are still around, we'll have both 
and need a marker to distinguish them. Otherwise we'll be stuck with
legacy scripts for a long time.

-- 
I'm a member of DNA (National Assocciation of Dyslexics).
	-- Storm in <5Z4Z7.52353$4x4.6445347@news2-win.server.ntlworld.com>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-16 18:57               ` Bodo Eggert
@ 2005-09-16 19:08                 ` Martin Mares
  2005-09-16 19:25                 ` H. Peter Anvin
  2005-09-16 19:57                 ` Horst von Brand
  2 siblings, 0 replies; 80+ messages in thread
From: Martin Mares @ 2005-09-16 19:08 UTC (permalink / raw)
  To: Bodo Eggert; +Cc: H. Peter Anvin, "Martin v. Löwis", linux-kernel

Hello!

> It should, but as long as old programs are still around, we'll have both 
> and need a marker to distinguish them.

I doubt that. For ages people were using several different encodings on
a single system (at least here in .cz) without any markers and although
there were some rough edges, almost everything worked. Now we do the same
with ISO-8859-2 and UTF-8, again with no need for a marker.

				Have a nice fortnight
-- 
Martin `MJ' Mares   <mj@ucw.cz>   http://atrey.karlin.mff.cuni.cz/~mj/
Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth
Linux vs. Windows is a no-WIN situation.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-16 18:57               ` Bodo Eggert
  2005-09-16 19:08                 ` Martin Mares
@ 2005-09-16 19:25                 ` H. Peter Anvin
  2005-09-16 19:57                 ` Horst von Brand
  2 siblings, 0 replies; 80+ messages in thread
From: H. Peter Anvin @ 2005-09-16 19:25 UTC (permalink / raw)
  To: Bodo Eggert; +Cc: "Martin v. Löwis", linux-kernel

Bodo Eggert wrote:
> 
> It should, but as long as old programs are still around, we'll have both 
> and need a marker to distinguish them. Otherwise we'll be stuck with
> legacy scripts for a long time.
> 

You don't have markers (although they're defined, see ISO 2022) for your 
8-bit encodings, and *THEY'RE THE ONES THAT NEED TO BE DISTINGUISHED.* 
Flagging UTF-8, especially with the BOM (as opposed to the ISO 2022 
signature, <ESC>%G) is pointless in the context, since you still can't 
distinguish your arbitrary number of legacy encodings.

Oh, yes, and try to stick ISO 2022 signatures in scripts or whatnot, and 
you can see what current software does with a signature standard that 
dates back to the 1970's.

	-hpa

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-16 18:57               ` Bodo Eggert
  2005-09-16 19:08                 ` Martin Mares
  2005-09-16 19:25                 ` H. Peter Anvin
@ 2005-09-16 19:57                 ` Horst von Brand
  2 siblings, 0 replies; 80+ messages in thread
From: Horst von Brand @ 2005-09-16 19:57 UTC (permalink / raw)
  To: Bodo Eggert; +Cc: H. Peter Anvin, "Martin v. Löwis", linux-kernel

Bodo Eggert <7eggert@gmx.de> wrote:
> On Fri, 16 Sep 2005, H. Peter Anvin wrote:
> > Bodo Eggert wrote:

[...]

> > > Unless you can guarantee every editor to correctly handle this case, all
> > > usage of 8-bit-characters should be disabled - NOT!

> > Actually, it's quite easy to avoid problems by using UTF-8 consistently. 
> >    The 8-bit characters are oddballs and need to be treated specially, 
> > but look, guys, it's 2005 - UTF-8 should be the norm, not the exception.

Right.

> It should, but as long as old programs are still around, we'll have both 
> and need a marker to distinguish them. Otherwise we'll be stuck with
> legacy scripts for a long time.

Please. Let people who mess with legacy stuff suffer, don't make everybody
else (and forevermore!) pay the price.
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
       [not found]                 ` <4NtL0-5lQ-13@gated-at.bofh.it>
@ 2005-09-16 20:34                   ` "Martin v. Löwis"
  2005-09-17 12:01                     ` Martin Mares
  0 siblings, 1 reply; 80+ messages in thread
From: "Martin v. Löwis" @ 2005-09-16 20:34 UTC (permalink / raw)
  To: Martin Mares; +Cc: linux-kernel

Martin Mares wrote:
> I doubt that. For ages people were using several different encodings on
> a single system (at least here in .cz) without any markers and although
> there were some rough edges, almost everything worked. Now we do the same
> with ISO-8859-2 and UTF-8, again with no need for a marker.

This is true for text files, where a human reader can interpret the data
correctly even in absence of a declaration. For programming languages,
this is typically not the case. Instead, in order to correctly interpret
the source code, you need to declare the encoding. For a script, this
should be done inside the file itself, as there is no explicit
invocation of a compiler or some such where the script encoding could
be specified externally.

Regards,
Martin

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
       [not found]                 ` <4Nu4p-5Js-3@gated-at.bofh.it>
@ 2005-09-16 20:41                   ` "Martin v. Löwis"
  2005-09-16 22:08                     ` H. Peter Anvin
  2005-09-16 22:45                     ` Bernd Petrovitsch
  0 siblings, 2 replies; 80+ messages in thread
From: "Martin v. Löwis" @ 2005-09-16 20:41 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-kernel

H. Peter Anvin wrote:
> You don't have markers (although they're defined, see ISO 2022) for your
> 8-bit encodings, and *THEY'RE THE ONES THAT NEED TO BE DISTINGUISHED.*
> Flagging UTF-8, especially with the BOM (as opposed to the ISO 2022
> signature, <ESC>%G) is pointless in the context, since you still can't
> distinguish your arbitrary number of legacy encodings.

In programming languages that support the notion of source encodings,
you do have markers for 8-bit encodings. For example, in Python, you
can specify

# -*- coding: iso-8859-1 -*-

to denote the source encoding. In Perl, you write

use encoding "latin-1";

(with 'use utf8;' being a special-case shortcut).

In Java, you can specify the encoding through the -encoding argument
to javac. In gcc, you use -finput-charset (with the special case of
-fexec-charset and -fwide-exec-charset potentially being different).

So you *must* use encoding declarations in some languages; the UTF-8
signature is a particularly convenient way of doing so, since it allows
for uniformity across languages, with no need for the text editors to
parse all the different programming languages.

Regards,
Martin

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-16 20:41                   ` "Martin v. Löwis"
@ 2005-09-16 22:08                     ` H. Peter Anvin
  2005-09-17  6:05                       ` "Martin v. Löwis"
  2005-09-16 22:45                     ` Bernd Petrovitsch
  1 sibling, 1 reply; 80+ messages in thread
From: H. Peter Anvin @ 2005-09-16 22:08 UTC (permalink / raw)
  To: "Martin v. Löwis"; +Cc: linux-kernel

Martin v. Löwis wrote:
> In programming languages that support the notion of source encodings,
> you do have markers for 8-bit encodings. For example, in Python, you
> can specify
> 
> # -*- coding: iso-8859-1 -*-
> 
> to denote the source encoding. In Perl, you write
> 
> use encoding "latin-1";
> 
> (with 'use utf8;' being a special-case shortcut).
> 
> In Java, you can specify the encoding through the -encoding argument
> to javac. In gcc, you use -finput-charset (with the special case of
> -fexec-charset and -fwide-exec-charset potentially being different).
> 
> So you *must* use encoding declarations in some languages; the UTF-8
> signature is a particularly convenient way of doing so, since it allows
> for uniformity across languages, with no need for the text editors to
> parse all the different programming languages.

Did you miss the point?  There has been a standard for marking for *30 
years*, and virtually NOONE (outside Japan) uses it.

	-hpa

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-16 20:41                   ` "Martin v. Löwis"
  2005-09-16 22:08                     ` H. Peter Anvin
@ 2005-09-16 22:45                     ` Bernd Petrovitsch
  2005-09-17  6:20                       ` "Martin v. Löwis"
  1 sibling, 1 reply; 80+ messages in thread
From: Bernd Petrovitsch @ 2005-09-16 22:45 UTC (permalink / raw)
  To: "Martin v. Löwis"; +Cc: H. Peter Anvin, linux-kernel

On Fri, 2005-09-16 at 22:41 +0200, "Martin v. Löwis" wrote:
[ Language-specific examples ]

And that's the only working way - the programming languages can actually
do it because it defines the syntax and semantics of the contents
anyways.
With this marker you are interferign with (at least) *all* text files.
And thus with *all* tools which "handle" those text files.

> So you *must* use encoding declarations in some languages; the UTF-8

... if you absolutely want to use Non-ASCII characters in the source
code. In most (if not all) of them exist a native gettext()
interface ...

> signature is a particularly convenient way of doing so, since it allows
> for uniformity across languages, with no need for the text editors to
> parse all the different programming languages.

And there are always tools out there which simply do not understand the
generic marker and can not ignore it since these bytes are part of the
file. And thus tools (and people) will kill those markers (for whatever
reason and if it's simple ignorance) anyway.

Or another example: (Try to) start a perl/shell/... script (without
paranmeter on the first line) which was edited on Win* and binary copied
to a Unix system. Or at least guess what will happen ....

	Bernd
-- 
Firmix Software GmbH                   http://www.firmix.at/
mobil: +43 664 4416156                 fax: +43 1 7890849-55
          Embedded Linux Development and Services




^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-16 22:08                     ` H. Peter Anvin
@ 2005-09-17  6:05                       ` "Martin v. Löwis"
  0 siblings, 0 replies; 80+ messages in thread
From: "Martin v. Löwis" @ 2005-09-17  6:05 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-kernel

H. Peter Anvin wrote:
> Did you miss the point?  There has been a standard for marking for *30
> years*, and virtually NOONE (outside Japan) uses it.

I understood that fact - but I fail to see the point. If you mean to
imply "people did not use ISO-2022, therefore, they will never use
encoding declarations", I think this implication is false. People
do use encoding declarations.

If you mean to imply "people did not use ISO-2022, therefore, they
will never use the UTF-8 signature", I think this implications is
also false. People do use the UTF-8 signature, even outside Japan.
The primary reason is that the UTF-8 signature is much easier to
implement than ISO-2022: if you support UTF-8 in your tool (say,
a text editor), anyway, adding support for the UTF-8 signature
is almost trivial. Therefore, many more editors support the UTF-8
signature today than ever supported ISO-2022.

Regards,
Martin

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-16 22:45                     ` Bernd Petrovitsch
@ 2005-09-17  6:20                       ` "Martin v. Löwis"
  2005-09-17 22:28                         ` Bernd Petrovitsch
  0 siblings, 1 reply; 80+ messages in thread
From: "Martin v. Löwis" @ 2005-09-17  6:20 UTC (permalink / raw)
  To: Bernd Petrovitsch
  Cc: "Martin v. Löwis", H. Peter Anvin, linux-kernel

Bernd Petrovitsch wrote:
> On Fri, 2005-09-16 at 22:41 +0200, "Martin v. Löwis" wrote:
> [ Language-specific examples ]
> 
> And that's the only working way - the programming languages can actually
> do it because it defines the syntax and semantics of the contents
> anyways.

It works from the programming language point of view, but it is a mess
from the text editor point of view.

Even for the programming language, it is a pain to implement: what
if you have non-ASCII characters before the pragma that declares the
encoding? and so on.

> With this marker you are interferign with (at least) *all* text files.

Hmm. What does that have to do with the patch I'm proposing? This
patch does *not* interfere with all text files. It is only relevant
for executable files starting with the #! magic.

> And thus with *all* tools which "handle" those text files.

This is simply not true. My patch does not interfere with any such
tools. They continue to work just fine.

>>So you *must* use encoding declarations in some languages; the UTF-8
> 
> 
> ... if you absolutely want to use Non-ASCII characters in the source
> code. In most (if not all) of them exist a native gettext()
> interface ...

True. However, this is more tedious to use. Also, it doesn't apply to
all cases: e.g. if you have comments, documentation etc. in the source
code, gettext is no option.

Likewise, people often want to use non-ASCII in identifiers (e.g. class
Lösung); this can also only work if you know what the source encoding
is. You may argue that people just shouldn't do that, because it does
not work well, but this is not convincing: it doesn't work well because
language developers are to lazy to implement it. In fact, some languages
(C, C++, Java, C#) do support non-ASCII identifiers (atleast in their
specifications); there really isn't a good reason not to support it
in scripting languages as well.

> And there are always tools out there which simply do not understand the
> generic marker and can not ignore it since these bytes are part of the
> file.

This conclusion is false. Many tools that don't understand the file
structure still can do their job on the files. So the fact that a tool
does not understand the structure does not necessarily imply that
the tool breaks when the structure changes.

> Or another example: (Try to) start a perl/shell/... script (without
> paranmeter on the first line) which was edited on Win* and binary copied
> to a Unix system. Or at least guess what will happen ....

For a Python script, I don't need to guess: It will just work.

Regards,
Martin

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
       [not found]             ` <200509170028.59973.dhazelton@enter.net>
@ 2005-09-17  6:28               ` "Martin v. Löwis"
  2005-09-17 22:31                 ` D. Hazelton
  2005-09-17 17:16               ` Bodo Eggert
  1 sibling, 1 reply; 80+ messages in thread
From: "Martin v. Löwis" @ 2005-09-17  6:28 UTC (permalink / raw)
  To: D. Hazelton; +Cc: 7eggert, H. Peter Anvin, linux-kernel

D. Hazelton wrote:
> This is a bogus argument. You're comparing the way a _binary_ 
> executable works to the way an interpreted _text_ script works. 
> execve(), at least on my system, isn't capable of running a script - 
> if I want to do that from a program I have to tell execve() that it's 
> running /bin/sh and the script file is in the parameter list. 

This being the linux-kernel list, I assume your system is Linux, no?
Well, on Linux, execve *does* support script files. This is the whole
point of my patch - I would not propose a kernel patch to improve
this support if it weren't there in the first place.

> While I appreciate that the kernel is capable of performing complex 
> actions when execve runs into a file that is not an a.out or elf 
> binary I have yet to see a "binfmt script" option in the kernel 
> config files ever.

It's not a config option because it is always enabled. See
fs/binfmt_script.c for details. It wasn't integrated into the binfmt
system until I made it so some ten years ago, though.

> On the other hand, there is the "binfmt_misc" option, which does the 
> work that you seem to be looking for and can, AFAIK, be set to handle 
> both ASCII and UTF-8 scripts. Why add the complexity to the kernel 
> when it's not needed?

One shouldn't add complexity if its not needed. However, this patch
does not add complexity. It is fairly trivial.

Regards,
Martin

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
       [not found]           ` <4NsOZ-3YF-9@gated-at.bofh.it>
       [not found]             ` <4NsYH-4bv-27@gated-at.bofh.it>
@ 2005-09-17  6:45             ` "Martin v. Löwis"
  1 sibling, 0 replies; 80+ messages in thread
From: "Martin v. Löwis" @ 2005-09-17  6:45 UTC (permalink / raw)
  To: 7eggert; +Cc: linux-kernel

Bodo Eggert wrote:
> BTW2: However, I don't like the patch.
> 
> I'd first check for a utf-8 signature, and if it's found, adjust the
> buffer offset by 3. Then I'd run the old code checking for the sh_bang.
> OTOH, I just read the patch and not the .c file, maybe (unlikely) my idea
> wouldn't work correctly.

I believe this wouldn't work. binfmt_script currently has the code

        for (cp = bprm->buf+2; (*cp == ' ') || (*cp == '\t'); cp++);

to get out the (start of the) interpreter file name. This knows
implicitly that you need to skip two bytes #!; for UTF-8 signatures,
it would be 5 bytes.

Now, if you meant to suggest that bprm->buf should be adjusted (e.g.
through 'brpm->buf += 3'): this cannot work, either. It would break
subsequent binfmt modules which assume that bprm->buf is the first
1KiB (or so) of the file to be executed.

If you suggest that the patch should merely check for the signature,
and then skip it: this is what the patch does.

Regards,
Martin

P.S. I just noticed there is a

bprm->buf[BINPRM_BUF_SIZE - 1] = '\0';

which seems incorrect: it puts a null-byte into the buffer data,
thus (slightly) corrupting the data for subsequent binfmt modules
(although it already knows the file starts with #!, so the
 subsequent modules will fail, anyway)

Also, I think the above loop should also terminate for '

 *cp == '\0'

if there is neither a space nor a tab in the file.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-16 20:34                   ` "Martin v. Löwis"
@ 2005-09-17 12:01                     ` Martin Mares
  2005-09-17 12:25                       ` "Martin v. Löwis"
  0 siblings, 1 reply; 80+ messages in thread
From: Martin Mares @ 2005-09-17 12:01 UTC (permalink / raw)
  To: "Martin v. Löwis"; +Cc: linux-kernel

Hello!

> This is true for text files, where a human reader can interpret the data
> correctly even in absence of a declaration. For programming languages,
> this is typically not the case. Instead, in order to correctly interpret
> the source code, you need to declare the encoding. For a script,
[...]

This makes no sense. For a script, the shell does not care about the encoding
at all.

Also, currently, people use zillions of encodings, most of which have no
signature, so introducing a signature for UTF-8 does not win anything.

In the future, most people will probably use only UTF-8, so the signature
carries no information.

				Have a nice fortnight
-- 
Martin `MJ' Mares   <mj@ucw.cz>   http://atrey.karlin.mff.cuni.cz/~mj/
Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth
Q: Who invented the first airplane that did not fly?  A: The Wrong Brothers.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-17 12:01                     ` Martin Mares
@ 2005-09-17 12:25                       ` "Martin v. Löwis"
  2005-09-17 12:28                         ` Martin Mares
  2005-09-19  7:08                         ` Pavel Machek
  0 siblings, 2 replies; 80+ messages in thread
From: "Martin v. Löwis" @ 2005-09-17 12:25 UTC (permalink / raw)
  To: Martin Mares; +Cc: linux-kernel

Martin Mares wrote:
> This makes no sense. For a script, the shell does not care about the encoding
> at all.

I'm not (only) talking about /bin/sh. I'm primarily talking about
/usr/bin/python, /usr/bin/perl, and /usr/bin/wish. In all these
languages, the interpreter *does* care about the encoding.

1. In Python, the syntax

   u"some data"

   denotes a Unicode literal (stored internally either in UCS-2 or
   UCS-4); the literals are converted from the source encoding to
   the internal representation. This requires knowledge of the source
   encoding.

2. In Tcl, all strings are internally represented in UTF-8, and
   converted from the source encoding (which currently is inferred
   from the locale of the process executing the script).

3. In Perl, 'use utf8' declares that the encoding of the script is
   UTF-8, meaning that non-ASCII can be used in string literals,
   identifiers, and regular expressions.

> Also, currently, people use zillions of encodings, most of which have no
> signature, so introducing a signature for UTF-8 does not win anything.

This specific patch does win something: it allows to executed scripts
which start with <utf8 signature>#!

This is useful e.g. for Python, which recognizes the UTF-8 signature
as declaring the source encoding of the Python module to be UTF-8.

> In the future, most people will probably use only UTF-8, so the signature
> carries no information.

In the future, the signature *will* carry no information. But the future
is, well, in the future.

I just can't understand why (some) people are so opposed to this patch.
It is a really trivial, straight-forward change. It introduces no
policy, just a feature: you can put the UTF-8 signature in your script
file, if you want to (and your scripting language supports it). By
no means it forces you to put the UTF-8 signature in your all script
files, let alone all your text files.

Regards,
Martin

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-17 12:25                       ` "Martin v. Löwis"
@ 2005-09-17 12:28                         ` Martin Mares
  2005-09-17 12:53                           ` "Martin v. Löwis"
  2005-09-19  7:08                         ` Pavel Machek
  1 sibling, 1 reply; 80+ messages in thread
From: Martin Mares @ 2005-09-17 12:28 UTC (permalink / raw)
  To: "Martin v. Löwis"; +Cc: linux-kernel

Hello!

> I'm not (only) talking about /bin/sh. I'm primarily talking about
> /usr/bin/python, /usr/bin/perl, and /usr/bin/wish. In all these
> languages, the interpreter *does* care about the encoding.

Agreed. On the other hand, in all these languages you can pass the encoding
as a parameter to the interpreter, cannot you?

> In the future, the signature *will* carry no information. But the future
> is, well, in the future.
> 
> I just can't understand why (some) people are so opposed to this patch.

Occam's razor?

				Have a nice fortnight
-- 
Martin `MJ' Mares   <mj@ucw.cz>   http://atrey.karlin.mff.cuni.cz/~mj/
Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth
"In accord to UNIX philosophy, PERL gives you enough rope to hang yourself." -- Larry Wall

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-17 12:28                         ` Martin Mares
@ 2005-09-17 12:53                           ` "Martin v. Löwis"
  2005-09-17 13:05                             ` Martin Mares
  0 siblings, 1 reply; 80+ messages in thread
From: "Martin v. Löwis" @ 2005-09-17 12:53 UTC (permalink / raw)
  To: Martin Mares; +Cc: linux-kernel

Martin Mares wrote:
> Agreed. On the other hand, in all these languages you can pass the encoding
> as a parameter to the interpreter, cannot you?

Not in general, no. If you have a library of multiple modules, different
modules may have different encodings. In particular, if UTF-8 in source
code becomes more common (because it is better supported than now),
people will start using it for libraries. At the same time, a lot of
code is around that still uses other encodings (typically Latin-1).
So you may have two encodings in the same program (different modules);
that's why you need the encoding declared *in* the file.

Now, there are different ways to do that: you can find language-specific
ways (such as 'use utf8;'), and this is what most languages currently
do. However, this is a nightmare for editor developers, and a severe
inconvenience for script authors - which now have to put the encoding
declaration into the files.

With the UTF-8 signature, things become much simpler: editors can
automatically detect presence of the signature, and need no
language-specific parsing. The language interpreters have a guarantee
that the signature is at the beginning of the file, so they don't
need to switch encodings in the middle of parsing. Users can configure
their editors to always write the signature for certain types of
files, and don't need to worry about putting correct encoding
declarations into the files.

>>In the future, the signature *will* carry no information. But the future
>>is, well, in the future.
>>
>>I just can't understand why (some) people are so opposed to this patch.
> 
> 
> Occam's razor?

Probably not literally, as we are not searching for an explanation of
some phenomenon. You are probably suggesting that people dislike the
feature because they see no need for it (as one poster stated it:
I don't use UTF-8, so I don't want that feature).

However, I do believe there is a need for the feature, and that
the gains by far outweigh the costs.

Regards,
Martin

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-17 12:53                           ` "Martin v. Löwis"
@ 2005-09-17 13:05                             ` Martin Mares
  2005-09-17 13:33                               ` "Martin v. Löwis"
  0 siblings, 1 reply; 80+ messages in thread
From: Martin Mares @ 2005-09-17 13:05 UTC (permalink / raw)
  To: "Martin v. Löwis"; +Cc: linux-kernel

Hello!

> With the UTF-8 signature, things become much simpler: editors can
> automatically detect presence of the signature, and need no
> language-specific parsing.

I still think that this does solve only a completely insignificant part
of the problem. Given the zillion existing encodings, you are able to identify
UTF-8, leaving you with zillion-1 other encodings you are unable to deal with.

> Probably not literally, as we are not searching for an explanation of
> some phenomenon.

ACK, not literally.

> You are probably suggesting that people dislike the
> feature because they see no need for it (as one poster stated it:
> I don't use UTF-8, so I don't want that feature).

I see a need for a feature which would help identify the charset of the script,
but the patch in question obviously doesn't offer that -- it solves only a single
special case of the problem in a completely non-systematic way. This does not
sound right.

				Have a nice fortnight
-- 
Martin `MJ' Mares   <mj@ucw.cz>   http://atrey.karlin.mff.cuni.cz/~mj/
Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth
"How I need a drink, alcoholic in nature, after the tough chapters involving quantum mechanics!" = \pi

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-17 13:05                             ` Martin Mares
@ 2005-09-17 13:33                               ` "Martin v. Löwis"
  0 siblings, 0 replies; 80+ messages in thread
From: "Martin v. Löwis" @ 2005-09-17 13:33 UTC (permalink / raw)
  To: Martin Mares; +Cc: linux-kernel

Martin Mares wrote:
> I still think that this does solve only a completely insignificant part
> of the problem. Given the zillion existing encodings, you are able to identify
> UTF-8, leaving you with zillion-1 other encodings you are unable to deal with.

Correct. This is a special case only. The more general problem is
already solved: both Python and Perl support source encodings in
the entire zillion encodings. As I explained, this general solution,
while being general, is also not very user-friendly.

Now, why does UTF-8 deserve to be a special case? One reason is that it
has the potential to replace the entire zillion of encodings over time.
However, this can only happen if tool support for this encoding is
really good. The patch contributes a (minor) fragment to the support -
it is a small patch only.

The other reason is that UTF-8 defines its own encoding declaration,
unlike most of the other zillion-1 encodings. So naturally, an
implementation that supports UTF-8 in this way cannot extend to other
encodings. hpa suggested that ISO-2022 would be a more general
mechanism, but pointed out that it hasn't implemented widely in the
last 30 years, so it is unlikely that it will get much better support
in the next thirty years.

> I see a need for a feature which would help identify the charset of the script,
> but the patch in question obviously doesn't offer that -- it solves only a single
> special case of the problem in a completely non-systematic way. This does not
> sound right.

It's not a complete solution, but it *is* part of a general solution.
People have tried in the past to solve the general problem of "identify
the encoding of a text file", both in really general ways (iso-2022)
and in format-specific ways (perl, python). All these solutions are
tedious to use.

There is another general solution: gradually replace the zillion
encodings with a single one, namely Unicode (or, specifically, UTF-8).
This solution will only work when done gradually. Clearly, this
patch doesn't implement this solution entirely, but it contributes
to it, by making usage of UTF-8 in script files more simple. Many
more changes to other software (i.e. non-kernel changes) will be
necessary to implement this solution, as well as (obviously) changes
to existing files.

Regards,
Martin

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
       [not found]             ` <200509170028.59973.dhazelton@enter.net>
  2005-09-17  6:28               ` "Martin v. Löwis"
@ 2005-09-17 17:16               ` Bodo Eggert
  1 sibling, 0 replies; 80+ messages in thread
From: Bodo Eggert @ 2005-09-17 17:16 UTC (permalink / raw)
  To: D. Hazelton; +Cc: 7eggert, H. Peter Anvin, Martin v.Löwis, linux-kernel

On Sat, 17 Sep 2005, D. Hazelton wrote:
> On Friday 16 September 2005 18:02, Bodo Eggert wrote:
> > Bernd Petrovitsch <bernd@firmix.at> wrote:
> > > On Thu, 2005-09-15 at 20:39 +0200, "Martin v. Löwis" wrote:
> > >> H. Peter Anvin wrote:

> > >> > In Unix, it's a hideously bad
> > >> > idea.  The reason is that Unix inherently assumes that text
> > >> > streams can be merged, split, and modified.  In other words,
> > >> > unless you can guarantee that EVERY program can handle BOM
> > >> > EVERYWHERE, it's broken.
> >
> > You can't sort /bin/ls into /tmp/ls and expect /tmp/ls to be
> > meaningfull, but /bin/ls works as expected. You can't usurally
> > concat perl scripts and shell scripts either, but both kinds of
> > script run quite well.
> >
> > And if you do "cat /bin/cat /bin/cp > /bin/catcp", what's "catcp
> > foo bar" supposed to do? First output foo and bar to stdout, then
> > copy foo to bar? Is execve() broken if it doesn't do what I
> > described? Is the ELF header broken because it's not recogmized
> > EVERYWHERE? I don't think so.
> 
> This is a bogus argument. You're comparing the way a _binary_ 
> executable works to the way an interpreted _text_ script works.

You can live with binaries, therefore the features not provided by 
binaries aren't vital for each and every executable.

> execve(), at least on my system, isn't capable of running a script - 
> if I want to do that from a program I have to tell execve() that it's 
> running /bin/sh and the script file is in the parameter list. 

Fix your system, it's broken.

> While I appreciate that the kernel is capable of performing complex 
> actions when execve runs into a file that is not an a.out or elf 
> binary I have yet to see a "binfmt script" option in the kernel 
> config files ever.

Your wish ... but you won't be happy.

--- ../t/linux-2.6.12/fs/Makefile	2005-06-17 21:48:29.000000000 +0200
+++ ./fs/Makefile	2005-09-17 18:02:36.000000000 +0200
@@ -20,9 +20,7 @@ obj-y				+= $(nfsd-y) $(nfsd-m)
 obj-$(CONFIG_BINFMT_AOUT)	+= binfmt_aout.o
 obj-$(CONFIG_BINFMT_EM86)	+= binfmt_em86.o
 obj-$(CONFIG_BINFMT_MISC)	+= binfmt_misc.o
-
-# binfmt_script is always there
-obj-y				+= binfmt_script.o
+obj-$(CONFIG_BINFMT_SCRIPT)	+= binfmt_script.o
 
 obj-$(CONFIG_BINFMT_ELF)	+= binfmt_elf.o
 obj-$(CONFIG_BINFMT_ELF_FDPIC)	+= binfmt_elf_fdpic.o
--- ../t/linux-2.6.12/fs/Kconfig.binfmt	2005-06-17 21:48:29.000000000 +0200
+++ ./fs/Kconfig.binfmt	2005-09-17 17:59:39.000000000 +0200
@@ -42,6 +42,12 @@ config BINFMT_FLAT
 	help
 	  Support uClinux FLAT format binaries.
 
+config BINFMT_SCRIPT
+       bool "Kernel support for script files"
+       default y
+       help
+         Support script files starting with a '#!' marker.
+
 config BINFMT_ZFLAT
 	bool "Enable ZFLAT support"
 	depends on BINFMT_FLAT

> On the other hand, there is the "binfmt_misc" option, which does the 
> work that you seem to be looking for and can, AFAIK, be set to handle 
> both ASCII and UTF-8 scripts. Why add the complexity to the kernel 
> when it's not needed?

Skipping 3 bytes vs. handling tons of binary formats? I bet the memory
required to hold the utf8 binfmt_misc entry alone will be bigger than the
code added by this patch.

> > BTW: I think decent utf-8 capable programs SHOULD ignore extra BOM
> > markers.
> 
> All well and good if you use UTF-8. I, personally, am happy with ASCII 
> and have found no need for the extensive UTF character set (in fact, 
> I despise it when people insist on using UTF-8 in mediums in which 
> the character set is defined in the standards to be ASCII or a subset 
> of ASCII)

I'm not using it, because nobody else is using it, and evrybody else does 
the same for exactly the same reasons. That's why just using utf-8 does 
not work out.

However, if there were means of using both transparently, people could 
migrate. The editor part is simple, but if you can't use your favorite 
editor to generate shell scripts, it's a showstopper.

> Since I am quite happy with the small subset of ASCII that I use on a 
> regular basis, and since I am always seeking ways to optimize my code 
> and my scripts I don't want the editor I'm using adding extra 
> characters behind my back. 

ACK. But you should be able to edit international text without tons of
helper scripts, so a BOM will be usefull to mark utf-8.

> > > And you (or at least I) do `grep`/`egrep`/`fgrep`, `wc` them.
> >
> > You can *grep utf-8 scripts, but you can't *grep binaries.
> > Shouldn't this be fixed by implementing an in-kernel ASCII
> > assembler and convert all binaries to assembler text?
> 
> Bogus argument. Every shell I've ever used has expected the command 
> line to contain only ASCII characters. With that restriction in mind 
> it's clear that it'd be hard to put a UTF8 string as the argument to 
> grep. Although I doubt wc would be buggered by UTF8 input... 

If your shell isn't 8-bit-clean, it should have been replaced in the last 
millenium. Handling combined characters will be the problem.

> > > And
> > > probably with several other tools too - think of `find <dir>
> > > -type f -print0 | xargs -0r <cmd>`.
> >
> > utf-8 filenames will work correctly (unless used as an extended
> > BASIC script with non-ASCII variable names, but that would be
> > insane).
> 
> This is the truth. As I previously mentioned I have yet to find a 
> shell that accepted UTF8 on the command line without choking. And 
> allowing UTF8 for filenames would, I believe, require any number of 
> changes to the kernel,  not the least of which would be changes to 
> the various filesystems to allow for UTF8 and to any number of system 
> calls that would be taking a filename for an argument.

It's not a task of allowing utf-8 filenames, but a task of disallowing
non-canonialized and non-utf8 filenames if files might be created. Systems
doing that won't be a strictly POSIX conformant, but as long as there is a
mounted FAT partition, it can't be anyway.

> > >> just pointless to do that. You create them with text editors,
> > >> and those can handle the UTF-8 signature.
> > >
> > > It is not uncommon to create scripts and the like with other
> > > programs, other scripts, what-else.
> >
> > It's not uncommon to create binaries using other programs. So what?
> 
> Bullsh*t. The case of one binary creating another doesn't apply - 
> because you either enter the data for the binary by hand (tedious and 
> difficult) or you use a binary that takes input and produces the 
> binary you need. And if the binary is missing the proper headers, 
> it's pretty much useless.

And you can live with binaries being non-editable, non-generatable without 
propper tools.

> When a script creates another script it is 
> just creating a text file, putting the data in the file as it reaches 
> those parts and has no way to know that it should be inserting the 
> BOM.

If scripts are just text files, why doesn't sort<script|sh usurally do
the right thing?

Scripts are _not_ random text, they have specific structures. They consist 
of well-formed data, and you should better know what kind of script you're 
creating, therefore you should also know wether to write sh_bang or 
BOM_sh_bang. If you don't, don't generate the script!

> > > Apart from the fact the a "script" is merely a plain text file with
> > > the eXecutable bit set.
> >
> > And an utf-8 script is a utf-8 encoded text file with it's
> > executable bit set.
> 
> And the kernel should have no more code in it to execute them than is 
> already present in the binfmt_misc code. No need for special kernel 
> code when you can simply hand a chunk of parameters regarding the 
> various executable formats to the kernel using a clean, simple and 
> proven interface. And even then I feel it should be limited to 
> binaries - a script is, by definition, interpreted. As such, it 
> belongs in the same place as the interpreter - in userland. (And I 
> fail to see why this is even brought up other than some people being 
> lazy and not wanting to do things _correctly_)

So the binfmt_sh code should be completely abandoned in favor of binfmt_misc?

> > > And that is the only difference, so you have to at
> > > least (all instances of) `chmod` to insert and remove the BOM.
> >
> > [...]
> >
> > In order to make it harder for the interpreter to correctly detect
> > utf-8? You can have DOS executables run in dosboxes, windows
> > applications run in windows, java archives run in java, but utf-8
> > scripts should be mangled in order to work "correctly", and mangled
> > back in order to be editable? *That*'s insane!
> >
> > Just make execve ignore the BOM marker before "#!" as the patch
> > does, and you're done. The rest is somebody else's not-a-problem.
> 
> GCC allows for non-ascii input as a formality. The specifications of 
> both C and C++ clearly define the input character set to be limited 
> to an extremely limited subset of ASCII, as do the specifications of 
> most other language.

This is a userspace problem.

> (Perl 6 is the first language I've ever heard of 
> that directly includes non-ascii characters in the accepted character 
> set)

The MS-DOS 3.3 shell accepted international characters in program names.

> AFAIK, the most common shells don't accept UTF-8 in the command set - 
> they instead see the non-ascii UTF-8 characters as a series of bytes, 
> and if one of them happens to be NULL, you're pretty much screwed.

There is no '\0' in utf-8-encoded data.

> > BTW2: However, I don't like the patch.
> 
> Neither do I. such a thing doesn't belong in the kernel.

It's better than 

- using a legacy wrapper script for each script.

- mangeling each utf8 file before and after editing it

- forcing the world to convert to utf-8 within two weeks

- using a wrapper script around each and every utf-8 script which would
  unnescensarily throw out pages and wastes CPU cycles while requiring
  each user to add several KB of kernel code for binfmt_misc and to
  have the interpreter for the wrapper script installed

I actually created a wrapper script for binfmt_misc and called it a 
hundres times, here is the result:

$ time for((i=0;i<100;i++));do ./foo > /dev/null;done # with wrapper

real    0m2.350s
user    0m1.808s
sys     0m0.476s
$ time for((i=0;i<100;i++));do ./bar > /dev/null;done # without wrapper

real    0m0.461s
user    0m0.232s
sys     0m0.216s

And I'm sure this script has a bug to exploit.

(foo and bar will ust print "test\n" to stdout)

-- 

"Our parents, worse than our grandparents, gave birth to us who are worse than
they, and we shall in our turn bear offspring still more evil."
	-- Horace (BC 65-8)

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-17  6:20                       ` "Martin v. Löwis"
@ 2005-09-17 22:28                         ` Bernd Petrovitsch
  2005-09-18  7:23                           ` "Martin v. Löwis"
  0 siblings, 1 reply; 80+ messages in thread
From: Bernd Petrovitsch @ 2005-09-17 22:28 UTC (permalink / raw)
  To: "Martin v. Löwis"; +Cc: H. Peter Anvin, linux-kernel

On Sat, 2005-09-17 at 08:20 +0200, "Martin v. Löwis" wrote:
> Bernd Petrovitsch wrote:
> > On Fri, 2005-09-16 at 22:41 +0200, "Martin v. Löwis" wrote:
> > [ Language-specific examples ]
> > 
> > And that's the only working way - the programming languages can actually
> > do it because it defines the syntax and semantics of the contents
> > anyways.
> 
> It works from the programming language point of view, but it is a mess
> from the text editor point of view.

Most of the text editors have ways to markup the source files. Not even
the various editors are able to agreen on one method for all, so why
could the (Linux) world agree on one for all text files?

> Even for the programming language, it is a pain to implement: what
> if you have non-ASCII characters before the pragma that declares the
> encoding? and so on.

That's the problem of the language definers who absolutely want such
(IMHO absolutely superflous) features.

> > With this marker you are interferign with (at least) *all* text files.
> 
> Hmm. What does that have to do with the patch I'm proposing? This
> patch does *not* interfere with all text files. It is only relevant
> for executable files starting with the #! magic.

It *does* interfere since scripts are also text files in every aspect.
So every feature you want for "scripts" you also get for text files (and
vice versa BTW).
If you think "script" and "text file" are different, define both of
them, please, otherwise a discussion is pointless.

> > And there are always tools out there which simply do not understand the
> > generic marker and can not ignore it since these bytes are part of the
> > file.
> 
> This conclusion is false. Many tools that don't understand the file
> structure still can do their job on the files. So the fact that a tool
> does not understand the structure does not necessarily imply that
> the tool breaks when the structure changes.

It *may* break just because of some to-be-ignored inline marking due to
some questionable feature.
And *when* (not if) it breaks, it is probably cumbersome to find since
you have pretty unprintable characters.
Let alone the confusion why the size of a file with `ls -l` is different
from the size in the editor or a marker-aware `wc -c`.
So IMHO either you have a clear and visible marker or you none at all.

> > Or another example: (Try to) start a perl/shell/... script (without
> > paranmeter on the first line) which was edited on Win* and binary copied
> > to a Unix system. Or at least guess what will happen ....
> 
> For a Python script, I don't need to guess: It will just work.

Then write a short python script (with a "#!/usr/bin/python" line at the
start [without parameters]) natively on a Win*-system, copy it binary
over to an arbitrary Linux system and see what's happening.

	Bernd
-- 
Firmix Software GmbH                   http://www.firmix.at/
mobil: +43 664 4416156                 fax: +43 1 7890849-55
          Embedded Linux Development and Services




^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-17  6:28               ` "Martin v. Löwis"
@ 2005-09-17 22:31                 ` D. Hazelton
  2005-09-18  3:45                   ` Kyle Moffett
  2005-09-18  6:58                   ` "Martin v. Löwis"
  0 siblings, 2 replies; 80+ messages in thread
From: D. Hazelton @ 2005-09-17 22:31 UTC (permalink / raw)
  To: Martin v. Löwis; +Cc: 7eggert, H. Peter Anvin, linux-kernel

On Saturday 17 September 2005 06:28, "Martin v. Löwis" wrote:
> D. Hazelton wrote:
> > This is a bogus argument. You're comparing the way a _binary_
> > executable works to the way an interpreted _text_ script works.
> > execve(), at least on my system, isn't capable of running a
> > script - if I want to do that from a program I have to tell
> > execve() that it's running /bin/sh and the script file is in the
> > parameter list.
>
> This being the linux-kernel list, I assume your system is Linux,
> no? Well, on Linux, execve *does* support script files. This is the
> whole point of my patch - I would not propose a kernel patch to
> improve this support if it weren't there in the first place.

This is news to me. The last time I handed execve() a script as a 
paramter I had errors returned from execve() -- I must admit that 
this was not on my current system and I had assumed that the behavior 
would be consistent.

> > While I appreciate that the kernel is capable of performing
> > complex actions when execve runs into a file that is not an a.out
> > or elf binary I have yet to see a "binfmt script" option in the
> > kernel config files ever.
>
> It's not a config option because it is always enabled. See
> fs/binfmt_script.c for details. It wasn't integrated into the
> binfmt system until I made it so some ten years ago, though.

I haven't gotten into that section of the code yet. I've been slowly 
working my way through the code from the drivers that seem to cause 
strange behavior on my system and then up the tree from there.

> > On the other hand, there is the "binfmt_misc" option, which does
> > the work that you seem to be looking for and can, AFAIK, be set
> > to handle both ASCII and UTF-8 scripts. Why add the complexity to
> > the kernel when it's not needed?
>
> One shouldn't add complexity if its not needed. However, this patch
> does not add complexity. It is fairly trivial.

You are correct. It is fairly trivial. However my point still is valid 
that the Kernel has the whole binfmt_misc system -- I will admit that 
I have recently been shown numbers that show a noticeable difference 
in the speed of a binary executed using the binfmt_misc system and 
the binfmt_script system, but the fact remains that offering handling 
for UTF8 and ASCII scripts directly in the kernel will likely lead to 
at least one more patch in which the the full Unicode standard is 
implemented.

That, and my point remains that the kernel should know absolutely 
nothing about how to execute a text file - the kernel should return 
an error to the extent of "I don't know what to do with this file" to 
the shell that tries to execute it, and the shell can then check for 
the sh_bang. I do admit that this change would break a lot of 
existing code, so I'll leave the argument to the experts.


> Regards,
> Martin

DRH

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
       [not found]                         ` <4NTvO-yJ-13@gated-at.bofh.it>
@ 2005-09-18  0:53                           ` Bodo Eggert
  2005-09-18 16:53                             ` Bernd Petrovitsch
       [not found]                           ` <4O1MJ-3Hf-5@gated-at.bofh.it>
  1 sibling, 1 reply; 80+ messages in thread
From: Bodo Eggert @ 2005-09-18  0:53 UTC (permalink / raw)
  To: Bernd Petrovitsch, Martin v. Löwis, H. Peter Anvin,
	linux-kernel

Bernd Petrovitsch <bernd@firmix.at> wrote:
> On Sat, 2005-09-17 at 08:20 +0200, "Martin v. Löwis" wrote:
>> Bernd Petrovitsch wrote:
>> > On Fri, 2005-09-16 at 22:41 +0200, "Martin v. Löwis" wrote:

>> > [ Language-specific examples ]
>> > 
>> > And that's the only working way - the programming languages can actually
>> > do it because it defines the syntax and semantics of the contents
>> > anyways.
>> 
>> It works from the programming language point of view, but it is a mess
>> from the text editor point of view.
> 
> Most of the text editors have ways to markup the source files. Not even
> the various editors are able to agreen on one method for all, so why
> could the (Linux) world agree on one for all text files?

You don't need a marker for all text files, but it's legal to have a marker
for utf-8 text files (see the uniocode standard 4.0.0 section 15.9), and
it's handy to use it until you made everybody in the world convert
everything to utf-8 (but not utf-{16,32}{le,be}).

>> > With this marker you are interferign with (at least) *all* text files.
>> 
>> Hmm. What does that have to do with the patch I'm proposing? This
>> patch does *not* interfere with all text files. It is only relevant
>> for executable files starting with the #! magic.
> 
> It *does* interfere since scripts are also text files in every aspect.
> So every feature you want for "scripts" you also get for text files (and
> vice versa BTW).

If utf-8 encoded text files are text files, and text files are scripts,
and all of them shall have the same features, utf-8 encoded text files
with BOM MUST be recognized as legal scripts, too. Therefore this patch
fixes a kernel bug.

BTW: Implementing the other utf signatures from Table 15.3 is left to the
reader as an exercise.-)

> If you think "script" and "text file" are different, define both of
> them, please, otherwise a discussion is pointless.

If all text files are script files, execute this mail.

>> > And there are always tools out there which simply do not understand the
>> > generic marker and can not ignore it since these bytes are part of the
>> > file.
>> 
>> This conclusion is false. Many tools that don't understand the file
>> structure still can do their job on the files. So the fact that a tool
>> does not understand the structure does not necessarily imply that
>> the tool breaks when the structure changes.
> 
> It *may* break just because of some to-be-ignored inline marking due to
> some questionable feature.

How exactly does it break, and what is it? And why must *it* be prevented
from breaking by ignoring script signatures in valid text files?

> And *when* (not if) it breaks, it is probably cumbersome to find since
> you have pretty unprintable characters.

If your tools can't print utf-8 encoded characters, they are broken for
ISO-8859-*, too. Besides that, it's not a kernel problem.

> Let alone the confusion why the size of a file with `ls -l` is different
> from the size in the editor or a marker-aware `wc -c`.
> So IMHO either you have a clear and visible marker or you none at all.

Like e.g. the "From "-line starting each message in a mbox file? Virtually
no email client will display it. The size of email messages does differ
from it's unencoded content size, too! Off cause nobody can handle this,
and all users contantly try to kill themselfes because of that - NOT.

-- 
Ich danke GMX dafür, die Verwendung meiner Adressen mittels per SPF
verbreiteten Lügen zu sabotieren.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-17 22:31                 ` D. Hazelton
@ 2005-09-18  3:45                   ` Kyle Moffett
  2005-09-19  0:14                     ` D. Hazelton
  2005-09-18  6:58                   ` "Martin v. Löwis"
  1 sibling, 1 reply; 80+ messages in thread
From: Kyle Moffett @ 2005-09-18  3:45 UTC (permalink / raw)
  To: D. Hazelton; +Cc:  Martin v. Löwis , 7eggert, H. Peter Anvin, linux-kernel

On Sep 17, 2005, at 18:31:33, D. Hazelton wrote:
> That, and my point remains that the kernel should know absolutely  
> nothing about how to execute a text file - the kernel should return  
> an error to the extent of "I don't know what to do with this file"  
> to the shell that tries to execute it, and the shell can then check  
> for the sh_bang. I do admit that this change would break a lot of  
> existing code, so I'll leave the argument to the experts.

No, that would not work at all.  We have a very nice system to allow  
set-uid scripts (Specifically, I like my nice secure taint-mode set- 
uid perl scripts).  If you did this, they would break completely, not  
to mention _add_ all sorts of unsolvable race conditions to the few  
ways of working around such a lack of SUID scripts.  Also, it means  
that I can't just "mv /sbin/init /sbin/init.real ; vim /sbin/init" to  
do a simple wrapper around the init program, I would need to write a  
compiled C program to do all sorts of fragile hackish things like  
calling a script /sbin/init.sh.

Cheers,
Kyle Moffett

--
There are two ways of constructing a software design. One way is to  
make it so simple that there are obviously no deficiencies. And the  
other way is to make it so complicated that there are no obvious  
deficiencies.  The first method is far more difficult.
   -- C.A.R. Hoare



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-17 22:31                 ` D. Hazelton
  2005-09-18  3:45                   ` Kyle Moffett
@ 2005-09-18  6:58                   ` "Martin v. Löwis"
  2005-09-19  0:31                     ` D. Hazelton
  1 sibling, 1 reply; 80+ messages in thread
From: "Martin v. Löwis" @ 2005-09-18  6:58 UTC (permalink / raw)
  To: D. Hazelton; +Cc: 7eggert, H. Peter Anvin, linux-kernel

D. Hazelton wrote:
> This is news to me. The last time I handed execve() a script as a 
> paramter I had errors returned from execve() -- I must admit that 
> this was not on my current system and I had assumed that the behavior 
> would be consistent.

The kernel checks for #!<path>, and that <path> is an existing
executable. If not, execve fails.

> You are correct. It is fairly trivial. However my point still is valid 
> that the Kernel has the whole binfmt_misc system -- I will admit that 
> I have recently been shown numbers that show a noticeable difference 
> in the speed of a binary executed using the binfmt_misc system and 
> the binfmt_script system, but the fact remains that offering handling 
> for UTF8 and ASCII scripts directly in the kernel will likely lead to 
> at least one more patch in which the the full Unicode standard is 
> implemented.

The problem with the binfmt_misc approach is that you need *another*
execve call: with binfmt_misc, you register <utf8sig>#!, and a
generic binary. Then, this generic binary will interpret the #!
signature *again*, and invoke the proper interpreter. This will
intepret the first line *yet again* (finding that it is a comment),
and continue processing the file.

However, this is not the real problem. The real problem is that
the specific binfmt_misc "backend" would not be universally
available, and then the same script would start on some systems,
and break on others. This may be acceptable for large or specific
applications (e.g. you have to setup the ibcs2 module to run
SCO applications); it is not for scripts.

Now, the "universally available" part would not apply right now,
as only the most recent kernels would provide the feature. However,
within a few years, the feature would be part of "Linux" - then
people can start using it extensively.

> That, and my point remains that the kernel should know absolutely 
> nothing about how to execute a text file - the kernel should return 
> an error to the extent of "I don't know what to do with this file" to 
> the shell that tries to execute it, and the shell can then check for 
> the sh_bang. I do admit that this change would break a lot of 
> existing code, so I'll leave the argument to the experts.

The point is that it is not necessarily the shell which starts
programs - the shell is but one creator of new processes. It is
very common today that, say, httpd starts new programs - this
mechanism is called CGI. Your approach was in use until 1985 or
so, when Unix implementations started to support #! natively.
This was done both for convenience and for performance: if
programs would always use system(3) to start new processes,
there would always be a shell that execs the eventual
interpreter.

I'm not sure, but I believe that most current shells have "forgotten"
how to do the #! magic, since, by now, "traditionally" this is
a kernel responsibility.

Regards,
Martin

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-17 22:28                         ` Bernd Petrovitsch
@ 2005-09-18  7:23                           ` "Martin v. Löwis"
  2005-09-18 14:50                             ` Bernd Petrovitsch
  0 siblings, 1 reply; 80+ messages in thread
From: "Martin v. Löwis" @ 2005-09-18  7:23 UTC (permalink / raw)
  To: Bernd Petrovitsch
  Cc: "Martin v. Löwis", H. Peter Anvin, linux-kernel

Bernd Petrovitsch wrote:
> Most of the text editors have ways to markup the source files. Not even
> the various editors are able to agreen on one method for all, so why
> could the (Linux) world agree on one for all text files?

You are ignoring the role of standardization. People invent their own
mechanism if a standard is missing (or virtually unimplementable). For
declaring encodings, there is no standard (except of iso-2022, which
is really hard to implement correctly). Therefore, editor authors
create their own standards.

Atleast Python abstained from creating yet another standard, and instead
supports both the declarations from Emacs and vim. To some degree, it
also supports notepad (namely through the UTF-8 signature).

However, people are much more likely to agree on a technology when it
is defined by a recognized standards body. This is the case for the
UTF-8 signature, which is defined by the Unicode consortium, for
precisely this purpose. Therefore, editors *will* agree on that
mechanism, while keeping their own mechanism for the more general
problem.

>>Even for the programming language, it is a pain to implement: what
>>if you have non-ASCII characters before the pragma that declares the
>>encoding? and so on.
> 
> 
> That's the problem of the language definers who absolutely want such
> (IMHO absolutely superflous) features.

It's not the language designers who absolutely want this feature. It's
the language users. Of course, you'ld have to be a language designer to
know that fact - language users go to the language designers asking for
the feature, not to the kernel developers.

>>Hmm. What does that have to do with the patch I'm proposing? This
>>patch does *not* interfere with all text files. It is only relevant
>>for executable files starting with the #! magic.
> 
> 
> It *does* interfere since scripts are also text files in every aspect.
> So every feature you want for "scripts" you also get for text files (and
> vice versa BTW).

The specific feature I get is that when I pass a file starting
with <utf8sig>#! to execve, Linux will execute the file following
the #!. In what way do I get this feature for text in general?
And if I do, why is that a problem?

> If you think "script" and "text file" are different, define both of
> them, please, otherwise a discussion is pointless.

A script file (in the context of this discussion) is a text file
that is executable (i.e. has the appropriate subset of
S_IXUSR|S_IXGRP|S_IXOTH set), starts with #!, and has the path
name of an executable file after the #!.

More generally, a script file is a text file written in a scripting
language. A scripting language is a programming language which
supports "direct" execution of source code. So in the more
general definition, a script file does not need to start with
#!; for the context of this discussion, we should restrict
attention to files actually affected by the patch.

>>This conclusion is false. Many tools that don't understand the file
>>structure still can do their job on the files. So the fact that a tool
>>does not understand the structure does not necessarily imply that
>>the tool breaks when the structure changes.
> 
> 
> It *may* break just because of some to-be-ignored inline marking due to
> some questionable feature.

Be more specific. For what specific kind of file will cat(1) break?
Unless cat(1) has a 2GB limitation, I very much doubt it will break
(i.e. fail to do its job, "concatenate files and print on the standard
output") for any kind of input - whether this is text files, binary
files, images, sound files, HTML files. cat always does what it is
designed to do.

> Let alone the confusion why the size of a file with `ls -l` is different
> from the size in the editor or a marker-aware `wc -c`.

This is true for any UTF-8 file, or any multibyte encoding. For any
multibyte encoding, the number of bytes in the file is different from
the number of characters. That doesn't (and shouldn't) stop people from
using multi-byte encodings.

What the editor displays as the number of "things" is up to its own.
The output of wc -c will always be the same as the one of ls -l,
as wc -c does *not* give you characters:

       -c, --bytes
              print the byte counts

You might have been thinking of 'wc -m'.

>>For a Python script, I don't need to guess: It will just work.
> 
> 
> Then write a short python script (with a "#!/usr/bin/python" line at the
> start [without parameters]) natively on a Win*-system, copy it binary
> over to an arbitrary Linux system and see what's happening.

It depends on the editor I use, of course: the kernel will consider any
CR after the n as part of the interpreter name. Not sure what this has
to do with the specific patch, though.

Regards,
Martin


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-18  7:23                           ` "Martin v. Löwis"
@ 2005-09-18 14:50                             ` Bernd Petrovitsch
  0 siblings, 0 replies; 80+ messages in thread
From: Bernd Petrovitsch @ 2005-09-18 14:50 UTC (permalink / raw)
  To: "Martin v. Löwis"; +Cc: H. Peter Anvin, linux-kernel

On Sun, 2005-09-18 at 09:23 +0200, "Martin v. Löwis" wrote:
[...]
> >>Hmm. What does that have to do with the patch I'm proposing? This
> >>patch does *not* interfere with all text files. It is only relevant
> >>for executable files starting with the #! magic.
> > 
> > It *does* interfere since scripts are also text files in every aspect.
> > So every feature you want for "scripts" you also get for text files (and
> > vice versa BTW).
> 
> The specific feature I get is that when I pass a file starting
> with <utf8sig>#! to execve, Linux will execute the file following
> the #!. In what way do I get this feature for text in general?
> And if I do, why is that a problem?

After applying this patch it seems that "Linux" is supporting this
marker officially in general - especially if the kernel supports it. I
suppose the next kernel patch is to support Win-like CR-LF sequences
(which is not the case AFAIK).
BTW even some standards body thinks that this is the way to go, it
raises more problems and questions than resolves anything.

> > If you think "script" and "text file" are different, define both of
> > them, please, otherwise a discussion is pointless.
> 
> A script file (in the context of this discussion) is a text file
> that is executable (i.e. has the appropriate subset of
> S_IXUSR|S_IXGRP|S_IXOTH set), starts with #!, and has the path
> name of an executable file after the #!.
> 
> More generally, a script file is a text file written in a scripting
> language. A scripting language is a programming language which
> supports "direct" execution of source code. So in the more
> general definition, a script file does not need to start with
> #!; for the context of this discussion, we should restrict
> attention to files actually affected by the patch.

And though scripts are usually edited/changed/"parsed"/... with an text
editor, it is not always the case. Therefore the automatic extension to
*all text files* (especially as the marker basically applies to all text
files, not only scripts).
You want to focus just on your patch and ignore the directly implied
potential problems arising ...

[...]
> > It *may* break just because of some to-be-ignored inline marking due to
> > some questionable feature.
> 
> Be more specific. For what specific kind of file will cat(1) break?

`cat` as such will not break (as such).

> Unless cat(1) has a 2GB limitation, I very much doubt it will break
> (i.e. fail to do its job, "concatenate files and print on the standard
> output") for any kind of input - whether this is text files, binary
> files, images, sound files, HTML files. cat always does what it is
> designed to do.

Apparently I have to repeat: If you do `cat a.txt b.txt >c.txt` where
a.txt and b.txt have this marker, then c.txt have the marker of b.txt
somewhere in the middle. Does this make sense in anyway?
How do I get rid of the marker in the middle transparently?

> > Let alone the confusion why the size of a file with `ls -l` is different
> > from the size in the editor or a marker-aware `wc -c`.
> 
> This is true for any UTF-8 file, or any multibyte encoding. For any
> multibyte encoding, the number of bytes in the file is different from
> the number of characters. That doesn't (and shouldn't) stop people from
> using multi-byte encodings.

It is different even if a pure ASCII file is marked as UTF-8.
And sure, the problem exists in general with multi-byte encodings.

> What the editor displays as the number of "things" is up to its own.
> The output of wc -c will always be the same as the one of ls -l,
> as wc -c does *not* give you characters:
> 
>        -c, --bytes
>               print the byte counts
> 
> You might have been thinking of 'wc -m'.

It depends on the definition of "character". There are other standards
which define "character" as "byte".

[...]
> > Then write a short python script (with a "#!/usr/bin/python" line at the
> > start [without parameters]) natively on a Win*-system, copy it binary
> > over to an arbitrary Linux system and see what's happening.
> 
> It depends on the editor I use, of course: the kernel will consider any

No, more on the OS the editor runs on.

> CR after the n as part of the interpreter name. Not sure what this has

ACK.

> to do with the specific patch, though.

It is not supported by the kernel. So either you remove it or you make
some compatibility hack (like an appropriate sym-link, etc.). Since the
kernel can start java classes directly, you can probably make a similar
thing for the UTF-8 stuff.

	Bernd
-- 
Firmix Software GmbH                   http://www.firmix.at/
mobil: +43 664 4416156                 fax: +43 1 7890849-55
          Embedded Linux Development and Services




^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-18  0:53                           ` Bodo Eggert
@ 2005-09-18 16:53                             ` Bernd Petrovitsch
  0 siblings, 0 replies; 80+ messages in thread
From: Bernd Petrovitsch @ 2005-09-18 16:53 UTC (permalink / raw)
  To: 7eggert; +Cc: Martin v. Löwis, H. Peter Anvin, linux-kernel

On Sun, 2005-09-18 at 02:53 +0200, Bodo Eggert wrote:
> Bernd Petrovitsch <bernd@firmix.at> wrote:
[...]
> > Most of the text editors have ways to markup the source files. Not even
> > the various editors are able to agreen on one method for all, so why
> > could the (Linux) world agree on one for all text files?
> 
> You don't need a marker for all text files, but it's legal to have a marker
> for utf-8 text files (see the uniocode standard 4.0.0 section 15.9), and
> it's handy to use it until you made everybody in the world convert
> everything to utf-8 (but not utf-{16,32}{le,be}).

Have fun patching almost every text processing tool and concept out
there.
Apart from that the way of that marker is wrong it seems to me that the
UTF-8 body has no other choice than such a insane "rule" or
recommendation).

> >> > With this marker you are interferign with (at least) *all* text files.
> >> 
> >> Hmm. What does that have to do with the patch I'm proposing? This
> >> patch does *not* interfere with all text files. It is only relevant
> >> for executable files starting with the #! magic.
> > 
> > It *does* interfere since scripts are also text files in every aspect.
> > So every feature you want for "scripts" you also get for text files (and
> > vice versa BTW).
> 
> If utf-8 encoded text files are text files, and text files are scripts,

No one said all text files are scripts, instead it is the other way
'round.

[ snipped because of ex falso quod libet ]

> > If you think "script" and "text file" are different, define both of
> > them, please, otherwise a discussion is pointless.
> 
> If all text files are script files, execute this mail.

See above. Obviously you misunderstand some thing.

> >> > And there are always tools out there which simply do not understand the
> >> > generic marker and can not ignore it since these bytes are part of the
> >> > file.
> >> 
> >> This conclusion is false. Many tools that don't understand the file
> >> structure still can do their job on the files. So the fact that a tool
> >> does not understand the structure does not necessarily imply that
> >> the tool breaks when the structure changes.
> > 
> > It *may* break just because of some to-be-ignored inline marking due to
> > some questionable feature.
> 
> How exactly does it break, and what is it? And why must *it* be prevented
> from breaking by ignoring script signatures in valid text files?

The question was: What is if this marker in encountered within a file?
To be ignored (by UTF-8 aware tools)? Some other interpretation?
Illegal/Forbidden?

> > And *when* (not if) it breaks, it is probably cumbersome to find since
> > you have pretty unprintable characters.
> 
> If your tools can't print utf-8 encoded characters, they are broken for
> ISO-8859-*, too. Besides that, it's not a kernel problem.

Which is again not true since lots of tools out there printed ISO-8859-*
correctly before UTF-8 was deployed.

[...]

	Bernd
-- 
Firmix Software GmbH                   http://www.firmix.at/
mobil: +43 664 4416156                 fax: +43 1 7890849-55
          Embedded Linux Development and Services




^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
       [not found]                             ` <4O8Oh-5jp-7@gated-at.bofh.it>
@ 2005-09-18 19:23                               ` Bodo Eggert
  2005-09-18 21:03                                 ` Bernd Petrovitsch
                                                   ` (2 more replies)
  2005-09-19  4:54                               ` "Martin v. Löwis"
  1 sibling, 3 replies; 80+ messages in thread
From: Bodo Eggert @ 2005-09-18 19:23 UTC (permalink / raw)
  To: Bernd Petrovitsch, Martin v. Löwis, H. Peter Anvin,
	linux-kernel

Bernd Petrovitsch <bernd@firmix.at> wrote:
> On Sun, 2005-09-18 at 09:23 +0200, "Martin v. Löwis" wrote:
> [...]
>> >>Hmm. What does that have to do with the patch I'm proposing? This
>> >>patch does *not* interfere with all text files. It is only relevant
>> >>for executable files starting with the #! magic.
>> > 
>> > It *does* interfere since scripts are also text files in every aspect.
>> > So every feature you want for "scripts" you also get for text files (and
>> > vice versa BTW).
>> 
>> The specific feature I get is that when I pass a file starting
>> with <utf8sig>#! to execve, Linux will execute the file following
>> the #!. In what way do I get this feature for text in general?
>> And if I do, why is that a problem?
> 
> After applying this patch it seems that "Linux" is supporting this
> marker officially in general - especially if the kernel supports it.

It will be the first POSIX kernel to correctly support utf-8 scripts.
It's 2005, and according to other(?) posters, this should be standard.

> I
> suppose the next kernel patch is to support Win-like CR-LF sequences
> (which is not the case AFAIK).

Maybe it should, maybe it shouldn't. If I used MAC or DOS, I'd be sure it
should.-)

> BTW even some standards body thinks that this is the way to go,

Not surprisingly the Unicode Consortium is one of them.

> it
> raises more problems and questions than resolves anything.

The problem of ow to handle BOM is solved by reading the standard.

> And though scripts are usually edited/changed/"parsed"/... with an text
> editor, it is not always the case. Therefore the automatic extension to
> *all text files* (especially as the marker basically applies to all text
> files, not only scripts).
> You want to focus just on your patch and ignore the directly implied
> potential problems arising ...

There is no problem arising from the patch, it solves one.
To solve the rest, use recode.

[...]
> Apparently I have to repeat: If you do `cat a.txt b.txt >c.txt` where
> a.txt and b.txt have this marker, then c.txt have the marker of b.txt
> somewhere in the middle. Does this make sense in anyway?
> How do I get rid of the marker in the middle transparently?

The unicode standard defines how to handle them.

>> > Let alone the confusion why the size of a file with `ls -l` is different
>> > from the size in the editor or a marker-aware `wc -c`.
>> 
>> This is true for any UTF-8 file, or any multibyte encoding. For any
>> multibyte encoding, the number of bytes in the file is different from
>> the number of characters. That doesn't (and shouldn't) stop people from
>> using multi-byte encodings.
> 
> It is different even if a pure ASCII file is marked as UTF-8.

No pure ASCII file will be marked, since a marked file will be no
ASCII file.

> And sure, the problem exists in general with multi-byte encodings.

ACK, but that's not a kernel problem nor a specific unicode problem.
Fix it by making China, Greece an Japan convert to ASCII and by making
all mathematicans stop using strange characters. All other users will
follow.

>> What the editor displays as the number of "things" is up to its own.
>> The output of wc -c will always be the same as the one of ls -l,
>> as wc -c does *not* give you characters:
>> 
>>        -c, --bytes
>>               print the byte counts
>> 
>> You might have been thinking of 'wc -m'.
> 
> It depends on the definition of "character". There are other standards
> which define "character" as "byte".

There are architectures defining a byte to be 32 bit.
They are irrelevant, too.

[...]
>> Not sure what this has
>> to do with the specific patch, though.
> 
> It is not supported by the kernel. So either you remove it or you make
> some compatibility hack (like an appropriate sym-link

-EDOESNOTWORK

#!/usr/bin/perl -T -s -w

>, etc.). Since the
> kernel can start java classes directly, you can probably make a similar
> thing for the UTF-8 stuff.

If MSDOS text files are text files are legal scripts, the kernel
should recognize [\x0D\x0A] as valid line breaks.

(The real reason would be unicode allowing NEL to be encoded as 0x0D
 or 0x0A.)

This compile-tested patch adds 32 bytes to binfmt_script:

--- ./fs/binfmt_script.c.old    2005-09-18 20:28:32.000000000 +0200
+++ ./fs/binfmt_script.c        2005-09-18 20:29:44.000000000 +0200
@@ -18,7 +18,7 @@

 static int load_script(struct linux_binprm *bprm,struct pt_regs *regs)
 {
-       char *cp, *i_name, *i_arg;
+       char *cp, *cp2, *i_name, *i_arg;
        struct file *file;
        char interp[BINPRM_BUF_SIZE];
        int retval;
@@ -47,6 +47,9 @@ static int load_script(struct linux_binp
        bprm->buf[BINPRM_BUF_SIZE - 1] = '\0';
        if ((cp = strchr(bprm->buf, '\n')) == NULL)
                cp = bprm->buf+BINPRM_BUF_SIZE-1;
+       if ((cp2 = strchr(bprm->buf, '\x0D')) != NULL
+       &&  cp2 < cp)
+               cp = cp2;
        *cp = '\0';
        while (cp > bprm->buf) {
                cp--;
-- 
Ich danke GMX dafür, die Verwendung meiner Adressen mittels per SPF
verbreiteten Lügen zu sabotieren.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-18 19:23                               ` Bodo Eggert
@ 2005-09-18 21:03                                 ` Bernd Petrovitsch
  2005-09-19 19:37                                   ` Bodo Eggert
  2005-09-18 22:29                                 ` Valdis.Kletnieks
  2005-09-19  6:03                                 ` H. Peter Anvin
  2 siblings, 1 reply; 80+ messages in thread
From: Bernd Petrovitsch @ 2005-09-18 21:03 UTC (permalink / raw)
  To: 7eggert; +Cc: Martin v. Löwis, H. Peter Anvin, linux-kernel

On Sun, 2005-09-18 at 21:23 +0200, Bodo Eggert wrote:
[...]
> >> Not sure what this has
> >> to do with the specific patch, though.
> > 
> > It is not supported by the kernel. So either you remove it or you make
> > some compatibility hack (like an appropriate sym-link
> 
> -EDOESNOTWORK
> 
> #!/usr/bin/perl -T -s -w

It depends on /usr/bin/perl how it handles a white-space character
directly after "-w".

> >, etc.). Since the
> > kernel can start java classes directly, you can probably make a similar
> > thing for the UTF-8 stuff.
> 
> If MSDOS text files are text files are legal scripts, the kernel
> should recognize [\x0D\x0A] as valid line breaks.

The Unix worls does recognize the line breaks. It's up to the tool how
to handle the white-space character before it. Especially for C and
similar languages with continuation lines this leads to interesting (or
now more boring) problems.

	Bernd
-- 
Firmix Software GmbH                   http://www.firmix.at/
mobil: +43 664 4416156                 fax: +43 1 7890849-55
          Embedded Linux Development and Services




^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-18 19:23                               ` Bodo Eggert
  2005-09-18 21:03                                 ` Bernd Petrovitsch
@ 2005-09-18 22:29                                 ` Valdis.Kletnieks
  2005-09-19  6:03                                 ` H. Peter Anvin
  2 siblings, 0 replies; 80+ messages in thread
From: Valdis.Kletnieks @ 2005-09-18 22:29 UTC (permalink / raw)
  To: 7eggert
  Cc: Bernd Petrovitsch, Martin v. Löwis, H. Peter Anvin,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1242 bytes --]

On Sun, 18 Sep 2005 21:23:42 +0200, Bodo Eggert said:
> Bernd Petrovitsch <bernd@firmix.at> wrote:
> > Apparently I have to repeat: If you do `cat a.txt b.txt >c.txt` where
> > a.txt and b.txt have this marker, then c.txt have the marker of b.txt
> > somewhere in the middle. Does this make sense in anyway?
> > How do I get rid of the marker in the middle transparently?
> 
> The unicode standard defines how to handle them.

For the benefit of those of us who are interested in the problem, but aren't
in the mood to wade through a long standard looking for the answer to a
specific question, can you elaborate?

It isn't as obvious as all that, because of all the nasty corner cases...

> > It is different even if a pure ASCII file is marked as UTF-8.
> 
> No pure ASCII file will be marked, since a marked file will be no
> ASCII file.

Given a file "a.txt" that's pure ASCII, and a file "b.txt" that has the BOM
marker on it, what happens when you do "cat a.txt b.txt > c.txt"?

'cat' doesn't know, and has no way of knowing, that c.txt needs a BOM at the
*front* of the file until it's already written past the point in c.txt where
the BOM has to go.

What does the Unicode standard say to do in this case?

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-18  3:45                   ` Kyle Moffett
@ 2005-09-19  0:14                     ` D. Hazelton
  0 siblings, 0 replies; 80+ messages in thread
From: D. Hazelton @ 2005-09-19  0:14 UTC (permalink / raw)
  To: Kyle Moffett; +Cc: Martin v. Löwis, 7eggert, H. Peter Anvin, linux-kernel

On Sunday 18 September 2005 03:45, Kyle Moffett wrote:
> On Sep 17, 2005, at 18:31:33, D. Hazelton wrote:
> > That, and my point remains that the kernel should know absolutely
> > nothing about how to execute a text file - the kernel should
> > return an error to the extent of "I don't know what to do with
> > this file" to the shell that tries to execute it, and the shell
> > can then check for the sh_bang. I do admit that this change would
> > break a lot of existing code, so I'll leave the argument to the
> > experts.
>
> No, that would not work at all.  We have a very nice system to
> allow set-uid scripts (Specifically, I like my nice secure
> taint-mode set- uid perl scripts).  If you did this, they would
> break completely, not to mention _add_ all sorts of unsolvable race
> conditions to the few ways of working around such a lack of SUID
> scripts.  Also, it means that I can't just "mv /sbin/init
> /sbin/init.real ; vim /sbin/init" to do a simple wrapper around the
> init program, I would need to write a compiled C program to do all
> sorts of fragile hackish things like calling a script
> /sbin/init.sh.

This makes a lot more sense than I expected to hear. This argument 
alone is enough for me to understand the reasoning behind the kernel 
knowing how to interpret a shell script. Problem is, the program 
would not be fragile or hackish - it'd be almost as simple as a 
"hello world" program.

#include <unistd.h>

int main() {
  /* if this fails the system is busted anyway */
  return execve( "/bin/sh", "/sbin/init.sh", 0 );
};

-- This program would do the trick nicely, and since init is run as 
root, there is no need to worry about the program having to grab 
privs. 

However, the real problem is that this would break the initrd systems 
used by most distributions for installation, and it would probably 
break most of the "early userspace" systems just coming into use. As 
I said originally - my  comment about having the shell itself 
interpret the sh_bang would break a lot of stuff and I've been shown 
that I have to spend more time in the kernel code (as I haven't 
finished going through the various drivers to see how those have been 
made to work) before I can make a good suggestion in a discussion 
like this.

DRH

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-18  6:58                   ` "Martin v. Löwis"
@ 2005-09-19  0:31                     ` D. Hazelton
  0 siblings, 0 replies; 80+ messages in thread
From: D. Hazelton @ 2005-09-19  0:31 UTC (permalink / raw)
  To: Martin v. Löwis; +Cc: 7eggert, H. Peter Anvin, linux-kernel

On Sunday 18 September 2005 06:58, "Martin v. Löwis" wrote:
> D. Hazelton wrote:
> > This is news to me. The last time I handed execve() a script as a
> > paramter I had errors returned from execve() -- I must admit that
> > this was not on my current system and I had assumed that the
> > behavior would be consistent.
>
> The kernel checks for #!<path>, and that <path> is an existing
> executable. If not, execve fails.
>
> > You are correct. It is fairly trivial. However my point still is
> > valid that the Kernel has the whole binfmt_misc system -- I will
> > admit that I have recently been shown numbers that show a
> > noticeable difference in the speed of a binary executed using the
> > binfmt_misc system and the binfmt_script system, but the fact
> > remains that offering handling for UTF8 and ASCII scripts
> > directly in the kernel will likely lead to at least one more
> > patch in which the the full Unicode standard is implemented.
>
> The problem with the binfmt_misc approach is that you need
> *another* execve call: with binfmt_misc, you register <utf8sig>#!,
> and a generic binary. Then, this generic binary will interpret the
> #! signature *again*, and invoke the proper interpreter. This will
> intepret the first line *yet again* (finding that it is a comment),
> and continue processing the file.

True. I had forgotten that for truly generic rules about handling the 
#!  there would be double the overhead for the sh_bang.

> However, this is not the real problem. The real problem is that
> the specific binfmt_misc "backend" would not be universally
> available, and then the same script would start on some systems,
> and break on others. This may be acceptable for large or specific
> applications (e.g. you have to setup the ibcs2 module to run
> SCO applications); it is not for scripts.

Again  this is all too true. Doubly so with the problem of an initrd 
that has 'init' as a script.

> Now, the "universally available" part would not apply right now,
> as only the most recent kernels would provide the feature. However,
> within a few years, the feature would be part of "Linux" - then
> people can start using it extensively.

This sounds to me like you're saying in a few years my suggestion of 
using binfmt_misc would be tenable. Unfortunately, unless forced into 
it, no distro would ever use it. As I now see it, binfmt_script is 
pretty much a hard-coded hack that gives the system a bit more speed 
for running scripts. And since I've thought about the consequences of 
ripping it out after the posts yesterday - there is no clean way to 
remove it and still have a large number of systems still function.

> > That, and my point remains that the kernel should know absolutely
> > nothing about how to execute a text file - the kernel should
> > return an error to the extent of "I don't know what to do with
> > this file" to the shell that tries to execute it, and the shell
> > can then check for the sh_bang. I do admit that this change would
> > break a lot of existing code, so I'll leave the argument to the
> > experts.
>
> The point is that it is not necessarily the shell which starts
> programs - the shell is but one creator of new processes. It is
> very common today that, say, httpd starts new programs - this
> mechanism is called CGI. Your approach was in use until 1985 or
> so, when Unix implementations started to support #! natively.
> This was done both for convenience and for performance: if
> programs would always use system(3) to start new processes,
> there would always be a shell that execs the eventual
> interpreter.

True. In some cases, though, system(3) is really unusable - like you 
mentioned, httpd often starts new processes. Since daemons don't, 
technically, run on top of a shell, having one use system(3) to start 
a new process would add a lot of unnecessary overhead.

> I'm not sure, but I believe that most current shells have
> "forgotten" how to do the #! magic, since, by now, "traditionally"
> this is a kernel responsibility.

Not true. Bash, at least, still handles the sh_bang. (Provable by 
using it to call a perl script that doesn't have the exec bit set. 
This worked for me just a week ago :)

DRH

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
       [not found]     ` <4Olip-6HH-13@gated-at.bofh.it>
@ 2005-09-19  4:41       ` "Martin v. Löwis"
  0 siblings, 0 replies; 80+ messages in thread
From: "Martin v. Löwis" @ 2005-09-19  4:41 UTC (permalink / raw)
  To: D. Hazelton, linux-kernel

D. Hazelton wrote:
>>I would need to write a compiled C program to do all
>>sorts of fragile hackish things like calling a script
>>/sbin/init.sh.
> 
> 
> Problem is, the program 
> would not be fragile or hackish - it'd be almost as simple as a 
> "hello world" program.
> 
> #include <unistd.h>
> 
> int main() {
>   /* if this fails the system is busted anyway */
>   return execve( "/bin/sh", "/sbin/init.sh", 0 );
> };

This attempt nicely illustrates Kyle's point. This program *is*
fragile and hackish. It is fragile because, even though it is only
five lines, contains two major bugs:
1. execve takes an argv array, not a null-terminated list of
   strings. So this compiles with a warning about incompatible
   pointer types; you meant to use execl(3).
2. In the exec family, the path to the program is different from
   argv[0]. So the correct line would be

     return execl("/bin/sh", "sh", /sbin/init.sh", 0);

It is hackisch, because it also lacks a feature commonly
found in such wrappers:
3. arguments passed to the wrapper are not forwarded to the
   executable. In particular, init takes several arguments
   (e.g. the runlevel), which should be forwarded to the
   final executable.

Just try completing the wrapper on your own.

Regards,
Martin

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
       [not found]                             ` <4O8Oh-5jp-7@gated-at.bofh.it>
  2005-09-18 19:23                               ` Bodo Eggert
@ 2005-09-19  4:54                               ` "Martin v. Löwis"
  2005-09-19  8:26                                 ` Bernd Petrovitsch
  1 sibling, 1 reply; 80+ messages in thread
From: "Martin v. Löwis" @ 2005-09-19  4:54 UTC (permalink / raw)
  To: Bernd Petrovitsch, linux-kernel

Bernd Petrovitsch wrote:
>>The specific feature I get is that when I pass a file starting
>>with <utf8sig>#! to execve, Linux will execute the file following
>>the #!. In what way do I get this feature for text in general?
>>And if I do, why is that a problem?
> 
> 
> After applying this patch it seems that "Linux" is supporting this
> marker officially in general - especially if the kernel supports it.

What makes it seem so? That binfmt_script supports a certain convention
doesn't mean that all other programs also somehow need to support that
convention - and certainly not in the same way.

> I suppose the next kernel patch is to support Win-like CR-LF sequences
> (which is not the case AFAIK).

What makes you suppose that? I have no plans to submit such a patch.

> And though scripts are usually edited/changed/"parsed"/... with an text
> editor, it is not always the case. Therefore the automatic extension to
> *all text files* (especially as the marker basically applies to all text
> files, not only scripts).
> You want to focus just on your patch and ignore the directly implied
> potential problems arising ...

Because there are no problems arising. The next time somebody submits
a patch to cat(1) to strip off UTF-8 signatures, you *then* complain
that this is the wrong thing to do, because it violates the
specification of cat.

This reasoning is just flawed: it is like saying to a web browser
developer: "don't _support_ XHTML, because there are so many tools
which use HTML 4".

> Apparently I have to repeat: If you do `cat a.txt b.txt >c.txt` where
> a.txt and b.txt have this marker, then c.txt have the marker of b.txt
> somewhere in the middle. Does this make sense in anyway?

Indeed, it does. There is nothing inherently wrong with having
the marker in the middle.

> How do I get rid of the marker in the middle transparently?

http://www.unicode.org/faq/utf_bom.html#38

>>What the editor displays as the number of "things" is up to its own.
>>The output of wc -c will always be the same as the one of ls -l,
>>as wc -c does *not* give you characters:
>>
>>       -c, --bytes
>>              print the byte counts
>>
>>You might have been thinking of 'wc -m'.
> 
> 
> It depends on the definition of "character". There are other standards
> which define "character" as "byte".

Certainly. However, you specifically talked about 'wc -c', and, in
wc(1), atleast in the implementation commonly used on Linux, characters
and bytes are not the same.

>>It depends on the editor I use, of course
> 
> 
> No, more on the OS the editor runs on.

You talked about Windows specifically. On Windows, most editors give you
the choice of chosing the line ending, and will preserve whatever line
ending they find when adding new lines to a file.

Regards,
Martin

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
       [not found]                                 ` <4OfZo-7AG-21@gated-at.bofh.it>
@ 2005-09-19  5:11                                   ` "Martin v. Löwis"
  0 siblings, 0 replies; 80+ messages in thread
From: "Martin v. Löwis" @ 2005-09-19  5:11 UTC (permalink / raw)
  To: Valdis.Kletnieks, linux-kernel

Valdis.Kletnieks@vt.edu wrote:
> For the benefit of those of us who are interested in the problem, but aren't
> in the mood to wade through a long standard looking for the answer to a
> specific question, can you elaborate?

See

http://www.unicode.org/faq/utf_bom.html#38

> It isn't as obvious as all that, because of all the nasty corner cases...

It really depends on the specific structure of the text file. For Python
scripts, the Python interpreter will reject a U+FEFF in the middle of
the file as a syntax error (*). This is, IMO, a reasonable reaction: you
just shouldn't concatenate Python scripts blindly. They may have
different source encodings, so any concatenation of Python scripts
needs to convert them both into a common encoding. The first script
may also fail to terminate with a newline, so concatenating Python
scripts also needs to insert a line break. In edition, you would
also typically want to remove the docstring in the second file.

The same holds for many other formats: for example, you cannot blindly
concatenate XML files, either (the result often won't be an XML file).
So that the BOM is treated as an error would give no problem.

> Given a file "a.txt" that's pure ASCII, and a file "b.txt" that has the BOM
> marker on it, what happens when you do "cat a.txt b.txt > c.txt"?

You answer the question yourself correctly:

> 'cat' doesn't know, and has no way of knowing, that c.txt needs a BOM at the
> *front* of the file until it's already written past the point in c.txt where
> the BOM has to go.
> 
> What does the Unicode standard say to do in this case?

The point is that the BOM *also* is a regular character, U+FEFF. It used
to have a specific function, too, but now U+2060 (WORD JOINER) should
be used for that function. So U+FEFF is exclusively used for the BOM
now. If you see it in the middle of a file, you know it doesn't belong
there (*). In processing the file, you can complain, you can ignore it,
and you can chose to strip it off. Which of these you do depends on
the application; if you don't know better, treating it as ZERO WIDTH
NON-BREAKING SPACE is the recommended reaction.

Regards,
Martin

(*) unless it occurs in a string literal, in which case it becomes
part of the string. In the case of concatenating two Python files,
it won't be part of a string literal, though, but instead occur
at the beginning of a line.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-18 19:23                               ` Bodo Eggert
  2005-09-18 21:03                                 ` Bernd Petrovitsch
  2005-09-18 22:29                                 ` Valdis.Kletnieks
@ 2005-09-19  6:03                                 ` H. Peter Anvin
  2 siblings, 0 replies; 80+ messages in thread
From: H. Peter Anvin @ 2005-09-19  6:03 UTC (permalink / raw)
  To: 7eggert; +Cc: Bernd Petrovitsch, "Martin v. Löwis", linux-kernel

Bodo Eggert wrote:
> 
> It will be the first POSIX kernel to correctly support utf-8 scripts.
> It's 2005, and according to other(?) posters, this should be standard.
> 

UTF-8, yes.  BOM bullshit, no.

	-hpa

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-17 12:25                       ` "Martin v. Löwis"
  2005-09-17 12:28                         ` Martin Mares
@ 2005-09-19  7:08                         ` Pavel Machek
  2005-09-19  7:18                           ` "Martin v. Löwis"
  1 sibling, 1 reply; 80+ messages in thread
From: Pavel Machek @ 2005-09-19  7:08 UTC (permalink / raw)
  To: Martin v. Löwis; +Cc: Martin Mares, linux-kernel

Hi!

> I just can't understand why (some) people are so opposed to this patch.
> It is a really trivial, straight-forward change. It introduces no
> policy, just a feature: you can put the UTF-8 signature in your script
> file, if you want to (and your scripting language supports it). By
> no means it forces you to put the UTF-8 signature in your all script
> files, let alone all your text files.

Why is binfmt_misc not enough for you?
								Pavel

-- 
if you have sharp zaurus hardware you don't need... you know my address

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-19  7:08                         ` Pavel Machek
@ 2005-09-19  7:18                           ` "Martin v. Löwis"
  2005-09-19  7:24                             ` Pavel Machek
  2005-09-19 23:49                             ` Horst von Brand
  0 siblings, 2 replies; 80+ messages in thread
From: "Martin v. Löwis" @ 2005-09-19  7:18 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Martin Mares, linux-kernel

Pavel Machek wrote:
> Why is binfmt_misc not enough for you?

For two reasons: for one, it has the overhead of yet another
exec call. This is different from usages for, say, Java byte
code or Python byte code, where the registered interpreter already
is the eventual binary which has to be invoked anyway; for
a binfmt_misc application, you need an additional wrapper
which reinterprets the first line, and then invokes the eventual
interpreter.

The other reason is availability: as an author of an UTF-8
script, you would have to communicate to your users that they
need the right binfmt_misc wrapper installed (which they may
have to build first). While installing additional stuff to
run a single program is acceptable for large applications,
it is likely not for script files. To make the feature useful
in practice, it must be builtin.

Regards,
Martin

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-19  7:18                           ` "Martin v. Löwis"
@ 2005-09-19  7:24                             ` Pavel Machek
  2005-09-19  7:46                               ` "Martin v. Löwis"
  2005-09-19 10:48                               ` Alan Cox
  2005-09-19 23:49                             ` Horst von Brand
  1 sibling, 2 replies; 80+ messages in thread
From: Pavel Machek @ 2005-09-19  7:24 UTC (permalink / raw)
  To: Martin v. Löwis; +Cc: Martin Mares, linux-kernel

On Po 19-09-05 09:18:33, "Martin v. Löwis" wrote:
> Pavel Machek wrote:
> > Why is binfmt_misc not enough for you?
> 
> For two reasons: for one, it has the overhead of yet another
> exec call. This is different from usages for, say, Java byte
> code or Python byte code, where the registered interpreter already
> is the eventual binary which has to be invoked anyway; for
> a binfmt_misc application, you need an additional wrapper
> which reinterprets the first line, and then invokes the eventual
> interpreter.

Who cares? exec is fast.

> The other reason is availability: as an author of an UTF-8
> script, you would have to communicate to your users that they
> need the right binfmt_misc wrapper installed (which they may
> have to build first). While installing additional stuff to
> run a single program is acceptable for large applications,
> it is likely not for script files. To make the feature useful
> in practice, it must be builtin.

This is distribution problem, not kernel problem. "/bin/ls should be
built into kernel, because otherwise you can't call /bin/ls from
script" is not an argument.

If UTF-8 compatibility is important, distros will get it right. If it
is not, you loose, but at least kernel is not messed up.

								Pavel
-- 
if you have sharp zaurus hardware you don't need... you know my address

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-19  7:24                             ` Pavel Machek
@ 2005-09-19  7:46                               ` "Martin v. Löwis"
  2005-09-19  7:50                                 ` Pavel Machek
  2005-09-19 10:48                               ` Alan Cox
  1 sibling, 1 reply; 80+ messages in thread
From: "Martin v. Löwis" @ 2005-09-19  7:46 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Martin Mares, linux-kernel

Pavel Machek wrote:
> If UTF-8 compatibility is important, distros will get it right. If it
> is not, you loose, but at least kernel is not messed up.

The patch doesn't mess up the kernel.

Regards,
Martin

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-19  7:46                               ` "Martin v. Löwis"
@ 2005-09-19  7:50                                 ` Pavel Machek
  0 siblings, 0 replies; 80+ messages in thread
From: Pavel Machek @ 2005-09-19  7:50 UTC (permalink / raw)
  To: Martin v. Löwis; +Cc: Martin Mares, linux-kernel

On Po 19-09-05 09:46:11, "Martin v. Löwis" wrote:
> Pavel Machek wrote:
> > If UTF-8 compatibility is important, distros will get it right. If it
> > is not, you loose, but at least kernel is not messed up.
> 
> The patch doesn't mess up the kernel.

Every patch does.

Except that yours one does not because it is not going in :-).
								Pavel

-- 
if you have sharp zaurus hardware you don't need... you know my address

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-19  4:54                               ` "Martin v. Löwis"
@ 2005-09-19  8:26                                 ` Bernd Petrovitsch
  2005-09-19  9:00                                   ` Valdis.Kletnieks
  2005-09-19 21:40                                   ` "Martin v. Löwis"
  0 siblings, 2 replies; 80+ messages in thread
From: Bernd Petrovitsch @ 2005-09-19  8:26 UTC (permalink / raw)
  To: "Martin v. Löwis"; +Cc: linux-kernel

On Mon, 2005-09-19 at 06:54 +0200, "Martin v. Löwis" wrote:
> Bernd Petrovitsch wrote:
> >>The specific feature I get is that when I pass a file starting
> >>with <utf8sig>#! to execve, Linux will execute the file following
> >>the #!. In what way do I get this feature for text in general?
> >>And if I do, why is that a problem?
> > 
> > After applying this patch it seems that "Linux" is supporting this
> > marker officially in general - especially if the kernel supports it.
> 
> What makes it seem so? That binfmt_script supports a certain convention
> doesn't mean that all other programs also somehow need to support that
> convention - and certainly not in the same way.

We will see how it develops. Actually the marker could be used to detect
endianness of the file if I read below URL correctly ....

> > I suppose the next kernel patch is to support Win-like CR-LF sequences
> > (which is not the case AFAIK).
> 
> What makes you suppose that? I have no plans to submit such a patch.

No need to. Other people tried already.

> This reasoning is just flawed: it is like saying to a web browser
> developer: "don't _support_ XHTML, because there are so many tools
> which use HTML 4".

No, the saying was more: "don't support XHTML since it may break HTML
compliant browsers."
For XHTML/HTML we all know that this is not the case, so the comparison
is flawed.

> > Apparently I have to repeat: If you do `cat a.txt b.txt >c.txt` where
> > a.txt and b.txt have this marker, then c.txt have the marker of b.txt
> > somewhere in the middle. Does this make sense in anyway?
> 
> Indeed, it does. There is nothing inherently wrong with having
> the marker in the middle.
> 
> > How do I get rid of the marker in the middle transparently?
> 
> http://www.unicode.org/faq/utf_bom.html#38

Thanks.
----  snip  ----
In that case, any U+FEFF occurring in the middle of the file can be
ignored, or treated as an error.
----  snip  ----
Well, this doesn't sound like an clear rule stating that it *must* be
ignored.
BTW:
----  snip  ----
Q: How I should deal with BOMs?
[...]
3. Some byte oriented protocols expect ASCII characters at the beginning
of a file. If UTF-8 is used with these protocols, use of the BOM as
encoding form signature should be avoided.
----  snip  ----
Voila. Avoid the BOM in your scripts and be done.

> > It depends on the definition of "character". There are other standards
> > which define "character" as "byte".
> 
> Certainly. However, you specifically talked about 'wc -c', and, in
> wc(1), atleast in the implementation commonly used on Linux, characters
> and bytes are not the same.

Yes, now since multi-byte character sets gets more commonly used.
However, I don't think you get this into the C standard. But we are now
far off the discussion ....

> >>It depends on the editor I use, of course
> > 
> > No, more on the OS the editor runs on.
> 
> You talked about Windows specifically. On Windows, most editors give you
> the choice of chosing the line ending, and will preserve whatever line
> ending they find when adding new lines to a file.

I belive this vor vim, emacs, etc. but I don't believe ir for the native
ones ...

	Bernd
-- 
Firmix Software GmbH                   http://www.firmix.at/
mobil: +43 664 4416156                 fax: +43 1 7890849-55
          Embedded Linux Development and Services


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-19  8:26                                 ` Bernd Petrovitsch
@ 2005-09-19  9:00                                   ` Valdis.Kletnieks
  2005-09-19  9:41                                     ` Bernd Petrovitsch
  2005-09-19 21:40                                   ` "Martin v. Löwis"
  1 sibling, 1 reply; 80+ messages in thread
From: Valdis.Kletnieks @ 2005-09-19  9:00 UTC (permalink / raw)
  To: Bernd Petrovitsch; +Cc: "Martin v. Löwis", linux-kernel

[-- Attachment #1: Type: text/plain, Size: 635 bytes --]

On Mon, 19 Sep 2005 10:26:22 +0200, Bernd Petrovitsch said:

> We will see how it develops. Actually the marker could be used to detect
> endianness of the file if I read below URL correctly ....

Text files have endianness????

> ----  snip  ----
> Q: How I should deal with BOMs?
> [...]
> 3. Some byte oriented protocols expect ASCII characters at the beginning
> of a file. If UTF-8 is used with these protocols, use of the BOM as
> encoding form signature should be avoided.
> ----  snip  ----
> Voila. Avoid the BOM in your scripts and be done.

At which point the proposed kernel patch becomes pointless.. ;)


[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-19  9:00                                   ` Valdis.Kletnieks
@ 2005-09-19  9:41                                     ` Bernd Petrovitsch
  0 siblings, 0 replies; 80+ messages in thread
From: Bernd Petrovitsch @ 2005-09-19  9:41 UTC (permalink / raw)
  To: Valdis.Kletnieks; +Cc: "Martin v. Löwis", linux-kernel

On Mon, 2005-09-19 at 05:00 -0400, Valdis.Kletnieks@vt.edu wrote:
> On Mon, 19 Sep 2005 10:26:22 +0200, Bernd Petrovitsch said:
> 
> > We will see how it develops. Actually the marker could be used to detect
> > endianness of the file if I read below URL correctly ....
> 
> Text files have endianness????

Unicode-16 ones with 16 bit per character (as in Win NT), yes.
UTF-8 ones not AFAIK.

	Bernd
-- 
Firmix Software GmbH                   http://www.firmix.at/
mobil: +43 664 4416156                 fax: +43 1 7890849-55
          Embedded Linux Development and Services


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-19  7:24                             ` Pavel Machek
  2005-09-19  7:46                               ` "Martin v. Löwis"
@ 2005-09-19 10:48                               ` Alan Cox
  1 sibling, 0 replies; 80+ messages in thread
From: Alan Cox @ 2005-09-19 10:48 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Martin v. Löwis, Martin Mares, linux-kernel

On Llu, 2005-09-19 at 09:24 +0200, Pavel Machek wrote:
> > which reinterprets the first line, and then invokes the eventual
> > interpreter.
> 
> Who cares? exec is fast.

It would be nice if it was but exec + user space overhead of startup is
merely "faster than many equivalent systems". It's still slow



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-18 21:03                                 ` Bernd Petrovitsch
@ 2005-09-19 19:37                                   ` Bodo Eggert
  0 siblings, 0 replies; 80+ messages in thread
From: Bodo Eggert @ 2005-09-19 19:37 UTC (permalink / raw)
  To: Bernd Petrovitsch
  Cc: 7eggert, Martin v. Löwis, H. Peter Anvin, linux-kernel

On Sun, 18 Sep 2005, Bernd Petrovitsch wrote:
> On Sun, 2005-09-18 at 21:23 +0200, Bodo Eggert wrote:

> > >, etc.). Since the
> > > kernel can start java classes directly, you can probably make a similar
> > > thing for the UTF-8 stuff.
> > 
> > If MSDOS text files are text files are legal scripts, the kernel
> > should recognize [\x0D\x0A] as valid line breaks.
> 
> The Unix worls does recognize the line breaks.

Create a valid text file with macintosh line breaks (as allowed in unicode 
files) and try it.
-- 
If enough data is collected, a board of inquiry can prove ANYTHING. 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-19  8:26                                 ` Bernd Petrovitsch
  2005-09-19  9:00                                   ` Valdis.Kletnieks
@ 2005-09-19 21:40                                   ` "Martin v. Löwis"
  1 sibling, 0 replies; 80+ messages in thread
From: "Martin v. Löwis" @ 2005-09-19 21:40 UTC (permalink / raw)
  To: Bernd Petrovitsch; +Cc: "Martin v. Löwis", linux-kernel

Bernd Petrovitsch wrote:
>>>It depends on the definition of "character". There are other standards
>>>which define "character" as "byte".
>>
>>Certainly. However, you specifically talked about 'wc -c', and, in
>>wc(1), atleast in the implementation commonly used on Linux, characters
>>and bytes are not the same.
> 
> 
> Yes, now since multi-byte character sets gets more commonly used.
> However, I don't think you get this into the C standard. But we are now
> far off the discussion ....

It does indeed, so just one final clarification. wc(1) is not part
of the C standard - ISO 9899 does not talk about command line utilities
at all. The relevant standard is POSIX; IEEE Std 1003.1, 2004 Edition
says, in

http://www.opengroup.org/onlinepubs/009695399/utilities/wc.html

-c
    Write to the standard output the number of bytes in each input file.
[...]
-m
    Write to the standard output the number of characters in each input
file.

[...]
RATIONALE
[...]
The -c option stands for "character" count, even though it counts bytes.
This stems from the sometimes erroneous historical view that bytes and
characters are the same size. Due to international requirements, the -m
option (reminiscent of "multi-byte") was added to obtain actual
character counts.

Regards,
Martin

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Patch] Support UTF-8 scripts
  2005-09-19  7:18                           ` "Martin v. Löwis"
  2005-09-19  7:24                             ` Pavel Machek
@ 2005-09-19 23:49                             ` Horst von Brand
  1 sibling, 0 replies; 80+ messages in thread
From: Horst von Brand @ 2005-09-19 23:49 UTC (permalink / raw)
  To: "Martin v. Löwis"; +Cc: Pavel Machek, Martin Mares, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 969 bytes --]

Martin v. Löwis <martin@v.loewis.de> wrote:
> Pavel Machek wrote:
> > Why is binfmt_misc not enough for you?

> For two reasons: for one, it has the overhead of yet another
> exec call.

For an interpreted language this is surely irrelevant.

[...]

> The other reason is availability: as an author of an UTF-8
> script, you would have to communicate to your users that they
> need the right binfmt_misc wrapper installed (which they may
> have to build first). While installing additional stuff to
> run a single program is acceptable for large applications,
> it is likely not for script files. To make the feature useful
> in practice, it must be builtin.

That is a distribution problem.
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply	[flat|nested] 80+ messages in thread

end of thread, other threads:[~2005-09-20  3:28 UTC | newest]

Thread overview: 80+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <4N6EL-4Hq-3@gated-at.bofh.it>
     [not found] ` <4N6EL-4Hq-5@gated-at.bofh.it>
     [not found]   ` <4N6EK-4Hq-1@gated-at.bofh.it>
     [not found]     ` <4N6EX-4Hq-27@gated-at.bofh.it>
     [not found]       ` <4N6Ox-4Ts-33@gated-at.bofh.it>
     [not found]         ` <4N7AS-67L-3@gated-at.bofh.it>
2005-09-16 18:02           ` [Patch] Support UTF-8 scripts Bodo Eggert
2005-09-16 18:09             ` H. Peter Anvin
2005-09-16 18:57               ` Bodo Eggert
2005-09-16 19:08                 ` Martin Mares
2005-09-16 19:25                 ` H. Peter Anvin
2005-09-16 19:57                 ` Horst von Brand
     [not found]             ` <200509170028.59973.dhazelton@enter.net>
2005-09-17  6:28               ` "Martin v. Löwis"
2005-09-17 22:31                 ` D. Hazelton
2005-09-18  3:45                   ` Kyle Moffett
2005-09-19  0:14                     ` D. Hazelton
2005-09-18  6:58                   ` "Martin v. Löwis"
2005-09-19  0:31                     ` D. Hazelton
2005-09-17 17:16               ` Bodo Eggert
     [not found] <4NVHm-3yE-13@gated-at.bofh.it>
     [not found] ` <4NVHm-3yE-15@gated-at.bofh.it>
     [not found]   ` <4NVHm-3yE-17@gated-at.bofh.it>
     [not found]     ` <4NVHm-3yE-19@gated-at.bofh.it>
     [not found]       ` <4NVHm-3yE-21@gated-at.bofh.it>
     [not found]         ` <4NVHm-3yE-23@gated-at.bofh.it>
     [not found]           ` <4NVHm-3yE-25@gated-at.bofh.it>
     [not found]             ` <4NVHm-3yE-27@gated-at.bofh.it>
     [not found]               ` <4NVHm-3yE-29@gated-at.bofh.it>
     [not found]                 ` <4NVHm-3yE-31@gated-at.bofh.it>
     [not found]                   ` <4NVHn-3yE-33@gated-at.bofh.it>
     [not found]                     ` <4NVHn-3yE-35@gated-at.bofh.it>
     [not found]                       ` <4NVHn-3yE-37@gated-at.bofh.it>
     [not found]                         ` <4NVHn-3yE-39@gated-at.bofh.it>
     [not found]                           ` <4Od1x-3e3-5@gated-at.bofh.it>
     [not found]                             ` <4Od1x-3e3-7@gated-at.bofh.it>
     [not found]                               ` <4Od1w-3e3-3@gated-at.bofh.it>
     [not found]                                 ` <4OfZo-7AG-21@gated-at.bofh.it>
2005-09-19  5:11                                   ` "Martin v. Löwis"
     [not found] <4NsP0-3YF-11@gated-at.bofh.it>
     [not found] ` <4NsP0-3YF-13@gated-at.bofh.it>
     [not found]   ` <4NsP0-3YF-15@gated-at.bofh.it>
     [not found]     ` <4NsP0-3YF-17@gated-at.bofh.it>
     [not found]       ` <4NsP1-3YF-19@gated-at.bofh.it>
     [not found]         ` <4NsP1-3YF-21@gated-at.bofh.it>
     [not found]           ` <4NsOZ-3YF-9@gated-at.bofh.it>
     [not found]             ` <4NsYH-4bv-27@gated-at.bofh.it>
     [not found]               ` <4NtBr-4WU-3@gated-at.bofh.it>
     [not found]                 ` <4NtL0-5lQ-13@gated-at.bofh.it>
2005-09-16 20:34                   ` "Martin v. Löwis"
2005-09-17 12:01                     ` Martin Mares
2005-09-17 12:25                       ` "Martin v. Löwis"
2005-09-17 12:28                         ` Martin Mares
2005-09-17 12:53                           ` "Martin v. Löwis"
2005-09-17 13:05                             ` Martin Mares
2005-09-17 13:33                               ` "Martin v. Löwis"
2005-09-19  7:08                         ` Pavel Machek
2005-09-19  7:18                           ` "Martin v. Löwis"
2005-09-19  7:24                             ` Pavel Machek
2005-09-19  7:46                               ` "Martin v. Löwis"
2005-09-19  7:50                                 ` Pavel Machek
2005-09-19 10:48                               ` Alan Cox
2005-09-19 23:49                             ` Horst von Brand
     [not found]                 ` <4Nu4p-5Js-3@gated-at.bofh.it>
2005-09-16 20:41                   ` "Martin v. Löwis"
2005-09-16 22:08                     ` H. Peter Anvin
2005-09-17  6:05                       ` "Martin v. Löwis"
2005-09-16 22:45                     ` Bernd Petrovitsch
2005-09-17  6:20                       ` "Martin v. Löwis"
2005-09-17 22:28                         ` Bernd Petrovitsch
2005-09-18  7:23                           ` "Martin v. Löwis"
2005-09-18 14:50                             ` Bernd Petrovitsch
2005-09-17  6:45             ` "Martin v. Löwis"
     [not found] ` <4NXfZ-5P0-1@gated-at.bofh.it>
     [not found]   ` <4NYlM-7i0-5@gated-at.bofh.it>
     [not found]     ` <4Olip-6HH-13@gated-at.bofh.it>
2005-09-19  4:41       ` "Martin v. Löwis"
     [not found] <4Nvab-7o5-11@gated-at.bofh.it>
     [not found] ` <4Nvab-7o5-13@gated-at.bofh.it>
     [not found]   ` <4Nvab-7o5-15@gated-at.bofh.it>
     [not found]     ` <4Nvab-7o5-17@gated-at.bofh.it>
     [not found]       ` <4Nvab-7o5-19@gated-at.bofh.it>
     [not found]         ` <4Nvab-7o5-21@gated-at.bofh.it>
     [not found]           ` <4Nvab-7o5-23@gated-at.bofh.it>
     [not found]             ` <4Nvab-7o5-25@gated-at.bofh.it>
     [not found]               ` <4Nvab-7o5-27@gated-at.bofh.it>
     [not found]                 ` <4NvjM-7CU-7@gated-at.bofh.it>
     [not found]                   ` <4NvjM-7CU-5@gated-at.bofh.it>
     [not found]                     ` <4NxbR-20S-1@gated-at.bofh.it>
     [not found]                       ` <4NEn7-3M5-7@gated-at.bofh.it>
     [not found]                         ` <4NTvO-yJ-13@gated-at.bofh.it>
2005-09-18  0:53                           ` Bodo Eggert
2005-09-18 16:53                             ` Bernd Petrovitsch
     [not found]                           ` <4O1MJ-3Hf-5@gated-at.bofh.it>
     [not found]                             ` <4O8Oh-5jp-7@gated-at.bofh.it>
2005-09-18 19:23                               ` Bodo Eggert
2005-09-18 21:03                                 ` Bernd Petrovitsch
2005-09-19 19:37                                   ` Bodo Eggert
2005-09-18 22:29                                 ` Valdis.Kletnieks
2005-09-19  6:03                                 ` H. Peter Anvin
2005-09-19  4:54                               ` "Martin v. Löwis"
2005-09-19  8:26                                 ` Bernd Petrovitsch
2005-09-19  9:00                                   ` Valdis.Kletnieks
2005-09-19  9:41                                     ` Bernd Petrovitsch
2005-09-19 21:40                                   ` "Martin v. Löwis"
     [not found] <4B2ZV-2dl-7@gated-at.bofh.it>
     [not found] ` <4HKbZ-Cx-37@gated-at.bofh.it>
2005-09-15 18:24   ` "Martin v. Löwis"
2005-09-15 18:25     ` H. Peter Anvin
2005-09-15 18:39       ` "Martin v. Löwis"
2005-09-15 19:20         ` H. Peter Anvin
2005-09-16  8:13         ` Bernd Petrovitsch
2005-08-13 12:07 "Martin v. Löwis"
2005-08-13 16:35 ` Stephen Pollei
2005-08-13 18:42   ` Lee Revell
2005-08-13 18:49     ` Hugo Mills
2005-08-13 18:53       ` Lee Revell
2005-08-14  0:57         ` Alan Cox
2005-08-14  1:19           ` Kyle Moffett
2005-08-14  1:40             ` Lee Revell
2005-08-14 10:40               ` Wichert Akkerman
2005-08-13 19:20       ` Lee Revell
2005-08-16  9:46       ` Jan Engelhardt
2005-08-14  0:53     ` Alan Cox
2005-08-14  4:10       ` James Cloos
2005-08-14  6:18     ` Jason L Tibbitts III
     [not found]       ` <feed8cdd050814125845fe4e2e@mail.gmail.com>
2005-08-14 19:59         ` Lee Revell
2005-08-14 20:13           ` Stephen Pollei
2005-08-14 20:22             ` Lee Revell
2005-08-14 22:10               ` "Martin v. Löwis"
2005-08-14 23:55           ` Alan Cox
2005-08-16 13:56           ` David Madore
     [not found]           ` <mailman.1124063520.13257.linux-kernel2news@redhat.com>
2005-08-16 20:17             ` Pete Zaitcev
2005-08-14 21:52       ` Kyle Moffett
2005-08-14 22:12         ` Valdis.Kletnieks
2005-08-15  8:01     ` Helge Hafting
2005-08-31 23:27 ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox