From: "\"Martin v. Löwis\"" <martin@v.loewis.de>
To: Valdis.Kletnieks@vt.edu, linux-kernel@vger.kernel.org
Subject: Re: [Patch] Support UTF-8 scripts
Date: Mon, 19 Sep 2005 07:11:01 +0200 [thread overview]
Message-ID: <432E4865.3000109@v.loewis.de> (raw)
In-Reply-To: <4OfZo-7AG-21@gated-at.bofh.it>
Valdis.Kletnieks@vt.edu wrote:
> For the benefit of those of us who are interested in the problem, but aren't
> in the mood to wade through a long standard looking for the answer to a
> specific question, can you elaborate?
See
http://www.unicode.org/faq/utf_bom.html#38
> It isn't as obvious as all that, because of all the nasty corner cases...
It really depends on the specific structure of the text file. For Python
scripts, the Python interpreter will reject a U+FEFF in the middle of
the file as a syntax error (*). This is, IMO, a reasonable reaction: you
just shouldn't concatenate Python scripts blindly. They may have
different source encodings, so any concatenation of Python scripts
needs to convert them both into a common encoding. The first script
may also fail to terminate with a newline, so concatenating Python
scripts also needs to insert a line break. In edition, you would
also typically want to remove the docstring in the second file.
The same holds for many other formats: for example, you cannot blindly
concatenate XML files, either (the result often won't be an XML file).
So that the BOM is treated as an error would give no problem.
> Given a file "a.txt" that's pure ASCII, and a file "b.txt" that has the BOM
> marker on it, what happens when you do "cat a.txt b.txt > c.txt"?
You answer the question yourself correctly:
> 'cat' doesn't know, and has no way of knowing, that c.txt needs a BOM at the
> *front* of the file until it's already written past the point in c.txt where
> the BOM has to go.
>
> What does the Unicode standard say to do in this case?
The point is that the BOM *also* is a regular character, U+FEFF. It used
to have a specific function, too, but now U+2060 (WORD JOINER) should
be used for that function. So U+FEFF is exclusively used for the BOM
now. If you see it in the middle of a file, you know it doesn't belong
there (*). In processing the file, you can complain, you can ignore it,
and you can chose to strip it off. Which of these you do depends on
the application; if you don't know better, treating it as ZERO WIDTH
NON-BREAKING SPACE is the recommended reaction.
Regards,
Martin
(*) unless it occurs in a string literal, in which case it becomes
part of the string. In the case of concatenating two Python files,
it won't be part of a string literal, though, but instead occur
at the beginning of a line.
next parent reply other threads:[~2005-09-19 5:11 UTC|newest]
Thread overview: 80+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <4NVHm-3yE-13@gated-at.bofh.it>
[not found] ` <4NVHm-3yE-15@gated-at.bofh.it>
[not found] ` <4NVHm-3yE-17@gated-at.bofh.it>
[not found] ` <4NVHm-3yE-19@gated-at.bofh.it>
[not found] ` <4NVHm-3yE-21@gated-at.bofh.it>
[not found] ` <4NVHm-3yE-23@gated-at.bofh.it>
[not found] ` <4NVHm-3yE-25@gated-at.bofh.it>
[not found] ` <4NVHm-3yE-27@gated-at.bofh.it>
[not found] ` <4NVHm-3yE-29@gated-at.bofh.it>
[not found] ` <4NVHm-3yE-31@gated-at.bofh.it>
[not found] ` <4NVHn-3yE-33@gated-at.bofh.it>
[not found] ` <4NVHn-3yE-35@gated-at.bofh.it>
[not found] ` <4NVHn-3yE-37@gated-at.bofh.it>
[not found] ` <4NVHn-3yE-39@gated-at.bofh.it>
[not found] ` <4Od1x-3e3-5@gated-at.bofh.it>
[not found] ` <4Od1x-3e3-7@gated-at.bofh.it>
[not found] ` <4Od1w-3e3-3@gated-at.bofh.it>
[not found] ` <4OfZo-7AG-21@gated-at.bofh.it>
2005-09-19 5:11 ` "Martin v. Löwis" [this message]
[not found] <4NsP0-3YF-11@gated-at.bofh.it>
[not found] ` <4NsP0-3YF-13@gated-at.bofh.it>
[not found] ` <4NsP0-3YF-15@gated-at.bofh.it>
[not found] ` <4NsP0-3YF-17@gated-at.bofh.it>
[not found] ` <4NsP1-3YF-19@gated-at.bofh.it>
[not found] ` <4NsP1-3YF-21@gated-at.bofh.it>
[not found] ` <4NsOZ-3YF-9@gated-at.bofh.it>
[not found] ` <4NsYH-4bv-27@gated-at.bofh.it>
[not found] ` <4NtBr-4WU-3@gated-at.bofh.it>
[not found] ` <4NtL0-5lQ-13@gated-at.bofh.it>
2005-09-16 20:34 ` [Patch] Support UTF-8 scripts "Martin v. Löwis"
2005-09-17 12:01 ` Martin Mares
2005-09-17 12:25 ` "Martin v. Löwis"
2005-09-17 12:28 ` Martin Mares
2005-09-17 12:53 ` "Martin v. Löwis"
2005-09-17 13:05 ` Martin Mares
2005-09-17 13:33 ` "Martin v. Löwis"
2005-09-19 7:08 ` Pavel Machek
2005-09-19 7:18 ` "Martin v. Löwis"
2005-09-19 7:24 ` Pavel Machek
2005-09-19 7:46 ` "Martin v. Löwis"
2005-09-19 7:50 ` Pavel Machek
2005-09-19 10:48 ` Alan Cox
2005-09-19 23:49 ` Horst von Brand
[not found] ` <4Nu4p-5Js-3@gated-at.bofh.it>
2005-09-16 20:41 ` "Martin v. Löwis"
2005-09-16 22:08 ` H. Peter Anvin
2005-09-17 6:05 ` "Martin v. Löwis"
2005-09-16 22:45 ` Bernd Petrovitsch
2005-09-17 6:20 ` "Martin v. Löwis"
2005-09-17 22:28 ` Bernd Petrovitsch
2005-09-18 7:23 ` "Martin v. Löwis"
2005-09-18 14:50 ` Bernd Petrovitsch
2005-09-17 6:45 ` "Martin v. Löwis"
[not found] ` <4NXfZ-5P0-1@gated-at.bofh.it>
[not found] ` <4NYlM-7i0-5@gated-at.bofh.it>
[not found] ` <4Olip-6HH-13@gated-at.bofh.it>
2005-09-19 4:41 ` "Martin v. Löwis"
[not found] <4Nvab-7o5-11@gated-at.bofh.it>
[not found] ` <4Nvab-7o5-13@gated-at.bofh.it>
[not found] ` <4Nvab-7o5-15@gated-at.bofh.it>
[not found] ` <4Nvab-7o5-17@gated-at.bofh.it>
[not found] ` <4Nvab-7o5-19@gated-at.bofh.it>
[not found] ` <4Nvab-7o5-21@gated-at.bofh.it>
[not found] ` <4Nvab-7o5-23@gated-at.bofh.it>
[not found] ` <4Nvab-7o5-25@gated-at.bofh.it>
[not found] ` <4Nvab-7o5-27@gated-at.bofh.it>
[not found] ` <4NvjM-7CU-7@gated-at.bofh.it>
[not found] ` <4NvjM-7CU-5@gated-at.bofh.it>
[not found] ` <4NxbR-20S-1@gated-at.bofh.it>
[not found] ` <4NEn7-3M5-7@gated-at.bofh.it>
[not found] ` <4NTvO-yJ-13@gated-at.bofh.it>
2005-09-18 0:53 ` Bodo Eggert
2005-09-18 16:53 ` Bernd Petrovitsch
[not found] ` <4O1MJ-3Hf-5@gated-at.bofh.it>
[not found] ` <4O8Oh-5jp-7@gated-at.bofh.it>
2005-09-18 19:23 ` Bodo Eggert
2005-09-18 21:03 ` Bernd Petrovitsch
2005-09-19 19:37 ` Bodo Eggert
2005-09-18 22:29 ` Valdis.Kletnieks
2005-09-19 6:03 ` H. Peter Anvin
2005-09-19 4:54 ` "Martin v. Löwis"
2005-09-19 8:26 ` Bernd Petrovitsch
2005-09-19 9:00 ` Valdis.Kletnieks
2005-09-19 9:41 ` Bernd Petrovitsch
2005-09-19 21:40 ` "Martin v. Löwis"
[not found] <4N6EL-4Hq-3@gated-at.bofh.it>
[not found] ` <4N6EL-4Hq-5@gated-at.bofh.it>
[not found] ` <4N6EK-4Hq-1@gated-at.bofh.it>
[not found] ` <4N6EX-4Hq-27@gated-at.bofh.it>
[not found] ` <4N6Ox-4Ts-33@gated-at.bofh.it>
[not found] ` <4N7AS-67L-3@gated-at.bofh.it>
2005-09-16 18:02 ` Bodo Eggert
2005-09-16 18:09 ` H. Peter Anvin
2005-09-16 18:57 ` Bodo Eggert
2005-09-16 19:08 ` Martin Mares
2005-09-16 19:25 ` H. Peter Anvin
2005-09-16 19:57 ` Horst von Brand
[not found] ` <200509170028.59973.dhazelton@enter.net>
2005-09-17 6:28 ` "Martin v. Löwis"
2005-09-17 22:31 ` D. Hazelton
2005-09-18 3:45 ` Kyle Moffett
2005-09-19 0:14 ` D. Hazelton
2005-09-18 6:58 ` "Martin v. Löwis"
2005-09-19 0:31 ` D. Hazelton
2005-09-17 17:16 ` Bodo Eggert
[not found] <4B2ZV-2dl-7@gated-at.bofh.it>
[not found] ` <4HKbZ-Cx-37@gated-at.bofh.it>
2005-09-15 18:24 ` "Martin v. Löwis"
2005-09-15 18:25 ` H. Peter Anvin
2005-09-15 18:39 ` "Martin v. Löwis"
2005-09-15 19:20 ` H. Peter Anvin
2005-09-16 8:13 ` Bernd Petrovitsch
2005-08-13 12:07 "Martin v. Löwis"
2005-08-13 16:35 ` Stephen Pollei
2005-08-13 18:42 ` Lee Revell
2005-08-13 18:49 ` Hugo Mills
2005-08-13 18:53 ` Lee Revell
2005-08-14 0:57 ` Alan Cox
2005-08-14 1:19 ` Kyle Moffett
2005-08-14 1:40 ` Lee Revell
2005-08-14 10:40 ` Wichert Akkerman
2005-08-13 19:20 ` Lee Revell
2005-08-16 9:46 ` Jan Engelhardt
2005-08-14 0:53 ` Alan Cox
2005-08-14 4:10 ` James Cloos
2005-08-14 6:18 ` Jason L Tibbitts III
[not found] ` <feed8cdd050814125845fe4e2e@mail.gmail.com>
2005-08-14 19:59 ` Lee Revell
2005-08-14 20:13 ` Stephen Pollei
2005-08-14 20:22 ` Lee Revell
2005-08-14 22:10 ` "Martin v. Löwis"
2005-08-14 23:55 ` Alan Cox
2005-08-16 13:56 ` David Madore
[not found] ` <mailman.1124063520.13257.linux-kernel2news@redhat.com>
2005-08-16 20:17 ` Pete Zaitcev
2005-08-14 21:52 ` Kyle Moffett
2005-08-14 22:12 ` Valdis.Kletnieks
2005-08-15 8:01 ` Helge Hafting
2005-08-31 23:27 ` H. Peter Anvin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=432E4865.3000109@v.loewis.de \
--to=martin@v.loewis.de \
--cc=Valdis.Kletnieks@vt.edu \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.