Re: OT: character encodings (was: Linux 2.6.20-rc4)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
@ 2007-01-08 10:24 Nicolas Mailhot
  2007-01-08 10:44 ` Alan
  0 siblings, 1 reply; 39+ messages in thread
From: Nicolas Mailhot @ 2007-01-08 10:24 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: linux-kernel

>> How would you do this technically in a way that it's significantely
>> easier than simply finishing the UTF=8 transition?

> In how many decades do you think the transition will be finished ?

Right now it looks like it will be finished way earlier than app bother
supporting the later 8-bit encodings such as iso-8859-15

(case in point: Russel's system. I was ROTFL when he proudly announced he
was running a full iso-8859-1 system after dissing UTF-8. Last I've seen
the official 8bit EU encoding was iso-8859-15, and UK is part of the EU)

-- 
Nicolas Mailhot


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-08 10:24 OT: character encodings (was: Linux 2.6.20-rc4) Nicolas Mailhot
@ 2007-01-08 10:44 ` Alan
  2007-01-08 10:44   ` Nicolas Mailhot
  0 siblings, 1 reply; 39+ messages in thread
From: Alan @ 2007-01-08 10:44 UTC (permalink / raw)
  To: Nicolas Mailhot; +Cc: Willy Tarreau, linux-kernel

> (case in point: Russel's system. I was ROTFL when he proudly announced he
> was running a full iso-8859-1 system after dissing UTF-8. Last I've seen
> the official 8bit EU encoding was iso-8859-15, and UK is part of the EU)

There is no correct UK encoding. You need -14 or -15 depending upon
language and can come horribly unstuck the moment a name is involved.

Alan

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-08 10:44 ` Alan
@ 2007-01-08 10:44   ` Nicolas Mailhot
  0 siblings, 0 replies; 39+ messages in thread
From: Nicolas Mailhot @ 2007-01-08 10:44 UTC (permalink / raw)
  To: Alan; +Cc: Willy Tarreau, linux-kernel


Le Lun 8 janvier 2007 11:44, Alan a écrit :
>> (case in point: Russel's system. I was ROTFL when he proudly announced
>> he
>> was running a full iso-8859-1 system after dissing UTF-8. Last I've seen
>> the official 8bit EU encoding was iso-8859-15, and UK is part of the EU)
>
> There is no correct UK encoding. You need -14 or -15 depending upon
> language and can come horribly unstuck the moment a name is involved.

Either way it's not iso-8859-1 :)

-- 
Nicolas Mailhot


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
@ 2007-01-08 10:13 Nicolas Mailhot
  0 siblings, 0 replies; 39+ messages in thread
From: Nicolas Mailhot @ 2007-01-08 10:13 UTC (permalink / raw)
  To: linux-kernel; +Cc: Russell King

> elinks is one such program.  It now assumes UTF-8 _only_ displays.
> That's no better than programs which assume ISO-8859-1 only or US-ASCII
> only.

That's way better than programs:
- which assume an encoding you can't write most world languages in (BTW
ISO-8859-1 & US-ASCII are broken by design for Western Europe since at
least the Euro creation)
- which perpetuate the myth local 8-bit encodings are manageable (they
aren't, people spent decades trying to limp along with them, unicode &
UTF-8 where not created just to make your life miserable)

Show me one program that spurns Unicode I'll show you one that "passed on"
iso-8859-15 (typically, though it's the easiest non-iso-8859-1 to do)

The only reason you have the UTF-8 big stick approach nowadays is people
have tried for years to get app writers manage 8-bit locales properly to
dismal results. The old system was only working for en_US users (and
perhaps to .uk people)

-- 
Nicolas Mailhot


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Linux 2.6.20-rc4
@ 2007-01-07  6:19 Linus Torvalds
  2007-01-07 10:56 ` Jan Engelhardt
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2007-01-07  6:19 UTC (permalink / raw)
  To: Linux Kernel Mailing List

[-- Attachment #1: Type: TEXT/PLAIN, Size: 10496 bytes --]


There's absolutely nothing interesting here, unless you want to play with 
KVM, or happened to be bitten by the bug with really old versions of the 
linker that made parts of entry.S just go away.

But check it out anyway, and the shortlog gives more details on the 
various minor fixes that have accumulated this week. Mostly in random 
device drivers.

		Linus

---
Adam Megacz (1):
      Add AFS_SUPER_MAGIC to magic.h

Adrian Bunk (2):
      [NET] drivers/net/loopback.c: convert to module_init()
      [X25]: proper prototype for x25_init_timers()

Alan (3):
      libata: fix combined mode
      atiixp: Old drivers/ide layer driver for the ATIIXP hang fix
      hpt37x: Two important bug fixes

Alan Stern (2):
      UHCI: make test for ASUS motherboard more specific
      UHCI: support device_may_wakeup

Alexey Dobriyan (2):
      [NETFILTER] xt_hashlimit.c: fix typo
      pata_optidma: typo in Kconfig

Andrew Morton (5):
      USB: funsoft is borken on sparc
      sisusb_con warning fixes
      PCI: disable PCI_MULTITHREAD_PROBE
      ip2 warning fix
      shrink_all_memory(): fix lru_pages handling

Ard van Breemen (3):
      start_kernel: test if irq's got enabled early, barf, and disable them again
      kernelparams: detect if and which parameter parsing enabled irq's
      PCI: prevent down_read when pci_devices is empty

Arnaud Patard (2):
      [ARM] 4065/1: S3C24XX: dma printk fixes
      [ARM] 4073/1: Prevent s3c24xx drivers from including asm/arch/hardware.h and asm/arch/irqs.h

Avi Kivity (39):
      KVM: Prevent stale bits in cr0 and cr4
      KVM: MMU: Implement simple reverse mapping
      KVM: MMU: Teach the page table walker to track guest page table gfns
      KVM: MMU: Load the pae pdptrs on cr3 change like the processor does
      KVM: MMU: Fold fetch_guest() into init_walker()
      KVM: MU: Special treatment for shadow pae root pages
      KVM: MMU: Use the guest pdptrs instead of mapping cr3 in pae mode
      KVM: MMU: Make the shadow page tables also special-case pae
      KVM: MMU: Make kvm_mmu_alloc_page() return a kvm_mmu_page pointer
      KVM: MMU: Shadow page table caching
      KVM: MMU: Write protect guest pages when a shadow is created for them
      KVM: MMU: Let the walker extract the target page gfn from the pte
      KVM: MMU: Support emulated writes into RAM
      KVM: MMU: Zap shadow page table entries on writes to guest page tables
      KVM: MMU: If emulating an instruction fails, try unprotecting the page
      KVM: MMU: Implement child shadow unlinking
      KVM: MMU: kvm_mmu_put_page() only removes one link to the page
      KVM: MMU: oom handling
      KVM: MMU: Remove invlpg interception
      KVM: MMU: Remove release_pt_page_64()
      KVM: MMU: Handle misaligned accesses to write protected guest page tables
      KVM: MMU: <ove is_empty_shadow_page() above kvm_mmu_free_page()
      KVM: MMU: Ensure freed shadow pages are clean
      KVM: MMU: If an empty shadow page is not empty, report more info
      KVM: MMU: Page table write flood protection
      KVM: MMU: Never free a shadow page actively serving as a root
      KVM: MMU: Fix cmpxchg8b emulation
      KVM: MMU: Treat user-mode faults as a hint that a page is no longer a page table
      KVM: MMU: Free pages on kvm destruction
      KVM: MMU: Replace atomic allocations by preallocated objects
      KVM: MMU: Detect oom conditions and propagate error to userspace
      KVM: MMU: Flush guest tlb when reducing permissions on a pte
      KVM: MMU: Destroy mmu while we still have a vcpu left
      KVM: MMU: add audit code to check mappings, etc are correct
      KVM: Improve reporting of vmwrite errors
      KVM: Initialize vcpu->kvm a little earlier
      KVM: Add missing 'break'
      KVM: Don't set guest cr3 from vmx_vcpu_setup()
      KVM: MMU: Add missing dirty bit

Bartlomiej Zolnierkiewicz (1):
      via82cxxx: fix cable detection

Ben Dooks (1):
      [ARM] 4071/1: S3C24XX: Documentation update

Benjamin Herrenschmidt (1):
      [SUNGEM]: PHY updates & pause fixes (#2)

Brice Goglin (1):
      [CPUFREQ] speedstep-centrino: missing space and bracket

Christoph Hellwig (2):
      [XFRM_USER]: avoid pointless void ** casts
      Fix BUG at drivers/scsi/scsi_lib.c:1118 caused by "pktsetup dvd /dev/sr0"

Christoph Lameter (1):
      Check for populated zone in __drain_pages

Chuck Ebbert (1):
      [NETFILTER]: ebtables: don't compute gap before checking struct type

Cyrill V. Gorcunov (1):
      qconf: fix SIGSEGV on empty menu items

Dan Williams (1):
      [ARM] 4077/1: iop13xx: fix __io() macro

Dave Jones (3):
      [CPUFREQ] longhaul: Fix up unreachable code.
      [CPUFREQ] longhaul: Kill off warnings introduced by recent changes.
      Fix implicit declarations in via-pmu

David Brownell (4):
      i2c: Migration aids for i2c_adapter.dev removal
      USB: omap_udc build fixes (sync with linux-omap)
      rtc-at91rm9200 build fix
      Update the rtc-rs5c372 driver

David Hollis (1):
      USB: asix: Fix AX88772 device PHY selection

David L Stevens (1):
      [IPV4/IPV6]: Fix inet{,6} device initialization order.

David S. Miller (2):
      [PKTGEN]: Convert to kthread API.
      [SOUND] Sparc CS4231: Use 64 for period_bytes_min

Dmitry Mishin (1):
      [NETFILTER]: compat offsets size change

Dor Laor (2):
      KVM: Improve interrupt response
      KVM: Simplify test for interrupt window

Doug Chapman (1):
      ACPI: increase ACPI_MAX_REFERENCE_COUNT for larger systems

Eric Anholt (1):
      [AGPGART] fix detection of aperture size versus GTT size on G965

Eric Sandeen (1):
      fix memory corruption from misinterpreted bad_inode_ops return values

Erik Jacobson (1):
      connector: some fixes for ia64 unaligned access errors

Evgeniy Dushistov (1):
      fix garbage instead of zeroes in UFS

Gabriel Mansi (1):
      [AGPGART] K8M890 support for amd-k8.

Georg Chini (1):
      [SOUND] Sparc CS4231: Fix IRQ return value and initialization.

Gerrit Renker (1):
      [TCP]: Use old definition of before

Guillaume Chazarain (2):
      ACPI: EC: move verbose printk to debug build only
      [CPUFREQ] Uninitialized use of cmd.val in arch/i386/kernel/cpu/cpufreq/acpi-cpufreq.c:acpi_cpufreq_target()

Hugh Dickins (2):
      fix BUG_ON(!PageSlab) from fallback_alloc
      fix OOM killing of swapoff

Ingo Molnar (6):
      KVM: Fix GFP_KERNEL alloc in atomic section bug
      KVM: Use raw_smp_processor_id() instead of smp_processor_id() where applicable
      profiling: fix sched profiling typo
      KVM: Avoid oom on cr3 switch
      KVM: Make loading cr3 more robust
      KVM: Simplify mmu_alloc_roots()

James Bursa (1):
      adfs: fix filename handling

Jens Axboe (3):
      cfq-iosched: merging problem
      cdrom: set default timeout to 7 seconds
      ide-cd maintainer

Jiri Kosina (1):
      HID: fix help texts in Kconfig

Kay Sievers (1):
      Driver core: Fix prefix driver links in /sys/module by bus-name

Len Brown (2):
      ACPI: fix section mis-match build warning
      ACPI: asus_acpi: new MAINTAINER

Lennert Buytenhek (1):
      [ARM] 4063/1: ep93xx: fix IRQ_EP93XX_GPIO?MUX numbering

Leonard NorrgÃ¥rd (1):
      sound: hda: detect ALC883 on MSI K9A Platinum motherboards (MS-7280)

Linus Torvalds (3):
      Revert "[PATCH] x86_64: fix boot hang caused by CALGARY_IOMMU_ENABLED_BY_DEFAULT"
      Revert "[PATCH] binfmt_elf: randomize PIE binaries (2nd try)"
      Linux 2.6.20-rc4

Mariusz Kozlowski (1):
      [AF_NETLINK]: module_put cleanup

Martin Josefsson (1):
      [NETFILTER]: nf_nat: fix MASQUERADE crash on device down

Martin Williges (1):
      USB: usblp.c - add Kyocera Mita FS 820 to list of "quirky" printers

Matthijs van Otterdijk (1):
      fix the toshiba_acpi write_lcd return value

Maxime Bizon (1):
      i2c-mv64xxx: Fix random oops at boot

Miguel Angel Alvarez (1):
      USB: fix interaction between different interfaces in an "Option" usb device

Nicolas Pitre (2):
      [ARM] 4064/1: make pxa_get_cycles() static
      [ARM] 4066/1: correct a comment about PXA's sched_clock range

OGAWA Hirofumi (1):
      x86_64: Fix dump_trace()

Oliver Neukum (1):
      USB: small update to Documentation/usb/acm.txt

Parag Warudkar (1):
      selinux: fix selinux_netlbl_inode_permission() locking

Patrick McHardy (2):
      [NETFILTER]: Fix routing of REJECT target generated packets in output chain
      [NETFILTER]: New connection tracking is not EXPERIMENTAL anymore

Paul Brook (1):
      [ARM] 4074/1: Flat loader stack alignment

Paul Mundt (1):
      Sanely size hash tables when using large base pages

Pete Zaitcev (1):
      USB storage: fix ipod ejecting issue

Phil Dibowitz (1):
      USB Storage: unusual_devs: add supertop drives

Philipp Zabel (2):
      [ARM] 4080/1: Fix for the SSCR0_SlotsPerFrm macro
      [ARM] 4081/1: Add definition for TI Sync Serial Protocol

Philippe De Muyter (1):
      i2c/m41t00: Do not forget to write year

Rafael J. Wysocki (1):
      swsusp: Do not fail if resume device is not set

Rafa³ Bilski (2):
      [CPUFREQ] Longhaul - Fix up powersaver assumptions.
      [CPUFREQ] Longhaul - Always guess FSB

Randy Dunlap (1):
      [CPUFREQ] select consistently

Richard Purdie (3):
      [ARM] 4078/1: Fix ARM copypage cache coherency problems
      backlight: fix backlight_device_register compile failures
      Fix leds-s3c24xx hardware.h reference

Russell King (2):
      [ARM] Fix VFP initialisation issue for SMP systems
      Fix some ARM builds due to HID brokenness

Sarah Bailey (1):
      USB: Fixed bug in endpoint release function.

Segher Boessenkool (1):
      Fix insta-reboot with "i386: Relocatable kernel support"

Thomas Hellstrom (2):
      [AGPGART] Remove unnecessary flushes when inserting and removing pages.
      [AGPGART] Fix PCI-posting flush typo.

Venkatesh Pallipadi (1):
      [CPUFREQ] Bug fix for acpi-cpufreq and cpufreq_stats oops on frequency change notification

Vitaly Wool (2):
      i2c-pnx: Fix interrupt handler, get rid of EARLY config option
      i2c-pnx: Add entry to MAINTAINERS

Vivek Goyal (4):
      i386: Restore CONFIG_PHYSICAL_START option
      i386: fix modpost warning in SMP trampoline code
      i386: fix another modpost warning
      i386: modpost smpboot code warning fix

Yoshimi Ichiyanagi (1):
      KVM: Recover after an arch module load failure

akpm@osdl.org (1):
      [AGPGART] drivers/char/agp/sgi-agp.c: check kmalloc() return value

dean gaudet (1):
      [NET]: ifb double-counts packets

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Linux 2.6.20-rc4
  2007-01-07  6:19 Linux 2.6.20-rc4 Linus Torvalds
@ 2007-01-07 10:56 ` Jan Engelhardt
  2007-01-07 11:44   ` Russell King
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Engelhardt @ 2007-01-07 10:56 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel Mailing List

[-- Attachment #1: Type: TEXT/PLAIN, Size: 283 bytes --]



On Jan 6 2007 22:19, Linus Torvalds wrote:

>Leonard NorrgÃ¥rd (1):
>      sound: hda: detect ALC883 on MSI K9A Platinum motherboards (MS-7280)

Something seems to have mangled the name, that should have
been an å not A¥. (Something reencoded it). A gitlog problem?


	-`J'
-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Linux 2.6.20-rc4
  2007-01-07 10:56 ` Jan Engelhardt
@ 2007-01-07 11:44   ` Russell King
  2007-01-07 13:06     ` OT: character encodings (was: Linux 2.6.20-rc4) Tilman Schmidt
  0 siblings, 1 reply; 39+ messages in thread
From: Russell King @ 2007-01-07 11:44 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Linus Torvalds, Linux Kernel Mailing List

On Sun, Jan 07, 2007 at 11:56:01AM +0100, Jan Engelhardt wrote:
> On Jan 6 2007 22:19, Linus Torvalds wrote:
> 
> >Leonard NorrgÃ¥rd (1):
> >      sound: hda: detect ALC883 on MSI K9A Platinum motherboards (MS-7280)
> 
> Something seems to have mangled the name, that should have
> been an å not A¥. (Something reencoded it). A gitlog problem?

That is an å if you look at the raw message in UTF-8.  However, Linus
sends mail in with a charset of ISO-8859-1, and if you place UTF-8
encoded text in such a message body, you will see A¥.

Welcome to the mess which the UTF-8 charset creates.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 39+ messages in thread

* OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-07 11:44   ` Russell King
@ 2007-01-07 13:06     ` Tilman Schmidt
  2007-01-07 15:13       ` David Woodhouse
  0 siblings, 1 reply; 39+ messages in thread
From: Tilman Schmidt @ 2007-01-07 13:06 UTC (permalink / raw)
  To: Russell King; +Cc: Linux Kernel Mailing List

Russell King schrieb:
[Leonard NorrgÃ¥rd (1):]
> That is an å if you look at the raw message in UTF-8.  However, Linus
> sends mail in with a charset of ISO-8859-1, and if you place UTF-8
> encoded text in such a message body, you will see A¥.

Only if the mechanism used for placing it there ignores the different
encodings.

> Welcome to the mess which the UTF-8 charset creates.

The problem of different character encodings coexisting on the same
platform, and the resulting occasional messing-up, far predates Unicode.
I distinctly remember one case of being bitten by this myself in 1977
when Unicode wasn't even on the horizon yet, and I don't think that was
the first time.

Tilman

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-07 13:06     ` OT: character encodings (was: Linux 2.6.20-rc4) Tilman Schmidt
@ 2007-01-07 15:13       ` David Woodhouse
  2007-01-07 15:38         ` Russell King
  0 siblings, 1 reply; 39+ messages in thread
From: David Woodhouse @ 2007-01-07 15:13 UTC (permalink / raw)
  To: Tilman Schmidt; +Cc: Russell King, Linux Kernel Mailing List

On Sun, 2007-01-07 at 14:06 +0100, Tilman Schmidt wrote:
> Russell King schrieb:
> > Welcome to the mess which the UTF-8 charset creates.

Utter bollocks.

> The problem of different character encodings coexisting on the same
> platform, and the resulting occasional messing-up, far predates Unicode.
> I distinctly remember one case of being bitten by this myself in 1977
> when Unicode wasn't even on the horizon yet, and I don't think that was
> the first time.

Indeed. If you take arbitrary content and send it out to the world
labelled as ISO8859-1, of _course_ you're likely to be corrupting it.

Far from being the cause of the problem, UTF-8 actually offers the
chance of a _solution_. Because once the Luddites catch up, it'll
largely eliminate the need for using the multitude of legacy character
sets and converting between them -- and the problem of mislabelling will
fairly much go away.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-07 15:13       ` David Woodhouse
@ 2007-01-07 15:38         ` Russell King
  2007-01-07 16:29           ` David Woodhouse
  2007-01-07 18:21           ` Alan
  0 siblings, 2 replies; 39+ messages in thread
From: Russell King @ 2007-01-07 15:38 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Tilman Schmidt, Linux Kernel Mailing List

On Sun, Jan 07, 2007 at 11:13:57PM +0800, David Woodhouse wrote:
> On Sun, 2007-01-07 at 14:06 +0100, Tilman Schmidt wrote:
> > Russell King schrieb:
> > > Welcome to the mess which the UTF-8 charset creates.
> 
> Utter bollocks.

Wrong.  The problem is partly caused by not everything understanding
multi-byte character encodings, and text files containing absolutely
_no_ information about their character encodings.

When a text file is stored on disk, there's no way to tell what
character set the characters in that file belong to.  As a result,
ISO-8859-1 folk assume that all text files are ISO-8859-1 encoded.
UTF-8 folk assume all text files are UTF-8 encoded.  This leads to
utter confusion.

To see what I mean, try the following:

$ git log | head -n 1000 > o
$ file -i o
o: text/x-c; charset=iso-8859-1

According to that, the charset of the 'git log' output (which on that
test included Leonard's entry) is iso-8859-1, and by that Linus' mailer
was right to include it as ISO-8859-1.

In reality, the output from git log contains an ad-hoc collection of
character sets making its interpretation under any one character set
incorrect.

> > The problem of different character encodings coexisting on the same
> > platform, and the resulting occasional messing-up, far predates Unicode.
> > I distinctly remember one case of being bitten by this myself in 1977
> > when Unicode wasn't even on the horizon yet, and I don't think that was
> > the first time.
> 
> Indeed. If you take arbitrary content and send it out to the world
> labelled as ISO8859-1, of _course_ you're likely to be corrupting it.
> 
> Far from being the cause of the problem, UTF-8 actually offers the
> chance of a _solution_. Because once the Luddites catch up, it'll
> largely eliminate the need for using the multitude of legacy character
> sets and converting between them -- and the problem of mislabelling will
> fairly much go away.

In other words, the UTF-8 luddites require the entire Internet to
upgrade to UTF-8 for UTF-8 to work properly.

I _regularly_ struggle with idiotic programs that assume that the world
is UTF-8 and nothing else.  UTF-8 does _not_ solve these inter-operability
problems - it only makes the entire situation worse by introducing yet
another different charset.  (Yes, it's also true that there are programs
which assume the world is only another, different, character set.)

Rather than having these problems fixed properly (by looking at the LANG
environment variable) many of these programs now assume that the world
is UTF-8.  It isn't.

elinks is one such program.  It now assumes UTF-8 _only_ displays.
That's no better than programs which assume ISO-8859-1 only or US-ASCII
only.

So, in short, UTF-8 is all fine and dandy if your _entire_ universe
is UTF-8 enabled.  If you're operating in a mixed charset environment
it's one bloody big pain in the butt.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-07 15:38         ` Russell King
@ 2007-01-07 16:29           ` David Woodhouse
  2007-01-07 17:06             ` Russell King
  2007-01-07 18:21           ` Alan
  1 sibling, 1 reply; 39+ messages in thread
From: David Woodhouse @ 2007-01-07 16:29 UTC (permalink / raw)
  To: Russell King; +Cc: Tilman Schmidt, Linux Kernel Mailing List

On Sun, 2007-01-07 at 15:38 +0000, Russell King wrote:
> On Sun, Jan 07, 2007 at 11:13:57PM +0800, David Woodhouse wrote:
> > On Sun, 2007-01-07 at 14:06 +0100, Tilman Schmidt wrote:
> > > Russell King schrieb:
> > > > Welcome to the mess which the UTF-8 charset creates.
> > 
> > Utter bollocks.
> 
> Wrong.  The problem is partly caused by not everything understanding
> multi-byte character encodings, 

No, that's a different problem; not the one you were referring to above.
And it's a problem which is rapidly diminishing, too.

> and text files containing absolutely
> _no_ information about their character encodings.

That's a real problem, yes -- but it was a problem long before UTF-8 was
added to the collection of character sets in use. Even within the UK, we
had to choose between ISO8859-1 and ISO8859-15.

> When a text file is stored on disk, there's no way to tell what
> character set the characters in that file belong to.  As a result,
> ISO-8859-1 folk assume that all text files are ISO-8859-1 encoded.
> UTF-8 folk assume all text files are UTF-8 encoded.  This leads to
> utter confusion.

Only if you are making different assumptions about the _same_ set of
files, on the _same_ system. But that would be silly.

If I suddenly "assume" that my laptop has a Dvorak keyboard layout
despite that blatantly not being true, I'll get the same kind of
confusion. That isn't Dvorak's fault, either.

If, on the other hand, I have one system which is entirely ISO8859-1 and
a separate system which is entirely UTF-8, each of those are _fine_ and
unconfusing. Obviously I have to make sure files are properly labelled
and converted in transport between different systems -- but that's
nothing new.

> To see what I mean, try the following:
> 
> $ git log | head -n 1000 > o
> $ file -i o
> o: text/x-c; charset=iso-8859-1
> 
> According to that, the charset of the 'git log' output (which on that
> test included Leonard's entry) is iso-8859-1, and by that Linus' mailer
> was right to include it as ISO-8859-1.

Yes. When you stored it on disk, the character set information was lost.
If you were running a mixed-charset system then attempting to recreating
the lost information with heuristics and assumptions is obviously going
to be problematic.

Actually, because UTF-8 allows me to run a system which is purely based
on a single character set, I get better results when I try the same
trick:
	shinybook /shiny/git/mtd-2.6 $ git log | head -n 1000 > o
	shinybook /shiny/git/mtd-2.6 $ file -i o
	o: text/plain; charset=utf-8

Again, the problem of labelling isn't at all new to UTF-8. The only
thing that's new with UTF-8 is that it's now actually _practical_ to
have a system which only uses one character set throughout, and which
thus _can_ get its 'guess' right when you don't bother to label
everything.

> In reality, the output from git log contains an ad-hoc collection of
> character sets making its interpretation under any one character set
> incorrect.

No, the contents of the git log ought to be UTF-8, unless people have
been misusing it. Git stores its text in UTF-8 (by default), and is
capable of converting to and from legacy character sets on input
(git-commit) and output (git-log).

(Obviously, that's likely to be lossy if you convert it to any given
legacy character set, because ∀ legacy character set, ∃ characters
within UTF-8 that aren't in that legacy character set.)

> > Far from being the cause of the problem, UTF-8 actually offers the
> > chance of a _solution_. Because once the Luddites catch up, it'll
> > largely eliminate the need for using the multitude of legacy character
> > sets and converting between them -- and the problem of mislabelling will
> > fairly much go away.
> 
> In other words, the UTF-8 luddites require the entire Internet to
> upgrade to UTF-8 for UTF-8 to work properly.

Not at all. The problems arise when character set information is lost,
which can happen at any point during the flow of information.

Anything we can do to reduce the likelihood of charset information being
lost is an overall improvement. We already demonstrated an example
(git-log > o; file -i o) of a case where a _consistent_ system gets it
right, while an inconsistent system introduces an error.

If any individual system processes all text in a single character set,
then that system is no longer a likely source of corruption due to
labelling errors. And because UTF-8 fully covers the set of characters
which can be represented in the legacy character sets, it allows us to
deploy systems which do just that.

> I _regularly_ struggle with idiotic programs that assume that the world
> is UTF-8 and nothing else. 

I don't think I've encountered such a program in my distribution of
choice. If I had, I would have filed a bug. Making assumptions about
character sets, outside of the locally-controlled environment, is
invalid. That's been true since the first 8-bit character sets, if not
longer.

> So, in short, UTF-8 is all fine and dandy if your _entire_ universe
> is UTF-8 enabled.  If you're operating in a mixed charset environment
> it's one bloody big pain in the butt.

A mixed charset environment was _already_ a pain in the butt, because
almost nobody got labelling right. It's wrong to blame that on UTF-8.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-07 16:29           ` David Woodhouse
@ 2007-01-07 17:06             ` Russell King
  2007-01-07 19:11               ` Jan Engelhardt
  0 siblings, 1 reply; 39+ messages in thread
From: Russell King @ 2007-01-07 17:06 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Tilman Schmidt, Linux Kernel Mailing List

On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote:
> On Sun, 2007-01-07 at 15:38 +0000, Russell King wrote:
> > When a text file is stored on disk, there's no way to tell what
> > character set the characters in that file belong to.  As a result,
> > ISO-8859-1 folk assume that all text files are ISO-8859-1 encoded.
> > UTF-8 folk assume all text files are UTF-8 encoded.  This leads to
> > utter confusion.
> 
> Only if you are making different assumptions about the _same_ set of
> files, on the _same_ system. But that would be silly.

$ git log | head -n 1000 | tail -n 200 > o
$ file -i o
o: text/plain; charset=us-ascii
$ git log | head -n 1000 | tail -n 300 > o
$ file -i o
o: text/plain; charset=us-ascii
$ git log | head -n 1000 | tail -n 400 > o
$ file -i o
o: text/plain; charset=utf-8

(and you know what charset the file is thought to have with all 1000
lines in it.)

All on a system with LANG set to en_GB (iow ISO-8859-1).

> > To see what I mean, try the following:
> > 
> > $ git log | head -n 1000 > o
> > $ file -i o
> > o: text/x-c; charset=iso-8859-1
> > 
> > According to that, the charset of the 'git log' output (which on that
> > test included Leonard's entry) is iso-8859-1, and by that Linus' mailer
> > was right to include it as ISO-8859-1.
> 
> Yes. When you stored it on disk, the character set information was lost.

The same thing actually happens when I look at it via:

  $ git log | head -n 1000 | less

but in this case the output is always interpreted by the terminal to be
in its character set.

> If you were running a mixed-charset system then attempting to recreating
> the lost information with heuristics and assumptions is obviously going
> to be problematic.

I'm not - I'm running a pure ISO-8859-1 system:

$ echo $LANG
en_GB
$ locale -k LC_CTYPE | grep charmap
charmap="ISO-8859-1"

> Actually, because UTF-8 allows me to run a system which is purely based
> on a single character set, I get better results when I try the same
> trick:
> 	shinybook /shiny/git/mtd-2.6 $ git log | head -n 1000 > o
> 	shinybook /shiny/git/mtd-2.6 $ file -i o
> 	o: text/plain; charset=utf-8

$ LANG=en_GB.UTF-8 locale -k LC_CTYPE | grep charmap
charmap="UTF-8"
$ LANG=en_GB.UTF-8 git log | head -n 1000 > o
$ LANG=en_GB.UTF-8 file -i o
o: text/x-c; charset=iso-8859-1
$ git version
git version 1.4.4.2

Looks like the output is iso-8859-1 even with UTF-8!

> > In reality, the output from git log contains an ad-hoc collection of
> > character sets making its interpretation under any one character set
> > incorrect.
> 
> No, the contents of the git log ought to be UTF-8, unless people have
> been misusing it. Git stores its text in UTF-8 (by default), and is
> capable of converting to and from legacy character sets on input
> (git-commit) and output (git-log).

Git may store its text internally in UTF-8 (I don't know but I have no
evidence to suggest it does - in fact I have some evidence in this test
that it doesn't care about charsets.)  git log output on a non-UTF-8
system certainly is not in the hosts character set.  For example:

$ LANG=en_GB.UTF-8 git log | head -n 1000 > o
$ LANG=en_GB git log | head -n 1000 > o2
$ diff -u o o2

That includes the UTF-8 encoded part of Leonard name.  It also includes
Rafa? Bilski's name which is non-UTF-8 encoded.

So, in both cases, exactly the same output bytestream was created
independent of the character set _actually_ being used, which both
includes untranslated UTF-8 and non-UTF-8 sequences.

There is obviously no character set translation going on with the output.
So we can add 'git' to my list of charset-broken programs.

Also, since we have recent data in the git repository which is non-UTF-8
as well, it is clear that there is no character set translation going on
at input time either.

Looking at the git-commit script, there appears to be no character set
conversion going on in there either.

So, I think you'll find that the contents of git _is_ an ad-hoc collection
of character sets which people happen to have in use on their machines.

> > So, in short, UTF-8 is all fine and dandy if your _entire_ universe
> > is UTF-8 enabled.  If you're operating in a mixed charset environment
> > it's one bloody big pain in the butt.
> 
> A mixed charset environment was _already_ a pain in the butt, because
> almost nobody got labelling right. It's wrong to blame that on UTF-8.

I'm not talking about a mixed charset environment.  I'm talking about
non-UTF-8 single charset environments being broken by programs which
universally think the universe is UTF-8 only.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-07 17:06             ` Russell King
@ 2007-01-07 19:11               ` Jan Engelhardt
  2007-01-07 19:20                 ` Russell King
  2007-01-07 20:48                 ` Willy Tarreau
  0 siblings, 2 replies; 39+ messages in thread
From: Jan Engelhardt @ 2007-01-07 19:11 UTC (permalink / raw)
  To: Russell King; +Cc: David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List

On Jan 7 2007 17:06, Russell King wrote:
>On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote:
>
>$ git log | head -n 1000 | tail -n 200 > o
>$ file -i o
>o: text/plain; charset=us-ascii
>$ git log | head -n 1000 | tail -n 300 > o
>$ file -i o
>o: text/plain; charset=us-ascii
>$ git log | head -n 1000 | tail -n 400 > o
>$ file -i o
>o: text/plain; charset=utf-8

I am inclined to say that "file" does not count, because it tries to guess an
ambiguous mapping from bytes to character set. Even more, file should be
_unable at all_ to distinguish an iso-8859-1 from an iso-8859-2 (or worse: 15)
file. This program is soo... forget it, it's not an argument. It works well for
headerful files, but text files don't really contain one. The next best thing
would be html, with a proper <meta http-equiv=Content> tag.

	-`J'
-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-07 19:11               ` Jan Engelhardt
@ 2007-01-07 19:20                 ` Russell King
  2007-01-07 20:48                 ` Willy Tarreau
  1 sibling, 0 replies; 39+ messages in thread
From: Russell King @ 2007-01-07 19:20 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List

On Sun, Jan 07, 2007 at 08:11:38PM +0100, Jan Engelhardt wrote:
> 
> On Jan 7 2007 17:06, Russell King wrote:
> >On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote:
> >
> >$ git log | head -n 1000 | tail -n 200 > o
> >$ file -i o
> >o: text/plain; charset=us-ascii
> >$ git log | head -n 1000 | tail -n 300 > o
> >$ file -i o
> >o: text/plain; charset=us-ascii
> >$ git log | head -n 1000 | tail -n 400 > o
> >$ file -i o
> >o: text/plain; charset=utf-8
> 
> I am inclined to say that "file" does not count, because it tries to guess an
> ambiguous mapping from bytes to character set. Even more, file should be
> _unable at all_ to distinguish an iso-8859-1 from an iso-8859-2 (or worse: 15)
> file. This program is soo... forget it, it's not an argument. It works well for
> headerful files, but text files don't really contain one. The next best thing
> would be html, with a proper <meta http-equiv=Content> tag.

You're discarding a perfectly reasonable argument - file itself obviously
is not good at guessing the charset, but inspecting the resulting file
manually and identifying *both* ISO-8859 and UTF-8 character sequences
in there is pretty conclusive.  As I did indeed do prior to sending
that message.

In this case, 'file' was doing a remarkably accurate job.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-07 19:11               ` Jan Engelhardt
  2007-01-07 19:20                 ` Russell King
@ 2007-01-07 20:48                 ` Willy Tarreau
  2007-01-07 23:37                   ` Adrian Bunk
  1 sibling, 1 reply; 39+ messages in thread
From: Willy Tarreau @ 2007-01-07 20:48 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Russell King, David Woodhouse, Tilman Schmidt,
	Linux Kernel Mailing List

On Sun, Jan 07, 2007 at 08:11:38PM +0100, Jan Engelhardt wrote:
> 
> On Jan 7 2007 17:06, Russell King wrote:
> >On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote:
> >
> >$ git log | head -n 1000 | tail -n 200 > o
> >$ file -i o
> >o: text/plain; charset=us-ascii
> >$ git log | head -n 1000 | tail -n 300 > o
> >$ file -i o
> >o: text/plain; charset=us-ascii
> >$ git log | head -n 1000 | tail -n 400 > o
> >$ file -i o
> >o: text/plain; charset=utf-8
> 
> I am inclined to say that "file" does not count, because it tries to guess an
> ambiguous mapping from bytes to character set. Even more, file should be
> _unable at all_ to distinguish an iso-8859-1 from an iso-8859-2 (or worse: 15)
> file. This program is soo... forget it, it's not an argument. It works well for
> headerful files, but text files don't really contain one. The next best thing
> would be html, with a proper <meta http-equiv=Content> tag.

The stupidity from the start up with those character sets is that they
consider that a whole file is written with a given set. In fact, the
charset should apply to characters themselves. At least, the
quoted-printable, non-human friendly, encoding was the least stupid.

Now that UTF8 comes everywhere, everyone receives tons of mangled mails,
and even mailers which correctly support UTF8 and use it by default manage
to shoot themselves in the foot when they reply to, or forward a mail. The
system is completely broken because limited by design, and we have to learn
to live with this brokenness.

Willy


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-07 20:48                 ` Willy Tarreau
@ 2007-01-07 23:37                   ` Adrian Bunk
  2007-01-08  0:38                     ` Willy Tarreau
  0 siblings, 1 reply; 39+ messages in thread
From: Adrian Bunk @ 2007-01-07 23:37 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Jan Engelhardt, Russell King, David Woodhouse, Tilman Schmidt,
	Linux Kernel Mailing List

On Sun, Jan 07, 2007 at 09:48:34PM +0100, Willy Tarreau wrote:
> On Sun, Jan 07, 2007 at 08:11:38PM +0100, Jan Engelhardt wrote:
> > 
> > On Jan 7 2007 17:06, Russell King wrote:
> > >On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote:
> > >
> > >$ git log | head -n 1000 | tail -n 200 > o
> > >$ file -i o
> > >o: text/plain; charset=us-ascii
> > >$ git log | head -n 1000 | tail -n 300 > o
> > >$ file -i o
> > >o: text/plain; charset=us-ascii
> > >$ git log | head -n 1000 | tail -n 400 > o
> > >$ file -i o
> > >o: text/plain; charset=utf-8
> > 
> > I am inclined to say that "file" does not count, because it tries to guess an
> > ambiguous mapping from bytes to character set. Even more, file should be
> > _unable at all_ to distinguish an iso-8859-1 from an iso-8859-2 (or worse: 15)
> > file. This program is soo... forget it, it's not an argument. It works well for
> > headerful files, but text files don't really contain one. The next best thing
> > would be html, with a proper <meta http-equiv=Content> tag.
> 
> The stupidity from the start up with those character sets is that they
> consider that a whole file is written with a given set. In fact, the
> charset should apply to characters themselves. At least, the
> quoted-printable, non-human friendly, encoding was the least stupid.

I doubt doing this would really be worth the effort.

In the 21st century, people should simply use UTF-8.

> Now that UTF8 comes everywhere, everyone receives tons of mangled mails,
> and even mailers which correctly support UTF8 and use it by default manage
> to shoot themselves in the foot when they reply to, or forward a mail. The
> system is completely broken because limited by design, and we have to learn
> to live with this brokenness.

Only if MUAs have broken charset support or don't set a correct 
"charset" header in the mails they are sending.

If some software still can't handle UTF-8 correctly more than 10 years 
after it was introduced, that's not a brokenness you can blame on UTF-8.

> Willy

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-07 23:37                   ` Adrian Bunk
@ 2007-01-08  0:38                     ` Willy Tarreau
  2007-01-08  1:03                       ` Adrian Bunk
  2007-01-08 19:53                       ` Valdis.Kletnieks
  0 siblings, 2 replies; 39+ messages in thread
From: Willy Tarreau @ 2007-01-08  0:38 UTC (permalink / raw)
  To: Adrian Bunk
  Cc: Jan Engelhardt, Russell King, David Woodhouse, Tilman Schmidt,
	Linux Kernel Mailing List

On Mon, Jan 08, 2007 at 12:37:50AM +0100, Adrian Bunk wrote:
> On Sun, Jan 07, 2007 at 09:48:34PM +0100, Willy Tarreau wrote:
> > On Sun, Jan 07, 2007 at 08:11:38PM +0100, Jan Engelhardt wrote:
> > > 
> > > On Jan 7 2007 17:06, Russell King wrote:
> > > >On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote:
> > > >
> > > >$ git log | head -n 1000 | tail -n 200 > o
> > > >$ file -i o
> > > >o: text/plain; charset=us-ascii
> > > >$ git log | head -n 1000 | tail -n 300 > o
> > > >$ file -i o
> > > >o: text/plain; charset=us-ascii
> > > >$ git log | head -n 1000 | tail -n 400 > o
> > > >$ file -i o
> > > >o: text/plain; charset=utf-8
> > > 
> > > I am inclined to say that "file" does not count, because it tries to guess an
> > > ambiguous mapping from bytes to character set. Even more, file should be
> > > _unable at all_ to distinguish an iso-8859-1 from an iso-8859-2 (or worse: 15)
> > > file. This program is soo... forget it, it's not an argument. It works well for
> > > headerful files, but text files don't really contain one. The next best thing
> > > would be html, with a proper <meta http-equiv=Content> tag.
> > 
> > The stupidity from the start up with those character sets is that they
> > consider that a whole file is written with a given set. In fact, the
> > charset should apply to characters themselves. At least, the
> > quoted-printable, non-human friendly, encoding was the least stupid.
> 
> I doubt doing this would really be worth the effort.
> 
> In the 21st century, people should simply use UTF-8.
> 
> > Now that UTF8 comes everywhere, everyone receives tons of mangled mails,
> > and even mailers which correctly support UTF8 and use it by default manage
> > to shoot themselves in the foot when they reply to, or forward a mail. The
> > system is completely broken because limited by design, and we have to learn
> > to live with this brokenness.
> 
> Only if MUAs have broken charset support or don't set a correct 
> "charset" header in the mails they are sending.
> 
> If some software still can't handle UTF-8 correctly more than 10 years 
> after it was introduced, that's not a brokenness you can blame on UTF-8.

I'm not blaming UTF-8 per se, but people who still believe in encoding
*whole documents*. Copy-paste, text insertion, git output, etc... everything
has a good reason not to be in the same encoding as what your MUA believes.
If major MUAs still have problems with UTF-8 10 years after it was introduced,
it's clearly the proof of a flaw in the initial design. And I'm not even
discussing the stupidity which requires that you read a whole text to get
its number of characters !

Willy


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-08  0:38                     ` Willy Tarreau
@ 2007-01-08  1:03                       ` Adrian Bunk
  2007-01-08  1:14                         ` Willy Tarreau
  2007-01-08  6:52                         ` Jan Engelhardt
  2007-01-08 19:53                       ` Valdis.Kletnieks
  1 sibling, 2 replies; 39+ messages in thread
From: Adrian Bunk @ 2007-01-08  1:03 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Jan Engelhardt, Russell King, David Woodhouse, Tilman Schmidt,
	Linux Kernel Mailing List

On Mon, Jan 08, 2007 at 01:38:57AM +0100, Willy Tarreau wrote:
> On Mon, Jan 08, 2007 at 12:37:50AM +0100, Adrian Bunk wrote:
> > On Sun, Jan 07, 2007 at 09:48:34PM +0100, Willy Tarreau wrote:
> > > On Sun, Jan 07, 2007 at 08:11:38PM +0100, Jan Engelhardt wrote:
> > > > 
> > > > On Jan 7 2007 17:06, Russell King wrote:
> > > > >On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote:
> > > > >
> > > > >$ git log | head -n 1000 | tail -n 200 > o
> > > > >$ file -i o
> > > > >o: text/plain; charset=us-ascii
> > > > >$ git log | head -n 1000 | tail -n 300 > o
> > > > >$ file -i o
> > > > >o: text/plain; charset=us-ascii
> > > > >$ git log | head -n 1000 | tail -n 400 > o
> > > > >$ file -i o
> > > > >o: text/plain; charset=utf-8
> > > > 
> > > > I am inclined to say that "file" does not count, because it tries to guess an
> > > > ambiguous mapping from bytes to character set. Even more, file should be
> > > > _unable at all_ to distinguish an iso-8859-1 from an iso-8859-2 (or worse: 15)
> > > > file. This program is soo... forget it, it's not an argument. It works well for
> > > > headerful files, but text files don't really contain one. The next best thing
> > > > would be html, with a proper <meta http-equiv=Content> tag.
> > > 
> > > The stupidity from the start up with those character sets is that they
> > > consider that a whole file is written with a given set. In fact, the
> > > charset should apply to characters themselves. At least, the
> > > quoted-printable, non-human friendly, encoding was the least stupid.
> > 
> > I doubt doing this would really be worth the effort.
> > 
> > In the 21st century, people should simply use UTF-8.
> > 
> > > Now that UTF8 comes everywhere, everyone receives tons of mangled mails,
> > > and even mailers which correctly support UTF8 and use it by default manage
> > > to shoot themselves in the foot when they reply to, or forward a mail. The
> > > system is completely broken because limited by design, and we have to learn
> > > to live with this brokenness.
> > 
> > Only if MUAs have broken charset support or don't set a correct 
> > "charset" header in the mails they are sending.
> > 
> > If some software still can't handle UTF-8 correctly more than 10 years 
> > after it was introduced, that's not a brokenness you can blame on UTF-8.
> 
> I'm not blaming UTF-8 per se, but people who still believe in encoding
> *whole documents*. Copy-paste, text insertion, git output, etc... everything
> has a good reason not to be in the same encoding as what your MUA believes.

How would you do this technically in a way that it's significantely 
easier than simply finishing the UTF=8 transition?

> If major MUAs still have problems with UTF-8 10 years after it was introduced,
> it's clearly the proof of a flaw in the initial design. And I'm not even
> discussing the stupidity which requires that you read a whole text to get
> its number of characters !

The only major MUA not supporting UTF-8 is Eudora.

And if you are talking about buggy old pine, in the latest development 
version [1] it does not only become open source, it also got some 
working Unicode support.

> Willy

cu
Adrian

[1] Alpine

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-08  1:03                       ` Adrian Bunk
@ 2007-01-08  1:14                         ` Willy Tarreau
  2007-01-08  1:45                           ` Adrian Bunk
  2007-01-08  6:52                         ` Jan Engelhardt
  1 sibling, 1 reply; 39+ messages in thread
From: Willy Tarreau @ 2007-01-08  1:14 UTC (permalink / raw)
  To: Adrian Bunk
  Cc: Jan Engelhardt, Russell King, David Woodhouse, Tilman Schmidt,
	Linux Kernel Mailing List

On Mon, Jan 08, 2007 at 02:03:37AM +0100, Adrian Bunk wrote:
> On Mon, Jan 08, 2007 at 01:38:57AM +0100, Willy Tarreau wrote:
> > On Mon, Jan 08, 2007 at 12:37:50AM +0100, Adrian Bunk wrote:
> > > On Sun, Jan 07, 2007 at 09:48:34PM +0100, Willy Tarreau wrote:
> > > > On Sun, Jan 07, 2007 at 08:11:38PM +0100, Jan Engelhardt wrote:
> > > > > 
> > > > > On Jan 7 2007 17:06, Russell King wrote:
> > > > > >On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote:
> > > > > >
> > > > > >$ git log | head -n 1000 | tail -n 200 > o
> > > > > >$ file -i o
> > > > > >o: text/plain; charset=us-ascii
> > > > > >$ git log | head -n 1000 | tail -n 300 > o
> > > > > >$ file -i o
> > > > > >o: text/plain; charset=us-ascii
> > > > > >$ git log | head -n 1000 | tail -n 400 > o
> > > > > >$ file -i o
> > > > > >o: text/plain; charset=utf-8
> > > > > 
> > > > > I am inclined to say that "file" does not count, because it tries to guess an
> > > > > ambiguous mapping from bytes to character set. Even more, file should be
> > > > > _unable at all_ to distinguish an iso-8859-1 from an iso-8859-2 (or worse: 15)
> > > > > file. This program is soo... forget it, it's not an argument. It works well for
> > > > > headerful files, but text files don't really contain one. The next best thing
> > > > > would be html, with a proper <meta http-equiv=Content> tag.
> > > > 
> > > > The stupidity from the start up with those character sets is that they
> > > > consider that a whole file is written with a given set. In fact, the
> > > > charset should apply to characters themselves. At least, the
> > > > quoted-printable, non-human friendly, encoding was the least stupid.
> > > 
> > > I doubt doing this would really be worth the effort.
> > > 
> > > In the 21st century, people should simply use UTF-8.
> > > 
> > > > Now that UTF8 comes everywhere, everyone receives tons of mangled mails,
> > > > and even mailers which correctly support UTF8 and use it by default manage
> > > > to shoot themselves in the foot when they reply to, or forward a mail. The
> > > > system is completely broken because limited by design, and we have to learn
> > > > to live with this brokenness.
> > > 
> > > Only if MUAs have broken charset support or don't set a correct 
> > > "charset" header in the mails they are sending.
> > > 
> > > If some software still can't handle UTF-8 correctly more than 10 years 
> > > after it was introduced, that's not a brokenness you can blame on UTF-8.
> > 
> > I'm not blaming UTF-8 per se, but people who still believe in encoding
> > *whole documents*. Copy-paste, text insertion, git output, etc... everything
> > has a good reason not to be in the same encoding as what your MUA believes.
> 
> How would you do this technically in a way that it's significantely 
> easier than simply finishing the UTF=8 transition?

In how many decades do you think the transition will be finished ?

> > If major MUAs still have problems with UTF-8 10 years after it was introduced,
> > it's clearly the proof of a flaw in the initial design. And I'm not even
> > discussing the stupidity which requires that you read a whole text to get
> > its number of characters !
> 
> The only major MUA not supporting UTF-8 is Eudora.
> 
> And if you are talking about buggy old pine, in the latest development 
> version [1] it does not only become open source, it also got some 
> working Unicode support.

No, I'm not speaking about "not supporting", but "having problems". Every
one of us has already received mails from Thunderbird, Outlook, Notes, etc...
with erroneously encoded characters because of this :

  - an UTF8 MUA sends a mail to a non-UTF8 aware one.

  - this last one only sees double chars. When it wants to forward the mail
    to someone else, it keeps the chars verbatim, and sets the encoding type
    to its own, something like iso8859-1 for instance.

  - the final MUA, which is UTF8-aware, is very happy to detect lots of UTF8
    combinations in the forwarded mail and decides that everything in it is
    UTF8, then you get lots of chars mangled in the mail, in the middle of
    UTF8 combinations. Then, this crappy mail can be forwarded as long as
    you want between UTF8 MUAs, they will all apply heuristics and to the
    wrong thing : consider the *whole* document with *one* type.

What I find even funnier is when, for no apparent reason, the same MUA is used
on both ends and the contents get mangled because the sender copies a portion
of text from somewhere else.

Anyway, I don't want to follow up on this thread, it's *highly* off-topic here.

Cheers,
Willy


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-08  1:14                         ` Willy Tarreau
@ 2007-01-08  1:45                           ` Adrian Bunk
  0 siblings, 0 replies; 39+ messages in thread
From: Adrian Bunk @ 2007-01-08  1:45 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Jan Engelhardt, Russell King, David Woodhouse, Tilman Schmidt,
	Linux Kernel Mailing List

On Mon, Jan 08, 2007 at 02:14:41AM +0100, Willy Tarreau wrote:
> On Mon, Jan 08, 2007 at 02:03:37AM +0100, Adrian Bunk wrote:
> > On Mon, Jan 08, 2007 at 01:38:57AM +0100, Willy Tarreau wrote:
> > > On Mon, Jan 08, 2007 at 12:37:50AM +0100, Adrian Bunk wrote:
> > > > On Sun, Jan 07, 2007 at 09:48:34PM +0100, Willy Tarreau wrote:
> > > > > On Sun, Jan 07, 2007 at 08:11:38PM +0100, Jan Engelhardt wrote:
> > > > > > 
> > > > > > On Jan 7 2007 17:06, Russell King wrote:
> > > > > > >On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote:
> > > > > > >
> > > > > > >$ git log | head -n 1000 | tail -n 200 > o
> > > > > > >$ file -i o
> > > > > > >o: text/plain; charset=us-ascii
> > > > > > >$ git log | head -n 1000 | tail -n 300 > o
> > > > > > >$ file -i o
> > > > > > >o: text/plain; charset=us-ascii
> > > > > > >$ git log | head -n 1000 | tail -n 400 > o
> > > > > > >$ file -i o
> > > > > > >o: text/plain; charset=utf-8
> > > > > > 
> > > > > > I am inclined to say that "file" does not count, because it tries to guess an
> > > > > > ambiguous mapping from bytes to character set. Even more, file should be
> > > > > > _unable at all_ to distinguish an iso-8859-1 from an iso-8859-2 (or worse: 15)
> > > > > > file. This program is soo... forget it, it's not an argument. It works well for
> > > > > > headerful files, but text files don't really contain one. The next best thing
> > > > > > would be html, with a proper <meta http-equiv=Content> tag.
> > > > > 
> > > > > The stupidity from the start up with those character sets is that they
> > > > > consider that a whole file is written with a given set. In fact, the
> > > > > charset should apply to characters themselves. At least, the
> > > > > quoted-printable, non-human friendly, encoding was the least stupid.
> > > > 
> > > > I doubt doing this would really be worth the effort.
> > > > 
> > > > In the 21st century, people should simply use UTF-8.
> > > > 
> > > > > Now that UTF8 comes everywhere, everyone receives tons of mangled mails,
> > > > > and even mailers which correctly support UTF8 and use it by default manage
> > > > > to shoot themselves in the foot when they reply to, or forward a mail. The
> > > > > system is completely broken because limited by design, and we have to learn
> > > > > to live with this brokenness.
> > > > 
> > > > Only if MUAs have broken charset support or don't set a correct 
> > > > "charset" header in the mails they are sending.
> > > > 
> > > > If some software still can't handle UTF-8 correctly more than 10 years 
> > > > after it was introduced, that's not a brokenness you can blame on UTF-8.
> > > 
> > > I'm not blaming UTF-8 per se, but people who still believe in encoding
> > > *whole documents*. Copy-paste, text insertion, git output, etc... everything
> > > has a good reason not to be in the same encoding as what your MUA believes.
> > 
> > How would you do this technically in a way that it's significantely 
> > easier than simply finishing the UTF=8 transition?
> 
> In how many decades do you think the transition will be finished ?
> 
> > > If major MUAs still have problems with UTF-8 10 years after it was introduced,
> > > it's clearly the proof of a flaw in the initial design. And I'm not even
> > > discussing the stupidity which requires that you read a whole text to get
> > > its number of characters !
> > 
> > The only major MUA not supporting UTF-8 is Eudora.
> > 
> > And if you are talking about buggy old pine, in the latest development 
> > version [1] it does not only become open source, it also got some 
> > working Unicode support.
> 
> No, I'm not speaking about "not supporting", but "having problems". Every
> one of us has already received mails from Thunderbird, Outlook, Notes, etc...
> with erroneously encoded characters because of this :
> 
>   - an UTF8 MUA sends a mail to a non-UTF8 aware one.

"non-UTF8 aware one" = Eudora (BTW: there's no Linux version)

>   - this last one only sees double chars. When it wants to forward the mail
>     to someone else, it keeps the chars verbatim, and sets the encoding type
>     to its own, something like iso8859-1 for instance.

Let's not base everything on the one broken non-Linux MUA,

>   - the final MUA, which is UTF8-aware, is very happy to detect lots of UTF8
>     combinations in the forwarded mail and decides that everything in it is
>     UTF8, then you get lots of chars mangled in the mail, in the middle of
>     UTF8 combinations. Then, this crappy mail can be forwarded as long as
>     you want between UTF8 MUAs, they will all apply heuristics and to the
>     wrong thing : consider the *whole* document with *one* type.

Which MUAs exactly do ignore the "charset" of an email and try their own 
guessing instead?

Or which MUAs exactly do not set a "charset" so that the receiving MUA 
might have a reason for guessing?

> What I find even funnier is when, for no apparent reason, the same MUA is used
> on both ends and the contents get mangled because the sender copies a portion
> of text from somewhere else.

With which MUA and which charset settings of the users?

> Anyway, I don't want to follow up on this thread, it's *highly* off-topic here.

People want their names written correctly in changelogs.

It is therefore on-topic if the result is something like "kernel 
maintainers shouldn't be using Eudora" or "kernel maintainers using pine 
should upgrade to Alpine" or something similar.

> Cheers,
> Willy

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-08  1:03                       ` Adrian Bunk
  2007-01-08  1:14                         ` Willy Tarreau
@ 2007-01-08  6:52                         ` Jan Engelhardt
  2007-01-08  8:02                           ` Adrian Bunk
  1 sibling, 1 reply; 39+ messages in thread
From: Jan Engelhardt @ 2007-01-08  6:52 UTC (permalink / raw)
  To: Adrian Bunk
  Cc: Willy Tarreau, Russell King, David Woodhouse, Tilman Schmidt,
	Linux Kernel Mailing List


On Jan 8 2007 02:03, Adrian Bunk wrote:
>
>The only major MUA not supporting UTF-8 is Eudora.
>
>And if you are talking about buggy old pine, in the latest development 
>version [1] it does not only become open source, it also got some 
>working Unicode support.

Uhm, just for the record, I run pine 4.61 where my mail delivers to,
and Unicode works, yes, including the spam.


	-`J'
-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-08  6:52                         ` Jan Engelhardt
@ 2007-01-08  8:02                           ` Adrian Bunk
  0 siblings, 0 replies; 39+ messages in thread
From: Adrian Bunk @ 2007-01-08  8:02 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Willy Tarreau, Russell King, David Woodhouse, Tilman Schmidt,
	Linux Kernel Mailing List

On Mon, Jan 08, 2007 at 07:52:48AM +0100, Jan Engelhardt wrote:
> 
> On Jan 8 2007 02:03, Adrian Bunk wrote:
> >
> >The only major MUA not supporting UTF-8 is Eudora.
> >
> >And if you are talking about buggy old pine, in the latest development 
> >version [1] it does not only become open source, it also got some 
> >working Unicode support.
> 
> Uhm, just for the record, I run pine 4.61 where my mail delivers to,
> and Unicode works, yes, including the spam.

For some years I'm using pine only as a newsreader, and I remember some 
display problems of Unicode characters that are fixed in Alpine.

It might be that the support in pine was already better than I thought
(but my switch to MUA was so many years ago...).

> 	-`J'

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-08  0:38                     ` Willy Tarreau
  2007-01-08  1:03                       ` Adrian Bunk
@ 2007-01-08 19:53                       ` Valdis.Kletnieks
  1 sibling, 0 replies; 39+ messages in thread
From: Valdis.Kletnieks @ 2007-01-08 19:53 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Adrian Bunk, Jan Engelhardt, Russell King, David Woodhouse,
	Tilman Schmidt, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 519 bytes --]

On Mon, 08 Jan 2007 01:38:57 +0100, Willy Tarreau said:
> it's clearly the proof of a flaw in the initial design. And I'm not even
> discussing the stupidity which requires that you read a whole text to get
> its number of characters !

It's no more stupid than the *current* situation with Linux kernel code, where
the stupidity actually requires that even if you know that there are only 60
characters on a given line, you actually have to look at each one in order to
figure out if the line goes past column 80....


[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-07 15:38         ` Russell King
  2007-01-07 16:29           ` David Woodhouse
@ 2007-01-07 18:21           ` Alan
  2007-01-07 19:12             ` Jan Engelhardt
  2007-01-07 19:17             ` Russell King
  1 sibling, 2 replies; 39+ messages in thread
From: Alan @ 2007-01-07 18:21 UTC (permalink / raw)
  To: Russell King; +Cc: David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List

> So, in short, UTF-8 is all fine and dandy if your _entire_ universe
> is UTF-8 enabled.  If you're operating in a mixed charset environment
> it's one bloody big pain in the butt.

Net ASCII is 7bit and is 1:1 mapped with UTF-8 unicode. It's just old
broken 8bit encodings that are problematic.

The kernel maintainers/help/config pretty consistently use UTF8

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-07 18:21           ` Alan
@ 2007-01-07 19:12             ` Jan Engelhardt
  2007-01-07 22:30               ` Alan
  2007-01-07 19:17             ` Russell King
  1 sibling, 1 reply; 39+ messages in thread
From: Jan Engelhardt @ 2007-01-07 19:12 UTC (permalink / raw)
  To: Alan
  Cc: Russell King, David Woodhouse, Tilman Schmidt,
	Linux Kernel Mailing List


On Jan 7 2007 18:21, Alan wrote:
>
>> So, in short, UTF-8 is all fine and dandy if your _entire_ universe
>> is UTF-8 enabled.  If you're operating in a mixed charset environment
>> it's one bloody big pain in the butt.
>
>Net ASCII is 7bit and is 1:1 mapped with UTF-8 unicode. It's just old
>broken 8bit encodings that are problematic.
>
>The kernel maintainers/help/config pretty consistently use UTF8

I've seen a lot of places that don't do so. Want a patch?


	-`J'
-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-07 19:12             ` Jan Engelhardt
@ 2007-01-07 22:30               ` Alan
  2007-01-08  1:22                 ` Jan Engelhardt
  2007-01-08 16:14                 ` Pavel Machek
  0 siblings, 2 replies; 39+ messages in thread
From: Alan @ 2007-01-07 22:30 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Russell King, David Woodhouse, Tilman Schmidt,
	Linux Kernel Mailing List

> >The kernel maintainers/help/config pretty consistently use UTF8
> 
> I've seen a lot of places that don't do so. Want a patch?

I think that would be a good idea - and add it to the coding/docs specs
that documentation is UTF-8. Code should IMHO say 7bit though.

Alan

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-07 22:30               ` Alan
@ 2007-01-08  1:22                 ` Jan Engelhardt
  2007-01-08 20:17                   ` Jan Engelhardt
  2007-01-08 16:14                 ` Pavel Machek
  1 sibling, 1 reply; 39+ messages in thread
From: Jan Engelhardt @ 2007-01-08  1:22 UTC (permalink / raw)
  To: Alan
  Cc: Russell King, David Woodhouse, Tilman Schmidt,
	Linux Kernel Mailing List


On Jan 7 2007 22:30, Alan wrote:
>
>> >The kernel maintainers/help/config pretty consistently use UTF8
>> 
>> I've seen a lot of places that don't do so. Want a patch?
>
>I think that would be a good idea - and add it to the coding/docs specs
>that documentation is UTF-8. Code should IMHO say 7bit though.

Hm, what do the list of authors in .c/.h files and kerneldoc
in .c/h belong to? doc or code?


	-`J'
-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-08  1:22                 ` Jan Engelhardt
@ 2007-01-08 20:17                   ` Jan Engelhardt
  2007-01-08 22:00                     ` Ken Moffat
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Engelhardt @ 2007-01-08 20:17 UTC (permalink / raw)
  To: Alan
  Cc: Russell King, David Woodhouse, Tilman Schmidt,
	Linux Kernel Mailing List

[-- Attachment #1: Type: TEXT/PLAIN, Size: 985 bytes --]


On Jan 8 2007 02:22, Jan Engelhardt wrote:
>On Jan 7 2007 22:30, Alan wrote:
>>
>>> >The kernel maintainers/help/config pretty consistently use UTF8
>>> 
>>> I've seen a lot of places that don't do so. Want a patch?
>>
>>I think that would be a good idea - and add it to the coding/docs specs
>>that documentation is UTF-8. Code should IMHO say 7bit though.

Most memorable issues:

* "don<decimal-180>t" (standalone accent aigu) rather than "don't" (apostrophe)
* "<decimal-160>", non breaking spaces
* cp437 encoding in some files (heh, heh, DOS!)
* iso8859-1/utf-8 mixed in some files

My compose key is hot now...

None of you people screw that patch with your buggy MUAs! I'll pack
it up into a .bz2 to get it marked as application/octet-stream to
not even give your MUA the chance to. ;-) [and because it's 221 K 
uncompressed and I am not sure if splitting it up makes much sense for 
such 'trivial' changes, or not?]

Signed-off-by: Jan Engelhardt <jengelh@gmx.de>


	-`J'
-- 

[-- Attachment #2: Type: APPLICATION/x-bzip2, Size: 42588 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-08 20:17                   ` Jan Engelhardt
@ 2007-01-08 22:00                     ` Ken Moffat
  2007-01-08 23:21                       ` Jan Engelhardt
  0 siblings, 1 reply; 39+ messages in thread
From: Ken Moffat @ 2007-01-08 22:00 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Alan, Russell King, David Woodhouse, Tilman Schmidt,
	Linux Kernel Mailing List

On Mon, Jan 08, 2007 at 09:17:06PM +0100, Jan Engelhardt wrote:
> 
> On Jan 8 2007 02:22, Jan Engelhardt wrote:
> >On Jan 7 2007 22:30, Alan wrote:
> >>
> >>> >The kernel maintainers/help/config pretty consistently use UTF8
> >>> 
> >>> I've seen a lot of places that don't do so. Want a patch?
> >>
> >>I think that would be a good idea - and add it to the coding/docs specs
> >>that documentation is UTF-8. Code should IMHO say 7bit though.
> 
> Most memorable issues:
> 
> * "don<decimal-180>t" (standalone accent aigu) rather than "don't" (apostrophe)
> * "<decimal-160>", non breaking spaces
> * cp437 encoding in some files (heh, heh, DOS!)
> * iso8859-1/utf-8 mixed in some files
 Looks nicely done, but I query the postal address changes in
Documentation/cdrom/sbpcd - that seems to be a change of address
(without anything to explain it).  Everything else seems to be just
character-set conversion or the occasional translation of comments
into English.  (And no, I didn't attempt to review the character-set
changes, even it there is an occasional error it will be better than
where we are now, and easy to patch.)
> 
> My compose key is hot now...
 I prefer the AltGr dead keys in X (they seem to work more reliably
for me), but I guess I'm straying OT.
> 
> None of you people screw that patch with your buggy MUAs! I'll pack
> it up into a .bz2 to get it marked as application/octet-stream to
> not even give your MUA the chance to. ;-) [and because it's 221 K 
> uncompressed and I am not sure if splitting it up makes much sense for 
> such 'trivial' changes, or not?]
> 
> Signed-off-by: Jan Engelhardt <jengelh@gmx.de>
> 
> 
> 	-`J'
> -- 

 Thanks for doing this, I hope it wasn't in vain.

Ken
-- 
das eine Mal als Tragödie, das andere Mal als Farce

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-08 22:00                     ` Ken Moffat
@ 2007-01-08 23:21                       ` Jan Engelhardt
  2007-01-08 23:34                         ` Eberhard Moenkeberg
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Engelhardt @ 2007-01-08 23:21 UTC (permalink / raw)
  To: Eberhard Mönkeberg
  Cc: Alan, Russell King, David Woodhouse, Tilman Schmidt,
	Linux Kernel Mailing List, Ken Moffat


On Jan 8 2007 22:00, Ken Moffat wrote:

> Looks nicely done, but I query the postal address changes in
>Documentation/cdrom/sbpcd - that seems to be a change of address
>(without anything to explain it).

Eberhard [cc], please attach an Acked-by: YourName <emailaddress>
keep Ccs, thanks ;-)

[thread/patch: http://lkml.org/lkml/2007/1/8/222 ]

	-`J'
-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-08 23:21                       ` Jan Engelhardt
@ 2007-01-08 23:34                         ` Eberhard Moenkeberg
  0 siblings, 0 replies; 39+ messages in thread
From: Eberhard Moenkeberg @ 2007-01-08 23:34 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Alan, Russell King, David Woodhouse, Tilman Schmidt,
	Linux Kernel Mailing List, Ken Moffat

Hi,

On Tue, 9 Jan 2007, Jan Engelhardt wrote:
> On Jan 8 2007 22:00, Ken Moffat wrote:

> > Looks nicely done, but I query the postal address changes in
> >Documentation/cdrom/sbpcd - that seems to be a change of address
> >(without anything to explain it).
> 
> Eberhard [cc], please attach an Acked-by: YourName <emailaddress>
> keep Ccs, thanks ;-)
> 
> [thread/patch: http://lkml.org/lkml/2007/1/8/222 ]

Acked-by: Eberhard Moenkeberg <emoenke@gwdg.de>

Jan had contacted me before, and I had sent him my new address data.

This very young guy is doing a really good job. ;-))

Cheers -e
-- 
Eberhard Moenkeberg (emoenke@gwdg.de, em@kki.org)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-07 22:30               ` Alan
  2007-01-08  1:22                 ` Jan Engelhardt
@ 2007-01-08 16:14                 ` Pavel Machek
  2007-01-08 22:17                   ` Tim Pepper
  1 sibling, 1 reply; 39+ messages in thread
From: Pavel Machek @ 2007-01-08 16:14 UTC (permalink / raw)
  To: Alan
  Cc: Jan Engelhardt, Russell King, David Woodhouse, Tilman Schmidt,
	Linux Kernel Mailing List

On Sun 2007-01-07 22:30:55, Alan wrote:
> > >The kernel maintainers/help/config pretty consistently use UTF8
> > 
> > I've seen a lot of places that don't do so. Want a patch?
> 
> I think that would be a good idea - and add it to the coding/docs specs
> that documentation is UTF-8. Code should IMHO say 7bit though.

Yes, yes, please.

I have been flamed when someone tried to do 8bit patch, and I was
trying to NAK it...

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-08 16:14                 ` Pavel Machek
@ 2007-01-08 22:17                   ` Tim Pepper
  2007-01-08 23:30                     ` Jan Engelhardt
  0 siblings, 1 reply; 39+ messages in thread
From: Tim Pepper @ 2007-01-08 22:17 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Alan, Jan Engelhardt, Russell King, David Woodhouse,
	Tilman Schmidt, Linux Kernel Mailing List

On 1/8/07, Pavel Machek <pavel@ucw.cz> wrote:
> On Sun 2007-01-07 22:30:55, Alan wrote:
> > I think that would be a good idea - and add it to the coding/docs specs
> > that documentation is UTF-8. Code should IMHO say 7bit though.
>
> Yes, yes, please.
>
> I have been flamed when someone tried to do 8bit patch, and I was
> trying to NAK it...

Could this get put in Documentation/CodingStyle?  And an item added to
the kernel janitors' list to fix up 8bit files?  Last I looked trying
to decided if there was a standard here I found a mish-mash of
encodings based output of file vs Linus' git tree.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-08 22:17                   ` Tim Pepper
@ 2007-01-08 23:30                     ` Jan Engelhardt
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Engelhardt @ 2007-01-08 23:30 UTC (permalink / raw)
  To: Tim Pepper
  Cc: Pavel Machek, Alan, Russell King, David Woodhouse, Tilman Schmidt,
	Linux Kernel Mailing List


On Jan 8 2007 14:17, Tim Pepper wrote:
> On 1/8/07, Pavel Machek <pavel@ucw.cz> wrote:
>> On Sun 2007-01-07 22:30:55, Alan wrote:
>> > I think that would be a good idea - and add it to the coding/docs
>> > specs
>> > that documentation is UTF-8. Code should IMHO say 7bit though.
>> 
>> Yes, yes, please.
>> 
>> I have been flamed when someone tried to do 8bit patch, and I was
>> trying to NAK it...
>
> Could this get put in Documentation/CodingStyle?

Someone do that.

> And an item added to
> the kernel janitors' list to fix up 8bit files?  Last I looked trying

That's already been just done by me. http://lkml.org/lkml/2007/1/8/222

> to decided if there was a standard here I found a mish-mash of
> encodings based output of file vs Linus' git tree.

	-`J'
-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-07 18:21           ` Alan
  2007-01-07 19:12             ` Jan Engelhardt
@ 2007-01-07 19:17             ` Russell King
  2007-01-07 19:58               ` Robin Rosenberg
                                 ` (2 more replies)
  1 sibling, 3 replies; 39+ messages in thread
From: Russell King @ 2007-01-07 19:17 UTC (permalink / raw)
  To: Alan; +Cc: David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List

On Sun, Jan 07, 2007 at 06:21:51PM +0000, Alan wrote:
> > So, in short, UTF-8 is all fine and dandy if your _entire_ universe
> > is UTF-8 enabled.  If you're operating in a mixed charset environment
> > it's one bloody big pain in the butt.
> 
> Net ASCII is 7bit and is 1:1 mapped with UTF-8 unicode.

The same is true of ISO-8859-1.

> It's just old broken 8bit encodings that are problematic.
> 
> The kernel maintainers/help/config pretty consistently use UTF8

As I've tried to point out, that's not universally true.  For instance:

commit 24ebead82bbf9785909d4cf205e2df5e9ff7da32
tree 921f686860e918a01c3d3fb6cd106ba82bf4ace6
parent 264166e604a7e14c278e31cadd1afb06a7d51a11
author Rafa³ Bilski <rafalbilski@interia.pl> 1167691774 +0100
committer Dave Jones <davej@redhat.com> 1167799119 -0500

and looking at that "author" closer with od:

0000140 74 68 6f 72 20 52 61 66 61 b3 20 42 69 6c 73 6b
          t   h   o   r       R   a   f   a   ³       B   i   l   s   k

clearly not UTF-8.  I doubt whether any of the commits I do on my
en_GB ISO-8859-1 systems end up being UTF-8 encoded.

And _this_ is the problem when it comes to generating the logs,
irrespective of whether or not Linus loads UTF-8 data into an
ISO-8859-1 message.  For all we know, Linus' system could be using
an ISO-8859 charset rather than UTF-8.

But the point is there is charset damage which has happened _long_ before
Linus' action.  There is no character set defined for the contents of git
repositories, and as such the output of the git tools can not be
interpreted as any one single character set.

All that UTF-8 has done is added to the "which charset is this data"
problem rather than actually solving any proper real life problem.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-07 19:17             ` Russell King
@ 2007-01-07 19:58               ` Robin Rosenberg
  2007-01-07 20:05               ` Dave Jones
  2007-01-08  1:40               ` Horst H. von Brand
  2 siblings, 0 replies; 39+ messages in thread
From: Robin Rosenberg @ 2007-01-07 19:58 UTC (permalink / raw)
  To: Kernel Mailing List; +Cc: Russell King, Alan, David Woodhouse, Tilman Schmidt

söndag 07 januari 2007 20:17 skrev Russell King:
[...]
> clearly not UTF-8.  I doubt whether any of the commits I do on my
> en_GB ISO-8859-1 systems end up being UTF-8 encoded.

They don't. Git doesn't convert, with the exception of two mail-related tools, 
which is the reason the commit being discussed ended up as UTF-8
in GIT. The mail containing the patch was in ISO-8859-1. All other git tools 
just store whatever byte sequence they are fed, be ut ISO-latin, utf-8 or 
something (to westeners) more exotic.

-- robin

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-07 19:17             ` Russell King
  2007-01-07 19:58               ` Robin Rosenberg
@ 2007-01-07 20:05               ` Dave Jones
  2007-01-07 20:15                 ` Sean
  2007-01-08  4:42                 ` David Woodhouse
  2007-01-08  1:40               ` Horst H. von Brand
  2 siblings, 2 replies; 39+ messages in thread
From: Dave Jones @ 2007-01-07 20:05 UTC (permalink / raw)
  To: Alan, David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List

On Sun, Jan 07, 2007 at 07:17:30PM +0000, Russell King wrote:

 > commit 24ebead82bbf9785909d4cf205e2df5e9ff7da32
 > tree 921f686860e918a01c3d3fb6cd106ba82bf4ace6
 > parent 264166e604a7e14c278e31cadd1afb06a7d51a11
 > author Rafa³ Bilski <rafalbilski@interia.pl> 1167691774 +0100
 > committer Dave Jones <davej@redhat.com> 1167799119 -0500
 > 
 > and looking at that "author" closer with od:
 > 
 > 0000140 74 68 6f 72 20 52 61 66 61 b3 20 42 69 6c 73 6b
 >           t   h   o   r       R   a   f   a   ³       B   i   l   s   k
 > 
 > clearly not UTF-8.  I doubt whether any of the commits I do on my
 > en_GB ISO-8859-1 systems end up being UTF-8 encoded.

This has been bugging me for a while.
Viewing the mail I applied in mutt shows his name correctly as Rafał
Applying it with git-applymbox and viewing the log on master.kernel.org
with git log shows Rafa<B3>   And then later when put into email
it turns into Rafa³

 > But the point is there is charset damage which has happened _long_ before
 > Linus' action.  There is no character set defined for the contents of git
 > repositories, and as such the output of the git tools can not be
 > interpreted as any one single character set.

If there's something I should be doing when I commit that I'm not,
I'll be happy to change my scripts.  My $LANG is set to en_US.UTF-8
which should DTRT to the best of my knowledge, but clearly, that isn't
the case.

		Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-07 20:05               ` Dave Jones
@ 2007-01-07 20:15                 ` Sean
  2007-01-07 20:40                   ` Jan Engelhardt
  2007-01-08  4:42                 ` David Woodhouse
  1 sibling, 1 reply; 39+ messages in thread
From: Sean @ 2007-01-07 20:15 UTC (permalink / raw)
  To: Dave Jones
  Cc: Alan, David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List,
	git

On Sun, 7 Jan 2007 15:05:53 -0500
Dave Jones <davej@redhat.com> wrote:

Including the Git list...

> On Sun, Jan 07, 2007 at 07:17:30PM +0000, Russell King wrote:
> 
>  > commit 24ebead82bbf9785909d4cf205e2df5e9ff7da32
>  > tree 921f686860e918a01c3d3fb6cd106ba82bf4ace6
>  > parent 264166e604a7e14c278e31cadd1afb06a7d51a11
>  > author Rafa³ Bilski <rafalbilski@interia.pl> 1167691774 +0100
>  > committer Dave Jones <davej@redhat.com> 1167799119 -0500
>  > 
>  > and looking at that "author" closer with od:
>  > 
>  > 0000140 74 68 6f 72 20 52 61 66 61 b3 20 42 69 6c 73 6b
>  >           t   h   o   r       R   a   f   a   ³       B   i   l   s   k
>  > 
>  > clearly not UTF-8.  I doubt whether any of the commits I do on my
>  > en_GB ISO-8859-1 systems end up being UTF-8 encoded.
> 
> This has been bugging me for a while.
> Viewing the mail I applied in mutt shows his name correctly as Rafał
> Applying it with git-applymbox and viewing the log on master.kernel.org
> with git log shows Rafa<B3>   And then later when put into email
> it turns into Rafa³
> 
>  > But the point is there is charset damage which has happened _long_ before
>  > Linus' action.  There is no character set defined for the contents of git
>  > repositories, and as such the output of the git tools can not be
>  > interpreted as any one single character set.
> 
> If there's something I should be doing when I commit that I'm not,
> I'll be happy to change my scripts.  My $LANG is set to en_US.UTF-8
> which should DTRT to the best of my knowledge, but clearly, that isn't
> the case.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-07 20:15                 ` Sean
@ 2007-01-07 20:40                   ` Jan Engelhardt
  2007-01-07 21:07                     ` Xavier Bestel
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Engelhardt @ 2007-01-07 20:40 UTC (permalink / raw)
  To: Sean
  Cc: Dave Jones, Alan, David Woodhouse, Tilman Schmidt,
	Linux Kernel Mailing List, git


>On Sun, 7 Jan 2007 15:05:53 -0500
>Dave Jones <davej@redhat.com> wrote:
>
>> If there's something I should be doing when I commit that I'm not,
>> I'll be happy to change my scripts.  My $LANG is set to en_US.UTF-8
>> which should DTRT to the best of my knowledge, but clearly, that isn't
>> the case.

No, LC_CTYPE defines what charset you use. (I may be wrong, though.)


	-`J'
-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-07 20:40                   ` Jan Engelhardt
@ 2007-01-07 21:07                     ` Xavier Bestel
  0 siblings, 0 replies; 39+ messages in thread
From: Xavier Bestel @ 2007-01-07 21:07 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Sean, Dave Jones, Alan, David Woodhouse, Tilman Schmidt,
	Linux Kernel Mailing List, git

Le dimanche 07 janvier 2007 à 21:40 +0100, Jan Engelhardt a écrit :
> >On Sun, 7 Jan 2007 15:05:53 -0500
> >Dave Jones <davej@redhat.com> wrote:
> >
> >> If there's something I should be doing when I commit that I'm not,
> >> I'll be happy to change my scripts.  My $LANG is set to en_US.UTF-8
> >> which should DTRT to the best of my knowledge, but clearly, that isn't
> >> the case.
> 
> No, LC_CTYPE defines what charset you use. (I may be wrong, though.)

IIRC LANG is a superset for all LC_* - i.e. if only LANG is defined, it
sets all your locales, but you can individually set the charset, numeric
format, date format, etc.

	Xav



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-07 20:05               ` Dave Jones
  2007-01-07 20:15                 ` Sean
@ 2007-01-08  4:42                 ` David Woodhouse
  1 sibling, 0 replies; 39+ messages in thread
From: David Woodhouse @ 2007-01-08  4:42 UTC (permalink / raw)
  To: Dave Jones; +Cc: Alan, Tilman Schmidt, Linux Kernel Mailing List, rmk+lkml

On Sun, 2007-01-07 at 15:05 -0500, Dave Jones wrote:
> This has been bugging me for a while.
> Viewing the mail I applied in mutt shows his name correctly as Rafał
> Applying it with git-applymbox and viewing the log on master.kernel.org
> with git log shows Rafa<B3>   And then later when put into email
> it turns into Rafa³ 

I believe you need to use the misnamed '-u' option to git-applymbox,
which _really_ ought to be the default behaviour. Otherwise, it fails to
pay any attention to the character set tags in the mail it's decoding --
it commits the sin which rmk was whining about; assuming the input data
is of a given type and ignoring the explicit tags which indicate the
contrary.

The '-u' option is misdocumented as 'causes the resulting commit to be
encoded in utf-8', but in fact I believe it doesn't necessarily do that
-- it actually causes the resulting commit to be encoded in the
configured storage charset for the repository, which just _happens_ to
default to UTF-8 unless otherwise specified. That is something which
should definitely be the _default_ behaviour.

We should make the '-u' behaviour the default, and if anyone really
wants the old behaviour of importing arbitrary data in untagged 
binary form overriding its labelling then they can have a separate
option which does that.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: OT: character encodings (was: Linux 2.6.20-rc4)
  2007-01-07 19:17             ` Russell King
  2007-01-07 19:58               ` Robin Rosenberg
  2007-01-07 20:05               ` Dave Jones
@ 2007-01-08  1:40               ` Horst H. von Brand
  2 siblings, 0 replies; 39+ messages in thread
From: Horst H. von Brand @ 2007-01-08  1:40 UTC (permalink / raw)
  To: Alan, David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List

Russell King <rmk+lkml@arm.linux.org.uk> wrote:

[...]

> All that UTF-8 has done is added to the "which charset is this data"
> problem rather than actually solving any proper real life problem.

It solves real-world problems, the pain is that it is not (yet) universally
used. The charset problems today are much more visible today than, say, 15
years back, that is all.
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                    Fono: +56 32 2654431
Universidad Tecnica Federico Santa Maria             +56 32 2654239
Casilla 110-V, Valparaiso, Chile               Fax:  +56 32 2797513

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2007-01-08 23:36 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-01-08 10:24 OT: character encodings (was: Linux 2.6.20-rc4) Nicolas Mailhot
2007-01-08 10:44 ` Alan
2007-01-08 10:44   ` Nicolas Mailhot
  -- strict thread matches above, loose matches on Subject: below --
2007-01-08 10:13 Nicolas Mailhot
2007-01-07  6:19 Linux 2.6.20-rc4 Linus Torvalds
2007-01-07 10:56 ` Jan Engelhardt
2007-01-07 11:44   ` Russell King
2007-01-07 13:06     ` OT: character encodings (was: Linux 2.6.20-rc4) Tilman Schmidt
2007-01-07 15:13       ` David Woodhouse
2007-01-07 15:38         ` Russell King
2007-01-07 16:29           ` David Woodhouse
2007-01-07 17:06             ` Russell King
2007-01-07 19:11               ` Jan Engelhardt
2007-01-07 19:20                 ` Russell King
2007-01-07 20:48                 ` Willy Tarreau
2007-01-07 23:37                   ` Adrian Bunk
2007-01-08  0:38                     ` Willy Tarreau
2007-01-08  1:03                       ` Adrian Bunk
2007-01-08  1:14                         ` Willy Tarreau
2007-01-08  1:45                           ` Adrian Bunk
2007-01-08  6:52                         ` Jan Engelhardt
2007-01-08  8:02                           ` Adrian Bunk
2007-01-08 19:53                       ` Valdis.Kletnieks
2007-01-07 18:21           ` Alan
2007-01-07 19:12             ` Jan Engelhardt
2007-01-07 22:30               ` Alan
2007-01-08  1:22                 ` Jan Engelhardt
2007-01-08 20:17                   ` Jan Engelhardt
2007-01-08 22:00                     ` Ken Moffat
2007-01-08 23:21                       ` Jan Engelhardt
2007-01-08 23:34                         ` Eberhard Moenkeberg
2007-01-08 16:14                 ` Pavel Machek
2007-01-08 22:17                   ` Tim Pepper
2007-01-08 23:30                     ` Jan Engelhardt
2007-01-07 19:17             ` Russell King
2007-01-07 19:58               ` Robin Rosenberg
2007-01-07 20:05               ` Dave Jones
2007-01-07 20:15                 ` Sean
2007-01-07 20:40                   ` Jan Engelhardt
2007-01-07 21:07                     ` Xavier Bestel
2007-01-08  4:42                 ` David Woodhouse
2007-01-08  1:40               ` Horst H. von Brand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox