* Re: OT: character encodings (was: Linux 2.6.20-rc4) @ 2007-01-08 10:24 Nicolas Mailhot 2007-01-08 10:44 ` Alan 0 siblings, 1 reply; 39+ messages in thread From: Nicolas Mailhot @ 2007-01-08 10:24 UTC (permalink / raw) To: Willy Tarreau; +Cc: linux-kernel >> How would you do this technically in a way that it's significantely >> easier than simply finishing the UTF=8 transition? > In how many decades do you think the transition will be finished ? Right now it looks like it will be finished way earlier than app bother supporting the later 8-bit encodings such as iso-8859-15 (case in point: Russel's system. I was ROTFL when he proudly announced he was running a full iso-8859-1 system after dissing UTF-8. Last I've seen the official 8bit EU encoding was iso-8859-15, and UK is part of the EU) -- Nicolas Mailhot ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-08 10:24 OT: character encodings (was: Linux 2.6.20-rc4) Nicolas Mailhot @ 2007-01-08 10:44 ` Alan 2007-01-08 10:44 ` Nicolas Mailhot 0 siblings, 1 reply; 39+ messages in thread From: Alan @ 2007-01-08 10:44 UTC (permalink / raw) To: Nicolas Mailhot; +Cc: Willy Tarreau, linux-kernel > (case in point: Russel's system. I was ROTFL when he proudly announced he > was running a full iso-8859-1 system after dissing UTF-8. Last I've seen > the official 8bit EU encoding was iso-8859-15, and UK is part of the EU) There is no correct UK encoding. You need -14 or -15 depending upon language and can come horribly unstuck the moment a name is involved. Alan ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-08 10:44 ` Alan @ 2007-01-08 10:44 ` Nicolas Mailhot 0 siblings, 0 replies; 39+ messages in thread From: Nicolas Mailhot @ 2007-01-08 10:44 UTC (permalink / raw) To: Alan; +Cc: Willy Tarreau, linux-kernel Le Lun 8 janvier 2007 11:44, Alan a écrit : >> (case in point: Russel's system. I was ROTFL when he proudly announced >> he >> was running a full iso-8859-1 system after dissing UTF-8. Last I've seen >> the official 8bit EU encoding was iso-8859-15, and UK is part of the EU) > > There is no correct UK encoding. You need -14 or -15 depending upon > language and can come horribly unstuck the moment a name is involved. Either way it's not iso-8859-1 :) -- Nicolas Mailhot ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4)
@ 2007-01-08 10:13 Nicolas Mailhot
0 siblings, 0 replies; 39+ messages in thread
From: Nicolas Mailhot @ 2007-01-08 10:13 UTC (permalink / raw)
To: linux-kernel; +Cc: Russell King
> elinks is one such program. It now assumes UTF-8 _only_ displays.
> That's no better than programs which assume ISO-8859-1 only or US-ASCII
> only.
That's way better than programs:
- which assume an encoding you can't write most world languages in (BTW
ISO-8859-1 & US-ASCII are broken by design for Western Europe since at
least the Euro creation)
- which perpetuate the myth local 8-bit encodings are manageable (they
aren't, people spent decades trying to limp along with them, unicode &
UTF-8 where not created just to make your life miserable)
Show me one program that spurns Unicode I'll show you one that "passed on"
iso-8859-15 (typically, though it's the easiest non-iso-8859-1 to do)
The only reason you have the UTF-8 big stick approach nowadays is people
have tried for years to get app writers manage 8-bit locales properly to
dismal results. The old system was only working for en_US users (and
perhaps to .uk people)
--
Nicolas Mailhot
^ permalink raw reply [flat|nested] 39+ messages in thread* Linux 2.6.20-rc4
@ 2007-01-07 6:19 Linus Torvalds
2007-01-07 10:56 ` Jan Engelhardt
0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2007-01-07 6:19 UTC (permalink / raw)
To: Linux Kernel Mailing List
[-- Attachment #1: Type: TEXT/PLAIN, Size: 10496 bytes --]
There's absolutely nothing interesting here, unless you want to play with
KVM, or happened to be bitten by the bug with really old versions of the
linker that made parts of entry.S just go away.
But check it out anyway, and the shortlog gives more details on the
various minor fixes that have accumulated this week. Mostly in random
device drivers.
Linus
---
Adam Megacz (1):
Add AFS_SUPER_MAGIC to magic.h
Adrian Bunk (2):
[NET] drivers/net/loopback.c: convert to module_init()
[X25]: proper prototype for x25_init_timers()
Alan (3):
libata: fix combined mode
atiixp: Old drivers/ide layer driver for the ATIIXP hang fix
hpt37x: Two important bug fixes
Alan Stern (2):
UHCI: make test for ASUS motherboard more specific
UHCI: support device_may_wakeup
Alexey Dobriyan (2):
[NETFILTER] xt_hashlimit.c: fix typo
pata_optidma: typo in Kconfig
Andrew Morton (5):
USB: funsoft is borken on sparc
sisusb_con warning fixes
PCI: disable PCI_MULTITHREAD_PROBE
ip2 warning fix
shrink_all_memory(): fix lru_pages handling
Ard van Breemen (3):
start_kernel: test if irq's got enabled early, barf, and disable them again
kernelparams: detect if and which parameter parsing enabled irq's
PCI: prevent down_read when pci_devices is empty
Arnaud Patard (2):
[ARM] 4065/1: S3C24XX: dma printk fixes
[ARM] 4073/1: Prevent s3c24xx drivers from including asm/arch/hardware.h and asm/arch/irqs.h
Avi Kivity (39):
KVM: Prevent stale bits in cr0 and cr4
KVM: MMU: Implement simple reverse mapping
KVM: MMU: Teach the page table walker to track guest page table gfns
KVM: MMU: Load the pae pdptrs on cr3 change like the processor does
KVM: MMU: Fold fetch_guest() into init_walker()
KVM: MU: Special treatment for shadow pae root pages
KVM: MMU: Use the guest pdptrs instead of mapping cr3 in pae mode
KVM: MMU: Make the shadow page tables also special-case pae
KVM: MMU: Make kvm_mmu_alloc_page() return a kvm_mmu_page pointer
KVM: MMU: Shadow page table caching
KVM: MMU: Write protect guest pages when a shadow is created for them
KVM: MMU: Let the walker extract the target page gfn from the pte
KVM: MMU: Support emulated writes into RAM
KVM: MMU: Zap shadow page table entries on writes to guest page tables
KVM: MMU: If emulating an instruction fails, try unprotecting the page
KVM: MMU: Implement child shadow unlinking
KVM: MMU: kvm_mmu_put_page() only removes one link to the page
KVM: MMU: oom handling
KVM: MMU: Remove invlpg interception
KVM: MMU: Remove release_pt_page_64()
KVM: MMU: Handle misaligned accesses to write protected guest page tables
KVM: MMU: <ove is_empty_shadow_page() above kvm_mmu_free_page()
KVM: MMU: Ensure freed shadow pages are clean
KVM: MMU: If an empty shadow page is not empty, report more info
KVM: MMU: Page table write flood protection
KVM: MMU: Never free a shadow page actively serving as a root
KVM: MMU: Fix cmpxchg8b emulation
KVM: MMU: Treat user-mode faults as a hint that a page is no longer a page table
KVM: MMU: Free pages on kvm destruction
KVM: MMU: Replace atomic allocations by preallocated objects
KVM: MMU: Detect oom conditions and propagate error to userspace
KVM: MMU: Flush guest tlb when reducing permissions on a pte
KVM: MMU: Destroy mmu while we still have a vcpu left
KVM: MMU: add audit code to check mappings, etc are correct
KVM: Improve reporting of vmwrite errors
KVM: Initialize vcpu->kvm a little earlier
KVM: Add missing 'break'
KVM: Don't set guest cr3 from vmx_vcpu_setup()
KVM: MMU: Add missing dirty bit
Bartlomiej Zolnierkiewicz (1):
via82cxxx: fix cable detection
Ben Dooks (1):
[ARM] 4071/1: S3C24XX: Documentation update
Benjamin Herrenschmidt (1):
[SUNGEM]: PHY updates & pause fixes (#2)
Brice Goglin (1):
[CPUFREQ] speedstep-centrino: missing space and bracket
Christoph Hellwig (2):
[XFRM_USER]: avoid pointless void ** casts
Fix BUG at drivers/scsi/scsi_lib.c:1118 caused by "pktsetup dvd /dev/sr0"
Christoph Lameter (1):
Check for populated zone in __drain_pages
Chuck Ebbert (1):
[NETFILTER]: ebtables: don't compute gap before checking struct type
Cyrill V. Gorcunov (1):
qconf: fix SIGSEGV on empty menu items
Dan Williams (1):
[ARM] 4077/1: iop13xx: fix __io() macro
Dave Jones (3):
[CPUFREQ] longhaul: Fix up unreachable code.
[CPUFREQ] longhaul: Kill off warnings introduced by recent changes.
Fix implicit declarations in via-pmu
David Brownell (4):
i2c: Migration aids for i2c_adapter.dev removal
USB: omap_udc build fixes (sync with linux-omap)
rtc-at91rm9200 build fix
Update the rtc-rs5c372 driver
David Hollis (1):
USB: asix: Fix AX88772 device PHY selection
David L Stevens (1):
[IPV4/IPV6]: Fix inet{,6} device initialization order.
David S. Miller (2):
[PKTGEN]: Convert to kthread API.
[SOUND] Sparc CS4231: Use 64 for period_bytes_min
Dmitry Mishin (1):
[NETFILTER]: compat offsets size change
Dor Laor (2):
KVM: Improve interrupt response
KVM: Simplify test for interrupt window
Doug Chapman (1):
ACPI: increase ACPI_MAX_REFERENCE_COUNT for larger systems
Eric Anholt (1):
[AGPGART] fix detection of aperture size versus GTT size on G965
Eric Sandeen (1):
fix memory corruption from misinterpreted bad_inode_ops return values
Erik Jacobson (1):
connector: some fixes for ia64 unaligned access errors
Evgeniy Dushistov (1):
fix garbage instead of zeroes in UFS
Gabriel Mansi (1):
[AGPGART] K8M890 support for amd-k8.
Georg Chini (1):
[SOUND] Sparc CS4231: Fix IRQ return value and initialization.
Gerrit Renker (1):
[TCP]: Use old definition of before
Guillaume Chazarain (2):
ACPI: EC: move verbose printk to debug build only
[CPUFREQ] Uninitialized use of cmd.val in arch/i386/kernel/cpu/cpufreq/acpi-cpufreq.c:acpi_cpufreq_target()
Hugh Dickins (2):
fix BUG_ON(!PageSlab) from fallback_alloc
fix OOM killing of swapoff
Ingo Molnar (6):
KVM: Fix GFP_KERNEL alloc in atomic section bug
KVM: Use raw_smp_processor_id() instead of smp_processor_id() where applicable
profiling: fix sched profiling typo
KVM: Avoid oom on cr3 switch
KVM: Make loading cr3 more robust
KVM: Simplify mmu_alloc_roots()
James Bursa (1):
adfs: fix filename handling
Jens Axboe (3):
cfq-iosched: merging problem
cdrom: set default timeout to 7 seconds
ide-cd maintainer
Jiri Kosina (1):
HID: fix help texts in Kconfig
Kay Sievers (1):
Driver core: Fix prefix driver links in /sys/module by bus-name
Len Brown (2):
ACPI: fix section mis-match build warning
ACPI: asus_acpi: new MAINTAINER
Lennert Buytenhek (1):
[ARM] 4063/1: ep93xx: fix IRQ_EP93XX_GPIO?MUX numbering
Leonard Norrgård (1):
sound: hda: detect ALC883 on MSI K9A Platinum motherboards (MS-7280)
Linus Torvalds (3):
Revert "[PATCH] x86_64: fix boot hang caused by CALGARY_IOMMU_ENABLED_BY_DEFAULT"
Revert "[PATCH] binfmt_elf: randomize PIE binaries (2nd try)"
Linux 2.6.20-rc4
Mariusz Kozlowski (1):
[AF_NETLINK]: module_put cleanup
Martin Josefsson (1):
[NETFILTER]: nf_nat: fix MASQUERADE crash on device down
Martin Williges (1):
USB: usblp.c - add Kyocera Mita FS 820 to list of "quirky" printers
Matthijs van Otterdijk (1):
fix the toshiba_acpi write_lcd return value
Maxime Bizon (1):
i2c-mv64xxx: Fix random oops at boot
Miguel Angel Alvarez (1):
USB: fix interaction between different interfaces in an "Option" usb device
Nicolas Pitre (2):
[ARM] 4064/1: make pxa_get_cycles() static
[ARM] 4066/1: correct a comment about PXA's sched_clock range
OGAWA Hirofumi (1):
x86_64: Fix dump_trace()
Oliver Neukum (1):
USB: small update to Documentation/usb/acm.txt
Parag Warudkar (1):
selinux: fix selinux_netlbl_inode_permission() locking
Patrick McHardy (2):
[NETFILTER]: Fix routing of REJECT target generated packets in output chain
[NETFILTER]: New connection tracking is not EXPERIMENTAL anymore
Paul Brook (1):
[ARM] 4074/1: Flat loader stack alignment
Paul Mundt (1):
Sanely size hash tables when using large base pages
Pete Zaitcev (1):
USB storage: fix ipod ejecting issue
Phil Dibowitz (1):
USB Storage: unusual_devs: add supertop drives
Philipp Zabel (2):
[ARM] 4080/1: Fix for the SSCR0_SlotsPerFrm macro
[ARM] 4081/1: Add definition for TI Sync Serial Protocol
Philippe De Muyter (1):
i2c/m41t00: Do not forget to write year
Rafael J. Wysocki (1):
swsusp: Do not fail if resume device is not set
Rafa³ Bilski (2):
[CPUFREQ] Longhaul - Fix up powersaver assumptions.
[CPUFREQ] Longhaul - Always guess FSB
Randy Dunlap (1):
[CPUFREQ] select consistently
Richard Purdie (3):
[ARM] 4078/1: Fix ARM copypage cache coherency problems
backlight: fix backlight_device_register compile failures
Fix leds-s3c24xx hardware.h reference
Russell King (2):
[ARM] Fix VFP initialisation issue for SMP systems
Fix some ARM builds due to HID brokenness
Sarah Bailey (1):
USB: Fixed bug in endpoint release function.
Segher Boessenkool (1):
Fix insta-reboot with "i386: Relocatable kernel support"
Thomas Hellstrom (2):
[AGPGART] Remove unnecessary flushes when inserting and removing pages.
[AGPGART] Fix PCI-posting flush typo.
Venkatesh Pallipadi (1):
[CPUFREQ] Bug fix for acpi-cpufreq and cpufreq_stats oops on frequency change notification
Vitaly Wool (2):
i2c-pnx: Fix interrupt handler, get rid of EARLY config option
i2c-pnx: Add entry to MAINTAINERS
Vivek Goyal (4):
i386: Restore CONFIG_PHYSICAL_START option
i386: fix modpost warning in SMP trampoline code
i386: fix another modpost warning
i386: modpost smpboot code warning fix
Yoshimi Ichiyanagi (1):
KVM: Recover after an arch module load failure
akpm@osdl.org (1):
[AGPGART] drivers/char/agp/sgi-agp.c: check kmalloc() return value
dean gaudet (1):
[NET]: ifb double-counts packets
^ permalink raw reply [flat|nested] 39+ messages in thread* Re: Linux 2.6.20-rc4 2007-01-07 6:19 Linux 2.6.20-rc4 Linus Torvalds @ 2007-01-07 10:56 ` Jan Engelhardt 2007-01-07 11:44 ` Russell King 0 siblings, 1 reply; 39+ messages in thread From: Jan Engelhardt @ 2007-01-07 10:56 UTC (permalink / raw) To: Linus Torvalds; +Cc: Linux Kernel Mailing List [-- Attachment #1: Type: TEXT/PLAIN, Size: 283 bytes --] On Jan 6 2007 22:19, Linus Torvalds wrote: >Leonard NorrgÃ¥rd (1): > sound: hda: detect ALC883 on MSI K9A Platinum motherboards (MS-7280) Something seems to have mangled the name, that should have been an å not A¥. (Something reencoded it). A gitlog problem? -`J' -- ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Linux 2.6.20-rc4 2007-01-07 10:56 ` Jan Engelhardt @ 2007-01-07 11:44 ` Russell King 2007-01-07 13:06 ` OT: character encodings (was: Linux 2.6.20-rc4) Tilman Schmidt 0 siblings, 1 reply; 39+ messages in thread From: Russell King @ 2007-01-07 11:44 UTC (permalink / raw) To: Jan Engelhardt; +Cc: Linus Torvalds, Linux Kernel Mailing List On Sun, Jan 07, 2007 at 11:56:01AM +0100, Jan Engelhardt wrote: > On Jan 6 2007 22:19, Linus Torvalds wrote: > > >Leonard NorrgÃ¥rd (1): > > sound: hda: detect ALC883 on MSI K9A Platinum motherboards (MS-7280) > > Something seems to have mangled the name, that should have > been an å not A¥. (Something reencoded it). A gitlog problem? That is an å if you look at the raw message in UTF-8. However, Linus sends mail in with a charset of ISO-8859-1, and if you place UTF-8 encoded text in such a message body, you will see A¥. Welcome to the mess which the UTF-8 charset creates. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: ^ permalink raw reply [flat|nested] 39+ messages in thread
* OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-07 11:44 ` Russell King @ 2007-01-07 13:06 ` Tilman Schmidt 2007-01-07 15:13 ` David Woodhouse 0 siblings, 1 reply; 39+ messages in thread From: Tilman Schmidt @ 2007-01-07 13:06 UTC (permalink / raw) To: Russell King; +Cc: Linux Kernel Mailing List Russell King schrieb: [Leonard NorrgÃ¥rd (1):] > That is an å if you look at the raw message in UTF-8. However, Linus > sends mail in with a charset of ISO-8859-1, and if you place UTF-8 > encoded text in such a message body, you will see A¥. Only if the mechanism used for placing it there ignores the different encodings. > Welcome to the mess which the UTF-8 charset creates. The problem of different character encodings coexisting on the same platform, and the resulting occasional messing-up, far predates Unicode. I distinctly remember one case of being bitten by this myself in 1977 when Unicode wasn't even on the horizon yet, and I don't think that was the first time. Tilman ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-07 13:06 ` OT: character encodings (was: Linux 2.6.20-rc4) Tilman Schmidt @ 2007-01-07 15:13 ` David Woodhouse 2007-01-07 15:38 ` Russell King 0 siblings, 1 reply; 39+ messages in thread From: David Woodhouse @ 2007-01-07 15:13 UTC (permalink / raw) To: Tilman Schmidt; +Cc: Russell King, Linux Kernel Mailing List On Sun, 2007-01-07 at 14:06 +0100, Tilman Schmidt wrote: > Russell King schrieb: > > Welcome to the mess which the UTF-8 charset creates. Utter bollocks. > The problem of different character encodings coexisting on the same > platform, and the resulting occasional messing-up, far predates Unicode. > I distinctly remember one case of being bitten by this myself in 1977 > when Unicode wasn't even on the horizon yet, and I don't think that was > the first time. Indeed. If you take arbitrary content and send it out to the world labelled as ISO8859-1, of _course_ you're likely to be corrupting it. Far from being the cause of the problem, UTF-8 actually offers the chance of a _solution_. Because once the Luddites catch up, it'll largely eliminate the need for using the multitude of legacy character sets and converting between them -- and the problem of mislabelling will fairly much go away. -- dwmw2 ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-07 15:13 ` David Woodhouse @ 2007-01-07 15:38 ` Russell King 2007-01-07 16:29 ` David Woodhouse 2007-01-07 18:21 ` Alan 0 siblings, 2 replies; 39+ messages in thread From: Russell King @ 2007-01-07 15:38 UTC (permalink / raw) To: David Woodhouse; +Cc: Tilman Schmidt, Linux Kernel Mailing List On Sun, Jan 07, 2007 at 11:13:57PM +0800, David Woodhouse wrote: > On Sun, 2007-01-07 at 14:06 +0100, Tilman Schmidt wrote: > > Russell King schrieb: > > > Welcome to the mess which the UTF-8 charset creates. > > Utter bollocks. Wrong. The problem is partly caused by not everything understanding multi-byte character encodings, and text files containing absolutely _no_ information about their character encodings. When a text file is stored on disk, there's no way to tell what character set the characters in that file belong to. As a result, ISO-8859-1 folk assume that all text files are ISO-8859-1 encoded. UTF-8 folk assume all text files are UTF-8 encoded. This leads to utter confusion. To see what I mean, try the following: $ git log | head -n 1000 > o $ file -i o o: text/x-c; charset=iso-8859-1 According to that, the charset of the 'git log' output (which on that test included Leonard's entry) is iso-8859-1, and by that Linus' mailer was right to include it as ISO-8859-1. In reality, the output from git log contains an ad-hoc collection of character sets making its interpretation under any one character set incorrect. > > The problem of different character encodings coexisting on the same > > platform, and the resulting occasional messing-up, far predates Unicode. > > I distinctly remember one case of being bitten by this myself in 1977 > > when Unicode wasn't even on the horizon yet, and I don't think that was > > the first time. > > Indeed. If you take arbitrary content and send it out to the world > labelled as ISO8859-1, of _course_ you're likely to be corrupting it. > > Far from being the cause of the problem, UTF-8 actually offers the > chance of a _solution_. Because once the Luddites catch up, it'll > largely eliminate the need for using the multitude of legacy character > sets and converting between them -- and the problem of mislabelling will > fairly much go away. In other words, the UTF-8 luddites require the entire Internet to upgrade to UTF-8 for UTF-8 to work properly. I _regularly_ struggle with idiotic programs that assume that the world is UTF-8 and nothing else. UTF-8 does _not_ solve these inter-operability problems - it only makes the entire situation worse by introducing yet another different charset. (Yes, it's also true that there are programs which assume the world is only another, different, character set.) Rather than having these problems fixed properly (by looking at the LANG environment variable) many of these programs now assume that the world is UTF-8. It isn't. elinks is one such program. It now assumes UTF-8 _only_ displays. That's no better than programs which assume ISO-8859-1 only or US-ASCII only. So, in short, UTF-8 is all fine and dandy if your _entire_ universe is UTF-8 enabled. If you're operating in a mixed charset environment it's one bloody big pain in the butt. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-07 15:38 ` Russell King @ 2007-01-07 16:29 ` David Woodhouse 2007-01-07 17:06 ` Russell King 2007-01-07 18:21 ` Alan 1 sibling, 1 reply; 39+ messages in thread From: David Woodhouse @ 2007-01-07 16:29 UTC (permalink / raw) To: Russell King; +Cc: Tilman Schmidt, Linux Kernel Mailing List On Sun, 2007-01-07 at 15:38 +0000, Russell King wrote: > On Sun, Jan 07, 2007 at 11:13:57PM +0800, David Woodhouse wrote: > > On Sun, 2007-01-07 at 14:06 +0100, Tilman Schmidt wrote: > > > Russell King schrieb: > > > > Welcome to the mess which the UTF-8 charset creates. > > > > Utter bollocks. > > Wrong. The problem is partly caused by not everything understanding > multi-byte character encodings, No, that's a different problem; not the one you were referring to above. And it's a problem which is rapidly diminishing, too. > and text files containing absolutely > _no_ information about their character encodings. That's a real problem, yes -- but it was a problem long before UTF-8 was added to the collection of character sets in use. Even within the UK, we had to choose between ISO8859-1 and ISO8859-15. > When a text file is stored on disk, there's no way to tell what > character set the characters in that file belong to. As a result, > ISO-8859-1 folk assume that all text files are ISO-8859-1 encoded. > UTF-8 folk assume all text files are UTF-8 encoded. This leads to > utter confusion. Only if you are making different assumptions about the _same_ set of files, on the _same_ system. But that would be silly. If I suddenly "assume" that my laptop has a Dvorak keyboard layout despite that blatantly not being true, I'll get the same kind of confusion. That isn't Dvorak's fault, either. If, on the other hand, I have one system which is entirely ISO8859-1 and a separate system which is entirely UTF-8, each of those are _fine_ and unconfusing. Obviously I have to make sure files are properly labelled and converted in transport between different systems -- but that's nothing new. > To see what I mean, try the following: > > $ git log | head -n 1000 > o > $ file -i o > o: text/x-c; charset=iso-8859-1 > > According to that, the charset of the 'git log' output (which on that > test included Leonard's entry) is iso-8859-1, and by that Linus' mailer > was right to include it as ISO-8859-1. Yes. When you stored it on disk, the character set information was lost. If you were running a mixed-charset system then attempting to recreating the lost information with heuristics and assumptions is obviously going to be problematic. Actually, because UTF-8 allows me to run a system which is purely based on a single character set, I get better results when I try the same trick: shinybook /shiny/git/mtd-2.6 $ git log | head -n 1000 > o shinybook /shiny/git/mtd-2.6 $ file -i o o: text/plain; charset=utf-8 Again, the problem of labelling isn't at all new to UTF-8. The only thing that's new with UTF-8 is that it's now actually _practical_ to have a system which only uses one character set throughout, and which thus _can_ get its 'guess' right when you don't bother to label everything. > In reality, the output from git log contains an ad-hoc collection of > character sets making its interpretation under any one character set > incorrect. No, the contents of the git log ought to be UTF-8, unless people have been misusing it. Git stores its text in UTF-8 (by default), and is capable of converting to and from legacy character sets on input (git-commit) and output (git-log). (Obviously, that's likely to be lossy if you convert it to any given legacy character set, because ∀ legacy character set, ∃ characters within UTF-8 that aren't in that legacy character set.) > > Far from being the cause of the problem, UTF-8 actually offers the > > chance of a _solution_. Because once the Luddites catch up, it'll > > largely eliminate the need for using the multitude of legacy character > > sets and converting between them -- and the problem of mislabelling will > > fairly much go away. > > In other words, the UTF-8 luddites require the entire Internet to > upgrade to UTF-8 for UTF-8 to work properly. Not at all. The problems arise when character set information is lost, which can happen at any point during the flow of information. Anything we can do to reduce the likelihood of charset information being lost is an overall improvement. We already demonstrated an example (git-log > o; file -i o) of a case where a _consistent_ system gets it right, while an inconsistent system introduces an error. If any individual system processes all text in a single character set, then that system is no longer a likely source of corruption due to labelling errors. And because UTF-8 fully covers the set of characters which can be represented in the legacy character sets, it allows us to deploy systems which do just that. > I _regularly_ struggle with idiotic programs that assume that the world > is UTF-8 and nothing else. I don't think I've encountered such a program in my distribution of choice. If I had, I would have filed a bug. Making assumptions about character sets, outside of the locally-controlled environment, is invalid. That's been true since the first 8-bit character sets, if not longer. > So, in short, UTF-8 is all fine and dandy if your _entire_ universe > is UTF-8 enabled. If you're operating in a mixed charset environment > it's one bloody big pain in the butt. A mixed charset environment was _already_ a pain in the butt, because almost nobody got labelling right. It's wrong to blame that on UTF-8. -- dwmw2 ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-07 16:29 ` David Woodhouse @ 2007-01-07 17:06 ` Russell King 2007-01-07 19:11 ` Jan Engelhardt 0 siblings, 1 reply; 39+ messages in thread From: Russell King @ 2007-01-07 17:06 UTC (permalink / raw) To: David Woodhouse; +Cc: Tilman Schmidt, Linux Kernel Mailing List On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote: > On Sun, 2007-01-07 at 15:38 +0000, Russell King wrote: > > When a text file is stored on disk, there's no way to tell what > > character set the characters in that file belong to. As a result, > > ISO-8859-1 folk assume that all text files are ISO-8859-1 encoded. > > UTF-8 folk assume all text files are UTF-8 encoded. This leads to > > utter confusion. > > Only if you are making different assumptions about the _same_ set of > files, on the _same_ system. But that would be silly. $ git log | head -n 1000 | tail -n 200 > o $ file -i o o: text/plain; charset=us-ascii $ git log | head -n 1000 | tail -n 300 > o $ file -i o o: text/plain; charset=us-ascii $ git log | head -n 1000 | tail -n 400 > o $ file -i o o: text/plain; charset=utf-8 (and you know what charset the file is thought to have with all 1000 lines in it.) All on a system with LANG set to en_GB (iow ISO-8859-1). > > To see what I mean, try the following: > > > > $ git log | head -n 1000 > o > > $ file -i o > > o: text/x-c; charset=iso-8859-1 > > > > According to that, the charset of the 'git log' output (which on that > > test included Leonard's entry) is iso-8859-1, and by that Linus' mailer > > was right to include it as ISO-8859-1. > > Yes. When you stored it on disk, the character set information was lost. The same thing actually happens when I look at it via: $ git log | head -n 1000 | less but in this case the output is always interpreted by the terminal to be in its character set. > If you were running a mixed-charset system then attempting to recreating > the lost information with heuristics and assumptions is obviously going > to be problematic. I'm not - I'm running a pure ISO-8859-1 system: $ echo $LANG en_GB $ locale -k LC_CTYPE | grep charmap charmap="ISO-8859-1" > Actually, because UTF-8 allows me to run a system which is purely based > on a single character set, I get better results when I try the same > trick: > shinybook /shiny/git/mtd-2.6 $ git log | head -n 1000 > o > shinybook /shiny/git/mtd-2.6 $ file -i o > o: text/plain; charset=utf-8 $ LANG=en_GB.UTF-8 locale -k LC_CTYPE | grep charmap charmap="UTF-8" $ LANG=en_GB.UTF-8 git log | head -n 1000 > o $ LANG=en_GB.UTF-8 file -i o o: text/x-c; charset=iso-8859-1 $ git version git version 1.4.4.2 Looks like the output is iso-8859-1 even with UTF-8! > > In reality, the output from git log contains an ad-hoc collection of > > character sets making its interpretation under any one character set > > incorrect. > > No, the contents of the git log ought to be UTF-8, unless people have > been misusing it. Git stores its text in UTF-8 (by default), and is > capable of converting to and from legacy character sets on input > (git-commit) and output (git-log). Git may store its text internally in UTF-8 (I don't know but I have no evidence to suggest it does - in fact I have some evidence in this test that it doesn't care about charsets.) git log output on a non-UTF-8 system certainly is not in the hosts character set. For example: $ LANG=en_GB.UTF-8 git log | head -n 1000 > o $ LANG=en_GB git log | head -n 1000 > o2 $ diff -u o o2 That includes the UTF-8 encoded part of Leonard name. It also includes Rafa? Bilski's name which is non-UTF-8 encoded. So, in both cases, exactly the same output bytestream was created independent of the character set _actually_ being used, which both includes untranslated UTF-8 and non-UTF-8 sequences. There is obviously no character set translation going on with the output. So we can add 'git' to my list of charset-broken programs. Also, since we have recent data in the git repository which is non-UTF-8 as well, it is clear that there is no character set translation going on at input time either. Looking at the git-commit script, there appears to be no character set conversion going on in there either. So, I think you'll find that the contents of git _is_ an ad-hoc collection of character sets which people happen to have in use on their machines. > > So, in short, UTF-8 is all fine and dandy if your _entire_ universe > > is UTF-8 enabled. If you're operating in a mixed charset environment > > it's one bloody big pain in the butt. > > A mixed charset environment was _already_ a pain in the butt, because > almost nobody got labelling right. It's wrong to blame that on UTF-8. I'm not talking about a mixed charset environment. I'm talking about non-UTF-8 single charset environments being broken by programs which universally think the universe is UTF-8 only. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-07 17:06 ` Russell King @ 2007-01-07 19:11 ` Jan Engelhardt 2007-01-07 19:20 ` Russell King 2007-01-07 20:48 ` Willy Tarreau 0 siblings, 2 replies; 39+ messages in thread From: Jan Engelhardt @ 2007-01-07 19:11 UTC (permalink / raw) To: Russell King; +Cc: David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List On Jan 7 2007 17:06, Russell King wrote: >On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote: > >$ git log | head -n 1000 | tail -n 200 > o >$ file -i o >o: text/plain; charset=us-ascii >$ git log | head -n 1000 | tail -n 300 > o >$ file -i o >o: text/plain; charset=us-ascii >$ git log | head -n 1000 | tail -n 400 > o >$ file -i o >o: text/plain; charset=utf-8 I am inclined to say that "file" does not count, because it tries to guess an ambiguous mapping from bytes to character set. Even more, file should be _unable at all_ to distinguish an iso-8859-1 from an iso-8859-2 (or worse: 15) file. This program is soo... forget it, it's not an argument. It works well for headerful files, but text files don't really contain one. The next best thing would be html, with a proper <meta http-equiv=Content> tag. -`J' -- ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-07 19:11 ` Jan Engelhardt @ 2007-01-07 19:20 ` Russell King 2007-01-07 20:48 ` Willy Tarreau 1 sibling, 0 replies; 39+ messages in thread From: Russell King @ 2007-01-07 19:20 UTC (permalink / raw) To: Jan Engelhardt; +Cc: David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List On Sun, Jan 07, 2007 at 08:11:38PM +0100, Jan Engelhardt wrote: > > On Jan 7 2007 17:06, Russell King wrote: > >On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote: > > > >$ git log | head -n 1000 | tail -n 200 > o > >$ file -i o > >o: text/plain; charset=us-ascii > >$ git log | head -n 1000 | tail -n 300 > o > >$ file -i o > >o: text/plain; charset=us-ascii > >$ git log | head -n 1000 | tail -n 400 > o > >$ file -i o > >o: text/plain; charset=utf-8 > > I am inclined to say that "file" does not count, because it tries to guess an > ambiguous mapping from bytes to character set. Even more, file should be > _unable at all_ to distinguish an iso-8859-1 from an iso-8859-2 (or worse: 15) > file. This program is soo... forget it, it's not an argument. It works well for > headerful files, but text files don't really contain one. The next best thing > would be html, with a proper <meta http-equiv=Content> tag. You're discarding a perfectly reasonable argument - file itself obviously is not good at guessing the charset, but inspecting the resulting file manually and identifying *both* ISO-8859 and UTF-8 character sequences in there is pretty conclusive. As I did indeed do prior to sending that message. In this case, 'file' was doing a remarkably accurate job. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-07 19:11 ` Jan Engelhardt 2007-01-07 19:20 ` Russell King @ 2007-01-07 20:48 ` Willy Tarreau 2007-01-07 23:37 ` Adrian Bunk 1 sibling, 1 reply; 39+ messages in thread From: Willy Tarreau @ 2007-01-07 20:48 UTC (permalink / raw) To: Jan Engelhardt Cc: Russell King, David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List On Sun, Jan 07, 2007 at 08:11:38PM +0100, Jan Engelhardt wrote: > > On Jan 7 2007 17:06, Russell King wrote: > >On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote: > > > >$ git log | head -n 1000 | tail -n 200 > o > >$ file -i o > >o: text/plain; charset=us-ascii > >$ git log | head -n 1000 | tail -n 300 > o > >$ file -i o > >o: text/plain; charset=us-ascii > >$ git log | head -n 1000 | tail -n 400 > o > >$ file -i o > >o: text/plain; charset=utf-8 > > I am inclined to say that "file" does not count, because it tries to guess an > ambiguous mapping from bytes to character set. Even more, file should be > _unable at all_ to distinguish an iso-8859-1 from an iso-8859-2 (or worse: 15) > file. This program is soo... forget it, it's not an argument. It works well for > headerful files, but text files don't really contain one. The next best thing > would be html, with a proper <meta http-equiv=Content> tag. The stupidity from the start up with those character sets is that they consider that a whole file is written with a given set. In fact, the charset should apply to characters themselves. At least, the quoted-printable, non-human friendly, encoding was the least stupid. Now that UTF8 comes everywhere, everyone receives tons of mangled mails, and even mailers which correctly support UTF8 and use it by default manage to shoot themselves in the foot when they reply to, or forward a mail. The system is completely broken because limited by design, and we have to learn to live with this brokenness. Willy ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-07 20:48 ` Willy Tarreau @ 2007-01-07 23:37 ` Adrian Bunk 2007-01-08 0:38 ` Willy Tarreau 0 siblings, 1 reply; 39+ messages in thread From: Adrian Bunk @ 2007-01-07 23:37 UTC (permalink / raw) To: Willy Tarreau Cc: Jan Engelhardt, Russell King, David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List On Sun, Jan 07, 2007 at 09:48:34PM +0100, Willy Tarreau wrote: > On Sun, Jan 07, 2007 at 08:11:38PM +0100, Jan Engelhardt wrote: > > > > On Jan 7 2007 17:06, Russell King wrote: > > >On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote: > > > > > >$ git log | head -n 1000 | tail -n 200 > o > > >$ file -i o > > >o: text/plain; charset=us-ascii > > >$ git log | head -n 1000 | tail -n 300 > o > > >$ file -i o > > >o: text/plain; charset=us-ascii > > >$ git log | head -n 1000 | tail -n 400 > o > > >$ file -i o > > >o: text/plain; charset=utf-8 > > > > I am inclined to say that "file" does not count, because it tries to guess an > > ambiguous mapping from bytes to character set. Even more, file should be > > _unable at all_ to distinguish an iso-8859-1 from an iso-8859-2 (or worse: 15) > > file. This program is soo... forget it, it's not an argument. It works well for > > headerful files, but text files don't really contain one. The next best thing > > would be html, with a proper <meta http-equiv=Content> tag. > > The stupidity from the start up with those character sets is that they > consider that a whole file is written with a given set. In fact, the > charset should apply to characters themselves. At least, the > quoted-printable, non-human friendly, encoding was the least stupid. I doubt doing this would really be worth the effort. In the 21st century, people should simply use UTF-8. > Now that UTF8 comes everywhere, everyone receives tons of mangled mails, > and even mailers which correctly support UTF8 and use it by default manage > to shoot themselves in the foot when they reply to, or forward a mail. The > system is completely broken because limited by design, and we have to learn > to live with this brokenness. Only if MUAs have broken charset support or don't set a correct "charset" header in the mails they are sending. If some software still can't handle UTF-8 correctly more than 10 years after it was introduced, that's not a brokenness you can blame on UTF-8. > Willy cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-07 23:37 ` Adrian Bunk @ 2007-01-08 0:38 ` Willy Tarreau 2007-01-08 1:03 ` Adrian Bunk 2007-01-08 19:53 ` Valdis.Kletnieks 0 siblings, 2 replies; 39+ messages in thread From: Willy Tarreau @ 2007-01-08 0:38 UTC (permalink / raw) To: Adrian Bunk Cc: Jan Engelhardt, Russell King, David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List On Mon, Jan 08, 2007 at 12:37:50AM +0100, Adrian Bunk wrote: > On Sun, Jan 07, 2007 at 09:48:34PM +0100, Willy Tarreau wrote: > > On Sun, Jan 07, 2007 at 08:11:38PM +0100, Jan Engelhardt wrote: > > > > > > On Jan 7 2007 17:06, Russell King wrote: > > > >On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote: > > > > > > > >$ git log | head -n 1000 | tail -n 200 > o > > > >$ file -i o > > > >o: text/plain; charset=us-ascii > > > >$ git log | head -n 1000 | tail -n 300 > o > > > >$ file -i o > > > >o: text/plain; charset=us-ascii > > > >$ git log | head -n 1000 | tail -n 400 > o > > > >$ file -i o > > > >o: text/plain; charset=utf-8 > > > > > > I am inclined to say that "file" does not count, because it tries to guess an > > > ambiguous mapping from bytes to character set. Even more, file should be > > > _unable at all_ to distinguish an iso-8859-1 from an iso-8859-2 (or worse: 15) > > > file. This program is soo... forget it, it's not an argument. It works well for > > > headerful files, but text files don't really contain one. The next best thing > > > would be html, with a proper <meta http-equiv=Content> tag. > > > > The stupidity from the start up with those character sets is that they > > consider that a whole file is written with a given set. In fact, the > > charset should apply to characters themselves. At least, the > > quoted-printable, non-human friendly, encoding was the least stupid. > > I doubt doing this would really be worth the effort. > > In the 21st century, people should simply use UTF-8. > > > Now that UTF8 comes everywhere, everyone receives tons of mangled mails, > > and even mailers which correctly support UTF8 and use it by default manage > > to shoot themselves in the foot when they reply to, or forward a mail. The > > system is completely broken because limited by design, and we have to learn > > to live with this brokenness. > > Only if MUAs have broken charset support or don't set a correct > "charset" header in the mails they are sending. > > If some software still can't handle UTF-8 correctly more than 10 years > after it was introduced, that's not a brokenness you can blame on UTF-8. I'm not blaming UTF-8 per se, but people who still believe in encoding *whole documents*. Copy-paste, text insertion, git output, etc... everything has a good reason not to be in the same encoding as what your MUA believes. If major MUAs still have problems with UTF-8 10 years after it was introduced, it's clearly the proof of a flaw in the initial design. And I'm not even discussing the stupidity which requires that you read a whole text to get its number of characters ! Willy ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-08 0:38 ` Willy Tarreau @ 2007-01-08 1:03 ` Adrian Bunk 2007-01-08 1:14 ` Willy Tarreau 2007-01-08 6:52 ` Jan Engelhardt 2007-01-08 19:53 ` Valdis.Kletnieks 1 sibling, 2 replies; 39+ messages in thread From: Adrian Bunk @ 2007-01-08 1:03 UTC (permalink / raw) To: Willy Tarreau Cc: Jan Engelhardt, Russell King, David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List On Mon, Jan 08, 2007 at 01:38:57AM +0100, Willy Tarreau wrote: > On Mon, Jan 08, 2007 at 12:37:50AM +0100, Adrian Bunk wrote: > > On Sun, Jan 07, 2007 at 09:48:34PM +0100, Willy Tarreau wrote: > > > On Sun, Jan 07, 2007 at 08:11:38PM +0100, Jan Engelhardt wrote: > > > > > > > > On Jan 7 2007 17:06, Russell King wrote: > > > > >On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote: > > > > > > > > > >$ git log | head -n 1000 | tail -n 200 > o > > > > >$ file -i o > > > > >o: text/plain; charset=us-ascii > > > > >$ git log | head -n 1000 | tail -n 300 > o > > > > >$ file -i o > > > > >o: text/plain; charset=us-ascii > > > > >$ git log | head -n 1000 | tail -n 400 > o > > > > >$ file -i o > > > > >o: text/plain; charset=utf-8 > > > > > > > > I am inclined to say that "file" does not count, because it tries to guess an > > > > ambiguous mapping from bytes to character set. Even more, file should be > > > > _unable at all_ to distinguish an iso-8859-1 from an iso-8859-2 (or worse: 15) > > > > file. This program is soo... forget it, it's not an argument. It works well for > > > > headerful files, but text files don't really contain one. The next best thing > > > > would be html, with a proper <meta http-equiv=Content> tag. > > > > > > The stupidity from the start up with those character sets is that they > > > consider that a whole file is written with a given set. In fact, the > > > charset should apply to characters themselves. At least, the > > > quoted-printable, non-human friendly, encoding was the least stupid. > > > > I doubt doing this would really be worth the effort. > > > > In the 21st century, people should simply use UTF-8. > > > > > Now that UTF8 comes everywhere, everyone receives tons of mangled mails, > > > and even mailers which correctly support UTF8 and use it by default manage > > > to shoot themselves in the foot when they reply to, or forward a mail. The > > > system is completely broken because limited by design, and we have to learn > > > to live with this brokenness. > > > > Only if MUAs have broken charset support or don't set a correct > > "charset" header in the mails they are sending. > > > > If some software still can't handle UTF-8 correctly more than 10 years > > after it was introduced, that's not a brokenness you can blame on UTF-8. > > I'm not blaming UTF-8 per se, but people who still believe in encoding > *whole documents*. Copy-paste, text insertion, git output, etc... everything > has a good reason not to be in the same encoding as what your MUA believes. How would you do this technically in a way that it's significantely easier than simply finishing the UTF=8 transition? > If major MUAs still have problems with UTF-8 10 years after it was introduced, > it's clearly the proof of a flaw in the initial design. And I'm not even > discussing the stupidity which requires that you read a whole text to get > its number of characters ! The only major MUA not supporting UTF-8 is Eudora. And if you are talking about buggy old pine, in the latest development version [1] it does not only become open source, it also got some working Unicode support. > Willy cu Adrian [1] Alpine -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-08 1:03 ` Adrian Bunk @ 2007-01-08 1:14 ` Willy Tarreau 2007-01-08 1:45 ` Adrian Bunk 2007-01-08 6:52 ` Jan Engelhardt 1 sibling, 1 reply; 39+ messages in thread From: Willy Tarreau @ 2007-01-08 1:14 UTC (permalink / raw) To: Adrian Bunk Cc: Jan Engelhardt, Russell King, David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List On Mon, Jan 08, 2007 at 02:03:37AM +0100, Adrian Bunk wrote: > On Mon, Jan 08, 2007 at 01:38:57AM +0100, Willy Tarreau wrote: > > On Mon, Jan 08, 2007 at 12:37:50AM +0100, Adrian Bunk wrote: > > > On Sun, Jan 07, 2007 at 09:48:34PM +0100, Willy Tarreau wrote: > > > > On Sun, Jan 07, 2007 at 08:11:38PM +0100, Jan Engelhardt wrote: > > > > > > > > > > On Jan 7 2007 17:06, Russell King wrote: > > > > > >On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote: > > > > > > > > > > > >$ git log | head -n 1000 | tail -n 200 > o > > > > > >$ file -i o > > > > > >o: text/plain; charset=us-ascii > > > > > >$ git log | head -n 1000 | tail -n 300 > o > > > > > >$ file -i o > > > > > >o: text/plain; charset=us-ascii > > > > > >$ git log | head -n 1000 | tail -n 400 > o > > > > > >$ file -i o > > > > > >o: text/plain; charset=utf-8 > > > > > > > > > > I am inclined to say that "file" does not count, because it tries to guess an > > > > > ambiguous mapping from bytes to character set. Even more, file should be > > > > > _unable at all_ to distinguish an iso-8859-1 from an iso-8859-2 (or worse: 15) > > > > > file. This program is soo... forget it, it's not an argument. It works well for > > > > > headerful files, but text files don't really contain one. The next best thing > > > > > would be html, with a proper <meta http-equiv=Content> tag. > > > > > > > > The stupidity from the start up with those character sets is that they > > > > consider that a whole file is written with a given set. In fact, the > > > > charset should apply to characters themselves. At least, the > > > > quoted-printable, non-human friendly, encoding was the least stupid. > > > > > > I doubt doing this would really be worth the effort. > > > > > > In the 21st century, people should simply use UTF-8. > > > > > > > Now that UTF8 comes everywhere, everyone receives tons of mangled mails, > > > > and even mailers which correctly support UTF8 and use it by default manage > > > > to shoot themselves in the foot when they reply to, or forward a mail. The > > > > system is completely broken because limited by design, and we have to learn > > > > to live with this brokenness. > > > > > > Only if MUAs have broken charset support or don't set a correct > > > "charset" header in the mails they are sending. > > > > > > If some software still can't handle UTF-8 correctly more than 10 years > > > after it was introduced, that's not a brokenness you can blame on UTF-8. > > > > I'm not blaming UTF-8 per se, but people who still believe in encoding > > *whole documents*. Copy-paste, text insertion, git output, etc... everything > > has a good reason not to be in the same encoding as what your MUA believes. > > How would you do this technically in a way that it's significantely > easier than simply finishing the UTF=8 transition? In how many decades do you think the transition will be finished ? > > If major MUAs still have problems with UTF-8 10 years after it was introduced, > > it's clearly the proof of a flaw in the initial design. And I'm not even > > discussing the stupidity which requires that you read a whole text to get > > its number of characters ! > > The only major MUA not supporting UTF-8 is Eudora. > > And if you are talking about buggy old pine, in the latest development > version [1] it does not only become open source, it also got some > working Unicode support. No, I'm not speaking about "not supporting", but "having problems". Every one of us has already received mails from Thunderbird, Outlook, Notes, etc... with erroneously encoded characters because of this : - an UTF8 MUA sends a mail to a non-UTF8 aware one. - this last one only sees double chars. When it wants to forward the mail to someone else, it keeps the chars verbatim, and sets the encoding type to its own, something like iso8859-1 for instance. - the final MUA, which is UTF8-aware, is very happy to detect lots of UTF8 combinations in the forwarded mail and decides that everything in it is UTF8, then you get lots of chars mangled in the mail, in the middle of UTF8 combinations. Then, this crappy mail can be forwarded as long as you want between UTF8 MUAs, they will all apply heuristics and to the wrong thing : consider the *whole* document with *one* type. What I find even funnier is when, for no apparent reason, the same MUA is used on both ends and the contents get mangled because the sender copies a portion of text from somewhere else. Anyway, I don't want to follow up on this thread, it's *highly* off-topic here. Cheers, Willy ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-08 1:14 ` Willy Tarreau @ 2007-01-08 1:45 ` Adrian Bunk 0 siblings, 0 replies; 39+ messages in thread From: Adrian Bunk @ 2007-01-08 1:45 UTC (permalink / raw) To: Willy Tarreau Cc: Jan Engelhardt, Russell King, David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List On Mon, Jan 08, 2007 at 02:14:41AM +0100, Willy Tarreau wrote: > On Mon, Jan 08, 2007 at 02:03:37AM +0100, Adrian Bunk wrote: > > On Mon, Jan 08, 2007 at 01:38:57AM +0100, Willy Tarreau wrote: > > > On Mon, Jan 08, 2007 at 12:37:50AM +0100, Adrian Bunk wrote: > > > > On Sun, Jan 07, 2007 at 09:48:34PM +0100, Willy Tarreau wrote: > > > > > On Sun, Jan 07, 2007 at 08:11:38PM +0100, Jan Engelhardt wrote: > > > > > > > > > > > > On Jan 7 2007 17:06, Russell King wrote: > > > > > > >On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote: > > > > > > > > > > > > > >$ git log | head -n 1000 | tail -n 200 > o > > > > > > >$ file -i o > > > > > > >o: text/plain; charset=us-ascii > > > > > > >$ git log | head -n 1000 | tail -n 300 > o > > > > > > >$ file -i o > > > > > > >o: text/plain; charset=us-ascii > > > > > > >$ git log | head -n 1000 | tail -n 400 > o > > > > > > >$ file -i o > > > > > > >o: text/plain; charset=utf-8 > > > > > > > > > > > > I am inclined to say that "file" does not count, because it tries to guess an > > > > > > ambiguous mapping from bytes to character set. Even more, file should be > > > > > > _unable at all_ to distinguish an iso-8859-1 from an iso-8859-2 (or worse: 15) > > > > > > file. This program is soo... forget it, it's not an argument. It works well for > > > > > > headerful files, but text files don't really contain one. The next best thing > > > > > > would be html, with a proper <meta http-equiv=Content> tag. > > > > > > > > > > The stupidity from the start up with those character sets is that they > > > > > consider that a whole file is written with a given set. In fact, the > > > > > charset should apply to characters themselves. At least, the > > > > > quoted-printable, non-human friendly, encoding was the least stupid. > > > > > > > > I doubt doing this would really be worth the effort. > > > > > > > > In the 21st century, people should simply use UTF-8. > > > > > > > > > Now that UTF8 comes everywhere, everyone receives tons of mangled mails, > > > > > and even mailers which correctly support UTF8 and use it by default manage > > > > > to shoot themselves in the foot when they reply to, or forward a mail. The > > > > > system is completely broken because limited by design, and we have to learn > > > > > to live with this brokenness. > > > > > > > > Only if MUAs have broken charset support or don't set a correct > > > > "charset" header in the mails they are sending. > > > > > > > > If some software still can't handle UTF-8 correctly more than 10 years > > > > after it was introduced, that's not a brokenness you can blame on UTF-8. > > > > > > I'm not blaming UTF-8 per se, but people who still believe in encoding > > > *whole documents*. Copy-paste, text insertion, git output, etc... everything > > > has a good reason not to be in the same encoding as what your MUA believes. > > > > How would you do this technically in a way that it's significantely > > easier than simply finishing the UTF=8 transition? > > In how many decades do you think the transition will be finished ? > > > > If major MUAs still have problems with UTF-8 10 years after it was introduced, > > > it's clearly the proof of a flaw in the initial design. And I'm not even > > > discussing the stupidity which requires that you read a whole text to get > > > its number of characters ! > > > > The only major MUA not supporting UTF-8 is Eudora. > > > > And if you are talking about buggy old pine, in the latest development > > version [1] it does not only become open source, it also got some > > working Unicode support. > > No, I'm not speaking about "not supporting", but "having problems". Every > one of us has already received mails from Thunderbird, Outlook, Notes, etc... > with erroneously encoded characters because of this : > > - an UTF8 MUA sends a mail to a non-UTF8 aware one. "non-UTF8 aware one" = Eudora (BTW: there's no Linux version) > - this last one only sees double chars. When it wants to forward the mail > to someone else, it keeps the chars verbatim, and sets the encoding type > to its own, something like iso8859-1 for instance. Let's not base everything on the one broken non-Linux MUA, > - the final MUA, which is UTF8-aware, is very happy to detect lots of UTF8 > combinations in the forwarded mail and decides that everything in it is > UTF8, then you get lots of chars mangled in the mail, in the middle of > UTF8 combinations. Then, this crappy mail can be forwarded as long as > you want between UTF8 MUAs, they will all apply heuristics and to the > wrong thing : consider the *whole* document with *one* type. Which MUAs exactly do ignore the "charset" of an email and try their own guessing instead? Or which MUAs exactly do not set a "charset" so that the receiving MUA might have a reason for guessing? > What I find even funnier is when, for no apparent reason, the same MUA is used > on both ends and the contents get mangled because the sender copies a portion > of text from somewhere else. With which MUA and which charset settings of the users? > Anyway, I don't want to follow up on this thread, it's *highly* off-topic here. People want their names written correctly in changelogs. It is therefore on-topic if the result is something like "kernel maintainers shouldn't be using Eudora" or "kernel maintainers using pine should upgrade to Alpine" or something similar. > Cheers, > Willy cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-08 1:03 ` Adrian Bunk 2007-01-08 1:14 ` Willy Tarreau @ 2007-01-08 6:52 ` Jan Engelhardt 2007-01-08 8:02 ` Adrian Bunk 1 sibling, 1 reply; 39+ messages in thread From: Jan Engelhardt @ 2007-01-08 6:52 UTC (permalink / raw) To: Adrian Bunk Cc: Willy Tarreau, Russell King, David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List On Jan 8 2007 02:03, Adrian Bunk wrote: > >The only major MUA not supporting UTF-8 is Eudora. > >And if you are talking about buggy old pine, in the latest development >version [1] it does not only become open source, it also got some >working Unicode support. Uhm, just for the record, I run pine 4.61 where my mail delivers to, and Unicode works, yes, including the spam. -`J' -- ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-08 6:52 ` Jan Engelhardt @ 2007-01-08 8:02 ` Adrian Bunk 0 siblings, 0 replies; 39+ messages in thread From: Adrian Bunk @ 2007-01-08 8:02 UTC (permalink / raw) To: Jan Engelhardt Cc: Willy Tarreau, Russell King, David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List On Mon, Jan 08, 2007 at 07:52:48AM +0100, Jan Engelhardt wrote: > > On Jan 8 2007 02:03, Adrian Bunk wrote: > > > >The only major MUA not supporting UTF-8 is Eudora. > > > >And if you are talking about buggy old pine, in the latest development > >version [1] it does not only become open source, it also got some > >working Unicode support. > > Uhm, just for the record, I run pine 4.61 where my mail delivers to, > and Unicode works, yes, including the spam. For some years I'm using pine only as a newsreader, and I remember some display problems of Unicode characters that are fixed in Alpine. It might be that the support in pine was already better than I thought (but my switch to MUA was so many years ago...). > -`J' cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-08 0:38 ` Willy Tarreau 2007-01-08 1:03 ` Adrian Bunk @ 2007-01-08 19:53 ` Valdis.Kletnieks 1 sibling, 0 replies; 39+ messages in thread From: Valdis.Kletnieks @ 2007-01-08 19:53 UTC (permalink / raw) To: Willy Tarreau Cc: Adrian Bunk, Jan Engelhardt, Russell King, David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List [-- Attachment #1: Type: text/plain, Size: 519 bytes --] On Mon, 08 Jan 2007 01:38:57 +0100, Willy Tarreau said: > it's clearly the proof of a flaw in the initial design. And I'm not even > discussing the stupidity which requires that you read a whole text to get > its number of characters ! It's no more stupid than the *current* situation with Linux kernel code, where the stupidity actually requires that even if you know that there are only 60 characters on a given line, you actually have to look at each one in order to figure out if the line goes past column 80.... [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-07 15:38 ` Russell King 2007-01-07 16:29 ` David Woodhouse @ 2007-01-07 18:21 ` Alan 2007-01-07 19:12 ` Jan Engelhardt 2007-01-07 19:17 ` Russell King 1 sibling, 2 replies; 39+ messages in thread From: Alan @ 2007-01-07 18:21 UTC (permalink / raw) To: Russell King; +Cc: David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List > So, in short, UTF-8 is all fine and dandy if your _entire_ universe > is UTF-8 enabled. If you're operating in a mixed charset environment > it's one bloody big pain in the butt. Net ASCII is 7bit and is 1:1 mapped with UTF-8 unicode. It's just old broken 8bit encodings that are problematic. The kernel maintainers/help/config pretty consistently use UTF8 ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-07 18:21 ` Alan @ 2007-01-07 19:12 ` Jan Engelhardt 2007-01-07 22:30 ` Alan 2007-01-07 19:17 ` Russell King 1 sibling, 1 reply; 39+ messages in thread From: Jan Engelhardt @ 2007-01-07 19:12 UTC (permalink / raw) To: Alan Cc: Russell King, David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List On Jan 7 2007 18:21, Alan wrote: > >> So, in short, UTF-8 is all fine and dandy if your _entire_ universe >> is UTF-8 enabled. If you're operating in a mixed charset environment >> it's one bloody big pain in the butt. > >Net ASCII is 7bit and is 1:1 mapped with UTF-8 unicode. It's just old >broken 8bit encodings that are problematic. > >The kernel maintainers/help/config pretty consistently use UTF8 I've seen a lot of places that don't do so. Want a patch? -`J' -- ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-07 19:12 ` Jan Engelhardt @ 2007-01-07 22:30 ` Alan 2007-01-08 1:22 ` Jan Engelhardt 2007-01-08 16:14 ` Pavel Machek 0 siblings, 2 replies; 39+ messages in thread From: Alan @ 2007-01-07 22:30 UTC (permalink / raw) To: Jan Engelhardt Cc: Russell King, David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List > >The kernel maintainers/help/config pretty consistently use UTF8 > > I've seen a lot of places that don't do so. Want a patch? I think that would be a good idea - and add it to the coding/docs specs that documentation is UTF-8. Code should IMHO say 7bit though. Alan ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-07 22:30 ` Alan @ 2007-01-08 1:22 ` Jan Engelhardt 2007-01-08 20:17 ` Jan Engelhardt 2007-01-08 16:14 ` Pavel Machek 1 sibling, 1 reply; 39+ messages in thread From: Jan Engelhardt @ 2007-01-08 1:22 UTC (permalink / raw) To: Alan Cc: Russell King, David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List On Jan 7 2007 22:30, Alan wrote: > >> >The kernel maintainers/help/config pretty consistently use UTF8 >> >> I've seen a lot of places that don't do so. Want a patch? > >I think that would be a good idea - and add it to the coding/docs specs >that documentation is UTF-8. Code should IMHO say 7bit though. Hm, what do the list of authors in .c/.h files and kerneldoc in .c/h belong to? doc or code? -`J' -- ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-08 1:22 ` Jan Engelhardt @ 2007-01-08 20:17 ` Jan Engelhardt 2007-01-08 22:00 ` Ken Moffat 0 siblings, 1 reply; 39+ messages in thread From: Jan Engelhardt @ 2007-01-08 20:17 UTC (permalink / raw) To: Alan Cc: Russell King, David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List [-- Attachment #1: Type: TEXT/PLAIN, Size: 985 bytes --] On Jan 8 2007 02:22, Jan Engelhardt wrote: >On Jan 7 2007 22:30, Alan wrote: >> >>> >The kernel maintainers/help/config pretty consistently use UTF8 >>> >>> I've seen a lot of places that don't do so. Want a patch? >> >>I think that would be a good idea - and add it to the coding/docs specs >>that documentation is UTF-8. Code should IMHO say 7bit though. Most memorable issues: * "don<decimal-180>t" (standalone accent aigu) rather than "don't" (apostrophe) * "<decimal-160>", non breaking spaces * cp437 encoding in some files (heh, heh, DOS!) * iso8859-1/utf-8 mixed in some files My compose key is hot now... None of you people screw that patch with your buggy MUAs! I'll pack it up into a .bz2 to get it marked as application/octet-stream to not even give your MUA the chance to. ;-) [and because it's 221 K uncompressed and I am not sure if splitting it up makes much sense for such 'trivial' changes, or not?] Signed-off-by: Jan Engelhardt <jengelh@gmx.de> -`J' -- [-- Attachment #2: Type: APPLICATION/x-bzip2, Size: 42588 bytes --] ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-08 20:17 ` Jan Engelhardt @ 2007-01-08 22:00 ` Ken Moffat 2007-01-08 23:21 ` Jan Engelhardt 0 siblings, 1 reply; 39+ messages in thread From: Ken Moffat @ 2007-01-08 22:00 UTC (permalink / raw) To: Jan Engelhardt Cc: Alan, Russell King, David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List On Mon, Jan 08, 2007 at 09:17:06PM +0100, Jan Engelhardt wrote: > > On Jan 8 2007 02:22, Jan Engelhardt wrote: > >On Jan 7 2007 22:30, Alan wrote: > >> > >>> >The kernel maintainers/help/config pretty consistently use UTF8 > >>> > >>> I've seen a lot of places that don't do so. Want a patch? > >> > >>I think that would be a good idea - and add it to the coding/docs specs > >>that documentation is UTF-8. Code should IMHO say 7bit though. > > Most memorable issues: > > * "don<decimal-180>t" (standalone accent aigu) rather than "don't" (apostrophe) > * "<decimal-160>", non breaking spaces > * cp437 encoding in some files (heh, heh, DOS!) > * iso8859-1/utf-8 mixed in some files Looks nicely done, but I query the postal address changes in Documentation/cdrom/sbpcd - that seems to be a change of address (without anything to explain it). Everything else seems to be just character-set conversion or the occasional translation of comments into English. (And no, I didn't attempt to review the character-set changes, even it there is an occasional error it will be better than where we are now, and easy to patch.) > > My compose key is hot now... I prefer the AltGr dead keys in X (they seem to work more reliably for me), but I guess I'm straying OT. > > None of you people screw that patch with your buggy MUAs! I'll pack > it up into a .bz2 to get it marked as application/octet-stream to > not even give your MUA the chance to. ;-) [and because it's 221 K > uncompressed and I am not sure if splitting it up makes much sense for > such 'trivial' changes, or not?] > > Signed-off-by: Jan Engelhardt <jengelh@gmx.de> > > > -`J' > -- Thanks for doing this, I hope it wasn't in vain. Ken -- das eine Mal als Tragödie, das andere Mal als Farce ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-08 22:00 ` Ken Moffat @ 2007-01-08 23:21 ` Jan Engelhardt 2007-01-08 23:34 ` Eberhard Moenkeberg 0 siblings, 1 reply; 39+ messages in thread From: Jan Engelhardt @ 2007-01-08 23:21 UTC (permalink / raw) To: Eberhard Mönkeberg Cc: Alan, Russell King, David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List, Ken Moffat On Jan 8 2007 22:00, Ken Moffat wrote: > Looks nicely done, but I query the postal address changes in >Documentation/cdrom/sbpcd - that seems to be a change of address >(without anything to explain it). Eberhard [cc], please attach an Acked-by: YourName <emailaddress> keep Ccs, thanks ;-) [thread/patch: http://lkml.org/lkml/2007/1/8/222 ] -`J' -- ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-08 23:21 ` Jan Engelhardt @ 2007-01-08 23:34 ` Eberhard Moenkeberg 0 siblings, 0 replies; 39+ messages in thread From: Eberhard Moenkeberg @ 2007-01-08 23:34 UTC (permalink / raw) To: Jan Engelhardt Cc: Alan, Russell King, David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List, Ken Moffat Hi, On Tue, 9 Jan 2007, Jan Engelhardt wrote: > On Jan 8 2007 22:00, Ken Moffat wrote: > > Looks nicely done, but I query the postal address changes in > >Documentation/cdrom/sbpcd - that seems to be a change of address > >(without anything to explain it). > > Eberhard [cc], please attach an Acked-by: YourName <emailaddress> > keep Ccs, thanks ;-) > > [thread/patch: http://lkml.org/lkml/2007/1/8/222 ] Acked-by: Eberhard Moenkeberg <emoenke@gwdg.de> Jan had contacted me before, and I had sent him my new address data. This very young guy is doing a really good job. ;-)) Cheers -e -- Eberhard Moenkeberg (emoenke@gwdg.de, em@kki.org) ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-07 22:30 ` Alan 2007-01-08 1:22 ` Jan Engelhardt @ 2007-01-08 16:14 ` Pavel Machek 2007-01-08 22:17 ` Tim Pepper 1 sibling, 1 reply; 39+ messages in thread From: Pavel Machek @ 2007-01-08 16:14 UTC (permalink / raw) To: Alan Cc: Jan Engelhardt, Russell King, David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List On Sun 2007-01-07 22:30:55, Alan wrote: > > >The kernel maintainers/help/config pretty consistently use UTF8 > > > > I've seen a lot of places that don't do so. Want a patch? > > I think that would be a good idea - and add it to the coding/docs specs > that documentation is UTF-8. Code should IMHO say 7bit though. Yes, yes, please. I have been flamed when someone tried to do 8bit patch, and I was trying to NAK it... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-08 16:14 ` Pavel Machek @ 2007-01-08 22:17 ` Tim Pepper 2007-01-08 23:30 ` Jan Engelhardt 0 siblings, 1 reply; 39+ messages in thread From: Tim Pepper @ 2007-01-08 22:17 UTC (permalink / raw) To: Pavel Machek Cc: Alan, Jan Engelhardt, Russell King, David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List On 1/8/07, Pavel Machek <pavel@ucw.cz> wrote: > On Sun 2007-01-07 22:30:55, Alan wrote: > > I think that would be a good idea - and add it to the coding/docs specs > > that documentation is UTF-8. Code should IMHO say 7bit though. > > Yes, yes, please. > > I have been flamed when someone tried to do 8bit patch, and I was > trying to NAK it... Could this get put in Documentation/CodingStyle? And an item added to the kernel janitors' list to fix up 8bit files? Last I looked trying to decided if there was a standard here I found a mish-mash of encodings based output of file vs Linus' git tree. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-08 22:17 ` Tim Pepper @ 2007-01-08 23:30 ` Jan Engelhardt 0 siblings, 0 replies; 39+ messages in thread From: Jan Engelhardt @ 2007-01-08 23:30 UTC (permalink / raw) To: Tim Pepper Cc: Pavel Machek, Alan, Russell King, David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List On Jan 8 2007 14:17, Tim Pepper wrote: > On 1/8/07, Pavel Machek <pavel@ucw.cz> wrote: >> On Sun 2007-01-07 22:30:55, Alan wrote: >> > I think that would be a good idea - and add it to the coding/docs >> > specs >> > that documentation is UTF-8. Code should IMHO say 7bit though. >> >> Yes, yes, please. >> >> I have been flamed when someone tried to do 8bit patch, and I was >> trying to NAK it... > > Could this get put in Documentation/CodingStyle? Someone do that. > And an item added to > the kernel janitors' list to fix up 8bit files? Last I looked trying That's already been just done by me. http://lkml.org/lkml/2007/1/8/222 > to decided if there was a standard here I found a mish-mash of > encodings based output of file vs Linus' git tree. -`J' -- ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-07 18:21 ` Alan 2007-01-07 19:12 ` Jan Engelhardt @ 2007-01-07 19:17 ` Russell King 2007-01-07 19:58 ` Robin Rosenberg ` (2 more replies) 1 sibling, 3 replies; 39+ messages in thread From: Russell King @ 2007-01-07 19:17 UTC (permalink / raw) To: Alan; +Cc: David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List On Sun, Jan 07, 2007 at 06:21:51PM +0000, Alan wrote: > > So, in short, UTF-8 is all fine and dandy if your _entire_ universe > > is UTF-8 enabled. If you're operating in a mixed charset environment > > it's one bloody big pain in the butt. > > Net ASCII is 7bit and is 1:1 mapped with UTF-8 unicode. The same is true of ISO-8859-1. > It's just old broken 8bit encodings that are problematic. > > The kernel maintainers/help/config pretty consistently use UTF8 As I've tried to point out, that's not universally true. For instance: commit 24ebead82bbf9785909d4cf205e2df5e9ff7da32 tree 921f686860e918a01c3d3fb6cd106ba82bf4ace6 parent 264166e604a7e14c278e31cadd1afb06a7d51a11 author Rafa³ Bilski <rafalbilski@interia.pl> 1167691774 +0100 committer Dave Jones <davej@redhat.com> 1167799119 -0500 and looking at that "author" closer with od: 0000140 74 68 6f 72 20 52 61 66 61 b3 20 42 69 6c 73 6b t h o r R a f a ³ B i l s k clearly not UTF-8. I doubt whether any of the commits I do on my en_GB ISO-8859-1 systems end up being UTF-8 encoded. And _this_ is the problem when it comes to generating the logs, irrespective of whether or not Linus loads UTF-8 data into an ISO-8859-1 message. For all we know, Linus' system could be using an ISO-8859 charset rather than UTF-8. But the point is there is charset damage which has happened _long_ before Linus' action. There is no character set defined for the contents of git repositories, and as such the output of the git tools can not be interpreted as any one single character set. All that UTF-8 has done is added to the "which charset is this data" problem rather than actually solving any proper real life problem. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-07 19:17 ` Russell King @ 2007-01-07 19:58 ` Robin Rosenberg 2007-01-07 20:05 ` Dave Jones 2007-01-08 1:40 ` Horst H. von Brand 2 siblings, 0 replies; 39+ messages in thread From: Robin Rosenberg @ 2007-01-07 19:58 UTC (permalink / raw) To: Kernel Mailing List; +Cc: Russell King, Alan, David Woodhouse, Tilman Schmidt söndag 07 januari 2007 20:17 skrev Russell King: [...] > clearly not UTF-8. I doubt whether any of the commits I do on my > en_GB ISO-8859-1 systems end up being UTF-8 encoded. They don't. Git doesn't convert, with the exception of two mail-related tools, which is the reason the commit being discussed ended up as UTF-8 in GIT. The mail containing the patch was in ISO-8859-1. All other git tools just store whatever byte sequence they are fed, be ut ISO-latin, utf-8 or something (to westeners) more exotic. -- robin ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-07 19:17 ` Russell King 2007-01-07 19:58 ` Robin Rosenberg @ 2007-01-07 20:05 ` Dave Jones 2007-01-07 20:15 ` Sean 2007-01-08 4:42 ` David Woodhouse 2007-01-08 1:40 ` Horst H. von Brand 2 siblings, 2 replies; 39+ messages in thread From: Dave Jones @ 2007-01-07 20:05 UTC (permalink / raw) To: Alan, David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List On Sun, Jan 07, 2007 at 07:17:30PM +0000, Russell King wrote: > commit 24ebead82bbf9785909d4cf205e2df5e9ff7da32 > tree 921f686860e918a01c3d3fb6cd106ba82bf4ace6 > parent 264166e604a7e14c278e31cadd1afb06a7d51a11 > author Rafa³ Bilski <rafalbilski@interia.pl> 1167691774 +0100 > committer Dave Jones <davej@redhat.com> 1167799119 -0500 > > and looking at that "author" closer with od: > > 0000140 74 68 6f 72 20 52 61 66 61 b3 20 42 69 6c 73 6b > t h o r R a f a ³ B i l s k > > clearly not UTF-8. I doubt whether any of the commits I do on my > en_GB ISO-8859-1 systems end up being UTF-8 encoded. This has been bugging me for a while. Viewing the mail I applied in mutt shows his name correctly as Rafał Applying it with git-applymbox and viewing the log on master.kernel.org with git log shows Rafa<B3> And then later when put into email it turns into Rafa³ > But the point is there is charset damage which has happened _long_ before > Linus' action. There is no character set defined for the contents of git > repositories, and as such the output of the git tools can not be > interpreted as any one single character set. If there's something I should be doing when I commit that I'm not, I'll be happy to change my scripts. My $LANG is set to en_US.UTF-8 which should DTRT to the best of my knowledge, but clearly, that isn't the case. Dave -- http://www.codemonkey.org.uk ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-07 20:05 ` Dave Jones @ 2007-01-07 20:15 ` Sean 2007-01-07 20:40 ` Jan Engelhardt 2007-01-08 4:42 ` David Woodhouse 1 sibling, 1 reply; 39+ messages in thread From: Sean @ 2007-01-07 20:15 UTC (permalink / raw) To: Dave Jones Cc: Alan, David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List, git On Sun, 7 Jan 2007 15:05:53 -0500 Dave Jones <davej@redhat.com> wrote: Including the Git list... > On Sun, Jan 07, 2007 at 07:17:30PM +0000, Russell King wrote: > > > commit 24ebead82bbf9785909d4cf205e2df5e9ff7da32 > > tree 921f686860e918a01c3d3fb6cd106ba82bf4ace6 > > parent 264166e604a7e14c278e31cadd1afb06a7d51a11 > > author Rafa³ Bilski <rafalbilski@interia.pl> 1167691774 +0100 > > committer Dave Jones <davej@redhat.com> 1167799119 -0500 > > > > and looking at that "author" closer with od: > > > > 0000140 74 68 6f 72 20 52 61 66 61 b3 20 42 69 6c 73 6b > > t h o r R a f a ³ B i l s k > > > > clearly not UTF-8. I doubt whether any of the commits I do on my > > en_GB ISO-8859-1 systems end up being UTF-8 encoded. > > This has been bugging me for a while. > Viewing the mail I applied in mutt shows his name correctly as Rafał > Applying it with git-applymbox and viewing the log on master.kernel.org > with git log shows Rafa<B3> And then later when put into email > it turns into Rafa³ > > > But the point is there is charset damage which has happened _long_ before > > Linus' action. There is no character set defined for the contents of git > > repositories, and as such the output of the git tools can not be > > interpreted as any one single character set. > > If there's something I should be doing when I commit that I'm not, > I'll be happy to change my scripts. My $LANG is set to en_US.UTF-8 > which should DTRT to the best of my knowledge, but clearly, that isn't > the case. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-07 20:15 ` Sean @ 2007-01-07 20:40 ` Jan Engelhardt 2007-01-07 21:07 ` Xavier Bestel 0 siblings, 1 reply; 39+ messages in thread From: Jan Engelhardt @ 2007-01-07 20:40 UTC (permalink / raw) To: Sean Cc: Dave Jones, Alan, David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List, git >On Sun, 7 Jan 2007 15:05:53 -0500 >Dave Jones <davej@redhat.com> wrote: > >> If there's something I should be doing when I commit that I'm not, >> I'll be happy to change my scripts. My $LANG is set to en_US.UTF-8 >> which should DTRT to the best of my knowledge, but clearly, that isn't >> the case. No, LC_CTYPE defines what charset you use. (I may be wrong, though.) -`J' -- ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-07 20:40 ` Jan Engelhardt @ 2007-01-07 21:07 ` Xavier Bestel 0 siblings, 0 replies; 39+ messages in thread From: Xavier Bestel @ 2007-01-07 21:07 UTC (permalink / raw) To: Jan Engelhardt Cc: Sean, Dave Jones, Alan, David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List, git Le dimanche 07 janvier 2007 à 21:40 +0100, Jan Engelhardt a écrit : > >On Sun, 7 Jan 2007 15:05:53 -0500 > >Dave Jones <davej@redhat.com> wrote: > > > >> If there's something I should be doing when I commit that I'm not, > >> I'll be happy to change my scripts. My $LANG is set to en_US.UTF-8 > >> which should DTRT to the best of my knowledge, but clearly, that isn't > >> the case. > > No, LC_CTYPE defines what charset you use. (I may be wrong, though.) IIRC LANG is a superset for all LC_* - i.e. if only LANG is defined, it sets all your locales, but you can individually set the charset, numeric format, date format, etc. Xav ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-07 20:05 ` Dave Jones 2007-01-07 20:15 ` Sean @ 2007-01-08 4:42 ` David Woodhouse 1 sibling, 0 replies; 39+ messages in thread From: David Woodhouse @ 2007-01-08 4:42 UTC (permalink / raw) To: Dave Jones; +Cc: Alan, Tilman Schmidt, Linux Kernel Mailing List, rmk+lkml On Sun, 2007-01-07 at 15:05 -0500, Dave Jones wrote: > This has been bugging me for a while. > Viewing the mail I applied in mutt shows his name correctly as Rafał > Applying it with git-applymbox and viewing the log on master.kernel.org > with git log shows Rafa<B3> And then later when put into email > it turns into Rafa³ I believe you need to use the misnamed '-u' option to git-applymbox, which _really_ ought to be the default behaviour. Otherwise, it fails to pay any attention to the character set tags in the mail it's decoding -- it commits the sin which rmk was whining about; assuming the input data is of a given type and ignoring the explicit tags which indicate the contrary. The '-u' option is misdocumented as 'causes the resulting commit to be encoded in utf-8', but in fact I believe it doesn't necessarily do that -- it actually causes the resulting commit to be encoded in the configured storage charset for the repository, which just _happens_ to default to UTF-8 unless otherwise specified. That is something which should definitely be the _default_ behaviour. We should make the '-u' behaviour the default, and if anyone really wants the old behaviour of importing arbitrary data in untagged binary form overriding its labelling then they can have a separate option which does that. -- dwmw2 ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: OT: character encodings (was: Linux 2.6.20-rc4) 2007-01-07 19:17 ` Russell King 2007-01-07 19:58 ` Robin Rosenberg 2007-01-07 20:05 ` Dave Jones @ 2007-01-08 1:40 ` Horst H. von Brand 2 siblings, 0 replies; 39+ messages in thread From: Horst H. von Brand @ 2007-01-08 1:40 UTC (permalink / raw) To: Alan, David Woodhouse, Tilman Schmidt, Linux Kernel Mailing List Russell King <rmk+lkml@arm.linux.org.uk> wrote: [...] > All that UTF-8 has done is added to the "which charset is this data" > problem rather than actually solving any proper real life problem. It solves real-world problems, the pain is that it is not (yet) universally used. The charset problems today are much more visible today than, say, 15 years back, that is all. -- Dr. Horst H. von Brand User #22616 counter.li.org Departamento de Informatica Fono: +56 32 2654431 Universidad Tecnica Federico Santa Maria +56 32 2654239 Casilla 110-V, Valparaiso, Chile Fax: +56 32 2797513 ^ permalink raw reply [flat|nested] 39+ messages in thread
end of thread, other threads:[~2007-01-08 23:36 UTC | newest] Thread overview: 39+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-01-08 10:24 OT: character encodings (was: Linux 2.6.20-rc4) Nicolas Mailhot 2007-01-08 10:44 ` Alan 2007-01-08 10:44 ` Nicolas Mailhot -- strict thread matches above, loose matches on Subject: below -- 2007-01-08 10:13 Nicolas Mailhot 2007-01-07 6:19 Linux 2.6.20-rc4 Linus Torvalds 2007-01-07 10:56 ` Jan Engelhardt 2007-01-07 11:44 ` Russell King 2007-01-07 13:06 ` OT: character encodings (was: Linux 2.6.20-rc4) Tilman Schmidt 2007-01-07 15:13 ` David Woodhouse 2007-01-07 15:38 ` Russell King 2007-01-07 16:29 ` David Woodhouse 2007-01-07 17:06 ` Russell King 2007-01-07 19:11 ` Jan Engelhardt 2007-01-07 19:20 ` Russell King 2007-01-07 20:48 ` Willy Tarreau 2007-01-07 23:37 ` Adrian Bunk 2007-01-08 0:38 ` Willy Tarreau 2007-01-08 1:03 ` Adrian Bunk 2007-01-08 1:14 ` Willy Tarreau 2007-01-08 1:45 ` Adrian Bunk 2007-01-08 6:52 ` Jan Engelhardt 2007-01-08 8:02 ` Adrian Bunk 2007-01-08 19:53 ` Valdis.Kletnieks 2007-01-07 18:21 ` Alan 2007-01-07 19:12 ` Jan Engelhardt 2007-01-07 22:30 ` Alan 2007-01-08 1:22 ` Jan Engelhardt 2007-01-08 20:17 ` Jan Engelhardt 2007-01-08 22:00 ` Ken Moffat 2007-01-08 23:21 ` Jan Engelhardt 2007-01-08 23:34 ` Eberhard Moenkeberg 2007-01-08 16:14 ` Pavel Machek 2007-01-08 22:17 ` Tim Pepper 2007-01-08 23:30 ` Jan Engelhardt 2007-01-07 19:17 ` Russell King 2007-01-07 19:58 ` Robin Rosenberg 2007-01-07 20:05 ` Dave Jones 2007-01-07 20:15 ` Sean 2007-01-07 20:40 ` Jan Engelhardt 2007-01-07 21:07 ` Xavier Bestel 2007-01-08 4:42 ` David Woodhouse 2007-01-08 1:40 ` Horst H. von Brand
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox