From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jakub Narebski Subject: Re: git on MacOSX and files with decomposed utf-8 file names Date: Wed, 16 Jan 2008 22:51:53 +0100 Message-ID: <200801162251.54219.jnareb@gmail.com> References: <478E1FED.5010801@web.de> <65026F2B-5CE8-4238-A9AB-D3545D336B41@sb.org> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-2 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Johannes Schindelin , Mark Junker , git@vger.kernel.org To: Kevin Ballard X-From: git-owner@vger.kernel.org Wed Jan 16 22:53:28 2008 Return-path: Envelope-to: gcvg-git-2@gmane.org Received: from vger.kernel.org ([209.132.176.167]) by lo.gmane.org with esmtp (Exim 4.50) id 1JFGBZ-0005gq-HF for gcvg-git-2@gmane.org; Wed, 16 Jan 2008 22:52:38 +0100 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751053AbYAPVwG convert rfc822-to-quoted-printable (ORCPT ); Wed, 16 Jan 2008 16:52:06 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1750994AbYAPVwF (ORCPT ); Wed, 16 Jan 2008 16:52:05 -0500 Received: from fg-out-1718.google.com ([72.14.220.154]:14107 "EHLO fg-out-1718.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750986AbYAPVwC (ORCPT ); Wed, 16 Jan 2008 16:52:02 -0500 Received: by fg-out-1718.google.com with SMTP id e21so422805fga.17 for ; Wed, 16 Jan 2008 13:52:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:from:to:subject:date:user-agent:cc:references:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:message-id; bh=hpauO6cRE6qZOODu91+08TGlByr8Q5v5CooHnW1l+sY=; b=J6yB24zhxzedC8OWPE/Vpn2Ev+Key4EGC3oTIy6qF3VBevuUrmHiYidxLi4OcCtAWFalp2kuXeZiREvxwe3zUrfNP3LVVa4URbflqugPjl0ymEcabbGEQRdVqJ/aolp3NOs7ReUrGzwLP4goFeADebKUAhmNK5ZHV+R9TB3xWLE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=from:to:subject:date:user-agent:cc:references:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:message-id; b=K8r7HrzIxJb4qo0tRRL9ZlNX13TXnj1OWWFbY//tfDT/PdqhDxx6lsfczZiY2Wrd8CemjwrIP1QZbAz75uUqJtB0To8C99NmltHsMp9xRTJQapxTfA1mFgJktaFFz6VUfrSutLb/e+9dm4fVxGab4nmlHkPyYOnKPNS3VrdiLlA= Received: by 10.82.152.16 with SMTP id z16mr2329446bud.17.1200520320875; Wed, 16 Jan 2008 13:52:00 -0800 (PST) Received: from ?192.168.1.11? ( [83.8.213.34]) by mx.google.com with ESMTPS id h6sm1876191nfh.30.2008.01.16.13.51.57 (version=SSLv3 cipher=OTHER); Wed, 16 Jan 2008 13:51:59 -0800 (PST) User-Agent: KMail/1.9.3 In-Reply-To: <65026F2B-5CE8-4238-A9AB-D3545D336B41@sb.org> Content-Disposition: inline Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: On Wed, 16 Jan 2008, Kevin Ballard wrote: > On Jan 16, 2008, at 11:46 AM, Jakub Narebski wrote: >>>> More like, Mac OS X has standardized on Unicode and the rest of th= e >>>> world hasn't caught up yet. Git is the only tool I've ever heard =20 >>>> which has a problem with OS X using Unicode. >>> >>> No. That's not at all the problem. Mac OS X insists on storing =20 >>> _another_ encoding of your filename. Both are UTF-8. Both encode >>> the _same_ string. Yet they are different, bytewise. For no good >>> reason.=20 >> >> To be more exact encoding used to _create_ file differs from encodin= g >> returned when _reading directory_... >> >>> Stop spreading FUD. Git can handle Unicode just fine. In fact, =20 >>> Git does not _care_ how the filename is encoded, it _respects_ the >>> user's choice, not only of the encoding _type_, but the _encoding_, >>> too.=20 >> >> ...which means that sequence of bytes differ. And Git by design is >> (both for filenames and for blob contents) encoding agnostic. >> >> HFS+ is just _stupid_. And unfortunately Git doesn't support stupid >> filesystems (e.g. case insensitive filesystems) well. By the way, calling HFS+ stupid, or rather calling at least two=20 different normalizations of UTF-8 (two different encodings) used for=20 writing and reading filenames stupid is wrong _for me_. I have quoted=20 Linus here, when I think I should use other description. =20 > There's two different ways to do filesystem encodings. One is to have= =20 > the fs simply not care about encoding, which is what the linux world = =20 > seems to prefer. Sure, this is great in that what you create the file= =20 > with is what you get back, but on the other hand, given an arbitrary = =20 > non-ASCII file on disk, you have absolutely no idea what the encoding= =20 > should be and you can't display it without making assumptions (yes yo= u =20 > can use heuristics, but you're still making assumptions). Filesystems= =20 > like HFS+ that standardize the encoding, on the other hand, make it =20 > such that you always know what the encoding of a file should be, so =20 > you can always display and use the filename intelligently. It also =20 > means it plays much nicer in a non-ASCII world, since you don't have = =20 > to worry about different normalizations of a given string referring t= o =20 > different files (it's one thing to be case-sensitive, but claiming =20 > that "f=F6o" and "f=F6o" are different files just because one uses a = =20 > composed character and the other doesn't is extremely user-=20 > unfriendly). =46or me it looks like a layering violation... but my knowledge about=20 filesystem is cluse to nil. IMHO it is VFS and libc which should do the= =20 translating. > On the other hand, what you create the file with may not =20 > be what you read back later, since the name has been standardized. =20 > It's hard to say one is better than the other, they're just different= =20 > ways of doing it. But using one encoding to create file, and another when reding filename= s=20 is strange. It is IMHO better to simply refuse creating filenames which= =20 are outside chosen encoding / normalization. But having different=20 encodings used for reading and writing on the level of filesystem=20 access (not on level of UI) is strange. > However, I have noticed that everybody who's voiced =20 > an opinion on this list in favor of the encoding-agnostic approach =20 > seem to be unwilling to accept that any other approach might have =20 > validity, to the extent of calling an OS/filesystem that does things = =20 > different stupid or insane. This strikes me as extremely elitist and = =20 > risks alienating what I expect to be a fast-growing group of users =20 > (i.e. OS X users). =46irst, it is Git philosophy and very core of design to be encoding=20 agnostic (to be "content tracker"). Second, using the same sequence of=20 bytes on filesystem, in the index, and in 'tree' objects ensures good=20 performance... this is something to think about if you want to add=20 patches which would deal with HFS+ API/UI quirks. [cut] --=20 Jakub Narebski Poland