From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from ring.crustytoothpaste.net (ring.crustytoothpaste.net [172.105.110.227]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 50F1038C for ; Tue, 16 Jan 2024 02:06:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=crustytoothpaste.net Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=crustytoothpaste.net Authentication-Results: smtp.subspace.kernel.org; dkim=pass (3072-bit key) header.d=crustytoothpaste.net header.i=@crustytoothpaste.net header.b="DD5o1H55" Received: from tapette.crustytoothpaste.net (unknown [IPv6:2001:470:b056:101:e59a:3ed0:5f5c:31f3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (3072 bits) server-digest SHA256) (No client certificate requested) by ring.crustytoothpaste.net (Postfix) with ESMTPSA id B39945A3C4; Tue, 16 Jan 2024 02:06:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=crustytoothpaste.net; s=default; t=1705370810; bh=9IeMbib/kaiuKqYVspp3SDTMgnepOY5moNTY+lAj3vU=; h=Date:From:To:Cc:Subject:References:Content-Type: Content-Disposition:In-Reply-To:From:Reply-To:Subject:Date:To:CC: Resent-Date:Resent-From:Resent-To:Resent-Cc:In-Reply-To:References: Content-Type:Content-Disposition; b=DD5o1H55Wt5abV4kllyM9SrCbsA4IHwWXuiuLLvLZqdz/WKvcTcAN1kGg3xFgFHVo SBGNiCUgYIU9A7NUZwr3Efek4OMrNMQAYxg9EfQX+yAOhtmFxJvUm9hEHE2aHMciBe 2LP1SqsHyFkGZZWgDphBO/I1/pCSa0lmdkux6oIVRhQNpLECEm5kcrPe7+iMBWazpF qwgrsYeutIwZ6ow1LO98IqPqtESkaMwgXVcCLfUu+UARJ1okX3xWerfjpxb2p+WK1h T2HrTKsEosoAZvx2WRYe5inCorBebbid3pULRJLm4oQ0/r3bY0k8uRnYPO1iXBi9V3 I6PbWCIOvpVh4rFGeEnCqhGrBniHt55pdvqgK+b7UfomV6gnQjf/2YlzaZm2mjUgs8 REfC0EdoO6zDTpV99oifijOAvfkt6KMmx3iazrGafRk1qxKMm44YkBJGOD43yiTHRg Mk12gzIb/A7bChOqq/LNQvts/+0ZCiAbkkKNlFssBbczioIV5fu Date: Tue, 16 Jan 2024 02:06:47 +0000 From: "brian m. carlson" To: Michael Litwak Cc: Torsten =?utf-8?Q?B=C3=B6gershausen?= , Matthias =?utf-8?Q?A=C3=9Fhauer?= , "git@vger.kernel.org" Subject: Re: Suggested clarification for .gitattributes reference documentation Message-ID: Mail-Followup-To: "brian m. carlson" , Michael Litwak , Torsten =?utf-8?Q?B=C3=B6gershausen?= , Matthias =?utf-8?Q?A=C3=9Fhauer?= , "git@vger.kernel.org" References: <20240113074323.GA6819@tb-raspi4> Precedence: bulk X-Mailing-List: git@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="bVYFgFSQZMWtfXh7" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/2.2.12 (2023-09-09) --bVYFgFSQZMWtfXh7 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On 2024-01-16 at 00:19:20, Michael Litwak wrote: > As for documentation clarifications for the .gitattributes manpage at > https://git-scm.com/docs/gitattributes, I still suggest adding an > explicit example for UTF-16LE with BOM, and/or adding a table listing > which working-tree-encoding value to use for each of the following > UTF-16 text encodings: >=20 > ENCODING 'working-tree-encoding' VALUE > ------------------- ----------------------------- > UTF-16LE with BOM UTF-16LE-BOM I should point out that this encoding, while very common on Windows, is also nonstandard. The standard says that UTF-16LE and UTF-16BE don't include a BOM and are always the respective endianness. UTF-16 can have a BOM or not, and if it doesn't, it's big-endian. There is no standard-conforming way to force the use of little-endian with a BOM. The problem is that many Windows programs insist on the BOM, but also refuse to read big-endian data in violation of the standard[0]. That's why this nonstandard variant exists in Git. I'll also note that this particular nonstandard variant is essentially impossible to encode reliably on Unix outside of Git because it's nonstandard, so it's an extremely unportable choice. In fact, I'm not aware of _any_ tool on my Debian system other than Git that will guarantee a UTF-16 little-endian stream with BOM. My editor (Neovim) certainly doesn't. (Apparently Emacs, which is not on my system, may permit that, which does not surprise me in the least.) > UTF-16BE with BOM UTF-16 It's a little more complicated than that. "UTF-16" would allow UTF-16 big-endian with BOM, UTF-16 little-endian with BOM, or UTF-16 big-endian without BOM. In other words, UTF-16 is big-endian by default and otherwise requires a BOM, which may be included even if not required. A reader must handle every variant of this, and must honour the BOM if set and default to big-endian if not. A writer may write whichever variant pleases it most as long as it's consistent within the same message. > UTF-16LE no BOM UTF-16LE > UTF-16BE no BOM UTF-16BE I think the addition of this table is too much. UTF-16LE-BOM is common on Windows, and the rest are substantially less common. It's also very difficult to explain in a table what "UTF-16" means in an understandable way. And I also think it's also pretty clear that users should be using UTF-8 without BOM where possible. We do already mention both UTF-16, UTF-16LE, and UTF-16LE-BOM as options in the gitattributes manual page, and it's up to the user to know what their program wants and supports if that's not UTF-8. (I would say that the user wants a new program that _does_ support UTF-8, but perhaps I'm being unrealistically harsh.) I agree it's difficult because the documentation usually doesn't indicate what's supported and all the variants are hard to understand, but that's a huge part of the reason that we recommend UTF-8. I'll also add that in general, when you do have Unix systems that read or write data in UTF-16, they handle every variant correctly. Thus, the practical choice if you steadfastly refuse to use UTF-8 is either UTF-16LE-BOM (if your Windows program has the bug I mentioned above) or UTF-16, both of which we mention already in the manual page. I'm explicitly ignoring non-file contexts here, where one may use UTF-16LE or UTF-16BE, but those are substantially less common in actual files, which is what this feature describes. > Why bother clarifying the documentation? Because These UTF-16 > encodings are commonly found on Windows systems. Notepad supports the > first two, and many Visual Studio project wizards add various files > using these encodings as well. Older versions of PowerShell saved new > .ps1 scripts using UTF-16BE with BOM as the default encoding. True, but Notepad also supports UTF-8 and has for quite a while. According to the Powershell documentation[1], there is no portable character set option for non-ASCII characters, so in general it's impossible to know. I suspect that a simple "UTF-16" will be fine here, though, since it clearly doesn't have the bug mentioned above. > Also, the current .gitattributes documentation makes frequent > reference to "UTF-16" as an encoding but fails to be clear that the > working-tree-encoding value "UTF-16" is now only for UTF-16BE with > BOM. It would be easy to assume that the working-tree-encoding value > "UTF-16" meant any UTF-16 file with a BOM (either LE or BE), which was > the original meaning of this value before UTF-16LE-BOM was added to > Git. As I said, your statement isn't correct. That's what libiconv does on Windows. On Linux, glibc uses a little-endian variant with BOM on little-endian machines. musl, if memory serves me, always uses big-endian without a BOM. All of those are valid encodings, and a UTF-16 reader must handle all of them. > Finally, I am not sure how to use git add --renormalize to correct a > UTF-16 file that was previously added incorrectly (i.e. with a missing > or incorrect working-tree-encoding entry in .gitattributes). The git > add documentation at https://git-scm.com/docs/git-add implies > 'renormalize' resets only the end-of-line values; however, I suspect > it also re-converts text encoding when a working-tree-encoding > property is set. It would be helpful to know one way or the other. It does indeed affect the working-tree-encoding. If you wanted to send an inline patch created with git format-patch, it would probably be welcome to mention that. However, because in this project we typically scratch our own itch, if you don't send one, it's likely nobody else will, either. [0] https://datatracker.ietf.org/doc/html/rfc2781 =C2=A7 4.1: =E2=80=9CAll applications that process text with the "UTF-16" charset label MUST be able to interpret both big- endian and little-endian text.=E2= =80=9D [1] https://learn.microsoft.com/en-us/powershell/module/microsoft.powershel= l.core/about/about_character_encoding?view=3Dpowershell-7.4 --=20 brian m. carlson (he/him or they/them) Toronto, Ontario, CA --bVYFgFSQZMWtfXh7 Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.4.3 (GNU/Linux) iHUEABYKAB0WIQQILOaKnbxl+4PRw5F8DEliiIeigQUCZaXktwAKCRB8DEliiIei geQ4AQClGBbbJVtMIGWAujd4A1geY7gz4q0EJxp2jDf5CEjCtwEAqMtkVxOUANvb GSiIGBUuSB+GF1Ie9cnx+hcFvJhaIAE= =E34N -----END PGP SIGNATURE----- --bVYFgFSQZMWtfXh7--