* Suggested clarification for .gitattributes reference documentation @ 2024-01-12 21:25 Michael Litwak 2024-01-12 21:50 ` brian m. carlson 0 siblings, 1 reply; 10+ messages in thread From: Michael Litwak @ 2024-01-12 21:25 UTC (permalink / raw) To: git@vger.kernel.org The .gitattributes documentation should be clarified to ensure files encoded as UTF-16 are properly accounted for, In particular for Windows users. Specifically, within the working-tree encoding topic https://git-scm.com/docs/gitattributes#_working_tree_encoding, I suggest the following edits: NEW BULLETED PARAGRAPH UNDER THE HEADING "Please note that using the working-tree-encoding attribute may have a number of pitfalls:" * Git for Windows is not able to access the iconv.exe text conversion program from an ordinary Command Prompt. Be sure to run 'git clone' or 'git add' from a git bash console or a Git GUI. OLD TEXT As an example, use the following attributes if your *.ps1 files are UTF-16 encoded with byte order mark (BOM) and you want Git to perform automatic line ending conversion based on your platform. *.ps1 text working-tree-encoding=UTF-16 Use the following attributes if your *.ps1 files are UTF-16 little endian encoded without BOM and you want Git to use Windows line endings in the working directory (use UTF-16LE-BOM instead of UTF-16LE if you want UTF-16 little endian with BOM). Please note, it is highly recommended to explicitly define the line endings with eol if the working-tree-encoding attribute is used to avoid ambiguity. *.ps1 text working-tree-encoding=UTF-16LE eol=CRLF NEW TEXT (SPECIFYING UTF-16BE EXPLICITLY IN THE FIRST EXAMPLE, AND WITH A NEW SEPARATE EXAMPLE FOR UTF-16LE WITH BOM) As an example, use the following attributes if your *.ps1 files are UTF-16 big endian encoded with byte order mark (BOM) and you want Git to perform automatic line ending conversion based on your platform. *.ps1 text working-tree-encoding=UTF-16 Use the following attributes if your *.ps1 files are UTF-16 little endian encoded without BOM and you want Git to use Windows line endings in the working directory. *.ps1 text working-tree-encoding=UTF-16LE eol=CRLF Use the following attributes if your *.ps1 files are UTF-16 little endian encoded with BOM and you want Git to use Windows line endings in the working directory. *.ps1 text working-tree-encoding=UTF-16LE-BOM eol=CRLF Please note, it is highly recommended to explicitly define the line endings with eol if the working-tree-encoding attribute is used to avoid ambiguity. Please note, Git for Windows does not support UTF-16LE encoding when running git commands from an ordinary Command Prompt. Use a git bash console instead. OLD TEXT: You can get a list of all available encodings on your platform with the following command: iconv --list NEW TEXT: You can get a list of all available encodings on your platform with the following command: iconv --list For Git for Windows users the command, above, is only supported when running in a 'git bash' console. In the thread "help request: unable to merge UTF-16-LE "text" file" at https://lore.kernel.org/git/Yl8uiflurfjuLIvD@camp.crustytoothpaste.net/, Brian m. Carlson, Chris Torek and others describe tips for dealing with improper encoding, such as the following: if you have already checked the file in without an appropriate working-tree-encoding, you should run `git add --renormalize .` and then commit. You'll need to do that (or merge in a commit that does that) on every branch you want to work with. > For that to work, it is likely that you'd need to convert not just > the tips of two branches getting merged, but also the merge base > commit, so that all three trees involved in the 3-way merge are in > the same text encoding. The old merge-recursive has `-X renormalize` that I believe would do this for you. (I see code in merge-ort for this as well, but have no handy means to test it myself.) So a NEW SECTION describing ways to deal with improper text file encoding could be added under the working-tree-encoding topic, specifically a description of what the following two commands can do to remedy improper encoding: git add --renormalize git merge-recursive -X renormalize CONCLUSION: Text files encoded with UTF-16LE with BOM are common in the Windows world, as some versions of Visual Studio will use this as the default encoding for .rc or .mc files. Solution files, project files and other Visual Studio files can also be in this format. Other encodings are common, too, e.g. some older versions of PowerShell defaulted to UTF-16BE with BOM for new .ps1 files. Yet users continue to experience encoding errors even when they are using the proper working-tree-encoding in their .gitattributes file. Part of this is due to the complexity of Git and the number of different platforms it supports. Ideally Git would automatically detect the most common UTF encodings and treat these files as diffable text files on all platforms -- without the need for entries in .gitattributes. And it would be great if Git for Windows could handle common UTF text encodings when executed in an ordinary Command Prompt. Until then, clarifying and enhancing the documentation for .gitattributes could go a long way in making text encoding easier for Git users. Thanks for considering these revisions. - Michael ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Suggested clarification for .gitattributes reference documentation 2024-01-12 21:25 Suggested clarification for .gitattributes reference documentation Michael Litwak @ 2024-01-12 21:50 ` brian m. carlson 2024-01-12 22:36 ` Michael Litwak 0 siblings, 1 reply; 10+ messages in thread From: brian m. carlson @ 2024-01-12 21:50 UTC (permalink / raw) To: Michael Litwak; +Cc: git@vger.kernel.org [-- Attachment #1: Type: text/plain, Size: 2516 bytes --] On 2024-01-12 at 21:25:19, Michael Litwak wrote: > Please note, Git for Windows does not support UTF-16LE encoding when running git > commands from an ordinary Command Prompt. Use a git bash console instead. This sounds like a Git for Windows bug. Rather than documenting it, could you open an issue for it on their project? > NEW TEXT: > > You can get a list of all available encodings on your platform with the following command: > > iconv --list > > For Git for Windows users the command, above, is only supported when running in a 'git bash' console. That sounds like a PATH misconfiguration on your part. Have you checked your PATH settings to make sure that the path including the binary is included? > CONCLUSION: > > Text files encoded with UTF-16LE with BOM are common in the Windows > world, as some versions of Visual Studio will use this as the default > encoding for .rc or .mc files. Solution files, project files and > other Visual Studio files can also be in this format. Other encodings > are common, too, e.g. some older versions of PowerShell defaulted to > UTF-16BE with BOM for new .ps1 files. Yet users continue to experience > encoding errors even when they are using the proper > working-tree-encoding in their .gitattributes file. Part of this is > due to the complexity of Git and the number of different platforms it > supports. I should point out that UTF-8 is pretty much the standard these days in many domains, even on Windows. For example, nobody is going to be pleased if you write a web page in any variant of UTF-16, and some languages, such as Rust, are simply defined to be in UTF-8 and won't work if you put them in any other encoding. Almost all editors these days do support UTF-8 (without BOM), even on Windows, so we do want to strongly encourage that rather than having people use UTF-16. The Git FAQ specifically outlines UTF-8 as the recommended way, which is most portable and most functional. We have also documented the UTF-16LE-BOM case specifically in the Git FAQ (git help gitfaq) under "I'm on Windows and my text files are detected as binary". Answering questions on Stack Overflow, I realize that nobody actually reads the FAQ, but we did clearly document how to do it. That being said, I'm not opposed to an additional mention in the gitattributes(5) page if you want to send an actual patch. -- brian m. carlson (he/him or they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Suggested clarification for .gitattributes reference documentation 2024-01-12 21:50 ` brian m. carlson @ 2024-01-12 22:36 ` Michael Litwak 2024-01-13 2:56 ` Michael Litwak 0 siblings, 1 reply; 10+ messages in thread From: Michael Litwak @ 2024-01-12 22:36 UTC (permalink / raw) To: brian m. carlson; +Cc: git@vger.kernel.org Regarding Git for Windows requiring 'git bash' console to perform text conversion... >This sounds like a Git for Windows bug. Rather than documenting it, could you open an issue for it on their project? >That sounds like a PATH misconfiguration on your part. Have you checked your PATH settings to make sure that the path including the binary is included? If it is merely that I need to adjust my PATH so iconv.exe is accessible, that simplifies everything; however, it could possibly be a Git for Windows installer bug (or perhaps the installer offered to change my PATH and I declined). I'll check out both possibilities. >We have also documented the UTF-16LE-BOM case specifically in the Git FAQ (git help gitfaq) under "I'm on Windows and my text files are detected as binary". I'll take a look and perhaps revise my suggested documentation edits. Thanks, - Michael -----Original Message----- From: brian m. carlson <sandals@crustytoothpaste.net> Sent: Friday, January 12, 2024 1:50 PM To: Michael Litwak <michael.litwak@nuix.com> Cc: git@vger.kernel.org Subject: [EXTERNAL]Re: Suggested clarification for .gitattributes reference documentation On 2024-01-12 at 21:25:19, Michael Litwak wrote: > Please note, Git for Windows does not support UTF-16LE encoding when running git > commands from an ordinary Command Prompt. Use a git bash console instead. This sounds like a Git for Windows bug. Rather than documenting it, could you open an issue for it on their project? > NEW TEXT: > > You can get a list of all available encodings on your platform with the following command: > > iconv --list > > For Git for Windows users the command, above, is only supported when running in a 'git bash' console. That sounds like a PATH misconfiguration on your part. Have you checked your PATH settings to make sure that the path including the binary is included? > CONCLUSION: > > Text files encoded with UTF-16LE with BOM are common in the Windows > world, as some versions of Visual Studio will use this as the default > encoding for .rc or .mc files. Solution files, project files and > other Visual Studio files can also be in this format. Other encodings > are common, too, e.g. some older versions of PowerShell defaulted to > UTF-16BE with BOM for new .ps1 files. Yet users continue to experience > encoding errors even when they are using the proper > working-tree-encoding in their .gitattributes file. Part of this is > due to the complexity of Git and the number of different platforms it > supports. I should point out that UTF-8 is pretty much the standard these days in many domains, even on Windows. For example, nobody is going to be pleased if you write a web page in any variant of UTF-16, and some languages, such as Rust, are simply defined to be in UTF-8 and won't work if you put them in any other encoding. Almost all editors these days do support UTF-8 (without BOM), even on Windows, so we do want to strongly encourage that rather than having people use UTF-16. The Git FAQ specifically outlines UTF-8 as the recommended way, which is most portable and most functional. We have also documented the UTF-16LE-BOM case specifically in the Git FAQ (git help gitfaq) under "I'm on Windows and my text files are detected as binary". Answering questions on Stack Overflow, I realize that nobody actually reads the FAQ, but we did clearly document how to do it. That being said, I'm not opposed to an additional mention in the gitattributes(5) page if you want to send an actual patch. -- brian m. carlson (he/him or they/them) Toronto, Ontario, CA ^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: Suggested clarification for .gitattributes reference documentation 2024-01-12 22:36 ` Michael Litwak @ 2024-01-13 2:56 ` Michael Litwak 2024-01-13 7:43 ` Torsten Bögershausen 0 siblings, 1 reply; 10+ messages in thread From: Michael Litwak @ 2024-01-13 2:56 UTC (permalink / raw) To: brian m. carlson; +Cc: git@vger.kernel.org I just installed Git for Windows 2.43.0 and noticed the installer offers three options for altering the PATH: 1) Run git from git bash only 2) Run git from git bash, cmd.exe and PowerShell (RECOMMENDED) 3) Run git from git bash, cmd.exe and PowerShell with optional utilities (warning: will override find, sort and other system utilities). It turns out iconv.exe is accessible from cmd.exe (Command Prompt) only when you take the third option. But iconv.exe is NOT optional. It is required for git to deal with UTF-16LE with BOM text conversions (and probably for numerous other encoding conversions). But when PATH option #2 is chosen, and iconv.exe is unreachable from a Windows Command Prompt, the git commands which call upon iconv.exe do NOT indicate the error. The call to iconv.exe fails silently. It is only later after you commit, push and clone the repo again that you see the encoding failures. And the warning about overriding find and sort must be taken with a grain of salt, since the Windows versions of those programs are accessed via a Windows folder which appears earlier in the PATH. So this Git for Windows installer screen is misleading. And perhaps iconv.exe should be relocated so it is accessible even when PATH option #2 is chosen. I intend to submit an issue on the Git for Windows issue tracker regarding this. I'll also submit an issue about the lack of an error when running 'git add' for a UTF-16LE with BOM file under PATH option #2. Thanks, - Michael -----Original Message----- From: brian m. carlson <sandals@crustytoothpaste.net> Sent: Friday, January 12, 2024 1:50 PM To: Michael Litwak <michael.litwak@nuix.com> Cc: git@vger.kernel.org Subject: [EXTERNAL]Re: Suggested clarification for .gitattributes reference documentation On 2024-01-12 at 21:25:19, Michael Litwak wrote: > Please note, Git for Windows does not support UTF-16LE encoding when running git > commands from an ordinary Command Prompt. Use a git bash console instead. This sounds like a Git for Windows bug. Rather than documenting it, could you open an issue for it on their project? > NEW TEXT: > > You can get a list of all available encodings on your platform with the following command: > > iconv --list > > For Git for Windows users the command, above, is only supported when running in a 'git bash' console. That sounds like a PATH misconfiguration on your part. Have you checked your PATH settings to make sure that the path including the binary is included? > CONCLUSION: > > Text files encoded with UTF-16LE with BOM are common in the Windows > world, as some versions of Visual Studio will use this as the default > encoding for .rc or .mc files. Solution files, project files and > other Visual Studio files can also be in this format. Other encodings > are common, too, e.g. some older versions of PowerShell defaulted to > UTF-16BE with BOM for new .ps1 files. Yet users continue to experience > encoding errors even when they are using the proper > working-tree-encoding in their .gitattributes file. Part of this is > due to the complexity of Git and the number of different platforms it > supports. I should point out that UTF-8 is pretty much the standard these days in many domains, even on Windows. For example, nobody is going to be pleased if you write a web page in any variant of UTF-16, and some languages, such as Rust, are simply defined to be in UTF-8 and won't work if you put them in any other encoding. Almost all editors these days do support UTF-8 (without BOM), even on Windows, so we do want to strongly encourage that rather than having people use UTF-16. The Git FAQ specifically outlines UTF-8 as the recommended way, which is most portable and most functional. We have also documented the UTF-16LE-BOM case specifically in the Git FAQ (git help gitfaq) under "I'm on Windows and my text files are detected as binary". Answering questions on Stack Overflow, I realize that nobody actually reads the FAQ, but we did clearly document how to do it. That being said, I'm not opposed to an additional mention in the gitattributes(5) page if you want to send an actual patch. -- brian m. carlson (he/him or they/them) Toronto, Ontario, CA ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Suggested clarification for .gitattributes reference documentation 2024-01-13 2:56 ` Michael Litwak @ 2024-01-13 7:43 ` Torsten Bögershausen 2024-01-13 9:24 ` Matthias Aßhauer 2024-01-16 0:19 ` Michael Litwak 0 siblings, 2 replies; 10+ messages in thread From: Torsten Bögershausen @ 2024-01-13 7:43 UTC (permalink / raw) To: Michael Litwak; +Cc: brian m. carlson, git@vger.kernel.org On Sat, Jan 13, 2024 at 02:56:27AM +0000, Michael Litwak wrote: > I just installed Git for Windows 2.43.0 and noticed the installer offers three options for altering the PATH: > > 1) Run git from git bash only > > 2) Run git from git bash, cmd.exe and PowerShell (RECOMMENDED) > > 3) Run git from git bash, cmd.exe and PowerShell with optional utilities (warning: will override find, sort and other system utilities). > > It turns out iconv.exe is accessible from cmd.exe (Command Prompt) only when you take the third option. But iconv.exe is NOT optional. It is required for git to deal with UTF-16LE with BOM text conversions (and probably for numerous other encoding conversions). Plese wait a second - and thanks for bringing this up. To my knowledge the binary iconv.exe (or just iconv under non-Windows) is never called from Git itself. Git is using iconv_open() and friends, which are all inside a library, either the C-library "libc", or "libiconv" (not 100% sure about the naming here) iconv.exe is not needed in everyday life, or is it ? If yes, when ? iconv.exe is used when you run the test-suite, to verify what Git is doing. Could you elaborate a little bit more, when iconv.exe is missing, and what is happening, please ? > > But when PATH option #2 is chosen, and iconv.exe is unreachable from a Windows Command Prompt, the git commands which call upon iconv.exe do NOT indicate the error. The call to iconv.exe fails silently. It is only later after you commit, push and clone the repo again that you see the encoding failures. > > And the warning about overriding find and sort must be taken with a grain of salt, since the Windows versions of those programs are accessed via a Windows folder which appears earlier in the PATH. > > So this Git for Windows installer screen is misleading. And perhaps iconv.exe should be relocated so it is accessible even when PATH option #2 is chosen. I intend to submit an issue on the Git for Windows issue tracker regarding this. I'll also submit an issue about the lack of an error when running 'git add' for a UTF-16LE with BOM file under PATH option #2. > > Thanks, > - Michael > [] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Suggested clarification for .gitattributes reference documentation 2024-01-13 7:43 ` Torsten Bögershausen @ 2024-01-13 9:24 ` Matthias Aßhauer 2024-02-18 23:12 ` Johannes Schindelin 2024-01-16 0:19 ` Michael Litwak 1 sibling, 1 reply; 10+ messages in thread From: Matthias Aßhauer @ 2024-01-13 9:24 UTC (permalink / raw) To: Torsten Bögershausen Cc: Michael Litwak, brian m. carlson, git@vger.kernel.org On Sat, 13 Jan 2024, Torsten Bögershausen wrote: > On Sat, Jan 13, 2024 at 02:56:27AM +0000, Michael Litwak wrote: >> I just installed Git for Windows 2.43.0 and noticed the installer offers three options for altering the PATH: >> >> 1) Run git from git bash only >> >> 2) Run git from git bash, cmd.exe and PowerShell (RECOMMENDED) >> >> 3) Run git from git bash, cmd.exe and PowerShell with optional utilities (warning: will override find, sort and other system utilities). >> >> It turns out iconv.exe is accessible from cmd.exe (Command Prompt) only when you take the third option. But iconv.exe is NOT optional. It is required for git to deal with UTF-16LE with BOM text conversions (and probably for numerous other encoding conversions). For end users directly calling iconv.exe is definitely optional. > Plese wait a second - and thanks for bringing this up. > To my knowledge the binary iconv.exe (or just iconv under non-Windows) > is never called from Git itself. > Git is using iconv_open() and friends, which are all inside > a library, either the C-library "libc", or "libiconv" > (not 100% sure about the naming here) Exactly. I can't find a single instance of Git for Windows calling iconv.exe instead of using the corresponding library functions. [1] And even if it did, iconv.exe is definitely on the path for git.exe [2] unless you're calling /(mingw|clangarm)(64|32)/bin/git.exe directly, in which case the solution is to call /cmd/git.exe instead. [1] https://github.com/search?q=repo%3Agit-for-windows%2Fgit%20iconv%20NOT%20path%3A%2F%5Et%5C%2F%2F%20NOT%20path%3A%2F%5EDocumentation%5C%2F%2F&type=code [2] https://github.com/git-for-windows/MINGW-packages/blob/0c91cf2079184ae6a604e8f7a406a47d39305e72/mingw-w64-git/git-wrapper.c#L166-L258 > iconv.exe is not needed in everyday life, or is it ? > If yes, when ? > iconv.exe is used when you run the test-suite, to verify > what Git is doing. > > Could you elaborate a little bit more, > when iconv.exe is missing, and what is happening, please ? > >> >> But when PATH option #2 is chosen, and iconv.exe is unreachable from a Windows Command Prompt, the git commands which call upon iconv.exe do NOT indicate the error. The call to iconv.exe fails silently. It is only later after you commit, push and clone the repo again that you see the encoding failures. >> >> And the warning about overriding find and sort must be taken with a grain of salt, since the Windows versions of those programs are accessed via a Windows folder which appears earlier in the PATH. We should probably consider rewording that warning. >> So this Git for Windows installer screen is misleading. And perhaps iconv.exe should be relocated so it is accessible even when PATH option #2 is chosen. I intend to submit an issue on the Git for Windows issue tracker regarding this. I'll also submit an issue about the lack of an error when running 'git add' for a UTF-16LE with BOM file under PATH option #2. >> >> Thanks, >> - Michael >> > [] > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Suggested clarification for .gitattributes reference documentation 2024-01-13 9:24 ` Matthias Aßhauer @ 2024-02-18 23:12 ` Johannes Schindelin 0 siblings, 0 replies; 10+ messages in thread From: Johannes Schindelin @ 2024-02-18 23:12 UTC (permalink / raw) To: Matthias Aßhauer Cc: Torsten Bögershausen, Michael Litwak, brian m. carlson, git@vger.kernel.org [-- Attachment #1: Type: text/plain, Size: 3706 bytes --] Hi, On Sat, 13 Jan 2024, Matthias Aßhauer wrote: > On Sat, 13 Jan 2024, Torsten Bögershausen wrote: > > > On Sat, Jan 13, 2024 at 02:56:27AM +0000, Michael Litwak wrote: > > > I just installed Git for Windows 2.43.0 and noticed the installer offers > > > three options for altering the PATH: > > > > > > 1) Run git from git bash only > > > > > > 2) Run git from git bash, cmd.exe and PowerShell (RECOMMENDED) > > > > > > 3) Run git from git bash, cmd.exe and PowerShell with optional utilities > > > (warning: will override find, sort and other system utilities). > > > > > > It turns out iconv.exe is accessible from cmd.exe (Command Prompt) only > > > when you take the third option. But iconv.exe is NOT optional. It is > > > required for git to deal with UTF-16LE with BOM text conversions (and > > > probably for numerous other encoding conversions). > > For end users directly calling iconv.exe is definitely optional. > > > Plese wait a second - and thanks for bringing this up. > > To my knowledge the binary iconv.exe (or just iconv under non-Windows) > > is never called from Git itself. > > Git is using iconv_open() and friends, which are all inside > > a library, either the C-library "libc", or "libiconv" > > (not 100% sure about the naming here) > > Exactly. I can't find a single instance of Git for Windows calling iconv.exe > instead of using the corresponding library functions. [1] > > And even if it did, iconv.exe is definitely on the path for git.exe [2] unless > you're calling /(mingw|clangarm)(64|32)/bin/git.exe directly, in which case > the solution is to call /cmd/git.exe instead. Just a quick addition to this: _even if_ you happen to call `/*[63][42]/bin/git.exe` directly, the `PATH` is adjusted (unless `MSYSTEM` is set, which would indicate a pilot error without the corresponding `PATH` components such as `/usr/bin`): https://github.com/git-for-windows/git/blob/v2.43.0.windows.1/compat/mingw.c#L3529 Ciao, Johannes > [1] > https://github.com/search?q=repo%3Agit-for-windows%2Fgit%20iconv%20NOT%20path%3A%2F%5Et%5C%2F%2F%20NOT%20path%3A%2F%5EDocumentation%5C%2F%2F&type=code > [2] > https://github.com/git-for-windows/MINGW-packages/blob/0c91cf2079184ae6a604e8f7a406a47d39305e72/mingw-w64-git/git-wrapper.c#L166-L258 > > > iconv.exe is not needed in everyday life, or is it ? > > If yes, when ? > > iconv.exe is used when you run the test-suite, to verify > > what Git is doing. > > > > Could you elaborate a little bit more, > > when iconv.exe is missing, and what is happening, please ? > > > > > > > > But when PATH option #2 is chosen, and iconv.exe is unreachable from a > > > Windows Command Prompt, the git commands which call upon iconv.exe do NOT > > > indicate the error. The call to iconv.exe fails silently. It is only > > > later after you commit, push and clone the repo again that you see the > > > encoding failures. > > > > > > And the warning about overriding find and sort must be taken with a grain > > > of salt, since the Windows versions of those programs are accessed via a > > > Windows folder which appears earlier in the PATH. > > We should probably consider rewording that warning. > > > > So this Git for Windows installer screen is misleading. And perhaps > > > iconv.exe should be relocated so it is accessible even when PATH option #2 > > > is chosen. I intend to submit an issue on the Git for Windows issue > > > tracker regarding this. I'll also submit an issue about the lack of an > > > error when running 'git add' for a UTF-16LE with BOM file under PATH > > > option #2. > > > > > > Thanks, > > > - Michael > > > > > [] > > > > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Suggested clarification for .gitattributes reference documentation 2024-01-13 7:43 ` Torsten Bögershausen 2024-01-13 9:24 ` Matthias Aßhauer @ 2024-01-16 0:19 ` Michael Litwak 2024-01-16 2:06 ` brian m. carlson 1 sibling, 1 reply; 10+ messages in thread From: Michael Litwak @ 2024-01-16 0:19 UTC (permalink / raw) To: Torsten Bögershausen, Matthias Aßhauer Cc: brian m. carlson, git@vger.kernel.org > To my knowledge the binary iconv.exe (or just iconv under non-Windows) is never called from Git itself. > I can't find a single instance of Git for Windows calling iconv.exe instead of using the corresponding library functions. Thank you for your responses. I think you are both right. Git must instead call methods in libiconv-2.dll to do encoding conversions. I have no idea why my Windows 10 PC could add a UTF-16LE with BOM file, but then fail to later successfully "decode" it, when running Git from an ordinary Command Prompt (cmd.exe). I assume this failure was a fluke, since I cannot replicate the failure on my other (Windows 11) PC. So I am withdrawing my concerns about: 1) Git for Windows failing to support UTF-16LE with BOM. 2) Git for Windows installer being misleading in its "recommended" PATH modification option. As for documentation clarifications for the .gitattributes manpage at https://git-scm.com/docs/gitattributes, I still suggest adding an explicit example for UTF-16LE with BOM, and/or adding a table listing which working-tree-encoding value to use for each of the following UTF-16 text encodings: ENCODING 'working-tree-encoding' VALUE ------------------- ----------------------------- UTF-16LE with BOM UTF-16LE-BOM UTF-16BE with BOM UTF-16 UTF-16LE no BOM UTF-16LE UTF-16BE no BOM UTF-16BE Why bother clarifying the documentation? Because These UTF-16 encodings are commonly found on Windows systems. Notepad supports the first two, and many Visual Studio project wizards add various files using these encodings as well. Older versions of PowerShell saved new .ps1 scripts using UTF-16BE with BOM as the default encoding. Also, the current .gitattributes documentation makes frequent reference to "UTF-16" as an encoding but fails to be clear that the working-tree-encoding value "UTF-16" is now only for UTF-16BE with BOM. It would be easy to assume that the working-tree-encoding value "UTF-16" meant any UTF-16 file with a BOM (either LE or BE), which was the original meaning of this value before UTF-16LE-BOM was added to Git. Finally, I am not sure how to use git add --renormalize to correct a UTF-16 file that was previously added incorrectly (i.e. with a missing or incorrect working-tree-encoding entry in .gitattributes). The git add documentation at https://git-scm.com/docs/git-add implies 'renormalize' resets only the end-of-line values; however, I suspect it also re-converts text encoding when a working-tree-encoding property is set. It would be helpful to know one way or the other. - Michael Litwak -----Original Message----- From: Torsten Bögershausen <tboegi@web.de> Sent: Friday, January 12, 2024 11:43 PM To: Michael Litwak <michael.litwak@nuix.com> Cc: brian m. carlson <sandals@crustytoothpaste.net>; git@vger.kernel.org Subject: [EXTERNAL]Re: Suggested clarification for .gitattributes reference documentation [You don't often get email from tboegi@web.de. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ] On Sat, Jan 13, 2024 at 02:56:27AM +0000, Michael Litwak wrote: > I just installed Git for Windows 2.43.0 and noticed the installer offers three options for altering the PATH: > > 1) Run git from git bash only > > 2) Run git from git bash, cmd.exe and PowerShell (RECOMMENDED) > > 3) Run git from git bash, cmd.exe and PowerShell with optional utilities (warning: will override find, sort and other system utilities). > > It turns out iconv.exe is accessible from cmd.exe (Command Prompt) only when you take the third option. But iconv.exe is NOT optional. It is required for git to deal with UTF-16LE with BOM text conversions (and probably for numerous other encoding conversions). Plese wait a second - and thanks for bringing this up. To my knowledge the binary iconv.exe (or just iconv under non-Windows) is never called from Git itself. Git is using iconv_open() and friends, which are all inside a library, either the C-library "libc", or "libiconv" (not 100% sure about the naming here) iconv.exe is not needed in everyday life, or is it ? If yes, when ? iconv.exe is used when you run the test-suite, to verify what Git is doing. Could you elaborate a little bit more, when iconv.exe is missing, and what is happening, please ? > > But when PATH option #2 is chosen, and iconv.exe is unreachable from a Windows Command Prompt, the git commands which call upon iconv.exe do NOT indicate the error. The call to iconv.exe fails silently. It is only later after you commit, push and clone the repo again that you see the encoding failures. > > And the warning about overriding find and sort must be taken with a grain of salt, since the Windows versions of those programs are accessed via a Windows folder which appears earlier in the PATH. > > So this Git for Windows installer screen is misleading. And perhaps iconv.exe should be relocated so it is accessible even when PATH option #2 is chosen. I intend to submit an issue on the Git for Windows issue tracker regarding this. I'll also submit an issue about the lack of an error when running 'git add' for a UTF-16LE with BOM file under PATH option #2. > > Thanks, > - Michael > [] CAUTION:This email originated from outside of Nuix. Do not click links or open attachments unless you recognise the sender and know the content is safe. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Suggested clarification for .gitattributes reference documentation 2024-01-16 0:19 ` Michael Litwak @ 2024-01-16 2:06 ` brian m. carlson 2024-01-16 17:44 ` Torsten Bögershausen 0 siblings, 1 reply; 10+ messages in thread From: brian m. carlson @ 2024-01-16 2:06 UTC (permalink / raw) To: Michael Litwak Cc: Torsten Bögershausen, Matthias Aßhauer, git@vger.kernel.org [-- Attachment #1: Type: text/plain, Size: 6194 bytes --] On 2024-01-16 at 00:19:20, Michael Litwak wrote: > As for documentation clarifications for the .gitattributes manpage at > https://git-scm.com/docs/gitattributes, I still suggest adding an > explicit example for UTF-16LE with BOM, and/or adding a table listing > which working-tree-encoding value to use for each of the following > UTF-16 text encodings: > > ENCODING 'working-tree-encoding' VALUE > ------------------- ----------------------------- > UTF-16LE with BOM UTF-16LE-BOM I should point out that this encoding, while very common on Windows, is also nonstandard. The standard says that UTF-16LE and UTF-16BE don't include a BOM and are always the respective endianness. UTF-16 can have a BOM or not, and if it doesn't, it's big-endian. There is no standard-conforming way to force the use of little-endian with a BOM. The problem is that many Windows programs insist on the BOM, but also refuse to read big-endian data in violation of the standard[0]. That's why this nonstandard variant exists in Git. I'll also note that this particular nonstandard variant is essentially impossible to encode reliably on Unix outside of Git because it's nonstandard, so it's an extremely unportable choice. In fact, I'm not aware of _any_ tool on my Debian system other than Git that will guarantee a UTF-16 little-endian stream with BOM. My editor (Neovim) certainly doesn't. (Apparently Emacs, which is not on my system, may permit that, which does not surprise me in the least.) > UTF-16BE with BOM UTF-16 It's a little more complicated than that. "UTF-16" would allow UTF-16 big-endian with BOM, UTF-16 little-endian with BOM, or UTF-16 big-endian without BOM. In other words, UTF-16 is big-endian by default and otherwise requires a BOM, which may be included even if not required. A reader must handle every variant of this, and must honour the BOM if set and default to big-endian if not. A writer may write whichever variant pleases it most as long as it's consistent within the same message. > UTF-16LE no BOM UTF-16LE > UTF-16BE no BOM UTF-16BE I think the addition of this table is too much. UTF-16LE-BOM is common on Windows, and the rest are substantially less common. It's also very difficult to explain in a table what "UTF-16" means in an understandable way. And I also think it's also pretty clear that users should be using UTF-8 without BOM where possible. We do already mention both UTF-16, UTF-16LE, and UTF-16LE-BOM as options in the gitattributes manual page, and it's up to the user to know what their program wants and supports if that's not UTF-8. (I would say that the user wants a new program that _does_ support UTF-8, but perhaps I'm being unrealistically harsh.) I agree it's difficult because the documentation usually doesn't indicate what's supported and all the variants are hard to understand, but that's a huge part of the reason that we recommend UTF-8. I'll also add that in general, when you do have Unix systems that read or write data in UTF-16, they handle every variant correctly. Thus, the practical choice if you steadfastly refuse to use UTF-8 is either UTF-16LE-BOM (if your Windows program has the bug I mentioned above) or UTF-16, both of which we mention already in the manual page. I'm explicitly ignoring non-file contexts here, where one may use UTF-16LE or UTF-16BE, but those are substantially less common in actual files, which is what this feature describes. > Why bother clarifying the documentation? Because These UTF-16 > encodings are commonly found on Windows systems. Notepad supports the > first two, and many Visual Studio project wizards add various files > using these encodings as well. Older versions of PowerShell saved new > .ps1 scripts using UTF-16BE with BOM as the default encoding. True, but Notepad also supports UTF-8 and has for quite a while. According to the Powershell documentation[1], there is no portable character set option for non-ASCII characters, so in general it's impossible to know. I suspect that a simple "UTF-16" will be fine here, though, since it clearly doesn't have the bug mentioned above. > Also, the current .gitattributes documentation makes frequent > reference to "UTF-16" as an encoding but fails to be clear that the > working-tree-encoding value "UTF-16" is now only for UTF-16BE with > BOM. It would be easy to assume that the working-tree-encoding value > "UTF-16" meant any UTF-16 file with a BOM (either LE or BE), which was > the original meaning of this value before UTF-16LE-BOM was added to > Git. As I said, your statement isn't correct. That's what libiconv does on Windows. On Linux, glibc uses a little-endian variant with BOM on little-endian machines. musl, if memory serves me, always uses big-endian without a BOM. All of those are valid encodings, and a UTF-16 reader must handle all of them. > Finally, I am not sure how to use git add --renormalize to correct a > UTF-16 file that was previously added incorrectly (i.e. with a missing > or incorrect working-tree-encoding entry in .gitattributes). The git > add documentation at https://git-scm.com/docs/git-add implies > 'renormalize' resets only the end-of-line values; however, I suspect > it also re-converts text encoding when a working-tree-encoding > property is set. It would be helpful to know one way or the other. It does indeed affect the working-tree-encoding. If you wanted to send an inline patch created with git format-patch, it would probably be welcome to mention that. However, because in this project we typically scratch our own itch, if you don't send one, it's likely nobody else will, either. [0] https://datatracker.ietf.org/doc/html/rfc2781 § 4.1: “All applications that process text with the "UTF-16" charset label MUST be able to interpret both big- endian and little-endian text.” [1] https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_character_encoding?view=powershell-7.4 -- brian m. carlson (he/him or they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Suggested clarification for .gitattributes reference documentation 2024-01-16 2:06 ` brian m. carlson @ 2024-01-16 17:44 ` Torsten Bögershausen 0 siblings, 0 replies; 10+ messages in thread From: Torsten Bögershausen @ 2024-01-16 17:44 UTC (permalink / raw) To: brian m. carlson, Michael Litwak, Matthias Aßhauer, git@vger.kernel.org On Tue, Jan 16, 2024 at 02:06:47AM +0000, brian m. carlson wrote: > On 2024-01-16 at 00:19:20, Michael Litwak wrote: > > As for documentation clarifications for the .gitattributes manpage at > > https://git-scm.com/docs/gitattributes, I still suggest adding an > > explicit example for UTF-16LE with BOM, and/or adding a table listing > > which working-tree-encoding value to use for each of the following > > UTF-16 text encodings: > > > > ENCODING 'working-tree-encoding' VALUE > > ------------------- ----------------------------- > > UTF-16LE with BOM UTF-16LE-BOM > > I should point out that this encoding, while very common on Windows, is > also nonstandard. In general, I agree with everything that is snipped, thanks for the ong wordings. [] > (Apparently Emacs, which is not on my system, may > permit that, which does not surprise me in the least.) emacs seems to handle UTF-16LE-BOM just fine. > > > UTF-16BE with BOM UTF-16 > [] > I think the addition of this table is too much. UTF-16LE-BOM is common > on Windows, and the rest are substantially less common. It's also very > difficult to explain in a table what "UTF-16" means in an understandable > way. And I also think it's also pretty clear that users should be using > UTF-8 without BOM where possible. > > We do already mention both UTF-16, UTF-16LE, and UTF-16LE-BOM as options > in the gitattributes manual page, and it's up to the user to know what > their program wants and supports if that's not UTF-8. What exactly is missing in the documentation ? Could you please try to send us a diff (or even better a patch), so that we can get an idea, of what can be improved ? From my reading UTF-16LE-BOM is already mentioned. It would be nice to see (from a user), what is probably missing. > > Finally, I am not sure how to use git add --renormalize to correct a > > UTF-16 file that was previously added incorrectly (i.e. with a missing > > or incorrect working-tree-encoding entry in .gitattributes). The git > > add documentation at https://git-scm.com/docs/git-add implies > > 'renormalize' resets only the end-of-line values; however, I suspect > > it also re-converts text encoding when a working-tree-encoding > > property is set. It would be helpful to know one way or the other. > > It does indeed affect the working-tree-encoding. If you wanted to send > an inline patch created with git format-patch, it would probably be > welcome to mention that. However, because in this project we typically > scratch our own itch, if you don't send one, it's likely nobody else > will, either. For the record: It will even run the "clean" filter, if it has changed, or being freshly enabled. So yes, a patch would be appreciated. Thanks for bringing this up. ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2024-02-18 23:12 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-01-12 21:25 Suggested clarification for .gitattributes reference documentation Michael Litwak 2024-01-12 21:50 ` brian m. carlson 2024-01-12 22:36 ` Michael Litwak 2024-01-13 2:56 ` Michael Litwak 2024-01-13 7:43 ` Torsten Bögershausen 2024-01-13 9:24 ` Matthias Aßhauer 2024-02-18 23:12 ` Johannes Schindelin 2024-01-16 0:19 ` Michael Litwak 2024-01-16 2:06 ` brian m. carlson 2024-01-16 17:44 ` Torsten Bögershausen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).