* [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...? @ 2010-10-22 16:06 Drew Northup 2010-10-22 16:18 ` Jonathan Nieder 0 siblings, 1 reply; 15+ messages in thread From: Drew Northup @ 2010-10-22 16:06 UTC (permalink / raw) To: Git mailing list; +Cc: Junio C Hamano I am currently thinking about what the best way to preset readable (and safely email-able) patches to the user may be when the content is UTF-16. This is part of my ongoing work to treat UTF-16 as text (in other words, the crlf options will work and .gitattributes hacks won't be required to display diffs, etc). I was also concerned that the result be re-importable to valid UTF-16 in the end. This has led me to consider printing diffs as UTF-8 (no data loss, at least 16->8) when the source text is UTF-16. This should also be git-gui / gitk friendly (in theory). I would favorably consider this as a configurable option (export_unicode_diff_as_utf8 ?) leaving plain UTF-16 output as the standard output from "git diff" (once I convince it that UTF-16 is indeed text). Also, there is the issue of being able to recognize UTF-16 as UTF-16 in diffs/patches. Is there a precedent/standard I should be aware of with respect to BOMs and patches? I would think that adhering to the UTF-16 standard with respect to whole text files would make sense here (no BOM == Big Endian, BOM used to match LE/BE otherwise). Comments welcome! -- -Drew Northup N1XIM AKA RvnPhnx on OPN ________________________________________________ "As opposed to vegetable or mineral error?" -John Pescatore, SANS NewsBites Vol. 12 Num. 59 ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...? 2010-10-22 16:06 [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...? Drew Northup @ 2010-10-22 16:18 ` Jonathan Nieder 2010-10-22 17:01 ` Drew Northup 2010-10-22 18:28 ` Joshua Juran 0 siblings, 2 replies; 15+ messages in thread From: Jonathan Nieder @ 2010-10-22 16:18 UTC (permalink / raw) To: Drew Northup; +Cc: Git mailing list, Junio C Hamano Hi, Drew Northup wrote: > This is part of my ongoing work to treat UTF-16 as text (in > other words, the crlf options will work and .gitattributes hacks won't > be required to display diffs, etc). What's wrong with .gitattributes for this use case? I would think a clean/smudge filter would produce very good behavior from most git commands. If speed is the issue, maybe a built-in clean/smudge filter would address that? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...? 2010-10-22 16:18 ` Jonathan Nieder @ 2010-10-22 17:01 ` Drew Northup 2010-10-22 17:12 ` Jonathan Nieder 2010-10-22 18:28 ` Joshua Juran 1 sibling, 1 reply; 15+ messages in thread From: Drew Northup @ 2010-10-22 17:01 UTC (permalink / raw) To: Jonathan Nieder; +Cc: Git mailing list, Junio C Hamano On Fri, 2010-10-22 at 11:18 -0500, Jonathan Nieder wrote: > Hi, > > Drew Northup wrote: > > > This is part of my ongoing work to treat UTF-16 as text (in > > other words, the crlf options will work and .gitattributes hacks won't > > be required to display diffs, etc). > > What's wrong with .gitattributes for this use case? I would think a > clean/smudge filter would produce very good behavior from most git > commands. > > If speed is the issue, maybe a built-in clean/smudge filter would > address that? That still doesn't fix the crlf issue, for starters. Also, I would like to be able to email patches for files that are in UTF-16 and properly re-import them. Unless I'm missing something really big there's not much that a display filter is going to do for me there. I want this to be two-way. To be honest, the not needing of a display filter goes along with treating UTF-16 patches as text. The same sort of code will be required for both, so why not get predictable behavior for another text encoding at the same time? -- -Drew Northup N1XIM AKA RvnPhnx on OPN ________________________________________________ "As opposed to vegetable or mineral error?" -John Pescatore, SANS NewsBites Vol. 12 Num. 59 ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...? 2010-10-22 17:01 ` Drew Northup @ 2010-10-22 17:12 ` Jonathan Nieder 2010-10-22 17:27 ` Drew Northup 0 siblings, 1 reply; 15+ messages in thread From: Jonathan Nieder @ 2010-10-22 17:12 UTC (permalink / raw) To: Drew Northup; +Cc: Git mailing list, Junio C Hamano Hi again, Drew Northup wrote: > That still doesn't fix the crlf issue, for starters. Also, I would like > to be able to email patches for files that are in UTF-16 and properly > re-import them. Unless I'm missing something really big there's not much > that a display filter is going to do for me there. Right, I think you're missing something big. textconv is a display filter. clean/smudge convert between internal and external representation (and your clean/smudge scripts could take care of CRLF themselves if desired). That said, I wouldn't be surprised if clean/smudge filters don't do everything you want. If you do go that way, please keep the list posted so the mechanism can be improved. And longer term, maybe people will want something tailor-made after all? I just imagine it would be more productive to try out the generic mechanisms first. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...? 2010-10-22 17:12 ` Jonathan Nieder @ 2010-10-22 17:27 ` Drew Northup 2010-10-22 17:30 ` Jonathan Nieder 2010-10-22 17:48 ` Jakub Narebski 0 siblings, 2 replies; 15+ messages in thread From: Drew Northup @ 2010-10-22 17:27 UTC (permalink / raw) To: Jonathan Nieder; +Cc: Git mailing list, Junio C Hamano On Fri, 2010-10-22 at 12:12 -0500, Jonathan Nieder wrote: > Hi again, > > Drew Northup wrote: > > > That still doesn't fix the crlf issue, for starters. Also, I would like > > to be able to email patches for files that are in UTF-16 and properly > > re-import them. Unless I'm missing something really big there's not much > > that a display filter is going to do for me there. > > Right, I think you're missing something big. textconv is a display > filter. clean/smudge convert between internal and external > representation (and your clean/smudge scripts could take care of CRLF > themselves if desired). > > That said, I wouldn't be surprised if clean/smudge filters don't do > everything you want. If you do go that way, please keep the list > posted so the mechanism can be improved. Well I shall plumb the documentation again.... just in case. I'm not holding my breath that it will do what I (and frankly a fair number of other people) want. We just want version control that treats text like text. FULL STOP. Why isn't UTF-16 text??????? > And longer term, maybe people will want something tailor-made after > all? I just imagine it would be more productive to try out the > generic mechanisms first. Please forgive me for being offended that UTF-16 text is not "generic" enough. -- -Drew Northup N1XIM AKA RvnPhnx on OPN ________________________________________________ "As opposed to vegetable or mineral error?" -John Pescatore, SANS NewsBites Vol. 12 Num. 59 ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...? 2010-10-22 17:27 ` Drew Northup @ 2010-10-22 17:30 ` Jonathan Nieder 2010-10-22 17:58 ` Jakub Narebski 2010-10-22 17:48 ` Jakub Narebski 1 sibling, 1 reply; 15+ messages in thread From: Jonathan Nieder @ 2010-10-22 17:30 UTC (permalink / raw) To: Drew Northup; +Cc: Git mailing list, Junio C Hamano Drew Northup wrote: > Please forgive me for being offended that UTF-16 text is not "generic" > enough. First some words of explanation. By "generic" I did not mean ubiquitous, unbranded, popular, or some other almost-synonym. What I actually meant is that it is not obvious what to do with UTF-16. Should it be converted to UTF-8 for output? Should it always be normalized when added to the index, so that switching between canonically equivalent sequences does not result in spurious diffs? Should the byte-for-byte representation be faithfully preserved, even when it is not valid UTF-16? When in such a situation, often a good approach is the following: take care of mechanism first, then policy. So the first thing to do is to make sure that the code is _capable_ of what people are trying to do; then one can try various configurations and see what is most convenient; and finally, one can make sure the program behaves in an intuitive way by setting a reasonable default. So by "generic" I meant those mechanisms that can be used in the context of multiple policies. Apologies; I never meant to offend; please carry on and I will leave you in peace. Jonathan ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...? 2010-10-22 17:30 ` Jonathan Nieder @ 2010-10-22 17:58 ` Jakub Narebski 0 siblings, 0 replies; 15+ messages in thread From: Jakub Narebski @ 2010-10-22 17:58 UTC (permalink / raw) To: Jonathan Nieder; +Cc: Drew Northup, Git mailing list, Junio C Hamano Jonathan Nieder <jrnieder@gmail.com> writes: > Drew Northup wrote: > > > Please forgive me for being offended that UTF-16 text is not "generic" > > enough. > > First some words of explanation. > > By "generic" I did not mean ubiquitous, unbranded, popular, or some > other almost-synonym. What I actually meant is that it is not obvious > what to do with UTF-16. Should it be converted to UTF-8 for output? > Should it always be normalized when added to the index, so that > switching between canonically equivalent sequences does not result > in spurious diffs? Should the byte-for-byte representation be > faithfully preserved, even when it is not valid UTF-16? > > When in such a situation, often a good approach is the following: > take care of mechanism first, then policy. So the first thing to do > is to make sure that the code is _capable_ of what people are trying > to do; then one can try various configurations and see what is most > convenient; and finally, one can make sure the program behaves in an > intuitive way by setting a reasonable default. > > So by "generic" I meant those mechanisms that can be used in the > context of multiple policies. It would be nice if there was a way (perhaps stearable via gitattributes) to change whether Git is to treat file as sequence of bytes (as it is now), or as sequence of characters (probably like Perl 6, i.e. as sequence of graphemes), though this would require to specify encoding (and normalization) used. Wishful thinking -- Jakub Narebski Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...? 2010-10-22 17:27 ` Drew Northup 2010-10-22 17:30 ` Jonathan Nieder @ 2010-10-22 17:48 ` Jakub Narebski 2010-10-22 18:06 ` Drew Northup 1 sibling, 1 reply; 15+ messages in thread From: Jakub Narebski @ 2010-10-22 17:48 UTC (permalink / raw) To: Drew Northup; +Cc: Jonathan Nieder, Git mailing list, Junio C Hamano Drew Northup <drew.northup@maine.edu> writes: > On Fri, 2010-10-22 at 12:12 -0500, Jonathan Nieder wrote: > > > > Drew Northup wrote: > > > > > That still doesn't fix the crlf issue, for starters. Also, I would like > > > to be able to email patches for files that are in UTF-16 and properly > > > re-import them. Unless I'm missing something really big there's not much > > > that a display filter is going to do for me there. > > > > Right, I think you're missing something big. textconv is a display > > filter. clean/smudge convert between internal and external > > representation (and your clean/smudge scripts could take care of CRLF > > themselves if desired). > > > > That said, I wouldn't be surprised if clean/smudge filters don't do > > everything you want. If you do go that way, please keep the list > > posted so the mechanism can be improved. > > Well I shall plumb the documentation again.... just in case. I'm not > holding my breath that it will do what I (and frankly a fair number of > other people) want. We just want version control that treats text like > text. FULL STOP. Why isn't UTF-16 text??????? If you are asking why Git detects files with text in UTF-16 / USC-2 as binary, it is because Git (re)uses the same heuristic that e.g. GNU diff (and probably also -T file test in Perl), and one of heuristics is that if file contains NUL ("\0") character, then it is most porbably binary (because legacy C programs for text would have troubles with NUL characters). That probably doesn't help you any... -- Jakub Narebski Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...? 2010-10-22 17:48 ` Jakub Narebski @ 2010-10-22 18:06 ` Drew Northup 2010-10-22 19:18 ` Jakub Narebski 0 siblings, 1 reply; 15+ messages in thread From: Drew Northup @ 2010-10-22 18:06 UTC (permalink / raw) To: Jakub Narebski; +Cc: Jonathan Nieder, Git mailing list, Junio C Hamano On Fri, 2010-10-22 at 10:48 -0700, Jakub Narebski wrote: > Drew Northup <drew.northup@maine.edu> writes: > > Well I shall plumb the documentation again.... just in case. I'm not > > holding my breath that it will do what I (and frankly a fair number of > > other people) want. We just want version control that treats text like > > text. FULL STOP. Why isn't UTF-16 text??????? > > If you are asking why Git detects files with text in UTF-16 / USC-2 as > binary, it is because Git (re)uses the same heuristic that e.g. GNU > diff (and probably also -T file test in Perl), and one of heuristics > is that if file contains NUL ("\0") character, then it is most > porbably binary (because legacy C programs for text would have > troubles with NUL characters). > > That probably doesn't help you any... I did find that already. I still have not decided that correct place to shoehorn in Unicode detection, but I'll be sure to do that before I bother anybody else with it. I already wrote code to detect (reasonably) valid UTF-16 (if it isn't obviously valid then I'll just as soon deal with it as binary data, so as to avoid a foot-shooting exercise). My main motivation here has been to get some feedback as I write stuff so as to not waste a lot of time during writing something that could be done better. (As opposed to not done at all, which is the feeling I'm getting from a few people around here...) -- -Drew Northup N1XIM AKA RvnPhnx on OPN ________________________________________________ "As opposed to vegetable or mineral error?" -John Pescatore, SANS NewsBites Vol. 12 Num. 59 ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...? 2010-10-22 18:06 ` Drew Northup @ 2010-10-22 19:18 ` Jakub Narebski 0 siblings, 0 replies; 15+ messages in thread From: Jakub Narebski @ 2010-10-22 19:18 UTC (permalink / raw) To: Drew Northup; +Cc: Jonathan Nieder, Git mailing list, Junio C Hamano On Fri, 22 Oct 2010, Drew Northup wrote: > On Fri, 2010-10-22 at 10:48 -0700, Jakub Narebski wrote: > > Drew Northup <drew.northup@maine.edu> writes: > > > > Well I shall plumb the documentation again.... just in case. I'm not > > > holding my breath that it will do what I (and frankly a fair number of > > > other people) want. We just want version control that treats text like > > > text. FULL STOP. Why isn't UTF-16 text??????? > > > > If you are asking why Git detects files with text in UTF-16 / USC-2 as > > binary, it is because Git (re)uses the same heuristic that e.g. GNU > > diff (and probably also -T file test in Perl), and one of heuristics > > is that if file contains NUL ("\0") character, then it is most > > porbably binary (because legacy C programs for text would have > > troubles with NUL characters). > > > > That probably doesn't help you any... > > I did find that already. I still have not decided that correct place to > shoehorn in Unicode detection, but I'll be sure to do that before I > bother anybody else with it. I already wrote code to detect (reasonably) > valid UTF-16 (if it isn't obviously valid then I'll just as soon deal > with it as binary data, so as to avoid a foot-shooting exercise). > My main motivation here has been to get some feedback as I write stuff > so as to not waste a lot of time during writing something that could be > done better. > > (As opposed to not done at all, which is the feeling I'm getting from a > few people around here...) Git supports well different encoding used in commit message (which is always text, as opposed to file contents which might be binary or text). You specify what encoding you use to format commit messages with i18n.commitEncoding (defaults to 'utf-8'); if it is different than utf-8 it gets saved in 'encoding' header. You can even specify that encoding that your terminal uses is different from i18n.commitEncoding with i18n.logOutputEncoding The only support for different encoding of file contents is used by git-gui. You provide encoding that a file uses via .gitattributes (the `encoding` attribute). You specify what output encoding git-gui (Tcl/Tk) uses with `gui.encoding` config variable. I guess that what you need to support for diffs and 'git show <file>' etc. is respecting `encoding` .gitattribute, and providing encoding that console uses with e.g. i18n.blobOutputEncoding (or something like that). HTH -- Jakub Narebski Poland ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...? 2010-10-22 16:18 ` Jonathan Nieder 2010-10-22 17:01 ` Drew Northup @ 2010-10-22 18:28 ` Joshua Juran 2010-10-22 19:13 ` Jeff King 2010-10-22 19:53 ` Jonathan Nieder 1 sibling, 2 replies; 15+ messages in thread From: Joshua Juran @ 2010-10-22 18:28 UTC (permalink / raw) To: Jonathan Nieder; +Cc: Drew Northup, Git mailing list, Junio C Hamano On Oct 22, 2010, at 9:18 AM, Jonathan Nieder wrote: > Drew Northup wrote: > >> This is part of my ongoing work to treat UTF-16 as text (in >> other words, the crlf options will work and .gitattributes hacks >> won't >> be required to display diffs, etc). I would like to see the same thing for MacRoman-encoded text.[1] This is the encoding used by classic Mac development tools such as Metrowerks C/C++ (packaged as CodeWarrior) and Apple's Rez resource compiler (even the version in OS X). Clearly, UTF-8 checkouts are not an option here. > What's wrong with .gitattributes for this use case? I would think a > clean/smudge filter would produce very good behavior from most git > commands. I wrote a Mac<->UTF-8 converter in C++ and set it as the clean/smudge filter for .r (Rez) files. Checkouts were noticeably slower (on a real machine, not one of my antiques). This would be much worse if I also applied it to C and C++ source files (most, but not all, of which are ASCII anyway). > If speed is the issue, maybe a built-in clean/smudge filter would > address that? While the performance cost could be overlooked, a worse problem occurred when I checked out a branch into which the conversion of files from MacRoman to UTF-8 hadn't occurred. It automatically dirtied my working tree, requiring me to temporarily disable the filter attribute and reset --hard. I also resorted to checkout -f a number of times -- a bad habit, I'm sure. In the end I concluded that (a) these files are definitely text, and (b) they are natively MacRoman and should be stored that way. There is no advantage to using UTF-8 since the tools can't handle it, and even were one to write a UTF-8-capable Rez compiler, the resources it outputs are still MacRoman-encoded, so no Unicode support is possible. Finally, (c) the end-to-end principle applies. Don't convert data en route, but wait until it's necessary. Premature conversion was the curse of FTP; let's not repeat it. But Git should definitely convert data to match the encoding of the display device; writing anything but valid UTF-8 to a UTF-8 terminal is in error. The same applies in gitk. Another issue (on which my thoughts are less clear) is the use of CR newlines. CR is also native to classic Mac OS, but in contrast to UTF-8, Mac developer tools are generally newline-agnostic, whereas typical Unix programs are less forgiving in expecting LF -- so I've been using linefeeds in my source code. However, it might be useful to retain platform-customary newlines for the purpose of guessing non- UTF character encodings: Carriage returns would almost certainly indicate MacRoman rather than ISO-8859-1. But a more complete and robust solution would be to store the encoding somewhere, possibly in the blob itself, or in the tree storing the filename. Josh [1] MacRoman is an extended ASCII character set native to classic Mac OS. <http://en.wikipedia.org/wiki/Mac_OS_Roman> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...? 2010-10-22 18:28 ` Joshua Juran @ 2010-10-22 19:13 ` Jeff King 2010-10-22 19:53 ` Jonathan Nieder 1 sibling, 0 replies; 15+ messages in thread From: Jeff King @ 2010-10-22 19:13 UTC (permalink / raw) To: Joshua Juran Cc: Jonathan Nieder, Drew Northup, Git mailing list, Junio C Hamano On Fri, Oct 22, 2010 at 11:28:44AM -0700, Joshua Juran wrote: > >What's wrong with .gitattributes for this use case? I would think a > >clean/smudge filter would produce very good behavior from most git > >commands. > > I wrote a Mac<->UTF-8 converter in C++ and set it as the clean/smudge > filter for .r (Rez) files. Checkouts were noticeably slower (on a > real machine, not one of my antiques). This would be much worse if I > also applied it to C and C++ source files (most, but not all, of > which are ASCII anyway). Not surprising, as you were probably running your filter a lot. Clean and smudge could perhaps benefit from the same notes-caching layer that textconv uses (caching the "smudged" version of each clean file). But that would only impact checkout. Most other operations use the "clean" representation already, so they should be full-speed. You could also cache the other way (mapping smudged sha1's into clean sha1's). But I doubt that would do you any good. We generally see those when updating the index with "git add", which means either the stat information is clean (and we don't have the clean the file) or it isn't (in which case you probably have new content that has not been seen before, which means a cache miss). And of course it doesn't help the other clean/smudge inconveniences you ran into. -Peff ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...? 2010-10-22 18:28 ` Joshua Juran 2010-10-22 19:13 ` Jeff King @ 2010-10-22 19:53 ` Jonathan Nieder 2010-10-22 20:18 ` Git Attribute: File Text Encoding {WAS: Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...?} Drew Northup 1 sibling, 1 reply; 15+ messages in thread From: Jonathan Nieder @ 2010-10-22 19:53 UTC (permalink / raw) To: Joshua Juran; +Cc: Drew Northup, Git mailing list, Junio C Hamano, Jeff King Joshua Juran wrote: > I would like to see the same thing for MacRoman-encoded text.[1] > This is the encoding used by classic Mac development tools such as > Metrowerks C/C++ (packaged as CodeWarrior) and Apple's Rez resource > compiler (even the version in OS X). Clearly, UTF-8 checkouts are > not an option here. Yes, makes sense. There are (at least) two approaches you could use here: treat the content as precious and use e.g. textconv for readable diffs, or treat the content as UTF-8 text and use clean/smudge to ensure the checkout has the right encoding. So let's see what happens with the latter: > I wrote a Mac<->UTF-8 converter in C++ and set it as the > clean/smudge filter for .r (Rez) files. Checkouts were noticeably > slower (on a real machine, not one of my antiques). Vague ideas to mitigate that: a) allow a single clean/smudge filter invocation for a batch of files b) cache, as Jeff hinted c) allow custom "native" clean/smudge filters, executed using dlopen() > While the performance cost could be overlooked, a worse problem > occurred when I checked out a branch into which the conversion of > files from MacRoman to UTF-8 hadn't occurred. It automatically > dirtied my working tree, requiring me to temporarily disable the > filter attribute and reset --hard. I also resorted to checkout -f a > number of times -- a bad habit, I'm sure. The jn/merge-renormalize topic from pu might help somewhat (or might not). In any event, if you have a test case, I would be happy to look at it. > In the end I concluded that (a) these files are definitely text, and > (b) they are natively MacRoman and should be stored that way. There > is no advantage to using UTF-8 since the tools can't handle it, and > even were one to write a UTF-8-capable Rez compiler, the resources > it outputs are still MacRoman-encoded, so no Unicode support is > possible. > > Finally, (c) the end-to-end principle applies. Yep. Although "definitely text" seems somewhat abstract to me. Is the problem that "git diff" fails to default to --text in some situation? > But Git should definitely > convert data to match the encoding of the display device; writing > anything but valid UTF-8 to a UTF-8 terminal is in error. Oh, this is what you mean. Except for log encoding, git is not paying attention to the display encoding at all. [...] > But a more > complete and robust solution would be to store the encoding > somewhere, possibly in the blob itself, or in the tree storing the > filename. How about Jakub's idea of keeping it in .gitattributes (or some similarly visible key/value store)? Two reasons: 1. When asked to declare encoding, half the time people will be wrong. So it seems worthwhile to make the declared encoding visible enough to fix. 2. Two ASCII files identical except that one is declared as latin1 and the other utf8 should be considered identical. Thanks for some food for thought. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Git Attribute: File Text Encoding {WAS: Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...?} 2010-10-22 19:53 ` Jonathan Nieder @ 2010-10-22 20:18 ` Drew Northup 2010-10-22 21:49 ` Jakub Narebski 0 siblings, 1 reply; 15+ messages in thread From: Drew Northup @ 2010-10-22 20:18 UTC (permalink / raw) To: Jonathan Nieder; +Cc: Joshua Juran, Git mailing list, Junio C Hamano, Jeff King On Fri, 2010-10-22 at 14:53 -0500, Jonathan Nieder wrote: > Joshua Juran wrote: > > But a more > > complete and robust solution would be to store the encoding > > somewhere, possibly in the blob itself, or in the tree storing the > > filename. > > How about Jakub's idea of keeping it in .gitattributes (or some > similarly visible key/value store)? Two reasons: > > 1. When asked to declare encoding, half the time people will be > wrong. So it seems worthwhile to make the declared encoding > visible enough to fix. > > 2. Two ASCII files identical except that one is declared as > latin1 and the other utf8 should be considered identical. > > Thanks for some food for thought. I think that's a fine place to start. Shall I start a branch for it when I get home (where the code I'm working on is located)? It would be good practice if nothing else... -- -Drew Northup N1XIM AKA RvnPhnx on OPN ________________________________________________ "As opposed to vegetable or mineral error?" -John Pescatore, SANS NewsBites Vol. 12 Num. 59 ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Git Attribute: File Text Encoding {WAS: Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...?} 2010-10-22 20:18 ` Git Attribute: File Text Encoding {WAS: Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...?} Drew Northup @ 2010-10-22 21:49 ` Jakub Narebski 0 siblings, 0 replies; 15+ messages in thread From: Jakub Narebski @ 2010-10-22 21:49 UTC (permalink / raw) To: Drew Northup Cc: Jonathan Nieder, Joshua Juran, Git mailing list, Junio C Hamano, Jeff King Drew Northup <drew.northup@maine.edu> writes: > On Fri, 2010-10-22 at 14:53 -0500, Jonathan Nieder wrote: > > Joshua Juran wrote: > > > > But a more > > > complete and robust solution would be to store the encoding > > > somewhere, possibly in the blob itself, or in the tree storing the > > > filename. > > > > How about Jakub's idea of keeping it in .gitattributes (or some > > similarly visible key/value store)? Two reasons: > > > > 1. When asked to declare encoding, half the time people will be > > wrong. So it seems worthwhile to make the declared encoding > > visible enough to fix. > > > > 2. Two ASCII files identical except that one is declared as > > latin1 and the other utf8 should be considered identical. > > > > Thanks for some food for thought. > > I think that's a fine place to start. Shall I start a branch for it when > I get home (where the code I'm working on is located)? It would be good > practice if nothing else... Just for clarification: Git supports and uses `encoding` gitattribute, but it is used currently _only_ bu git-gui (by converting from `encoding` gitattribute to `git.encoding` given by config). The places that are missing: 1. A way to check attribute of a file in given tree. Currently git-check-attr checks for .gitattributes _only_ from working area (in addition to unversioned .git/info/attributes, and perhaps in the future core.attributesFile a la core.excludesFile). 2. A consensus where conversion should take place: at which level in stack, for what output destination, etc. 3. Support for i18n.blobOutputEncoding, which would convert (unless overriden) e.g. in 'git show <blob>', and perhaps also in diff. -- Jakub Narebski Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2010-10-22 21:49 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-10-22 16:06 [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...? Drew Northup 2010-10-22 16:18 ` Jonathan Nieder 2010-10-22 17:01 ` Drew Northup 2010-10-22 17:12 ` Jonathan Nieder 2010-10-22 17:27 ` Drew Northup 2010-10-22 17:30 ` Jonathan Nieder 2010-10-22 17:58 ` Jakub Narebski 2010-10-22 17:48 ` Jakub Narebski 2010-10-22 18:06 ` Drew Northup 2010-10-22 19:18 ` Jakub Narebski 2010-10-22 18:28 ` Joshua Juran 2010-10-22 19:13 ` Jeff King 2010-10-22 19:53 ` Jonathan Nieder 2010-10-22 20:18 ` Git Attribute: File Text Encoding {WAS: Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...?} Drew Northup 2010-10-22 21:49 ` Jakub Narebski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).