* Re: [PATCH 00/22] Refactor to accept NUL in commit messages [not found] <1319277881-4128-1-git-send-email-pclouds@gmail.com> @ 2011-10-22 19:09 ` Jeff King 2011-10-23 10:44 ` Robin Rosenberg 2011-10-22 22:47 ` Junio C Hamano 1 sibling, 1 reply; 26+ messages in thread From: Jeff King @ 2011-10-22 19:09 UTC (permalink / raw) To: Nguyễn Thái Ngọc Duy Cc: git, Junio C Hamano, Ævar Arnfjörð On Sat, Oct 22, 2011 at 09:04:19PM +1100, Nguyen Thai Ngoc Duy wrote: > This series helps pass commit message size up to output functions, > though it does not change any output functions to print ^@. Can we take a step back for a second and discuss what git _should_ do with commits that contain NUL? If all of the pretty-print functions are just going consider "foo\0bar" to be "foo^@bar", then maybe it would be much simpler to just "normalize" the commit message into a C string at a lower level, and pass it around as a string as we currently do. On the other hand, if we are eventually looking to add an option like "--include-NUL-in-commit-message", then it would make sense for the real contents and size to get passed around. > All functions up to the last patch learn to accept a string as a pair > <const char *start, const char *end> as a preparation step. These > changes are relatively simple. Or it could have been so if I did not > attempt to reduce some code duplication found while working on this > series. Great. Reducing code duplication is always a plus. > The last patch turns commit_buffer field in struct commit to "struct > strbuf *". This approach costs us 12 bytes more each commit. We can > choose not to use strbuf to save memory. I think 12 bytes in the commit struct might be noticeable. But it looks like you've done the sane thing, and replaced the pointer-to-char with a pointer-to-strbuf. And that I don't think should be a big deal. The buffer itself is way bigger than 12 bytes, so we don't care so much about the "we have a buffer" case, but more about the 100,000 other commits that we're not currently printing right now. Of course, some timings on things like "rev-list" and "pack-objects" would be nice to double-check. -Peff ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 00/22] Refactor to accept NUL in commit messages 2011-10-22 19:09 ` [PATCH 00/22] Refactor to accept NUL in commit messages Jeff King @ 2011-10-23 10:44 ` Robin Rosenberg 2011-10-23 16:09 ` Jeff King 0 siblings, 1 reply; 26+ messages in thread From: Robin Rosenberg @ 2011-10-23 10:44 UTC (permalink / raw) To: Jeff King Cc: Nguyễn Thái Ngọc Duy, git, Junio C Hamano, Ævar Arnfjörð Jeff King skrev 2011-10-22 21.09: > On Sat, Oct 22, 2011 at 09:04:19PM +1100, Nguyen Thai Ngoc Duy wrote: > >> This series helps pass commit message size up to output functions, >> though it does not change any output functions to print ^@. > Can we take a step back for a second and discuss what git _should_ do > with commits that contain NUL? Yes please. I don't think allowing NUL makes sense, but it makes sense to state how NUL should be handled when anyone attempt it, so there might be things to fix even if NUL is banned. Are there any such commits in the wild? -- robin ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 00/22] Refactor to accept NUL in commit messages 2011-10-23 10:44 ` Robin Rosenberg @ 2011-10-23 16:09 ` Jeff King 0 siblings, 0 replies; 26+ messages in thread From: Jeff King @ 2011-10-23 16:09 UTC (permalink / raw) To: Robin Rosenberg Cc: Nguyễn Thái Ngọc Duy, git, Junio C Hamano, Ævar Arnfjörð On Sun, Oct 23, 2011 at 12:44:30PM +0200, Robin Rosenberg wrote: > Jeff King skrev 2011-10-22 21.09: > >On Sat, Oct 22, 2011 at 09:04:19PM +1100, Nguyen Thai Ngoc Duy wrote: > > > >>This series helps pass commit message size up to output functions, > >>though it does not change any output functions to print ^@. > >Can we take a step back for a second and discuss what git _should_ do > >with commits that contain NUL? > Yes please. I don't think allowing NUL makes sense, but it makes sense > to state how NUL should be handled when anyone attempt it, so there > might be things to fix even if NUL is banned. > > Are there any such commits in the wild? Adding an arbitrary NUL, no, I don't think I've ever seen it outside of people (myself included) trying to break git in interesting ways. But utf16 may contains NUL bytes, so I expect configuring your editor to output utf16 and running "git commit" would give you the most likely example. -Peff ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 00/22] Refactor to accept NUL in commit messages [not found] <1319277881-4128-1-git-send-email-pclouds@gmail.com> 2011-10-22 19:09 ` [PATCH 00/22] Refactor to accept NUL in commit messages Jeff King @ 2011-10-22 22:47 ` Junio C Hamano 2011-10-23 1:24 ` Nguyen Thai Ngoc Duy 1 sibling, 1 reply; 26+ messages in thread From: Junio C Hamano @ 2011-10-22 22:47 UTC (permalink / raw) To: Nguyễn Thái Ngọc Duy Cc: git, peff, Ævar Arnfjörð I do not think we want to go this route. There are two possible approaches to attack this. - If we want to show everything after a potential and rare NUL in the log message most of the time, then "struct commit" should just store <ptr,len> pair. This grows "struct commit" with one extra ulong. - If we want to give us a way to notice and show these "funnily, this commit log message has a NUL in it" case as an exception in only selected codepaths, then "struct commit" should just gain "flags" 4-byte int field between "indegree" and "date", and parse_commit_buffer() should set one bit in the flags when the log message has NUL in it. And teach only these selected codepaths to find the length from the object name with sha1_object_info() as needed. This grows "struct commit" with one 4-byte int, with runtime overhead only where it matters. The approach taken by the patch wastes two malloc() blocks with their own allocation overhead, and unused "alloc" field in the strbuf that does not have to be there. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 00/22] Refactor to accept NUL in commit messages 2011-10-22 22:47 ` Junio C Hamano @ 2011-10-23 1:24 ` Nguyen Thai Ngoc Duy 2011-10-23 5:51 ` Junio C Hamano 0 siblings, 1 reply; 26+ messages in thread From: Nguyen Thai Ngoc Duy @ 2011-10-23 1:24 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, peff, Ævar Arnfjörð 2011/10/23 Junio C Hamano <gitster@pobox.com>: > I do not think we want to go this route. > > There are two possible approaches to attack this. > > - If we want to show everything after a potential and rare NUL in the log > message most of the time, then "struct commit" should just store > <ptr,len> pair. This grows "struct commit" with one extra ulong. > > - If we want to give us a way to notice and show these "funnily, this > commit log message has a NUL in it" case as an exception in only > selected codepaths, then "struct commit" should just gain "flags" > 4-byte int field between "indegree" and "date", and > parse_commit_buffer() should set one bit in the flags when the log > message has NUL in it. And teach only these selected codepaths to find > the length from the object name with sha1_object_info() as needed. This > grows "struct commit" with one 4-byte int, with runtime overhead only > where it matters. > > The approach taken by the patch wastes two malloc() blocks with their own > allocation overhead, and unused "alloc" field in the strbuf that does not > have to be there. We could allocate just one block with length as the first field: struct commit_buffer { unsigned long len; char buf[FLEX_ARRAY]; }; The downside is commit_buffer field type in struct commit changes, which impacts many codepaths. If we agree to allow NUL in commit objects, then all codepaths should be aware of that fact, otherwise funny things may happen because string processing in this function stops early due to NUL, but others run fine.. Jeff's low-level normalization approach sounds much simpler, but probably trickier because we need to identify where to normalize and denormalize. At least with type change, the compiler spots all the places for me. I would not worry about runtime processing overhead. The string end check is basically converted from "if (*msg)" to "if (msg < msg_end)". There will one more pointer (msg_end) in stack for each call. Unless we do deep recursion, we should be fine. Memory overhead is still something to profile. -- Duy ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 00/22] Refactor to accept NUL in commit messages 2011-10-23 1:24 ` Nguyen Thai Ngoc Duy @ 2011-10-23 5:51 ` Junio C Hamano 2011-10-23 6:37 ` Nguyen Thai Ngoc Duy 0 siblings, 1 reply; 26+ messages in thread From: Junio C Hamano @ 2011-10-23 5:51 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy; +Cc: git, peff, Ævar Arnfjörð Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes: > We could allocate just one block with length as the first field: > > struct commit_buffer { > unsigned long len; > char buf[FLEX_ARRAY]; > }; > > The downside is commit_buffer field type in struct commit changes, > which impacts many codepaths. I think that is a good thing overall to _force_ us to audit all the code, *if* our goal were to avoid losing bytes. And the solution above is better than adding a length field to "struct commit". It certainly is better than quoting NUL byte to ^@, keep using the "char *" field and risking some codepaths forget to convert it back to NUL. For types of payloads for which losing everything after the first NUL matters, converting NUL to ^@ and then forgetting to convert it back to NUL is equally bad breakage to the payload anyway, so such a conversion would not be a particularly good approach to avoid losing bytes. But as Jeff suggested, we should step back a bit and think what our goal is. The low level object format of our commit is textual header fields, each of which is terminated with a LF, followed by a LF to mark the end of header fields, and then opaque payload that can contain any bytes. It does not forbid a non-Git application to reuse the object store infrastructure to store ASN.1 binary goo there, and the low level interface we give such as cat-file is a perfectly valid way to inspect such a "commit" object. But when it comes to "Git" Porcelains (e.g. the log family of commands), we do assume people do not store random binary byte sequences in commits, and we do take advantage of that assumption by splitting each "line" at LF, indenting them with 4 spaces, etc. In other words, a commit log in the Git context _is_ pretty much text and not arbitrary byte sequence. Even the "--pretty=raw" option for "log" family is not about the "raw" body; the "raw"-ness applies only to the header fields. So even if we _were_ to update the codepaths involved to avoid losing bytes, the end result will not be useful for users to whom ability to include NUL matters. So in that sense, I do not think it is unreasonable to chop it off at the first NUL, which is the current behaviour. IOW, it is entirely sane to argue that there is nothing to fix. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 00/22] Refactor to accept NUL in commit messages 2011-10-23 5:51 ` Junio C Hamano @ 2011-10-23 6:37 ` Nguyen Thai Ngoc Duy 2011-10-23 9:46 ` Junio C Hamano 0 siblings, 1 reply; 26+ messages in thread From: Nguyen Thai Ngoc Duy @ 2011-10-23 6:37 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, peff, Ævar Arnfjörð On Sun, Oct 23, 2011 at 4:51 PM, Junio C Hamano <gitster@pobox.com> wrote: > So in that sense, I do not think it is unreasonable to chop it off at the > first NUL, which is the current behaviour. IOW, it is entirely sane to > argue that there is nothing to fix. Good for me too if we go this way. I was just curious what if we changed commit_buffer field and got hooked in. > But as Jeff suggested, we should step back a bit and think what our goal > is. > > The low level object format of our commit is textual header fields, each > of which is terminated with a LF, followed by a LF to mark the end of > header fields, and then opaque payload that can contain any bytes. It does > not forbid a non-Git application to reuse the object store infrastructure > to store ASN.1 binary goo there, and the low level interface we give such > as cat-file is a perfectly valid way to inspect such a "commit" object. cat-file is fine, commit-tree (or any commands that call commit_tree()) cuts at NUL though. I wonder how git processes commit messages in utf-16. Did a quick test, did not look good. But that's because git-commit cuts at NUL. But even if git-commit makes a good object, I doubt if git-log shows it right. -- Duy ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 00/22] Refactor to accept NUL in commit messages 2011-10-23 6:37 ` Nguyen Thai Ngoc Duy @ 2011-10-23 9:46 ` Junio C Hamano 2011-10-23 10:17 ` Nguyen Thai Ngoc Duy 2011-10-23 16:07 ` Jeff King 0 siblings, 2 replies; 26+ messages in thread From: Junio C Hamano @ 2011-10-23 9:46 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy; +Cc: git, peff, Ævar Arnfjörð Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes: > On Sun, Oct 23, 2011 at 4:51 PM, Junio C Hamano <gitster@pobox.com> wrote: > ... >> The low level object format of our commit is textual header fields, each >> of which is terminated with a LF, followed by a LF to mark the end of >> header fields, and then opaque payload that can contain any bytes. It does >> not forbid a non-Git application to reuse the object store infrastructure >> to store ASN.1 binary goo there, and the low level interface we give such >> as cat-file is a perfectly valid way to inspect such a "commit" object. > > cat-file is fine, commit-tree (or any commands that call > commit_tree()) cuts at NUL though. > I wonder how git processes commit messages in utf-16. That is exactly what I am saying. Perhaps you didn't either read or understand what you omitted from your quoting; otherwise you even wouldn't have brought up utf-16. Let me requote that part for you. > But when it comes to "Git" Porcelains (e.g. the log family of commands), > we do assume people do not store random binary byte sequences in commits, > and we do take advantage of that assumption by splitting each "line" at > LF, indenting them with 4 spaces, etc. In other words, a commit log in the > Git context _is_ pretty much text and not arbitrary byte sequence. Think what would cutting at a byte whose value is 012 and adding four bytes whose values are 040 to each of "lines" that formed with such cutting do to UTF-16 goo, even if it does not contain any NUL byte. As far as Git Porcelains are concerned, it is no different from random binary byte sequences. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 00/22] Refactor to accept NUL in commit messages 2011-10-23 9:46 ` Junio C Hamano @ 2011-10-23 10:17 ` Nguyen Thai Ngoc Duy 2011-10-23 16:07 ` Jeff King 1 sibling, 0 replies; 26+ messages in thread From: Nguyen Thai Ngoc Duy @ 2011-10-23 10:17 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, peff, Ævar Arnfjörð On Sun, Oct 23, 2011 at 8:46 PM, Junio C Hamano <gitster@pobox.com> wrote: > Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes: > >> On Sun, Oct 23, 2011 at 4:51 PM, Junio C Hamano <gitster@pobox.com> wrote: >> ... >>> The low level object format of our commit is textual header fields, each >>> of which is terminated with a LF, followed by a LF to mark the end of >>> header fields, and then opaque payload that can contain any bytes. It does >>> not forbid a non-Git application to reuse the object store infrastructure >>> to store ASN.1 binary goo there, and the low level interface we give such >>> as cat-file is a perfectly valid way to inspect such a "commit" object. >> >> cat-file is fine, commit-tree (or any commands that call >> commit_tree()) cuts at NUL though. >> I wonder how git processes commit messages in utf-16. > > That is exactly what I am saying. > > Perhaps you didn't either read or understand what you omitted from your > quoting; otherwise you even wouldn't have brought up utf-16. > > Let me requote that part for you. > >> But when it comes to "Git" Porcelains (e.g. the log family of commands), >> we do assume people do not store random binary byte sequences in commits, >> and we do take advantage of that assumption by splitting each "line" at >> LF, indenting them with 4 spaces, etc. In other words, a commit log in the >> Git context _is_ pretty much text and not arbitrary byte sequence. > > Think what would cutting at a byte whose value is 012 and adding four > bytes whose values are 040 to each of "lines" that formed with such > cutting do to UTF-16 goo, even if it does not contain any NUL byte. As far > as Git Porcelains are concerned, it is no different from random binary > byte sequences. > I'm sorry. The utf-16 was an afterthought when I was nearly finished with the reply and already cut that quote. The assumption that people do not store random binary byte sequences in commits sort of conflicts with "encoding" field in the commit header though. The assumption is documented in i18n.txt. I guess it's just me who did not read document carefully. But maybe it's good to stop people from shooting themselves in this case (i.e. setting encoding to utf-16 or similar). -- Duy ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 00/22] Refactor to accept NUL in commit messages 2011-10-23 9:46 ` Junio C Hamano 2011-10-23 10:17 ` Nguyen Thai Ngoc Duy @ 2011-10-23 16:07 ` Jeff King 2011-10-23 20:16 ` Junio C Hamano 1 sibling, 1 reply; 26+ messages in thread From: Jeff King @ 2011-10-23 16:07 UTC (permalink / raw) To: Junio C Hamano; +Cc: Nguyen Thai Ngoc Duy, git, Ævar Arnfjörð On Sun, Oct 23, 2011 at 02:46:59AM -0700, Junio C Hamano wrote: > > But when it comes to "Git" Porcelains (e.g. the log family of commands), > > we do assume people do not store random binary byte sequences in commits, > > and we do take advantage of that assumption by splitting each "line" at > > LF, indenting them with 4 spaces, etc. In other words, a commit log in the > > Git context _is_ pretty much text and not arbitrary byte sequence. > > Think what would cutting at a byte whose value is 012 and adding four > bytes whose values are 040 to each of "lines" that formed with such > cutting do to UTF-16 goo, even if it does not contain any NUL byte. As far > as Git Porcelains are concerned, it is no different from random binary > byte sequences. But as Duy mentions, we have an encoding header. Shouldn't we treat it like binary goo until we do reencode_log_message, and _then_ we can break it into lines? -Peff ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 00/22] Refactor to accept NUL in commit messages 2011-10-23 16:07 ` Jeff King @ 2011-10-23 20:16 ` Junio C Hamano 2011-10-24 4:40 ` Junio C Hamano 0 siblings, 1 reply; 26+ messages in thread From: Junio C Hamano @ 2011-10-23 20:16 UTC (permalink / raw) To: Jeff King; +Cc: Nguyen Thai Ngoc Duy, git, Ævar Arnfjörð Jeff King <peff@peff.net> writes: > But as Duy mentions, we have an encoding header. Shouldn't we treat it > like binary goo until we do reencode_log_message, and _then_ we can > break it into lines? That's sensible. If we go that route, I think the "one allocation of separate struct commit_buffer pointed from a pointer field in struct commit to replace the current member 'buffer'" is a reasonable thing to do. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 00/22] Refactor to accept NUL in commit messages 2011-10-23 20:16 ` Junio C Hamano @ 2011-10-24 4:40 ` Junio C Hamano 2011-10-24 5:10 ` Nguyen Thai Ngoc Duy 2011-10-24 22:45 ` Jeff King 0 siblings, 2 replies; 26+ messages in thread From: Junio C Hamano @ 2011-10-24 4:40 UTC (permalink / raw) To: git; +Cc: Jeff King, Nguyen Thai Ngoc Duy, Ævar Arnfjörð Junio C Hamano <gitster@pobox.com> writes: > Jeff King <peff@peff.net> writes: > >> But as Duy mentions, we have an encoding header. Shouldn't we treat it >> like binary goo until we do reencode_log_message, and _then_ we can >> break it into lines? > > That's sensible. If we go that route, I think the "one allocation of > separate struct commit_buffer pointed from a pointer field in struct > commit to replace the current member 'buffer'" is a reasonable thing > to do. Having given that "sensible" comment, I am not convinced if this is worth it. We are talking about what is left in the ephemeral COMMIT_EDITMSG by the chosen editor, but are there really editors that can _only_ write in UTF-16 and not in UTF-8, and is it worth bending backwards to add support such an editor? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 00/22] Refactor to accept NUL in commit messages 2011-10-24 4:40 ` Junio C Hamano @ 2011-10-24 5:10 ` Nguyen Thai Ngoc Duy 2011-10-24 11:09 ` Štěpán Němec 2011-10-24 22:45 ` Jeff King 1 sibling, 1 reply; 26+ messages in thread From: Nguyen Thai Ngoc Duy @ 2011-10-24 5:10 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, Jeff King, Ævar Arnfjörð On Mon, Oct 24, 2011 at 3:40 PM, Junio C Hamano <gitster@pobox.com> wrote: > Junio C Hamano <gitster@pobox.com> writes: > >> Jeff King <peff@peff.net> writes: >> >>> But as Duy mentions, we have an encoding header. Shouldn't we treat it >>> like binary goo until we do reencode_log_message, and _then_ we can >>> break it into lines? >> >> That's sensible. If we go that route, I think the "one allocation of >> separate struct commit_buffer pointed from a pointer field in struct >> commit to replace the current member 'buffer'" is a reasonable thing >> to do. > > Having given that "sensible" comment, I am not convinced if this is worth > it. We are talking about what is left in the ephemeral COMMIT_EDITMSG by > the chosen editor, but are there really editors that can _only_ write in > UTF-16 and not in UTF-8, and is it worth bending backwards to add support > such an editor? This is argument for the sake of argument because I don't use utf-16 and do not care much. UTF-16 can have more code points and some may prefer utf-16 to utf-8. I'd be happy with git's refusing to create broken commits because people accidentally set editor encoding to utf-16. If people do use utf-16, they'll get caught and should yell up if they want utf-16 supported. -- Duy ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 00/22] Refactor to accept NUL in commit messages 2011-10-24 5:10 ` Nguyen Thai Ngoc Duy @ 2011-10-24 11:09 ` Štěpán Němec 0 siblings, 0 replies; 26+ messages in thread From: Štěpán Němec @ 2011-10-24 11:09 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy Cc: Junio C Hamano, git, Jeff King, Ævar Arnfjörð On Mon, 24 Oct 2011 07:10:08 +0200 Nguyen Thai Ngoc Duy wrote: > This is argument for the sake of argument because I don't use utf-16 > and do not care much. UTF-16 can have more code points and some may > prefer utf-16 to utf-8. I suspect this is really tangential to this thread, but I can't make much sense of that last sentence -- if you meant that UTF-16 is somehow more apt at encoding Unicode code points than UTF-8, then that's not the case. Both can represent all Unicode characters. If anything, things are _more_, not less complicated in UTF-16, which apart from the NUL and endianness complications has to jump through the "surrogate pairs" hoop for code points bigger than U+FFFF (so you'll actually find many apps with buggy UTF-16 implementation which break for those code points, unlike when using UTF-8). -- Štěpán ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 00/22] Refactor to accept NUL in commit messages 2011-10-24 4:40 ` Junio C Hamano 2011-10-24 5:10 ` Nguyen Thai Ngoc Duy @ 2011-10-24 22:45 ` Jeff King 2011-10-25 10:16 ` Štěpán Němec 2011-10-25 14:07 ` Junio C Hamano 1 sibling, 2 replies; 26+ messages in thread From: Jeff King @ 2011-10-24 22:45 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, Nguyen Thai Ngoc Duy, Ævar Arnfjörð On Sun, Oct 23, 2011 at 09:40:51PM -0700, Junio C Hamano wrote: > >> But as Duy mentions, we have an encoding header. Shouldn't we treat it > >> like binary goo until we do reencode_log_message, and _then_ we can > >> break it into lines? > > > > That's sensible. If we go that route, I think the "one allocation of > > separate struct commit_buffer pointed from a pointer field in struct > > commit to replace the current member 'buffer'" is a reasonable thing > > to do. > > Having given that "sensible" comment, I am not convinced if this is worth > it. We are talking about what is left in the ephemeral COMMIT_EDITMSG by > the chosen editor, but are there really editors that can _only_ write in > UTF-16 and not in UTF-8, and is it worth bending backwards to add support > such an editor? Couldn't you make the same argument about iso8859-1, or any other encoding? The user has some encoding that they want to use, for whatever reason[1]. We have a slot for an encoding header; is there a reason that git would allow some encodings and not others? I mean, besides the obvious that UTF-16 is annoying and contains embedded NULs and newlines. -Peff [1] English is my first language, so it's rare for me to even step outside of ASCII, let alone latin1. But aren't there some languages in which utf-16 is more efficient than utf-8? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 00/22] Refactor to accept NUL in commit messages 2011-10-24 22:45 ` Jeff King @ 2011-10-25 10:16 ` Štěpán Němec 2011-10-25 14:07 ` Junio C Hamano 1 sibling, 0 replies; 26+ messages in thread From: Štěpán Němec @ 2011-10-25 10:16 UTC (permalink / raw) To: Jeff King Cc: Junio C Hamano, git, Nguyen Thai Ngoc Duy, Ævar Arnfjörð On Tue, 25 Oct 2011 00:45:58 +0200 Jeff King wrote: > [1] English is my first language, so it's rare for me to even step > outside of ASCII, let alone latin1. But aren't there some languages in > which utf-16 is more efficient than utf-8? You sometimes hear something along the lines of the second "disadvantage" listed in the article below, i.e. "Characters U+0800 through U+FFFF use three bytes in UTF-8, but only two in UTF-16.": https://en.wikipedia.org/wiki/UTF-8#Disadvantages_4 -- Štěpán ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 00/22] Refactor to accept NUL in commit messages 2011-10-24 22:45 ` Jeff King 2011-10-25 10:16 ` Štěpán Němec @ 2011-10-25 14:07 ` Junio C Hamano 2011-10-27 18:13 ` Jeff King 1 sibling, 1 reply; 26+ messages in thread From: Junio C Hamano @ 2011-10-25 14:07 UTC (permalink / raw) To: Jeff King; +Cc: git, Nguyen Thai Ngoc Duy, Ævar Arnfjörð Jeff King <peff@peff.net> writes: > I mean, besides the obvious that UTF-16 is ... Yes, you could, besides the obvious. But that obvious reason makes it sufficiently different that it may not be so outrageous to draw the line between it and all the others. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 00/22] Refactor to accept NUL in commit messages 2011-10-25 14:07 ` Junio C Hamano @ 2011-10-27 18:13 ` Jeff King 2011-10-27 18:47 ` Junio C Hamano 0 siblings, 1 reply; 26+ messages in thread From: Jeff King @ 2011-10-27 18:13 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, Nguyen Thai Ngoc Duy, Ævar Arnfjörð On Tue, Oct 25, 2011 at 07:07:38AM -0700, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > I mean, besides the obvious that UTF-16 is ... > > Yes, you could, besides the obvious. But that obvious reason makes it > sufficiently different that it may not be so outrageous to draw the line > between it and all the others. Yeah, and I'm OK with that. It's just not a satisfying answer to give Windows people who think UTF-16 is a good idea. But at the very least, it's still unicode. It should be lossless for them to convert to utf8 and back if they want. Speaking of which, I've been looking at handling diffing of utf-16 files. Right now we generally just consider them binary, which sucks. It's easy to identify them by BOM in the is_buffer_binary() code, but that's only part of it. We do an OK job of diffing them, except that: 1. The BOM makes some diffs a little noisier. 2. We split lines on 0x0a. But this byte can appear in other code points, like 0x010a (Ċ), or the entire entire 0x0a* code point (the entire Gurmukhi charset). I'm tempted to detect the UTF-{16,32}{LE,BE} by their BOM, reencode them to utf8, and then display them in utf8. Is that too gross for us to consider? You can kind-of implement this outside of git using textconv. But you have to manually mark each file as utf-16, as there's no way to trigger an alternative diff driver on something like a BOM. I'm really not clear on how people with utf-16 files work. Even if we did treat utf-16 like text, the _rest_ of git is outputting ascii, so it's not like their terminals are utf-16. But we do have projects on github with utf-16 and utf-32 encodings. -Peff ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 00/22] Refactor to accept NUL in commit messages 2011-10-27 18:13 ` Jeff King @ 2011-10-27 18:47 ` Junio C Hamano 2011-10-27 18:52 ` Jeff King 0 siblings, 1 reply; 26+ messages in thread From: Junio C Hamano @ 2011-10-27 18:47 UTC (permalink / raw) To: Jeff King; +Cc: git, Nguyen Thai Ngoc Duy, Ævar Arnfjörð Jeff King <peff@peff.net> writes: > I'm tempted to detect the UTF-{16,32}{LE,BE} by their BOM, reencode them > to utf8, and then display them in utf8. Is that too gross for us to > consider? I tend to think so; it is entirely a different matter if the user instructed us to clean/smudge UTF-16 payload into/outof UTF-8. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 00/22] Refactor to accept NUL in commit messages 2011-10-27 18:47 ` Junio C Hamano @ 2011-10-27 18:52 ` Jeff King 2011-10-27 19:14 ` Junio C Hamano 0 siblings, 1 reply; 26+ messages in thread From: Jeff King @ 2011-10-27 18:52 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, Nguyen Thai Ngoc Duy, Ævar Arnfjörð On Thu, Oct 27, 2011 at 11:47:27AM -0700, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > I'm tempted to detect the UTF-{16,32}{LE,BE} by their BOM, reencode them > > to utf8, and then display them in utf8. Is that too gross for us to > > consider? > > I tend to think so; it is entirely a different matter if the user > instructed us to clean/smudge UTF-16 payload into/outof UTF-8. Minor nit, but this is just for diff, so it is not about clean/smudge but rather about doing something like textconv. The other option I mentioned would be something like detecting the BOM and pretending as if the attribute "diff=utf-16" was set (which would do nothing by default). Then people could set themselves up to handle utf-16 if they wanted, but wouldn't have to go around marking each file with .gitattributes. But maybe that is too gross, too, and they should just use .gitattributes. -Peff ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 00/22] Refactor to accept NUL in commit messages 2011-10-27 18:52 ` Jeff King @ 2011-10-27 19:14 ` Junio C Hamano 2011-10-27 23:44 ` Jeff King 0 siblings, 1 reply; 26+ messages in thread From: Junio C Hamano @ 2011-10-27 19:14 UTC (permalink / raw) To: Jeff King; +Cc: git, Nguyen Thai Ngoc Duy, Ævar Arnfjörð Jeff King <peff@peff.net> writes: > On Thu, Oct 27, 2011 at 11:47:27AM -0700, Junio C Hamano wrote: > >> Jeff King <peff@peff.net> writes: >> >> > I'm tempted to detect the UTF-{16,32}{LE,BE} by their BOM, reencode them >> > to utf8, and then display them in utf8. Is that too gross for us to >> > consider? >> >> I tend to think so; it is entirely a different matter if the user >> instructed us to clean/smudge UTF-16 payload into/outof UTF-8. > > Minor nit, but this is just for diff, so it is not about clean/smudge > but rather about doing something like textconv. I can understand if some tools in the Windows land prefer to work with these encodings, so clean/smudge to have the checkout in these encodings would be a reasonable thing not just diff but things like grep. On the other hand, I do doubt the sanity of these people if they want to have in-repository representation also in these encodings. So I do not think "it is just for diff" is any improvement. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 00/22] Refactor to accept NUL in commit messages 2011-10-27 19:14 ` Junio C Hamano @ 2011-10-27 23:44 ` Jeff King 2011-10-28 0:03 ` Junio C Hamano 2011-10-28 1:40 ` Miles Bader 0 siblings, 2 replies; 26+ messages in thread From: Jeff King @ 2011-10-27 23:44 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, Nguyen Thai Ngoc Duy, Ævar Arnfjörð On Thu, Oct 27, 2011 at 12:14:34PM -0700, Junio C Hamano wrote: > > Minor nit, but this is just for diff, so it is not about clean/smudge > > but rather about doing something like textconv. > > I can understand if some tools in the Windows land prefer to work with > these encodings, so clean/smudge to have the checkout in these encodings > would be a reasonable thing not just diff but things like grep. On the > other hand, I do doubt the sanity of these people if they want to have > in-repository representation also in these encodings. I'm pretty much of the same mind. We do have people with utf-16 in their repositories on github. I have no idea why they do such a thing, or what kinds of tricks they do to make it usable (because without it, they just get "binary files differ"). My interest is to make things like bare-repository diff (and everything built on it; i.e., things like github, gitweb, or whatever) do the sane thing for these people, even if I think what they're doing is wrong. And as always, I try to structure the git portions of that as much as possible to be general and help everybody, so they can be pushed upstream (also, then I don't have to worry about managing local changes :) ). But it sounds like this is probably just too ugly and should end up as a github-specific thing. -Peff ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 00/22] Refactor to accept NUL in commit messages 2011-10-27 23:44 ` Jeff King @ 2011-10-28 0:03 ` Junio C Hamano 2011-10-28 0:19 ` Jeff King 2011-10-28 1:40 ` Miles Bader 1 sibling, 1 reply; 26+ messages in thread From: Junio C Hamano @ 2011-10-28 0:03 UTC (permalink / raw) To: Jeff King; +Cc: git, Nguyen Thai Ngoc Duy, Ævar Arnfjörð Jeff King <peff@peff.net> writes: > My interest is to make things like bare-repository diff (and everything > built on it; i.e., things like github, gitweb, or whatever) do the sane > thing for these people, even if I think what they're doing is wrong. I do not think we are talking about right or wrong. I was primarily saying that textconv may not be the right thing (think github/gitweb showing blob contents, nicely formatted inside the chrome the site provides). The solution you suggested feels like a gross layering violation, unless we do it everywhere, in which case I wouldn't mind too much. We have in-repository representation that diff and grep and friends work on, and output conversion layer that externalizes the result of them in the form of "smudge". Another layer above the in-repository representation and below operations could convert UTF-16 to UTF-8 when going outward and in the opposite when going inward. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 00/22] Refactor to accept NUL in commit messages 2011-10-28 0:03 ` Junio C Hamano @ 2011-10-28 0:19 ` Jeff King 0 siblings, 0 replies; 26+ messages in thread From: Jeff King @ 2011-10-28 0:19 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, Nguyen Thai Ngoc Duy, Ævar Arnfjörð On Thu, Oct 27, 2011 at 05:03:29PM -0700, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > My interest is to make things like bare-repository diff (and everything > > built on it; i.e., things like github, gitweb, or whatever) do the sane > > thing for these people, even if I think what they're doing is wrong. > > I do not think we are talking about right or wrong. I was primarily saying > that textconv may not be the right thing (think github/gitweb showing blob > contents, nicely formatted inside the chrome the site provides). But I think it is probably a wrong thing to store utf-16 as the canonical format inside the git repository. Git simply can't handle it for diffing. And the right thing, as you suggested, is clean/smudge. But I'm dealing with repositories on the server side, where it is too late to do clean/smudge; I just get whatever junk people commited. > We have in-repository representation that diff and grep and friends work > on, and output conversion layer that externalizes the result of them in > the form of "smudge". Another layer above the in-repository representation > and below operations could convert UTF-16 to UTF-8 when going outward and > in the opposite when going inward. I'm not sure that could sanely be done in a backwards compatible way. Doing it with just textual diffs is a hack, of course, but at least we know that the damage is limited, and the diff we generate on top doesn't care that much about the original sha1s[1]. But should read_object_sha1 learn to convert utf-16 into utf-8? I think madness lies that way, as we are breaking assumptions about sha1 validity. -Peff [1] Actually, the text diff does mention the original and resulting sha1s, which would now either bear no relation to the diff text, or bear no relation to what's in the repo. Either way, I think we are creating something that can't necessarily be applied, which is bad. And is why I thought of textconv, which is basically the same concept (and has the same problems). ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 00/22] Refactor to accept NUL in commit messages 2011-10-27 23:44 ` Jeff King 2011-10-28 0:03 ` Junio C Hamano @ 2011-10-28 1:40 ` Miles Bader 2011-10-28 4:07 ` Junio C Hamano 1 sibling, 1 reply; 26+ messages in thread From: Miles Bader @ 2011-10-28 1:40 UTC (permalink / raw) To: Jeff King Cc: Junio C Hamano, git, Nguyen Thai Ngoc Duy, Ævar Arnfjörð Jeff King <peff@peff.net> writes: > We do have people with utf-16 in their repositories on github. I > have no idea why they do such a thing, or what kinds of tricks they > do to make it usable (because without it, they just get "binary > files differ"). Hmm, you could ask them ... [or, I suppose more diplomatically, post a blog entry asking "Hey all you people who use github for utf-16 encoded files, ..."] -Miles -- Year, n. A period of three hundred and sixty-five disappointments. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 00/22] Refactor to accept NUL in commit messages 2011-10-28 1:40 ` Miles Bader @ 2011-10-28 4:07 ` Junio C Hamano 0 siblings, 0 replies; 26+ messages in thread From: Junio C Hamano @ 2011-10-28 4:07 UTC (permalink / raw) To: Miles Bader Cc: Jeff King, git, Nguyen Thai Ngoc Duy, Ævar Arnfjörð Miles Bader <miles@gnu.org> writes: > Jeff King <peff@peff.net> writes: >> We do have people with utf-16 in their repositories on github. I >> have no idea why they do such a thing, or what kinds of tricks they >> do to make it usable (because without it, they just get "binary >> files differ"). > > Hmm, you could ask them ... [or, I suppose more diplomatically, post > a blog entry asking "Hey all you people who use github for utf-16 > encoded files, ..."] People will hate such a flag day event initially and later thank you for it ;-) I am afraid that that is a wishful thinking, though. ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2011-10-28 4:07 UTC | newest] Thread overview: 26+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <1319277881-4128-1-git-send-email-pclouds@gmail.com> 2011-10-22 19:09 ` [PATCH 00/22] Refactor to accept NUL in commit messages Jeff King 2011-10-23 10:44 ` Robin Rosenberg 2011-10-23 16:09 ` Jeff King 2011-10-22 22:47 ` Junio C Hamano 2011-10-23 1:24 ` Nguyen Thai Ngoc Duy 2011-10-23 5:51 ` Junio C Hamano 2011-10-23 6:37 ` Nguyen Thai Ngoc Duy 2011-10-23 9:46 ` Junio C Hamano 2011-10-23 10:17 ` Nguyen Thai Ngoc Duy 2011-10-23 16:07 ` Jeff King 2011-10-23 20:16 ` Junio C Hamano 2011-10-24 4:40 ` Junio C Hamano 2011-10-24 5:10 ` Nguyen Thai Ngoc Duy 2011-10-24 11:09 ` Štěpán Němec 2011-10-24 22:45 ` Jeff King 2011-10-25 10:16 ` Štěpán Němec 2011-10-25 14:07 ` Junio C Hamano 2011-10-27 18:13 ` Jeff King 2011-10-27 18:47 ` Junio C Hamano 2011-10-27 18:52 ` Jeff King 2011-10-27 19:14 ` Junio C Hamano 2011-10-27 23:44 ` Jeff King 2011-10-28 0:03 ` Junio C Hamano 2011-10-28 0:19 ` Jeff King 2011-10-28 1:40 ` Miles Bader 2011-10-28 4:07 ` Junio C Hamano
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).