Re: [PATCH 00/22] Refactor to accept NUL in commit messages

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH 00/22] Refactor to accept NUL in commit messages
       [not found] <1319277881-4128-1-git-send-email-pclouds@gmail.com>
@ 2011-10-22 19:09 ` Jeff King
  2011-10-23 10:44   ` Robin Rosenberg
  2011-10-22 22:47 ` Junio C Hamano
  1 sibling, 1 reply; 26+ messages in thread
From: Jeff King @ 2011-10-22 19:09 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy
  Cc: git, Junio C Hamano, Ævar Arnfjörð

On Sat, Oct 22, 2011 at 09:04:19PM +1100, Nguyen Thai Ngoc Duy wrote:

> This series helps pass commit message size up to output functions,
> though it does not change any output functions to print ^@.

Can we take a step back for a second and discuss what git _should_ do
with commits that contain NUL?

If all of the pretty-print functions are just going consider "foo\0bar"
to be "foo^@bar", then maybe it would be much simpler to just
"normalize" the commit message into a C string at a lower level, and
pass it around as a string as we currently do.

On the other hand, if we are eventually looking to add an option like
"--include-NUL-in-commit-message", then it would make sense for the
real contents and size to get passed around.

> All functions up to the last patch learn to accept a string as a pair
> <const char *start, const char *end> as a preparation step. These
> changes are relatively simple. Or it could have been so if I did not
> attempt to reduce some code duplication found while working on this
> series.

Great. Reducing code duplication is always a plus.

> The last patch turns commit_buffer field in struct commit to "struct
> strbuf *". This approach costs us 12 bytes more each commit. We can
> choose not to use strbuf to save memory.

I think 12 bytes in the commit struct might be noticeable. But it looks
like you've done the sane thing, and replaced the pointer-to-char with a
pointer-to-strbuf. And that I don't think should be a big deal. The
buffer itself is way bigger than 12 bytes, so we don't care so much
about the "we have a buffer" case, but more about the 100,000 other
commits that we're not currently printing right now.

Of course, some timings on things like "rev-list" and "pack-objects"
would be nice to double-check.

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/22] Refactor to accept NUL in commit messages
  2011-10-22 19:09 ` [PATCH 00/22] Refactor to accept NUL in commit messages Jeff King
@ 2011-10-23 10:44   ` Robin Rosenberg
  2011-10-23 16:09     ` Jeff King
  0 siblings, 1 reply; 26+ messages in thread
From: Robin Rosenberg @ 2011-10-23 10:44 UTC (permalink / raw)
  To: Jeff King
  Cc: Nguyễn Thái Ngọc Duy, git, Junio C Hamano,
	Ævar Arnfjörð

Jeff King skrev 2011-10-22 21.09:
> On Sat, Oct 22, 2011 at 09:04:19PM +1100, Nguyen Thai Ngoc Duy wrote:
>
>> This series helps pass commit message size up to output functions,
>> though it does not change any output functions to print ^@.
> Can we take a step back for a second and discuss what git _should_ do
> with commits that contain NUL?
Yes please. I don't think allowing NUL makes sense, but it makes sense
to state how NUL should be handled when anyone attempt it, so there
might be things to fix even if NUL is banned.

Are there any such commits in the wild?

-- robin

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/22] Refactor to accept NUL in commit messages
  2011-10-23 10:44   ` Robin Rosenberg
@ 2011-10-23 16:09     ` Jeff King
  0 siblings, 0 replies; 26+ messages in thread
From: Jeff King @ 2011-10-23 16:09 UTC (permalink / raw)
  To: Robin Rosenberg
  Cc: Nguyễn Thái Ngọc Duy, git, Junio C Hamano,
	Ævar Arnfjörð

On Sun, Oct 23, 2011 at 12:44:30PM +0200, Robin Rosenberg wrote:

> Jeff King skrev 2011-10-22 21.09:
> >On Sat, Oct 22, 2011 at 09:04:19PM +1100, Nguyen Thai Ngoc Duy wrote:
> >
> >>This series helps pass commit message size up to output functions,
> >>though it does not change any output functions to print ^@.
> >Can we take a step back for a second and discuss what git _should_ do
> >with commits that contain NUL?
> Yes please. I don't think allowing NUL makes sense, but it makes sense
> to state how NUL should be handled when anyone attempt it, so there
> might be things to fix even if NUL is banned.
> 
> Are there any such commits in the wild?

Adding an arbitrary NUL, no, I don't think I've ever seen it outside of
people (myself included) trying to break git in interesting ways.

But utf16 may contains NUL bytes, so I expect configuring your editor to
output utf16 and running "git commit" would give you the most likely
example.

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/22] Refactor to accept NUL in commit messages
       [not found] <1319277881-4128-1-git-send-email-pclouds@gmail.com>
  2011-10-22 19:09 ` [PATCH 00/22] Refactor to accept NUL in commit messages Jeff King
@ 2011-10-22 22:47 ` Junio C Hamano
  2011-10-23  1:24   ` Nguyen Thai Ngoc Duy
  1 sibling, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2011-10-22 22:47 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy
  Cc: git, peff, Ævar Arnfjörð

I do not think we want to go this route.

There are two possible approaches to attack this.

 - If we want to show everything after a potential and rare NUL in the log
   message most of the time, then "struct commit" should just store
   <ptr,len> pair. This grows "struct commit" with one extra ulong.

 - If we want to give us a way to notice and show these "funnily, this
   commit log message has a NUL in it" case as an exception in only
   selected codepaths, then "struct commit" should just gain "flags"
   4-byte int field between "indegree" and "date", and
   parse_commit_buffer() should set one bit in the flags when the log
   message has NUL in it. And teach only these selected codepaths to find
   the length from the object name with sha1_object_info() as needed. This
   grows "struct commit" with one 4-byte int, with runtime overhead only
   where it matters.

The approach taken by the patch wastes two malloc() blocks with their own
allocation overhead, and unused "alloc" field in the strbuf that does not
have to be there.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/22] Refactor to accept NUL in commit messages
  2011-10-22 22:47 ` Junio C Hamano
@ 2011-10-23  1:24   ` Nguyen Thai Ngoc Duy
  2011-10-23  5:51     ` Junio C Hamano
  0 siblings, 1 reply; 26+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2011-10-23  1:24 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff, Ævar Arnfjörð

2011/10/23 Junio C Hamano <gitster@pobox.com>:
> I do not think we want to go this route.
>
> There are two possible approaches to attack this.
>
>  - If we want to show everything after a potential and rare NUL in the log
>   message most of the time, then "struct commit" should just store
>   <ptr,len> pair. This grows "struct commit" with one extra ulong.
>
>  - If we want to give us a way to notice and show these "funnily, this
>   commit log message has a NUL in it" case as an exception in only
>   selected codepaths, then "struct commit" should just gain "flags"
>   4-byte int field between "indegree" and "date", and
>   parse_commit_buffer() should set one bit in the flags when the log
>   message has NUL in it. And teach only these selected codepaths to find
>   the length from the object name with sha1_object_info() as needed. This
>   grows "struct commit" with one 4-byte int, with runtime overhead only
>   where it matters.
>
> The approach taken by the patch wastes two malloc() blocks with their own
> allocation overhead, and unused "alloc" field in the strbuf that does not
> have to be there.

We could allocate just one block with length as the first field:

struct commit_buffer {
        unsigned long len;
        char buf[FLEX_ARRAY];
};

The downside is commit_buffer field type in struct commit changes,
which impacts many codepaths. If we agree to allow NUL in commit
objects, then all codepaths should be aware of that fact, otherwise
funny things may happen because string processing in this function
stops early due to NUL, but others run fine..

Jeff's low-level normalization approach sounds much simpler, but
probably trickier because we need to identify where to normalize and
denormalize. At least with type change, the compiler spots all the
places for me.

I would not worry about runtime processing overhead. The string end
check is basically converted from "if (*msg)" to "if (msg < msg_end)".
There will one more pointer (msg_end) in stack for each call. Unless
we do deep recursion, we should be fine. Memory overhead is still
something to profile.
-- 
Duy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/22] Refactor to accept NUL in commit messages
  2011-10-23  1:24   ` Nguyen Thai Ngoc Duy
@ 2011-10-23  5:51     ` Junio C Hamano
  2011-10-23  6:37       ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2011-10-23  5:51 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: git, peff, Ævar Arnfjörð

Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:

> We could allocate just one block with length as the first field:
>
> struct commit_buffer {
>         unsigned long len;
>         char buf[FLEX_ARRAY];
> };
>
> The downside is commit_buffer field type in struct commit changes,
> which impacts many codepaths.

I think that is a good thing overall to _force_ us to audit all the code,
*if* our goal were to avoid losing bytes. And the solution above is better
than adding a length field to "struct commit". It certainly is better than
quoting NUL byte to ^@, keep using the "char *" field and risking some
codepaths forget to convert it back to NUL. For types of payloads for
which losing everything after the first NUL matters, converting NUL to ^@
and then forgetting to convert it back to NUL is equally bad breakage to
the payload anyway, so such a conversion would not be a particularly good
approach to avoid losing bytes.

But as Jeff suggested, we should step back a bit and think what our goal
is.

The low level object format of our commit is textual header fields, each
of which is terminated with a LF, followed by a LF to mark the end of
header fields, and then opaque payload that can contain any bytes. It does
not forbid a non-Git application to reuse the object store infrastructure
to store ASN.1 binary goo there, and the low level interface we give such
as cat-file is a perfectly valid way to inspect such a "commit" object.

But when it comes to "Git" Porcelains (e.g. the log family of commands),
we do assume people do not store random binary byte sequences in commits,
and we do take advantage of that assumption by splitting each "line" at
LF, indenting them with 4 spaces, etc. In other words, a commit log in the
Git context _is_ pretty much text and not arbitrary byte sequence. Even
the "--pretty=raw" option for "log" family is not about the "raw" body;
the "raw"-ness applies only to the header fields. So even if we _were_ to
update the codepaths involved to avoid losing bytes, the end result will
not be useful for users to whom ability to include NUL matters.

So in that sense, I do not think it is unreasonable to chop it off at the
first NUL, which is the current behaviour. IOW, it is entirely sane to
argue that there is nothing to fix.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/22] Refactor to accept NUL in commit messages
  2011-10-23  5:51     ` Junio C Hamano
@ 2011-10-23  6:37       ` Nguyen Thai Ngoc Duy
  2011-10-23  9:46         ` Junio C Hamano
  0 siblings, 1 reply; 26+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2011-10-23  6:37 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff, Ævar Arnfjörð

On Sun, Oct 23, 2011 at 4:51 PM, Junio C Hamano <gitster@pobox.com> wrote:
> So in that sense, I do not think it is unreasonable to chop it off at the
> first NUL, which is the current behaviour. IOW, it is entirely sane to
> argue that there is nothing to fix.

Good for me too if we go this way. I was just curious what if we
changed commit_buffer field and got hooked in.

> But as Jeff suggested, we should step back a bit and think what our goal
> is.
>
> The low level object format of our commit is textual header fields, each
> of which is terminated with a LF, followed by a LF to mark the end of
> header fields, and then opaque payload that can contain any bytes. It does
> not forbid a non-Git application to reuse the object store infrastructure
> to store ASN.1 binary goo there, and the low level interface we give such
> as cat-file is a perfectly valid way to inspect such a "commit" object.

cat-file is fine, commit-tree (or any commands that call
commit_tree()) cuts at NUL though.

I wonder how git processes commit messages in utf-16. Did a quick
test, did not look good. But that's because git-commit cuts at NUL.
But even if git-commit makes a good object, I doubt if git-log shows
it right.
-- 
Duy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/22] Refactor to accept NUL in commit messages
  2011-10-23  6:37       ` Nguyen Thai Ngoc Duy
@ 2011-10-23  9:46         ` Junio C Hamano
  2011-10-23 10:17           ` Nguyen Thai Ngoc Duy
  2011-10-23 16:07           ` Jeff King
  0 siblings, 2 replies; 26+ messages in thread
From: Junio C Hamano @ 2011-10-23  9:46 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: git, peff, Ævar Arnfjörð

Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:

> On Sun, Oct 23, 2011 at 4:51 PM, Junio C Hamano <gitster@pobox.com> wrote:
> ...
>> The low level object format of our commit is textual header fields, each
>> of which is terminated with a LF, followed by a LF to mark the end of
>> header fields, and then opaque payload that can contain any bytes. It does
>> not forbid a non-Git application to reuse the object store infrastructure
>> to store ASN.1 binary goo there, and the low level interface we give such
>> as cat-file is a perfectly valid way to inspect such a "commit" object.
>
> cat-file is fine, commit-tree (or any commands that call
> commit_tree()) cuts at NUL though.
> I wonder how git processes commit messages in utf-16.

That is exactly what I am saying.

Perhaps you didn't either read or understand what you omitted from your
quoting; otherwise you even wouldn't have brought up utf-16.

Let me requote that part for you.

> But when it comes to "Git" Porcelains (e.g. the log family of commands),
> we do assume people do not store random binary byte sequences in commits,
> and we do take advantage of that assumption by splitting each "line" at
> LF, indenting them with 4 spaces, etc. In other words, a commit log in the
> Git context _is_ pretty much text and not arbitrary byte sequence.

Think what would cutting at a byte whose value is 012 and adding four
bytes whose values are 040 to each of "lines" that formed with such
cutting do to UTF-16 goo, even if it does not contain any NUL byte. As far
as Git Porcelains are concerned, it is no different from random binary
byte sequences.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/22] Refactor to accept NUL in commit messages
  2011-10-23  9:46         ` Junio C Hamano
@ 2011-10-23 10:17           ` Nguyen Thai Ngoc Duy
  2011-10-23 16:07           ` Jeff King
  1 sibling, 0 replies; 26+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2011-10-23 10:17 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff, Ævar Arnfjörð

On Sun, Oct 23, 2011 at 8:46 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:
>
>> On Sun, Oct 23, 2011 at 4:51 PM, Junio C Hamano <gitster@pobox.com> wrote:
>> ...
>>> The low level object format of our commit is textual header fields, each
>>> of which is terminated with a LF, followed by a LF to mark the end of
>>> header fields, and then opaque payload that can contain any bytes. It does
>>> not forbid a non-Git application to reuse the object store infrastructure
>>> to store ASN.1 binary goo there, and the low level interface we give such
>>> as cat-file is a perfectly valid way to inspect such a "commit" object.
>>
>> cat-file is fine, commit-tree (or any commands that call
>> commit_tree()) cuts at NUL though.
>> I wonder how git processes commit messages in utf-16.
>
> That is exactly what I am saying.
>
> Perhaps you didn't either read or understand what you omitted from your
> quoting; otherwise you even wouldn't have brought up utf-16.
>
> Let me requote that part for you.
>
>> But when it comes to "Git" Porcelains (e.g. the log family of commands),
>> we do assume people do not store random binary byte sequences in commits,
>> and we do take advantage of that assumption by splitting each "line" at
>> LF, indenting them with 4 spaces, etc. In other words, a commit log in the
>> Git context _is_ pretty much text and not arbitrary byte sequence.
>
> Think what would cutting at a byte whose value is 012 and adding four
> bytes whose values are 040 to each of "lines" that formed with such
> cutting do to UTF-16 goo, even if it does not contain any NUL byte. As far
> as Git Porcelains are concerned, it is no different from random binary
> byte sequences.
>

I'm sorry. The utf-16 was an afterthought when I was nearly finished
with the reply and already cut that quote.

The assumption that people do not store random binary byte sequences
in commits sort of conflicts with "encoding" field in the commit
header though. The assumption is documented in i18n.txt. I guess it's
just me who did not read document carefully. But maybe it's good to
stop people from shooting themselves in this case (i.e. setting
encoding to utf-16 or similar).
-- 
Duy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/22] Refactor to accept NUL in commit messages
  2011-10-23  9:46         ` Junio C Hamano
  2011-10-23 10:17           ` Nguyen Thai Ngoc Duy
@ 2011-10-23 16:07           ` Jeff King
  2011-10-23 20:16             ` Junio C Hamano
  1 sibling, 1 reply; 26+ messages in thread
From: Jeff King @ 2011-10-23 16:07 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Nguyen Thai Ngoc Duy, git, Ævar Arnfjörð

On Sun, Oct 23, 2011 at 02:46:59AM -0700, Junio C Hamano wrote:

> > But when it comes to "Git" Porcelains (e.g. the log family of commands),
> > we do assume people do not store random binary byte sequences in commits,
> > and we do take advantage of that assumption by splitting each "line" at
> > LF, indenting them with 4 spaces, etc. In other words, a commit log in the
> > Git context _is_ pretty much text and not arbitrary byte sequence.
> 
> Think what would cutting at a byte whose value is 012 and adding four
> bytes whose values are 040 to each of "lines" that formed with such
> cutting do to UTF-16 goo, even if it does not contain any NUL byte. As far
> as Git Porcelains are concerned, it is no different from random binary
> byte sequences.

But as Duy mentions, we have an encoding header. Shouldn't we treat it
like binary goo until we do reencode_log_message, and _then_ we can
break it into lines?

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/22] Refactor to accept NUL in commit messages
  2011-10-23 16:07           ` Jeff King
@ 2011-10-23 20:16             ` Junio C Hamano
  2011-10-24  4:40               ` Junio C Hamano
  0 siblings, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2011-10-23 20:16 UTC (permalink / raw)
  To: Jeff King; +Cc: Nguyen Thai Ngoc Duy, git, Ævar Arnfjörð

Jeff King <peff@peff.net> writes:

> But as Duy mentions, we have an encoding header. Shouldn't we treat it
> like binary goo until we do reencode_log_message, and _then_ we can
> break it into lines?

That's sensible. If we go that route, I think the "one allocation of
separate struct commit_buffer pointed from a pointer field in struct
commit to replace the current member 'buffer'" is a reasonable thing
to do.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/22] Refactor to accept NUL in commit messages
  2011-10-23 20:16             ` Junio C Hamano
@ 2011-10-24  4:40               ` Junio C Hamano
  2011-10-24  5:10                 ` Nguyen Thai Ngoc Duy
  2011-10-24 22:45                 ` Jeff King
  0 siblings, 2 replies; 26+ messages in thread
From: Junio C Hamano @ 2011-10-24  4:40 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Nguyen Thai Ngoc Duy, Ævar Arnfjörð

Junio C Hamano <gitster@pobox.com> writes:

> Jeff King <peff@peff.net> writes:
>
>> But as Duy mentions, we have an encoding header. Shouldn't we treat it
>> like binary goo until we do reencode_log_message, and _then_ we can
>> break it into lines?
>
> That's sensible. If we go that route, I think the "one allocation of
> separate struct commit_buffer pointed from a pointer field in struct
> commit to replace the current member 'buffer'" is a reasonable thing
> to do.

Having given that "sensible" comment, I am not convinced if this is worth
it. We are talking about what is left in the ephemeral COMMIT_EDITMSG by
the chosen editor, but are there really editors that can _only_ write in
UTF-16 and not in UTF-8, and is it worth bending backwards to add support
such an editor?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/22] Refactor to accept NUL in commit messages
  2011-10-24  4:40               ` Junio C Hamano
@ 2011-10-24  5:10                 ` Nguyen Thai Ngoc Duy
  2011-10-24 11:09                   ` Štěpán Němec
  2011-10-24 22:45                 ` Jeff King
  1 sibling, 1 reply; 26+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2011-10-24  5:10 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King, Ævar Arnfjörð

On Mon, Oct 24, 2011 at 3:40 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Junio C Hamano <gitster@pobox.com> writes:
>
>> Jeff King <peff@peff.net> writes:
>>
>>> But as Duy mentions, we have an encoding header. Shouldn't we treat it
>>> like binary goo until we do reencode_log_message, and _then_ we can
>>> break it into lines?
>>
>> That's sensible. If we go that route, I think the "one allocation of
>> separate struct commit_buffer pointed from a pointer field in struct
>> commit to replace the current member 'buffer'" is a reasonable thing
>> to do.
>
> Having given that "sensible" comment, I am not convinced if this is worth
> it. We are talking about what is left in the ephemeral COMMIT_EDITMSG by
> the chosen editor, but are there really editors that can _only_ write in
> UTF-16 and not in UTF-8, and is it worth bending backwards to add support
> such an editor?

This is argument for the sake of argument because I don't use utf-16
and do not care much. UTF-16 can have more code points and some may
prefer utf-16 to utf-8.

I'd be happy with git's refusing to create broken commits because
people accidentally set editor encoding to utf-16. If people do use
utf-16, they'll get caught and should yell up if they want utf-16
supported.
-- 
Duy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/22] Refactor to accept NUL in commit messages
  2011-10-24  5:10                 ` Nguyen Thai Ngoc Duy
@ 2011-10-24 11:09                   ` Štěpán Němec
  0 siblings, 0 replies; 26+ messages in thread
From: Štěpán Němec @ 2011-10-24 11:09 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy
  Cc: Junio C Hamano, git, Jeff King, Ævar Arnfjörð

On Mon, 24 Oct 2011 07:10:08 +0200
Nguyen Thai Ngoc Duy wrote:

> This is argument for the sake of argument because I don't use utf-16
> and do not care much. UTF-16 can have more code points and some may
> prefer utf-16 to utf-8.

I suspect this is really tangential to this thread, but I can't make
much sense of that last sentence -- if you meant that UTF-16 is somehow
more apt at encoding Unicode code points than UTF-8, then that's not the
case. Both can represent all Unicode characters. If anything, things are
_more_, not less complicated in UTF-16, which apart from the NUL and
endianness complications has to jump through the "surrogate pairs" hoop
for code points bigger than U+FFFF (so you'll actually find many apps
with buggy UTF-16 implementation which break for those code points,
unlike when using UTF-8).

-- 
Štěpán

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/22] Refactor to accept NUL in commit messages
  2011-10-24  4:40               ` Junio C Hamano
  2011-10-24  5:10                 ` Nguyen Thai Ngoc Duy
@ 2011-10-24 22:45                 ` Jeff King
  2011-10-25 10:16                   ` Štěpán Němec
  2011-10-25 14:07                   ` Junio C Hamano
  1 sibling, 2 replies; 26+ messages in thread
From: Jeff King @ 2011-10-24 22:45 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Nguyen Thai Ngoc Duy, Ævar Arnfjörð

On Sun, Oct 23, 2011 at 09:40:51PM -0700, Junio C Hamano wrote:

> >> But as Duy mentions, we have an encoding header. Shouldn't we treat it
> >> like binary goo until we do reencode_log_message, and _then_ we can
> >> break it into lines?
> >
> > That's sensible. If we go that route, I think the "one allocation of
> > separate struct commit_buffer pointed from a pointer field in struct
> > commit to replace the current member 'buffer'" is a reasonable thing
> > to do.
> 
> Having given that "sensible" comment, I am not convinced if this is worth
> it. We are talking about what is left in the ephemeral COMMIT_EDITMSG by
> the chosen editor, but are there really editors that can _only_ write in
> UTF-16 and not in UTF-8, and is it worth bending backwards to add support
> such an editor?

Couldn't you make the same argument about iso8859-1, or any other
encoding? The user has some encoding that they want to use, for whatever
reason[1]. We have a slot for an encoding header; is there a reason that
git would allow some encodings and not others?

I mean, besides the obvious that UTF-16 is annoying and contains
embedded NULs and newlines.

-Peff

[1] English is my first language, so it's rare for me to even step
outside of ASCII, let alone latin1. But aren't there some languages in
which utf-16 is more efficient than utf-8?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/22] Refactor to accept NUL in commit messages
  2011-10-24 22:45                 ` Jeff King
@ 2011-10-25 10:16                   ` Štěpán Němec
  2011-10-25 14:07                   ` Junio C Hamano
  1 sibling, 0 replies; 26+ messages in thread
From: Štěpán Němec @ 2011-10-25 10:16 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, git, Nguyen Thai Ngoc Duy,
	Ævar Arnfjörð

On Tue, 25 Oct 2011 00:45:58 +0200
Jeff King wrote:

> [1] English is my first language, so it's rare for me to even step
> outside of ASCII, let alone latin1. But aren't there some languages in
> which utf-16 is more efficient than utf-8?

You sometimes hear something along the lines of the second
"disadvantage" listed in the article below, i.e. "Characters U+0800
through U+FFFF use three bytes in UTF-8, but only two in UTF-16.":

https://en.wikipedia.org/wiki/UTF-8#Disadvantages_4

-- 
Štěpán

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/22] Refactor to accept NUL in commit messages
  2011-10-24 22:45                 ` Jeff King
  2011-10-25 10:16                   ` Štěpán Němec
@ 2011-10-25 14:07                   ` Junio C Hamano
  2011-10-27 18:13                     ` Jeff King
  1 sibling, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2011-10-25 14:07 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Nguyen Thai Ngoc Duy, Ævar Arnfjörð

Jeff King <peff@peff.net> writes:

> I mean, besides the obvious that UTF-16 is ...

Yes, you could, besides the obvious. But that obvious reason makes it
sufficiently different that it may not be so outrageous to draw the line
between it and all the others.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/22] Refactor to accept NUL in commit messages
  2011-10-25 14:07                   ` Junio C Hamano
@ 2011-10-27 18:13                     ` Jeff King
  2011-10-27 18:47                       ` Junio C Hamano
  0 siblings, 1 reply; 26+ messages in thread
From: Jeff King @ 2011-10-27 18:13 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Nguyen Thai Ngoc Duy, Ævar Arnfjörð

On Tue, Oct 25, 2011 at 07:07:38AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > I mean, besides the obvious that UTF-16 is ...
> 
> Yes, you could, besides the obvious. But that obvious reason makes it
> sufficiently different that it may not be so outrageous to draw the line
> between it and all the others.

Yeah, and I'm OK with that. It's just not a satisfying answer to give
Windows people who think UTF-16 is a good idea. But at the very least,
it's still unicode. It should be lossless for them to convert to utf8
and back if they want.

Speaking of which, I've been looking at handling diffing of utf-16
files. Right now we generally just consider them binary, which sucks.
It's easy to identify them by BOM in the is_buffer_binary() code, but
that's only part of it. We do an OK job of diffing them, except that:

  1. The BOM makes some diffs a little noisier.

  2. We split lines on 0x0a. But this byte can appear in other code
     points, like 0x010a (Ċ), or the entire entire 0x0a* code point (the
     entire Gurmukhi charset).

I'm tempted to detect the UTF-{16,32}{LE,BE} by their BOM, reencode them
to utf8, and then display them in utf8. Is that too gross for us to
consider?

You can kind-of implement this outside of git using textconv. But you
have to manually mark each file as utf-16, as there's no way to trigger
an alternative diff driver on something like a BOM.

I'm really not clear on how people with utf-16 files work. Even if we
did treat utf-16 like text, the _rest_ of git is outputting ascii, so
it's not like their terminals are utf-16. But we do have projects on
github with utf-16 and utf-32 encodings.

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/22] Refactor to accept NUL in commit messages
  2011-10-27 18:13                     ` Jeff King
@ 2011-10-27 18:47                       ` Junio C Hamano
  2011-10-27 18:52                         ` Jeff King
  0 siblings, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2011-10-27 18:47 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Nguyen Thai Ngoc Duy, Ævar Arnfjörð

Jeff King <peff@peff.net> writes:

> I'm tempted to detect the UTF-{16,32}{LE,BE} by their BOM, reencode them
> to utf8, and then display them in utf8. Is that too gross for us to
> consider?

I tend to think so; it is entirely a different matter if the user
instructed us to clean/smudge UTF-16 payload into/outof UTF-8.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/22] Refactor to accept NUL in commit messages
  2011-10-27 18:47                       ` Junio C Hamano
@ 2011-10-27 18:52                         ` Jeff King
  2011-10-27 19:14                           ` Junio C Hamano
  0 siblings, 1 reply; 26+ messages in thread
From: Jeff King @ 2011-10-27 18:52 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Nguyen Thai Ngoc Duy, Ævar Arnfjörð

On Thu, Oct 27, 2011 at 11:47:27AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > I'm tempted to detect the UTF-{16,32}{LE,BE} by their BOM, reencode them
> > to utf8, and then display them in utf8. Is that too gross for us to
> > consider?
> 
> I tend to think so; it is entirely a different matter if the user
> instructed us to clean/smudge UTF-16 payload into/outof UTF-8.

Minor nit, but this is just for diff, so it is not about clean/smudge
but rather about doing something like textconv.

The other option I mentioned would be something like detecting the BOM
and pretending as if the attribute "diff=utf-16" was set (which would do
nothing by default). Then people could set themselves up to handle
utf-16 if they wanted, but wouldn't have to go around marking each file
with .gitattributes.

But maybe that is too gross, too, and they should just use
.gitattributes.

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/22] Refactor to accept NUL in commit messages
  2011-10-27 18:52                         ` Jeff King
@ 2011-10-27 19:14                           ` Junio C Hamano
  2011-10-27 23:44                             ` Jeff King
  0 siblings, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2011-10-27 19:14 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Nguyen Thai Ngoc Duy, Ævar Arnfjörð

Jeff King <peff@peff.net> writes:

> On Thu, Oct 27, 2011 at 11:47:27AM -0700, Junio C Hamano wrote:
>
>> Jeff King <peff@peff.net> writes:
>> 
>> > I'm tempted to detect the UTF-{16,32}{LE,BE} by their BOM, reencode them
>> > to utf8, and then display them in utf8. Is that too gross for us to
>> > consider?
>> 
>> I tend to think so; it is entirely a different matter if the user
>> instructed us to clean/smudge UTF-16 payload into/outof UTF-8.
>
> Minor nit, but this is just for diff, so it is not about clean/smudge
> but rather about doing something like textconv.

I can understand if some tools in the Windows land prefer to work with
these encodings, so clean/smudge to have the checkout in these encodings
would be a reasonable thing not just diff but things like grep. On the
other hand, I do doubt the sanity of these people if they want to have
in-repository representation also in these encodings.

So I do not think "it is just for diff" is any improvement.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/22] Refactor to accept NUL in commit messages
  2011-10-27 19:14                           ` Junio C Hamano
@ 2011-10-27 23:44                             ` Jeff King
  2011-10-28  0:03                               ` Junio C Hamano
  2011-10-28  1:40                               ` Miles Bader
  0 siblings, 2 replies; 26+ messages in thread
From: Jeff King @ 2011-10-27 23:44 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Nguyen Thai Ngoc Duy, Ævar Arnfjörð

On Thu, Oct 27, 2011 at 12:14:34PM -0700, Junio C Hamano wrote:

> > Minor nit, but this is just for diff, so it is not about clean/smudge
> > but rather about doing something like textconv.
> 
> I can understand if some tools in the Windows land prefer to work with
> these encodings, so clean/smudge to have the checkout in these encodings
> would be a reasonable thing not just diff but things like grep. On the
> other hand, I do doubt the sanity of these people if they want to have
> in-repository representation also in these encodings.

I'm pretty much of the same mind. We do have people with utf-16 in their
repositories on github. I have no idea why they do such a thing, or what
kinds of tricks they do to make it usable (because without it, they just
get "binary files differ").

My interest is to make things like bare-repository diff (and everything
built on it; i.e., things like github, gitweb, or whatever) do the sane
thing for these people, even if I think what they're doing is wrong. And
as always, I try to structure the git portions of that as much as
possible to be general and help everybody, so they can be pushed
upstream (also, then I don't have to worry about managing local changes
:) ).

But it sounds like this is probably just too ugly and should end up as a
github-specific thing.

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/22] Refactor to accept NUL in commit messages
  2011-10-27 23:44                             ` Jeff King
@ 2011-10-28  0:03                               ` Junio C Hamano
  2011-10-28  0:19                                 ` Jeff King
  2011-10-28  1:40                               ` Miles Bader
  1 sibling, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2011-10-28  0:03 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Nguyen Thai Ngoc Duy, Ævar Arnfjörð

Jeff King <peff@peff.net> writes:

> My interest is to make things like bare-repository diff (and everything
> built on it; i.e., things like github, gitweb, or whatever) do the sane
> thing for these people, even if I think what they're doing is wrong.

I do not think we are talking about right or wrong. I was primarily saying
that textconv may not be the right thing (think github/gitweb showing blob
contents, nicely formatted inside the chrome the site provides).

The solution you suggested feels like a gross layering violation, unless
we do it everywhere, in which case I wouldn't mind too much.

We have in-repository representation that diff and grep and friends work
on, and output conversion layer that externalizes the result of them in
the form of "smudge". Another layer above the in-repository representation
and below operations could convert UTF-16 to UTF-8 when going outward and
in the opposite when going inward.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/22] Refactor to accept NUL in commit messages
  2011-10-28  0:03                               ` Junio C Hamano
@ 2011-10-28  0:19                                 ` Jeff King
  0 siblings, 0 replies; 26+ messages in thread
From: Jeff King @ 2011-10-28  0:19 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Nguyen Thai Ngoc Duy, Ævar Arnfjörð

On Thu, Oct 27, 2011 at 05:03:29PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > My interest is to make things like bare-repository diff (and everything
> > built on it; i.e., things like github, gitweb, or whatever) do the sane
> > thing for these people, even if I think what they're doing is wrong.
> 
> I do not think we are talking about right or wrong. I was primarily saying
> that textconv may not be the right thing (think github/gitweb showing blob
> contents, nicely formatted inside the chrome the site provides).

But I think it is probably a wrong thing to store utf-16 as the
canonical format inside the git repository. Git simply can't handle it
for diffing. And the right thing, as you suggested, is clean/smudge.

But I'm dealing with repositories on the server side, where it is too
late to do clean/smudge; I just get whatever junk people commited.

> We have in-repository representation that diff and grep and friends work
> on, and output conversion layer that externalizes the result of them in
> the form of "smudge". Another layer above the in-repository representation
> and below operations could convert UTF-16 to UTF-8 when going outward and
> in the opposite when going inward.

I'm not sure that could sanely be done in a backwards compatible way.
Doing it with just textual diffs is a hack, of course, but at least we
know that the damage is limited, and the diff we generate on top doesn't
care that much about the original sha1s[1]. But should read_object_sha1
learn to convert utf-16 into utf-8? I think madness lies that way, as
we are breaking assumptions about sha1 validity.

-Peff

[1] Actually, the text diff does mention the original and resulting
sha1s, which would now either bear no relation to the diff text, or bear
no relation to what's in the repo. Either way, I think we are creating
something that can't necessarily be applied, which is bad. And is why I
thought of textconv, which is basically the same concept (and has the
same problems).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/22] Refactor to accept NUL in commit messages
  2011-10-27 23:44                             ` Jeff King
  2011-10-28  0:03                               ` Junio C Hamano
@ 2011-10-28  1:40                               ` Miles Bader
  2011-10-28  4:07                                 ` Junio C Hamano
  1 sibling, 1 reply; 26+ messages in thread
From: Miles Bader @ 2011-10-28  1:40 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, git, Nguyen Thai Ngoc Duy,
	Ævar Arnfjörð

Jeff King <peff@peff.net> writes:
> We do have people with utf-16 in their repositories on github. I
> have no idea why they do such a thing, or what kinds of tricks they
> do to make it usable (because without it, they just get "binary
> files differ").

Hmm, you could ask them ...  [or, I suppose more diplomatically, post
a blog entry asking "Hey all you people who use github for utf-16
encoded files, ..."]

-Miles

-- 
Year, n. A period of three hundred and sixty-five disappointments.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/22] Refactor to accept NUL in commit messages
  2011-10-28  1:40                               ` Miles Bader
@ 2011-10-28  4:07                                 ` Junio C Hamano
  0 siblings, 0 replies; 26+ messages in thread
From: Junio C Hamano @ 2011-10-28  4:07 UTC (permalink / raw)
  To: Miles Bader
  Cc: Jeff King, git, Nguyen Thai Ngoc Duy,
	Ævar Arnfjörð

Miles Bader <miles@gnu.org> writes:

> Jeff King <peff@peff.net> writes:
>> We do have people with utf-16 in their repositories on github. I
>> have no idea why they do such a thing, or what kinds of tricks they
>> do to make it usable (because without it, they just get "binary
>> files differ").
>
> Hmm, you could ask them ...  [or, I suppose more diplomatically, post
> a blog entry asking "Hey all you people who use github for utf-16
> encoded files, ..."]

People will hate such a flag day event initially and later thank you for
it ;-) 

I am afraid that that is a wishful thinking, though.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2011-10-28  4:07 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1319277881-4128-1-git-send-email-pclouds@gmail.com>
2011-10-22 19:09 ` [PATCH 00/22] Refactor to accept NUL in commit messages Jeff King
2011-10-23 10:44   ` Robin Rosenberg
2011-10-23 16:09     ` Jeff King
2011-10-22 22:47 ` Junio C Hamano
2011-10-23  1:24   ` Nguyen Thai Ngoc Duy
2011-10-23  5:51     ` Junio C Hamano
2011-10-23  6:37       ` Nguyen Thai Ngoc Duy
2011-10-23  9:46         ` Junio C Hamano
2011-10-23 10:17           ` Nguyen Thai Ngoc Duy
2011-10-23 16:07           ` Jeff King
2011-10-23 20:16             ` Junio C Hamano
2011-10-24  4:40               ` Junio C Hamano
2011-10-24  5:10                 ` Nguyen Thai Ngoc Duy
2011-10-24 11:09                   ` Štěpán Němec
2011-10-24 22:45                 ` Jeff King
2011-10-25 10:16                   ` Štěpán Němec
2011-10-25 14:07                   ` Junio C Hamano
2011-10-27 18:13                     ` Jeff King
2011-10-27 18:47                       ` Junio C Hamano
2011-10-27 18:52                         ` Jeff King
2011-10-27 19:14                           ` Junio C Hamano
2011-10-27 23:44                             ` Jeff King
2011-10-28  0:03                               ` Junio C Hamano
2011-10-28  0:19                                 ` Jeff King
2011-10-28  1:40                               ` Miles Bader
2011-10-28  4:07                                 ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).