* git rev-list %an, %ae, %at bug in v1.7.10.1 and beyond @ 2012-05-21 8:01 Bryan Turner 2012-05-22 4:32 ` Jeff King 0 siblings, 1 reply; 6+ messages in thread From: Bryan Turner @ 2012-05-21 8:01 UTC (permalink / raw) To: git Hello all, I believe I've found an issue in git v1.7.10.1 and v1.7.10.2 (and master) where the output of git rev-list has changed for some commits. Most commits do not appear to trigger the issue; it may be related to Unicode characters being used in the author name. I've run a git bisect and it appears the bug was introduced in 4b340cfab9c7a18e39bc531d6a6ffaffdf95f62d. Built from master (as well as using the 1.7.10.1 and 1.7.10.2 release tags), I get output like this: ================================ Output ===================================== aphrael:qa-resources.git bturner$ git rev-list --format="%H|%h|%P|%p|%an|%ae|%at%n%B%n@@object_end@@" -1 5c1ccdec5f84aa149a4978f729fdda70769f942f commit 5c1ccdec5f84aa149a4978f729fdda70769f942f 5c1ccdec5f84aa149a4978f729fdda70769f942f|5c1ccde|02c78bc39ac6192623bf080e3e2ac892a4f5764c|02c78bc||| commit with unicode name @@object_end@@ ================================ End ======================================== Note that the author name, e-mail and timestamp values are all missing (the three |'s in a row at the end). Built from 0dbe6592ccbd1a394a69a52074e3729d546fe952, the parent of 4b340cf, and in previous versions of git (1.7.10 and earlier), I got output like this: ================================ Output ===================================== aphrael:qa-resources.git bturner$ git rev-list --format="%H|%h|%P|%p|%an|%ae|%at%n%B%n@@object_end@@" -1 5c1ccdec5f84aa149a4978f729fdda70769f942f commit 5c1ccdec5f84aa149a4978f729fdda70769f942f 5c1ccdec5f84aa149a4978f729fdda70769f942f|5c1ccde|02c78bc39ac6192623bf080e3e2ac892a4f5764c|02c78bc|a|farmas@atlassian.com|1327876222 commit with unicode name @@object_end@@ ================================ End ======================================== Note that the author name, e-mail and timestamp are all present. The "a" appears as ASCII here, but it's actually a UTF-16LE character (the terminal on the Mac is being "helpful"). To try and verify whether the difference is a bug or a bugfix (because I wasn't certain whether perhaps the output from earlier versions was the result of a bug which was fixed in 1.7.10.1 and on), I compared the git rev-list output with git cat-file -p (again, built from master): ================================ Output ===================================== aphrael:qa-resources.git bturner$ git cat-file -p 5c1ccdec5f84aa149a4978f729fdda70769f942f tree dd173cb70baaac07bdf405f4e3db110e7fafd180 parent 02c78bc39ac6192623bf080e3e2ac892a4f5764c author a <farmas@atlassian.com> 1327876222 +1100 committer a <farmas@atlassian.com> 1327876222 +1100 commit with unicode name ================================ End ======================================== The git cat-file output is consistent across versions of git where I'm seeing the rev-list issue and versions where I'm not. I would be happy to provide a zip file containing the repository with the commit shown in all the output above, if that will facilitate testing/fixing the issue. Just let me know where to put it. I lack the C chops to provide a patch myself; sorry about that. Best regards, Bryan Turner ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: git rev-list %an, %ae, %at bug in v1.7.10.1 and beyond 2012-05-21 8:01 git rev-list %an, %ae, %at bug in v1.7.10.1 and beyond Bryan Turner @ 2012-05-22 4:32 ` Jeff King 2012-05-22 4:35 ` Jeff King 0 siblings, 1 reply; 6+ messages in thread From: Jeff King @ 2012-05-22 4:32 UTC (permalink / raw) To: Bryan Turner; +Cc: git On Mon, May 21, 2012 at 06:01:50PM +1000, Bryan Turner wrote: > Note that the author name, e-mail and timestamp values are all missing > (the three |'s in a row at the end). > [...] > Built from 0dbe6592ccbd1a394a69a52074e3729d546fe952, the parent of > 4b340cf, and in previous versions of git (1.7.10 and earlier), I got > output like this: > [...] > Note that the author name, e-mail and timestamp are all present. The > "a" appears as ASCII here, but it's actually a UTF-16LE character (the > terminal on the Mac is being "helpful"). I'm not too surprised this is broken (in fact, I'm surprised it ever really worked). UTF-16, especially representing pure ascii characters, will have embedded NULs. Most of the code assumes that things like names and emails are NUL-terminated and ascii-compatible (so ascii, or some ascii-superset encoding like utf8, iso8859-1, etc). You can store a commit message (and name) in utf-16 if you tell git that you are doing so, but we should be re-encoding it before handling it. > ================================ Output ===================================== > aphrael:qa-resources.git bturner$ git cat-file -p > 5c1ccdec5f84aa149a4978f729fdda70769f942f > tree dd173cb70baaac07bdf405f4e3db110e7fafd180 > parent 02c78bc39ac6192623bf080e3e2ac892a4f5764c > author a <farmas@atlassian.com> 1327876222 +1100 > committer a <farmas@atlassian.com> 1327876222 +1100 > > commit with unicode name > ================================ End ======================================== There's no encoding header, so having a utf-16 character there is wrong. How did you make such a commit in the first place, though? I believe that git-commit treats the name as a string and would terminate on a NUL (or am I wrong in thinking that this "a" is really U+0061, and is actually some other unicode character that _looks_ like "a", but doesn't contain a NUL?). Did you create it by piping straight to git-hash-object? What does piping the above through "xxd" or "cat -A" show? -Peff ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: git rev-list %an, %ae, %at bug in v1.7.10.1 and beyond 2012-05-22 4:32 ` Jeff King @ 2012-05-22 4:35 ` Jeff King 2012-05-22 5:13 ` Bryan Turner 0 siblings, 1 reply; 6+ messages in thread From: Jeff King @ 2012-05-22 4:35 UTC (permalink / raw) To: Bryan Turner; +Cc: git On Tue, May 22, 2012 at 12:32:21AM -0400, Jeff King wrote: > I'm not too surprised this is broken (in fact, I'm surprised it ever > really worked). UTF-16, especially representing pure ascii characters, > will have embedded NULs. Most of the code assumes that things like names > and emails are NUL-terminated and ascii-compatible (so ascii, or some > ascii-superset encoding like utf8, iso8859-1, etc). You can store a > commit message (and name) in utf-16 if you tell git that you are doing > so, but we should be re-encoding it before handling it. Actually, I take that back. I think storing in utf-16 would probably fail. We need to use ascii to even read the headers to get to the encoding header to realize that it is in utf-16. So I believe we really do only support ascii-superset encodings. -Peff ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: git rev-list %an, %ae, %at bug in v1.7.10.1 and beyond 2012-05-22 4:35 ` Jeff King @ 2012-05-22 5:13 ` Bryan Turner 2012-05-22 5:58 ` Jeff King 0 siblings, 1 reply; 6+ messages in thread From: Bryan Turner @ 2012-05-22 5:13 UTC (permalink / raw) To: Jeff King; +Cc: git Peff, Thanks for taking the time to respond! I appreciate it. I went and spoke to the tester who made the original commit, and it sounds like what he did was open .git/config with a text editor and update the user.name directly. He then saved the file. His operating system's code page was UTF-16LE, so I believe the file was saved in that format. He then did the commit. That sort of behavior is pretty far outside the scope of anything git can realistically be expected to handle (and I told the tester that as well). That said, I'm interested that it "suddenly" broke, and also that cat-file still appears to not think there's any problem. I piped the output through xxd, for cat-file and rev-list. The cat-file output is the same in both versions of git, and looks like this: auri:qa-resources.git bturner$ git cat-file -p 5c1ccdec5f84aa149a4978f729fdda70769f942f | xxd 0000000: 7472 6565 2064 6431 3733 6362 3730 6261 tree dd173cb70ba 0000010: 6161 6330 3762 6466 3430 3566 3465 3364 aac07bdf405f4e3d 0000020: 6231 3130 6537 6661 6664 3138 300a 7061 b110e7fafd180.pa 0000030: 7265 6e74 2030 3263 3738 6263 3339 6163 rent 02c78bc39ac 0000040: 3631 3932 3632 3362 6630 3830 6533 6532 6192623bf080e3e2 0000050: 6163 3839 3261 3466 3537 3634 630a 6175 ac892a4f5764c.au 0000060: 7468 6f72 2061 203c 6661 726d 6173 4061 thor a <farmas@a 0000070: 746c 6173 7369 616e 2e63 6f6d 3e20 3133 tlassian.com> 13 0000080: 3237 3837 3632 3232 202b 3131 3030 0a63 27876222 +1100.c 0000090: 6f6d 6d69 7474 6572 2061 203c 6661 726d ommitter a <farm 00000a0: 6173 4061 746c 6173 7369 616e 2e63 6f6d as@atlassian.com 00000b0: 3e20 3133 3237 3837 3632 3232 202b 3131 > 1327876222 +11 00000c0: 3030 0a0a 636f 6d6d 6974 2077 6974 6820 00..commit with 00000d0: 756e 6963 6f64 6520 6e61 6d65 0a unicode name. In 1.7.9.5, rev-list looks like this: auri:qa-resources.git bturner$ git rev-list --format="%H|%h|%P|%p|%an|%ae|%at%n%B%n@@object_end@@" -1 5c1ccdec5f84aa149a4978f729fdda70769f942f | xxd 0000000: 636f 6d6d 6974 2035 6331 6363 6465 6335 commit 5c1ccdec5 0000010: 6638 3461 6131 3439 6134 3937 3866 3732 f84aa149a4978f72 0000020: 3966 6464 6137 3037 3639 6639 3432 660a 9fdda70769f942f. 0000030: 3563 3163 6364 6563 3566 3834 6161 3134 5c1ccdec5f84aa14 0000040: 3961 3439 3738 6637 3239 6664 6461 3730 9a4978f729fdda70 0000050: 3736 3966 3934 3266 7c35 6331 6363 6465 769f942f|5c1ccde 0000060: 7c30 3263 3738 6263 3339 6163 3631 3932 |02c78bc39ac6192 0000070: 3632 3362 6630 3830 6533 6532 6163 3839 623bf080e3e2ac89 0000080: 3261 3466 3537 3634 637c 3032 6337 3862 2a4f5764c|02c78b 0000090: 637c 617c 6661 726d 6173 4061 746c 6173 c|a|farmas@atlas 00000a0: 7369 616e 2e63 6f6d 7c31 3332 3738 3736 sian.com|1327876 00000b0: 3232 320a 636f 6d6d 6974 2077 6974 6820 222.commit with 00000c0: 756e 6963 6f64 6520 6e61 6d65 0a0a 4040 unicode name..@@ 00000d0: 6f62 6a65 6374 5f65 6e64 4040 0a object_end@@. In 1.7.10.2, it looks like this: auri:qa-resources.git bturner$ git rev-list --format="%H|%h|%P|%p|%an|%ae|%at%n%B%n@@object_end@@" -1 5c1ccdec5f84aa149a4978f729fdda70769f942f | xxd 0000000: 636f 6d6d 6974 2035 6331 6363 6465 6335 commit 5c1ccdec5 0000010: 6638 3461 6131 3439 6134 3937 3866 3732 f84aa149a4978f72 0000020: 3966 6464 6137 3037 3639 6639 3432 660a 9fdda70769f942f. 0000030: 3563 3163 6364 6563 3566 3834 6161 3134 5c1ccdec5f84aa14 0000040: 3961 3439 3738 6637 3239 6664 6461 3730 9a4978f729fdda70 0000050: 3736 3966 3934 3266 7c35 6331 6363 6465 769f942f|5c1ccde 0000060: 7c30 3263 3738 6263 3339 6163 3631 3932 |02c78bc39ac6192 0000070: 3632 3362 6630 3830 6533 6532 6163 3839 623bf080e3e2ac89 0000080: 3261 3466 3537 3634 637c 3032 6337 3862 2a4f5764c|02c78b 0000090: 637c 7c7c 0a63 6f6d 6d69 7420 7769 7468 c|||.commit with 00000a0: 2075 6e69 636f 6465 206e 616d 650a 0a40 unicode name..@ 00000b0: 406f 626a 6563 745f 656e 6440 400a @object_end@@. None of the output indicates a NUL is present. I'm wondering if the old code that rev-list was running (and maybe the code cat-file appears to still be running) might be ignoring leading NULs until it finds the first character, or something similar. If that were the case, had the tester used a name like "ab" (or anything more than one character), perhaps the commit would have caused unexpected behavior all along. I don't know how to dump the binary for the commit object itself; that may give us better information on what the author segment actually contains. Unfortunately, the repository has been packed, so there is no loose object for the commit anymore. If there's any other information I can provide, please let me know. Best regards, Bryan Turner On 22 May 2012 14:35, Jeff King <peff@peff.net> wrote: > On Tue, May 22, 2012 at 12:32:21AM -0400, Jeff King wrote: > >> I'm not too surprised this is broken (in fact, I'm surprised it ever >> really worked). UTF-16, especially representing pure ascii characters, >> will have embedded NULs. Most of the code assumes that things like names >> and emails are NUL-terminated and ascii-compatible (so ascii, or some >> ascii-superset encoding like utf8, iso8859-1, etc). You can store a >> commit message (and name) in utf-16 if you tell git that you are doing >> so, but we should be re-encoding it before handling it. > > Actually, I take that back. I think storing in utf-16 would probably > fail. We need to use ascii to even read the headers to get to the > encoding header to realize that it is in utf-16. So I believe we really > do only support ascii-superset encodings. > > -Peff ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: git rev-list %an, %ae, %at bug in v1.7.10.1 and beyond 2012-05-22 5:13 ` Bryan Turner @ 2012-05-22 5:58 ` Jeff King 2012-05-22 6:13 ` Jeff King 0 siblings, 1 reply; 6+ messages in thread From: Jeff King @ 2012-05-22 5:58 UTC (permalink / raw) To: Bryan Turner; +Cc: git On Tue, May 22, 2012 at 03:13:09PM +1000, Bryan Turner wrote: > I went and spoke to the tester who made the original commit, and it > sounds like what he did was open .git/config with a text editor and > update the user.name directly. He then saved the file. His operating > system's code page was UTF-16LE, so I believe the file was saved in > that format. He then did the commit. That doesn't make sense. Git will not read a config file in utf16 (again, because of the NULs). Nor can you store NULs with "git config user.name ...". So I'm not sure how utf16 might have made it there. > That sort of behavior is pretty far outside the scope of anything git > can realistically be expected to handle (and I told the tester that as > well). That said, I'm interested that it "suddenly" broke, and also > that cat-file still appears to not think there's any problem. Yeah, I'm worried that it's indicative of a bug that might affect other cases. > I piped the output through xxd, for cat-file and rev-list. The > cat-file output is the same in both versions of git, and looks like > this: > auri:qa-resources.git bturner$ git cat-file -p > 5c1ccdec5f84aa149a4978f729fdda70769f942f | xxd > 0000000: 7472 6565 2064 6431 3733 6362 3730 6261 tree dd173cb70ba > 0000010: 6161 6330 3762 6466 3430 3566 3465 3364 aac07bdf405f4e3d > 0000020: 6231 3130 6537 6661 6664 3138 300a 7061 b110e7fafd180.pa > 0000030: 7265 6e74 2030 3263 3738 6263 3339 6163 rent 02c78bc39ac > 0000040: 3631 3932 3632 3362 6630 3830 6533 6532 6192623bf080e3e2 > 0000050: 6163 3839 3261 3466 3537 3634 630a 6175 ac892a4f5764c.au > 0000060: 7468 6f72 2061 203c 6661 726d 6173 4061 thor a <farmas@a > 0000070: 746c 6173 7369 616e 2e63 6f6d 3e20 3133 tlassian.com> 13 > 0000080: 3237 3837 3632 3232 202b 3131 3030 0a63 27876222 +1100.c > 0000090: 6f6d 6d69 7474 6572 2061 203c 6661 726d ommitter a <farm > 00000a0: 6173 4061 746c 6173 7369 616e 2e63 6f6d as@atlassian.com > 00000b0: 3e20 3133 3237 3837 3632 3232 202b 3131 > 1327876222 +11 > 00000c0: 3030 0a0a 636f 6d6d 6974 2077 6974 6820 00..commit with > 00000d0: 756e 6963 6f64 6520 6e61 6d65 0a unicode name. But I don't see any unicode at all here. The author and committer names are just ascii 'a'. However, that seems to be the real problem. If I create a commit with a single-letter name, I see some weirdness: $ git init $ echo content >foo && git add foo $ GIT_AUTHOR_NAME=a git commit -m msg Author: <> 1 file changed, 1 insertion(+) create mode 100644 foo Uh oh, that's odd. And worse: $ git log -1 --format='|%an <%ae>|' | <>| But in v1.7.10: $ git log -1 --format='|%an <%ae>|' |a <peff@peff.net>| So there is definitely a bug. The unicode thing is a red herring, and if there was any unicode at some point, git stripped it out when making the commit. The real regression seems to be in single-character names. I'll see if I can find the bug. -Peff ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: git rev-list %an, %ae, %at bug in v1.7.10.1 and beyond 2012-05-22 5:58 ` Jeff King @ 2012-05-22 6:13 ` Jeff King 0 siblings, 0 replies; 6+ messages in thread From: Jeff King @ 2012-05-22 6:13 UTC (permalink / raw) To: Bryan Turner; +Cc: git On Tue, May 22, 2012 at 01:58:11AM -0400, Jeff King wrote: > $ git init > $ echo content >foo && git add foo > $ GIT_AUTHOR_NAME=a git commit -m msg > Author: <> > 1 file changed, 1 insertion(+) > create mode 100644 foo > > Uh oh, that's odd. And worse: > > $ git log -1 --format='|%an <%ae>|' > | <>| > > So there is definitely a bug. The unicode thing is a red herring, and if > there was any unicode at some point, git stripped it out when making the > commit. The real regression seems to be in single-character names. > > I'll see if I can find the bug. Both are caused by an off-by-one error in split_ident_line. I just posted a patch. -Peff ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2012-05-22 6:13 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-05-21 8:01 git rev-list %an, %ae, %at bug in v1.7.10.1 and beyond Bryan Turner 2012-05-22 4:32 ` Jeff King 2012-05-22 4:35 ` Jeff King 2012-05-22 5:13 ` Bryan Turner 2012-05-22 5:58 ` Jeff King 2012-05-22 6:13 ` Jeff King
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).