* Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980 @ 2023-04-14 11:37 Thomas Bock 2023-04-15 8:52 ` Jeff King 0 siblings, 1 reply; 46+ messages in thread From: Thomas Bock @ 2023-04-14 11:37 UTC (permalink / raw) To: git Dear git development team, I've spotted a weird behavior of git in a specific repository, which I cannot explain, maybe there is a bug in git. Repository in which the problem occurs: git@github.com:LibreOffice/core.git Problem: When ordering the commits in this repository by time via "git log", there are several tens of commits which appear to be before 1980 although their author date and commit date are in 2009 or 2010 or 2011. Example to reproduce: git clone git@github.com:LibreOffice/core.git libreoffice cd libreoffice git log --no-merges --before="1980-01-01T00:00:00+0000" --format=%H,%ct,%ci,%ad All the resulting commits have an author date and commit date in 2009 or 2010 or 2011, though. This also appears when you order the commits by date: git log --no-merges --date-order --format=%H,%ct,%ci,%ad If you search for commit d3b03514dde317473db0d247f21405b5db6a727e in the resulting output, you can see that this commit is placed earlier in the list than the commits from 2008 although its author date and commit date are from 2011. And there are many more commits of this sort between 2009 and 2011 which are placed out of order, interpreted to be earlier than every other commit. I already searched for broken timestamps in these commits and printed the timestamp in various different formats, but I could not spot anything dubious there. So, why does git log think that these commits have an earlier timestamp than they actually have? Thanks for your efforts! [System Info] git version: git version 2.30.2 (but I have also tried it with git version 2.40.0 with no difference) cpu: x86_64 sizeof-long: 8 sizeof-size_t: 8 shell-path: /bin/sh uname: Linux 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 compiler info: gnuc: 10.2 libc info: glibc: 2.31 $SHELL (typically, interactive shell): /bin/bash Best, Thomas ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980 2023-04-14 11:37 Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980 Thomas Bock @ 2023-04-15 8:52 ` Jeff King 2023-04-15 8:59 ` Jeff King ` (2 more replies) 0 siblings, 3 replies; 46+ messages in thread From: Jeff King @ 2023-04-15 8:52 UTC (permalink / raw) To: Thomas Bock; +Cc: git On Fri, Apr 14, 2023 at 01:37:57PM +0200, Thomas Bock wrote: > Example to reproduce: > > git clone git@github.com:LibreOffice/core.git libreoffice > cd libreoffice > git log --no-merges --before="1980-01-01T00:00:00+0000" > --format=%H,%ct,%ci,%ad Thanks for providing a clear example and reproduction recipe. It's an interesting case. The commits _are_ malformed, but only slightly. For example: $ git cat-file commit d3b03514dde317473db0d247f21405b5db6a727e tree c7526375c71327a195714e9e325b66a9ad013d74 parent 61099481271709723469421181f65e6219cbc271 author Andre Fischer<andre.f.fischer <Andre Fischer<andre.f.fischer@oracle.com>> 1297247749 +0100 committer Andre Fischer<andre.f.fischer <Andre Fischer<andre.f.fischer@oracle.com>> 1297247749 +0100 There's an extra weird set of angle brackets in both the author and committer lines. But here's the really quirky part: there actually two parsers being used in Git here. One is in parse_commit_buffer(), which does the minimum to fill in the fields of a "struct commit": parents, tree, and committer timestamp. That parser calls parse_commit_date(), which finds the first ">" and then tries to parse "> 1297247749 +0100" as a timestamp, which fails. So it uses the sentinel value "0" (aka 1970-01-01). And that's what gets used for "--before" (and "--since", and things like queue order for showing commits). But then when we actually _display_ the commit, we have a whole pretty-printing system. There we usually end up in split_ident_line(), which is a bit more forgiving, due to 03818a4a94 (split_ident: parse timestamp from end of line, 2013-10-14). It finds the trailing ">" from the right-hand side, which in this case yields the correct timestamp. We could probably use the same trick in parse_commit_date(), something like: diff --git a/commit.c b/commit.c index 6d844da9a6..72d21a2bb8 100644 --- a/commit.c +++ b/commit.c @@ -95,6 +95,7 @@ struct commit *lookup_commit_reference_by_name(const char *name) static timestamp_t parse_commit_date(const char *buf, const char *tail) { const char *dateptr; + const char *eol; if (buf + 6 >= tail) return 0; @@ -106,16 +107,22 @@ static timestamp_t parse_commit_date(const char *buf, const char *tail) return 0; if (memcmp(buf, "committer", 9)) return 0; - while (buf < tail && *buf++ != '>') + + /* + * parse to end-of-line and then walk backwards, which + * handles some malformed cases. Alternatively, once + * we have eol, we could just call split_ident_line() + */ + eol = buf; + while (eol < tail && *eol++ != '\n') /* nada */; - if (buf >= tail) + if (eol >= tail) return 0; - dateptr = buf; - while (buf < tail && *buf++ != '\n') + for (dateptr = eol; dateptr > buf && dateptr[-1] != '>'; dateptr--) /* nada */; - if (buf >= tail) + if (dateptr == buf || dateptr == eol) return 0; - /* dateptr < buf && buf[-1] == '\n', so parsing will stop at buf-1 */ + /* dateptr < eol && *eol == '\n', so parsing will stop at eol */ return parse_timestamp(dateptr, NULL, 10); } which makes your case behave as expected. It may make sense to use split_ident_line(), which I think has a few other rules (it actually checks for something that looks like numbers, for instance). There may be also cases where the two diverge. Obviously having two parsers isn't ideal. I think sharing the code may involve a lot of work, though. The pretty-print parser is interested in pulling out more information, and is less focused on performance. Parsing commits is traditionally a hot path, as we historically had to parse every commit, even if we weren't showing it (including non-log operations like computing merge bases, reachability, and so forth). But that may not matter so much. One, we already inflate the whole commit object, not just the header. So even if we spend a few extra instructions on parsing, it may not be noticeable. And two, these days we often cache commit metadata in the commit-graph files, which avoids loading the commit message entirely (and thus this parsing) for most operations. Of course it may also be reasonable to consider this mystery solved and leave the code as-is. These objects _are_ malformed. I note they're all at least a decade old, and most of them are from a handful of committers (who presumably had misconfigured idents for a while). These days I think Git wouldn't allow them (I don't think it rejects the commit, but it does screen out the syntactically bogus characters). -Peff ^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980 2023-04-15 8:52 ` Jeff King @ 2023-04-15 8:59 ` Jeff King 2023-04-15 14:10 ` Kristoffer Haugsbakk 2023-04-17 9:51 ` Junio C Hamano 2 siblings, 0 replies; 46+ messages in thread From: Jeff King @ 2023-04-15 8:59 UTC (permalink / raw) To: Thomas Bock; +Cc: git On Sat, Apr 15, 2023 at 04:52:07AM -0400, Jeff King wrote: > There may be also cases where the two diverge. Obviously having two Sorry, my ability to type is going rapidly downhill this evening. I meant to say "there may be other cases where the two diverge". I.e., the best way to be sure they behave the same is to make them share code. -Peff ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980 2023-04-15 8:52 ` Jeff King 2023-04-15 8:59 ` Jeff King @ 2023-04-15 14:10 ` Kristoffer Haugsbakk 2023-04-17 5:40 ` Jeff King 2023-04-17 9:51 ` Junio C Hamano 2 siblings, 1 reply; 46+ messages in thread From: Kristoffer Haugsbakk @ 2023-04-15 14:10 UTC (permalink / raw) To: Jeff King; +Cc: git On Sat, Apr 15, 2023, at 10:52, Jeff King wrote: > Of course it may also be reasonable to consider this mystery solved and > leave the code as-is. These objects _are_ malformed. I started to run `git-repair`[1] on the repository yesterday. I’m not sure yet if it will finish.[2] [1]: https://git-repair.branchable.com/ [2]: https://git-repair.branchable.com/index/discussion/ -- Kristoffer Haugsbakk ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980 2023-04-15 14:10 ` Kristoffer Haugsbakk @ 2023-04-17 5:40 ` Jeff King 2023-04-17 6:20 ` Kristoffer Haugsbakk 0 siblings, 1 reply; 46+ messages in thread From: Jeff King @ 2023-04-17 5:40 UTC (permalink / raw) To: Kristoffer Haugsbakk; +Cc: git On Sat, Apr 15, 2023 at 04:10:32PM +0200, Kristoffer Haugsbakk wrote: > On Sat, Apr 15, 2023, at 10:52, Jeff King wrote: > > Of course it may also be reasonable to consider this mystery solved and > > leave the code as-is. These objects _are_ malformed. > > I started to run `git-repair`[1] on the repository yesterday. I’m not > sure yet if it will finish.[2] I don't know if git-repair will fix syntactic problems in commits like this, or if it's just looking for broken links between objects. But regardless, it would be possible to fix these using filter-repo or other tools. The problem with repairing them is that it rewrites history, changing the object id of every commit which comes after. So you'd want to do it once centrally for the project, and declare a flag day that everybody is moving over to the new, fixed history. It's probably not worth it for a minor problem like this. -Peff ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980 2023-04-17 5:40 ` Jeff King @ 2023-04-17 6:20 ` Kristoffer Haugsbakk 2023-04-17 7:41 ` Jeff King 0 siblings, 1 reply; 46+ messages in thread From: Kristoffer Haugsbakk @ 2023-04-17 6:20 UTC (permalink / raw) To: Jeff King; +Cc: git On Mon, Apr 17, 2023, at 07:40, Jeff King wrote: > I don't know if git-repair will fix syntactic problems in commits like > this, or if it's just looking for broken links between objects. But > regardless, it would be possible to fix these using filter-repo or other > tools. Makes sense, thanks. I gave up since the program never seemed to finish. > The problem with repairing them is that it rewrites history, changing > the object id of every commit which comes after. So you'd want to do it > once centrally for the project, and declare a flag day that everybody is > moving over to the new, fixed history. I wasn’t interested in fixing the repo for future development. Just to see if that weird git-log(1) behavior went away. It’s like the inverse of a minimal example to reproduce a bug; repair the corrupt object so that future askers can be convinced that it’s not a bug. ;) I’m not affiliated with the project. -- Kristoffer Haugsbakk ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980 2023-04-17 6:20 ` Kristoffer Haugsbakk @ 2023-04-17 7:41 ` Jeff King 2023-04-27 22:32 ` Kristoffer Haugsbakk 0 siblings, 1 reply; 46+ messages in thread From: Jeff King @ 2023-04-17 7:41 UTC (permalink / raw) To: Kristoffer Haugsbakk; +Cc: git On Mon, Apr 17, 2023 at 08:20:30AM +0200, Kristoffer Haugsbakk wrote: > > The problem with repairing them is that it rewrites history, changing > > the object id of every commit which comes after. So you'd want to do it > > once centrally for the project, and declare a flag day that everybody is > > moving over to the new, fixed history. > > I wasn’t interested in fixing the repo for future development. Just to > see if that weird git-log(1) behavior went away. It’s like the inverse > of a minimal example to reproduce a bug; repair the corrupt object so > that future askers can be convinced that it’s not a bug. ;) Ah, OK. Carry on, then. ;) I'm 99% sure it would indeed go away, since the patch I showed earlier (to make the parsing more lenient) worked. But if you want to try, I suspect you could do it with git-filter-repo's --commit-callback option. Or if you don't mind being a little loose with the parsing, probably just piping fast-export through sed, and then back to fast-import. -Peff ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980 2023-04-17 7:41 ` Jeff King @ 2023-04-27 22:32 ` Kristoffer Haugsbakk 0 siblings, 0 replies; 46+ messages in thread From: Kristoffer Haugsbakk @ 2023-04-27 22:32 UTC (permalink / raw) To: Jeff King; +Cc: git, Thomas Bock On Mon, Apr 17, 2023, at 09:41, Jeff King wrote: > Ah, OK. Carry on, then. ;) The result (see readme): https://github.com/LemmingAvalanche/libreoffice-repaired The problem went away. > But if you want to try, I suspect you could do it with > git-filter-repo's --commit-callback option. I couldn’t use that. The metadata was too mangled and `git-filter-repo` just errored out. :P But `git commit-tree` + `git replace` + `git filter-repo --force` worked. -- Kristoffer Haugsbakk ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980 2023-04-15 8:52 ` Jeff King 2023-04-15 8:59 ` Jeff King 2023-04-15 14:10 ` Kristoffer Haugsbakk @ 2023-04-17 9:51 ` Junio C Hamano 2023-04-18 4:12 ` Jeff King 2 siblings, 1 reply; 46+ messages in thread From: Junio C Hamano @ 2023-04-17 9:51 UTC (permalink / raw) To: Jeff King; +Cc: Thomas Bock, git Jeff King <peff@peff.net> writes: > There may be also cases where the two diverge. Obviously having two > parsers isn't ideal. I think sharing the code may involve a lot of work, > though. The pretty-print parser is interested in pulling out more > information, and is less focused on performance. Parsing commits is > traditionally a hot path, as we historically had to parse every commit, > even if we weren't showing it (including non-log operations like > computing merge bases, reachability, and so forth). > > But that may not matter so much. One, we already inflate the whole > commit object, not just the header. So even if we spend a few extra > instructions on parsing, it may not be noticeable. And two, these days > we often cache commit metadata in the commit-graph files, which avoids > loading the commit message entirely (and thus this parsing) for most > operations. Makes readers wonder which parser is used to parse commit objects in order to populate the commit-graph files. If that step screws up, we'd record a broken timestamp there X-<. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980 2023-04-17 9:51 ` Junio C Hamano @ 2023-04-18 4:12 ` Jeff King 2023-04-18 14:02 ` Derrick Stolee 0 siblings, 1 reply; 46+ messages in thread From: Jeff King @ 2023-04-18 4:12 UTC (permalink / raw) To: Junio C Hamano; +Cc: Thomas Bock, git On Mon, Apr 17, 2023 at 02:51:42AM -0700, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > There may be also cases where the two diverge. Obviously having two > > parsers isn't ideal. I think sharing the code may involve a lot of work, > > though. The pretty-print parser is interested in pulling out more > > information, and is less focused on performance. Parsing commits is > > traditionally a hot path, as we historically had to parse every commit, > > even if we weren't showing it (including non-log operations like > > computing merge bases, reachability, and so forth). > > > > But that may not matter so much. One, we already inflate the whole > > commit object, not just the header. So even if we spend a few extra > > instructions on parsing, it may not be noticeable. And two, these days > > we often cache commit metadata in the commit-graph files, which avoids > > loading the commit message entirely (and thus this parsing) for most > > operations. > > Makes readers wonder which parser is used to parse commit objects in > order to populate the commit-graph files. If that step screws up, > we'd record a broken timestamp there X-<. It should be be the parse_commit() one, since the commit-graph writing code works off of parsed "struct commit" objects to find its data. Which is good, since we may silently optimize out a parse in favor of the commit-graph; we'd want the two to always match. So recording a broken timestamp there is OK (but see below). Where it gets trickier is if we then fix the parser to handle this case. Now the commit-graph stores a broken value, and further writes are clever enough to say "ah, I already have that commit in the graph; no need to recompute its values". So once you start using a version of Git with the more lenient parser, you'd have to manually blow away any graph files and rebuild to see the benefit. One thing the commit graph perhaps _could_ do is omit the commit, or mark it as "this one is broken in some way". And then fall back to parsing those few instead (which is slower, but if it's a small minority of commits, that's OK). But I don't think there's any code for that. And it's kind of tricky anyway, because the parser just sets the date to "0" with no indication that there was any potential error. So it's probably solvable, but perhaps not worth the effort. The current rule is just "whatever we got from parsing is what the commit-graph will return", which works well in practice unless the parser changes. -Peff ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980 2023-04-18 4:12 ` Jeff King @ 2023-04-18 14:02 ` Derrick Stolee 2023-04-21 14:51 ` Thomas Bock 2023-04-22 13:52 ` Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980 Jeff King 0 siblings, 2 replies; 46+ messages in thread From: Derrick Stolee @ 2023-04-18 14:02 UTC (permalink / raw) To: Jeff King, Junio C Hamano; +Cc: Thomas Bock, git On 4/18/2023 12:12 AM, Jeff King wrote: > One thing the commit graph perhaps _could_ do is omit the commit, or > mark it as "this one is broken in some way". And then fall back to > parsing those few instead (which is slower, but if it's a small minority > of commits, that's OK). But I don't think there's any code for that. The "broken" commit would need to be included in the commit-graph file so its children can point to it using a graph position, but then it would revert to parsing from the commit object (due to some new concept storing "this is a bad commit"). If we decided to treat a timestamp of 0 as "probably broken, artificial at best" then we wouldn't need the new indicator in the commit-graph file, but this seems like quite a big hammer for a small case. Thanks, -Stolee ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980 2023-04-18 14:02 ` Derrick Stolee @ 2023-04-21 14:51 ` Thomas Bock 2023-04-22 13:41 ` [PATCH 0/3] fixing some parse_commit() timestamp corner cases Jeff King 2023-04-22 13:52 ` Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980 Jeff King 1 sibling, 1 reply; 46+ messages in thread From: Thomas Bock @ 2023-04-21 14:51 UTC (permalink / raw) To: Derrick Stolee, Jeff King, Junio C Hamano; +Cc: git [-- Attachment #1.1: Type: text/plain, Size: 2111 bytes --] Thanks @ all for clarifying this mystery, this already helped a lot to figure out what's going on there. Even though the affected commit objects are malformed, it would be very helpful if this problem could be solved somehow, from a user perspective. Such malformed objects can potentially occur also in other comparably old projects, where searching for commits that have been made in a specific time window in the past could be useful or even necessary in some cases. On the other hand, if it would be a huge effort to solve the problem (which I cannot assess), I could fully understand that it might not be worth to completely fix this specific problem that occurs only in few cases in the past (if you are sure that such malformed objects cannot occur any more today). If you decide to not spend the effort on fixing this problem, however, I would appreciate if you could, at least, partially fix this problem by not treating the broken commit objects to be before any other commit object (timestamp 0), but introduce some kind of error handling for such commits that omits listing them in the wrong time period. Thanks again for all the explanations and thoughts! Best, Thomas Am 18.04.23 um 16:02 schrieb Derrick Stolee: > On 4/18/2023 12:12 AM, Jeff King wrote: > >> One thing the commit graph perhaps _could_ do is omit the commit, or >> mark it as "this one is broken in some way". And then fall back to >> parsing those few instead (which is slower, but if it's a small minority >> of commits, that's OK). But I don't think there's any code for that. > The "broken" commit would need to be included in the commit-graph file > so its children can point to it using a graph position, but then it > would revert to parsing from the commit object (due to some new concept > storing "this is a bad commit"). > > If we decided to treat a timestamp of 0 as "probably broken, artificial > at best" then we wouldn't need the new indicator in the commit-graph > file, but this seems like quite a big hammer for a small case. > > Thanks, > -Stolee [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 495 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH 0/3] fixing some parse_commit() timestamp corner cases 2023-04-21 14:51 ` Thomas Bock @ 2023-04-22 13:41 ` Jeff King 2023-04-22 13:42 ` [PATCH 1/3] t4212: avoid putting git on left-hand side of pipe Jeff King ` (3 more replies) 0 siblings, 4 replies; 46+ messages in thread From: Jeff King @ 2023-04-22 13:41 UTC (permalink / raw) To: Thomas Bock; +Cc: Derrick Stolee, Junio C Hamano, git On Fri, Apr 21, 2023 at 04:51:03PM +0200, Thomas Bock wrote: > Even though the affected commit objects are malformed, it would be very > helpful if this problem could be solved somehow, from a user perspective. > Such malformed objects can potentially occur also in other comparably old > projects, where searching for commits that have been made in a specific time > window in the past could be useful or even necessary in some cases. Yeah, after sleeping on it for a bit, I think it is worth fixing. I also found another parsing bug in the same function. ;) So here's the result. [1/3]: t4212: avoid putting git on left-hand side of pipe [2/3]: parse_commit(): parse timestamp from end of line [3/3]: parse_commit(): handle broken whitespace-only timestamp commit.c | 29 +++++++++++++++++++++------- t/t4212-log-corrupt.sh | 44 ++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 64 insertions(+), 9 deletions(-) -Peff ^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH 1/3] t4212: avoid putting git on left-hand side of pipe 2023-04-22 13:41 ` [PATCH 0/3] fixing some parse_commit() timestamp corner cases Jeff King @ 2023-04-22 13:42 ` Jeff King 2023-04-22 13:47 ` [PATCH 2/3] parse_commit(): parse timestamp from end of line Jeff King ` (2 subsequent siblings) 3 siblings, 0 replies; 46+ messages in thread From: Jeff King @ 2023-04-22 13:42 UTC (permalink / raw) To: Thomas Bock; +Cc: Derrick Stolee, Junio C Hamano, git We wouldn't expect cat-file to fail here, but it's good practice to avoid putting git on the upstream of a pipe, as we otherwise ignore its exit code. Signed-off-by: Jeff King <peff@peff.net> --- t/t4212-log-corrupt.sh | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/t/t4212-log-corrupt.sh b/t/t4212-log-corrupt.sh index e89e1f54b6..8b5433ea74 100755 --- a/t/t4212-log-corrupt.sh +++ b/t/t4212-log-corrupt.sh @@ -8,8 +8,9 @@ TEST_PASSES_SANITIZE_LEAK=true test_expect_success 'setup' ' test_commit foo && - git cat-file commit HEAD | - sed "/^author /s/>/>-<>/" >broken_email.commit && + git cat-file commit HEAD >ok.commit && + sed "/^author /s/>/>-<>/" <ok.commit >broken_email.commit && + git hash-object --literally -w -t commit broken_email.commit >broken_email.hash && git update-ref refs/heads/broken_email $(cat broken_email.hash) ' -- 2.40.0.653.g15ca972062 ^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH 2/3] parse_commit(): parse timestamp from end of line 2023-04-22 13:41 ` [PATCH 0/3] fixing some parse_commit() timestamp corner cases Jeff King 2023-04-22 13:42 ` [PATCH 1/3] t4212: avoid putting git on left-hand side of pipe Jeff King @ 2023-04-22 13:47 ` Jeff King 2023-04-24 17:05 ` Junio C Hamano 2023-04-24 16:39 ` [PATCH 0/3] fixing some parse_commit() timestamp corner cases Junio C Hamano 2023-04-25 5:52 ` [PATCH v2 " Jeff King 3 siblings, 1 reply; 46+ messages in thread From: Jeff King @ 2023-04-22 13:47 UTC (permalink / raw) To: Thomas Bock; +Cc: Derrick Stolee, Junio C Hamano, git To find the committer timestamp, we parse left-to-right looking for the closing ">" of the email, and then expect the timestamp right after that. But we've seen some broken cases in the wild where this fails, but we _could_ find the timestamp with a little extra work. E.g.: Name <Name<email>> 123456789 -0500 This means that features that rely on the committer timestamp, like --since or --until, will treat the commit as happening at time 0 (i.e., 1970). This is doubly confusing because the pretty-print parser learned to handle these in 03818a4a94 (split_ident: parse timestamp from end of line, 2013-10-14). So printing them via "git show", etc, makes everything look normal, but --until, etc are still broken (despite the fact that that commit explicitly mentioned --until!). So let's use the same trick as 03818a4a94: find the end of the line, and parse back to the final ">". In theory we could use split_ident_line() here, but it's actually a bit more strict. In particular, it requires a valid time-zone token, too. That should be present, of course, but we wouldn't want to break --until for malformed cases that are working currently. We might want to teach split_ident_line() to become more lenient there, but it would require checking its many callers (since right now they can assume that if date_start is non-NULL, so is tz_start). So for now we'll just reimplement the same trick in the commit parser. The test is in t4212, which already covers similar cases, courtesy of 03818a4a94. We'll just adjust the broken commit to munge both the author and committer timestamps. Note that we could match (author|committer) here, but alternation can't be used portably in sed. Since we wouldn't expect to see ">" except as part of an ident line, we can just match that character on any line. Signed-off-by: Jeff King <peff@peff.net> --- This is more or less the same as what I posted earlier, but using memchr() where appropriate (we could use memrchr(), too, but I don't think it's portable enough). Note that both before and after my series, there are cases where parse_commit() is more lenient than split_ident_line(), because of the time-zone thing. For example: committer name <email> 1234567890\n will show as 1970, but --until, etc, will work as expected (so the opposite of the case that started this thread). So I do think making split_ident_line() more lenient may be worth doing, but I punted on that for now. This series takes us in a strictly better direction with respect to visible behavior, even if we might be able to clean up the internals later by reusing code. commit.c | 19 ++++++++++++------- t/t4212-log-corrupt.sh | 7 ++++++- 2 files changed, 18 insertions(+), 8 deletions(-) diff --git a/commit.c b/commit.c index 6d844da9a6..ede810ac1c 100644 --- a/commit.c +++ b/commit.c @@ -95,6 +95,7 @@ struct commit *lookup_commit_reference_by_name(const char *name) static timestamp_t parse_commit_date(const char *buf, const char *tail) { const char *dateptr; + const char *eol; if (buf + 6 >= tail) return 0; @@ -106,16 +107,20 @@ static timestamp_t parse_commit_date(const char *buf, const char *tail) return 0; if (memcmp(buf, "committer", 9)) return 0; - while (buf < tail && *buf++ != '>') - /* nada */; - if (buf >= tail) + + /* + * parse to end-of-line and then walk backwards, which + * handles some malformed cases. + */ + eol = memchr(buf, '\n', tail - buf); + if (!eol) return 0; - dateptr = buf; - while (buf < tail && *buf++ != '\n') + for (dateptr = eol; dateptr > buf && dateptr[-1] != '>'; dateptr--) /* nada */; - if (buf >= tail) + if (dateptr == buf || dateptr == eol) return 0; - /* dateptr < buf && buf[-1] == '\n', so parsing will stop at buf-1 */ + + /* dateptr < eol && *eol == '\n', so parsing will stop at eol */ return parse_timestamp(dateptr, NULL, 10); } diff --git a/t/t4212-log-corrupt.sh b/t/t4212-log-corrupt.sh index 8b5433ea74..af4b35ff56 100755 --- a/t/t4212-log-corrupt.sh +++ b/t/t4212-log-corrupt.sh @@ -9,7 +9,7 @@ test_expect_success 'setup' ' test_commit foo && git cat-file commit HEAD >ok.commit && - sed "/^author /s/>/>-<>/" <ok.commit >broken_email.commit && + sed "s/>/>-<>/" <ok.commit >broken_email.commit && git hash-object --literally -w -t commit broken_email.commit >broken_email.hash && git update-ref refs/heads/broken_email $(cat broken_email.hash) @@ -44,6 +44,11 @@ test_expect_success 'git log --format with broken author email' ' test_must_be_empty actual.err ' +test_expect_success '--until handles broken email' ' + git rev-list --until=1980-01-01 broken_email >actual && + test_must_be_empty actual +' + munge_author_date () { git cat-file commit "$1" >commit.orig && sed "s/^\(author .*>\) [0-9]*/\1 $2/" <commit.orig >commit.munge && -- 2.40.0.653.g15ca972062 ^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [PATCH 2/3] parse_commit(): parse timestamp from end of line 2023-04-22 13:47 ` [PATCH 2/3] parse_commit(): parse timestamp from end of line Jeff King @ 2023-04-24 17:05 ` Junio C Hamano 2023-04-25 5:23 ` Jeff King 0 siblings, 1 reply; 46+ messages in thread From: Junio C Hamano @ 2023-04-24 17:05 UTC (permalink / raw) To: Jeff King; +Cc: Thomas Bock, Derrick Stolee, git Jeff King <peff@peff.net> writes: > To find the committer timestamp, we parse left-to-right looking for the > closing ">" of the email, and then expect the timestamp right after > that. But we've seen some broken cases in the wild where this fails, but > we _could_ find the timestamp with a little extra work. E.g.: > > Name <Name<email>> 123456789 -0500 > ... > So let's use the same trick as 03818a4a94: find the end of the line, and > parse back to the final ">". This obviously assumes that even in a broken ident, it is very likely that the second component (where <e-mail> usually sits) ends with a '>'. Given that we enclose whatever garbage the end user gave us as their e-mail inside a pair of <> ourselves, it is a very sensible assumption, I would think. The original parser assumed that the end user would not have '>' inside their e-mail part of the ident, which turns out to be more problematic than the alternative being proposed. It is doubly good that we already parse from the end elsewhere. Nice. > + /* > + * parse to end-of-line and then walk backwards, which > + * handles some malformed cases. > + */ I would say "parse to" -> "jump to", but technically moving forward looking for a LF byte is still "parsing". "some" malformed cases being "most plausible" ones (due to how ident.c::fmt_ident() is what writes '>' after the string end-user gave as e-mail) may be worth mentioning. > + eol = memchr(buf, '\n', tail - buf); > + if (!eol) > return 0; OK. > - dateptr = buf; > - while (buf < tail && *buf++ != '\n') > + for (dateptr = eol; dateptr > buf && dateptr[-1] != '>'; dateptr--) > /* nada */; OK. Just a style thing, but I found that "; /* nada */" is easier to spot that there is an empty statement there. > - if (buf >= tail) > + if (dateptr == buf || dateptr == eol) > return 0; Curious when dateptr that wanted to scan back from eol is still at eol after the loop. It is when the ident line ends with ">" without any timestamp/tz info. And the reason why we need to check that here is ... > - /* dateptr < buf && buf[-1] == '\n', so parsing will stop at buf-1 */ > + > + /* dateptr < eol && *eol == '\n', so parsing will stop at eol */ > return parse_timestamp(dateptr, NULL, 10); ... because parse_timestamp() is merely strtoumax() and would happily skip over arbitrary number of leading "whitespace" without stopping if (dateptr == eol && *eol == '\n'). OK, sad but correct. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 2/3] parse_commit(): parse timestamp from end of line 2023-04-24 17:05 ` Junio C Hamano @ 2023-04-25 5:23 ` Jeff King 0 siblings, 0 replies; 46+ messages in thread From: Jeff King @ 2023-04-25 5:23 UTC (permalink / raw) To: Junio C Hamano; +Cc: Thomas Bock, Derrick Stolee, git On Mon, Apr 24, 2023 at 10:05:42AM -0700, Junio C Hamano wrote: > > + /* > > + * parse to end-of-line and then walk backwards, which > > + * handles some malformed cases. > > + */ > > I would say "parse to" -> "jump to", but technically moving forward > looking for a LF byte is still "parsing". "some" malformed cases > being "most plausible" ones (due to how ident.c::fmt_ident() is what > writes '>' after the string end-user gave as e-mail) may be worth > mentioning. I'll expand this to: /* * Jump to end-of-line so that we can walk backwards to find the * end-of-email ">". This is more forgiving of malformed cases * because unexpected characters tend to be in the name and email * fields. */ > > - dateptr = buf; > > - while (buf < tail && *buf++ != '\n') > > + for (dateptr = eol; dateptr > buf && dateptr[-1] != '>'; dateptr--) > > /* nada */; > > OK. Just a style thing, but I found that "; /* nada */" is easier > to spot that there is an empty statement there. I found it ugly, too, but it's from earlier. Since the point is to advance "dateptr", it may be better to turn it into a while loop anyway, which side-steps the empty statement altogether: dateptr = eol; while (dateptr > buf && dateptr[-1] != '>') dateptr--; > > - if (buf >= tail) > > + if (dateptr == buf || dateptr == eol) > > return 0; > > Curious when dateptr that wanted to scan back from eol is still at > eol after the loop. It is when the ident line ends with ">" without > any timestamp/tz info. And the reason why we need to check that > here is ... Yeah, though as you saw in the next patch, it is not really sufficient. :) I think it may be redundant, though. In the original we already bailed earlier if we didn't find a ">" (because "buf >= tail" after we advanced it looking for ">"). Here "dateptr == buf" is checking the same thing (because we walked backwards). In the original we'd bail if we failed to find a newline as part of this loop (because "buf >= tail" after looking for a newline). But that can't happen here; we've already bailed after memchr() failed to find a newline). So "dateptr == eol" only triggers if there were no characters in "buf" to look at. So I added it only to keep this comment trivially true: > > - /* dateptr < buf && buf[-1] == '\n', so parsing will stop at buf-1 */ > > + > > + /* dateptr < eol && *eol == '\n', so parsing will stop at eol */ > > return parse_timestamp(dateptr, NULL, 10); > > ... because parse_timestamp() is merely strtoumax() and would > happily skip over arbitrary number of leading "whitespace" without > stopping if (dateptr == eol && *eol == '\n'). OK, sad but correct. But it could also read "dateptr <= eol" and still be true (which is to say it is mostly accurate, but not quite because of the "soaking up whitespace" problem fixed by the next patch. I'll leave the extra condition in this patch, since it's orthogonal to what this patch is fixing. But in the next one I'll remove it and expand the comment to explain a bit more. -Peff ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 0/3] fixing some parse_commit() timestamp corner cases 2023-04-22 13:41 ` [PATCH 0/3] fixing some parse_commit() timestamp corner cases Jeff King 2023-04-22 13:42 ` [PATCH 1/3] t4212: avoid putting git on left-hand side of pipe Jeff King 2023-04-22 13:47 ` [PATCH 2/3] parse_commit(): parse timestamp from end of line Jeff King @ 2023-04-24 16:39 ` Junio C Hamano 2023-04-25 5:52 ` [PATCH v2 " Jeff King 3 siblings, 0 replies; 46+ messages in thread From: Junio C Hamano @ 2023-04-24 16:39 UTC (permalink / raw) To: Jeff King; +Cc: Thomas Bock, Derrick Stolee, git Jeff King <peff@peff.net> writes: > On Fri, Apr 21, 2023 at 04:51:03PM +0200, Thomas Bock wrote: > >> Even though the affected commit objects are malformed, it would be very >> helpful if this problem could be solved somehow, from a user perspective. >> Such malformed objects can potentially occur also in other comparably old >> projects, where searching for commits that have been made in a specific time >> window in the past could be useful or even necessary in some cases. > > Yeah, after sleeping on it for a bit, I think it is worth fixing. I also > found another parsing bug in the same function. ;) Nice. Thanks. > > So here's the result. > > [1/3]: t4212: avoid putting git on left-hand side of pipe > [2/3]: parse_commit(): parse timestamp from end of line > [3/3]: parse_commit(): handle broken whitespace-only timestamp > > commit.c | 29 +++++++++++++++++++++------- > t/t4212-log-corrupt.sh | 44 ++++++++++++++++++++++++++++++++++++++++-- > 2 files changed, 64 insertions(+), 9 deletions(-) > > -Peff ^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH v2 0/3] fixing some parse_commit() timestamp corner cases 2023-04-22 13:41 ` [PATCH 0/3] fixing some parse_commit() timestamp corner cases Jeff King ` (2 preceding siblings ...) 2023-04-24 16:39 ` [PATCH 0/3] fixing some parse_commit() timestamp corner cases Junio C Hamano @ 2023-04-25 5:52 ` Jeff King 2023-04-25 5:54 ` Jeff King ` (4 more replies) 3 siblings, 5 replies; 46+ messages in thread From: Jeff King @ 2023-04-25 5:52 UTC (permalink / raw) To: Thomas Bock; +Cc: Derrick Stolee, Junio C Hamano, git Here's a v2 of my series. The behavior should be identical, but I've incorporated some comment and small code tweaks based on feedback from the first round. I also added a fourth patch which adds a new comment explaining some of the cases that were alluded to in the earlier round's patch 3. [1/4]: t4212: avoid putting git on left-hand side of pipe [2/4]: parse_commit(): parse timestamp from end of line [3/4]: parse_commit(): handle broken whitespace-only timestamp [4/4]: parse_commit(): describe more date-parsing failure modes commit.c | 47 +++++++++++++++++++++++++++++++++++------- t/t4212-log-corrupt.sh | 39 +++++++++++++++++++++++++++++++++-- 2 files changed, 76 insertions(+), 10 deletions(-) -Peff ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v2 0/3] fixing some parse_commit() timestamp corner cases 2023-04-25 5:52 ` [PATCH v2 " Jeff King @ 2023-04-25 5:54 ` Jeff King 2023-04-25 5:54 ` [PATCH v2 1/4] t4212: avoid putting git on left-hand side of pipe Jeff King ` (3 subsequent siblings) 4 siblings, 0 replies; 46+ messages in thread From: Jeff King @ 2023-04-25 5:54 UTC (permalink / raw) To: Thomas Bock; +Cc: Derrick Stolee, Junio C Hamano, git On Tue, Apr 25, 2023 at 01:52:45AM -0400, Jeff King wrote: > Here's a v2 of my series. The behavior should be identical, but I've > incorporated some comment and small code tweaks based on feedback from > the first round. > > I also added a fourth patch which adds a new comment explaining some of > the cases that were alluded to in the earlier round's patch 3. > > [1/4]: t4212: avoid putting git on left-hand side of pipe > [2/4]: parse_commit(): parse timestamp from end of line > [3/4]: parse_commit(): handle broken whitespace-only timestamp > [4/4]: parse_commit(): describe more date-parsing failure modes > > commit.c | 47 +++++++++++++++++++++++++++++++++++------- > t/t4212-log-corrupt.sh | 39 +++++++++++++++++++++++++++++++++-- > 2 files changed, 76 insertions(+), 10 deletions(-) Whoops, forgot my range-diff (though nothing should be too surprising based on the round 1 discussion): 1: 07932cf666 = 1: ac38ce133d t4212: avoid putting git on left-hand side of pipe 2: 7ee34c7d5f ! 2: f59e61262d parse_commit(): parse timestamp from end of line @@ Commit message parse back to the final ">". In theory we could use split_ident_line() here, but it's actually a bit more strict. In particular, it requires a valid time-zone token, too. That should be present, of course, but we - wouldn't want to break --until for malformed cases that are working - currently. + wouldn't want to break --until for cases that are working currently. We might want to teach split_ident_line() to become more lenient there, but it would require checking its many callers (since right now they can @@ commit.c: static timestamp_t parse_commit_date(const char *buf, const char *tail - if (buf >= tail) + + /* -+ * parse to end-of-line and then walk backwards, which -+ * handles some malformed cases. ++ * Jump to end-of-line so that we can walk backwards to find the ++ * end-of-email ">". This is more forgiving of malformed cases ++ * because unexpected characters tend to be in the name and email ++ * fields. + */ + eol = memchr(buf, '\n', tail - buf); + if (!eol) return 0; - dateptr = buf; - while (buf < tail && *buf++ != '\n') -+ for (dateptr = eol; dateptr > buf && dateptr[-1] != '>'; dateptr--) - /* nada */; +- /* nada */; - if (buf >= tail) ++ dateptr = eol; ++ while (dateptr > buf && dateptr[-1] != '>') ++ dateptr--; + if (dateptr == buf || dateptr == eol) return 0; - /* dateptr < buf && buf[-1] == '\n', so parsing will stop at buf-1 */ 3: e8e94083f5 ! 3: c62fc59bf1 parse_commit(): handle broken whitespace-only timestamp @@ Commit message It's not subject to the same bug, because it insists that there be one or more digits in the timestamp. - We can use the same logic here. If there's a non-whitespace but - non-digit value (say "committer name <email> foo"), then - parse_timestamp() would already have returned 0 anyway. So the only - change should be for this "whitespace only" case. - Signed-off-by: Jeff King <peff@peff.net> ## commit.c ## @@ commit.c: static timestamp_t parse_commit_date(const char *buf, const char *tail) - if (dateptr == buf || dateptr == eol) + dateptr = eol; + while (dateptr > buf && dateptr[-1] != '>') + dateptr--; +- if (dateptr == buf || dateptr == eol) ++ if (dateptr == buf) return 0; +- /* dateptr < eol && *eol == '\n', so parsing will stop at eol */ + /* -+ * trim leading whitespace; parse_timestamp() will do this itself, but -+ * it will walk past the newline at eol while doing so. So we insist -+ * that there is at least one digit here. ++ * Trim leading whitespace; parse_timestamp() will do this itself, but ++ * if we have _only_ whitespace, it will walk right past the newline ++ * while doing so. + */ + while (dateptr < eol && isspace(*dateptr)) + dateptr++; -+ if (!strchr("0123456789", *dateptr)) ++ if (dateptr == eol) + return 0; + - /* dateptr < eol && *eol == '\n', so parsing will stop at eol */ ++ /* ++ * We know there is at least one non-whitespace character, so we'll ++ * begin parsing there and stop at worst case at eol. ++ */ return parse_timestamp(dateptr, NULL, 10); } + ## t/t4212-log-corrupt.sh ## @@ t/t4212-log-corrupt.sh: test_expect_success 'absurdly far-in-future date' ' -: ---------- > 4: 28ed51a2ca parse_commit(): describe more date-parsing failure modes ^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH v2 1/4] t4212: avoid putting git on left-hand side of pipe 2023-04-25 5:52 ` [PATCH v2 " Jeff King 2023-04-25 5:54 ` Jeff King @ 2023-04-25 5:54 ` Jeff King 2023-04-25 5:54 ` [PATCH v2 2/4] parse_commit(): parse timestamp from end of line Jeff King ` (2 subsequent siblings) 4 siblings, 0 replies; 46+ messages in thread From: Jeff King @ 2023-04-25 5:54 UTC (permalink / raw) To: Thomas Bock; +Cc: Derrick Stolee, Junio C Hamano, git We wouldn't expect cat-file to fail here, but it's good practice to avoid putting git on the upstream of a pipe, as we otherwise ignore its exit code. Signed-off-by: Jeff King <peff@peff.net> --- t/t4212-log-corrupt.sh | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/t/t4212-log-corrupt.sh b/t/t4212-log-corrupt.sh index e89e1f54b6..8b5433ea74 100755 --- a/t/t4212-log-corrupt.sh +++ b/t/t4212-log-corrupt.sh @@ -8,8 +8,9 @@ TEST_PASSES_SANITIZE_LEAK=true test_expect_success 'setup' ' test_commit foo && - git cat-file commit HEAD | - sed "/^author /s/>/>-<>/" >broken_email.commit && + git cat-file commit HEAD >ok.commit && + sed "/^author /s/>/>-<>/" <ok.commit >broken_email.commit && + git hash-object --literally -w -t commit broken_email.commit >broken_email.hash && git update-ref refs/heads/broken_email $(cat broken_email.hash) ' -- 2.40.0.653.g15ca972062 ^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH v2 2/4] parse_commit(): parse timestamp from end of line 2023-04-25 5:52 ` [PATCH v2 " Jeff King 2023-04-25 5:54 ` Jeff King 2023-04-25 5:54 ` [PATCH v2 1/4] t4212: avoid putting git on left-hand side of pipe Jeff King @ 2023-04-25 5:54 ` Jeff King 2023-04-25 5:54 ` [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp Jeff King 2023-04-25 5:55 ` [PATCH v2 4/4] parse_commit(): describe more date-parsing failure modes Jeff King 4 siblings, 0 replies; 46+ messages in thread From: Jeff King @ 2023-04-25 5:54 UTC (permalink / raw) To: Thomas Bock; +Cc: Derrick Stolee, Junio C Hamano, git To find the committer timestamp, we parse left-to-right looking for the closing ">" of the email, and then expect the timestamp right after that. But we've seen some broken cases in the wild where this fails, but we _could_ find the timestamp with a little extra work. E.g.: Name <Name<email>> 123456789 -0500 This means that features that rely on the committer timestamp, like --since or --until, will treat the commit as happening at time 0 (i.e., 1970). This is doubly confusing because the pretty-print parser learned to handle these in 03818a4a94 (split_ident: parse timestamp from end of line, 2013-10-14). So printing them via "git show", etc, makes everything look normal, but --until, etc are still broken (despite the fact that that commit explicitly mentioned --until!). So let's use the same trick as 03818a4a94: find the end of the line, and parse back to the final ">". In theory we could use split_ident_line() here, but it's actually a bit more strict. In particular, it requires a valid time-zone token, too. That should be present, of course, but we wouldn't want to break --until for cases that are working currently. We might want to teach split_ident_line() to become more lenient there, but it would require checking its many callers (since right now they can assume that if date_start is non-NULL, so is tz_start). So for now we'll just reimplement the same trick in the commit parser. The test is in t4212, which already covers similar cases, courtesy of 03818a4a94. We'll just adjust the broken commit to munge both the author and committer timestamps. Note that we could match (author|committer) here, but alternation can't be used portably in sed. Since we wouldn't expect to see ">" except as part of an ident line, we can just match that character on any line. Signed-off-by: Jeff King <peff@peff.net> --- commit.c | 24 ++++++++++++++++-------- t/t4212-log-corrupt.sh | 7 ++++++- 2 files changed, 22 insertions(+), 9 deletions(-) diff --git a/commit.c b/commit.c index 6d844da9a6..bb340f66fa 100644 --- a/commit.c +++ b/commit.c @@ -95,6 +95,7 @@ struct commit *lookup_commit_reference_by_name(const char *name) static timestamp_t parse_commit_date(const char *buf, const char *tail) { const char *dateptr; + const char *eol; if (buf + 6 >= tail) return 0; @@ -106,16 +107,23 @@ static timestamp_t parse_commit_date(const char *buf, const char *tail) return 0; if (memcmp(buf, "committer", 9)) return 0; - while (buf < tail && *buf++ != '>') - /* nada */; - if (buf >= tail) + + /* + * Jump to end-of-line so that we can walk backwards to find the + * end-of-email ">". This is more forgiving of malformed cases + * because unexpected characters tend to be in the name and email + * fields. + */ + eol = memchr(buf, '\n', tail - buf); + if (!eol) return 0; - dateptr = buf; - while (buf < tail && *buf++ != '\n') - /* nada */; - if (buf >= tail) + dateptr = eol; + while (dateptr > buf && dateptr[-1] != '>') + dateptr--; + if (dateptr == buf || dateptr == eol) return 0; - /* dateptr < buf && buf[-1] == '\n', so parsing will stop at buf-1 */ + + /* dateptr < eol && *eol == '\n', so parsing will stop at eol */ return parse_timestamp(dateptr, NULL, 10); } diff --git a/t/t4212-log-corrupt.sh b/t/t4212-log-corrupt.sh index 8b5433ea74..af4b35ff56 100755 --- a/t/t4212-log-corrupt.sh +++ b/t/t4212-log-corrupt.sh @@ -9,7 +9,7 @@ test_expect_success 'setup' ' test_commit foo && git cat-file commit HEAD >ok.commit && - sed "/^author /s/>/>-<>/" <ok.commit >broken_email.commit && + sed "s/>/>-<>/" <ok.commit >broken_email.commit && git hash-object --literally -w -t commit broken_email.commit >broken_email.hash && git update-ref refs/heads/broken_email $(cat broken_email.hash) @@ -44,6 +44,11 @@ test_expect_success 'git log --format with broken author email' ' test_must_be_empty actual.err ' +test_expect_success '--until handles broken email' ' + git rev-list --until=1980-01-01 broken_email >actual && + test_must_be_empty actual +' + munge_author_date () { git cat-file commit "$1" >commit.orig && sed "s/^\(author .*>\) [0-9]*/\1 $2/" <commit.orig >commit.munge && -- 2.40.0.653.g15ca972062 ^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp 2023-04-25 5:52 ` [PATCH v2 " Jeff King ` (2 preceding siblings ...) 2023-04-25 5:54 ` [PATCH v2 2/4] parse_commit(): parse timestamp from end of line Jeff King @ 2023-04-25 5:54 ` Jeff King 2023-04-25 10:11 ` Phillip Wood 2023-04-25 5:55 ` [PATCH v2 4/4] parse_commit(): describe more date-parsing failure modes Jeff King 4 siblings, 1 reply; 46+ messages in thread From: Jeff King @ 2023-04-25 5:54 UTC (permalink / raw) To: Thomas Bock; +Cc: Derrick Stolee, Junio C Hamano, git The comment in parse_commit_date() claims that parse_timestamp() will not walk past the end of the buffer we've been given, since it will hit the newline at "eol" and stop. This is usually true, when dateptr contains actual numbers to parse. But with a line like: committer name <email> \n with just whitespace, and no numbers, parse_timestamp() will consume that newline as part of the leading whitespace, and we may walk past our "tail" pointer (which itself is set from the "size" parameter passed in to parse_commit_buffer()). In practice this can't cause us to walk off the end of an array, because we always add an extra NUL byte to the end of objects we load from disk (as a defense against exactly this kind of bug). However, you can see the behavior in action when "committer" is the final header (which it usually is, unless there's an encoding) and the subject line can be parsed as an integer. We walk right past the newline on the committer line, as well as the "\n\n" separator, and mistake the subject for the timestamp. The new test demonstrates such a case. I also added a test to check this case against the pretty-print formatter, which uses split_ident_line(). It's not subject to the same bug, because it insists that there be one or more digits in the timestamp. Signed-off-by: Jeff King <peff@peff.net> --- commit.c | 17 +++++++++++++++-- t/t4212-log-corrupt.sh | 29 +++++++++++++++++++++++++++++ 2 files changed, 44 insertions(+), 2 deletions(-) diff --git a/commit.c b/commit.c index bb340f66fa..2f1b5d505b 100644 --- a/commit.c +++ b/commit.c @@ -120,10 +120,23 @@ static timestamp_t parse_commit_date(const char *buf, const char *tail) dateptr = eol; while (dateptr > buf && dateptr[-1] != '>') dateptr--; - if (dateptr == buf || dateptr == eol) + if (dateptr == buf) return 0; - /* dateptr < eol && *eol == '\n', so parsing will stop at eol */ + /* + * Trim leading whitespace; parse_timestamp() will do this itself, but + * if we have _only_ whitespace, it will walk right past the newline + * while doing so. + */ + while (dateptr < eol && isspace(*dateptr)) + dateptr++; + if (dateptr == eol) + return 0; + + /* + * We know there is at least one non-whitespace character, so we'll + * begin parsing there and stop at worst case at eol. + */ return parse_timestamp(dateptr, NULL, 10); } diff --git a/t/t4212-log-corrupt.sh b/t/t4212-log-corrupt.sh index af4b35ff56..d4ef48d646 100755 --- a/t/t4212-log-corrupt.sh +++ b/t/t4212-log-corrupt.sh @@ -92,4 +92,33 @@ test_expect_success 'absurdly far-in-future date' ' git log -1 --format=%ad $commit ' +test_expect_success 'create commit with whitespace committer date' ' + # It is important that this subject line is numeric, since we want to + # be sure we are not confused by skipping whitespace and accidentally + # parsing the subject as a timestamp. + # + # Do not use munge_author_date here. Besides not hitting the committer + # line, it leaves the timezone intact, and we want nothing but + # whitespace. + test_commit 1234567890 && + git cat-file commit HEAD >commit.orig && + sed "s/>.*/> /" <commit.orig >commit.munge && + ws_commit=$(git hash-object --literally -w -t commit commit.munge) +' + +test_expect_success '--until treats whitespace date as sentinel' ' + echo $ws_commit >expect && + git rev-list --until=1980-01-01 $ws_commit >actual && + test_cmp expect actual +' + +test_expect_success 'pretty-printer handles whitespace date' ' + # as with the %ad test above, we will show these as the empty string, + # not the 1970 epoch date. This is intentional; see 7d9a281941 (t4212: + # test bogus timestamps with git-log, 2014-02-24) for more discussion. + echo : >expect && + git log -1 --format="%at:%ct" $ws_commit >actual && + test_cmp expect actual +' + test_done -- 2.40.0.653.g15ca972062 ^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp 2023-04-25 5:54 ` [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp Jeff King @ 2023-04-25 10:11 ` Phillip Wood 2023-04-25 16:06 ` Junio C Hamano 0 siblings, 1 reply; 46+ messages in thread From: Phillip Wood @ 2023-04-25 10:11 UTC (permalink / raw) To: Jeff King, Thomas Bock Cc: Derrick Stolee, Junio C Hamano, git, René Scharfe Hi Peff On 25/04/2023 06:54, Jeff King wrote: > The comment in parse_commit_date() claims that parse_timestamp() will > not walk past the end of the buffer we've been given, since it will hit > the newline at "eol" and stop. This is usually true, when dateptr > contains actual numbers to parse. But with a line like: > > committer name <email> \n > > with just whitespace, and no numbers, parse_timestamp() will consume > that newline as part of the leading whitespace, and we may walk past our > "tail" pointer (which itself is set from the "size" parameter passed in > to parse_commit_buffer()). > > In practice this can't cause us to walk off the end of an array, because > we always add an extra NUL byte to the end of objects we load from disk > (as a defense against exactly this kind of bug). However, you can see > the behavior in action when "committer" is the final header (which it > usually is, unless there's an encoding) and the subject line can be > parsed as an integer. We walk right past the newline on the committer > line, as well as the "\n\n" separator, and mistake the subject for the > timestamp. > > The new test demonstrates such a case. I also added a test to check this > case against the pretty-print formatter, which uses split_ident_line(). > It's not subject to the same bug, because it insists that there be one > or more digits in the timestamp. > > Signed-off-by: Jeff King <peff@peff.net> > --- > commit.c | 17 +++++++++++++++-- > t/t4212-log-corrupt.sh | 29 +++++++++++++++++++++++++++++ > 2 files changed, 44 insertions(+), 2 deletions(-) > > diff --git a/commit.c b/commit.c > index bb340f66fa..2f1b5d505b 100644 > --- a/commit.c > +++ b/commit.c > @@ -120,10 +120,23 @@ static timestamp_t parse_commit_date(const char *buf, const char *tail) > dateptr = eol; > while (dateptr > buf && dateptr[-1] != '>') > dateptr--; > - if (dateptr == buf || dateptr == eol) > + if (dateptr == buf) > return 0; > > - /* dateptr < eol && *eol == '\n', so parsing will stop at eol */ > + /* > + * Trim leading whitespace; parse_timestamp() will do this itself, but > + * if we have _only_ whitespace, it will walk right past the newline > + * while doing so. > + */ > + while (dateptr < eol && isspace(*dateptr)) > + dateptr++; > + if (dateptr == eol) > + return 0; > + > + /* > + * We know there is at least one non-whitespace character, so we'll > + * begin parsing there and stop at worst case at eol. > + */ This probably doesn't matter in practice but we define our own isspace() that does not treat '\v' and '\f' as whitespace. However parse_timestamp() (which is just strtoumax()) uses the standard library's isspace() which does treat those characters as whitespace and is locale dependent. This means we can potentially stop at a character that parse_timestamp() treats as whitespace and if there are no digits after it we'll still walk past the end of the line. Using Rene's suggestion of testing the character with isdigit() would fix that. It would also avoid parsing negative timestamps as positive numbers and reject any timestamps that begin with a locale dependent digit. I'm not familiar with this code, but would it be worth changing parse_timestamp() to stop parsing if it sees a newline? Best Wishes Phillip > return parse_timestamp(dateptr, NULL, 10); > } > > diff --git a/t/t4212-log-corrupt.sh b/t/t4212-log-corrupt.sh > index af4b35ff56..d4ef48d646 100755 > --- a/t/t4212-log-corrupt.sh > +++ b/t/t4212-log-corrupt.sh > @@ -92,4 +92,33 @@ test_expect_success 'absurdly far-in-future date' ' > git log -1 --format=%ad $commit > ' > > +test_expect_success 'create commit with whitespace committer date' ' > + # It is important that this subject line is numeric, since we want to > + # be sure we are not confused by skipping whitespace and accidentally > + # parsing the subject as a timestamp. > + # > + # Do not use munge_author_date here. Besides not hitting the committer > + # line, it leaves the timezone intact, and we want nothing but > + # whitespace. > + test_commit 1234567890 && > + git cat-file commit HEAD >commit.orig && > + sed "s/>.*/> /" <commit.orig >commit.munge && > + ws_commit=$(git hash-object --literally -w -t commit commit.munge) > +' > + > +test_expect_success '--until treats whitespace date as sentinel' ' > + echo $ws_commit >expect && > + git rev-list --until=1980-01-01 $ws_commit >actual && > + test_cmp expect actual > +' > + > +test_expect_success 'pretty-printer handles whitespace date' ' > + # as with the %ad test above, we will show these as the empty string, > + # not the 1970 epoch date. This is intentional; see 7d9a281941 (t4212: > + # test bogus timestamps with git-log, 2014-02-24) for more discussion. > + echo : >expect && > + git log -1 --format="%at:%ct" $ws_commit >actual && > + test_cmp expect actual > +' > + > test_done ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp 2023-04-25 10:11 ` Phillip Wood @ 2023-04-25 16:06 ` Junio C Hamano 2023-04-26 11:36 ` Jeff King 2023-04-26 14:06 ` [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp Phillip Wood 0 siblings, 2 replies; 46+ messages in thread From: Junio C Hamano @ 2023-04-25 16:06 UTC (permalink / raw) To: Phillip Wood Cc: Jeff King, Thomas Bock, Derrick Stolee, git, René Scharfe Phillip Wood <phillip.wood123@gmail.com> writes: > This probably doesn't matter in practice but we define our own > isspace() that does not treat '\v' and '\f' as whitespace. However > parse_timestamp() (which is just strtoumax()) uses the standard > library's isspace() which does treat those characters as whitespace > and is locale dependent. This means we can potentially stop at a > character that parse_timestamp() treats as whitespace and if there are > no digits after it we'll still walk past the end of the line. Using > Rene's suggestion of testing the character with isdigit() would fix > that. It would also avoid parsing negative timestamps as positive > numbers and reject any timestamps that begin with a locale dependent > digit. A very interesting observation. I wonder if a curious person can craft a malformed timestamp with "hash-object --literally" to do more than DoS themselves? We are not going to put anything other than [ 0-9+-] after the '>' we scan for, and making sure '>' is followed by SP and then [0-9] would be sufficient to ensure strtoumax() to stop before the '\n' but does not ensure that the "signal a bad timestamp with 0" happens. Perhaps that would be sufficient. I dunno. > I'm not familiar with this code, but would it be worth changing > parse_timestamp() to stop parsing if it sees a newline? Meaning replace or write our own strtoumax() equivalent? ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp 2023-04-25 16:06 ` Junio C Hamano @ 2023-04-26 11:36 ` Jeff King 2023-04-26 15:32 ` Junio C Hamano 2023-04-26 14:06 ` [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp Phillip Wood 1 sibling, 1 reply; 46+ messages in thread From: Jeff King @ 2023-04-26 11:36 UTC (permalink / raw) To: Junio C Hamano Cc: Phillip Wood, Thomas Bock, Derrick Stolee, git, René Scharfe On Tue, Apr 25, 2023 at 09:06:47AM -0700, Junio C Hamano wrote: > Phillip Wood <phillip.wood123@gmail.com> writes: > > > This probably doesn't matter in practice but we define our own > > isspace() that does not treat '\v' and '\f' as whitespace. However > > parse_timestamp() (which is just strtoumax()) uses the standard > > library's isspace() which does treat those characters as whitespace > > and is locale dependent. This means we can potentially stop at a > > character that parse_timestamp() treats as whitespace and if there are > > no digits after it we'll still walk past the end of the line. Using > > Rene's suggestion of testing the character with isdigit() would fix > > that. It would also avoid parsing negative timestamps as positive > > numbers and reject any timestamps that begin with a locale dependent > > digit. > > A very interesting observation. I wonder if a curious person can > craft a malformed timestamp with "hash-object --literally" to do > more than DoS themselves? I think the answer is no, because the worst case is that they read to the trailing NUL that we stick after any object content we read into memory. So we'd mis-parse: committer name <email> \v\n 123456 in the subject line to read "123456" as the commit timestamp (so basically the same bug my patch was trying to fix). But we'd never read out-of-bounds memory. Still, it does not give me warm fuzzies, and I think is worth fixing. > We are not going to put anything other than [ 0-9+-] after the '>' > we scan for, and making sure '>' is followed by SP and then [0-9] > would be sufficient to ensure strtoumax() to stop before the '\n' > but does not ensure that the "signal a bad timestamp with 0" > happens. Perhaps that would be sufficient. I dunno. Any single non-whitespace character at all would be sufficient to avoid the problem. And that's what the current iteration of the patch is trying to do. It's just that our definition of "whitespace" has to agree with strtoumax()'s for it to work. And as Phillip notes, that may even include locale dependent characters. So I don't think we want to get into trying to match them all (i.e., a "allow known" strategy). Instead, we should go back to what the original iteration of the series was doing, and make sure there is at least one digit (i.e., a "forbid unknown" strategy). Assuming that there is no locale where ascii "1" is considered whitespace. ;) Note that will exclude a few cases that we do allow now, like: committer name <email> \v123456 +0000\n Right now that parses as "123456", but we'd reject it as "0" after such a patch. The alternative is to check _all_ of the characters between ">" and the newline and make sure there is some digit somewhere, which would be sufficient to prevent strtoumax() from walking past the newline. I guess it's not even any more expensive in the normal case (since the very first non-whitespace entry should be a digit!). I'm not sure it's worth caring about too much either way. Garbage making it into name/email is an easy mistake to make (for users and implementations). Putting whitespace control codes into your timestamp is not, and marking them as "0" is an OK outcome. -Peff ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp 2023-04-26 11:36 ` Jeff King @ 2023-04-26 15:32 ` Junio C Hamano 2023-04-27 8:13 ` [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases Jeff King 0 siblings, 1 reply; 46+ messages in thread From: Junio C Hamano @ 2023-04-26 15:32 UTC (permalink / raw) To: Jeff King Cc: Phillip Wood, Thomas Bock, Derrick Stolee, git, René Scharfe Jeff King <peff@peff.net> writes: > Instead, we should go back to what the original iteration of the series > was doing, and make sure there is at least one digit (i.e., a "forbid > unknown" strategy). Assuming that there is no locale where ascii "1" is > considered whitespace. ;) > > Note that will exclude a few cases that we do allow now, like: > > committer name <email> \v123456 +0000\n > > Right now that parses as "123456", but we'd reject it as "0" after such > a patch. I would say that it is a good thing. The only (somewhat) end-user controlled things on the line are the name and email, and even there name is sanitized to remove "crud". The user-supplied timestamp goes through date.c::parse_date(), ending up with what date.c::date_string() formats, so there will not be syntactically incorrect timestamp there. So we can be strict format-wise on the timestamp field, once we identify where it begins, which is the point of scanning backwards for '>'. Unless the user does "hash-object" and deliberately creates a malformed commit object---they can keep both halves just fine in such a case as long as we do reject such a timestamp correctly. > The alternative is to check _all_ of the characters between ">" and the > newline and make sure there is some digit somewhere, which would be > sufficient to prevent strtoumax() from walking past the newline. > > I guess it's not even any more expensive in the normal case (since the > very first non-whitespace entry should be a digit!). I'm not sure it's > worth caring about too much either way. Garbage making it into > name/email is an easy mistake to make (for users and implementations). > Putting whitespace control codes into your timestamp is not, and marking > them as "0" is an OK outcome. Yeah, I think it is fine either way. Thanks ^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases 2023-04-26 15:32 ` Junio C Hamano @ 2023-04-27 8:13 ` Jeff King 2023-04-27 8:14 ` [PATCH v3 1/4] t4212: avoid putting git on left-hand side of pipe Jeff King ` (5 more replies) 0 siblings, 6 replies; 46+ messages in thread From: Jeff King @ 2023-04-27 8:13 UTC (permalink / raw) To: Junio C Hamano Cc: Phillip Wood, Thomas Bock, Derrick Stolee, git, René Scharfe On Wed, Apr 26, 2023 at 08:32:49AM -0700, Junio C Hamano wrote: > > Note that will exclude a few cases that we do allow now, like: > > > > committer name <email> \v123456 +0000\n > > > > Right now that parses as "123456", but we'd reject it as "0" after such > > a patch. > > I would say that it is a good thing. > > The only (somewhat) end-user controlled things on the line are the > name and email, and even there name is sanitized to remove "crud". > The user-supplied timestamp goes through date.c::parse_date(), > ending up with what date.c::date_string() formats, so there will not > be syntactically incorrect timestamp there. So we can be strict > format-wise on the timestamp field, once we identify where it begins, > which is the point of scanning backwards for '>'. > > Unless the user does "hash-object" and deliberately creates a > malformed commit object---they can keep both halves just fine in > such a case as long as we do reject such a timestamp correctly. I think we'd ideally consider the behavior against hypothetical bugs in other implementations (including us in the future!). So yeah, I don't think we ever generated a syntactically incorrect timestamp, and it would be hard for a user to create one. But all things being equal, I'd prefer to keep parsing something like: committer name <email> 123456\n which is missing its timezone (and seems like a plausible sort of bug). But I'm OK to draw the line at "if your implementation is sticking control characters in the header, then tough luck". So here's a v3. I was tempted to add the fix on top of the existing patch, since it's somewhat its own case, and could be explained separately. But they really are two versions of the same problem, so I just rolled it all into patch 3. Patch 4 needed small updates to its comment to match. The first two patches are the same. [1/4]: t4212: avoid putting git on left-hand side of pipe [2/4]: parse_commit(): parse timestamp from end of line [3/4]: parse_commit(): handle broken whitespace-only timestamp [4/4]: parse_commit(): describe more date-parsing failure modes commit.c | 57 ++++++++++++++++++++++++++++++++++++------ t/t4212-log-corrupt.sh | 51 +++++++++++++++++++++++++++++++++++-- 2 files changed, 98 insertions(+), 10 deletions(-) 1: 7a2fa8daac = 1: 57401571b6 t4212: avoid putting git on left-hand side of pipe 2: d90c720075 = 2: 54fa983d66 parse_commit(): parse timestamp from end of line 3: 1a47c87c07 ! 3: 894815d82d parse_commit(): handle broken whitespace-only timestamp @@ Commit message line, as well as the "\n\n" separator, and mistake the subject for the timestamp. - The new test demonstrates such a case. I also added a test to check this - case against the pretty-print formatter, which uses split_ident_line(). - It's not subject to the same bug, because it insists that there be one - or more digits in the timestamp. + We can solve this by trimming the whitespace ourselves, making sure that + it has some non-whitespace to parse. Note that we need to be a bit + careful about the definition of "whitespace" here, as our isspace() + doesn't match exotic characters like vertical tab or formfeed. We can + work around that by checking for an actual number (see the in-code + comment). This is slightly more restrictive than the current code, but + in practice the results are either the same (we reject "foo" as "0", but + so would parse_timestamp()) or extremely unlikely even for broken + commits (parse_timestamp() would allow "\v123" as "123", but we'll know + make it "0"). + I did also allow "-" here, which may be controversial, as we don't + currently support negative timestamps. My reasoning was two-fold. One, + the design of parse_timestamp() is such that we should be able to easily + switch it to handling signed values, and this otherwise creates a + hard-to-find gotcha that anybody doing that work would get tripped up + on. And two, the status quo is that we currently parse them, though the + result of course ends up as a very large unsigned value (which is likely + to just get clamped to "0" for display anyway, since our date routines + can't handle it). + + The new test checks the commit parser (via "--until") for both vanilla + spaces and the vertical-tab case. I also added a test to check these + against the pretty-print formatter, which uses split_ident_line(). It's + not subject to the same bug, because it already insists that there be + one or more digits in the timestamp. + + Helped-by: Phillip Wood <phillip.wood123@gmail.com> Signed-off-by: Jeff King <peff@peff.net> ## commit.c ## @@ commit.c: static timestamp_t parse_commit_date(const char *buf, const char *tail - /* dateptr < eol && *eol == '\n', so parsing will stop at eol */ + /* -+ * Trim leading whitespace; parse_timestamp() will do this itself, but -+ * if we have _only_ whitespace, it will walk right past the newline -+ * while doing so. ++ * Trim leading whitespace, but make sure we have at least one ++ * non-whitespace character, as parse_timestamp() will otherwise walk ++ * right past the newline we found in "eol" when skipping whitespace ++ * itself. ++ * ++ * In theory it would be sufficient to allow any character not matched ++ * by isspace(), but there's a catch: our isspace() does not ++ * necessarily match the behavior of parse_timestamp(), as the latter ++ * is implemented by system routines which match more exotic control ++ * codes, or even locale-dependent sequences. ++ * ++ * Since we expect the timestamp to be a number, we can check for that. ++ * Anything else (e.g., a non-numeric token like "foo") would just ++ * cause parse_timestamp() to return 0 anyway. + */ + while (dateptr < eol && isspace(*dateptr)) + dateptr++; -+ if (dateptr == eol) ++ if (!isdigit(*dateptr) && *dateptr != '-') + return 0; + + /* -+ * We know there is at least one non-whitespace character, so we'll -+ * begin parsing there and stop at worst case at eol. ++ * We know there is at least one digit (or dash), so we'll begin ++ * parsing there and stop at worst case at eol. + */ return parse_timestamp(dateptr, NULL, 10); } @@ t/t4212-log-corrupt.sh: test_expect_success 'absurdly far-in-future date' ' git log -1 --format=%ad $commit ' -+test_expect_success 'create commit with whitespace committer date' ' ++test_expect_success 'create commits with whitespace committer dates' ' + # It is important that this subject line is numeric, since we want to + # be sure we are not confused by skipping whitespace and accidentally + # parsing the subject as a timestamp. + # + # Do not use munge_author_date here. Besides not hitting the committer + # line, it leaves the timezone intact, and we want nothing but + # whitespace. ++ # ++ # We will make two munged commits here. The first, ws_commit, will ++ # be purely spaces. The second contains a vertical tab, which is ++ # considered a space by strtoumax(), but not by our isspace(). + test_commit 1234567890 && + git cat-file commit HEAD >commit.orig && + sed "s/>.*/> /" <commit.orig >commit.munge && -+ ws_commit=$(git hash-object --literally -w -t commit commit.munge) ++ ws_commit=$(git hash-object --literally -w -t commit commit.munge) && ++ sed "s/>.*/> $(printf "\013")/" <commit.orig >commit.munge && ++ vt_commit=$(git hash-object --literally -w -t commit commit.munge) +' + +test_expect_success '--until treats whitespace date as sentinel' ' + echo $ws_commit >expect && + git rev-list --until=1980-01-01 $ws_commit >actual && ++ test_cmp expect actual && ++ ++ echo $vt_commit >expect && ++ git rev-list --until=1980-01-01 $vt_commit >actual && + test_cmp expect actual +' + @@ t/t4212-log-corrupt.sh: test_expect_success 'absurdly far-in-future date' ' + # test bogus timestamps with git-log, 2014-02-24) for more discussion. + echo : >expect && + git log -1 --format="%at:%ct" $ws_commit >actual && ++ test_cmp expect actual && ++ git log -1 --format="%at:%ct" $vt_commit && + test_cmp expect actual +' + 4: 193a01a32a < -: ---------- parse_commit(): describe more date-parsing failure modes -: ---------- > 4: ff7e9ddc7c parse_commit(): describe more date-parsing failure modes ^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH v3 1/4] t4212: avoid putting git on left-hand side of pipe 2023-04-27 8:13 ` [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases Jeff King @ 2023-04-27 8:14 ` Jeff King 2023-04-27 8:14 ` [PATCH v3 2/4] parse_commit(): parse timestamp from end of line Jeff King ` (4 subsequent siblings) 5 siblings, 0 replies; 46+ messages in thread From: Jeff King @ 2023-04-27 8:14 UTC (permalink / raw) To: Junio C Hamano Cc: Phillip Wood, Thomas Bock, Derrick Stolee, git, René Scharfe We wouldn't expect cat-file to fail here, but it's good practice to avoid putting git on the upstream of a pipe, as we otherwise ignore its exit code. Signed-off-by: Jeff King <peff@peff.net> --- t/t4212-log-corrupt.sh | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/t/t4212-log-corrupt.sh b/t/t4212-log-corrupt.sh index e89e1f54b6..8b5433ea74 100755 --- a/t/t4212-log-corrupt.sh +++ b/t/t4212-log-corrupt.sh @@ -8,8 +8,9 @@ TEST_PASSES_SANITIZE_LEAK=true test_expect_success 'setup' ' test_commit foo && - git cat-file commit HEAD | - sed "/^author /s/>/>-<>/" >broken_email.commit && + git cat-file commit HEAD >ok.commit && + sed "/^author /s/>/>-<>/" <ok.commit >broken_email.commit && + git hash-object --literally -w -t commit broken_email.commit >broken_email.hash && git update-ref refs/heads/broken_email $(cat broken_email.hash) ' -- 2.40.1.663.g410c33770c.dirty ^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH v3 2/4] parse_commit(): parse timestamp from end of line 2023-04-27 8:13 ` [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases Jeff King 2023-04-27 8:14 ` [PATCH v3 1/4] t4212: avoid putting git on left-hand side of pipe Jeff King @ 2023-04-27 8:14 ` Jeff King 2023-04-27 8:17 ` [PATCH v3 3/4] parse_commit(): handle broken whitespace-only timestamp Jeff King ` (3 subsequent siblings) 5 siblings, 0 replies; 46+ messages in thread From: Jeff King @ 2023-04-27 8:14 UTC (permalink / raw) To: Junio C Hamano Cc: Phillip Wood, Thomas Bock, Derrick Stolee, git, René Scharfe To find the committer timestamp, we parse left-to-right looking for the closing ">" of the email, and then expect the timestamp right after that. But we've seen some broken cases in the wild where this fails, but we _could_ find the timestamp with a little extra work. E.g.: Name <Name<email>> 123456789 -0500 This means that features that rely on the committer timestamp, like --since or --until, will treat the commit as happening at time 0 (i.e., 1970). This is doubly confusing because the pretty-print parser learned to handle these in 03818a4a94 (split_ident: parse timestamp from end of line, 2013-10-14). So printing them via "git show", etc, makes everything look normal, but --until, etc are still broken (despite the fact that that commit explicitly mentioned --until!). So let's use the same trick as 03818a4a94: find the end of the line, and parse back to the final ">". In theory we could use split_ident_line() here, but it's actually a bit more strict. In particular, it requires a valid time-zone token, too. That should be present, of course, but we wouldn't want to break --until for cases that are working currently. We might want to teach split_ident_line() to become more lenient there, but it would require checking its many callers (since right now they can assume that if date_start is non-NULL, so is tz_start). So for now we'll just reimplement the same trick in the commit parser. The test is in t4212, which already covers similar cases, courtesy of 03818a4a94. We'll just adjust the broken commit to munge both the author and committer timestamps. Note that we could match (author|committer) here, but alternation can't be used portably in sed. Since we wouldn't expect to see ">" except as part of an ident line, we can just match that character on any line. Signed-off-by: Jeff King <peff@peff.net> --- commit.c | 24 ++++++++++++++++-------- t/t4212-log-corrupt.sh | 7 ++++++- 2 files changed, 22 insertions(+), 9 deletions(-) diff --git a/commit.c b/commit.c index 878b4473e4..04c20d9cc6 100644 --- a/commit.c +++ b/commit.c @@ -96,6 +96,7 @@ struct commit *lookup_commit_reference_by_name(const char *name) static timestamp_t parse_commit_date(const char *buf, const char *tail) { const char *dateptr; + const char *eol; if (buf + 6 >= tail) return 0; @@ -107,16 +108,23 @@ static timestamp_t parse_commit_date(const char *buf, const char *tail) return 0; if (memcmp(buf, "committer", 9)) return 0; - while (buf < tail && *buf++ != '>') - /* nada */; - if (buf >= tail) + + /* + * Jump to end-of-line so that we can walk backwards to find the + * end-of-email ">". This is more forgiving of malformed cases + * because unexpected characters tend to be in the name and email + * fields. + */ + eol = memchr(buf, '\n', tail - buf); + if (!eol) return 0; - dateptr = buf; - while (buf < tail && *buf++ != '\n') - /* nada */; - if (buf >= tail) + dateptr = eol; + while (dateptr > buf && dateptr[-1] != '>') + dateptr--; + if (dateptr == buf || dateptr == eol) return 0; - /* dateptr < buf && buf[-1] == '\n', so parsing will stop at buf-1 */ + + /* dateptr < eol && *eol == '\n', so parsing will stop at eol */ return parse_timestamp(dateptr, NULL, 10); } diff --git a/t/t4212-log-corrupt.sh b/t/t4212-log-corrupt.sh index 8b5433ea74..af4b35ff56 100755 --- a/t/t4212-log-corrupt.sh +++ b/t/t4212-log-corrupt.sh @@ -9,7 +9,7 @@ test_expect_success 'setup' ' test_commit foo && git cat-file commit HEAD >ok.commit && - sed "/^author /s/>/>-<>/" <ok.commit >broken_email.commit && + sed "s/>/>-<>/" <ok.commit >broken_email.commit && git hash-object --literally -w -t commit broken_email.commit >broken_email.hash && git update-ref refs/heads/broken_email $(cat broken_email.hash) @@ -44,6 +44,11 @@ test_expect_success 'git log --format with broken author email' ' test_must_be_empty actual.err ' +test_expect_success '--until handles broken email' ' + git rev-list --until=1980-01-01 broken_email >actual && + test_must_be_empty actual +' + munge_author_date () { git cat-file commit "$1" >commit.orig && sed "s/^\(author .*>\) [0-9]*/\1 $2/" <commit.orig >commit.munge && -- 2.40.1.663.g410c33770c.dirty ^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH v3 3/4] parse_commit(): handle broken whitespace-only timestamp 2023-04-27 8:13 ` [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases Jeff King 2023-04-27 8:14 ` [PATCH v3 1/4] t4212: avoid putting git on left-hand side of pipe Jeff King 2023-04-27 8:14 ` [PATCH v3 2/4] parse_commit(): parse timestamp from end of line Jeff King @ 2023-04-27 8:17 ` Jeff King 2023-04-27 10:11 ` Phillip Wood 2023-04-27 16:25 ` Junio C Hamano 2023-04-27 8:17 ` [PATCH v3 4/4] parse_commit(): describe more date-parsing failure modes Jeff King ` (2 subsequent siblings) 5 siblings, 2 replies; 46+ messages in thread From: Jeff King @ 2023-04-27 8:17 UTC (permalink / raw) To: Junio C Hamano Cc: Phillip Wood, Thomas Bock, Derrick Stolee, git, René Scharfe The comment in parse_commit_date() claims that parse_timestamp() will not walk past the end of the buffer we've been given, since it will hit the newline at "eol" and stop. This is usually true, when dateptr contains actual numbers to parse. But with a line like: committer name <email> \n with just whitespace, and no numbers, parse_timestamp() will consume that newline as part of the leading whitespace, and we may walk past our "tail" pointer (which itself is set from the "size" parameter passed in to parse_commit_buffer()). In practice this can't cause us to walk off the end of an array, because we always add an extra NUL byte to the end of objects we load from disk (as a defense against exactly this kind of bug). However, you can see the behavior in action when "committer" is the final header (which it usually is, unless there's an encoding) and the subject line can be parsed as an integer. We walk right past the newline on the committer line, as well as the "\n\n" separator, and mistake the subject for the timestamp. We can solve this by trimming the whitespace ourselves, making sure that it has some non-whitespace to parse. Note that we need to be a bit careful about the definition of "whitespace" here, as our isspace() doesn't match exotic characters like vertical tab or formfeed. We can work around that by checking for an actual number (see the in-code comment). This is slightly more restrictive than the current code, but in practice the results are either the same (we reject "foo" as "0", but so would parse_timestamp()) or extremely unlikely even for broken commits (parse_timestamp() would allow "\v123" as "123", but we'll now make it "0"). I did also allow "-" here, which may be controversial, as we don't currently support negative timestamps. My reasoning was two-fold. One, the design of parse_timestamp() is such that we should be able to easily switch it to handling signed values, and this otherwise creates a hard-to-find gotcha that anybody doing that work would get tripped up on. And two, the status quo is that we currently parse them, though the result of course ends up as a very large unsigned value (which is likely to just get clamped to "0" for display anyway, since our date routines can't handle it). The new test checks the commit parser (via "--until") for both vanilla spaces and the vertical-tab case. I also added a test to check these against the pretty-print formatter, which uses split_ident_line(). It's not subject to the same bug, because it already insists that there be one or more digits in the timestamp. Helped-by: Phillip Wood <phillip.wood123@gmail.com> Signed-off-by: Jeff King <peff@peff.net> --- commit.c | 28 ++++++++++++++++++++++++++-- t/t4212-log-corrupt.sh | 41 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 67 insertions(+), 2 deletions(-) diff --git a/commit.c b/commit.c index 04c20d9cc6..8dfe92cf37 100644 --- a/commit.c +++ b/commit.c @@ -121,10 +121,34 @@ static timestamp_t parse_commit_date(const char *buf, const char *tail) dateptr = eol; while (dateptr > buf && dateptr[-1] != '>') dateptr--; - if (dateptr == buf || dateptr == eol) + if (dateptr == buf) return 0; - /* dateptr < eol && *eol == '\n', so parsing will stop at eol */ + /* + * Trim leading whitespace, but make sure we have at least one + * non-whitespace character, as parse_timestamp() will otherwise walk + * right past the newline we found in "eol" when skipping whitespace + * itself. + * + * In theory it would be sufficient to allow any character not matched + * by isspace(), but there's a catch: our isspace() does not + * necessarily match the behavior of parse_timestamp(), as the latter + * is implemented by system routines which match more exotic control + * codes, or even locale-dependent sequences. + * + * Since we expect the timestamp to be a number, we can check for that. + * Anything else (e.g., a non-numeric token like "foo") would just + * cause parse_timestamp() to return 0 anyway. + */ + while (dateptr < eol && isspace(*dateptr)) + dateptr++; + if (!isdigit(*dateptr) && *dateptr != '-') + return 0; + + /* + * We know there is at least one digit (or dash), so we'll begin + * parsing there and stop at worst case at eol. + */ return parse_timestamp(dateptr, NULL, 10); } diff --git a/t/t4212-log-corrupt.sh b/t/t4212-log-corrupt.sh index af4b35ff56..85e90acb09 100755 --- a/t/t4212-log-corrupt.sh +++ b/t/t4212-log-corrupt.sh @@ -92,4 +92,45 @@ test_expect_success 'absurdly far-in-future date' ' git log -1 --format=%ad $commit ' +test_expect_success 'create commits with whitespace committer dates' ' + # It is important that this subject line is numeric, since we want to + # be sure we are not confused by skipping whitespace and accidentally + # parsing the subject as a timestamp. + # + # Do not use munge_author_date here. Besides not hitting the committer + # line, it leaves the timezone intact, and we want nothing but + # whitespace. + # + # We will make two munged commits here. The first, ws_commit, will + # be purely spaces. The second contains a vertical tab, which is + # considered a space by strtoumax(), but not by our isspace(). + test_commit 1234567890 && + git cat-file commit HEAD >commit.orig && + sed "s/>.*/> /" <commit.orig >commit.munge && + ws_commit=$(git hash-object --literally -w -t commit commit.munge) && + sed "s/>.*/> $(printf "\013")/" <commit.orig >commit.munge && + vt_commit=$(git hash-object --literally -w -t commit commit.munge) +' + +test_expect_success '--until treats whitespace date as sentinel' ' + echo $ws_commit >expect && + git rev-list --until=1980-01-01 $ws_commit >actual && + test_cmp expect actual && + + echo $vt_commit >expect && + git rev-list --until=1980-01-01 $vt_commit >actual && + test_cmp expect actual +' + +test_expect_success 'pretty-printer handles whitespace date' ' + # as with the %ad test above, we will show these as the empty string, + # not the 1970 epoch date. This is intentional; see 7d9a281941 (t4212: + # test bogus timestamps with git-log, 2014-02-24) for more discussion. + echo : >expect && + git log -1 --format="%at:%ct" $ws_commit >actual && + test_cmp expect actual && + git log -1 --format="%at:%ct" $vt_commit >actual && + test_cmp expect actual +' + test_done -- 2.40.1.663.g410c33770c.dirty ^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [PATCH v3 3/4] parse_commit(): handle broken whitespace-only timestamp 2023-04-27 8:17 ` [PATCH v3 3/4] parse_commit(): handle broken whitespace-only timestamp Jeff King @ 2023-04-27 10:11 ` Phillip Wood 2023-04-27 11:55 ` Phillip Wood 2023-04-27 16:20 ` Junio C Hamano 2023-04-27 16:25 ` Junio C Hamano 1 sibling, 2 replies; 46+ messages in thread From: Phillip Wood @ 2023-04-27 10:11 UTC (permalink / raw) To: Jeff King, Junio C Hamano Cc: Thomas Bock, Derrick Stolee, git, René Scharfe On 27/04/2023 09:17, Jeff King wrote: > The comment in parse_commit_date() claims that parse_timestamp() will > not walk past the end of the buffer we've been given, since it will hit > the newline at "eol" and stop. This is usually true, when dateptr > contains actual numbers to parse. But with a line like: > > committer name <email> \n > > with just whitespace, and no numbers, parse_timestamp() will consume > that newline as part of the leading whitespace, and we may walk past our > "tail" pointer (which itself is set from the "size" parameter passed in > to parse_commit_buffer()). > > In practice this can't cause us to walk off the end of an array, because > we always add an extra NUL byte to the end of objects we load from disk > (as a defense against exactly this kind of bug). However, you can see > the behavior in action when "committer" is the final header (which it > usually is, unless there's an encoding) and the subject line can be > parsed as an integer. We walk right past the newline on the committer > line, as well as the "\n\n" separator, and mistake the subject for the > timestamp. > > We can solve this by trimming the whitespace ourselves, making sure that > it has some non-whitespace to parse. Note that we need to be a bit > careful about the definition of "whitespace" here, as our isspace() > doesn't match exotic characters like vertical tab or formfeed. We can > work around that by checking for an actual number (see the in-code > comment). This is slightly more restrictive than the current code, but > in practice the results are either the same (we reject "foo" as "0", but > so would parse_timestamp()) or extremely unlikely even for broken > commits (parse_timestamp() would allow "\v123" as "123", but we'll now > make it "0"). > > I did also allow "-" here, which may be controversial, as we don't > currently support negative timestamps. My reasoning was two-fold. One, > the design of parse_timestamp() is such that we should be able to easily > switch it to handling signed values, and this otherwise creates a > hard-to-find gotcha that anybody doing that work would get tripped up > on. And two, the status quo is that we currently parse them, though the > result of course ends up as a very large unsigned value (which is likely > to just get clamped to "0" for display anyway, since our date routines > can't handle it). I think this makes a good case for accepting '-'. The commit message is well explained as always :-) This all looks good to me apart from a query about one of the tests. > The new test checks the commit parser (via "--until") for both vanilla > spaces and the vertical-tab case. I also added a test to check these > against the pretty-print formatter, which uses split_ident_line(). It's > not subject to the same bug, because it already insists that there be > one or more digits in the timestamp. > > Helped-by: Phillip Wood <phillip.wood123@gmail.com> > Signed-off-by: Jeff King <peff@peff.net> > --- > commit.c | 28 ++++++++++++++++++++++++++-- > t/t4212-log-corrupt.sh | 41 +++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 67 insertions(+), 2 deletions(-) > > +test_expect_success 'create commits with whitespace committer dates' ' > + # It is important that this subject line is numeric, since we want to > + # be sure we are not confused by skipping whitespace and accidentally > + # parsing the subject as a timestamp. > + # > + # Do not use munge_author_date here. Besides not hitting the committer > + # line, it leaves the timezone intact, and we want nothing but > + # whitespace. > + # > + # We will make two munged commits here. The first, ws_commit, will > + # be purely spaces. The second contains a vertical tab, which is > + # considered a space by strtoumax(), but not by our isspace(). This comment is really helpful to explain what's going on and testing '\v' as well as ' ' is a good idea. > + test_commit 1234567890 && > + git cat-file commit HEAD >commit.orig && > + sed "s/>.*/> /" <commit.orig >commit.munge && > + ws_commit=$(git hash-object --literally -w -t commit commit.munge) && > + sed "s/>.*/> $(printf "\013")/" <commit.orig >commit.munge && Does the shell eat the '\v' when it trims trailing whitespace from the command substitution (I can't remember the rules off the top of my head)? Best Wishes Phillip > + vt_commit=$(git hash-object --literally -w -t commit commit.munge) > +' > + > +test_expect_success '--until treats whitespace date as sentinel' ' > + echo $ws_commit >expect && > + git rev-list --until=1980-01-01 $ws_commit >actual && > + test_cmp expect actual && > + > + echo $vt_commit >expect && > + git rev-list --until=1980-01-01 $vt_commit >actual && > + test_cmp expect actual > +' > + > +test_expect_success 'pretty-printer handles whitespace date' ' > + # as with the %ad test above, we will show these as the empty string, > + # not the 1970 epoch date. This is intentional; see 7d9a281941 (t4212: > + # test bogus timestamps with git-log, 2014-02-24) for more discussion. > + echo : >expect && > + git log -1 --format="%at:%ct" $ws_commit >actual && > + test_cmp expect actual && > + git log -1 --format="%at:%ct" $vt_commit >actual && > + test_cmp expect actual > +' > + > test_done ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v3 3/4] parse_commit(): handle broken whitespace-only timestamp 2023-04-27 10:11 ` Phillip Wood @ 2023-04-27 11:55 ` Phillip Wood 2023-04-27 16:46 ` Jeff King 2023-04-27 16:20 ` Junio C Hamano 1 sibling, 1 reply; 46+ messages in thread From: Phillip Wood @ 2023-04-27 11:55 UTC (permalink / raw) To: Jeff King, Junio C Hamano Cc: Thomas Bock, Derrick Stolee, git, René Scharfe On 27/04/2023 11:11, Phillip Wood wrote: >> + test_commit 1234567890 && >> + git cat-file commit HEAD >commit.orig && >> + sed "s/>.*/> /" <commit.orig >commit.munge && >> + ws_commit=$(git hash-object --literally -w -t commit >> commit.munge) && >> + sed "s/>.*/> $(printf "\013")/" <commit.orig >commit.munge && > > Does the shell eat the '\v' when it trims trailing whitespace from the > command substitution (I can't remember the rules off the top of my head)? Having looked it up, the shell trims newlines but not other whitespace so this should be fine. Best Wishes Phillip > > Best Wishes > > Phillip > >> + vt_commit=$(git hash-object --literally -w -t commit commit.munge) >> +' >> + >> +test_expect_success '--until treats whitespace date as sentinel' ' >> + echo $ws_commit >expect && >> + git rev-list --until=1980-01-01 $ws_commit >actual && >> + test_cmp expect actual && >> + >> + echo $vt_commit >expect && >> + git rev-list --until=1980-01-01 $vt_commit >actual && >> + test_cmp expect actual >> +' >> + >> +test_expect_success 'pretty-printer handles whitespace date' ' >> + # as with the %ad test above, we will show these as the empty >> string, >> + # not the 1970 epoch date. This is intentional; see 7d9a281941 >> (t4212: >> + # test bogus timestamps with git-log, 2014-02-24) for more >> discussion. >> + echo : >expect && >> + git log -1 --format="%at:%ct" $ws_commit >actual && >> + test_cmp expect actual && >> + git log -1 --format="%at:%ct" $vt_commit >actual && >> + test_cmp expect actual >> +' >> + >> test_done > ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v3 3/4] parse_commit(): handle broken whitespace-only timestamp 2023-04-27 11:55 ` Phillip Wood @ 2023-04-27 16:46 ` Jeff King 0 siblings, 0 replies; 46+ messages in thread From: Jeff King @ 2023-04-27 16:46 UTC (permalink / raw) To: phillip.wood Cc: Junio C Hamano, Thomas Bock, Derrick Stolee, git, René Scharfe On Thu, Apr 27, 2023 at 12:55:26PM +0100, Phillip Wood wrote: > On 27/04/2023 11:11, Phillip Wood wrote: > > > > + test_commit 1234567890 && > > > + git cat-file commit HEAD >commit.orig && > > > + sed "s/>.*/> /" <commit.orig >commit.munge && > > > + ws_commit=$(git hash-object --literally -w -t commit > > > commit.munge) && > > > + sed "s/>.*/> $(printf "\013")/" <commit.orig >commit.munge && > > > > Does the shell eat the '\v' when it trims trailing whitespace from the > > command substitution (I can't remember the rules off the top of my > > head)? > > Having looked it up, the shell trims newlines but not other whitespace so > this should be fine. Yep. I also wondered if some sed versions might complain. But it's really not that much more exotic than a tab, so it's probably OK. -Peff ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v3 3/4] parse_commit(): handle broken whitespace-only timestamp 2023-04-27 10:11 ` Phillip Wood 2023-04-27 11:55 ` Phillip Wood @ 2023-04-27 16:20 ` Junio C Hamano 2023-04-27 16:55 ` Jeff King 1 sibling, 1 reply; 46+ messages in thread From: Junio C Hamano @ 2023-04-27 16:20 UTC (permalink / raw) To: Phillip Wood Cc: Jeff King, Thomas Bock, Derrick Stolee, git, René Scharfe Phillip Wood <phillip.wood123@gmail.com> writes: >> I did also allow "-" here, which may be controversial, as we don't >> currently support negative timestamps. My reasoning was two-fold. One, >> the design of parse_timestamp() is such that we should be able to easily >> switch it to handling signed values, and this otherwise creates a >> hard-to-find gotcha that anybody doing that work would get tripped up >> on. And two, the status quo is that we currently parse them, though the >> result of course ends up as a very large unsigned value (which is likely >> to just get clamped to "0" for display anyway, since our date routines >> can't handle it). > > I think this makes a good case for accepting '-'. The commit message > is well explained as always :-) This all looks good to me apart from a > query about one of the tests. I agree. I was somewhat surprised that the big comment before that code did not mention it, but hopefully those who would be tempted to remove the check for '-' would either be careful enough themselves or be stopped by reviewers who are careful enough to go back to the log message of the commit that added the check in the first place, so it is OK. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v3 3/4] parse_commit(): handle broken whitespace-only timestamp 2023-04-27 16:20 ` Junio C Hamano @ 2023-04-27 16:55 ` Jeff King 0 siblings, 0 replies; 46+ messages in thread From: Jeff King @ 2023-04-27 16:55 UTC (permalink / raw) To: Junio C Hamano Cc: Phillip Wood, Thomas Bock, Derrick Stolee, git, René Scharfe On Thu, Apr 27, 2023 at 09:20:53AM -0700, Junio C Hamano wrote: > Phillip Wood <phillip.wood123@gmail.com> writes: > > >> I did also allow "-" here, which may be controversial, as we don't > >> currently support negative timestamps. My reasoning was two-fold. One, > >> the design of parse_timestamp() is such that we should be able to easily > >> switch it to handling signed values, and this otherwise creates a > >> hard-to-find gotcha that anybody doing that work would get tripped up > >> on. And two, the status quo is that we currently parse them, though the > >> result of course ends up as a very large unsigned value (which is likely > >> to just get clamped to "0" for display anyway, since our date routines > >> can't handle it). > > > > I think this makes a good case for accepting '-'. The commit message > > is well explained as always :-) This all looks good to me apart from a > > query about one of the tests. > > I agree. I was somewhat surprised that the big comment before that > code did not mention it, but hopefully those who would be tempted to > remove the check for '-' would either be careful enough themselves > or be stopped by reviewers who are careful enough to go back to the > log message of the commit that added the check in the first place, > so it is OK. Hmph, I thought I did write something in that comment. But I think in the end I just migrated it to the commit message (and skimming my reflog I don't think it even made it as far as a commit). I think it's OK. I was mostly trying to help out Future Peff, who has a series almost completed to handle negative timestamps. But I think the worst case is that the tests in that new series would fail, and I'd figure it out. :) -Peff ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v3 3/4] parse_commit(): handle broken whitespace-only timestamp 2023-04-27 8:17 ` [PATCH v3 3/4] parse_commit(): handle broken whitespace-only timestamp Jeff King 2023-04-27 10:11 ` Phillip Wood @ 2023-04-27 16:25 ` Junio C Hamano 2023-04-27 16:57 ` Jeff King 1 sibling, 1 reply; 46+ messages in thread From: Junio C Hamano @ 2023-04-27 16:25 UTC (permalink / raw) To: Jeff King Cc: Phillip Wood, Thomas Bock, Derrick Stolee, git, René Scharfe Jeff King <peff@peff.net> writes: > In practice this can't cause us to walk off the end of an array, because > we always add an extra NUL byte to the end of objects we load from disk > (as a defense against exactly this kind of bug). However, you can see > the behavior in action when "committer" is the final header (which it > usually is, unless there's an encoding ... ... or it is a signed commit or a commit that merges a signed tag. There is no need for us to be exhaustive here, but I just wondered which one of these three commit object headers is more common. I guess the reason "encoding" came to your mind first is because it is the oldest among the three. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v3 3/4] parse_commit(): handle broken whitespace-only timestamp 2023-04-27 16:25 ` Junio C Hamano @ 2023-04-27 16:57 ` Jeff King 0 siblings, 0 replies; 46+ messages in thread From: Jeff King @ 2023-04-27 16:57 UTC (permalink / raw) To: Junio C Hamano Cc: Phillip Wood, Thomas Bock, Derrick Stolee, git, René Scharfe On Thu, Apr 27, 2023 at 09:25:02AM -0700, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > In practice this can't cause us to walk off the end of an array, because > > we always add an extra NUL byte to the end of objects we load from disk > > (as a defense against exactly this kind of bug). However, you can see > > the behavior in action when "committer" is the final header (which it > > usually is, unless there's an encoding ... > > ... or it is a signed commit or a commit that merges a signed tag. > > There is no need for us to be exhaustive here, but I just wondered > which one of these three commit object headers is more common. I > guess the reason "encoding" came to your mind first is because it is > the oldest among the three. Mostly the others did not occur to me at all. :) I expect that "gpgsig" lines are probably the most common these days, but that may be my biased view (I guess in the kernel workflow it is probably signed tags). -Peff ^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH v3 4/4] parse_commit(): describe more date-parsing failure modes 2023-04-27 8:13 ` [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases Jeff King ` (2 preceding siblings ...) 2023-04-27 8:17 ` [PATCH v3 3/4] parse_commit(): handle broken whitespace-only timestamp Jeff King @ 2023-04-27 8:17 ` Jeff King 2023-04-27 8:18 ` [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases Jeff King 2023-04-27 16:32 ` Junio C Hamano 5 siblings, 0 replies; 46+ messages in thread From: Jeff King @ 2023-04-27 8:17 UTC (permalink / raw) To: Junio C Hamano Cc: Phillip Wood, Thomas Bock, Derrick Stolee, git, René Scharfe The previous few commits improved the parsing of dates in malformed commit objects. But there's one big case left implicit: we may still feed garbage to parse_timestamp(). This is preferable to trying to be more strict, but let's document the thinking in a comment. Signed-off-by: Jeff King <peff@peff.net> --- commit.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/commit.c b/commit.c index 8dfe92cf37..e2e4fd2db9 100644 --- a/commit.c +++ b/commit.c @@ -148,6 +148,15 @@ static timestamp_t parse_commit_date(const char *buf, const char *tail) /* * We know there is at least one digit (or dash), so we'll begin * parsing there and stop at worst case at eol. + * + * Note that we may feed parse_timestamp() extra characters here if the + * commit is malformed, and it will parse as far as it can. For + * example, "123foo456" would return "123". That might be questionable + * (versus returning "0"), but it would help in a hypothetical case + * like "123456+0100", where the whitespace from the timezone is + * missing. Since such syntactic errors may be baked into history and + * hard to correct now, let's err on trying to make our best guess + * here, rather than insist on perfect syntax. */ return parse_timestamp(dateptr, NULL, 10); } -- 2.40.1.663.g410c33770c.dirty ^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases 2023-04-27 8:13 ` [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases Jeff King ` (3 preceding siblings ...) 2023-04-27 8:17 ` [PATCH v3 4/4] parse_commit(): describe more date-parsing failure modes Jeff King @ 2023-04-27 8:18 ` Jeff King 2023-04-27 16:32 ` Junio C Hamano 5 siblings, 0 replies; 46+ messages in thread From: Jeff King @ 2023-04-27 8:18 UTC (permalink / raw) To: Junio C Hamano Cc: Phillip Wood, Thomas Bock, Derrick Stolee, git, René Scharfe On Thu, Apr 27, 2023 at 04:13:31AM -0400, Jeff King wrote: > @@ t/t4212-log-corrupt.sh: test_expect_success 'absurdly far-in-future date' ' > + # test bogus timestamps with git-log, 2014-02-24) for more discussion. > + echo : >expect && > + git log -1 --format="%at:%ct" $ws_commit >actual && > ++ test_cmp expect actual && > ++ git log -1 --format="%at:%ct" $vt_commit && > + test_cmp expect actual In case anyone is just reading the range-diff, you may notice that the added git-log invocation here is missing its ">actual" redirect. I noticed this while sending out, and it's fixed in the actual v3 patch that was sent. -Peff ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases 2023-04-27 8:13 ` [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases Jeff King ` (4 preceding siblings ...) 2023-04-27 8:18 ` [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases Jeff King @ 2023-04-27 16:32 ` Junio C Hamano 5 siblings, 0 replies; 46+ messages in thread From: Junio C Hamano @ 2023-04-27 16:32 UTC (permalink / raw) To: Jeff King Cc: Phillip Wood, Thomas Bock, Derrick Stolee, git, René Scharfe Jeff King <peff@peff.net> writes: > So here's a v3. I was tempted to add the fix on top of the existing > patch, since it's somewhat its own case, and could be explained > separately. But they really are two versions of the same problem, so I > just rolled it all into patch 3. > > Patch 4 needed small updates to its comment to match. The first two > patches are the same. > > [1/4]: t4212: avoid putting git on left-hand side of pipe > [2/4]: parse_commit(): parse timestamp from end of line > [3/4]: parse_commit(): handle broken whitespace-only timestamp > [4/4]: parse_commit(): describe more date-parsing failure modes Thanks for a pleasant read. Looking very good. Will queue. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp 2023-04-25 16:06 ` Junio C Hamano 2023-04-26 11:36 ` Jeff King @ 2023-04-26 14:06 ` Phillip Wood 2023-04-26 14:31 ` Andreas Schwab 1 sibling, 1 reply; 46+ messages in thread From: Phillip Wood @ 2023-04-26 14:06 UTC (permalink / raw) To: Junio C Hamano Cc: Jeff King, Thomas Bock, Derrick Stolee, git, René Scharfe On 25/04/2023 17:06, Junio C Hamano wrote: > Phillip Wood <phillip.wood123@gmail.com> writes: > >> This probably doesn't matter in practice but we define our own >> isspace() that does not treat '\v' and '\f' as whitespace. However >> parse_timestamp() (which is just strtoumax()) uses the standard >> library's isspace() which does treat those characters as whitespace >> and is locale dependent. This means we can potentially stop at a >> character that parse_timestamp() treats as whitespace and if there are >> no digits after it we'll still walk past the end of the line. Using >> Rene's suggestion of testing the character with isdigit() would fix >> that. It would also avoid parsing negative timestamps as positive >> numbers >> and reject any timestamps that begin with a locale dependent >> digit. Sorry, that bit is not correct, I've since checked the C standard and I think strtoul() and friends expect ascii digits (isdigit() and isxdigit() are also locale independent unlike isspace(), isalpha() etc.) > A very interesting observation. I wonder if a curious person can > craft a malformed timestamp with "hash-object --literally" to do > more than DoS themselves? > > We are not going to put anything other than [ 0-9+-] after the '>' > we scan for, and making sure '>' is followed by SP and then [0-9] > would be sufficient to ensure strtoumax() to stop before the '\n' > but does not ensure that the "signal a bad timestamp with 0" > happens. Perhaps that would be sufficient. I dunno. > >> I'm not familiar with this code, but would it be worth changing >> parse_timestamp() to stop parsing if it sees a newline? > > Meaning replace or write our own strtoumax() equivalent? I was thinking of a wrapper around strtoumax() that skipped the leading whitespace itself and returned 0 if it saw '\n' or the first non-whitespace character was not a digit. It would help other callers avoid the problem with missing timestamps that is being fixed in this series. I was surprised to see that callers are expected to pass a base to parse_timestamp(). All of them seem to pass "10" apart from a caller in upload-pack.c that passes "0" when parsing the argument to "deepen-since" - do we really want to support octal and hexadecimal timestamps there?. Best Wishes Phillip ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp 2023-04-26 14:06 ` [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp Phillip Wood @ 2023-04-26 14:31 ` Andreas Schwab 2023-04-26 14:44 ` Phillip Wood 0 siblings, 1 reply; 46+ messages in thread From: Andreas Schwab @ 2023-04-26 14:31 UTC (permalink / raw) To: Phillip Wood Cc: Junio C Hamano, phillip.wood, Jeff King, Thomas Bock, Derrick Stolee, git, René Scharfe On Apr 26 2023, Phillip Wood wrote: > On 25/04/2023 17:06, Junio C Hamano wrote: >> Phillip Wood <phillip.wood123@gmail.com> writes: >> >>> This probably doesn't matter in practice but we define our own >>> isspace() that does not treat '\v' and '\f' as whitespace. However >>> parse_timestamp() (which is just strtoumax()) uses the standard >>> library's isspace() which does treat those characters as whitespace >>> and is locale dependent. This means we can potentially stop at a >>> character that parse_timestamp() treats as whitespace and if there are >>> no digits after it we'll still walk past the end of the line. Using >>> Rene's suggestion of testing the character with isdigit() would fix >>> that. It would also avoid parsing negative timestamps as positive >>> numbers > >>> and reject any timestamps that begin with a locale dependent >>> digit. > > Sorry, that bit is not correct, I've since checked the C standard and I > think strtoul() and friends expect ascii digits (isdigit() and isxdigit() > are also locale independent unlike isspace(), isalpha() etc.) The standard says: In other than the "C" locale, additional locale-specific subject sequence forms may be accepted. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1 "And now for something completely different." ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp 2023-04-26 14:31 ` Andreas Schwab @ 2023-04-26 14:44 ` Phillip Wood 0 siblings, 0 replies; 46+ messages in thread From: Phillip Wood @ 2023-04-26 14:44 UTC (permalink / raw) To: Andreas Schwab Cc: Junio C Hamano, phillip.wood, Jeff King, Thomas Bock, Derrick Stolee, git, René Scharfe Hi Andreas On 26/04/2023 15:31, Andreas Schwab wrote: > On Apr 26 2023, Phillip Wood wrote: > >> On 25/04/2023 17:06, Junio C Hamano wrote: >>> Phillip Wood <phillip.wood123@gmail.com> writes: >>> >>>> This probably doesn't matter in practice but we define our own >>>> isspace() that does not treat '\v' and '\f' as whitespace. However >>>> parse_timestamp() (which is just strtoumax()) uses the standard >>>> library's isspace() which does treat those characters as whitespace >>>> and is locale dependent. This means we can potentially stop at a >>>> character that parse_timestamp() treats as whitespace and if there are >>>> no digits after it we'll still walk past the end of the line. Using >>>> Rene's suggestion of testing the character with isdigit() would fix >>>> that. It would also avoid parsing negative timestamps as positive >>>> numbers >> >>>> and reject any timestamps that begin with a locale dependent >>>> digit. >> >> Sorry, that bit is not correct, I've since checked the C standard and I >> think strtoul() and friends expect ascii digits (isdigit() and isxdigit() >> are also locale independent unlike isspace(), isalpha() etc.) > > The standard says: > > In other than the "C" locale, additional locale-specific subject > sequence forms may be accepted. Thanks, looking at the standard again I don't know how I managed to miss that, my initial recollection was correct after all. Best Wishes Phillip ^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH v2 4/4] parse_commit(): describe more date-parsing failure modes 2023-04-25 5:52 ` [PATCH v2 " Jeff King ` (3 preceding siblings ...) 2023-04-25 5:54 ` [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp Jeff King @ 2023-04-25 5:55 ` Jeff King 4 siblings, 0 replies; 46+ messages in thread From: Jeff King @ 2023-04-25 5:55 UTC (permalink / raw) To: Thomas Bock; +Cc: Derrick Stolee, Junio C Hamano, git The previous few commits improved the parsing of dates in malformed commit objects. But there's one big case left implicit: we may still feed garbage to parse_timestamp(). This is preferable to trying to be more strict, but let's document the thinking in a comment. Signed-off-by: Jeff King <peff@peff.net> --- Obviously not strictly necessary, but it felt like this was worth pointing out, and it felt weird to shove it into patch 3. commit.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/commit.c b/commit.c index 2f1b5d505b..2ac141e198 100644 --- a/commit.c +++ b/commit.c @@ -136,6 +136,16 @@ static timestamp_t parse_commit_date(const char *buf, const char *tail) /* * We know there is at least one non-whitespace character, so we'll * begin parsing there and stop at worst case at eol. + * + * Note that we may feed parse_timestamp() non-digit garbage here if + * the commit is malformed. If we see a totally unexpected token like + * "foo" here, it will return 0, which is our error/sentinel value + * anyway. For something like "123foo456", it will parse as far as it + * can. That might be questionable (versus returning "0"), but it would + * help in a hypothetical case like "123456+0100", where the whitespace + * from the timezone is missing. Since such syntactic errors may be + * baked into history and hard to correct now, let's err on trying to + * make our best guess here, rather than insist on perfect syntax. */ return parse_timestamp(dateptr, NULL, 10); } -- 2.40.0.653.g15ca972062 ^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980 2023-04-18 14:02 ` Derrick Stolee 2023-04-21 14:51 ` Thomas Bock @ 2023-04-22 13:52 ` Jeff King 1 sibling, 0 replies; 46+ messages in thread From: Jeff King @ 2023-04-22 13:52 UTC (permalink / raw) To: Derrick Stolee; +Cc: Junio C Hamano, Thomas Bock, git On Tue, Apr 18, 2023 at 10:02:44AM -0400, Derrick Stolee wrote: > On 4/18/2023 12:12 AM, Jeff King wrote: > > > One thing the commit graph perhaps _could_ do is omit the commit, or > > mark it as "this one is broken in some way". And then fall back to > > parsing those few instead (which is slower, but if it's a small minority > > of commits, that's OK). But I don't think there's any code for that. > > The "broken" commit would need to be included in the commit-graph file > so its children can point to it using a graph position, but then it > would revert to parsing from the commit object (due to some new concept > storing "this is a bad commit"). > > If we decided to treat a timestamp of 0 as "probably broken, artificial > at best" then we wouldn't need the new indicator in the commit-graph > file, but this seems like quite a big hammer for a small case. Makes sense. I punted on doing anything here for the series I just posted, since the workaround of "blow away your commit graph and rebuild" is probably sufficient for such obscure cases. And going forward, new commit graphs would be built with the fixed parser anyway, so it becomes less of a problem over time. -Peff ^ permalink raw reply [flat|nested] 46+ messages in thread
end of thread, other threads:[~2023-04-27 22:33 UTC | newest] Thread overview: 46+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-04-14 11:37 Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980 Thomas Bock 2023-04-15 8:52 ` Jeff King 2023-04-15 8:59 ` Jeff King 2023-04-15 14:10 ` Kristoffer Haugsbakk 2023-04-17 5:40 ` Jeff King 2023-04-17 6:20 ` Kristoffer Haugsbakk 2023-04-17 7:41 ` Jeff King 2023-04-27 22:32 ` Kristoffer Haugsbakk 2023-04-17 9:51 ` Junio C Hamano 2023-04-18 4:12 ` Jeff King 2023-04-18 14:02 ` Derrick Stolee 2023-04-21 14:51 ` Thomas Bock 2023-04-22 13:41 ` [PATCH 0/3] fixing some parse_commit() timestamp corner cases Jeff King 2023-04-22 13:42 ` [PATCH 1/3] t4212: avoid putting git on left-hand side of pipe Jeff King 2023-04-22 13:47 ` [PATCH 2/3] parse_commit(): parse timestamp from end of line Jeff King 2023-04-24 17:05 ` Junio C Hamano 2023-04-25 5:23 ` Jeff King 2023-04-24 16:39 ` [PATCH 0/3] fixing some parse_commit() timestamp corner cases Junio C Hamano 2023-04-25 5:52 ` [PATCH v2 " Jeff King 2023-04-25 5:54 ` Jeff King 2023-04-25 5:54 ` [PATCH v2 1/4] t4212: avoid putting git on left-hand side of pipe Jeff King 2023-04-25 5:54 ` [PATCH v2 2/4] parse_commit(): parse timestamp from end of line Jeff King 2023-04-25 5:54 ` [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp Jeff King 2023-04-25 10:11 ` Phillip Wood 2023-04-25 16:06 ` Junio C Hamano 2023-04-26 11:36 ` Jeff King 2023-04-26 15:32 ` Junio C Hamano 2023-04-27 8:13 ` [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases Jeff King 2023-04-27 8:14 ` [PATCH v3 1/4] t4212: avoid putting git on left-hand side of pipe Jeff King 2023-04-27 8:14 ` [PATCH v3 2/4] parse_commit(): parse timestamp from end of line Jeff King 2023-04-27 8:17 ` [PATCH v3 3/4] parse_commit(): handle broken whitespace-only timestamp Jeff King 2023-04-27 10:11 ` Phillip Wood 2023-04-27 11:55 ` Phillip Wood 2023-04-27 16:46 ` Jeff King 2023-04-27 16:20 ` Junio C Hamano 2023-04-27 16:55 ` Jeff King 2023-04-27 16:25 ` Junio C Hamano 2023-04-27 16:57 ` Jeff King 2023-04-27 8:17 ` [PATCH v3 4/4] parse_commit(): describe more date-parsing failure modes Jeff King 2023-04-27 8:18 ` [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases Jeff King 2023-04-27 16:32 ` Junio C Hamano 2023-04-26 14:06 ` [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp Phillip Wood 2023-04-26 14:31 ` Andreas Schwab 2023-04-26 14:44 ` Phillip Wood 2023-04-25 5:55 ` [PATCH v2 4/4] parse_commit(): describe more date-parsing failure modes Jeff King 2023-04-22 13:52 ` Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980 Jeff King
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).