Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980
@ 2023-04-14 11:37 Thomas Bock
  2023-04-15  8:52 ` Jeff King
  0 siblings, 1 reply; 46+ messages in thread
From: Thomas Bock @ 2023-04-14 11:37 UTC (permalink / raw)
  To: git

Dear git development team,

I've spotted a weird behavior of git in a specific repository, which I 
cannot explain, maybe there is a bug in git.

Repository in which the problem occurs:

git@github.com:LibreOffice/core.git

Problem:

When ordering the commits in this repository by time via "git log", 
there are several tens of commits which appear to be before 1980 
although their author date and commit date are in 2009 or 2010 or 2011.

Example to reproduce:

git clone git@github.com:LibreOffice/core.git libreoffice
cd libreoffice
git log --no-merges --before="1980-01-01T00:00:00+0000" 
--format=%H,%ct,%ci,%ad

All the resulting commits have an author date and commit date in 2009 or 
2010 or 2011, though.

This also appears when you order the commits by date:

git log --no-merges --date-order --format=%H,%ct,%ci,%ad

If you search for commit d3b03514dde317473db0d247f21405b5db6a727e in the 
resulting output, you can see that this commit is placed earlier in the 
list than the commits from 2008 although its author date and commit date 
are from 2011. And there are many more commits of this sort between 2009 
and 2011 which are placed out of order, interpreted to be earlier than 
every other commit.

I already searched for broken timestamps in these commits and printed 
the timestamp in various different formats, but I could not spot 
anything dubious there. So, why does git log think that these commits 
have an earlier timestamp than they actually have?

Thanks for your efforts!

[System Info]
git version:
git version 2.30.2 (but I have also tried it with git version 2.40.0 
with no difference)
cpu: x86_64
sizeof-long: 8
sizeof-size_t: 8
shell-path: /bin/sh
uname: Linux 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64
compiler info: gnuc: 10.2
libc info: glibc: 2.31
$SHELL (typically, interactive shell): /bin/bash

Best,
Thomas

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980
  2023-04-14 11:37 Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980 Thomas Bock
@ 2023-04-15  8:52 ` Jeff King
  2023-04-15  8:59   ` Jeff King
                     ` (2 more replies)
  0 siblings, 3 replies; 46+ messages in thread
From: Jeff King @ 2023-04-15  8:52 UTC (permalink / raw)
  To: Thomas Bock; +Cc: git

On Fri, Apr 14, 2023 at 01:37:57PM +0200, Thomas Bock wrote:

> Example to reproduce:
> 
> git clone git@github.com:LibreOffice/core.git libreoffice
> cd libreoffice
> git log --no-merges --before="1980-01-01T00:00:00+0000"
> --format=%H,%ct,%ci,%ad

Thanks for providing a clear example and reproduction recipe. It's an
interesting case. The commits _are_ malformed, but only slightly. For
example:

  $ git cat-file commit d3b03514dde317473db0d247f21405b5db6a727e
  tree c7526375c71327a195714e9e325b66a9ad013d74
  parent 61099481271709723469421181f65e6219cbc271
  author Andre Fischer<andre.f.fischer <Andre Fischer<andre.f.fischer@oracle.com>> 1297247749 +0100
  committer Andre Fischer<andre.f.fischer <Andre Fischer<andre.f.fischer@oracle.com>> 1297247749 +0100

There's an extra weird set of angle brackets in both the author and
committer lines.

But here's the really quirky part: there actually two parsers being used
in Git here. One is in parse_commit_buffer(), which does the minimum to
fill in the fields of a "struct commit": parents, tree, and committer
timestamp.  That parser calls parse_commit_date(), which finds the first
">" and then tries to parse "> 1297247749 +0100" as a timestamp, which
fails. So it uses the sentinel value "0" (aka 1970-01-01). And that's
what gets used for "--before" (and "--since", and things like queue
order for showing commits).

But then when we actually _display_ the commit, we have a whole
pretty-printing system. There we usually end up in split_ident_line(),
which is a bit more forgiving, due to 03818a4a94 (split_ident: parse
timestamp from end of line, 2013-10-14). It finds the trailing ">" from
the right-hand side, which in this case yields the correct timestamp.

We could probably use the same trick in parse_commit_date(), something
like:

diff --git a/commit.c b/commit.c
index 6d844da9a6..72d21a2bb8 100644
--- a/commit.c
+++ b/commit.c
@@ -95,6 +95,7 @@ struct commit *lookup_commit_reference_by_name(const char *name)
 static timestamp_t parse_commit_date(const char *buf, const char *tail)
 {
 	const char *dateptr;
+	const char *eol;

 	if (buf + 6 >= tail)
 		return 0;
@@ -106,16 +107,22 @@ static timestamp_t parse_commit_date(const char *buf, const char *tail)
 		return 0;
 	if (memcmp(buf, "committer", 9))
 		return 0;
-	while (buf < tail && *buf++ != '>')
+
+	/*
+	 * parse to end-of-line and then walk backwards, which
+	 * handles some malformed cases. Alternatively, once
+	 * we have eol, we could just call split_ident_line()
+	 */
+	eol = buf;
+	while (eol < tail && *eol++ != '\n')
 		/* nada */;
-	if (buf >= tail)
+	if (eol >= tail)
 		return 0;
-	dateptr = buf;
-	while (buf < tail && *buf++ != '\n')
+	for (dateptr = eol; dateptr > buf && dateptr[-1] != '>'; dateptr--)
 		/* nada */;
-	if (buf >= tail)
+	if (dateptr == buf || dateptr == eol)
 		return 0;
-	/* dateptr < buf && buf[-1] == '\n', so parsing will stop at buf-1 */
+	/* dateptr < eol && *eol == '\n', so parsing will stop at eol */
 	return parse_timestamp(dateptr, NULL, 10);
 }

which makes your case behave as expected. It may make sense to use
split_ident_line(), which I think has a few other rules (it actually
checks for something that looks like numbers, for instance).

There may be also cases where the two diverge. Obviously having two
parsers isn't ideal. I think sharing the code may involve a lot of work,
though. The pretty-print parser is interested in pulling out more
information, and is less focused on performance. Parsing commits is
traditionally a hot path, as we historically had to parse every commit,
even if we weren't showing it (including non-log operations like
computing merge bases, reachability, and so forth).

But that may not matter so much. One, we already inflate the whole
commit object, not just the header. So even if we spend a few extra
instructions on parsing, it may not be noticeable. And two, these days
we often cache commit metadata in the commit-graph files, which avoids
loading the commit message entirely (and thus this parsing) for most
operations.

Of course it may also be reasonable to consider this mystery solved and
leave the code as-is. These objects _are_ malformed. I note they're all
at least a decade old, and most of them are from a handful of committers
(who presumably had misconfigured idents for a while). These days I
think Git wouldn't allow them (I don't think it rejects the commit, but
it does screen out the syntactically bogus characters).

-Peff

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980
  2023-04-15  8:52 ` Jeff King
@ 2023-04-15  8:59   ` Jeff King
  2023-04-15 14:10   ` Kristoffer Haugsbakk
  2023-04-17  9:51   ` Junio C Hamano
  2 siblings, 0 replies; 46+ messages in thread
From: Jeff King @ 2023-04-15  8:59 UTC (permalink / raw)
  To: Thomas Bock; +Cc: git

On Sat, Apr 15, 2023 at 04:52:07AM -0400, Jeff King wrote:

> There may be also cases where the two diverge. Obviously having two

Sorry, my ability to type is going rapidly downhill this evening.

I meant to say "there may be other cases where the two diverge". I.e.,
the best way to be sure they behave the same is to make them share code.

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980
  2023-04-15  8:52 ` Jeff King
  2023-04-15  8:59   ` Jeff King
@ 2023-04-15 14:10   ` Kristoffer Haugsbakk
  2023-04-17  5:40     ` Jeff King
  2023-04-17  9:51   ` Junio C Hamano
  2 siblings, 1 reply; 46+ messages in thread
From: Kristoffer Haugsbakk @ 2023-04-15 14:10 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Sat, Apr 15, 2023, at 10:52, Jeff King wrote:
> Of course it may also be reasonable to consider this mystery solved and
> leave the code as-is. These objects _are_ malformed.

I started to run `git-repair`[1] on the repository yesterday. I’m not
sure yet if it will finish.[2]

[1]: https://git-repair.branchable.com/
[2]: https://git-repair.branchable.com/index/discussion/

-- 
Kristoffer Haugsbakk

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980
  2023-04-15 14:10   ` Kristoffer Haugsbakk
@ 2023-04-17  5:40     ` Jeff King
  2023-04-17  6:20       ` Kristoffer Haugsbakk
  0 siblings, 1 reply; 46+ messages in thread
From: Jeff King @ 2023-04-17  5:40 UTC (permalink / raw)
  To: Kristoffer Haugsbakk; +Cc: git

On Sat, Apr 15, 2023 at 04:10:32PM +0200, Kristoffer Haugsbakk wrote:

> On Sat, Apr 15, 2023, at 10:52, Jeff King wrote:
> > Of course it may also be reasonable to consider this mystery solved and
> > leave the code as-is. These objects _are_ malformed.
> 
> I started to run `git-repair`[1] on the repository yesterday. I’m not
> sure yet if it will finish.[2]

I don't know if git-repair will fix syntactic problems in commits like
this, or if it's just looking for broken links between objects. But
regardless, it would be possible to fix these using filter-repo or other
tools.

The problem with repairing them is that it rewrites history, changing
the object id of every commit which comes after.  So you'd want to do it
once centrally for the project, and declare a flag day that everybody is
moving over to the new, fixed history.

It's probably not worth it for a minor problem like this.

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980
  2023-04-17  5:40     ` Jeff King
@ 2023-04-17  6:20       ` Kristoffer Haugsbakk
  2023-04-17  7:41         ` Jeff King
  0 siblings, 1 reply; 46+ messages in thread
From: Kristoffer Haugsbakk @ 2023-04-17  6:20 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Mon, Apr 17, 2023, at 07:40, Jeff King wrote:
> I don't know if git-repair will fix syntactic problems in commits like
> this, or if it's just looking for broken links between objects. But
> regardless, it would be possible to fix these using filter-repo or other
> tools.

Makes sense, thanks.

I gave up since the program never seemed to finish.

> The problem with repairing them is that it rewrites history, changing
> the object id of every commit which comes after.  So you'd want to do it
> once centrally for the project, and declare a flag day that everybody is
> moving over to the new, fixed history.

I wasn’t interested in fixing the repo for future development. Just to
see if that weird git-log(1) behavior went away. It’s like the inverse
of a minimal example to reproduce a bug; repair the corrupt object so
that future askers can be convinced that it’s not a bug. ;)

I’m not affiliated with the project.

-- 
Kristoffer Haugsbakk

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980
  2023-04-17  6:20       ` Kristoffer Haugsbakk
@ 2023-04-17  7:41         ` Jeff King
  2023-04-27 22:32           ` Kristoffer Haugsbakk
  0 siblings, 1 reply; 46+ messages in thread
From: Jeff King @ 2023-04-17  7:41 UTC (permalink / raw)
  To: Kristoffer Haugsbakk; +Cc: git

On Mon, Apr 17, 2023 at 08:20:30AM +0200, Kristoffer Haugsbakk wrote:

> > The problem with repairing them is that it rewrites history, changing
> > the object id of every commit which comes after.  So you'd want to do it
> > once centrally for the project, and declare a flag day that everybody is
> > moving over to the new, fixed history.
> 
> I wasn’t interested in fixing the repo for future development. Just to
> see if that weird git-log(1) behavior went away. It’s like the inverse
> of a minimal example to reproduce a bug; repair the corrupt object so
> that future askers can be convinced that it’s not a bug. ;)

Ah, OK. Carry on, then. ;)

I'm 99% sure it would indeed go away, since the patch I showed earlier
(to make the parsing more lenient) worked. But if you want to try, I
suspect you could do it with git-filter-repo's --commit-callback option.
Or if you don't mind being a little loose with the parsing, probably
just piping fast-export through sed, and then back to fast-import.

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980
  2023-04-17  7:41         ` Jeff King
@ 2023-04-27 22:32           ` Kristoffer Haugsbakk
  0 siblings, 0 replies; 46+ messages in thread
From: Kristoffer Haugsbakk @ 2023-04-27 22:32 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Thomas Bock

On Mon, Apr 17, 2023, at 09:41, Jeff King wrote:
> Ah, OK. Carry on, then. ;)

The result (see readme): https://github.com/LemmingAvalanche/libreoffice-repaired

The problem went away.

> But if you want to try, I suspect you could do it with
> git-filter-repo's --commit-callback option.

I couldn’t use that. The metadata was too mangled and `git-filter-repo`
just errored out. :P

But `git commit-tree` + `git replace` + `git filter-repo --force` worked.

-- 
Kristoffer Haugsbakk

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980
  2023-04-15  8:52 ` Jeff King
  2023-04-15  8:59   ` Jeff King
  2023-04-15 14:10   ` Kristoffer Haugsbakk
@ 2023-04-17  9:51   ` Junio C Hamano
  2023-04-18  4:12     ` Jeff King
  2 siblings, 1 reply; 46+ messages in thread
From: Junio C Hamano @ 2023-04-17  9:51 UTC (permalink / raw)
  To: Jeff King; +Cc: Thomas Bock, git

Jeff King <peff@peff.net> writes:

> There may be also cases where the two diverge. Obviously having two
> parsers isn't ideal. I think sharing the code may involve a lot of work,
> though. The pretty-print parser is interested in pulling out more
> information, and is less focused on performance. Parsing commits is
> traditionally a hot path, as we historically had to parse every commit,
> even if we weren't showing it (including non-log operations like
> computing merge bases, reachability, and so forth).
>
> But that may not matter so much. One, we already inflate the whole
> commit object, not just the header. So even if we spend a few extra
> instructions on parsing, it may not be noticeable. And two, these days
> we often cache commit metadata in the commit-graph files, which avoids
> loading the commit message entirely (and thus this parsing) for most
> operations.

Makes readers wonder which parser is used to parse commit objects in
order to populate the commit-graph files.  If that step screws up,
we'd record a broken timestamp there X-<.




^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980
  2023-04-17  9:51   ` Junio C Hamano
@ 2023-04-18  4:12     ` Jeff King
  2023-04-18 14:02       ` Derrick Stolee
  0 siblings, 1 reply; 46+ messages in thread
From: Jeff King @ 2023-04-18  4:12 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Thomas Bock, git

On Mon, Apr 17, 2023 at 02:51:42AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > There may be also cases where the two diverge. Obviously having two
> > parsers isn't ideal. I think sharing the code may involve a lot of work,
> > though. The pretty-print parser is interested in pulling out more
> > information, and is less focused on performance. Parsing commits is
> > traditionally a hot path, as we historically had to parse every commit,
> > even if we weren't showing it (including non-log operations like
> > computing merge bases, reachability, and so forth).
> >
> > But that may not matter so much. One, we already inflate the whole
> > commit object, not just the header. So even if we spend a few extra
> > instructions on parsing, it may not be noticeable. And two, these days
> > we often cache commit metadata in the commit-graph files, which avoids
> > loading the commit message entirely (and thus this parsing) for most
> > operations.
> 
> Makes readers wonder which parser is used to parse commit objects in
> order to populate the commit-graph files.  If that step screws up,
> we'd record a broken timestamp there X-<.

It should be be the parse_commit() one, since the commit-graph writing
code works off of parsed "struct commit" objects to find its data. Which
is good, since we may silently optimize out a parse in favor of the
commit-graph; we'd want the two to always match. So recording a broken
timestamp there is OK (but see below).

Where it gets trickier is if we then fix the parser to handle this case.
Now the commit-graph stores a broken value, and further writes are
clever enough to say "ah, I already have that commit in the graph; no
need to recompute its values". So once you start using a version of Git
with the more lenient parser, you'd have to manually blow away any graph
files and rebuild to see the benefit.

One thing the commit graph perhaps _could_ do is omit the commit, or
mark it as "this one is broken in some way". And then fall back to
parsing those few instead (which is slower, but if it's a small minority
of commits, that's OK). But I don't think there's any code for that. And
it's kind of tricky anyway, because the parser just sets the date to "0"
with no indication that there was any potential error. So it's probably
solvable, but perhaps not worth the effort. The current rule is just
"whatever we got from parsing is what the commit-graph will return",
which works well in practice unless the parser changes.

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980
  2023-04-18  4:12     ` Jeff King
@ 2023-04-18 14:02       ` Derrick Stolee
  2023-04-21 14:51         ` Thomas Bock
  2023-04-22 13:52         ` Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980 Jeff King
  0 siblings, 2 replies; 46+ messages in thread
From: Derrick Stolee @ 2023-04-18 14:02 UTC (permalink / raw)
  To: Jeff King, Junio C Hamano; +Cc: Thomas Bock, git

On 4/18/2023 12:12 AM, Jeff King wrote:

> One thing the commit graph perhaps _could_ do is omit the commit, or
> mark it as "this one is broken in some way". And then fall back to
> parsing those few instead (which is slower, but if it's a small minority
> of commits, that's OK). But I don't think there's any code for that.

The "broken" commit would need to be included in the commit-graph file
so its children can point to it using a graph position, but then it
would revert to parsing from the commit object (due to some new concept
storing "this is a bad commit").

If we decided to treat a timestamp of 0 as "probably broken, artificial
at best" then we wouldn't need the new indicator in the commit-graph
file, but this seems like quite a big hammer for a small case.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980
  2023-04-18 14:02       ` Derrick Stolee
@ 2023-04-21 14:51         ` Thomas Bock
  2023-04-22 13:41           ` [PATCH 0/3] fixing some parse_commit() timestamp corner cases Jeff King
  2023-04-22 13:52         ` Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980 Jeff King
  1 sibling, 1 reply; 46+ messages in thread
From: Thomas Bock @ 2023-04-21 14:51 UTC (permalink / raw)
  To: Derrick Stolee, Jeff King, Junio C Hamano; +Cc: git

[-- Attachment #1.1: Type: text/plain, Size: 2111 bytes --]

Thanks @ all for clarifying this mystery, this already helped a lot to 
figure out what's going on there.

Even though the affected commit objects are malformed, it would be very 
helpful if this problem could be solved somehow, from a user 
perspective. Such malformed objects can potentially occur also in other 
comparably old projects, where searching for commits that have been made 
in a specific time window in the past could be useful or even necessary 
in some cases.

On the other hand, if it would be a huge effort to solve the problem 
(which I cannot assess), I could fully understand that it might not be 
worth to completely fix this specific problem that occurs only in few 
cases in the past (if you are sure that such malformed objects cannot 
occur any more today). If you decide to not spend the effort on fixing 
this problem, however, I would appreciate if you could, at least, 
partially fix this problem by not treating the broken commit objects to 
be before any other commit object (timestamp 0), but introduce some kind 
of error handling for such commits that omits listing them in the wrong 
time period.

Thanks again for all the explanations and thoughts!

Best,
Thomas

Am 18.04.23 um 16:02 schrieb Derrick Stolee:
> On 4/18/2023 12:12 AM, Jeff King wrote:
>
>> One thing the commit graph perhaps _could_ do is omit the commit, or
>> mark it as "this one is broken in some way". And then fall back to
>> parsing those few instead (which is slower, but if it's a small minority
>> of commits, that's OK). But I don't think there's any code for that.
> The "broken" commit would need to be included in the commit-graph file
> so its children can point to it using a graph position, but then it
> would revert to parsing from the commit object (due to some new concept
> storing "this is a bad commit").
>
> If we decided to treat a timestamp of 0 as "probably broken, artificial
> at best" then we wouldn't need the new indicator in the commit-graph
> file, but this seems like quite a big hammer for a small case.
>
> Thanks,
> -Stolee

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 0/3] fixing some parse_commit() timestamp corner cases
  2023-04-21 14:51         ` Thomas Bock
@ 2023-04-22 13:41           ` Jeff King
  2023-04-22 13:42             ` [PATCH 1/3] t4212: avoid putting git on left-hand side of pipe Jeff King
                               ` (3 more replies)
  0 siblings, 4 replies; 46+ messages in thread
From: Jeff King @ 2023-04-22 13:41 UTC (permalink / raw)
  To: Thomas Bock; +Cc: Derrick Stolee, Junio C Hamano, git

On Fri, Apr 21, 2023 at 04:51:03PM +0200, Thomas Bock wrote:

> Even though the affected commit objects are malformed, it would be very
> helpful if this problem could be solved somehow, from a user perspective.
> Such malformed objects can potentially occur also in other comparably old
> projects, where searching for commits that have been made in a specific time
> window in the past could be useful or even necessary in some cases.

Yeah, after sleeping on it for a bit, I think it is worth fixing. I also
found another parsing bug in the same function. ;)

So here's the result.

  [1/3]: t4212: avoid putting git on left-hand side of pipe
  [2/3]: parse_commit(): parse timestamp from end of line
  [3/3]: parse_commit(): handle broken whitespace-only timestamp

 commit.c               | 29 +++++++++++++++++++++-------
 t/t4212-log-corrupt.sh | 44 ++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 64 insertions(+), 9 deletions(-)

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 1/3] t4212: avoid putting git on left-hand side of pipe
  2023-04-22 13:41           ` [PATCH 0/3] fixing some parse_commit() timestamp corner cases Jeff King
@ 2023-04-22 13:42             ` Jeff King
  2023-04-22 13:47             ` [PATCH 2/3] parse_commit(): parse timestamp from end of line Jeff King
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 46+ messages in thread
From: Jeff King @ 2023-04-22 13:42 UTC (permalink / raw)
  To: Thomas Bock; +Cc: Derrick Stolee, Junio C Hamano, git

We wouldn't expect cat-file to fail here, but it's good practice to
avoid putting git on the upstream of a pipe, as we otherwise ignore its
exit code.

Signed-off-by: Jeff King <peff@peff.net>
---
 t/t4212-log-corrupt.sh | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/t/t4212-log-corrupt.sh b/t/t4212-log-corrupt.sh
index e89e1f54b6..8b5433ea74 100755
--- a/t/t4212-log-corrupt.sh
+++ b/t/t4212-log-corrupt.sh
@@ -8,8 +8,9 @@ TEST_PASSES_SANITIZE_LEAK=true
 test_expect_success 'setup' '
 	test_commit foo &&
 
-	git cat-file commit HEAD |
-	sed "/^author /s/>/>-<>/" >broken_email.commit &&
+	git cat-file commit HEAD >ok.commit &&
+	sed "/^author /s/>/>-<>/" <ok.commit >broken_email.commit &&
+
 	git hash-object --literally -w -t commit broken_email.commit >broken_email.hash &&
 	git update-ref refs/heads/broken_email $(cat broken_email.hash)
 '
-- 
2.40.0.653.g15ca972062


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 2/3] parse_commit(): parse timestamp from end of line
  2023-04-22 13:41           ` [PATCH 0/3] fixing some parse_commit() timestamp corner cases Jeff King
  2023-04-22 13:42             ` [PATCH 1/3] t4212: avoid putting git on left-hand side of pipe Jeff King
@ 2023-04-22 13:47             ` Jeff King
  2023-04-24 17:05               ` Junio C Hamano
  2023-04-24 16:39             ` [PATCH 0/3] fixing some parse_commit() timestamp corner cases Junio C Hamano
  2023-04-25  5:52             ` [PATCH v2 " Jeff King
  3 siblings, 1 reply; 46+ messages in thread
From: Jeff King @ 2023-04-22 13:47 UTC (permalink / raw)
  To: Thomas Bock; +Cc: Derrick Stolee, Junio C Hamano, git

To find the committer timestamp, we parse left-to-right looking for the
closing ">" of the email, and then expect the timestamp right after
that. But we've seen some broken cases in the wild where this fails, but
we _could_ find the timestamp with a little extra work. E.g.:

  Name <Name<email>> 123456789 -0500

This means that features that rely on the committer timestamp, like
--since or --until, will treat the commit as happening at time 0 (i.e.,
1970).

This is doubly confusing because the pretty-print parser learned to
handle these in 03818a4a94 (split_ident: parse timestamp from end of
line, 2013-10-14). So printing them via "git show", etc, makes
everything look normal, but --until, etc are still broken (despite the
fact that that commit explicitly mentioned --until!).

So let's use the same trick as 03818a4a94: find the end of the line, and
parse back to the final ">". In theory we could use split_ident_line()
here, but it's actually a bit more strict. In particular, it requires a
valid time-zone token, too. That should be present, of course, but we
wouldn't want to break --until for malformed cases that are working
currently.

We might want to teach split_ident_line() to become more lenient there,
but it would require checking its many callers (since right now they can
assume that if date_start is non-NULL, so is tz_start).

So for now we'll just reimplement the same trick in the commit parser.

The test is in t4212, which already covers similar cases, courtesy of
03818a4a94. We'll just adjust the broken commit to munge both the author
and committer timestamps. Note that we could match (author|committer)
here, but alternation can't be used portably in sed. Since we wouldn't
expect to see ">" except as part of an ident line, we can just match
that character on any line.

Signed-off-by: Jeff King <peff@peff.net>
---
This is more or less the same as what I posted earlier, but using
memchr() where appropriate (we could use memrchr(), too, but I don't
think it's portable enough).

Note that both before and after my series, there are cases where
parse_commit() is more lenient than split_ident_line(), because of the
time-zone thing. For example:

  committer name <email> 1234567890\n

will show as 1970, but --until, etc, will work as expected (so the
opposite of the case that started this thread). So I do think making
split_ident_line() more lenient may be worth doing, but I punted on that
for now. This series takes us in a strictly better direction with
respect to visible behavior, even if we might be able to clean up the
internals later by reusing code.

 commit.c               | 19 ++++++++++++-------
 t/t4212-log-corrupt.sh |  7 ++++++-
 2 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/commit.c b/commit.c
index 6d844da9a6..ede810ac1c 100644
--- a/commit.c
+++ b/commit.c
@@ -95,6 +95,7 @@ struct commit *lookup_commit_reference_by_name(const char *name)
 static timestamp_t parse_commit_date(const char *buf, const char *tail)
 {
 	const char *dateptr;
+	const char *eol;

 	if (buf + 6 >= tail)
 		return 0;
@@ -106,16 +107,20 @@ static timestamp_t parse_commit_date(const char *buf, const char *tail)
 		return 0;
 	if (memcmp(buf, "committer", 9))
 		return 0;
-	while (buf < tail && *buf++ != '>')
-		/* nada */;
-	if (buf >= tail)
+
+	/*
+	 * parse to end-of-line and then walk backwards, which
+	 * handles some malformed cases.
+	 */
+	eol = memchr(buf, '\n', tail - buf);
+	if (!eol)
 		return 0;
-	dateptr = buf;
-	while (buf < tail && *buf++ != '\n')
+	for (dateptr = eol; dateptr > buf && dateptr[-1] != '>'; dateptr--)
 		/* nada */;
-	if (buf >= tail)
+	if (dateptr == buf || dateptr == eol)
 		return 0;
-	/* dateptr < buf && buf[-1] == '\n', so parsing will stop at buf-1 */
+
+	/* dateptr < eol && *eol == '\n', so parsing will stop at eol */
 	return parse_timestamp(dateptr, NULL, 10);
 }

diff --git a/t/t4212-log-corrupt.sh b/t/t4212-log-corrupt.sh
index 8b5433ea74..af4b35ff56 100755
--- a/t/t4212-log-corrupt.sh
+++ b/t/t4212-log-corrupt.sh
@@ -9,7 +9,7 @@ test_expect_success 'setup' '
 	test_commit foo &&

 	git cat-file commit HEAD >ok.commit &&
-	sed "/^author /s/>/>-<>/" <ok.commit >broken_email.commit &&
+	sed "s/>/>-<>/" <ok.commit >broken_email.commit &&

 	git hash-object --literally -w -t commit broken_email.commit >broken_email.hash &&
 	git update-ref refs/heads/broken_email $(cat broken_email.hash)
@@ -44,6 +44,11 @@ test_expect_success 'git log --format with broken author email' '
 	test_must_be_empty actual.err
 '

+test_expect_success '--until handles broken email' '
+	git rev-list --until=1980-01-01 broken_email >actual &&
+	test_must_be_empty actual
+'
+
 munge_author_date () {
 	git cat-file commit "$1" >commit.orig &&
 	sed "s/^\(author .*>\) [0-9]*/\1 $2/" <commit.orig >commit.munge &&
-- 
2.40.0.653.g15ca972062

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/3] parse_commit(): parse timestamp from end of line
  2023-04-22 13:47             ` [PATCH 2/3] parse_commit(): parse timestamp from end of line Jeff King
@ 2023-04-24 17:05               ` Junio C Hamano
  2023-04-25  5:23                 ` Jeff King
  0 siblings, 1 reply; 46+ messages in thread
From: Junio C Hamano @ 2023-04-24 17:05 UTC (permalink / raw)
  To: Jeff King; +Cc: Thomas Bock, Derrick Stolee, git

Jeff King <peff@peff.net> writes:

> To find the committer timestamp, we parse left-to-right looking for the
> closing ">" of the email, and then expect the timestamp right after
> that. But we've seen some broken cases in the wild where this fails, but
> we _could_ find the timestamp with a little extra work. E.g.:
>
>   Name <Name<email>> 123456789 -0500
> ...
> So let's use the same trick as 03818a4a94: find the end of the line, and
> parse back to the final ">".

This obviously assumes that even in a broken ident, it is very
likely that the second component (where <e-mail> usually sits) ends
with a '>'.  Given that we enclose whatever garbage the end user
gave us as their e-mail inside a pair of <> ourselves, it is a very
sensible assumption, I would think.  The original parser assumed
that the end user would not have '>' inside their e-mail part of the
ident, which turns out to be more problematic than the alternative
being proposed.  It is doubly good that we already parse from the
end elsewhere.

Nice.

> +	/*
> +	 * parse to end-of-line and then walk backwards, which
> +	 * handles some malformed cases.
> +	 */

I would say "parse to" -> "jump to", but technically moving forward
looking for a LF byte is still "parsing".  "some" malformed cases
being "most plausible" ones (due to how ident.c::fmt_ident() is what
writes '>' after the string end-user gave as e-mail) may be worth
mentioning.

> +	eol = memchr(buf, '\n', tail - buf);
> +	if (!eol)
>  		return 0;

OK.

> -	dateptr = buf;
> -	while (buf < tail && *buf++ != '\n')
> +	for (dateptr = eol; dateptr > buf && dateptr[-1] != '>'; dateptr--)
>  		/* nada */;

OK.  Just a style thing, but I found that "; /* nada */" is easier
to spot that there is an empty statement there.

> -	if (buf >= tail)
> +	if (dateptr == buf || dateptr == eol)
>  		return 0;

Curious when dateptr that wanted to scan back from eol is still at
eol after the loop.  It is when the ident line ends with ">" without
any timestamp/tz info.   And the reason why we need to check that
here is ...

> -	/* dateptr < buf && buf[-1] == '\n', so parsing will stop at buf-1 */
> +
> +	/* dateptr < eol && *eol == '\n', so parsing will stop at eol */
>  	return parse_timestamp(dateptr, NULL, 10);

... because parse_timestamp() is merely strtoumax() and would
happily skip over arbitrary number of leading "whitespace" without
stopping if (dateptr == eol && *eol == '\n').  OK, sad but correct.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/3] parse_commit(): parse timestamp from end of line
  2023-04-24 17:05               ` Junio C Hamano
@ 2023-04-25  5:23                 ` Jeff King
  0 siblings, 0 replies; 46+ messages in thread
From: Jeff King @ 2023-04-25  5:23 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Thomas Bock, Derrick Stolee, git

On Mon, Apr 24, 2023 at 10:05:42AM -0700, Junio C Hamano wrote:

> > +	/*
> > +	 * parse to end-of-line and then walk backwards, which
> > +	 * handles some malformed cases.
> > +	 */
> 
> I would say "parse to" -> "jump to", but technically moving forward
> looking for a LF byte is still "parsing".  "some" malformed cases
> being "most plausible" ones (due to how ident.c::fmt_ident() is what
> writes '>' after the string end-user gave as e-mail) may be worth
> mentioning.

I'll expand this to:

  /*
   * Jump to end-of-line so that we can walk backwards to find the
   * end-of-email ">". This is more forgiving of malformed cases
   * because unexpected characters tend to be in the name and email
   * fields.
   */

> > -	dateptr = buf;
> > -	while (buf < tail && *buf++ != '\n')
> > +	for (dateptr = eol; dateptr > buf && dateptr[-1] != '>'; dateptr--)
> >  		/* nada */;
> 
> OK.  Just a style thing, but I found that "; /* nada */" is easier
> to spot that there is an empty statement there.

I found it ugly, too, but it's from earlier. Since the point is to
advance "dateptr", it may be better to turn it into a while loop anyway,
which side-steps the empty statement altogether:

  dateptr = eol;
  while (dateptr > buf && dateptr[-1] != '>')
	dateptr--;

> > -	if (buf >= tail)
> > +	if (dateptr == buf || dateptr == eol)
> >  		return 0;
> 
> Curious when dateptr that wanted to scan back from eol is still at
> eol after the loop.  It is when the ident line ends with ">" without
> any timestamp/tz info.   And the reason why we need to check that
> here is ...

Yeah, though as you saw in the next patch, it is not really sufficient. :)

I think it may be redundant, though.

In the original we already bailed earlier if we didn't find a ">"
(because "buf >= tail" after we advanced it looking for ">"). Here
"dateptr == buf" is checking the same thing (because we walked
backwards).

In the original we'd bail if we failed to find a newline as part of this
loop (because "buf >= tail" after looking for a newline). But that can't
happen here; we've already bailed after memchr() failed to find a
newline). So "dateptr == eol" only triggers if there were no characters
in "buf" to look at.

So I added it only to keep this comment trivially true:

> > -	/* dateptr < buf && buf[-1] == '\n', so parsing will stop at buf-1 */
> > +
> > +	/* dateptr < eol && *eol == '\n', so parsing will stop at eol */
> >  	return parse_timestamp(dateptr, NULL, 10);
> 
> ... because parse_timestamp() is merely strtoumax() and would
> happily skip over arbitrary number of leading "whitespace" without
> stopping if (dateptr == eol && *eol == '\n').  OK, sad but correct.

But it could also read "dateptr <= eol" and still be true (which is to
say it is mostly accurate, but not quite because of the "soaking up
whitespace" problem fixed by the next patch.

I'll leave the extra condition in this patch, since it's orthogonal to
what this patch is fixing. But in the next one I'll remove it and expand
the comment to explain a bit more.

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/3] fixing some parse_commit() timestamp corner cases
  2023-04-22 13:41           ` [PATCH 0/3] fixing some parse_commit() timestamp corner cases Jeff King
  2023-04-22 13:42             ` [PATCH 1/3] t4212: avoid putting git on left-hand side of pipe Jeff King
  2023-04-22 13:47             ` [PATCH 2/3] parse_commit(): parse timestamp from end of line Jeff King
@ 2023-04-24 16:39             ` Junio C Hamano
  2023-04-25  5:52             ` [PATCH v2 " Jeff King
  3 siblings, 0 replies; 46+ messages in thread
From: Junio C Hamano @ 2023-04-24 16:39 UTC (permalink / raw)
  To: Jeff King; +Cc: Thomas Bock, Derrick Stolee, git

Jeff King <peff@peff.net> writes:

> On Fri, Apr 21, 2023 at 04:51:03PM +0200, Thomas Bock wrote:
>
>> Even though the affected commit objects are malformed, it would be very
>> helpful if this problem could be solved somehow, from a user perspective.
>> Such malformed objects can potentially occur also in other comparably old
>> projects, where searching for commits that have been made in a specific time
>> window in the past could be useful or even necessary in some cases.
>
> Yeah, after sleeping on it for a bit, I think it is worth fixing. I also
> found another parsing bug in the same function. ;)

Nice.  Thanks.

>
> So here's the result.
>
>   [1/3]: t4212: avoid putting git on left-hand side of pipe
>   [2/3]: parse_commit(): parse timestamp from end of line
>   [3/3]: parse_commit(): handle broken whitespace-only timestamp
>
>  commit.c               | 29 +++++++++++++++++++++-------
>  t/t4212-log-corrupt.sh | 44 ++++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 64 insertions(+), 9 deletions(-)
>
> -Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v2 0/3] fixing some parse_commit() timestamp corner cases
  2023-04-22 13:41           ` [PATCH 0/3] fixing some parse_commit() timestamp corner cases Jeff King
                               ` (2 preceding siblings ...)
  2023-04-24 16:39             ` [PATCH 0/3] fixing some parse_commit() timestamp corner cases Junio C Hamano
@ 2023-04-25  5:52             ` Jeff King
  2023-04-25  5:54               ` Jeff King
                                 ` (4 more replies)
  3 siblings, 5 replies; 46+ messages in thread
From: Jeff King @ 2023-04-25  5:52 UTC (permalink / raw)
  To: Thomas Bock; +Cc: Derrick Stolee, Junio C Hamano, git

Here's a v2 of my series. The behavior should be identical, but I've
incorporated some comment and small code tweaks based on feedback from
the first round.

I also added a fourth patch which adds a new comment explaining some of
the cases that were alluded to in the earlier round's patch 3.

  [1/4]: t4212: avoid putting git on left-hand side of pipe
  [2/4]: parse_commit(): parse timestamp from end of line
  [3/4]: parse_commit(): handle broken whitespace-only timestamp
  [4/4]: parse_commit(): describe more date-parsing failure modes

 commit.c               | 47 +++++++++++++++++++++++++++++++++++-------
 t/t4212-log-corrupt.sh | 39 +++++++++++++++++++++++++++++++++--
 2 files changed, 76 insertions(+), 10 deletions(-)

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 0/3] fixing some parse_commit() timestamp corner cases
  2023-04-25  5:52             ` [PATCH v2 " Jeff King
@ 2023-04-25  5:54               ` Jeff King
  2023-04-25  5:54               ` [PATCH v2 1/4] t4212: avoid putting git on left-hand side of pipe Jeff King
                                 ` (3 subsequent siblings)
  4 siblings, 0 replies; 46+ messages in thread
From: Jeff King @ 2023-04-25  5:54 UTC (permalink / raw)
  To: Thomas Bock; +Cc: Derrick Stolee, Junio C Hamano, git

On Tue, Apr 25, 2023 at 01:52:45AM -0400, Jeff King wrote:

> Here's a v2 of my series. The behavior should be identical, but I've
> incorporated some comment and small code tweaks based on feedback from
> the first round.
> 
> I also added a fourth patch which adds a new comment explaining some of
> the cases that were alluded to in the earlier round's patch 3.
> 
>   [1/4]: t4212: avoid putting git on left-hand side of pipe
>   [2/4]: parse_commit(): parse timestamp from end of line
>   [3/4]: parse_commit(): handle broken whitespace-only timestamp
>   [4/4]: parse_commit(): describe more date-parsing failure modes
> 
>  commit.c               | 47 +++++++++++++++++++++++++++++++++++-------
>  t/t4212-log-corrupt.sh | 39 +++++++++++++++++++++++++++++++++--
>  2 files changed, 76 insertions(+), 10 deletions(-)

Whoops, forgot my range-diff (though nothing should be too surprising
based on the round 1 discussion):

1:  07932cf666 = 1:  ac38ce133d t4212: avoid putting git on left-hand side of pipe
2:  7ee34c7d5f ! 2:  f59e61262d parse_commit(): parse timestamp from end of line
    @@ Commit message
         parse back to the final ">". In theory we could use split_ident_line()
         here, but it's actually a bit more strict. In particular, it requires a
         valid time-zone token, too. That should be present, of course, but we
    -    wouldn't want to break --until for malformed cases that are working
    -    currently.
    +    wouldn't want to break --until for cases that are working currently.
     
         We might want to teach split_ident_line() to become more lenient there,
         but it would require checking its many callers (since right now they can
    @@ commit.c: static timestamp_t parse_commit_date(const char *buf, const char *tail
     -	if (buf >= tail)
     +
     +	/*
    -+	 * parse to end-of-line and then walk backwards, which
    -+	 * handles some malformed cases.
    ++	 * Jump to end-of-line so that we can walk backwards to find the
    ++	 * end-of-email ">". This is more forgiving of malformed cases
    ++	 * because unexpected characters tend to be in the name and email
    ++	 * fields.
     +	 */
     +	eol = memchr(buf, '\n', tail - buf);
     +	if (!eol)
      		return 0;
     -	dateptr = buf;
     -	while (buf < tail && *buf++ != '\n')
    -+	for (dateptr = eol; dateptr > buf && dateptr[-1] != '>'; dateptr--)
    - 		/* nada */;
    +-		/* nada */;
     -	if (buf >= tail)
    ++	dateptr = eol;
    ++	while (dateptr > buf && dateptr[-1] != '>')
    ++		dateptr--;
     +	if (dateptr == buf || dateptr == eol)
      		return 0;
     -	/* dateptr < buf && buf[-1] == '\n', so parsing will stop at buf-1 */
3:  e8e94083f5 ! 3:  c62fc59bf1 parse_commit(): handle broken whitespace-only timestamp
    @@ Commit message
         It's not subject to the same bug, because it insists that there be one
         or more digits in the timestamp.
     
    -    We can use the same logic here. If there's a non-whitespace but
    -    non-digit value (say "committer name <email> foo"), then
    -    parse_timestamp() would already have returned 0 anyway. So the only
    -    change should be for this "whitespace only" case.
    -
         Signed-off-by: Jeff King <peff@peff.net>
     
      ## commit.c ##
     @@ commit.c: static timestamp_t parse_commit_date(const char *buf, const char *tail)
    - 	if (dateptr == buf || dateptr == eol)
    + 	dateptr = eol;
    + 	while (dateptr > buf && dateptr[-1] != '>')
    + 		dateptr--;
    +-	if (dateptr == buf || dateptr == eol)
    ++	if (dateptr == buf)
      		return 0;
      
    +-	/* dateptr < eol && *eol == '\n', so parsing will stop at eol */
     +	/*
    -+	 * trim leading whitespace; parse_timestamp() will do this itself, but
    -+	 * it will walk past the newline at eol while doing so. So we insist
    -+	 * that there is at least one digit here.
    ++	 * Trim leading whitespace; parse_timestamp() will do this itself, but
    ++	 * if we have _only_ whitespace, it will walk right past the newline
    ++	 * while doing so.
     +	 */
     +	while (dateptr < eol && isspace(*dateptr))
     +		dateptr++;
    -+	if (!strchr("0123456789", *dateptr))
    ++	if (dateptr == eol)
     +		return 0;
     +
    - 	/* dateptr < eol && *eol == '\n', so parsing will stop at eol */
    ++	/*
    ++	 * We know there is at least one non-whitespace character, so we'll
    ++	 * begin parsing there and stop at worst case at eol.
    ++	 */
      	return parse_timestamp(dateptr, NULL, 10);
      }
    + 
     
      ## t/t4212-log-corrupt.sh ##
     @@ t/t4212-log-corrupt.sh: test_expect_success 'absurdly far-in-future date' '
-:  ---------- > 4:  28ed51a2ca parse_commit(): describe more date-parsing failure modes

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v2 1/4] t4212: avoid putting git on left-hand side of pipe
  2023-04-25  5:52             ` [PATCH v2 " Jeff King
  2023-04-25  5:54               ` Jeff King
@ 2023-04-25  5:54               ` Jeff King
  2023-04-25  5:54               ` [PATCH v2 2/4] parse_commit(): parse timestamp from end of line Jeff King
                                 ` (2 subsequent siblings)
  4 siblings, 0 replies; 46+ messages in thread
From: Jeff King @ 2023-04-25  5:54 UTC (permalink / raw)
  To: Thomas Bock; +Cc: Derrick Stolee, Junio C Hamano, git

We wouldn't expect cat-file to fail here, but it's good practice to
avoid putting git on the upstream of a pipe, as we otherwise ignore its
exit code.

Signed-off-by: Jeff King <peff@peff.net>
---
 t/t4212-log-corrupt.sh | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/t/t4212-log-corrupt.sh b/t/t4212-log-corrupt.sh
index e89e1f54b6..8b5433ea74 100755
--- a/t/t4212-log-corrupt.sh
+++ b/t/t4212-log-corrupt.sh
@@ -8,8 +8,9 @@ TEST_PASSES_SANITIZE_LEAK=true
 test_expect_success 'setup' '
 	test_commit foo &&
 
-	git cat-file commit HEAD |
-	sed "/^author /s/>/>-<>/" >broken_email.commit &&
+	git cat-file commit HEAD >ok.commit &&
+	sed "/^author /s/>/>-<>/" <ok.commit >broken_email.commit &&
+
 	git hash-object --literally -w -t commit broken_email.commit >broken_email.hash &&
 	git update-ref refs/heads/broken_email $(cat broken_email.hash)
 '
-- 
2.40.0.653.g15ca972062


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 2/4] parse_commit(): parse timestamp from end of line
  2023-04-25  5:52             ` [PATCH v2 " Jeff King
  2023-04-25  5:54               ` Jeff King
  2023-04-25  5:54               ` [PATCH v2 1/4] t4212: avoid putting git on left-hand side of pipe Jeff King
@ 2023-04-25  5:54               ` Jeff King
  2023-04-25  5:54               ` [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp Jeff King
  2023-04-25  5:55               ` [PATCH v2 4/4] parse_commit(): describe more date-parsing failure modes Jeff King
  4 siblings, 0 replies; 46+ messages in thread
From: Jeff King @ 2023-04-25  5:54 UTC (permalink / raw)
  To: Thomas Bock; +Cc: Derrick Stolee, Junio C Hamano, git

To find the committer timestamp, we parse left-to-right looking for the
closing ">" of the email, and then expect the timestamp right after
that. But we've seen some broken cases in the wild where this fails, but
we _could_ find the timestamp with a little extra work. E.g.:

  Name <Name<email>> 123456789 -0500

This means that features that rely on the committer timestamp, like
--since or --until, will treat the commit as happening at time 0 (i.e.,
1970).

This is doubly confusing because the pretty-print parser learned to
handle these in 03818a4a94 (split_ident: parse timestamp from end of
line, 2013-10-14). So printing them via "git show", etc, makes
everything look normal, but --until, etc are still broken (despite the
fact that that commit explicitly mentioned --until!).

So let's use the same trick as 03818a4a94: find the end of the line, and
parse back to the final ">". In theory we could use split_ident_line()
here, but it's actually a bit more strict. In particular, it requires a
valid time-zone token, too. That should be present, of course, but we
wouldn't want to break --until for cases that are working currently.

We might want to teach split_ident_line() to become more lenient there,
but it would require checking its many callers (since right now they can
assume that if date_start is non-NULL, so is tz_start).

So for now we'll just reimplement the same trick in the commit parser.

The test is in t4212, which already covers similar cases, courtesy of
03818a4a94. We'll just adjust the broken commit to munge both the author
and committer timestamps. Note that we could match (author|committer)
here, but alternation can't be used portably in sed. Since we wouldn't
expect to see ">" except as part of an ident line, we can just match
that character on any line.

Signed-off-by: Jeff King <peff@peff.net>
---
 commit.c               | 24 ++++++++++++++++--------
 t/t4212-log-corrupt.sh |  7 ++++++-
 2 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/commit.c b/commit.c
index 6d844da9a6..bb340f66fa 100644
--- a/commit.c
+++ b/commit.c
@@ -95,6 +95,7 @@ struct commit *lookup_commit_reference_by_name(const char *name)
 static timestamp_t parse_commit_date(const char *buf, const char *tail)
 {
 	const char *dateptr;
+	const char *eol;

 	if (buf + 6 >= tail)
 		return 0;
@@ -106,16 +107,23 @@ static timestamp_t parse_commit_date(const char *buf, const char *tail)
 		return 0;
 	if (memcmp(buf, "committer", 9))
 		return 0;
-	while (buf < tail && *buf++ != '>')
-		/* nada */;
-	if (buf >= tail)
+
+	/*
+	 * Jump to end-of-line so that we can walk backwards to find the
+	 * end-of-email ">". This is more forgiving of malformed cases
+	 * because unexpected characters tend to be in the name and email
+	 * fields.
+	 */
+	eol = memchr(buf, '\n', tail - buf);
+	if (!eol)
 		return 0;
-	dateptr = buf;
-	while (buf < tail && *buf++ != '\n')
-		/* nada */;
-	if (buf >= tail)
+	dateptr = eol;
+	while (dateptr > buf && dateptr[-1] != '>')
+		dateptr--;
+	if (dateptr == buf || dateptr == eol)
 		return 0;
-	/* dateptr < buf && buf[-1] == '\n', so parsing will stop at buf-1 */
+
+	/* dateptr < eol && *eol == '\n', so parsing will stop at eol */
 	return parse_timestamp(dateptr, NULL, 10);
 }

diff --git a/t/t4212-log-corrupt.sh b/t/t4212-log-corrupt.sh
index 8b5433ea74..af4b35ff56 100755
--- a/t/t4212-log-corrupt.sh
+++ b/t/t4212-log-corrupt.sh
@@ -9,7 +9,7 @@ test_expect_success 'setup' '
 	test_commit foo &&

 	git cat-file commit HEAD >ok.commit &&
-	sed "/^author /s/>/>-<>/" <ok.commit >broken_email.commit &&
+	sed "s/>/>-<>/" <ok.commit >broken_email.commit &&

 	git hash-object --literally -w -t commit broken_email.commit >broken_email.hash &&
 	git update-ref refs/heads/broken_email $(cat broken_email.hash)
@@ -44,6 +44,11 @@ test_expect_success 'git log --format with broken author email' '
 	test_must_be_empty actual.err
 '

+test_expect_success '--until handles broken email' '
+	git rev-list --until=1980-01-01 broken_email >actual &&
+	test_must_be_empty actual
+'
+
 munge_author_date () {
 	git cat-file commit "$1" >commit.orig &&
 	sed "s/^\(author .*>\) [0-9]*/\1 $2/" <commit.orig >commit.munge &&
-- 
2.40.0.653.g15ca972062

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp
  2023-04-25  5:52             ` [PATCH v2 " Jeff King
                                 ` (2 preceding siblings ...)
  2023-04-25  5:54               ` [PATCH v2 2/4] parse_commit(): parse timestamp from end of line Jeff King
@ 2023-04-25  5:54               ` Jeff King
  2023-04-25 10:11                 ` Phillip Wood
  2023-04-25  5:55               ` [PATCH v2 4/4] parse_commit(): describe more date-parsing failure modes Jeff King
  4 siblings, 1 reply; 46+ messages in thread
From: Jeff King @ 2023-04-25  5:54 UTC (permalink / raw)
  To: Thomas Bock; +Cc: Derrick Stolee, Junio C Hamano, git

The comment in parse_commit_date() claims that parse_timestamp() will
not walk past the end of the buffer we've been given, since it will hit
the newline at "eol" and stop. This is usually true, when dateptr
contains actual numbers to parse. But with a line like:

   committer name <email>   \n

with just whitespace, and no numbers, parse_timestamp() will consume
that newline as part of the leading whitespace, and we may walk past our
"tail" pointer (which itself is set from the "size" parameter passed in
to parse_commit_buffer()).

In practice this can't cause us to walk off the end of an array, because
we always add an extra NUL byte to the end of objects we load from disk
(as a defense against exactly this kind of bug). However, you can see
the behavior in action when "committer" is the final header (which it
usually is, unless there's an encoding) and the subject line can be
parsed as an integer. We walk right past the newline on the committer
line, as well as the "\n\n" separator, and mistake the subject for the
timestamp.

The new test demonstrates such a case. I also added a test to check this
case against the pretty-print formatter, which uses split_ident_line().
It's not subject to the same bug, because it insists that there be one
or more digits in the timestamp.

Signed-off-by: Jeff King <peff@peff.net>
---
 commit.c               | 17 +++++++++++++++--
 t/t4212-log-corrupt.sh | 29 +++++++++++++++++++++++++++++
 2 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/commit.c b/commit.c
index bb340f66fa..2f1b5d505b 100644
--- a/commit.c
+++ b/commit.c
@@ -120,10 +120,23 @@ static timestamp_t parse_commit_date(const char *buf, const char *tail)
 	dateptr = eol;
 	while (dateptr > buf && dateptr[-1] != '>')
 		dateptr--;
-	if (dateptr == buf || dateptr == eol)
+	if (dateptr == buf)
 		return 0;
 
-	/* dateptr < eol && *eol == '\n', so parsing will stop at eol */
+	/*
+	 * Trim leading whitespace; parse_timestamp() will do this itself, but
+	 * if we have _only_ whitespace, it will walk right past the newline
+	 * while doing so.
+	 */
+	while (dateptr < eol && isspace(*dateptr))
+		dateptr++;
+	if (dateptr == eol)
+		return 0;
+
+	/*
+	 * We know there is at least one non-whitespace character, so we'll
+	 * begin parsing there and stop at worst case at eol.
+	 */
 	return parse_timestamp(dateptr, NULL, 10);
 }
 
diff --git a/t/t4212-log-corrupt.sh b/t/t4212-log-corrupt.sh
index af4b35ff56..d4ef48d646 100755
--- a/t/t4212-log-corrupt.sh
+++ b/t/t4212-log-corrupt.sh
@@ -92,4 +92,33 @@ test_expect_success 'absurdly far-in-future date' '
 	git log -1 --format=%ad $commit
 '
 
+test_expect_success 'create commit with whitespace committer date' '
+	# It is important that this subject line is numeric, since we want to
+	# be sure we are not confused by skipping whitespace and accidentally
+	# parsing the subject as a timestamp.
+	#
+	# Do not use munge_author_date here. Besides not hitting the committer
+	# line, it leaves the timezone intact, and we want nothing but
+	# whitespace.
+	test_commit 1234567890 &&
+	git cat-file commit HEAD >commit.orig &&
+	sed "s/>.*/>    /" <commit.orig >commit.munge &&
+	ws_commit=$(git hash-object --literally -w -t commit commit.munge)
+'
+
+test_expect_success '--until treats whitespace date as sentinel' '
+	echo $ws_commit >expect &&
+	git rev-list --until=1980-01-01 $ws_commit >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'pretty-printer handles whitespace date' '
+	# as with the %ad test above, we will show these as the empty string,
+	# not the 1970 epoch date. This is intentional; see 7d9a281941 (t4212:
+	# test bogus timestamps with git-log, 2014-02-24) for more discussion.
+	echo : >expect &&
+	git log -1 --format="%at:%ct" $ws_commit >actual &&
+	test_cmp expect actual
+'
+
 test_done
-- 
2.40.0.653.g15ca972062


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp
  2023-04-25  5:54               ` [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp Jeff King
@ 2023-04-25 10:11                 ` Phillip Wood
  2023-04-25 16:06                   ` Junio C Hamano
  0 siblings, 1 reply; 46+ messages in thread
From: Phillip Wood @ 2023-04-25 10:11 UTC (permalink / raw)
  To: Jeff King, Thomas Bock
  Cc: Derrick Stolee, Junio C Hamano, git, René Scharfe

Hi Peff

On 25/04/2023 06:54, Jeff King wrote:
> The comment in parse_commit_date() claims that parse_timestamp() will
> not walk past the end of the buffer we've been given, since it will hit
> the newline at "eol" and stop. This is usually true, when dateptr
> contains actual numbers to parse. But with a line like:
> 
>     committer name <email>   \n
> 
> with just whitespace, and no numbers, parse_timestamp() will consume
> that newline as part of the leading whitespace, and we may walk past our
> "tail" pointer (which itself is set from the "size" parameter passed in
> to parse_commit_buffer()).
> 
> In practice this can't cause us to walk off the end of an array, because
> we always add an extra NUL byte to the end of objects we load from disk
> (as a defense against exactly this kind of bug). However, you can see
> the behavior in action when "committer" is the final header (which it
> usually is, unless there's an encoding) and the subject line can be
> parsed as an integer. We walk right past the newline on the committer
> line, as well as the "\n\n" separator, and mistake the subject for the
> timestamp.
> 
> The new test demonstrates such a case. I also added a test to check this
> case against the pretty-print formatter, which uses split_ident_line().
> It's not subject to the same bug, because it insists that there be one
> or more digits in the timestamp.
> 
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>   commit.c               | 17 +++++++++++++++--
>   t/t4212-log-corrupt.sh | 29 +++++++++++++++++++++++++++++
>   2 files changed, 44 insertions(+), 2 deletions(-)
> 
> diff --git a/commit.c b/commit.c
> index bb340f66fa..2f1b5d505b 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -120,10 +120,23 @@ static timestamp_t parse_commit_date(const char *buf, const char *tail)
>   	dateptr = eol;
>   	while (dateptr > buf && dateptr[-1] != '>')
>   		dateptr--;
> -	if (dateptr == buf || dateptr == eol)
> +	if (dateptr == buf)
>   		return 0;
>   
> -	/* dateptr < eol && *eol == '\n', so parsing will stop at eol */
> +	/*
> +	 * Trim leading whitespace; parse_timestamp() will do this itself, but
> +	 * if we have _only_ whitespace, it will walk right past the newline
> +	 * while doing so.
> +	 */
> +	while (dateptr < eol && isspace(*dateptr))
> +		dateptr++;
> +	if (dateptr == eol)
> +		return 0;
> +
> +	/*
> +	 * We know there is at least one non-whitespace character, so we'll
> +	 * begin parsing there and stop at worst case at eol.
> +	 */

This probably doesn't matter in practice but we define our own isspace() 
that does not treat '\v' and '\f' as whitespace. However 
parse_timestamp() (which is just strtoumax()) uses the standard 
library's isspace() which does treat those characters as whitespace and 
is locale dependent. This means we can potentially stop at a character 
that parse_timestamp() treats as whitespace and if there are no digits 
after it we'll still walk past the end of the line. Using Rene's 
suggestion of testing the character with isdigit() would fix that. It 
would also avoid parsing negative timestamps as positive numbers and 
reject any timestamps that begin with a locale dependent digit.

I'm not familiar with this code, but would it be worth changing 
parse_timestamp() to stop parsing if it sees a newline?

Best Wishes

Phillip

>   	return parse_timestamp(dateptr, NULL, 10);
>   }
>   
> diff --git a/t/t4212-log-corrupt.sh b/t/t4212-log-corrupt.sh
> index af4b35ff56..d4ef48d646 100755
> --- a/t/t4212-log-corrupt.sh
> +++ b/t/t4212-log-corrupt.sh
> @@ -92,4 +92,33 @@ test_expect_success 'absurdly far-in-future date' '
>   	git log -1 --format=%ad $commit
>   '
>   
> +test_expect_success 'create commit with whitespace committer date' '
> +	# It is important that this subject line is numeric, since we want to
> +	# be sure we are not confused by skipping whitespace and accidentally
> +	# parsing the subject as a timestamp.
> +	#
> +	# Do not use munge_author_date here. Besides not hitting the committer
> +	# line, it leaves the timezone intact, and we want nothing but
> +	# whitespace.
> +	test_commit 1234567890 &&
> +	git cat-file commit HEAD >commit.orig &&
> +	sed "s/>.*/>    /" <commit.orig >commit.munge &&
> +	ws_commit=$(git hash-object --literally -w -t commit commit.munge)
> +'
> +
> +test_expect_success '--until treats whitespace date as sentinel' '
> +	echo $ws_commit >expect &&
> +	git rev-list --until=1980-01-01 $ws_commit >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'pretty-printer handles whitespace date' '
> +	# as with the %ad test above, we will show these as the empty string,
> +	# not the 1970 epoch date. This is intentional; see 7d9a281941 (t4212:
> +	# test bogus timestamps with git-log, 2014-02-24) for more discussion.
> +	echo : >expect &&
> +	git log -1 --format="%at:%ct" $ws_commit >actual &&
> +	test_cmp expect actual
> +'
> +
>   test_done


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp
  2023-04-25 10:11                 ` Phillip Wood
@ 2023-04-25 16:06                   ` Junio C Hamano
  2023-04-26 11:36                     ` Jeff King
  2023-04-26 14:06                     ` [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp Phillip Wood
  0 siblings, 2 replies; 46+ messages in thread
From: Junio C Hamano @ 2023-04-25 16:06 UTC (permalink / raw)
  To: Phillip Wood
  Cc: Jeff King, Thomas Bock, Derrick Stolee, git, René Scharfe

Phillip Wood <phillip.wood123@gmail.com> writes:

> This probably doesn't matter in practice but we define our own
> isspace() that does not treat '\v' and '\f' as whitespace. However
> parse_timestamp() (which is just strtoumax()) uses the standard
> library's isspace() which does treat those characters as whitespace
> and is locale dependent. This means we can potentially stop at a
> character that parse_timestamp() treats as whitespace and if there are
> no digits after it we'll still walk past the end of the line. Using
> Rene's suggestion of testing the character with isdigit() would fix
> that. It would also avoid parsing negative timestamps as positive
> numbers and reject any timestamps that begin with a locale dependent
> digit.

A very interesting observation.  I wonder if a curious person can
craft a malformed timestamp with "hash-object --literally" to do
more than DoS themselves?

We are not going to put anything other than [ 0-9+-] after the '>'
we scan for, and making sure '>' is followed by SP and then [0-9]
would be sufficient to ensure strtoumax() to stop before the '\n'
but does not ensure that the "signal a bad timestamp with 0"
happens.  Perhaps that would be sufficient.  I dunno.

> I'm not familiar with this code, but would it be worth changing
> parse_timestamp() to stop parsing if it sees a newline?

Meaning replace or write our own strtoumax() equivalent?


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp
  2023-04-25 16:06                   ` Junio C Hamano
@ 2023-04-26 11:36                     ` Jeff King
  2023-04-26 15:32                       ` Junio C Hamano
  2023-04-26 14:06                     ` [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp Phillip Wood
  1 sibling, 1 reply; 46+ messages in thread
From: Jeff King @ 2023-04-26 11:36 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Phillip Wood, Thomas Bock, Derrick Stolee, git, René Scharfe

On Tue, Apr 25, 2023 at 09:06:47AM -0700, Junio C Hamano wrote:

> Phillip Wood <phillip.wood123@gmail.com> writes:
> 
> > This probably doesn't matter in practice but we define our own
> > isspace() that does not treat '\v' and '\f' as whitespace. However
> > parse_timestamp() (which is just strtoumax()) uses the standard
> > library's isspace() which does treat those characters as whitespace
> > and is locale dependent. This means we can potentially stop at a
> > character that parse_timestamp() treats as whitespace and if there are
> > no digits after it we'll still walk past the end of the line. Using
> > Rene's suggestion of testing the character with isdigit() would fix
> > that. It would also avoid parsing negative timestamps as positive
> > numbers and reject any timestamps that begin with a locale dependent
> > digit.
> 
> A very interesting observation.  I wonder if a curious person can
> craft a malformed timestamp with "hash-object --literally" to do
> more than DoS themselves?

I think the answer is no, because the worst case is that they read to
the trailing NUL that we stick after any object content we read into
memory. So we'd mis-parse:

  committer name <email> \v\n

  123456 in the subject line

to read "123456" as the commit timestamp (so basically the same bug my
patch was trying to fix). But we'd never read out-of-bounds memory.
Still, it does not give me warm fuzzies, and I think is worth fixing.

> We are not going to put anything other than [ 0-9+-] after the '>'
> we scan for, and making sure '>' is followed by SP and then [0-9]
> would be sufficient to ensure strtoumax() to stop before the '\n'
> but does not ensure that the "signal a bad timestamp with 0"
> happens.  Perhaps that would be sufficient.  I dunno.

Any single non-whitespace character at all would be sufficient to avoid
the problem. And that's what the current iteration of the patch is
trying to do. It's just that our definition of "whitespace" has to agree
with strtoumax()'s for it to work. And as Phillip notes, that may even
include locale dependent characters. So I don't think we want to get
into trying to match them all (i.e., a "allow known" strategy).

Instead, we should go back to what the original iteration of the series
was doing, and make sure there is at least one digit (i.e., a "forbid
unknown" strategy). Assuming that there is no locale where ascii "1" is
considered whitespace. ;)

Note that will exclude a few cases that we do allow now, like:

  committer name <email> \v123456 +0000\n

Right now that parses as "123456", but we'd reject it as "0" after such
a patch.

The alternative is to check _all_ of the characters between ">" and the
newline and make sure there is some digit somewhere, which would be
sufficient to prevent strtoumax() from walking past the newline.

I guess it's not even any more expensive in the normal case (since the
very first non-whitespace entry should be a digit!). I'm not sure it's
worth caring about too much either way. Garbage making it into
name/email is an easy mistake to make (for users and implementations).
Putting whitespace control codes into your timestamp is not, and marking
them as "0" is an OK outcome.

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp
  2023-04-26 11:36                     ` Jeff King
@ 2023-04-26 15:32                       ` Junio C Hamano
  2023-04-27  8:13                         ` [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases Jeff King
  0 siblings, 1 reply; 46+ messages in thread
From: Junio C Hamano @ 2023-04-26 15:32 UTC (permalink / raw)
  To: Jeff King
  Cc: Phillip Wood, Thomas Bock, Derrick Stolee, git, René Scharfe

Jeff King <peff@peff.net> writes:

> Instead, we should go back to what the original iteration of the series
> was doing, and make sure there is at least one digit (i.e., a "forbid
> unknown" strategy). Assuming that there is no locale where ascii "1" is
> considered whitespace. ;)
>
> Note that will exclude a few cases that we do allow now, like:
>
>   committer name <email> \v123456 +0000\n
>
> Right now that parses as "123456", but we'd reject it as "0" after such
> a patch.

I would say that it is a good thing.

The only (somewhat) end-user controlled things on the line are the
name and email, and even there name is sanitized to remove "crud".
The user-supplied timestamp goes through date.c::parse_date(),
ending up with what date.c::date_string() formats, so there will not
be syntactically incorrect timestamp there.  So we can be strict
format-wise on the timestamp field, once we identify where it begins,
which is the point of scanning backwards for '>'.

Unless the user does "hash-object" and deliberately creates a
malformed commit object---they can keep both halves just fine in
such a case as long as we do reject such a timestamp correctly.

> The alternative is to check _all_ of the characters between ">" and the
> newline and make sure there is some digit somewhere, which would be
> sufficient to prevent strtoumax() from walking past the newline.
>
> I guess it's not even any more expensive in the normal case (since the
> very first non-whitespace entry should be a digit!). I'm not sure it's
> worth caring about too much either way. Garbage making it into
> name/email is an easy mistake to make (for users and implementations).
> Putting whitespace control codes into your timestamp is not, and marking
> them as "0" is an OK outcome.

Yeah, I think it is fine either way.

Thanks

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases
  2023-04-26 15:32                       ` Junio C Hamano
@ 2023-04-27  8:13                         ` Jeff King
  2023-04-27  8:14                           ` [PATCH v3 1/4] t4212: avoid putting git on left-hand side of pipe Jeff King
                                             ` (5 more replies)
  0 siblings, 6 replies; 46+ messages in thread
From: Jeff King @ 2023-04-27  8:13 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Phillip Wood, Thomas Bock, Derrick Stolee, git, René Scharfe

On Wed, Apr 26, 2023 at 08:32:49AM -0700, Junio C Hamano wrote:

> > Note that will exclude a few cases that we do allow now, like:
> >
> >   committer name <email> \v123456 +0000\n
> >
> > Right now that parses as "123456", but we'd reject it as "0" after such
> > a patch.
> 
> I would say that it is a good thing.
> 
> The only (somewhat) end-user controlled things on the line are the
> name and email, and even there name is sanitized to remove "crud".
> The user-supplied timestamp goes through date.c::parse_date(),
> ending up with what date.c::date_string() formats, so there will not
> be syntactically incorrect timestamp there.  So we can be strict
> format-wise on the timestamp field, once we identify where it begins,
> which is the point of scanning backwards for '>'.
> 
> Unless the user does "hash-object" and deliberately creates a
> malformed commit object---they can keep both halves just fine in
> such a case as long as we do reject such a timestamp correctly.

I think we'd ideally consider the behavior against hypothetical bugs in
other implementations (including us in the future!). So yeah, I don't
think we ever generated a syntactically incorrect timestamp, and it
would be hard for a user to create one. But all things being equal, I'd
prefer to keep parsing something like:

  committer name <email> 123456\n

which is missing its timezone (and seems like a plausible sort of bug).
But I'm OK to draw the line at "if your implementation is sticking
control characters in the header, then tough luck".

So here's a v3. I was tempted to add the fix on top of the existing
patch, since it's somewhat its own case, and could be explained
separately. But they really are two versions of the same problem, so I
just rolled it all into patch 3.

Patch 4 needed small updates to its comment to match. The first two
patches are the same.

  [1/4]: t4212: avoid putting git on left-hand side of pipe
  [2/4]: parse_commit(): parse timestamp from end of line
  [3/4]: parse_commit(): handle broken whitespace-only timestamp
  [4/4]: parse_commit(): describe more date-parsing failure modes

 commit.c               | 57 ++++++++++++++++++++++++++++++++++++------
 t/t4212-log-corrupt.sh | 51 +++++++++++++++++++++++++++++++++++--
 2 files changed, 98 insertions(+), 10 deletions(-)

1:  7a2fa8daac = 1:  57401571b6 t4212: avoid putting git on left-hand side of pipe
2:  d90c720075 = 2:  54fa983d66 parse_commit(): parse timestamp from end of line
3:  1a47c87c07 ! 3:  894815d82d parse_commit(): handle broken whitespace-only timestamp
    @@ Commit message
         line, as well as the "\n\n" separator, and mistake the subject for the
         timestamp.
     
    -    The new test demonstrates such a case. I also added a test to check this
    -    case against the pretty-print formatter, which uses split_ident_line().
    -    It's not subject to the same bug, because it insists that there be one
    -    or more digits in the timestamp.
    +    We can solve this by trimming the whitespace ourselves, making sure that
    +    it has some non-whitespace to parse. Note that we need to be a bit
    +    careful about the definition of "whitespace" here, as our isspace()
    +    doesn't match exotic characters like vertical tab or formfeed. We can
    +    work around that by checking for an actual number (see the in-code
    +    comment). This is slightly more restrictive than the current code, but
    +    in practice the results are either the same (we reject "foo" as "0", but
    +    so would parse_timestamp()) or extremely unlikely even for broken
    +    commits (parse_timestamp() would allow "\v123" as "123", but we'll know
    +    make it "0").
     
    +    I did also allow "-" here, which may be controversial, as we don't
    +    currently support negative timestamps. My reasoning was two-fold. One,
    +    the design of parse_timestamp() is such that we should be able to easily
    +    switch it to handling signed values, and this otherwise creates a
    +    hard-to-find gotcha that anybody doing that work would get tripped up
    +    on. And two, the status quo is that we currently parse them, though the
    +    result of course ends up as a very large unsigned value (which is likely
    +    to just get clamped to "0" for display anyway, since our date routines
    +    can't handle it).
    +
    +    The new test checks the commit parser (via "--until") for both vanilla
    +    spaces and the vertical-tab case. I also added a test to check these
    +    against the pretty-print formatter, which uses split_ident_line().  It's
    +    not subject to the same bug, because it already insists that there be
    +    one or more digits in the timestamp.
    +
    +    Helped-by: Phillip Wood <phillip.wood123@gmail.com>
         Signed-off-by: Jeff King <peff@peff.net>
     
      ## commit.c ##
    @@ commit.c: static timestamp_t parse_commit_date(const char *buf, const char *tail
      
     -	/* dateptr < eol && *eol == '\n', so parsing will stop at eol */
     +	/*
    -+	 * Trim leading whitespace; parse_timestamp() will do this itself, but
    -+	 * if we have _only_ whitespace, it will walk right past the newline
    -+	 * while doing so.
    ++	 * Trim leading whitespace, but make sure we have at least one
    ++	 * non-whitespace character, as parse_timestamp() will otherwise walk
    ++	 * right past the newline we found in "eol" when skipping whitespace
    ++	 * itself.
    ++	 *
    ++	 * In theory it would be sufficient to allow any character not matched
    ++	 * by isspace(), but there's a catch: our isspace() does not
    ++	 * necessarily match the behavior of parse_timestamp(), as the latter
    ++	 * is implemented by system routines which match more exotic control
    ++	 * codes, or even locale-dependent sequences.
    ++	 *
    ++	 * Since we expect the timestamp to be a number, we can check for that.
    ++	 * Anything else (e.g., a non-numeric token like "foo") would just
    ++	 * cause parse_timestamp() to return 0 anyway.
     +	 */
     +	while (dateptr < eol && isspace(*dateptr))
     +		dateptr++;
    -+	if (dateptr == eol)
    ++	if (!isdigit(*dateptr) && *dateptr != '-')
     +		return 0;
     +
     +	/*
    -+	 * We know there is at least one non-whitespace character, so we'll
    -+	 * begin parsing there and stop at worst case at eol.
    ++	 * We know there is at least one digit (or dash), so we'll begin
    ++	 * parsing there and stop at worst case at eol.
     +	 */
      	return parse_timestamp(dateptr, NULL, 10);
      }
    @@ t/t4212-log-corrupt.sh: test_expect_success 'absurdly far-in-future date' '
      	git log -1 --format=%ad $commit
      '
      
    -+test_expect_success 'create commit with whitespace committer date' '
    ++test_expect_success 'create commits with whitespace committer dates' '
     +	# It is important that this subject line is numeric, since we want to
     +	# be sure we are not confused by skipping whitespace and accidentally
     +	# parsing the subject as a timestamp.
     +	#
     +	# Do not use munge_author_date here. Besides not hitting the committer
     +	# line, it leaves the timezone intact, and we want nothing but
     +	# whitespace.
    ++	#
    ++	# We will make two munged commits here. The first, ws_commit, will
    ++	# be purely spaces. The second contains a vertical tab, which is
    ++	# considered a space by strtoumax(), but not by our isspace().
     +	test_commit 1234567890 &&
     +	git cat-file commit HEAD >commit.orig &&
     +	sed "s/>.*/>    /" <commit.orig >commit.munge &&
    -+	ws_commit=$(git hash-object --literally -w -t commit commit.munge)
    ++	ws_commit=$(git hash-object --literally -w -t commit commit.munge) &&
    ++	sed "s/>.*/>   $(printf "\013")/" <commit.orig >commit.munge &&
    ++	vt_commit=$(git hash-object --literally -w -t commit commit.munge)
     +'
     +
     +test_expect_success '--until treats whitespace date as sentinel' '
     +	echo $ws_commit >expect &&
     +	git rev-list --until=1980-01-01 $ws_commit >actual &&
    ++	test_cmp expect actual &&
    ++
    ++	echo $vt_commit >expect &&
    ++	git rev-list --until=1980-01-01 $vt_commit >actual &&
     +	test_cmp expect actual
     +'
     +
    @@ t/t4212-log-corrupt.sh: test_expect_success 'absurdly far-in-future date' '
     +	# test bogus timestamps with git-log, 2014-02-24) for more discussion.
     +	echo : >expect &&
     +	git log -1 --format="%at:%ct" $ws_commit >actual &&
    ++	test_cmp expect actual &&
    ++	git log -1 --format="%at:%ct" $vt_commit &&
     +	test_cmp expect actual
     +'
     +
4:  193a01a32a < -:  ---------- parse_commit(): describe more date-parsing failure modes
-:  ---------- > 4:  ff7e9ddc7c parse_commit(): describe more date-parsing failure modes

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v3 1/4] t4212: avoid putting git on left-hand side of pipe
  2023-04-27  8:13                         ` [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases Jeff King
@ 2023-04-27  8:14                           ` Jeff King
  2023-04-27  8:14                           ` [PATCH v3 2/4] parse_commit(): parse timestamp from end of line Jeff King
                                             ` (4 subsequent siblings)
  5 siblings, 0 replies; 46+ messages in thread
From: Jeff King @ 2023-04-27  8:14 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Phillip Wood, Thomas Bock, Derrick Stolee, git, René Scharfe

We wouldn't expect cat-file to fail here, but it's good practice to
avoid putting git on the upstream of a pipe, as we otherwise ignore its
exit code.

Signed-off-by: Jeff King <peff@peff.net>
---
 t/t4212-log-corrupt.sh | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/t/t4212-log-corrupt.sh b/t/t4212-log-corrupt.sh
index e89e1f54b6..8b5433ea74 100755
--- a/t/t4212-log-corrupt.sh
+++ b/t/t4212-log-corrupt.sh
@@ -8,8 +8,9 @@ TEST_PASSES_SANITIZE_LEAK=true
 test_expect_success 'setup' '
 	test_commit foo &&
 
-	git cat-file commit HEAD |
-	sed "/^author /s/>/>-<>/" >broken_email.commit &&
+	git cat-file commit HEAD >ok.commit &&
+	sed "/^author /s/>/>-<>/" <ok.commit >broken_email.commit &&
+
 	git hash-object --literally -w -t commit broken_email.commit >broken_email.hash &&
 	git update-ref refs/heads/broken_email $(cat broken_email.hash)
 '
-- 
2.40.1.663.g410c33770c.dirty


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v3 2/4] parse_commit(): parse timestamp from end of line
  2023-04-27  8:13                         ` [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases Jeff King
  2023-04-27  8:14                           ` [PATCH v3 1/4] t4212: avoid putting git on left-hand side of pipe Jeff King
@ 2023-04-27  8:14                           ` Jeff King
  2023-04-27  8:17                           ` [PATCH v3 3/4] parse_commit(): handle broken whitespace-only timestamp Jeff King
                                             ` (3 subsequent siblings)
  5 siblings, 0 replies; 46+ messages in thread
From: Jeff King @ 2023-04-27  8:14 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Phillip Wood, Thomas Bock, Derrick Stolee, git, René Scharfe

To find the committer timestamp, we parse left-to-right looking for the
closing ">" of the email, and then expect the timestamp right after
that. But we've seen some broken cases in the wild where this fails, but
we _could_ find the timestamp with a little extra work. E.g.:

  Name <Name<email>> 123456789 -0500

This means that features that rely on the committer timestamp, like
--since or --until, will treat the commit as happening at time 0 (i.e.,
1970).

This is doubly confusing because the pretty-print parser learned to
handle these in 03818a4a94 (split_ident: parse timestamp from end of
line, 2013-10-14). So printing them via "git show", etc, makes
everything look normal, but --until, etc are still broken (despite the
fact that that commit explicitly mentioned --until!).

So let's use the same trick as 03818a4a94: find the end of the line, and
parse back to the final ">". In theory we could use split_ident_line()
here, but it's actually a bit more strict. In particular, it requires a
valid time-zone token, too. That should be present, of course, but we
wouldn't want to break --until for cases that are working currently.

We might want to teach split_ident_line() to become more lenient there,
but it would require checking its many callers (since right now they can
assume that if date_start is non-NULL, so is tz_start).

So for now we'll just reimplement the same trick in the commit parser.

The test is in t4212, which already covers similar cases, courtesy of
03818a4a94. We'll just adjust the broken commit to munge both the author
and committer timestamps. Note that we could match (author|committer)
here, but alternation can't be used portably in sed. Since we wouldn't
expect to see ">" except as part of an ident line, we can just match
that character on any line.

Signed-off-by: Jeff King <peff@peff.net>
---
 commit.c               | 24 ++++++++++++++++--------
 t/t4212-log-corrupt.sh |  7 ++++++-
 2 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/commit.c b/commit.c
index 878b4473e4..04c20d9cc6 100644
--- a/commit.c
+++ b/commit.c
@@ -96,6 +96,7 @@ struct commit *lookup_commit_reference_by_name(const char *name)
 static timestamp_t parse_commit_date(const char *buf, const char *tail)
 {
 	const char *dateptr;
+	const char *eol;

 	if (buf + 6 >= tail)
 		return 0;
@@ -107,16 +108,23 @@ static timestamp_t parse_commit_date(const char *buf, const char *tail)
 		return 0;
 	if (memcmp(buf, "committer", 9))
 		return 0;
-	while (buf < tail && *buf++ != '>')
-		/* nada */;
-	if (buf >= tail)
+
+	/*
+	 * Jump to end-of-line so that we can walk backwards to find the
+	 * end-of-email ">". This is more forgiving of malformed cases
+	 * because unexpected characters tend to be in the name and email
+	 * fields.
+	 */
+	eol = memchr(buf, '\n', tail - buf);
+	if (!eol)
 		return 0;
-	dateptr = buf;
-	while (buf < tail && *buf++ != '\n')
-		/* nada */;
-	if (buf >= tail)
+	dateptr = eol;
+	while (dateptr > buf && dateptr[-1] != '>')
+		dateptr--;
+	if (dateptr == buf || dateptr == eol)
 		return 0;
-	/* dateptr < buf && buf[-1] == '\n', so parsing will stop at buf-1 */
+
+	/* dateptr < eol && *eol == '\n', so parsing will stop at eol */
 	return parse_timestamp(dateptr, NULL, 10);
 }

diff --git a/t/t4212-log-corrupt.sh b/t/t4212-log-corrupt.sh
index 8b5433ea74..af4b35ff56 100755
--- a/t/t4212-log-corrupt.sh
+++ b/t/t4212-log-corrupt.sh
@@ -9,7 +9,7 @@ test_expect_success 'setup' '
 	test_commit foo &&

 	git cat-file commit HEAD >ok.commit &&
-	sed "/^author /s/>/>-<>/" <ok.commit >broken_email.commit &&
+	sed "s/>/>-<>/" <ok.commit >broken_email.commit &&

 	git hash-object --literally -w -t commit broken_email.commit >broken_email.hash &&
 	git update-ref refs/heads/broken_email $(cat broken_email.hash)
@@ -44,6 +44,11 @@ test_expect_success 'git log --format with broken author email' '
 	test_must_be_empty actual.err
 '

+test_expect_success '--until handles broken email' '
+	git rev-list --until=1980-01-01 broken_email >actual &&
+	test_must_be_empty actual
+'
+
 munge_author_date () {
 	git cat-file commit "$1" >commit.orig &&
 	sed "s/^\(author .*>\) [0-9]*/\1 $2/" <commit.orig >commit.munge &&
-- 
2.40.1.663.g410c33770c.dirty

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v3 3/4] parse_commit(): handle broken whitespace-only timestamp
  2023-04-27  8:13                         ` [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases Jeff King
  2023-04-27  8:14                           ` [PATCH v3 1/4] t4212: avoid putting git on left-hand side of pipe Jeff King
  2023-04-27  8:14                           ` [PATCH v3 2/4] parse_commit(): parse timestamp from end of line Jeff King
@ 2023-04-27  8:17                           ` Jeff King
  2023-04-27 10:11                             ` Phillip Wood
  2023-04-27 16:25                             ` Junio C Hamano
  2023-04-27  8:17                           ` [PATCH v3 4/4] parse_commit(): describe more date-parsing failure modes Jeff King
                                             ` (2 subsequent siblings)
  5 siblings, 2 replies; 46+ messages in thread
From: Jeff King @ 2023-04-27  8:17 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Phillip Wood, Thomas Bock, Derrick Stolee, git, René Scharfe

The comment in parse_commit_date() claims that parse_timestamp() will
not walk past the end of the buffer we've been given, since it will hit
the newline at "eol" and stop. This is usually true, when dateptr
contains actual numbers to parse. But with a line like:

   committer name <email>   \n

with just whitespace, and no numbers, parse_timestamp() will consume
that newline as part of the leading whitespace, and we may walk past our
"tail" pointer (which itself is set from the "size" parameter passed in
to parse_commit_buffer()).

In practice this can't cause us to walk off the end of an array, because
we always add an extra NUL byte to the end of objects we load from disk
(as a defense against exactly this kind of bug). However, you can see
the behavior in action when "committer" is the final header (which it
usually is, unless there's an encoding) and the subject line can be
parsed as an integer. We walk right past the newline on the committer
line, as well as the "\n\n" separator, and mistake the subject for the
timestamp.

We can solve this by trimming the whitespace ourselves, making sure that
it has some non-whitespace to parse. Note that we need to be a bit
careful about the definition of "whitespace" here, as our isspace()
doesn't match exotic characters like vertical tab or formfeed. We can
work around that by checking for an actual number (see the in-code
comment). This is slightly more restrictive than the current code, but
in practice the results are either the same (we reject "foo" as "0", but
so would parse_timestamp()) or extremely unlikely even for broken
commits (parse_timestamp() would allow "\v123" as "123", but we'll now
make it "0").

I did also allow "-" here, which may be controversial, as we don't
currently support negative timestamps. My reasoning was two-fold. One,
the design of parse_timestamp() is such that we should be able to easily
switch it to handling signed values, and this otherwise creates a
hard-to-find gotcha that anybody doing that work would get tripped up
on. And two, the status quo is that we currently parse them, though the
result of course ends up as a very large unsigned value (which is likely
to just get clamped to "0" for display anyway, since our date routines
can't handle it).

The new test checks the commit parser (via "--until") for both vanilla
spaces and the vertical-tab case. I also added a test to check these
against the pretty-print formatter, which uses split_ident_line().  It's
not subject to the same bug, because it already insists that there be
one or more digits in the timestamp.

Helped-by: Phillip Wood <phillip.wood123@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
---
 commit.c               | 28 ++++++++++++++++++++++++++--
 t/t4212-log-corrupt.sh | 41 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 67 insertions(+), 2 deletions(-)

diff --git a/commit.c b/commit.c
index 04c20d9cc6..8dfe92cf37 100644
--- a/commit.c
+++ b/commit.c
@@ -121,10 +121,34 @@ static timestamp_t parse_commit_date(const char *buf, const char *tail)
 	dateptr = eol;
 	while (dateptr > buf && dateptr[-1] != '>')
 		dateptr--;
-	if (dateptr == buf || dateptr == eol)
+	if (dateptr == buf)
 		return 0;
 
-	/* dateptr < eol && *eol == '\n', so parsing will stop at eol */
+	/*
+	 * Trim leading whitespace, but make sure we have at least one
+	 * non-whitespace character, as parse_timestamp() will otherwise walk
+	 * right past the newline we found in "eol" when skipping whitespace
+	 * itself.
+	 *
+	 * In theory it would be sufficient to allow any character not matched
+	 * by isspace(), but there's a catch: our isspace() does not
+	 * necessarily match the behavior of parse_timestamp(), as the latter
+	 * is implemented by system routines which match more exotic control
+	 * codes, or even locale-dependent sequences.
+	 *
+	 * Since we expect the timestamp to be a number, we can check for that.
+	 * Anything else (e.g., a non-numeric token like "foo") would just
+	 * cause parse_timestamp() to return 0 anyway.
+	 */
+	while (dateptr < eol && isspace(*dateptr))
+		dateptr++;
+	if (!isdigit(*dateptr) && *dateptr != '-')
+		return 0;
+
+	/*
+	 * We know there is at least one digit (or dash), so we'll begin
+	 * parsing there and stop at worst case at eol.
+	 */
 	return parse_timestamp(dateptr, NULL, 10);
 }
 
diff --git a/t/t4212-log-corrupt.sh b/t/t4212-log-corrupt.sh
index af4b35ff56..85e90acb09 100755
--- a/t/t4212-log-corrupt.sh
+++ b/t/t4212-log-corrupt.sh
@@ -92,4 +92,45 @@ test_expect_success 'absurdly far-in-future date' '
 	git log -1 --format=%ad $commit
 '
 
+test_expect_success 'create commits with whitespace committer dates' '
+	# It is important that this subject line is numeric, since we want to
+	# be sure we are not confused by skipping whitespace and accidentally
+	# parsing the subject as a timestamp.
+	#
+	# Do not use munge_author_date here. Besides not hitting the committer
+	# line, it leaves the timezone intact, and we want nothing but
+	# whitespace.
+	#
+	# We will make two munged commits here. The first, ws_commit, will
+	# be purely spaces. The second contains a vertical tab, which is
+	# considered a space by strtoumax(), but not by our isspace().
+	test_commit 1234567890 &&
+	git cat-file commit HEAD >commit.orig &&
+	sed "s/>.*/>    /" <commit.orig >commit.munge &&
+	ws_commit=$(git hash-object --literally -w -t commit commit.munge) &&
+	sed "s/>.*/>   $(printf "\013")/" <commit.orig >commit.munge &&
+	vt_commit=$(git hash-object --literally -w -t commit commit.munge)
+'
+
+test_expect_success '--until treats whitespace date as sentinel' '
+	echo $ws_commit >expect &&
+	git rev-list --until=1980-01-01 $ws_commit >actual &&
+	test_cmp expect actual &&
+
+	echo $vt_commit >expect &&
+	git rev-list --until=1980-01-01 $vt_commit >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'pretty-printer handles whitespace date' '
+	# as with the %ad test above, we will show these as the empty string,
+	# not the 1970 epoch date. This is intentional; see 7d9a281941 (t4212:
+	# test bogus timestamps with git-log, 2014-02-24) for more discussion.
+	echo : >expect &&
+	git log -1 --format="%at:%ct" $ws_commit >actual &&
+	test_cmp expect actual &&
+	git log -1 --format="%at:%ct" $vt_commit >actual &&
+	test_cmp expect actual
+'
+
 test_done
-- 
2.40.1.663.g410c33770c.dirty


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 3/4] parse_commit(): handle broken whitespace-only timestamp
  2023-04-27  8:17                           ` [PATCH v3 3/4] parse_commit(): handle broken whitespace-only timestamp Jeff King
@ 2023-04-27 10:11                             ` Phillip Wood
  2023-04-27 11:55                               ` Phillip Wood
  2023-04-27 16:20                               ` Junio C Hamano
  2023-04-27 16:25                             ` Junio C Hamano
  1 sibling, 2 replies; 46+ messages in thread
From: Phillip Wood @ 2023-04-27 10:11 UTC (permalink / raw)
  To: Jeff King, Junio C Hamano
  Cc: Thomas Bock, Derrick Stolee, git, René Scharfe

On 27/04/2023 09:17, Jeff King wrote:
> The comment in parse_commit_date() claims that parse_timestamp() will
> not walk past the end of the buffer we've been given, since it will hit
> the newline at "eol" and stop. This is usually true, when dateptr
> contains actual numbers to parse. But with a line like:
> 
>     committer name <email>   \n
> 
> with just whitespace, and no numbers, parse_timestamp() will consume
> that newline as part of the leading whitespace, and we may walk past our
> "tail" pointer (which itself is set from the "size" parameter passed in
> to parse_commit_buffer()).
> 
> In practice this can't cause us to walk off the end of an array, because
> we always add an extra NUL byte to the end of objects we load from disk
> (as a defense against exactly this kind of bug). However, you can see
> the behavior in action when "committer" is the final header (which it
> usually is, unless there's an encoding) and the subject line can be
> parsed as an integer. We walk right past the newline on the committer
> line, as well as the "\n\n" separator, and mistake the subject for the
> timestamp.
> 
> We can solve this by trimming the whitespace ourselves, making sure that
> it has some non-whitespace to parse. Note that we need to be a bit
> careful about the definition of "whitespace" here, as our isspace()
> doesn't match exotic characters like vertical tab or formfeed. We can
> work around that by checking for an actual number (see the in-code
> comment). This is slightly more restrictive than the current code, but
> in practice the results are either the same (we reject "foo" as "0", but
> so would parse_timestamp()) or extremely unlikely even for broken
> commits (parse_timestamp() would allow "\v123" as "123", but we'll now
> make it "0").
> 
> I did also allow "-" here, which may be controversial, as we don't
> currently support negative timestamps. My reasoning was two-fold. One,
> the design of parse_timestamp() is such that we should be able to easily
> switch it to handling signed values, and this otherwise creates a
> hard-to-find gotcha that anybody doing that work would get tripped up
> on. And two, the status quo is that we currently parse them, though the
> result of course ends up as a very large unsigned value (which is likely
> to just get clamped to "0" for display anyway, since our date routines
> can't handle it).

I think this makes a good case for accepting '-'. The commit message is 
well explained as always :-) This all looks good to me apart from a 
query about one of the tests.

> The new test checks the commit parser (via "--until") for both vanilla
> spaces and the vertical-tab case. I also added a test to check these
> against the pretty-print formatter, which uses split_ident_line().  It's
> not subject to the same bug, because it already insists that there be
> one or more digits in the timestamp.
> 
> Helped-by: Phillip Wood <phillip.wood123@gmail.com>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>   commit.c               | 28 ++++++++++++++++++++++++++--
>   t/t4212-log-corrupt.sh | 41 +++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 67 insertions(+), 2 deletions(-)
> 

> +test_expect_success 'create commits with whitespace committer dates' '
> +	# It is important that this subject line is numeric, since we want to
> +	# be sure we are not confused by skipping whitespace and accidentally
> +	# parsing the subject as a timestamp.
> +	#
> +	# Do not use munge_author_date here. Besides not hitting the committer
> +	# line, it leaves the timezone intact, and we want nothing but
> +	# whitespace.
> +	#
> +	# We will make two munged commits here. The first, ws_commit, will
> +	# be purely spaces. The second contains a vertical tab, which is
> +	# considered a space by strtoumax(), but not by our isspace().

This comment is really helpful to explain what's going on and testing 
'\v' as well as ' ' is a good idea.

> +	test_commit 1234567890 &&
> +	git cat-file commit HEAD >commit.orig &&
> +	sed "s/>.*/>    /" <commit.orig >commit.munge &&
> +	ws_commit=$(git hash-object --literally -w -t commit commit.munge) &&
> +	sed "s/>.*/>   $(printf "\013")/" <commit.orig >commit.munge &&

Does the shell eat the '\v' when it trims trailing whitespace from the 
command substitution (I can't remember the rules off the top of my head)?

Best Wishes

Phillip

> +	vt_commit=$(git hash-object --literally -w -t commit commit.munge)
> +'
> +
> +test_expect_success '--until treats whitespace date as sentinel' '
> +	echo $ws_commit >expect &&
> +	git rev-list --until=1980-01-01 $ws_commit >actual &&
> +	test_cmp expect actual &&
> +
> +	echo $vt_commit >expect &&
> +	git rev-list --until=1980-01-01 $vt_commit >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'pretty-printer handles whitespace date' '
> +	# as with the %ad test above, we will show these as the empty string,
> +	# not the 1970 epoch date. This is intentional; see 7d9a281941 (t4212:
> +	# test bogus timestamps with git-log, 2014-02-24) for more discussion.
> +	echo : >expect &&
> +	git log -1 --format="%at:%ct" $ws_commit >actual &&
> +	test_cmp expect actual &&
> +	git log -1 --format="%at:%ct" $vt_commit >actual &&
> +	test_cmp expect actual
> +'
> +
>   test_done


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 3/4] parse_commit(): handle broken whitespace-only timestamp
  2023-04-27 10:11                             ` Phillip Wood
@ 2023-04-27 11:55                               ` Phillip Wood
  2023-04-27 16:46                                 ` Jeff King
  2023-04-27 16:20                               ` Junio C Hamano
  1 sibling, 1 reply; 46+ messages in thread
From: Phillip Wood @ 2023-04-27 11:55 UTC (permalink / raw)
  To: Jeff King, Junio C Hamano
  Cc: Thomas Bock, Derrick Stolee, git, René Scharfe

On 27/04/2023 11:11, Phillip Wood wrote:

>> +    test_commit 1234567890 &&
>> +    git cat-file commit HEAD >commit.orig &&
>> +    sed "s/>.*/>    /" <commit.orig >commit.munge &&
>> +    ws_commit=$(git hash-object --literally -w -t commit 
>> commit.munge) &&
>> +    sed "s/>.*/>   $(printf "\013")/" <commit.orig >commit.munge &&
> 
> Does the shell eat the '\v' when it trims trailing whitespace from the 
> command substitution (I can't remember the rules off the top of my head)?

Having looked it up, the shell trims newlines but not other whitespace 
so this should be fine.

Best Wishes

Phillip

> 
> Best Wishes
> 
> Phillip
> 
>> +    vt_commit=$(git hash-object --literally -w -t commit commit.munge)
>> +'
>> +
>> +test_expect_success '--until treats whitespace date as sentinel' '
>> +    echo $ws_commit >expect &&
>> +    git rev-list --until=1980-01-01 $ws_commit >actual &&
>> +    test_cmp expect actual &&
>> +
>> +    echo $vt_commit >expect &&
>> +    git rev-list --until=1980-01-01 $vt_commit >actual &&
>> +    test_cmp expect actual
>> +'
>> +
>> +test_expect_success 'pretty-printer handles whitespace date' '
>> +    # as with the %ad test above, we will show these as the empty 
>> string,
>> +    # not the 1970 epoch date. This is intentional; see 7d9a281941 
>> (t4212:
>> +    # test bogus timestamps with git-log, 2014-02-24) for more 
>> discussion.
>> +    echo : >expect &&
>> +    git log -1 --format="%at:%ct" $ws_commit >actual &&
>> +    test_cmp expect actual &&
>> +    git log -1 --format="%at:%ct" $vt_commit >actual &&
>> +    test_cmp expect actual
>> +'
>> +
>>   test_done
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 3/4] parse_commit(): handle broken whitespace-only timestamp
  2023-04-27 11:55                               ` Phillip Wood
@ 2023-04-27 16:46                                 ` Jeff King
  0 siblings, 0 replies; 46+ messages in thread
From: Jeff King @ 2023-04-27 16:46 UTC (permalink / raw)
  To: phillip.wood
  Cc: Junio C Hamano, Thomas Bock, Derrick Stolee, git,
	René Scharfe

On Thu, Apr 27, 2023 at 12:55:26PM +0100, Phillip Wood wrote:

> On 27/04/2023 11:11, Phillip Wood wrote:
> 
> > > +    test_commit 1234567890 &&
> > > +    git cat-file commit HEAD >commit.orig &&
> > > +    sed "s/>.*/>    /" <commit.orig >commit.munge &&
> > > +    ws_commit=$(git hash-object --literally -w -t commit
> > > commit.munge) &&
> > > +    sed "s/>.*/>   $(printf "\013")/" <commit.orig >commit.munge &&
> > 
> > Does the shell eat the '\v' when it trims trailing whitespace from the
> > command substitution (I can't remember the rules off the top of my
> > head)?
> 
> Having looked it up, the shell trims newlines but not other whitespace so
> this should be fine.

Yep. I also wondered if some sed versions might complain. But it's
really not that much more exotic than a tab, so it's probably OK.

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 3/4] parse_commit(): handle broken whitespace-only timestamp
  2023-04-27 10:11                             ` Phillip Wood
  2023-04-27 11:55                               ` Phillip Wood
@ 2023-04-27 16:20                               ` Junio C Hamano
  2023-04-27 16:55                                 ` Jeff King
  1 sibling, 1 reply; 46+ messages in thread
From: Junio C Hamano @ 2023-04-27 16:20 UTC (permalink / raw)
  To: Phillip Wood
  Cc: Jeff King, Thomas Bock, Derrick Stolee, git, René Scharfe

Phillip Wood <phillip.wood123@gmail.com> writes:

>> I did also allow "-" here, which may be controversial, as we don't
>> currently support negative timestamps. My reasoning was two-fold. One,
>> the design of parse_timestamp() is such that we should be able to easily
>> switch it to handling signed values, and this otherwise creates a
>> hard-to-find gotcha that anybody doing that work would get tripped up
>> on. And two, the status quo is that we currently parse them, though the
>> result of course ends up as a very large unsigned value (which is likely
>> to just get clamped to "0" for display anyway, since our date routines
>> can't handle it).
>
> I think this makes a good case for accepting '-'. The commit message
> is well explained as always :-) This all looks good to me apart from a
> query about one of the tests.

I agree.  I was somewhat surprised that the big comment before that
code did not mention it, but hopefully those who would be tempted to
remove the check for '-' would either be careful enough themselves
or be stopped by reviewers who are careful enough to go back to the
log message of the commit that added the check in the first place,
so it is OK.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 3/4] parse_commit(): handle broken whitespace-only timestamp
  2023-04-27 16:20                               ` Junio C Hamano
@ 2023-04-27 16:55                                 ` Jeff King
  0 siblings, 0 replies; 46+ messages in thread
From: Jeff King @ 2023-04-27 16:55 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Phillip Wood, Thomas Bock, Derrick Stolee, git, René Scharfe

On Thu, Apr 27, 2023 at 09:20:53AM -0700, Junio C Hamano wrote:

> Phillip Wood <phillip.wood123@gmail.com> writes:
> 
> >> I did also allow "-" here, which may be controversial, as we don't
> >> currently support negative timestamps. My reasoning was two-fold. One,
> >> the design of parse_timestamp() is such that we should be able to easily
> >> switch it to handling signed values, and this otherwise creates a
> >> hard-to-find gotcha that anybody doing that work would get tripped up
> >> on. And two, the status quo is that we currently parse them, though the
> >> result of course ends up as a very large unsigned value (which is likely
> >> to just get clamped to "0" for display anyway, since our date routines
> >> can't handle it).
> >
> > I think this makes a good case for accepting '-'. The commit message
> > is well explained as always :-) This all looks good to me apart from a
> > query about one of the tests.
> 
> I agree.  I was somewhat surprised that the big comment before that
> code did not mention it, but hopefully those who would be tempted to
> remove the check for '-' would either be careful enough themselves
> or be stopped by reviewers who are careful enough to go back to the
> log message of the commit that added the check in the first place,
> so it is OK.

Hmph, I thought I did write something in that comment. But I think in
the end I just migrated it to the commit message (and skimming my reflog
I don't think it even made it as far as a commit).

I think it's OK. I was mostly trying to help out Future Peff, who has a
series almost completed to handle negative timestamps. But I think the
worst case is that the tests in that new series would fail, and I'd
figure it out. :)

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 3/4] parse_commit(): handle broken whitespace-only timestamp
  2023-04-27  8:17                           ` [PATCH v3 3/4] parse_commit(): handle broken whitespace-only timestamp Jeff King
  2023-04-27 10:11                             ` Phillip Wood
@ 2023-04-27 16:25                             ` Junio C Hamano
  2023-04-27 16:57                               ` Jeff King
  1 sibling, 1 reply; 46+ messages in thread
From: Junio C Hamano @ 2023-04-27 16:25 UTC (permalink / raw)
  To: Jeff King
  Cc: Phillip Wood, Thomas Bock, Derrick Stolee, git, René Scharfe

Jeff King <peff@peff.net> writes:

> In practice this can't cause us to walk off the end of an array, because
> we always add an extra NUL byte to the end of objects we load from disk
> (as a defense against exactly this kind of bug). However, you can see
> the behavior in action when "committer" is the final header (which it
> usually is, unless there's an encoding ...

... or it is a signed commit or a commit that merges a signed tag.

There is no need for us to be exhaustive here, but I just wondered
which one of these three commit object headers is more common.  I
guess the reason "encoding" came to your mind first is because it is
the oldest among the three.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 3/4] parse_commit(): handle broken whitespace-only timestamp
  2023-04-27 16:25                             ` Junio C Hamano
@ 2023-04-27 16:57                               ` Jeff King
  0 siblings, 0 replies; 46+ messages in thread
From: Jeff King @ 2023-04-27 16:57 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Phillip Wood, Thomas Bock, Derrick Stolee, git, René Scharfe

On Thu, Apr 27, 2023 at 09:25:02AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > In practice this can't cause us to walk off the end of an array, because
> > we always add an extra NUL byte to the end of objects we load from disk
> > (as a defense against exactly this kind of bug). However, you can see
> > the behavior in action when "committer" is the final header (which it
> > usually is, unless there's an encoding ...
> 
> ... or it is a signed commit or a commit that merges a signed tag.
> 
> There is no need for us to be exhaustive here, but I just wondered
> which one of these three commit object headers is more common.  I
> guess the reason "encoding" came to your mind first is because it is
> the oldest among the three.

Mostly the others did not occur to me at all. :)

I expect that "gpgsig" lines are probably the most common these days,
but that may be my biased view (I guess in the kernel workflow it is
probably signed tags).

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v3 4/4] parse_commit(): describe more date-parsing failure modes
  2023-04-27  8:13                         ` [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases Jeff King
                                             ` (2 preceding siblings ...)
  2023-04-27  8:17                           ` [PATCH v3 3/4] parse_commit(): handle broken whitespace-only timestamp Jeff King
@ 2023-04-27  8:17                           ` Jeff King
  2023-04-27  8:18                           ` [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases Jeff King
  2023-04-27 16:32                           ` Junio C Hamano
  5 siblings, 0 replies; 46+ messages in thread
From: Jeff King @ 2023-04-27  8:17 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Phillip Wood, Thomas Bock, Derrick Stolee, git, René Scharfe

The previous few commits improved the parsing of dates in malformed
commit objects. But there's one big case left implicit: we may still
feed garbage to parse_timestamp(). This is preferable to trying to be
more strict, but let's document the thinking in a comment.

Signed-off-by: Jeff King <peff@peff.net>
---
 commit.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/commit.c b/commit.c
index 8dfe92cf37..e2e4fd2db9 100644
--- a/commit.c
+++ b/commit.c
@@ -148,6 +148,15 @@ static timestamp_t parse_commit_date(const char *buf, const char *tail)
 	/*
 	 * We know there is at least one digit (or dash), so we'll begin
 	 * parsing there and stop at worst case at eol.
+	 *
+	 * Note that we may feed parse_timestamp() extra characters here if the
+	 * commit is malformed, and it will parse as far as it can. For
+	 * example, "123foo456" would return "123". That might be questionable
+	 * (versus returning "0"), but it would help in a hypothetical case
+	 * like "123456+0100", where the whitespace from the timezone is
+	 * missing. Since such syntactic errors may be baked into history and
+	 * hard to correct now, let's err on trying to make our best guess
+	 * here, rather than insist on perfect syntax.
 	 */
 	return parse_timestamp(dateptr, NULL, 10);
 }
-- 
2.40.1.663.g410c33770c.dirty

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases
  2023-04-27  8:13                         ` [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases Jeff King
                                             ` (3 preceding siblings ...)
  2023-04-27  8:17                           ` [PATCH v3 4/4] parse_commit(): describe more date-parsing failure modes Jeff King
@ 2023-04-27  8:18                           ` Jeff King
  2023-04-27 16:32                           ` Junio C Hamano
  5 siblings, 0 replies; 46+ messages in thread
From: Jeff King @ 2023-04-27  8:18 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Phillip Wood, Thomas Bock, Derrick Stolee, git, René Scharfe

On Thu, Apr 27, 2023 at 04:13:31AM -0400, Jeff King wrote:

>     @@ t/t4212-log-corrupt.sh: test_expect_success 'absurdly far-in-future date' '
>      +	# test bogus timestamps with git-log, 2014-02-24) for more discussion.
>      +	echo : >expect &&
>      +	git log -1 --format="%at:%ct" $ws_commit >actual &&
>     ++	test_cmp expect actual &&
>     ++	git log -1 --format="%at:%ct" $vt_commit &&
>      +	test_cmp expect actual

In case anyone is just reading the range-diff, you may notice that the
added git-log invocation here is missing its ">actual" redirect. I
noticed this while sending out, and it's fixed in the actual v3 patch
that was sent.

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases
  2023-04-27  8:13                         ` [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases Jeff King
                                             ` (4 preceding siblings ...)
  2023-04-27  8:18                           ` [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases Jeff King
@ 2023-04-27 16:32                           ` Junio C Hamano
  5 siblings, 0 replies; 46+ messages in thread
From: Junio C Hamano @ 2023-04-27 16:32 UTC (permalink / raw)
  To: Jeff King
  Cc: Phillip Wood, Thomas Bock, Derrick Stolee, git, René Scharfe

Jeff King <peff@peff.net> writes:

> So here's a v3. I was tempted to add the fix on top of the existing
> patch, since it's somewhat its own case, and could be explained
> separately. But they really are two versions of the same problem, so I
> just rolled it all into patch 3.
>
> Patch 4 needed small updates to its comment to match. The first two
> patches are the same.
>
>   [1/4]: t4212: avoid putting git on left-hand side of pipe
>   [2/4]: parse_commit(): parse timestamp from end of line
>   [3/4]: parse_commit(): handle broken whitespace-only timestamp
>   [4/4]: parse_commit(): describe more date-parsing failure modes

Thanks for a pleasant read.  Looking very good.

Will queue.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp
  2023-04-25 16:06                   ` Junio C Hamano
  2023-04-26 11:36                     ` Jeff King
@ 2023-04-26 14:06                     ` Phillip Wood
  2023-04-26 14:31                       ` Andreas Schwab
  1 sibling, 1 reply; 46+ messages in thread
From: Phillip Wood @ 2023-04-26 14:06 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Thomas Bock, Derrick Stolee, git, René Scharfe

On 25/04/2023 17:06, Junio C Hamano wrote:
> Phillip Wood <phillip.wood123@gmail.com> writes:
> 
>> This probably doesn't matter in practice but we define our own
>> isspace() that does not treat '\v' and '\f' as whitespace. However
>> parse_timestamp() (which is just strtoumax()) uses the standard
>> library's isspace() which does treat those characters as whitespace
>> and is locale dependent. This means we can potentially stop at a
>> character that parse_timestamp() treats as whitespace and if there are
>> no digits after it we'll still walk past the end of the line. Using
>> Rene's suggestion of testing the character with isdigit() would fix
>> that. It would also avoid parsing negative timestamps as positive
>> numbers 

>> and reject any timestamps that begin with a locale dependent
>> digit.

Sorry, that bit is not correct, I've since checked the C standard and I 
think strtoul() and friends expect ascii digits (isdigit() and 
isxdigit() are also locale independent unlike isspace(), isalpha() etc.)

> A very interesting observation.  I wonder if a curious person can
> craft a malformed timestamp with "hash-object --literally" to do
> more than DoS themselves?
> 
> We are not going to put anything other than [ 0-9+-] after the '>'
> we scan for, and making sure '>' is followed by SP and then [0-9]
> would be sufficient to ensure strtoumax() to stop before the '\n'
> but does not ensure that the "signal a bad timestamp with 0"
> happens.  Perhaps that would be sufficient.  I dunno.
> 
>> I'm not familiar with this code, but would it be worth changing
>> parse_timestamp() to stop parsing if it sees a newline?
> 
> Meaning replace or write our own strtoumax() equivalent?

I was thinking of a wrapper around strtoumax() that skipped the leading 
whitespace itself and returned 0 if it saw '\n' or the first 
non-whitespace character was not a digit. It would help other callers 
avoid the problem with missing timestamps that is being fixed in this 
series. I was surprised to see that callers are expected to pass a base 
to parse_timestamp(). All of them seem to pass "10" apart from a caller 
in upload-pack.c that passes "0" when parsing the argument to 
"deepen-since" - do we really want to support octal and hexadecimal 
timestamps there?.

Best Wishes

Phillip

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp
  2023-04-26 14:06                     ` [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp Phillip Wood
@ 2023-04-26 14:31                       ` Andreas Schwab
  2023-04-26 14:44                         ` Phillip Wood
  0 siblings, 1 reply; 46+ messages in thread
From: Andreas Schwab @ 2023-04-26 14:31 UTC (permalink / raw)
  To: Phillip Wood
  Cc: Junio C Hamano, phillip.wood, Jeff King, Thomas Bock,
	Derrick Stolee, git, René Scharfe

On Apr 26 2023, Phillip Wood wrote:

> On 25/04/2023 17:06, Junio C Hamano wrote:
>> Phillip Wood <phillip.wood123@gmail.com> writes:
>> 
>>> This probably doesn't matter in practice but we define our own
>>> isspace() that does not treat '\v' and '\f' as whitespace. However
>>> parse_timestamp() (which is just strtoumax()) uses the standard
>>> library's isspace() which does treat those characters as whitespace
>>> and is locale dependent. This means we can potentially stop at a
>>> character that parse_timestamp() treats as whitespace and if there are
>>> no digits after it we'll still walk past the end of the line. Using
>>> Rene's suggestion of testing the character with isdigit() would fix
>>> that. It would also avoid parsing negative timestamps as positive
>>> numbers 
>
>>> and reject any timestamps that begin with a locale dependent
>>> digit.
>
> Sorry, that bit is not correct, I've since checked the C standard and I
> think strtoul() and friends expect ascii digits (isdigit() and isxdigit()
> are also locale independent unlike isspace(), isalpha() etc.)

The standard says:

    In other than the "C" locale, additional locale-specific subject
    sequence forms may be accepted.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp
  2023-04-26 14:31                       ` Andreas Schwab
@ 2023-04-26 14:44                         ` Phillip Wood
  0 siblings, 0 replies; 46+ messages in thread
From: Phillip Wood @ 2023-04-26 14:44 UTC (permalink / raw)
  To: Andreas Schwab
  Cc: Junio C Hamano, phillip.wood, Jeff King, Thomas Bock,
	Derrick Stolee, git, René Scharfe

Hi Andreas

On 26/04/2023 15:31, Andreas Schwab wrote:
> On Apr 26 2023, Phillip Wood wrote:
> 
>> On 25/04/2023 17:06, Junio C Hamano wrote:
>>> Phillip Wood <phillip.wood123@gmail.com> writes:
>>>
>>>> This probably doesn't matter in practice but we define our own
>>>> isspace() that does not treat '\v' and '\f' as whitespace. However
>>>> parse_timestamp() (which is just strtoumax()) uses the standard
>>>> library's isspace() which does treat those characters as whitespace
>>>> and is locale dependent. This means we can potentially stop at a
>>>> character that parse_timestamp() treats as whitespace and if there are
>>>> no digits after it we'll still walk past the end of the line. Using
>>>> Rene's suggestion of testing the character with isdigit() would fix
>>>> that. It would also avoid parsing negative timestamps as positive
>>>> numbers
>>
>>>> and reject any timestamps that begin with a locale dependent
>>>> digit.
>>
>> Sorry, that bit is not correct, I've since checked the C standard and I
>> think strtoul() and friends expect ascii digits (isdigit() and isxdigit()
>> are also locale independent unlike isspace(), isalpha() etc.)
> 
> The standard says:
> 
>      In other than the "C" locale, additional locale-specific subject
>      sequence forms may be accepted.

Thanks, looking at the standard again I don't know how I managed to miss 
that, my initial recollection was correct after all.

Best Wishes

Phillip

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v2 4/4] parse_commit(): describe more date-parsing failure modes
  2023-04-25  5:52             ` [PATCH v2 " Jeff King
                                 ` (3 preceding siblings ...)
  2023-04-25  5:54               ` [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp Jeff King
@ 2023-04-25  5:55               ` Jeff King
  4 siblings, 0 replies; 46+ messages in thread
From: Jeff King @ 2023-04-25  5:55 UTC (permalink / raw)
  To: Thomas Bock; +Cc: Derrick Stolee, Junio C Hamano, git

The previous few commits improved the parsing of dates in malformed
commit objects. But there's one big case left implicit: we may still
feed garbage to parse_timestamp(). This is preferable to trying to be
more strict, but let's document the thinking in a comment.

Signed-off-by: Jeff King <peff@peff.net>
---
Obviously not strictly necessary, but it felt like this was worth
pointing out, and it felt weird to shove it into patch 3.

 commit.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/commit.c b/commit.c
index 2f1b5d505b..2ac141e198 100644
--- a/commit.c
+++ b/commit.c
@@ -136,6 +136,16 @@ static timestamp_t parse_commit_date(const char *buf, const char *tail)
 	/*
 	 * We know there is at least one non-whitespace character, so we'll
 	 * begin parsing there and stop at worst case at eol.
+	 *
+	 * Note that we may feed parse_timestamp() non-digit garbage here if
+	 * the commit is malformed. If we see a totally unexpected token like
+	 * "foo" here, it will return 0, which is our error/sentinel value
+	 * anyway. For something like "123foo456", it will parse as far as it
+	 * can. That might be questionable (versus returning "0"), but it would
+	 * help in a hypothetical case like "123456+0100", where the whitespace
+	 * from the timezone is missing. Since such syntactic errors may be
+	 * baked into history and hard to correct now, let's err on trying to
+	 * make our best guess here, rather than insist on perfect syntax.
 	 */
 	return parse_timestamp(dateptr, NULL, 10);
 }
-- 
2.40.0.653.g15ca972062

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980
  2023-04-18 14:02       ` Derrick Stolee
  2023-04-21 14:51         ` Thomas Bock
@ 2023-04-22 13:52         ` Jeff King
  1 sibling, 0 replies; 46+ messages in thread
From: Jeff King @ 2023-04-22 13:52 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Junio C Hamano, Thomas Bock, git

On Tue, Apr 18, 2023 at 10:02:44AM -0400, Derrick Stolee wrote:

> On 4/18/2023 12:12 AM, Jeff King wrote:
> 
> > One thing the commit graph perhaps _could_ do is omit the commit, or
> > mark it as "this one is broken in some way". And then fall back to
> > parsing those few instead (which is slower, but if it's a small minority
> > of commits, that's OK). But I don't think there's any code for that.
> 
> The "broken" commit would need to be included in the commit-graph file
> so its children can point to it using a graph position, but then it
> would revert to parsing from the commit object (due to some new concept
> storing "this is a bad commit").
> 
> If we decided to treat a timestamp of 0 as "probably broken, artificial
> at best" then we wouldn't need the new indicator in the commit-graph
> file, but this seems like quite a big hammer for a small case.

Makes sense. I punted on doing anything here for the series I just
posted, since the workaround of "blow away your commit graph and
rebuild" is probably sufficient for such obscure cases. And going
forward, new commit graphs would be built with the fixed parser anyway,
so it becomes less of a problem over time.

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2023-04-27 22:33 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-14 11:37 Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980 Thomas Bock
2023-04-15  8:52 ` Jeff King
2023-04-15  8:59   ` Jeff King
2023-04-15 14:10   ` Kristoffer Haugsbakk
2023-04-17  5:40     ` Jeff King
2023-04-17  6:20       ` Kristoffer Haugsbakk
2023-04-17  7:41         ` Jeff King
2023-04-27 22:32           ` Kristoffer Haugsbakk
2023-04-17  9:51   ` Junio C Hamano
2023-04-18  4:12     ` Jeff King
2023-04-18 14:02       ` Derrick Stolee
2023-04-21 14:51         ` Thomas Bock
2023-04-22 13:41           ` [PATCH 0/3] fixing some parse_commit() timestamp corner cases Jeff King
2023-04-22 13:42             ` [PATCH 1/3] t4212: avoid putting git on left-hand side of pipe Jeff King
2023-04-22 13:47             ` [PATCH 2/3] parse_commit(): parse timestamp from end of line Jeff King
2023-04-24 17:05               ` Junio C Hamano
2023-04-25  5:23                 ` Jeff King
2023-04-24 16:39             ` [PATCH 0/3] fixing some parse_commit() timestamp corner cases Junio C Hamano
2023-04-25  5:52             ` [PATCH v2 " Jeff King
2023-04-25  5:54               ` Jeff King
2023-04-25  5:54               ` [PATCH v2 1/4] t4212: avoid putting git on left-hand side of pipe Jeff King
2023-04-25  5:54               ` [PATCH v2 2/4] parse_commit(): parse timestamp from end of line Jeff King
2023-04-25  5:54               ` [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp Jeff King
2023-04-25 10:11                 ` Phillip Wood
2023-04-25 16:06                   ` Junio C Hamano
2023-04-26 11:36                     ` Jeff King
2023-04-26 15:32                       ` Junio C Hamano
2023-04-27  8:13                         ` [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases Jeff King
2023-04-27  8:14                           ` [PATCH v3 1/4] t4212: avoid putting git on left-hand side of pipe Jeff King
2023-04-27  8:14                           ` [PATCH v3 2/4] parse_commit(): parse timestamp from end of line Jeff King
2023-04-27  8:17                           ` [PATCH v3 3/4] parse_commit(): handle broken whitespace-only timestamp Jeff King
2023-04-27 10:11                             ` Phillip Wood
2023-04-27 11:55                               ` Phillip Wood
2023-04-27 16:46                                 ` Jeff King
2023-04-27 16:20                               ` Junio C Hamano
2023-04-27 16:55                                 ` Jeff King
2023-04-27 16:25                             ` Junio C Hamano
2023-04-27 16:57                               ` Jeff King
2023-04-27  8:17                           ` [PATCH v3 4/4] parse_commit(): describe more date-parsing failure modes Jeff King
2023-04-27  8:18                           ` [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases Jeff King
2023-04-27 16:32                           ` Junio C Hamano
2023-04-26 14:06                     ` [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp Phillip Wood
2023-04-26 14:31                       ` Andreas Schwab
2023-04-26 14:44                         ` Phillip Wood
2023-04-25  5:55               ` [PATCH v2 4/4] parse_commit(): describe more date-parsing failure modes Jeff King
2023-04-22 13:52         ` Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980 Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).