git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: Patrick Steinhardt <ps@pks.im>
Cc: git@vger.kernel.org,  Karthik Nayak <karthik.188@gmail.com>,
	 Taylor Blau <me@ttaylorr.com>,  Jeff King <peff@peff.net>
Subject: Re: [PATCH v2 2/2] commit: detect commits that exist in commit-graph but not in the ODB
Date: Tue, 24 Oct 2023 12:10:13 -0700	[thread overview]
Message-ID: <xmqqy1fr3kh6.fsf@gitster.g> (raw)
In-Reply-To: <0476d4855562b677ced106a4cc7788b46434cf21.1698060036.git.ps@pks.im> (Patrick Steinhardt's message of "Mon, 23 Oct 2023 13:27:20 +0200")

Patrick Steinhardt <ps@pks.im> writes:

> Commit graphs can become stale and contain references to commits that do
> not exist in the object database anymore. Theoretically, this can lead
> to a scenario where we are able to successfully look up any such commit
> via the commit graph even though such a lookup would fail if done via
> the object database directly.
>
> As the commit graph is mostly intended as a sort of cache to speed up
> parsing of commits we do not want to have diverging behaviour in a
> repository with and a repository without commit graphs, no matter
> whether they are stale or not. As commits are otherwise immutable, the
> only thing that we really need to care about is thus the presence or
> absence of a commit.
>
> To address potentially stale commit data that may exist in the graph,
> our `lookup_commit_in_graph()` function will check for the commit's
> existence in both the commit graph, but also in the object database. So
> even if we were able to look up the commit's data in the graph, we would
> still pretend as if the commit didn't exist if it is missing in the
> object database.
>
> We don't have the same safety net in `parse_commit_in_graph_one()`
> though. This function is mostly used internally in "commit-graph.c"
> itself to validate the commit graph, and this usage is fine. We do
> expose its functionality via `parse_commit_in_graph()` though, which
> gets called by `repo_parse_commit_internal()`, and that function is in
> turn used in many places in our codebase.
>
> For all I can see this function is never used to directly turn an object
> ID into a commit object without additional safety checks before or after
> this lookup. What it is being used for though is to walk history via the
> parent chain of commits. So when commits in the parent chain of a graph
> walk are missing it is possible that we wouldn't notice if that missing
> commit was part of the commit graph. Thus, a query like `git rev-parse
> HEAD~2` can succeed even if the intermittent commit is missing.
>
> It's unclear whether there are additional ways in which such stale
> commit graphs can lead to problems. In any case, it feels like this is a
> bigger bug waiting to happen when we gain additional direct or indirect
> callers of `repo_parse_commit_internal()`. So let's fix the inconsistent
> behaviour by checking for object existence via the object database, as
> well.
>
> This check of course comes with a performance penalty. The following
> benchmarks have been executed in a clone of linux.git with stable tags
> added:
>
>     Benchmark 1: git -c core.commitGraph=true rev-list --topo-order --all (git = master)
>       Time (mean ± σ):      2.913 s ±  0.018 s    [User: 2.363 s, System: 0.548 s]
>       Range (min … max):    2.894 s …  2.950 s    10 runs
>
>     Benchmark 2: git -c core.commitGraph=true rev-list --topo-order --all (git = pks-commit-graph-inconsistency)
>       Time (mean ± σ):      3.834 s ±  0.052 s    [User: 3.276 s, System: 0.556 s]
>       Range (min … max):    3.780 s …  3.961 s    10 runs
>
>     Benchmark 3: git -c core.commitGraph=false rev-list --topo-order --all (git = master)
>       Time (mean ± σ):     13.841 s ±  0.084 s    [User: 13.152 s, System: 0.687 s]
>       Range (min … max):   13.714 s … 13.995 s    10 runs
>
>     Benchmark 4: git -c core.commitGraph=false rev-list --topo-order --all (git = pks-commit-graph-inconsistency)
>       Time (mean ± σ):     13.762 s ±  0.116 s    [User: 13.094 s, System: 0.667 s]
>       Range (min … max):   13.645 s … 14.038 s    10 runs
>
>     Summary
>       git -c core.commitGraph=true rev-list --topo-order --all (git = master) ran
>         1.32 ± 0.02 times faster than git -c core.commitGraph=true rev-list --topo-order --all (git = pks-commit-graph-inconsistency)
>         4.72 ± 0.05 times faster than git -c core.commitGraph=false rev-list --topo-order --all (git = pks-commit-graph-inconsistency)
>         4.75 ± 0.04 times faster than git -c core.commitGraph=false rev-list --topo-order --all (git = master)
>
> We look at a ~30% regression in general, but in general we're still a
> whole lot faster than without the commit graph. To counteract this, the
> new check can be turned off with the `GIT_COMMIT_GRAPH_PARANOIA` envvar.

Very nicely described.  Will queue.  I'll go offline for the rest of
the week but if there are no significant issues discovered by the
time I come back, let's declare a victory and merge these two
patches down to 'next'.

Thanks.

>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  commit.c                | 16 +++++++++++++++-
>  t/t5318-commit-graph.sh | 27 +++++++++++++++++++++++++++
>  2 files changed, 42 insertions(+), 1 deletion(-)
>
> diff --git a/commit.c b/commit.c
> index b3223478bc..7399e90212 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -28,6 +28,7 @@
>  #include "shallow.h"
>  #include "tree.h"
>  #include "hook.h"
> +#include "parse.h"
>  
>  static struct commit_extra_header *read_commit_extra_header_lines(const char *buf, size_t len, const char **);
>  
> @@ -572,8 +573,21 @@ int repo_parse_commit_internal(struct repository *r,
>  		return -1;
>  	if (item->object.parsed)
>  		return 0;
> -	if (use_commit_graph && parse_commit_in_graph(r, item))
> +	if (use_commit_graph && parse_commit_in_graph(r, item)) {
> +		static int object_paranoia = -1;
> +
> +		if (object_paranoia == -1)
> +			object_paranoia = git_env_bool(GIT_COMMIT_GRAPH_PARANOIA, 1);
> +
> +		if (object_paranoia && !has_object(r, &item->object.oid, 0)) {
> +			unparse_commit(r, &item->object.oid);
> +			return quiet_on_missing ? -1 :
> +				error(_("commit %s exists in commit-graph but not in the object database"),
> +				      oid_to_hex(&item->object.oid));
> +		}
> +
>  		return 0;
> +	}
>  
>  	if (oid_object_info_extended(r, &item->object.oid, &oi, flags) < 0)
>  		return quiet_on_missing ? -1 :
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index c0cc454538..79467d7926 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -842,4 +842,31 @@ test_expect_success 'stale commit cannot be parsed when given directly' '
>  	)
>  '
>  
> +test_expect_success 'stale commit cannot be parsed when traversing graph' '
> +	test_when_finished "rm -rf repo" &&
> +	git init repo &&
> +	(
> +		cd repo &&
> +
> +		test_commit A &&
> +		test_commit B &&
> +		test_commit C &&
> +		git commit-graph write --reachable &&
> +
> +		# Corrupt the repository by deleting the intermittent commit
> +		# object. Commands should notice that this object is absent and
> +		# thus that the repository is corrupt even if the commit graph
> +		# exists.
> +		oid=$(git rev-parse B) &&
> +		rm .git/objects/"$(test_oid_to_path "$oid")" &&
> +
> +		# Again, we should be able to parse the commit when not
> +		# being paranoid about commit graph staleness...
> +		GIT_COMMIT_GRAPH_PARANOIA=false git rev-parse HEAD~2 &&
> +		# ... but fail when we are paranoid.
> +		test_must_fail git rev-parse HEAD~2 2>error &&
> +		grep "error: commit $oid exists in commit-graph but not in the object database" error
> +	)
> +'
> +
>  test_done

  reply	other threads:[~2023-10-24 19:10 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-23 11:27 [PATCH v2 0/2] commit-graph: detect commits missing in ODB Patrick Steinhardt
2023-10-23 11:27 ` [PATCH v2 1/2] commit-graph: introduce envvar to disable commit existence checks Patrick Steinhardt
2023-10-30 21:29   ` Taylor Blau
2023-10-31  6:19     ` Patrick Steinhardt
2023-10-23 11:27 ` [PATCH v2 2/2] commit: detect commits that exist in commit-graph but not in the ODB Patrick Steinhardt
2023-10-24 19:10   ` Junio C Hamano [this message]
2023-10-30 21:32     ` Taylor Blau
2023-10-31  0:21       ` Junio C Hamano
2023-10-30 21:31   ` Taylor Blau
2023-10-31  7:16 ` [PATCH v3 0/2] commit-graph: detect commits missing in ODB Patrick Steinhardt
2023-10-31  7:16   ` [PATCH v3 1/2] commit-graph: introduce envvar to disable commit existence checks Patrick Steinhardt
2023-10-31  7:16   ` [PATCH v3 2/2] commit: detect commits that exist in commit-graph but not in the ODB Patrick Steinhardt
2023-10-31 19:16   ` [PATCH v3 0/2] commit-graph: detect commits missing in ODB Taylor Blau
2023-10-31 23:57     ` Junio C Hamano
2023-11-01  2:06       ` Junio C Hamano
2023-11-01  4:55         ` Patrick Steinhardt
2023-11-01  5:01           ` Junio C Hamano
2023-10-31 23:55   ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqy1fr3kh6.fsf@gitster.g \
    --to=gitster@pobox.com \
    --cc=git@vger.kernel.org \
    --cc=karthik.188@gmail.com \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    --cc=ps@pks.im \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).