git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Taylor Blau <me@ttaylorr.com>
To: Patrick Steinhardt <ps@pks.im>
Cc: git@vger.kernel.org, Karthik Nayak <karthik.188@gmail.com>,
	Junio C Hamano <gitster@pobox.com>, Jeff King <peff@peff.net>
Subject: Re: [PATCH v2 1/2] commit-graph: introduce envvar to disable commit existence checks
Date: Mon, 30 Oct 2023 17:29:33 -0400	[thread overview]
Message-ID: <ZUAgPVR8HxgEZEWo@nand.local> (raw)
In-Reply-To: <a89c4355285bc0bb0ec339818e6fe907f9ffd30e.1698060036.git.ps@pks.im>

On Mon, Oct 23, 2023 at 01:27:16PM +0200, Patrick Steinhardt wrote:
> Our `lookup_commit_in_graph()` helper tries to look up commits from the
> commit graph and, if it doesn't exist there, falls back to parsing it
> from the object database instead. This is intended to speed up the
> lookup of any such commit that exists in the database. There is an edge
> case though where the commit exists in the graph, but not in the object
> database. To avoid returning such stale commits the helper function thus
> double checks that any such commit parsed from the graph also exists in
> the object database. This makes the function safe to use even when
> commit graphs aren't updated regularly.
>
> We're about to introduce the same pattern into other parts of our code
> base though, namely `repo_parse_commit_internal()`. Here the extra
> sanity check is a bit of a tougher sell: `lookup_commit_in_graph()` was
> a newly introduced helper, and as such there was no performance hit by
> adding this sanity check. If we added `repo_parse_commit_internal()`
> with that sanity check right from the beginning as well, this would
> probably never have been an issue to begin with. But by retrofitting it
> with this sanity check now we do add a performance regression to
> preexisting code, and thus there is a desire to avoid this or at least
> give an escape hatch.
>
> In practice, there is no inherent reason why either of those functions
> should have the sanity check whereas the other one does not: either both
> of them are able to detect this issue or none of them should be. This
> also means that the default of whether we do the check should likely be
> the same for both. To err on the side of caution, we thus rather want to
> make `repo_parse_commit_internal()` stricter than to loosen the checks
> that we already have in `lookup_commit_in_graph()`.

All well reasoned. I think the most compelling reason is that we're
already doing this extra check in lookup_commit_in_graph(), and having
that be somewhat inconsistent with repo_parse_commit_internal() feels
error-prone to me.

> The escape hatch is added in the form of a new GIT_COMMIT_GRAPH_PARANOIA
> environment variable that mirrors GIT_REF_PARANOIA. If enabled, which is
> the default, we will double check that commits looked up in the commit
> graph via `lookup_commit_in_graph()` also exist in the object database.
> This same check will also be added in `repo_parse_commit_internal()`.

Sounds good.

> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  Documentation/git.txt   |  9 +++++++++
>  commit-graph.c          |  6 +++++-
>  commit-graph.h          |  6 ++++++
>  t/t5318-commit-graph.sh | 21 +++++++++++++++++++++
>  4 files changed, 41 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/git.txt b/Documentation/git.txt
> index 11228956cd..22c2b537aa 100644
> --- a/Documentation/git.txt
> +++ b/Documentation/git.txt
> @@ -911,6 +911,15 @@ for full details.
>  	should not normally need to set this to `0`, but it may be
>  	useful when trying to salvage data from a corrupted repository.
>
> +`GIT_COMMIT_GRAPH_PARANOIA`::
> +	If this Boolean environment variable is set to false, ignore the
> +	case where commits exist in the commit graph but not in the
> +	object database. Normally, Git will check whether commits loaded
> +	from the commit graph exist in the object database to avoid
> +	issues with stale commit graphs, but this check comes with a
> +	performance penalty. The default is `1` (i.e., be paranoid about
> +	stale commits in the commit graph).
> +

The first two sentences seem to be flipped. Perhaps:

    When loading a commit object from the commit-graph, Git will perform
    an existence check on the object in the ODB before parsing it out of
    the commit-graph. The default is "true", which enables the
    aforementioned behavior. Setting this to "false" disables the
    existential check when parsing commits from a commit-graph.

>  `GIT_ALLOW_PROTOCOL`::
>  	If set to a colon-separated list of protocols, behave as if
>  	`protocol.allow` is set to `never`, and each of the listed
> diff --git a/commit-graph.c b/commit-graph.c
> index fd2f700b2e..12ec31902e 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -939,14 +939,18 @@ int repo_find_commit_pos_in_graph(struct repository *r, struct commit *c,
>
>  struct commit *lookup_commit_in_graph(struct repository *repo, const struct object_id *id)
>  {
> +	static int object_paranoia = -1;
>  	struct commit *commit;
>  	uint32_t pos;
>
> +	if (object_paranoia == -1)
> +		object_paranoia = git_env_bool(GIT_COMMIT_GRAPH_PARANOIA, 1);
> +

I don't think that this is a reroll-able issue, but calling this
variable object_paranoia to store a setting for *graph* paranoia feels
like a good itch to scratch. But obviously not a big deal ;-).

> @@ -821,4 +821,25 @@ test_expect_success 'overflow during generation version upgrade' '
>  	)
>  '
>
> +test_expect_success 'stale commit cannot be parsed when given directly' '
> +	test_when_finished "rm -rf repo" &&
> +	git init repo &&
> +	(
> +		cd repo &&
> +		test_commit A &&
> +		test_commit B &&
> +		git commit-graph write --reachable &&
> +
> +		oid=$(git rev-parse B) &&
> +		rm .git/objects/"$(test_oid_to_path "$oid")" &&
> +
> +		# Verify that it is possible to read the commit from the
> +		# commit graph when not being paranoid, ...
> +		GIT_COMMIT_GRAPH_PARANOIA=false git rev-list B &&
> +		# ... but parsing the commit when double checking that
> +		# it actually exists in the object database should fail.
> +		test_must_fail git rev-list -1 B

Would "cat-file -p" be more direct here than "rev-list -1"?

Thanks,
Taylor

  reply	other threads:[~2023-10-30 21:29 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-23 11:27 [PATCH v2 0/2] commit-graph: detect commits missing in ODB Patrick Steinhardt
2023-10-23 11:27 ` [PATCH v2 1/2] commit-graph: introduce envvar to disable commit existence checks Patrick Steinhardt
2023-10-30 21:29   ` Taylor Blau [this message]
2023-10-31  6:19     ` Patrick Steinhardt
2023-10-23 11:27 ` [PATCH v2 2/2] commit: detect commits that exist in commit-graph but not in the ODB Patrick Steinhardt
2023-10-24 19:10   ` Junio C Hamano
2023-10-30 21:32     ` Taylor Blau
2023-10-31  0:21       ` Junio C Hamano
2023-10-30 21:31   ` Taylor Blau
2023-10-31  7:16 ` [PATCH v3 0/2] commit-graph: detect commits missing in ODB Patrick Steinhardt
2023-10-31  7:16   ` [PATCH v3 1/2] commit-graph: introduce envvar to disable commit existence checks Patrick Steinhardt
2023-10-31  7:16   ` [PATCH v3 2/2] commit: detect commits that exist in commit-graph but not in the ODB Patrick Steinhardt
2023-10-31 19:16   ` [PATCH v3 0/2] commit-graph: detect commits missing in ODB Taylor Blau
2023-10-31 23:57     ` Junio C Hamano
2023-11-01  2:06       ` Junio C Hamano
2023-11-01  4:55         ` Patrick Steinhardt
2023-11-01  5:01           ` Junio C Hamano
2023-10-31 23:55   ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZUAgPVR8HxgEZEWo@nand.local \
    --to=me@ttaylorr.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=karthik.188@gmail.com \
    --cc=peff@peff.net \
    --cc=ps@pks.im \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).