git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Patrick Steinhardt <ps@pks.im>
To: Taylor Blau <me@ttaylorr.com>
Cc: git@vger.kernel.org, "Jonathan Tan" <jonathantanmy@google.com>,
	"Junio C Hamano" <gitster@pobox.com>, "Jeff King" <peff@peff.net>,
	"SZEDER Gábor" <szeder.dev@gmail.com>
Subject: Re: [PATCH v3 03/17] commit-graph: ensure Bloom filters are read with consistent settings
Date: Tue, 17 Oct 2023 10:45:07 +0200	[thread overview]
Message-ID: <ZS5JkyBuFzW09LXH@tanuki> (raw)
In-Reply-To: <2ecc0a2d58432b149d73a3e2abfa948eb1f0aa0b.1696969994.git.me@ttaylorr.com>

[-- Attachment #1: Type: text/plain, Size: 8299 bytes --]

On Tue, Oct 10, 2023 at 04:33:26PM -0400, Taylor Blau wrote:
> The changed-path Bloom filter mechanism is parameterized by a couple of
> variables, notably the number of bits per hash (typically "m" in Bloom
> filter literature) and the number of hashes themselves (typically "k").
> 
> It is critically important that filters are read with the Bloom filter
> settings that they were written with. Failing to do so would mean that
> each query is liable to compute different fingerprints, meaning that the
> filter itself could return a false negative. This goes against a basic
> assumption of using Bloom filters (that they may return false positives,
> but never false negatives) and can lead to incorrect results.
> 
> We have some existing logic to carry forward existing Bloom filter
> settings from one layer to the next. In `write_commit_graph()`, we have
> something like:
> 
>     if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
>         struct commit_graph *g = ctx->r->objects->commit_graph;
> 
>         /* We have changed-paths already. Keep them in the next graph */
>         if (g && g->chunk_bloom_data) {
>             ctx->changed_paths = 1;
>             ctx->bloom_settings = g->bloom_filter_settings;
>         }
>     }
> 
> , which drags forward Bloom filter settings across adjacent layers.
> 
> This doesn't quite address all cases, however, since it is possible for
> intermediate layers to contain no Bloom filters at all. For example,
> suppose we have two layers in a commit-graph chain, say, {G1, G2}. If G1
> contains Bloom filters, but G2 doesn't, a new G3 (whose base graph is
> G2) may be written with arbitrary Bloom filter settings, because we only
> check the immediately adjacent layer's settings for compatibility.
> 
> This behavior has existed since the introduction of changed-path Bloom
> filters. But in practice, this is not such a big deal, since the only
> way up until this point to modify the Bloom filter settings at write
> time is with the undocumented environment variables:
> 
>   - GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY
>   - GIT_TEST_BLOOM_SETTINGS_NUM_HASHES
>   - GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS
> 
> (it is still possible to tweak MAX_CHANGED_PATHS between layers, but
> this does not affect reads, so is allowed to differ across multiple
> graph layers).
> 
> But in future commits, we will introduce another parameter to change the
> hash algorithm used to compute Bloom fingerprints itself. This will be
> exposed via a configuration setting, making this foot-gun easier to use.
> 
> To prevent this potential issue, validate that all layers of a split
> commit-graph have compatible settings with the newest layer which
> contains Bloom filters.
> 
> Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
> Original-test-by: SZEDER Gábor <szeder.dev@gmail.com>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  commit-graph.c       | 25 +++++++++++++++++
>  t/t4216-log-bloom.sh | 64 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 89 insertions(+)
> 
> diff --git a/commit-graph.c b/commit-graph.c
> index 1a56efcf69..ae0902f7f4 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -498,6 +498,30 @@ static int validate_mixed_generation_chain(struct commit_graph *g)
>  	return 0;
>  }
>  
> +static void validate_mixed_bloom_settings(struct commit_graph *g)
> +{
> +	struct bloom_filter_settings *settings = NULL;
> +	for (; g; g = g->base_graph) {
> +		if (!g->bloom_filter_settings)
> +			continue;
> +		if (!settings) {
> +			settings = g->bloom_filter_settings;
> +			continue;
> +		}
> +
> +		if (g->bloom_filter_settings->bits_per_entry != settings->bits_per_entry ||
> +		    g->bloom_filter_settings->num_hashes != settings->num_hashes) {
> +			g->chunk_bloom_indexes = NULL;
> +			g->chunk_bloom_data = NULL;
> +			FREE_AND_NULL(g->bloom_filter_settings);
> +
> +			warning(_("disabling Bloom filters for commit-graph "
> +				  "layer '%s' due to incompatible settings"),
> +				oid_to_hex(&g->oid));
> +		}
> +	}
> +}
> +
>  static int add_graph_to_chain(struct commit_graph *g,
>  			      struct commit_graph *chain,
>  			      struct object_id *oids,
> @@ -614,6 +638,7 @@ struct commit_graph *load_commit_graph_chain_fd_st(struct repository *r,
>  	}
>  
>  	validate_mixed_generation_chain(graph_chain);
> +	validate_mixed_bloom_settings(graph_chain);
>  
>  	free(oids);
>  	fclose(fp);
> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> index 322640feeb..f49a8f2fbf 100755
> --- a/t/t4216-log-bloom.sh
> +++ b/t/t4216-log-bloom.sh
> @@ -420,4 +420,68 @@ test_expect_success 'Bloom generation backfills empty commits' '
>  	)
>  '
>  
> +graph=.git/objects/info/commit-graph
> +graphdir=.git/objects/info/commit-graphs
> +chain=$graphdir/commit-graph-chain
> +
> +test_expect_success 'setup for mixed Bloom setting tests' '
> +	repo=mixed-bloom-settings &&
> +
> +	git init $repo &&
> +	for i in one two three
> +	do
> +		test_commit -C $repo $i file || return 1
> +	done
> +'
> +
> +test_expect_success 'split' '
> +	# Compute Bloom filters with "unusual" settings.
> +	git -C $repo rev-parse one >in &&
> +	GIT_TEST_BLOOM_SETTINGS_NUM_HASHES=3 git -C $repo commit-graph write \
> +		--stdin-commits --changed-paths --split <in &&
> +	layer=$(head -n 1 $repo/$chain) &&
> +
> +	# A commit-graph layer without Bloom filters "hides" the layers
> +	# below ...
> +	git -C $repo rev-parse two >in &&
> +	git -C $repo commit-graph write --stdin-commits --no-changed-paths \
> +		--split=no-merge <in &&
> +
> +	# Another commit-graph layer that has Bloom filters, but with
> +	# standard settings, and is thus incompatible with the base
> +	# layer written above.
> +	git -C $repo rev-parse HEAD >in &&
> +	git -C $repo commit-graph write --stdin-commits --changed-paths \
> +		--split=no-merge <in &&
> +
> +	test_line_count = 3 $repo/$chain &&
> +
> +	# Ensure that incompatible Bloom filters are ignored.
> +	git -C $repo -c core.commitGraph=false log --oneline --no-decorate -- file \
> +		>expect 2>err &&
> +	git -C $repo log --oneline --no-decorate -- file >actual 2>err &&
> +	test_cmp expect actual &&
> +	grep "disabling Bloom filters for commit-graph layer .$layer." err
> +'

Up to this point everything looks sensible to me.

> +test_expect_success 'merge graph layers with incompatible Bloom settings' '
> +	# Ensure that incompatible Bloom filters are ignored when
> +	# generating new layers.
> +	git -C $repo commit-graph write --reachable --changed-paths 2>err &&
> +	grep "disabling Bloom filters for commit-graph layer .$layer." err &&
> +
> +	test_path_is_file $repo/$graph &&
> +	test_dir_is_empty $repo/$graphdir &&
> +
> +	# ...and merging existing ones.
> +	git -C $repo -c core.commitGraph=false log --oneline --no-decorate -- file \
> +		>expect 2>err &&
> +	GIT_TRACE2_PERF="$(pwd)/trace.perf" \
> +		git -C $repo log --oneline --no-decorate -- file >actual 2>err &&

But this test is a bit confusing to me, to be honest, also because the
comment for the second block here reads funny. We don't really merge
anything, do we? We only generate logs and compare that the log with and
without the resulting merged commit graph is the same. The actual logic
happened before.

> +	test_cmp expect actual && cat err &&

The `cat err` looks like a leftover from debugging.

> +	grep "statistics:{\"filter_not_present\":0" trace.perf &&

Also, why should the filter not be present here? If we merge the
commit-graphs with `--changed-paths` I'd have expected that we either
carry over bloom filters from preexisting commit graphs if compatible,
or otherwise generate them if they are either incompatible or don't
exist.

I feel like I'm missing something obvious, so this may be me just
missing the bigger picture.

> +	! grep "disabling Bloom filters" err

Can we make this assertion stricter and verify that `err` is empty? I
always think that `! grep` is quite a fragile pattern as it is quite
prone to becoming stale, e.g. when the error message itself would
change.

Patrick

> +'
> +
>  test_done
> -- 
> 2.42.0.342.g8bb3a896ee
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2023-10-17  8:45 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-21 21:43 [PATCH 00/15] bloom: changed-path Bloom filters v2 Taylor Blau
2023-08-21 21:43 ` [PATCH 01/15] gitformat-commit-graph: describe version 2 of BDAT Taylor Blau
2023-08-21 21:44 ` [PATCH 02/15] t/helper/test-read-graph.c: extract `dump_graph_info()` Taylor Blau
2023-08-21 21:44 ` [PATCH 03/15] bloom.h: make `load_bloom_filter_from_graph()` public Taylor Blau
2023-08-21 21:44 ` [PATCH 04/15] t/helper/test-read-graph: implement `bloom-filters` mode Taylor Blau
2023-08-21 21:44 ` [PATCH 05/15] t4216: test changed path filters with high bit paths Taylor Blau
2023-08-21 21:44 ` [PATCH 06/15] repo-settings: introduce commitgraph.changedPathsVersion Taylor Blau
2023-08-21 21:44 ` [PATCH 07/15] commit-graph: new filter ver. that fixes murmur3 Taylor Blau
2023-08-26 15:06   ` SZEDER Gábor
2023-08-29 16:31     ` Jonathan Tan
2023-08-30 20:02       ` SZEDER Gábor
2023-09-01 20:56         ` Jonathan Tan
2023-09-25 23:03           ` Taylor Blau
2023-10-08 14:35             ` SZEDER Gábor
2023-10-09 18:17               ` Taylor Blau
2023-10-09 19:31                 ` Taylor Blau
2023-10-09 19:52                   ` Junio C Hamano
2023-10-10 20:34                     ` Taylor Blau
2023-08-21 21:44 ` [PATCH 08/15] bloom: annotate filters with hash version Taylor Blau
2023-08-21 21:44 ` [PATCH 09/15] bloom: prepare to discard incompatible Bloom filters Taylor Blau
2023-08-21 21:44 ` [PATCH 10/15] t/t4216-log-bloom.sh: harden `test_bloom_filters_not_used()` Taylor Blau
2023-08-21 21:44 ` [PATCH 11/15] commit-graph.c: unconditionally load Bloom filters Taylor Blau
2023-08-21 21:44 ` [PATCH 12/15] commit-graph: drop unnecessary `graph_read_bloom_data_context` Taylor Blau
2023-08-21 21:44 ` [PATCH 13/15] object.h: fix mis-aligned flag bits table Taylor Blau
2023-08-21 21:44 ` [PATCH 14/15] commit-graph: reuse existing Bloom filters where possible Taylor Blau
2023-08-21 21:44 ` [PATCH 15/15] bloom: introduce `deinit_bloom_filters()` Taylor Blau
2023-08-24 22:22 ` [PATCH 00/15] bloom: changed-path Bloom filters v2 Jonathan Tan
2023-08-25 17:06   ` Jonathan Tan
2023-08-29 22:18     ` Jonathan Tan
2023-08-29 23:16       ` Junio C Hamano
2023-08-30 16:43 ` [PATCH v2 " Jonathan Tan
2023-08-30 16:43   ` [PATCH v2 01/15] gitformat-commit-graph: describe version 2 of BDAT Jonathan Tan
2023-08-30 16:43   ` [PATCH v2 02/15] t/helper/test-read-graph.c: extract `dump_graph_info()` Jonathan Tan
2023-08-30 16:43   ` [PATCH v2 03/15] bloom.h: make `load_bloom_filter_from_graph()` public Jonathan Tan
2023-08-30 16:43   ` [PATCH v2 04/15] t/helper/test-read-graph: implement `bloom-filters` mode Jonathan Tan
2023-08-30 16:43   ` [PATCH v2 05/15] t4216: test changed path filters with high bit paths Jonathan Tan
2023-08-30 16:43   ` [PATCH v2 06/15] repo-settings: introduce commitgraph.changedPathsVersion Jonathan Tan
2023-08-30 16:43   ` [PATCH v2 07/15] commit-graph: new filter ver. that fixes murmur3 Jonathan Tan
2023-08-30 16:43   ` [PATCH v2 08/15] bloom: annotate filters with hash version Jonathan Tan
2023-08-30 16:43   ` [PATCH v2 09/15] bloom: prepare to discard incompatible Bloom filters Jonathan Tan
2023-08-30 16:43   ` [PATCH v2 10/15] t/t4216-log-bloom.sh: harden `test_bloom_filters_not_used()` Jonathan Tan
2023-08-30 16:43   ` [PATCH v2 11/15] commit-graph.c: unconditionally load Bloom filters Jonathan Tan
2023-08-30 16:43   ` [PATCH v2 12/15] commit-graph: drop unnecessary `graph_read_bloom_data_context` Jonathan Tan
2023-08-30 16:43   ` [PATCH v2 13/15] object.h: fix mis-aligned flag bits table Jonathan Tan
2023-08-30 16:43   ` [PATCH v2 14/15] commit-graph: reuse existing Bloom filters where possible Jonathan Tan
2023-08-30 16:43   ` [PATCH v2 15/15] bloom: introduce `deinit_bloom_filters()` Jonathan Tan
2023-08-30 19:38   ` [PATCH v2 00/15] bloom: changed-path Bloom filters v2 Junio C Hamano
2023-10-10 20:33 ` [PATCH v3 00/17] bloom: changed-path Bloom filters v2 (& sundries) Taylor Blau
2023-10-10 20:33   ` [PATCH v3 01/17] t/t4216-log-bloom.sh: harden `test_bloom_filters_not_used()` Taylor Blau
2023-10-10 20:33   ` [PATCH v3 02/17] revision.c: consult Bloom filters for root commits Taylor Blau
2023-10-10 20:33   ` [PATCH v3 03/17] commit-graph: ensure Bloom filters are read with consistent settings Taylor Blau
2023-10-17  8:45     ` Patrick Steinhardt [this message]
2023-10-10 20:33   ` [PATCH v3 04/17] gitformat-commit-graph: describe version 2 of BDAT Taylor Blau
2023-10-10 20:33   ` [PATCH v3 05/17] t/helper/test-read-graph.c: extract `dump_graph_info()` Taylor Blau
2023-10-17  8:45     ` Patrick Steinhardt
2023-10-18 17:37       ` Taylor Blau
2023-10-18 23:56         ` Junio C Hamano
2023-10-10 20:33   ` [PATCH v3 06/17] bloom.h: make `load_bloom_filter_from_graph()` public Taylor Blau
2023-10-10 20:33   ` [PATCH v3 07/17] t/helper/test-read-graph: implement `bloom-filters` mode Taylor Blau
2023-10-10 20:33   ` [PATCH v3 08/17] t4216: test changed path filters with high bit paths Taylor Blau
2023-10-17  8:45     ` Patrick Steinhardt
2023-10-18 17:41       ` Taylor Blau
2023-10-10 20:33   ` [PATCH v3 09/17] repo-settings: introduce commitgraph.changedPathsVersion Taylor Blau
2023-10-10 20:33   ` [PATCH v3 10/17] commit-graph: new filter ver. that fixes murmur3 Taylor Blau
2023-10-17  8:45     ` Patrick Steinhardt
2023-10-18 17:46       ` Taylor Blau
2023-10-10 20:33   ` [PATCH v3 11/17] bloom: annotate filters with hash version Taylor Blau
2023-10-10 20:33   ` [PATCH v3 12/17] bloom: prepare to discard incompatible Bloom filters Taylor Blau
2023-10-10 20:33   ` [PATCH v3 13/17] commit-graph.c: unconditionally load " Taylor Blau
2023-10-17  8:45     ` Patrick Steinhardt
2023-10-10 20:34   ` [PATCH v3 14/17] commit-graph: drop unnecessary `graph_read_bloom_data_context` Taylor Blau
2023-10-10 20:34   ` [PATCH v3 15/17] object.h: fix mis-aligned flag bits table Taylor Blau
2023-10-10 20:34   ` [PATCH v3 16/17] commit-graph: reuse existing Bloom filters where possible Taylor Blau
2023-10-10 20:34   ` [PATCH v3 17/17] bloom: introduce `deinit_bloom_filters()` Taylor Blau
2023-10-17  8:45   ` [PATCH v3 00/17] bloom: changed-path Bloom filters v2 (& sundries) Patrick Steinhardt
2023-10-18 17:47     ` Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZS5JkyBuFzW09LXH@tanuki \
    --to=ps@pks.im \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).