All of lore.kernel.org
 help / color / mirror / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: "SZEDER Gábor" <szeder.dev@gmail.com>,
	"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, me@ttaylorr.com, l.s.r@web.de,
	Derrick Stolee <dstolee@microsoft.com>
Subject: Re: [PATCH v3 03/10] bloom: fix logic in get_bloom_filter()
Date: Mon, 29 Jun 2020 09:02:52 -0400	[thread overview]
Message-ID: <e998499b-6c27-e4c8-fa92-d83167ab41b6@gmail.com> (raw)
In-Reply-To: <20200627163335.GO2898@szeder.dev>

On 6/27/2020 12:33 PM, SZEDER Gábor wrote:
> On Fri, Jun 26, 2020 at 12:30:29PM +0000, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> The get_bloom_filter() method is a bit complicated in some parts where
>> it does not need to be. In particular, it needs to return a NULL filter
>> only when compute_if_not_present is zero AND the filter data cannot be
>> loaded from a commit-graph file. This currently happens by accident
>> because the commit-graph does not load changed-path Bloom filters from
>> an existing commit-graph when writing a new one. This will change in a
>> later patch.
>>
>> Also clean up some style issues while we are here.
>>
>> Helped-by: René Scharfe <l.s.r@web.de>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>  bloom.c | 14 ++++++--------
>>  1 file changed, 6 insertions(+), 8 deletions(-)
>>
>> diff --git a/bloom.c b/bloom.c
>> index c38d1cff0c..2af5389795 100644
>> --- a/bloom.c
>> +++ b/bloom.c
>> @@ -186,7 +186,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>>  	struct diff_options diffopt;
>>  	int max_changes = 512;
>>  
>> -	if (bloom_filters.slab_size == 0)
>> +	if (!bloom_filters.slab_size)
>>  		return NULL;
>>
>>  	filter = bloom_filter_slab_at(&bloom_filters, c);
>> @@ -194,16 +194,14 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>>  	if (!filter->data) {
>>  		load_commit_graph_info(r, c);
>>  		if (c->graph_pos != COMMIT_NOT_FROM_GRAPH &&
>> -			r->objects->commit_graph->chunk_bloom_indexes) {
>> -			if (load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
>> -				return filter;
>> -			else
>> -				return NULL;
>> -		}
>> +		    r->objects->commit_graph->chunk_bloom_indexes)
>> +			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
>>  	}
>>  
>> -	if (filter->data || !compute_if_not_present)
>> +	if (filter->data)
>>  		return filter;
>> +	if (!compute_if_not_present)
>> +		return NULL;
> 
> Some callers of get_bloom_filter() invoke it with
> compute_if_not_present=0, but are not prepared to handle a NULL return
> value and dereference it right away:
> 
>   write_graph_chunk_bloom_indexes():
> 
>                 struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
>                 cur_pos += filter->len;
> 
>   write_graph_chunk_bloom_data():
> 
>                 struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
>                 display_progress(progress, ++i);
>                 hashwrite(f, filter->data, filter->len * sizeof(unsigned char));

In theory, these _should_ be safe, because we already computed
the filters in an earlier step, right? We should have generated
the filter and populated it in the slab.

> I don't know whether this was an issue before, but I didn't really
> tried.  Unfortunately, starting with this patch this causes
> segmentation faults basically in all real repositories I use for
> testing.
> 
>   expecting success of 9999.1 'test': 
>           for i in $(test_seq 1 513)
>           do
>                   >file-$i || return 1
>           done &&
>           git add file-* &&
>           git commit -q -m one &&
>   
>           git commit-graph write --reachable --changed-paths
>   
>   Segmentation fault
>   not ok 1 - test

However, you are demonstrating a failure that doesn't appear
in our test suite. I was able to reproduce it.

I can confirm that this patch causes a SIGSEGV when writing
the commit-graph in the Git repository, too.

So, what is wrong with my earlier assumption? There are
two problems.

The thing I notice is that an empty filter (no changes
with respect to the first parent) will have NULL
filter->data, so we are returning NULL instead of a
correctly-empty filter (with len zero).

But what you are hitting here is the max number of changes
limit. That also returns a NULL filter, because we mark
the filter as "TOO LARGE" to store. We store that as a
zero-length filter.

The following fixup corrects the bug and adds a test
similar to yours, but with extra care around ensuring the
revision walk still works appropriately for that large
commit.

In the next version, I will include more in the commit
message about these side-effect changes, especially around
the stats for zero-length filters. The trace2 message will
no longer differentiate between zero-length filters and
NULL filters.

Thanks,
-Stolee

-- >8 --

From f9867adc5de8a072f41b91fd6cd87edfcc92e05e Mon Sep 17 00:00:00 2001
From: Derrick Stolee <dstolee@microsoft.com>
Date: Mon, 29 Jun 2020 08:52:33 -0400
Subject: [PATCH] fixup! bloom: fix logic in get_bloom_filter()

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c       |  8 ++++++--
 revision.c           |  7 -------
 t/t4216-log-bloom.sh | 24 ++++++++++++++++++++++--
 3 files changed, 28 insertions(+), 11 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index a0766a86f5..6752916c1a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1108,7 +1108,8 @@ static int write_graph_chunk_bloom_indexes(struct hashfile *f,
 
 	while (list < last) {
 		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
-		cur_pos += filter->len;
+		size_t len = filter ? filter->len : 0;
+		cur_pos += len;
 		display_progress(progress, ++i);
 		hashwrite_be32(f, cur_pos);
 		list++;
@@ -1154,8 +1155,11 @@ static int write_graph_chunk_bloom_data(struct hashfile *f,
 
 	while (list < last) {
 		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
+		size_t len = filter ? filter->len : 0;
 		display_progress(progress, ++i);
-		hashwrite(f, filter->data, filter->len * sizeof(unsigned char));
+
+		if (len)
+			hashwrite(f, filter->data, len * sizeof(unsigned char));
 		list++;
 	}
 
diff --git a/revision.c b/revision.c
index b40bc5b51b..b9118001f9 100644
--- a/revision.c
+++ b/revision.c
@@ -633,7 +633,6 @@ static unsigned int count_bloom_filter_maybe;
 static unsigned int count_bloom_filter_definitely_not;
 static unsigned int count_bloom_filter_false_positive;
 static unsigned int count_bloom_filter_not_present;
-static unsigned int count_bloom_filter_length_zero;
 
 static void trace2_bloom_filter_statistics_atexit(void)
 {
@@ -641,7 +640,6 @@ static void trace2_bloom_filter_statistics_atexit(void)
 
 	jw_object_begin(&jw, 0);
 	jw_object_intmax(&jw, "filter_not_present", count_bloom_filter_not_present);
-	jw_object_intmax(&jw, "zero_length_filter", count_bloom_filter_length_zero);
 	jw_object_intmax(&jw, "maybe", count_bloom_filter_maybe);
 	jw_object_intmax(&jw, "definitely_not", count_bloom_filter_definitely_not);
 	jw_object_intmax(&jw, "false_positive", count_bloom_filter_false_positive);
@@ -765,11 +763,6 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 		return -1;
 	}
 
-	if (!filter->len) {
-		count_bloom_filter_length_zero++;
-		return -1;
-	}
-
 	for (j = 0; result && j < revs->bloom_keys_nr; j++) {
 		result = bloom_filter_contains(filter,
 					       &revs->bloom_keys[j],
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index d7dd717347..4892364e74 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -60,7 +60,7 @@ setup () {
 
 test_bloom_filters_used () {
 	log_args=$1
-	bloom_trace_prefix="statistics:{\"filter_not_present\":0,\"zero_length_filter\":0,\"maybe\""
+	bloom_trace_prefix="statistics:{\"filter_not_present\":0,\"maybe\""
 	setup "$log_args" &&
 	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
 	test_cmp log_wo_bloom log_w_bloom &&
@@ -146,7 +146,7 @@ test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
 
 test_bloom_filters_used_when_some_filters_are_missing () {
 	log_args=$1
-	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":8"
+	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"maybe\":6,\"definitely_not\":8"
 	setup "$log_args" &&
 	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
 	test_cmp log_wo_bloom log_w_bloom
@@ -171,4 +171,24 @@ test_expect_success 'persist filter settings' '
 	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2-auto.txt
 '
 
+test_expect_success 'correctly report changes over limit' '
+	git init 513changes &&
+	(
+		cd 513changes &&
+		for i in $(test_seq 1 513)
+		do
+			echo $i >file$i.txt || return 1
+		done &&
+		git add . &&
+		git commit -m "files" &&
+		git commit-graph write --reachable --changed-paths &&
+		for i in $(test_seq 1 513)
+		do
+			git -c core.commitGraph=false log -- file$i.txt >expect &&
+			git log -- file$i.txt >actual &&
+			test_cmp expect actual || return 1
+		done
+	)
+'
+
 test_done
\ No newline at end of file
-- 
2.27.0.203.gf402ea6816


  reply	other threads:[~2020-06-29 19:03 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-15 20:14 [PATCH 0/8] More commit-graph/Bloom filter improvements Derrick Stolee via GitGitGadget
2020-06-15 20:14 ` [PATCH 1/8] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
2020-06-18 20:30   ` René Scharfe
2020-06-19 12:58     ` Derrick Stolee
2020-06-15 20:14 ` [PATCH 2/8] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
2020-06-18 20:30   ` René Scharfe
2020-06-15 20:14 ` [PATCH 3/8] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
2020-06-18 20:30   ` René Scharfe
2020-06-15 20:14 ` [PATCH 4/8] commit-graph: check chunk sizes after writing SZEDER Gábor via GitGitGadget
2020-06-15 20:14 ` [PATCH 5/8] commit-graph: check all leading directories in changed path Bloom filters SZEDER Gábor via GitGitGadget
2020-06-18 20:31   ` René Scharfe
2020-06-19  9:14     ` René Scharfe
2020-06-19 17:17   ` Taylor Blau
2020-06-19 17:19     ` Taylor Blau
2020-06-23 13:47     ` Derrick Stolee
2020-06-15 20:14 ` [PATCH 6/8] bloom: enforce a minimum size of 8 bytes Derrick Stolee via GitGitGadget
2020-06-15 20:14 ` [PATCH 7/8] commit-graph: change test to die on parse, not load Derrick Stolee via GitGitGadget
2020-06-15 20:14 ` [PATCH 8/8] commit-graph: persist existence of changed-paths Derrick Stolee via GitGitGadget
2020-06-17 21:21 ` [PATCH 0/8] More commit-graph/Bloom filter improvements Junio C Hamano
2020-06-18  1:46   ` Derrick Stolee
2020-06-23 17:46 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 01/11] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 02/11] commit-graph: change test to die on parse, not load Derrick Stolee via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 03/11] bloom: get_bloom_filter() cleanups Derrick Stolee via GitGitGadget
2020-06-25  7:24     ` René Scharfe
2020-06-23 17:47   ` [PATCH v2 04/11] commit-graph: persist existence of changed-paths Derrick Stolee via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 05/11] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
2020-06-25  7:25     ` René Scharfe
2020-06-23 17:47   ` [PATCH v2 06/11] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
2020-06-25  7:25     ` René Scharfe
2020-06-25 14:59       ` Derrick Stolee
2020-06-23 17:47   ` [PATCH v2 07/11] commit-graph: check chunk sizes after writing SZEDER Gábor via GitGitGadget
2020-06-25  7:25     ` René Scharfe
2020-06-25 15:02       ` Derrick Stolee
2020-06-23 17:47   ` [PATCH v2 08/11] revision.c: fix whitespace Derrick Stolee via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 09/11] revision: empty pathspecs should not use Bloom filters Taylor Blau via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 10/11] commit-graph: check all leading directories in changed path " SZEDER Gábor via GitGitGadget
2020-06-25  7:25     ` René Scharfe
2020-06-25 15:05       ` Derrick Stolee
2020-06-26  6:34         ` SZEDER Gábor
2020-06-26 14:42           ` Derrick Stolee
2020-06-23 17:47   ` [PATCH v2 11/11] bloom: enforce a minimum size of 8 bytes Derrick Stolee via GitGitGadget
2020-06-24 23:11   ` [PATCH v2 00/11] More commit-graph/Bloom filter improvements Junio C Hamano
2020-06-24 23:32     ` Derrick Stolee
2020-06-25  0:38       ` Junio C Hamano
2020-06-25 13:38         ` Derrick Stolee
2020-06-25 16:34           ` Junio C Hamano
2020-06-26 12:30   ` [PATCH v3 00/10] " Derrick Stolee via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 01/10] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 02/10] commit-graph: change test to die on parse, not load Derrick Stolee via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 03/10] bloom: fix logic in get_bloom_filter() Derrick Stolee via GitGitGadget
2020-06-27 16:33       ` SZEDER Gábor
2020-06-29 13:02         ` Derrick Stolee [this message]
2020-06-26 12:30     ` [PATCH v3 04/10] commit-graph: persist existence of changed-paths Derrick Stolee via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 05/10] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 06/10] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 07/10] commit-graph: check chunk sizes after writing SZEDER Gábor via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 08/10] revision.c: fix whitespace Derrick Stolee via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 09/10] revision: empty pathspecs should not use Bloom filters Taylor Blau via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 10/10] commit-graph: check all leading directories in changed path " SZEDER Gábor via GitGitGadget
2020-07-01 13:27     ` [PATCH v4 00/10] More commit-graph/Bloom filter improvements Derrick Stolee via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 01/10] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 02/10] commit-graph: change test to die on parse, not load Derrick Stolee via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 03/10] bloom: fix logic in get_bloom_filter() Derrick Stolee via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 04/10] commit-graph: persist existence of changed-paths Derrick Stolee via GitGitGadget
2020-10-15 13:21         ` SZEDER Gábor
2020-10-15 21:41           ` Taylor Blau
2020-10-16  2:18             ` Derrick Stolee
2020-10-16  3:18               ` Taylor Blau
2020-10-16 13:52                 ` Derrick Stolee
2020-07-01 13:27       ` [PATCH v4 05/10] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 06/10] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 07/10] commit-graph: check chunk sizes after writing SZEDER Gábor via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 08/10] revision.c: fix whitespace Derrick Stolee via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 09/10] revision: empty pathspecs should not use Bloom filters Taylor Blau via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 10/10] commit-graph: check all leading directories in changed path " SZEDER Gábor via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e998499b-6c27-e4c8-fa92-d83167ab41b6@gmail.com \
    --to=stolee@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=l.s.r@web.de \
    --cc=me@ttaylorr.com \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.