git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Taylor Blau <me@ttaylorr.com>
To: Will Chandler <wfc@wfchandler.org>
Cc: derrickstolee@github.com, git@vger.kernel.org, peff@peff.net, ps@pks.im
Subject: Re: commit-graph overflow generation chicken and egg
Date: Tue, 5 Jul 2022 18:28:47 -0400	[thread overview]
Message-ID: <YsS7H4i5DqUZVQ5i@nand.local> (raw)
In-Reply-To: <DD88D523-0ECA-4474-9AA5-1D4A431E532A@wfchandler.org>

On Tue, Jul 05, 2022 at 05:03:44PM -0400, Will Chandler wrote:
> Hi all,
>
> I think I've reproduced this problem on v2.36.1 and main.

Great find. I played around with this a little bit locally and your
reproduction works for me on any sufficiently-large repository. Looking
around in the code, I'm pretty confident that this is the issue.

>   $ git -c commitGraph.generationVersion=2 commit-graph write --changed-paths

This step is critical: without it we never end up overwriting the
`->generation` data, and the conversion works fine.

> Because the existing commit graph did not include generation data,
> graph.read_generation_data is 0. When compute_bloom_filters executes, it
> will call fill_commit_graph_info which checks that value and falls back
> to the older generation calculation if false:
>
>   if (g->read_generation_data) {
>   	offset = (timestamp_t)get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
>
>   	if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
>   		if (!g->chunk_generation_data_overflow)
>   			die(_("commit-graph requires overflow generation data but has none"));
>
>   		offset_pos = offset ^ CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW;
>   		graph_data->generation = item->date + get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);
>   	} else
>   		graph_data->generation = item->date + offset;
>   } else
>   	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
>
> This re-writes the commit data to:
>
>   {
>     graph_pos = 0,
>     generation = 17631
>   }

Nicely explained. To me, it seems like the problem is that we're reusing
the same slab to:

  - store data that we're going to write as a part of commit-graph
    generation, and
  - store auxiliary data about commits that we have *read* from a
    commit-graph

A complete fix might be to use a separate slab to store data that we
read from data that we are about to write, to avoid polluting the latter
with the former.

In the meantime, a more minimal fix would be to avoid reading and
overwriting the generation data where we can avoid it. AFAICT, this is
the only spot that would need to be changed. The following patch does
the trick for me:

--- >8 ---

diff --git a/bloom.c b/bloom.c
index 5e297038bb..863052fa68 100644
--- a/bloom.c
+++ b/bloom.c
@@ -30,10 +30,9 @@ static inline unsigned char get_bitmask(uint32_t pos)

 static int load_bloom_filter_from_graph(struct commit_graph *g,
 					struct bloom_filter *filter,
-					struct commit *c)
+					uint32_t graph_pos)
 {
 	uint32_t lex_pos, start_index, end_index;
-	uint32_t graph_pos = commit_graph_position(c);

 	while (graph_pos < g->num_commits_in_base)
 		g = g->base_graph;
@@ -203,9 +202,11 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 	filter = bloom_filter_slab_at(&bloom_filters, c);

 	if (!filter->data) {
-		load_commit_graph_info(r, c);
-		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH)
-			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
+		uint32_t pos;
+		if ((repo_find_commit_pos_in_graph(r, c, &pos)) &&
+		    (pos != COMMIT_NOT_FROM_GRAPH))
+			load_bloom_filter_from_graph(r->objects->commit_graph,
+						     filter, pos);
 	}

 	if (filter->data && filter->len)
diff --git a/commit-graph.c b/commit-graph.c
index 92d4503336..2d9fad5910 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -889,6 +889,14 @@ static int find_commit_pos_in_graph(struct commit *item, struct commit_graph *g,
 	}
 }

+int repo_find_commit_pos_in_graph(struct repository *r, struct commit *c,
+				  uint32_t *pos)
+{
+	if (!prepare_commit_graph(r))
+		return 0;
+	return find_commit_pos_in_graph(c, r->objects->commit_graph, pos);
+}
+
 struct commit *lookup_commit_in_graph(struct repository *repo, const struct object_id *id)
 {
 	struct commit *commit;
diff --git a/commit-graph.h b/commit-graph.h
index 2e3ac35237..52ddfbe349 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -40,6 +40,9 @@ int open_commit_graph(const char *graph_file, int *fd, struct stat *st);
  */
 int parse_commit_in_graph(struct repository *r, struct commit *item);

+int repo_find_commit_pos_in_graph(struct repository *r, struct commit *c,
+				  uint32_t *pos);
+
 /*
  * Look up the given commit ID in the commit-graph. This will only return a
  * commit if the ID exists both in the graph and in the object database such

--- 8< ---

Thanks,
Taylor

  reply	other threads:[~2022-07-05 22:28 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-08 19:33 commit-graph overflow generation chicken and egg Jeff King
2022-06-08 20:08 ` Derrick Stolee
2022-06-08 23:17   ` Jeff King
2022-07-01 12:06     ` Patrick Steinhardt
2022-07-04 10:46       ` Patrick Steinhardt
2022-07-04 20:50         ` Derrick Stolee
2022-07-05 21:03           ` Will Chandler
2022-07-05 22:28             ` Taylor Blau [this message]
2022-07-06  8:52               ` Jeff King
2022-07-06  9:11           ` Jeff King
2022-06-09  7:49   ` Ævar Arnfjörð Bjarmason
2022-06-09 15:26     ` Jeff King
2022-06-09 15:39       ` Derrick Stolee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YsS7H4i5DqUZVQ5i@nand.local \
    --to=me@ttaylorr.com \
    --cc=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    --cc=ps@pks.im \
    --cc=wfc@wfchandler.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).