git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Taylor Blau <me@ttaylorr.com>
Cc: Derrick Stolee <stolee@gmail.com>, fox <fox.gbr@townlong-yak.com>,
	Eric Sunshine <sunshine@sunshineco.com>,
	git@vger.kernel.org
Subject: [PATCH 01/11] midx: avoid duplicate packed_git entries
Date: Fri, 25 Oct 2024 02:43:40 -0400	[thread overview]
Message-ID: <20241025064340.GA2110355@coredump.intra.peff.net> (raw)
In-Reply-To: <20241025064148.GA2110169@coredump.intra.peff.net>

When we scan a pack directory to load the idx entries we find into the
packed_git list, we skip any of them that are contained in a midx. We
then load them later lazily if we actually need to access the
corresponding pack, referencing them both from the midx struct and the
packed_git list.

The lazy-load in the midx code checks to see if the midx already
mentions the pack, but doesn't otherwise check the packed_git list. This
makes sense, since we should have added any pack to both lists.

But there's a loophole! If we call close_object_store(), that frees the
midx entirely, but _not_ the packed_git structs, which we must keep
around for Reasons[1]. If we then try to look up more objects, we'll
auto-load the midx again, which won't realize that we've already loaded
those packs, and will create duplicate entries in the packed_git list.

This is possibly inefficient, because it means we may open and map the
pack redundantly. But it can also lead to weird user-visible behavior.
The case I found is in "git repack", which closes and reopens the midx
after repacking and then calls update_server_info(). We end up writing
the duplicate entries into objects/info/packs.

We could obviously de-dup them while writing that file, but it seems
like a violation of more core assumptions that we end up with these
duplicate structs at all.

We can avoid the duplicates reasonably efficiently by checking their
names in the pack_map hash. This annoyingly does require a little more
than a straight hash lookup due to the naming conventions, but it should
only happen when we are about to actually open a pack. I don't think one
extra malloc will be noticeable there.

[1] I'm not entirely sure of all the details, except that we generally
    assume the packed_git structs never go away. We noted this
    restriction in the comment added by 6f1e9394e2 (object: fix leaking
    packfiles when closing object store, 2024-08-08), but it's somewhat
    vague. At any rate, if you try freeing the structs in
    close_object_store(), you can observe segfaults all over the test
    suite. So it might be fixable, but it's not trivial.

Signed-off-by: Jeff King <peff@peff.net>
---
+cc Stolee here as the original midx author. I can't think of a good
reason we'd want to avoid this dup-detection here.

 midx.c                        | 20 +++++++++++++++++---
 t/t5200-update-server-info.sh |  8 ++++++++
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/midx.c b/midx.c
index 67e0d64004..479812cb9b 100644
--- a/midx.c
+++ b/midx.c
@@ -445,6 +445,7 @@ int prepare_midx_pack(struct repository *r, struct multi_pack_index *m,
 		      uint32_t pack_int_id)
 {
 	struct strbuf pack_name = STRBUF_INIT;
+	struct strbuf key = STRBUF_INIT;
 	struct packed_git *p;
 
 	pack_int_id = midx_for_pack(&m, pack_int_id);
@@ -455,16 +456,29 @@ int prepare_midx_pack(struct repository *r, struct multi_pack_index *m,
 	strbuf_addf(&pack_name, "%s/pack/%s", m->object_dir,
 		    m->pack_names[pack_int_id]);
 
-	p = add_packed_git(pack_name.buf, pack_name.len, m->local);
+	/* pack_map holds the ".pack" name, but we have the .idx */
+	strbuf_addbuf(&key, &pack_name);
+	strbuf_strip_suffix(&key, ".idx");
+	strbuf_addstr(&key, ".pack");
+	p = hashmap_get_entry_from_hash(&r->objects->pack_map,
+					strhash(key.buf), key.buf,
+					struct packed_git, packmap_ent);
+	if (!p) {
+		p = add_packed_git(pack_name.buf, pack_name.len, m->local);
+		if (p) {
+			install_packed_git(r, p);
+			list_add_tail(&p->mru, &r->objects->packed_git_mru);
+		}
+	}
+
 	strbuf_release(&pack_name);
+	strbuf_release(&key);
 
 	if (!p)
 		return 1;
 
 	p->multi_pack_index = 1;
 	m->packs[pack_int_id] = p;
-	install_packed_git(r, p);
-	list_add_tail(&p->mru, &r->objects->packed_git_mru);
 
 	return 0;
 }
diff --git a/t/t5200-update-server-info.sh b/t/t5200-update-server-info.sh
index ed9dfd624c..cc51c73986 100755
--- a/t/t5200-update-server-info.sh
+++ b/t/t5200-update-server-info.sh
@@ -39,4 +39,12 @@ test_expect_success 'info/refs updates when changes are made' '
 	! test_cmp a b
 '
 
+test_expect_success 'midx does not create duplicate pack entries' '
+	git repack -d --write-midx &&
+	git repack -d &&
+	grep ^P .git/objects/info/packs >packs &&
+	uniq -d <packs >dups &&
+	test_must_be_empty dups
+'
+
 test_done
-- 
2.47.0.363.g6e72b256be


  reply	other threads:[~2024-10-25  6:43 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-19 23:22 Bug report: v2.47.0 cannot fetch version 1 pack indexes fox
2024-10-20  0:37 ` Eric Sunshine
2024-10-21 20:06   ` Taylor Blau
2024-10-20  1:24 ` Jeff King
2024-10-20  2:40   ` Jeff King
2024-10-21 20:33     ` Taylor Blau
2024-10-22  5:14       ` Jeff King
2024-10-22 15:18         ` Taylor Blau
2024-10-25  6:41         ` [PATCH 0/11] dumb-http pack index v1 regression + cleanups Jeff King
2024-10-25  6:43           ` Jeff King [this message]
2024-10-25 21:09             ` [PATCH 01/11] midx: avoid duplicate packed_git entries Taylor Blau
2024-10-25  6:44           ` [PATCH 02/11] t5550: count fetches in "previously-fetched .idx" test Jeff King
2024-10-25  6:58           ` [PATCH 03/11] dumb-http: store downloaded pack idx as tempfile Jeff King
2024-10-25 21:18             ` Taylor Blau
2024-10-26  6:02               ` Jeff King
2024-10-28  0:14                 ` Taylor Blau
2024-10-25  7:00           ` [PATCH 04/11] packfile: drop has_pack_index() Jeff King
2024-10-25 21:27             ` Taylor Blau
2024-10-25  7:00           ` [PATCH 05/11] packfile: drop sha1_pack_name() Jeff King
2024-10-25  7:01           ` [PATCH 06/11] packfile: drop sha1_pack_index_name() Jeff King
2024-10-25  7:02           ` [PATCH 07/11] packfile: warn people away from parse_packed_git() Jeff King
2024-10-25 21:28             ` Taylor Blau
2024-10-25  7:03           ` [PATCH 08/11] http-walker: use object_id instead of bare hash Jeff King
2024-10-25  7:05           ` [PATCH 09/11] packfile: convert find_sha1_pack() to use object_id Jeff King
2024-10-25  7:06           ` [PATCH 10/11] packfile: use object_id in find_pack_entry_one() Jeff King
2024-10-25 21:33             ` Taylor Blau
2024-10-25  7:08           ` [PATCH 11/11] packfile: use oidread() instead of hashcpy() to fill object_id Jeff King
2024-10-25 21:35           ` [PATCH 0/11] dumb-http pack index v1 regression + cleanups Taylor Blau
2024-10-21 20:23   ` Bug report: v2.47.0 cannot fetch version 1 pack indexes Taylor Blau
2024-10-22  5:00     ` Jeff King
2024-10-22 15:50       ` Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20241025064340.GA2110355@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=fox.gbr@townlong-yak.com \
    --cc=git@vger.kernel.org \
    --cc=me@ttaylorr.com \
    --cc=stolee@gmail.com \
    --cc=sunshine@sunshineco.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).