From: Jeff King <peff@peff.net>
To: Taylor Blau <me@ttaylorr.com>
Cc: Derrick Stolee <stolee@gmail.com>, fox <fox.gbr@townlong-yak.com>,
Eric Sunshine <sunshine@sunshineco.com>,
git@vger.kernel.org
Subject: [PATCH 01/11] midx: avoid duplicate packed_git entries
Date: Fri, 25 Oct 2024 02:43:40 -0400 [thread overview]
Message-ID: <20241025064340.GA2110355@coredump.intra.peff.net> (raw)
In-Reply-To: <20241025064148.GA2110169@coredump.intra.peff.net>
When we scan a pack directory to load the idx entries we find into the
packed_git list, we skip any of them that are contained in a midx. We
then load them later lazily if we actually need to access the
corresponding pack, referencing them both from the midx struct and the
packed_git list.
The lazy-load in the midx code checks to see if the midx already
mentions the pack, but doesn't otherwise check the packed_git list. This
makes sense, since we should have added any pack to both lists.
But there's a loophole! If we call close_object_store(), that frees the
midx entirely, but _not_ the packed_git structs, which we must keep
around for Reasons[1]. If we then try to look up more objects, we'll
auto-load the midx again, which won't realize that we've already loaded
those packs, and will create duplicate entries in the packed_git list.
This is possibly inefficient, because it means we may open and map the
pack redundantly. But it can also lead to weird user-visible behavior.
The case I found is in "git repack", which closes and reopens the midx
after repacking and then calls update_server_info(). We end up writing
the duplicate entries into objects/info/packs.
We could obviously de-dup them while writing that file, but it seems
like a violation of more core assumptions that we end up with these
duplicate structs at all.
We can avoid the duplicates reasonably efficiently by checking their
names in the pack_map hash. This annoyingly does require a little more
than a straight hash lookup due to the naming conventions, but it should
only happen when we are about to actually open a pack. I don't think one
extra malloc will be noticeable there.
[1] I'm not entirely sure of all the details, except that we generally
assume the packed_git structs never go away. We noted this
restriction in the comment added by 6f1e9394e2 (object: fix leaking
packfiles when closing object store, 2024-08-08), but it's somewhat
vague. At any rate, if you try freeing the structs in
close_object_store(), you can observe segfaults all over the test
suite. So it might be fixable, but it's not trivial.
Signed-off-by: Jeff King <peff@peff.net>
---
+cc Stolee here as the original midx author. I can't think of a good
reason we'd want to avoid this dup-detection here.
midx.c | 20 +++++++++++++++++---
t/t5200-update-server-info.sh | 8 ++++++++
2 files changed, 25 insertions(+), 3 deletions(-)
diff --git a/midx.c b/midx.c
index 67e0d64004..479812cb9b 100644
--- a/midx.c
+++ b/midx.c
@@ -445,6 +445,7 @@ int prepare_midx_pack(struct repository *r, struct multi_pack_index *m,
uint32_t pack_int_id)
{
struct strbuf pack_name = STRBUF_INIT;
+ struct strbuf key = STRBUF_INIT;
struct packed_git *p;
pack_int_id = midx_for_pack(&m, pack_int_id);
@@ -455,16 +456,29 @@ int prepare_midx_pack(struct repository *r, struct multi_pack_index *m,
strbuf_addf(&pack_name, "%s/pack/%s", m->object_dir,
m->pack_names[pack_int_id]);
- p = add_packed_git(pack_name.buf, pack_name.len, m->local);
+ /* pack_map holds the ".pack" name, but we have the .idx */
+ strbuf_addbuf(&key, &pack_name);
+ strbuf_strip_suffix(&key, ".idx");
+ strbuf_addstr(&key, ".pack");
+ p = hashmap_get_entry_from_hash(&r->objects->pack_map,
+ strhash(key.buf), key.buf,
+ struct packed_git, packmap_ent);
+ if (!p) {
+ p = add_packed_git(pack_name.buf, pack_name.len, m->local);
+ if (p) {
+ install_packed_git(r, p);
+ list_add_tail(&p->mru, &r->objects->packed_git_mru);
+ }
+ }
+
strbuf_release(&pack_name);
+ strbuf_release(&key);
if (!p)
return 1;
p->multi_pack_index = 1;
m->packs[pack_int_id] = p;
- install_packed_git(r, p);
- list_add_tail(&p->mru, &r->objects->packed_git_mru);
return 0;
}
diff --git a/t/t5200-update-server-info.sh b/t/t5200-update-server-info.sh
index ed9dfd624c..cc51c73986 100755
--- a/t/t5200-update-server-info.sh
+++ b/t/t5200-update-server-info.sh
@@ -39,4 +39,12 @@ test_expect_success 'info/refs updates when changes are made' '
! test_cmp a b
'
+test_expect_success 'midx does not create duplicate pack entries' '
+ git repack -d --write-midx &&
+ git repack -d &&
+ grep ^P .git/objects/info/packs >packs &&
+ uniq -d <packs >dups &&
+ test_must_be_empty dups
+'
+
test_done
--
2.47.0.363.g6e72b256be
next prev parent reply other threads:[~2024-10-25 6:43 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-19 23:22 Bug report: v2.47.0 cannot fetch version 1 pack indexes fox
2024-10-20 0:37 ` Eric Sunshine
2024-10-21 20:06 ` Taylor Blau
2024-10-20 1:24 ` Jeff King
2024-10-20 2:40 ` Jeff King
2024-10-21 20:33 ` Taylor Blau
2024-10-22 5:14 ` Jeff King
2024-10-22 15:18 ` Taylor Blau
2024-10-25 6:41 ` [PATCH 0/11] dumb-http pack index v1 regression + cleanups Jeff King
2024-10-25 6:43 ` Jeff King [this message]
2024-10-25 21:09 ` [PATCH 01/11] midx: avoid duplicate packed_git entries Taylor Blau
2024-10-25 6:44 ` [PATCH 02/11] t5550: count fetches in "previously-fetched .idx" test Jeff King
2024-10-25 6:58 ` [PATCH 03/11] dumb-http: store downloaded pack idx as tempfile Jeff King
2024-10-25 21:18 ` Taylor Blau
2024-10-26 6:02 ` Jeff King
2024-10-28 0:14 ` Taylor Blau
2024-10-25 7:00 ` [PATCH 04/11] packfile: drop has_pack_index() Jeff King
2024-10-25 21:27 ` Taylor Blau
2024-10-25 7:00 ` [PATCH 05/11] packfile: drop sha1_pack_name() Jeff King
2024-10-25 7:01 ` [PATCH 06/11] packfile: drop sha1_pack_index_name() Jeff King
2024-10-25 7:02 ` [PATCH 07/11] packfile: warn people away from parse_packed_git() Jeff King
2024-10-25 21:28 ` Taylor Blau
2024-10-25 7:03 ` [PATCH 08/11] http-walker: use object_id instead of bare hash Jeff King
2024-10-25 7:05 ` [PATCH 09/11] packfile: convert find_sha1_pack() to use object_id Jeff King
2024-10-25 7:06 ` [PATCH 10/11] packfile: use object_id in find_pack_entry_one() Jeff King
2024-10-25 21:33 ` Taylor Blau
2024-10-25 7:08 ` [PATCH 11/11] packfile: use oidread() instead of hashcpy() to fill object_id Jeff King
2024-10-25 21:35 ` [PATCH 0/11] dumb-http pack index v1 regression + cleanups Taylor Blau
2024-10-21 20:23 ` Bug report: v2.47.0 cannot fetch version 1 pack indexes Taylor Blau
2024-10-22 5:00 ` Jeff King
2024-10-22 15:50 ` Taylor Blau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20241025064340.GA2110355@coredump.intra.peff.net \
--to=peff@peff.net \
--cc=fox.gbr@townlong-yak.com \
--cc=git@vger.kernel.org \
--cc=me@ttaylorr.com \
--cc=stolee@gmail.com \
--cc=sunshine@sunshineco.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).