git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Taylor Blau <me@ttaylorr.com>
To: Jeff King <peff@peff.net>
Cc: Junio C Hamano <gitster@pobox.com>,
	Patrick Steinhardt <ps@pks.im>,
	git@vger.kernel.org
Subject: Re: [PATCH v2 2/2] midx: stop repeatedly looking up nonexistent packfiles
Date: Fri, 23 May 2025 13:46:20 -0400	[thread overview]
Message-ID: <aDC0bK+NOuuVvQtb@nand.local> (raw)
In-Reply-To: <20250523020820.GB559000@coredump.intra.peff.net>

On Thu, May 22, 2025 at 10:08:20PM -0400, Jeff King wrote:
> > (As an aside, I think I recall you suggesting a while ago that it might
> > be interesting to define "global" things with a different type than
> > "local" ones to prevent this sort of confusion. That would allow us to
> > keep both "pack_int_id" and the return value of "midx_for_pack()" in
> > scope at the same time, without the possibility of using one when you
> > meant to use the other.)
>
> Yeah. The trouble is that it becomes awkward in C, since the language
> will happily intermix two integer typedefs. So you have to wrap them in
> a struct and access the struct fields everywhere.

Yup :-<.

> > >   - There's a similar case in midx-write.c:want_included_pack(). That
> > >     one seems to have the same local/global confusion, but I do not
> > >     obviously see anything preventing it from being fed a non-base midx.
> > >     So it might possibly be buggy?
> >
> > Yeah, this spot is definitely broken. At minimum it would need something
> > like:
> >
> > --- 8< ---
> > diff --git a/midx-write.c b/midx-write.c
> > index 0897cbd829..54a04f7b75 100644
> > --- a/midx-write.c
> > +++ b/midx-write.c
> > @@ -1634,9 +1634,10 @@ static int want_included_pack(struct repository *r,
> >  			      uint32_t pack_int_id)
> >  {
> >  	struct packed_git *p;
> > +	midx_for_pack(&m, pack_int_id);
> >  	if (prepare_midx_pack(r, m, pack_int_id))
> >  		return 0;
> > -	p = m->packs[pack_int_id];
> > +	p = m->packs[pack_int_id - m->num_packs_in_base];
> >  	if (!pack_kept_objects && p->pack_keep)
> >  		return 0;
> >  	if (p->is_cruft)
> > --- >8 ---
>
> Yep, that's exactly what I was envisioning. I guess it's probably
> possible to trigger in practice by writing a new midx based on an
> existing incremental state. I'll let you figure that part out. :)

Hmm. I thought that this spot was broken last night, but looking at it
again today I think that it is actually OK. I started writing an
analysis of why in response to your email, and then decided to throw it
away since those details really should live in the patch message. Here's
what I came up with:

--- 8< ---
Subject: [PATCH] midx-write.c: guard against incremental MIDXs in
 want_included_pack()

The function want_included_pack() is used to determine whether or not a
the packfile corresponding to some given pack_int_id should be included
in a 'git multi-pack-index repack' operation.

This spot looks like it would be broken, particularly in:

    struct packed_git *p;
    if (prepare_midx_pack(r, m, pack_int_id))
        return 0;
    p = m->packs[pack_int_id];

, when pack_int_id is greater than m->num_pack_in_base (supposing that
m->num_packs_in_base is non-zero, or equivalently that m->base_midx is
non-NULL).

Suppose we have two MIDXs in an incremental MIDX chain, each having two
packs:

  - M0 = {packs: [P0, P1], base_midx: NULL}
  - M1 = {packs: [P2, P3], base_midx: M0}

noting that each pack is identified by its global pack_int_id within the
chain.

So if you do something like:

    want_included_pack(the_repository, M1, pack_kept_objects, 2);

that function will call prepare_midx_pack(), which is smart enough to
realize that the pack of interest is in the current layer (M1), and
knows how to adjust its global pack_int_id into an index into the
current layer's 'packs' array.

But the following line:

    p = m->packs[pack_int_id]; /* m is M1, pack_int_id is 2 */

looks broken, since each layer of the MIDX only maintains an array of
the packs stored within that layer, and 'm' wasn't adjusted back to
point at M1->base_midx (M0).

The right thing to do would be:

    struct packed_git *p;
    if (prepare_midx_pack(r, m, pack_int_id))
        return 0;

    /* open-code midx.c::midx_for_pack(), which is private */
    while (m && pack_int_id < m->num_packs_in_base)
        m = m->base_midx;

    if (!m)
        BUG("broken midx?");
    if (pack_int_id >= m->num_packs + m->num_packs_in_base)
        BUG("broken pack_int_id?");

    p = m->packs[pack_int_id - m->num_packs_in_base];

But that would be overkill, since this function never deals with
incremental MIDXs having more than one layer! To see why, observe that
want_included_pack() has two callers:

  - midx-write.c::fill_included_packs_all()
  - midx-write.c::fill_included_packs_batch()

and those functions' sole caller is in midx-write.c::midx_repack(),
which dispatches a call to one or the other depending on whether or not
the batch_size is non-zero.

And at the very top of midx_repack(), we have a guard against
non-trivial incremental MIDX chains:

    if (m->base_midx)
        die(_("cannot repack an incremental multi-pack-index"));

So want_included_pack() is OK becuase it will never encounter a
situation where it has to chase backwards through the '->base_midx'
pointer. But that is not immediately clear from reading the code, and is
too fragile for my comfort. Make this more clear by adding an ASSERT()
to the above effect.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx-write.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/midx-write.c b/midx-write.c
index 0897cbd829..991c42d1dc 100644
--- a/midx-write.c
+++ b/midx-write.c
@@ -1634,6 +1634,9 @@ static int want_included_pack(struct repository *r,
 			      uint32_t pack_int_id)
 {
 	struct packed_git *p;
+
+	ASSERT(m && !m->base_midx);
+
 	if (prepare_midx_pack(r, m, pack_int_id))
 		return 0;
 	p = m->packs[pack_int_id];
@@ -1653,6 +1656,8 @@ static void fill_included_packs_all(struct repository *r,
 	uint32_t i;
 	int pack_kept_objects = 0;

+	ASSERT(m && !m->base_midx);
+
 	repo_config_get_bool(r, "repack.packkeptobjects", &pack_kept_objects);

 	for (i = 0; i < m->num_packs; i++) {
@@ -1673,6 +1678,8 @@ static void fill_included_packs_batch(struct repository *r,
 	struct repack_info *pack_info;
 	int pack_kept_objects = 0;

+	ASSERT(m && !m->base_midx);
+
 	CALLOC_ARRAY(pack_info, m->num_packs);

 	repo_config_get_bool(r, "repack.packkeptobjects", &pack_kept_objects);
--
2.49.0.221.g485f5f8636.dirty
--- >8 ---

> > >     Likewise fill_included_packs_batch() in the same file.
> >
> > I think this one is actually OK for the same reason as the
> > expire_midx_packs() case. Its sole caller in midx_repack() has:
> >
> >     if (m->base_midx)
> >         die(_("cannot repack an incremental multi-pack-index"));
> >
> > , so we are OK there. We might want to add an ASSERT() in
> > fill_included_packs_batch() to make it clearer, though.
>
> Ah, good. Yes, I agree that an assertion would be a nice bit of
> documentation/safety. Though for these cases I think they only care
> about the pack (not the containing midx), so changing the return type of
> prepare_midx_pack() might be an even easier way to get that safety.

Ironically, the same argument applies for this function (and the _all()
variant) as well ;-).

Thanks,
Taylor

  reply	other threads:[~2025-05-23 17:46 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-16  8:55 [PATCH] packfile: avoid access(3p) calls for missing packs Patrick Steinhardt
2025-05-16 18:34 ` Junio C Hamano
2025-05-19  6:52   ` Jeff King
2025-05-19 15:46     ` Junio C Hamano
2025-05-20  6:45     ` Patrick Steinhardt
2025-05-22  5:28       ` Jeff King
2025-05-23  1:02     ` Taylor Blau
2025-05-23  2:03       ` Jeff King
2025-05-20  9:53 ` [PATCH v2 0/2] " Patrick Steinhardt
2025-05-20  9:53   ` [PATCH v2 1/2] packfile: explain ordering of how we look up auxiliary pack files Patrick Steinhardt
2025-05-23  1:03     ` Taylor Blau
2025-05-20  9:53   ` [PATCH v2 2/2] midx: stop repeatedly looking up nonexistent packfiles Patrick Steinhardt
2025-05-22  5:32     ` Jeff King
2025-05-22 15:47       ` Junio C Hamano
2025-05-22 16:59         ` Jeff King
2025-05-22 18:44           ` Junio C Hamano
2025-05-23  1:22           ` Taylor Blau
2025-05-23  2:08             ` Jeff King
2025-05-23 17:46               ` Taylor Blau [this message]
2025-05-25 18:41                 ` [PATCH 0/5] midx: improve prepare_midx_pack() ergonomics Taylor Blau
2025-05-25 18:41                   ` [PATCH 1/5] pack-bitmap.c: fix broken warning() when missing MIDX'd pack Taylor Blau
2025-05-26  7:23                     ` Patrick Steinhardt
2025-05-28  2:00                       ` Taylor Blau
2025-05-25 18:41                   ` [PATCH 2/5] midx-write.c: guard against incremental MIDXs in want_included_pack() Taylor Blau
2025-05-26  7:23                     ` Patrick Steinhardt
2025-05-28  2:08                       ` Taylor Blau
2025-05-25 18:41                   ` [PATCH 3/5] midx-write.c: simplify fill_packs_from_midx() Taylor Blau
2025-05-26  7:23                     ` Patrick Steinhardt
2025-05-28  2:15                       ` Taylor Blau
2025-05-25 18:42                   ` [PATCH 4/5] midx-write.c: extract inner loop from fill_packs_from_midx() Taylor Blau
2025-05-25 18:42                   ` [PATCH 5/5] midx: return a `packed_git` pointer from `prepare_midx_pack()` Taylor Blau
2025-05-26  7:24                     ` Patrick Steinhardt
2025-05-28  2:18                       ` Taylor Blau
2025-05-28 11:53                         ` Patrick Steinhardt
2025-05-28 22:58                   ` [PATCH v2 0/4] midx: improve prepare_midx_pack() ergonomics Taylor Blau
2025-05-28 22:59                     ` [PATCH v2 1/4] midx: access pack names through `nth_midxed_pack_name()` Taylor Blau
2025-05-29 20:47                       ` Junio C Hamano
2025-06-03 22:22                         ` Taylor Blau
2025-05-29 20:51                       ` Junio C Hamano
2025-06-03 22:23                         ` Taylor Blau
2025-05-28 22:59                     ` [PATCH v2 2/4] midx-write.c: guard against incremental MIDXs in want_included_pack() Taylor Blau
2025-05-28 22:59                     ` [PATCH v2 3/4] midx-write.c: extract inner loop from fill_packs_from_midx() Taylor Blau
2025-05-28 22:59                     ` [PATCH v2 4/4] midx: return a `packed_git` pointer from `prepare_midx_pack()` Taylor Blau
2025-05-30  6:50                       ` Jeff King
2025-06-03 22:27                         ` Taylor Blau
2025-08-28 23:25                           ` Junio C Hamano
2025-05-23  1:31       ` [PATCH v2 2/2] midx: stop repeatedly looking up nonexistent packfiles Taylor Blau
2025-05-23  2:18         ` Jeff King
2025-05-21 13:24   ` [PATCH v2 0/2] packfile: avoid access(3p) calls for missing packs Junio C Hamano
2025-05-28 12:24 ` [PATCH v3 " Patrick Steinhardt
2025-05-28 12:24   ` [PATCH v3 1/2] packfile: explain ordering of how we look up auxiliary pack files Patrick Steinhardt
2025-05-28 12:24   ` [PATCH v3 2/2] midx: stop repeatedly looking up nonexistent packfiles Patrick Steinhardt
2025-05-30  6:27   ` [PATCH v3 0/2] packfile: avoid access(3p) calls for missing packs Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aDC0bK+NOuuVvQtb@nand.local \
    --to=me@ttaylorr.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    --cc=ps@pks.im \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).