* [PATCH 0/2] pack-objects: more brown-paper-bag multi-pack reuse fixes
@ 2024-11-13 17:32 Taylor Blau
2024-11-13 17:32 ` [PATCH 1/2] t5332-multi-pack-reuse.sh: demonstrate duplicate packing failure Taylor Blau
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: Taylor Blau @ 2024-11-13 17:32 UTC (permalink / raw)
To: git; +Cc: Jeff King, Junio C Hamano, Elijah Newren
While starting to use multi-pack reuse in more places throughout
GitHub, I noticed a rare but persistent segfault, which is the result
of a broken assumption in write_reused_pack_verbatim().
The first patch demonstrates the problem, and the second patch fixes
it. The second patch explains the broken assumption in detail, but
the gist is that we can't infer that an all-1s word in the reuse
bitmap from a non-preferred pack means that we want all objects
between the ones corresponding to the first and last bit.
I'm fairly disappointed that I didn't catch this obviously-broken
implementation during the original development of this feature, or in
the first round of bugfixes. But these patches should conclusively
resolve at least this issue.
Thanks in advance for your review!
Taylor Blau (2):
t5332-multi-pack-reuse.sh: demonstrate duplicate packing failure
pack-objects: only perform verbatim reuse on the preferred pack
builtin/pack-objects.c | 101 +++++++++++++++---------------------
t/t5332-multi-pack-reuse.sh | 23 ++++++++
2 files changed, 66 insertions(+), 58 deletions(-)
base-commit: 25b0f41288718625b18495de23cc066394c09a92
--
2.46.0.421.g159f2d50e75
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 1/2] t5332-multi-pack-reuse.sh: demonstrate duplicate packing failure
2024-11-13 17:32 [PATCH 0/2] pack-objects: more brown-paper-bag multi-pack reuse fixes Taylor Blau
@ 2024-11-13 17:32 ` Taylor Blau
2024-11-14 1:12 ` Junio C Hamano
2024-11-13 17:32 ` [PATCH 2/2] pack-objects: only perform verbatim reuse on the preferred pack Taylor Blau
2024-11-14 13:42 ` [PATCH v2 0/2] pack-objects: more brown-paper-bag multi-pack reuse fixes Taylor Blau
2 siblings, 1 reply; 14+ messages in thread
From: Taylor Blau @ 2024-11-13 17:32 UTC (permalink / raw)
To: git; +Cc: Jeff King, Junio C Hamano, Elijah Newren
In the multi-pack reuse code, there are two paths for reusing the
on-disk representation of an object, handled by:
- builtin/pack-objects.c::write_reused_pack_one()
- builtin/pack-objects.c::write_reused_pack_verbatim()
The former is responsible for copying the bytes for a single object out
of an existing source pack. The latter does the same but for a region of
objects aligned at eword_t boundaries.
Demonstrate a bug whereby write_reused_pack_verbatim() can be tricked
into writing out objects from some source pack, even when those objects
were selected from a different source pack in the MIDX bitmap.
When the caller wants at least one of the objects in that region,
pack-objects will write the same object twice as a result of this bug.
In the other case where the caller doesn't want any of the objects in
the region of interest, we will write out objects that weren't
requested.
Demonstrate this bug by creating two packs, where the preferred one of
those packs contains a single object which also appears in the main
(non-preferred) pack. A separate bug[^1] prevents us from triggering the
main bug when the duplicated object is the last one in the main pack,
but any earlier object will suffice.
We could fix that separate bug, but the following commit will simplify
write_reused_pack_verbatim() and only call it on the preferred pack, so
doing so would have little point.
[^1]: Because write_reused_pack_verbatim() only reuses bits in the range
off_t pack_start_off = pack_pos_to_offset(reuse_packfile->p, 0);
off_t pack_end_off = pack_pos_to_offset(reuse_packfile->p,
pos - reuse_packfile->bitmap_pos);
written += pos - reuse_packfile->bitmap_pos;
/* We're recording one chunk, not one object. */
record_reused_object(pack_start_off,
pack_start_off - (hashfile_total(out) - pack_start));
, or in other words excluding the object beginning at position 'pos -
reuse_packfile->bitmap_pos' in the source pack. But since
reuse_packfile->bitmap_pos is '1' in the non-preferred pack
(accounting for the single-object pack which is preferred), we don't
actually copy the bytes from the last object.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
t/t5332-multi-pack-reuse.sh | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/t/t5332-multi-pack-reuse.sh b/t/t5332-multi-pack-reuse.sh
index 955ea42769b..8f403d9fdaa 100755
--- a/t/t5332-multi-pack-reuse.sh
+++ b/t/t5332-multi-pack-reuse.sh
@@ -259,4 +259,27 @@ test_expect_success 'duplicate objects' '
)
'
+test_expect_failure 'duplicate objects with verbatim reuse' '
+ git init duplicate-objects-verbatim &&
+ (
+ cd duplicate-objects-verbatim &&
+
+ git config pack.allowPackReuse multi &&
+
+ test_commit_bulk 64 &&
+
+ # take the first object from the main pack...
+ git show-index <$(ls $packdir/pack-*.idx) >obj.raw &&
+ sort -nk1 <obj.raw | head -n1 | cut -d" " -f2 >in &&
+
+ # ...and create a separate pack containing just that object
+ p="$(git pack-objects $packdir/pack <in)" &&
+ git show-index <$packdir/pack-$p.idx &&
+
+ git multi-pack-index write --bitmap --preferred-pack=pack-$p.idx &&
+
+ test_pack_objects_reused_all 192 2
+ )
+'
+
test_done
--
2.46.0.421.g159f2d50e75
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 2/2] pack-objects: only perform verbatim reuse on the preferred pack
2024-11-13 17:32 [PATCH 0/2] pack-objects: more brown-paper-bag multi-pack reuse fixes Taylor Blau
2024-11-13 17:32 ` [PATCH 1/2] t5332-multi-pack-reuse.sh: demonstrate duplicate packing failure Taylor Blau
@ 2024-11-13 17:32 ` Taylor Blau
2024-11-14 0:25 ` Jeff King
2024-11-14 13:42 ` [PATCH v2 0/2] pack-objects: more brown-paper-bag multi-pack reuse fixes Taylor Blau
2 siblings, 1 reply; 14+ messages in thread
From: Taylor Blau @ 2024-11-13 17:32 UTC (permalink / raw)
To: git; +Cc: Jeff King, Junio C Hamano, Elijah Newren
When reusing objects from source pack(s), write_reused_pack_verbatim()
is responsible for reusing objects whole eword_t's at a time. It works
by taking the longest continuous run of objects from the beginning of
each source pack that the caller wants, and reuses the entirety of that
section from each pack.
This is based on the assumption that we don't have any gaps within the
region. This assumption relieves us from having to patch any
OFS_DELTAs, since we know that there aren't any gaps between any delta
and its base in that region.
To illustrate why this assumption is necessary, suppose we have some
pack P, which has objects X, Y, and Z. If the MIDX's copy of Y was
selected from a pack other than P, then the bit corresponding to object
Y will appear earlier in the bitmap than the bits corresponding to X and
Z.
If pack-objects already has or will use the copy of Y from the pack it
was selected from in the MIDX, then it is an error to reuse all objects
between X and Z in the source pack. Doing so will cause us to reuse Y
from a different pack than the one which represents Y in the MIDX,
causing us to either:
- include the object twice, assuming that the caller wants Y in the
pack, or
- include the object once, resulting in us packing more objects than
necessary.
This regression comes from ca0fd69e37 (pack-objects: prepare
`write_reused_pack_verbatim()` for multi-pack reuse, 2023-12-14), which
incorrectly assumed that there would be no gaps in reusable regions of
non-preferred packs.
Instead, we can only safely perform the whole-word reuse optimization on
the preferred pack, where we know with certainty that no gaps exist in
that region of the bitmap. We can still reuse objects from non-preferred
packs, but we have to inspect them individually in write_reused_pack()
to ensure that any gaps that may exist are accounted for.
This allows us to simplify the implementation of
write_reused_pack_verbatim() back to almost its pre-multi-pack reuse
form, since we can now assume that the beginning of the pack appears at
the beginning of the bitmap, meaning that we don't have to account for
any bits up to the first word boundary (like we had to special case in
ca0fd69e37).
The only significant changes from the pre-ca0fd69e37 implementation are:
- that we can no longer inspect words up to the end of
reuse_packfile_bitmap->word_alloc, since we only want to look at
words whose bits all correspond to objects in the given packfile, and
- that we return early when given a reuse_packfile which is not
preferred, making the call a noop.
In the future, it might be possible to restore this optimization if we
could guarantee that some reuse packs don't contain any gaps by
construction (similar to the "disjoint packs" idea in very early
versions of multi-pack reuse).
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
builtin/pack-objects.c | 101 +++++++++++++++---------------------
t/t5332-multi-pack-reuse.sh | 2 +-
2 files changed, 44 insertions(+), 59 deletions(-)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 08007142671..f413344e90c 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1100,78 +1100,64 @@ static void write_reused_pack_one(struct packed_git *reuse_packfile,
static size_t write_reused_pack_verbatim(struct bitmapped_pack *reuse_packfile,
struct hashfile *out,
- off_t pack_start,
struct pack_window **w_curs)
{
- size_t pos = reuse_packfile->bitmap_pos;
+ size_t pos = 0;
size_t end;
- if (pos % BITS_IN_EWORD) {
- size_t word_pos = (pos / BITS_IN_EWORD);
- size_t offset = pos % BITS_IN_EWORD;
- size_t last;
- eword_t word = reuse_packfile_bitmap->words[word_pos];
-
- if (offset + reuse_packfile->bitmap_nr < BITS_IN_EWORD)
- last = offset + reuse_packfile->bitmap_nr;
- else
- last = BITS_IN_EWORD;
-
- for (; offset < last; offset++) {
- if (word >> offset == 0)
- return word_pos;
- if (!bitmap_get(reuse_packfile_bitmap,
- word_pos * BITS_IN_EWORD + offset))
- return word_pos;
- }
-
- pos += BITS_IN_EWORD - (pos % BITS_IN_EWORD);
- }
-
- /*
- * Now we're going to copy as many whole eword_t's as possible.
- * "end" is the index of the last whole eword_t we copy, but
- * there may be additional bits to process. Those are handled
- * individually by write_reused_pack().
- *
- * Begin by advancing to the first word boundary in range of the
- * bit positions occupied by objects in "reuse_packfile". Then
- * pick the last word boundary in the same range. If we have at
- * least one word's worth of bits to process, continue on.
- */
- end = reuse_packfile->bitmap_pos + reuse_packfile->bitmap_nr;
- if (end % BITS_IN_EWORD)
- end -= end % BITS_IN_EWORD;
- if (pos >= end)
+ if (reuse_packfile->bitmap_pos) {
+ /*
+ * We can't reuse whole chunks verbatim out of
+ * non-preferred packs since we can't guarantee that
+ * all duplicate objects were resolved in favor of
+ * that pack.
+ *
+ * Even if we have a whole eword_t worth of bits that
+ * could be reused, there may be objects between the
+ * objects corresponding to the first and last bit of
+ * that word which were selected from a different
+ * pack, causing us to send duplicate or unwanted
+ * objects.
+ *
+ * Handle non-preferred packs from within
+ * write_reused_pack(), which inspects and reuses
+ * individual bits.
+ */
return reuse_packfile->bitmap_pos / BITS_IN_EWORD;
+ }
- while (pos < end &&
- reuse_packfile_bitmap->words[pos / BITS_IN_EWORD] == (eword_t)~0)
- pos += BITS_IN_EWORD;
+ /*
+ * Only read through the last word whose bits all correspond
+ * to objects in the given packfile, since we must stop at a
+ * word boundary.
+ *
+ * If there is no whole word to read (i.e. the packfile
+ * contains fewer than BITS_IN_EWORD objects), then we'll
+ * inspect bits one-by-one in write_reused_pack().
+ */
+ end = reuse_packfile->bitmap_nr / BITS_IN_EWORD;
+ if (reuse_packfile_bitmap->word_alloc < end)
+ BUG("fewer words than expected in reuse_packfile_bitmap");
- if (pos > end)
- pos = end;
+ while (pos < end && reuse_packfile_bitmap->words[pos] == (eword_t)~0)
+ pos++;
- if (reuse_packfile->bitmap_pos < pos) {
- off_t pack_start_off = pack_pos_to_offset(reuse_packfile->p, 0);
- off_t pack_end_off = pack_pos_to_offset(reuse_packfile->p,
- pos - reuse_packfile->bitmap_pos);
+ if (pos) {
+ off_t to_write;
- written += pos - reuse_packfile->bitmap_pos;
+ written = (pos * BITS_IN_EWORD);
+ to_write = pack_pos_to_offset(reuse_packfile->p, written)
+ - sizeof(struct pack_header);
/* We're recording one chunk, not one object. */
- record_reused_object(pack_start_off,
- pack_start_off - (hashfile_total(out) - pack_start));
+ record_reused_object(sizeof(struct pack_header), 0);
hashflush(out);
copy_pack_data(out, reuse_packfile->p, w_curs,
- pack_start_off, pack_end_off - pack_start_off);
+ sizeof(struct pack_header), to_write);
display_progress(progress_state, written);
}
- if (pos % BITS_IN_EWORD)
- BUG("attempted to jump past a word boundary to %"PRIuMAX,
- (uintmax_t)pos);
- return pos / BITS_IN_EWORD;
+ return pos;
}
static void write_reused_pack(struct bitmapped_pack *reuse_packfile,
@@ -1183,8 +1169,7 @@ static void write_reused_pack(struct bitmapped_pack *reuse_packfile,
struct pack_window *w_curs = NULL;
if (allow_ofs_delta)
- i = write_reused_pack_verbatim(reuse_packfile, f, pack_start,
- &w_curs);
+ i = write_reused_pack_verbatim(reuse_packfile, f, &w_curs);
for (; i < reuse_packfile_bitmap->word_alloc; ++i) {
eword_t word = reuse_packfile_bitmap->words[i];
diff --git a/t/t5332-multi-pack-reuse.sh b/t/t5332-multi-pack-reuse.sh
index 8f403d9fdaa..06836a4206c 100755
--- a/t/t5332-multi-pack-reuse.sh
+++ b/t/t5332-multi-pack-reuse.sh
@@ -259,7 +259,7 @@ test_expect_success 'duplicate objects' '
)
'
-test_expect_failure 'duplicate objects with verbatim reuse' '
+test_expect_success 'duplicate objects with verbatim reuse' '
git init duplicate-objects-verbatim &&
(
cd duplicate-objects-verbatim &&
--
2.46.0.421.g159f2d50e75
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] pack-objects: only perform verbatim reuse on the preferred pack
2024-11-13 17:32 ` [PATCH 2/2] pack-objects: only perform verbatim reuse on the preferred pack Taylor Blau
@ 2024-11-14 0:25 ` Jeff King
2024-11-14 13:40 ` Taylor Blau
0 siblings, 1 reply; 14+ messages in thread
From: Jeff King @ 2024-11-14 0:25 UTC (permalink / raw)
To: Taylor Blau; +Cc: git, Junio C Hamano, Elijah Newren
On Wed, Nov 13, 2024 at 12:32:58PM -0500, Taylor Blau wrote:
> Instead, we can only safely perform the whole-word reuse optimization on
> the preferred pack, where we know with certainty that no gaps exist in
> that region of the bitmap. We can still reuse objects from non-preferred
> packs, but we have to inspect them individually in write_reused_pack()
> to ensure that any gaps that may exist are accounted for.
Yep. With the disclaimer that I'm biased because I helped a little with
debugging, this change is obviously correct. Forgetting the bug you saw
in the real world, we know this function cannot work as-is because of
the potential for those gaps.
> This allows us to simplify the implementation of
> write_reused_pack_verbatim() back to almost its pre-multi-pack reuse
> form, since we can now assume that the beginning of the pack appears at
> the beginning of the bitmap, meaning that we don't have to account for
> any bits up to the first word boundary (like we had to special case in
> ca0fd69e37).
>
> The only significant changes from the pre-ca0fd69e37 implementation are:
> [...]
Thanks for this section. My first instinct was to go back and look at
the diff to the pre-midx version of the function, and this nicely
explains the hunks I see there.
So this patch looks good to me. I was able to follow your explanation in
the commit message, but that may not count for much since I'm probably
the only other person with deep knowledge of the verbatim-reuse code in
the first place. ;)
I do think the explanation in the message of the first commit would be a
lot simpler if it were simply combined into this patch. With them split
you effectively have to explain the problem twice. I don't feel that
strongly about changing it, though.
-Peff
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] t5332-multi-pack-reuse.sh: demonstrate duplicate packing failure
2024-11-13 17:32 ` [PATCH 1/2] t5332-multi-pack-reuse.sh: demonstrate duplicate packing failure Taylor Blau
@ 2024-11-14 1:12 ` Junio C Hamano
2024-11-14 13:37 ` Taylor Blau
0 siblings, 1 reply; 14+ messages in thread
From: Junio C Hamano @ 2024-11-14 1:12 UTC (permalink / raw)
To: Taylor Blau; +Cc: git, Jeff King, Elijah Newren
Taylor Blau <me@ttaylorr.com> writes:
> +test_expect_failure 'duplicate objects with verbatim reuse' '
> + git init duplicate-objects-verbatim &&
> + (
> + cd duplicate-objects-verbatim &&
> +
> + git config pack.allowPackReuse multi &&
> +
> + test_commit_bulk 64 &&
> +
> + # take the first object from the main pack...
> + git show-index <$(ls $packdir/pack-*.idx) >obj.raw &&
> + sort -nk1 <obj.raw | head -n1 | cut -d" " -f2 >in &&
> +
> + # ...and create a separate pack containing just that object
> + p="$(git pack-objects $packdir/pack <in)" &&
> + git show-index <$packdir/pack-$p.idx &&
Is this done so that "git show-index" fails when the .idx file fed
is malformed? Or is it a leftover debugging aid, where a human
developer was helped by eyeballing the contents of the .idx file in
human readable form? If the latter, do we perhaps want to "parse"
the output the same way in this test to validate our expectation?
> + git multi-pack-index write --bitmap --preferred-pack=pack-$p.idx &&
> +
> + test_pack_objects_reused_all 192 2
> + )
> +'
> +
> test_done
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] t5332-multi-pack-reuse.sh: demonstrate duplicate packing failure
2024-11-14 1:12 ` Junio C Hamano
@ 2024-11-14 13:37 ` Taylor Blau
0 siblings, 0 replies; 14+ messages in thread
From: Taylor Blau @ 2024-11-14 13:37 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git, Jeff King, Elijah Newren
On Thu, Nov 14, 2024 at 10:12:15AM +0900, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > +test_expect_failure 'duplicate objects with verbatim reuse' '
> > + git init duplicate-objects-verbatim &&
> > + (
> > + cd duplicate-objects-verbatim &&
> > +
> > + git config pack.allowPackReuse multi &&
> > +
> > + test_commit_bulk 64 &&
> > +
> > + # take the first object from the main pack...
> > + git show-index <$(ls $packdir/pack-*.idx) >obj.raw &&
> > + sort -nk1 <obj.raw | head -n1 | cut -d" " -f2 >in &&
> > +
> > + # ...and create a separate pack containing just that object
> > + p="$(git pack-objects $packdir/pack <in)" &&
> > + git show-index <$packdir/pack-$p.idx &&
>
> Is this done so that "git show-index" fails when the .idx file fed
> is malformed? Or is it a leftover debugging aid, where a human
> developer was helped by eyeballing the contents of the .idx file in
> human readable form? If the latter, do we perhaps want to "parse"
> the output the same way in this test to validate our expectation?
Oops. This is stray debugging left over that I forgot to take out before
committing. That makes it the latter of the two you mentioned, but I
don't think we need to validate the output of 'git show-index' in this
instance.
We're just relying on pack-objects to generate a pack containing a
single object, which feels like basic functionality not worth explicitly
making an assertion on.
There is a subtle assertion on the line below here:
> > + git multi-pack-index write --bitmap --preferred-pack=pack-$p.idx &&
that the pack (a) exists, and (b) has at least one object, since both
conditions must be met for a pack to become preferred in a MIDX bitmap.
But beyond that, I don't think we need to validate the output of
show-index here. I'll remove the stray debugging line and send a new
round. Sorry about that, and thanks for spotting!
Thanks,
Taylor
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] pack-objects: only perform verbatim reuse on the preferred pack
2024-11-14 0:25 ` Jeff King
@ 2024-11-14 13:40 ` Taylor Blau
2024-11-15 9:57 ` Jeff King
2024-11-22 9:16 ` Kristoffer Haugsbakk
0 siblings, 2 replies; 14+ messages in thread
From: Taylor Blau @ 2024-11-14 13:40 UTC (permalink / raw)
To: Jeff King; +Cc: git, Junio C Hamano, Elijah Newren
On Wed, Nov 13, 2024 at 07:25:04PM -0500, Jeff King wrote:
> On Wed, Nov 13, 2024 at 12:32:58PM -0500, Taylor Blau wrote:
>
> > Instead, we can only safely perform the whole-word reuse optimization on
> > the preferred pack, where we know with certainty that no gaps exist in
> > that region of the bitmap. We can still reuse objects from non-preferred
> > packs, but we have to inspect them individually in write_reused_pack()
> > to ensure that any gaps that may exist are accounted for.
>
> Yep. With the disclaimer that I'm biased because I helped a little with
> debugging, this change is obviously correct. Forgetting the bug you saw
> in the real world, we know this function cannot work as-is because of
> the potential for those gaps.
Yep, and thanks again for your help ;-).
> > This allows us to simplify the implementation of
> > write_reused_pack_verbatim() back to almost its pre-multi-pack reuse
> > form, since we can now assume that the beginning of the pack appears at
> > the beginning of the bitmap, meaning that we don't have to account for
> > any bits up to the first word boundary (like we had to special case in
> > ca0fd69e37).
> >
> > The only significant changes from the pre-ca0fd69e37 implementation are:
> > [...]
>
> Thanks for this section. My first instinct was to go back and look at
> the diff to the pre-midx version of the function, and this nicely
> explains the hunks I see there.
>
> So this patch looks good to me. I was able to follow your explanation in
> the commit message, but that may not count for much since I'm probably
> the only other person with deep knowledge of the verbatim-reuse code in
> the first place. ;)
Heh.
> I do think the explanation in the message of the first commit would be a
> lot simpler if it were simply combined into this patch. With them split
> you effectively have to explain the problem twice. I don't feel that
> strongly about changing it, though.
I always seem to go back and forth on that. I feel somewhat strongly
that for complicated regression fixes that we should demonstrate the
existing failure mode in a separate commit with a test_expect_failure.
That forces the author to ensure they really understand the bug and can
produce a minimal (or close to it) reproduction.
It also makes it easier to demonstrate that the fix actually does what
it says, instead of assuming that the test fails without the fix applied
(and passes with it applied).
That does force the author to potentially explain the bug twice. In my
experience, I tend to keep the explanation in the first patch relatively
brief, hinting at details that will appear in the subsequent patch
instead of explaining them in full detail.
So I dunno. It's a tradeoff for sure, but I think having an explicit
point in the log that demonstrates the existing bug is valuable.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v2 0/2] pack-objects: more brown-paper-bag multi-pack reuse fixes
2024-11-13 17:32 [PATCH 0/2] pack-objects: more brown-paper-bag multi-pack reuse fixes Taylor Blau
2024-11-13 17:32 ` [PATCH 1/2] t5332-multi-pack-reuse.sh: demonstrate duplicate packing failure Taylor Blau
2024-11-13 17:32 ` [PATCH 2/2] pack-objects: only perform verbatim reuse on the preferred pack Taylor Blau
@ 2024-11-14 13:42 ` Taylor Blau
2024-11-14 13:42 ` [PATCH v2 1/2] t5332-multi-pack-reuse.sh: demonstrate duplicate packing failure Taylor Blau
2024-11-14 13:42 ` [PATCH v2 2/2] pack-objects: only perform verbatim reuse on the preferred pack Taylor Blau
2 siblings, 2 replies; 14+ messages in thread
From: Taylor Blau @ 2024-11-14 13:42 UTC (permalink / raw)
To: git; +Cc: Jeff King, Junio C Hamano, Elijah Newren
This is a small re-roll of my series to fix a regression in multi-pack
reuse based on a broken assumption in write_reused_pack_verbatim().
The only change since last time is removing a stray debugging aid, and
the series is otherwise identical as in v1. A range-diff is included
below for convenience.
Thanks in advance for your review!
Taylor Blau (2):
t5332-multi-pack-reuse.sh: demonstrate duplicate packing failure
pack-objects: only perform verbatim reuse on the preferred pack
builtin/pack-objects.c | 101 +++++++++++++++---------------------
t/t5332-multi-pack-reuse.sh | 22 ++++++++
2 files changed, 65 insertions(+), 58 deletions(-)
Range-diff against v1:
1: 7a69cf84ae5 ! 1: d791b7b20c9 t5332-multi-pack-reuse.sh: demonstrate duplicate packing failure
@@ t/t5332-multi-pack-reuse.sh: test_expect_success 'duplicate objects' '
+
+ # ...and create a separate pack containing just that object
+ p="$(git pack-objects $packdir/pack <in)" &&
-+ git show-index <$packdir/pack-$p.idx &&
+
+ git multi-pack-index write --bitmap --preferred-pack=pack-$p.idx &&
+
2: 2520abf24a8 = 2: d73b8fe2d63 pack-objects: only perform verbatim reuse on the preferred pack
base-commit: 25b0f41288718625b18495de23cc066394c09a92
--
2.46.0.421.g159f2d50e75
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v2 1/2] t5332-multi-pack-reuse.sh: demonstrate duplicate packing failure
2024-11-14 13:42 ` [PATCH v2 0/2] pack-objects: more brown-paper-bag multi-pack reuse fixes Taylor Blau
@ 2024-11-14 13:42 ` Taylor Blau
2024-11-14 13:42 ` [PATCH v2 2/2] pack-objects: only perform verbatim reuse on the preferred pack Taylor Blau
1 sibling, 0 replies; 14+ messages in thread
From: Taylor Blau @ 2024-11-14 13:42 UTC (permalink / raw)
To: git; +Cc: Jeff King, Junio C Hamano, Elijah Newren
In the multi-pack reuse code, there are two paths for reusing the
on-disk representation of an object, handled by:
- builtin/pack-objects.c::write_reused_pack_one()
- builtin/pack-objects.c::write_reused_pack_verbatim()
The former is responsible for copying the bytes for a single object out
of an existing source pack. The latter does the same but for a region of
objects aligned at eword_t boundaries.
Demonstrate a bug whereby write_reused_pack_verbatim() can be tricked
into writing out objects from some source pack, even when those objects
were selected from a different source pack in the MIDX bitmap.
When the caller wants at least one of the objects in that region,
pack-objects will write the same object twice as a result of this bug.
In the other case where the caller doesn't want any of the objects in
the region of interest, we will write out objects that weren't
requested.
Demonstrate this bug by creating two packs, where the preferred one of
those packs contains a single object which also appears in the main
(non-preferred) pack. A separate bug[^1] prevents us from triggering the
main bug when the duplicated object is the last one in the main pack,
but any earlier object will suffice.
We could fix that separate bug, but the following commit will simplify
write_reused_pack_verbatim() and only call it on the preferred pack, so
doing so would have little point.
[^1]: Because write_reused_pack_verbatim() only reuses bits in the range
off_t pack_start_off = pack_pos_to_offset(reuse_packfile->p, 0);
off_t pack_end_off = pack_pos_to_offset(reuse_packfile->p,
pos - reuse_packfile->bitmap_pos);
written += pos - reuse_packfile->bitmap_pos;
/* We're recording one chunk, not one object. */
record_reused_object(pack_start_off,
pack_start_off - (hashfile_total(out) - pack_start));
, or in other words excluding the object beginning at position 'pos -
reuse_packfile->bitmap_pos' in the source pack. But since
reuse_packfile->bitmap_pos is '1' in the non-preferred pack
(accounting for the single-object pack which is preferred), we don't
actually copy the bytes from the last object.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
t/t5332-multi-pack-reuse.sh | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)
diff --git a/t/t5332-multi-pack-reuse.sh b/t/t5332-multi-pack-reuse.sh
index 955ea42769b..d87ea0ae19b 100755
--- a/t/t5332-multi-pack-reuse.sh
+++ b/t/t5332-multi-pack-reuse.sh
@@ -259,4 +259,26 @@ test_expect_success 'duplicate objects' '
)
'
+test_expect_failure 'duplicate objects with verbatim reuse' '
+ git init duplicate-objects-verbatim &&
+ (
+ cd duplicate-objects-verbatim &&
+
+ git config pack.allowPackReuse multi &&
+
+ test_commit_bulk 64 &&
+
+ # take the first object from the main pack...
+ git show-index <$(ls $packdir/pack-*.idx) >obj.raw &&
+ sort -nk1 <obj.raw | head -n1 | cut -d" " -f2 >in &&
+
+ # ...and create a separate pack containing just that object
+ p="$(git pack-objects $packdir/pack <in)" &&
+
+ git multi-pack-index write --bitmap --preferred-pack=pack-$p.idx &&
+
+ test_pack_objects_reused_all 192 2
+ )
+'
+
test_done
--
2.46.0.421.g159f2d50e75
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 2/2] pack-objects: only perform verbatim reuse on the preferred pack
2024-11-14 13:42 ` [PATCH v2 0/2] pack-objects: more brown-paper-bag multi-pack reuse fixes Taylor Blau
2024-11-14 13:42 ` [PATCH v2 1/2] t5332-multi-pack-reuse.sh: demonstrate duplicate packing failure Taylor Blau
@ 2024-11-14 13:42 ` Taylor Blau
2024-11-22 4:44 ` Junio C Hamano
1 sibling, 1 reply; 14+ messages in thread
From: Taylor Blau @ 2024-11-14 13:42 UTC (permalink / raw)
To: git; +Cc: Jeff King, Junio C Hamano, Elijah Newren
When reusing objects from source pack(s), write_reused_pack_verbatim()
is responsible for reusing objects whole eword_t's at a time. It works
by taking the longest continuous run of objects from the beginning of
each source pack that the caller wants, and reuses the entirety of that
section from each pack.
This is based on the assumption that we don't have any gaps within the
region. This assumption relieves us from having to patch any
OFS_DELTAs, since we know that there aren't any gaps between any delta
and its base in that region.
To illustrate why this assumption is necessary, suppose we have some
pack P, which has objects X, Y, and Z. If the MIDX's copy of Y was
selected from a pack other than P, then the bit corresponding to object
Y will appear earlier in the bitmap than the bits corresponding to X and
Z.
If pack-objects already has or will use the copy of Y from the pack it
was selected from in the MIDX, then it is an error to reuse all objects
between X and Z in the source pack. Doing so will cause us to reuse Y
from a different pack than the one which represents Y in the MIDX,
causing us to either:
- include the object twice, assuming that the caller wants Y in the
pack, or
- include the object once, resulting in us packing more objects than
necessary.
This regression comes from ca0fd69e37 (pack-objects: prepare
`write_reused_pack_verbatim()` for multi-pack reuse, 2023-12-14), which
incorrectly assumed that there would be no gaps in reusable regions of
non-preferred packs.
Instead, we can only safely perform the whole-word reuse optimization on
the preferred pack, where we know with certainty that no gaps exist in
that region of the bitmap. We can still reuse objects from non-preferred
packs, but we have to inspect them individually in write_reused_pack()
to ensure that any gaps that may exist are accounted for.
This allows us to simplify the implementation of
write_reused_pack_verbatim() back to almost its pre-multi-pack reuse
form, since we can now assume that the beginning of the pack appears at
the beginning of the bitmap, meaning that we don't have to account for
any bits up to the first word boundary (like we had to special case in
ca0fd69e37).
The only significant changes from the pre-ca0fd69e37 implementation are:
- that we can no longer inspect words up to the end of
reuse_packfile_bitmap->word_alloc, since we only want to look at
words whose bits all correspond to objects in the given packfile, and
- that we return early when given a reuse_packfile which is not
preferred, making the call a noop.
In the future, it might be possible to restore this optimization if we
could guarantee that some reuse packs don't contain any gaps by
construction (similar to the "disjoint packs" idea in very early
versions of multi-pack reuse).
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
builtin/pack-objects.c | 101 +++++++++++++++---------------------
t/t5332-multi-pack-reuse.sh | 2 +-
2 files changed, 44 insertions(+), 59 deletions(-)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 08007142671..f413344e90c 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1100,78 +1100,64 @@ static void write_reused_pack_one(struct packed_git *reuse_packfile,
static size_t write_reused_pack_verbatim(struct bitmapped_pack *reuse_packfile,
struct hashfile *out,
- off_t pack_start,
struct pack_window **w_curs)
{
- size_t pos = reuse_packfile->bitmap_pos;
+ size_t pos = 0;
size_t end;
- if (pos % BITS_IN_EWORD) {
- size_t word_pos = (pos / BITS_IN_EWORD);
- size_t offset = pos % BITS_IN_EWORD;
- size_t last;
- eword_t word = reuse_packfile_bitmap->words[word_pos];
-
- if (offset + reuse_packfile->bitmap_nr < BITS_IN_EWORD)
- last = offset + reuse_packfile->bitmap_nr;
- else
- last = BITS_IN_EWORD;
-
- for (; offset < last; offset++) {
- if (word >> offset == 0)
- return word_pos;
- if (!bitmap_get(reuse_packfile_bitmap,
- word_pos * BITS_IN_EWORD + offset))
- return word_pos;
- }
-
- pos += BITS_IN_EWORD - (pos % BITS_IN_EWORD);
- }
-
- /*
- * Now we're going to copy as many whole eword_t's as possible.
- * "end" is the index of the last whole eword_t we copy, but
- * there may be additional bits to process. Those are handled
- * individually by write_reused_pack().
- *
- * Begin by advancing to the first word boundary in range of the
- * bit positions occupied by objects in "reuse_packfile". Then
- * pick the last word boundary in the same range. If we have at
- * least one word's worth of bits to process, continue on.
- */
- end = reuse_packfile->bitmap_pos + reuse_packfile->bitmap_nr;
- if (end % BITS_IN_EWORD)
- end -= end % BITS_IN_EWORD;
- if (pos >= end)
+ if (reuse_packfile->bitmap_pos) {
+ /*
+ * We can't reuse whole chunks verbatim out of
+ * non-preferred packs since we can't guarantee that
+ * all duplicate objects were resolved in favor of
+ * that pack.
+ *
+ * Even if we have a whole eword_t worth of bits that
+ * could be reused, there may be objects between the
+ * objects corresponding to the first and last bit of
+ * that word which were selected from a different
+ * pack, causing us to send duplicate or unwanted
+ * objects.
+ *
+ * Handle non-preferred packs from within
+ * write_reused_pack(), which inspects and reuses
+ * individual bits.
+ */
return reuse_packfile->bitmap_pos / BITS_IN_EWORD;
+ }
- while (pos < end &&
- reuse_packfile_bitmap->words[pos / BITS_IN_EWORD] == (eword_t)~0)
- pos += BITS_IN_EWORD;
+ /*
+ * Only read through the last word whose bits all correspond
+ * to objects in the given packfile, since we must stop at a
+ * word boundary.
+ *
+ * If there is no whole word to read (i.e. the packfile
+ * contains fewer than BITS_IN_EWORD objects), then we'll
+ * inspect bits one-by-one in write_reused_pack().
+ */
+ end = reuse_packfile->bitmap_nr / BITS_IN_EWORD;
+ if (reuse_packfile_bitmap->word_alloc < end)
+ BUG("fewer words than expected in reuse_packfile_bitmap");
- if (pos > end)
- pos = end;
+ while (pos < end && reuse_packfile_bitmap->words[pos] == (eword_t)~0)
+ pos++;
- if (reuse_packfile->bitmap_pos < pos) {
- off_t pack_start_off = pack_pos_to_offset(reuse_packfile->p, 0);
- off_t pack_end_off = pack_pos_to_offset(reuse_packfile->p,
- pos - reuse_packfile->bitmap_pos);
+ if (pos) {
+ off_t to_write;
- written += pos - reuse_packfile->bitmap_pos;
+ written = (pos * BITS_IN_EWORD);
+ to_write = pack_pos_to_offset(reuse_packfile->p, written)
+ - sizeof(struct pack_header);
/* We're recording one chunk, not one object. */
- record_reused_object(pack_start_off,
- pack_start_off - (hashfile_total(out) - pack_start));
+ record_reused_object(sizeof(struct pack_header), 0);
hashflush(out);
copy_pack_data(out, reuse_packfile->p, w_curs,
- pack_start_off, pack_end_off - pack_start_off);
+ sizeof(struct pack_header), to_write);
display_progress(progress_state, written);
}
- if (pos % BITS_IN_EWORD)
- BUG("attempted to jump past a word boundary to %"PRIuMAX,
- (uintmax_t)pos);
- return pos / BITS_IN_EWORD;
+ return pos;
}
static void write_reused_pack(struct bitmapped_pack *reuse_packfile,
@@ -1183,8 +1169,7 @@ static void write_reused_pack(struct bitmapped_pack *reuse_packfile,
struct pack_window *w_curs = NULL;
if (allow_ofs_delta)
- i = write_reused_pack_verbatim(reuse_packfile, f, pack_start,
- &w_curs);
+ i = write_reused_pack_verbatim(reuse_packfile, f, &w_curs);
for (; i < reuse_packfile_bitmap->word_alloc; ++i) {
eword_t word = reuse_packfile_bitmap->words[i];
diff --git a/t/t5332-multi-pack-reuse.sh b/t/t5332-multi-pack-reuse.sh
index d87ea0ae19b..e9d80186ec6 100755
--- a/t/t5332-multi-pack-reuse.sh
+++ b/t/t5332-multi-pack-reuse.sh
@@ -259,7 +259,7 @@ test_expect_success 'duplicate objects' '
)
'
-test_expect_failure 'duplicate objects with verbatim reuse' '
+test_expect_success 'duplicate objects with verbatim reuse' '
git init duplicate-objects-verbatim &&
(
cd duplicate-objects-verbatim &&
--
2.46.0.421.g159f2d50e75
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] pack-objects: only perform verbatim reuse on the preferred pack
2024-11-14 13:40 ` Taylor Blau
@ 2024-11-15 9:57 ` Jeff King
2024-11-22 9:16 ` Kristoffer Haugsbakk
1 sibling, 0 replies; 14+ messages in thread
From: Jeff King @ 2024-11-15 9:57 UTC (permalink / raw)
To: Taylor Blau; +Cc: git, Junio C Hamano, Elijah Newren
On Thu, Nov 14, 2024 at 08:40:13AM -0500, Taylor Blau wrote:
> > I do think the explanation in the message of the first commit would be a
> > lot simpler if it were simply combined into this patch. With them split
> > you effectively have to explain the problem twice. I don't feel that
> > strongly about changing it, though.
>
> I always seem to go back and forth on that. I feel somewhat strongly
> that for complicated regression fixes that we should demonstrate the
> existing failure mode in a separate commit with a test_expect_failure.
> That forces the author to ensure they really understand the bug and can
> produce a minimal (or close to it) reproduction.
>
> It also makes it easier to demonstrate that the fix actually does what
> it says, instead of assuming that the test fails without the fix applied
> (and passes with it applied).
>
> That does force the author to potentially explain the bug twice. In my
> experience, I tend to keep the explanation in the first patch relatively
> brief, hinting at details that will appear in the subsequent patch
> instead of explaining them in full detail.
>
> So I dunno. It's a tradeoff for sure, but I think having an explicit
> point in the log that demonstrates the existing bug is valuable.
I do think it's important to make sure your tests fail in the way you
expect. I can't count the number of times I have _thought_ I had a
solution and then after seeing that the test didn't fail without my
patch, had to go back and study it more. :)
But I generally do it with:
# build the old version
git checkout HEAD^
make
# now go back to the new and test _without_ building
git checkout -
cd t && ./t1234-whatever.sh
(or more likely with a "break" in "git rebase -i" for a longer series).
That is, I'm running a mis-matched test/build combo. I guess in one
sense that is horrible and I'm a bad person. It is easy to make a
mistake and test the wrong thing. But it does make the resulting patches
easier to read/review, IMHO.
As an aside, I think this kind of mismatch gets awkward with the
proposed meson out-of-tree build stuff. Running the tests through
meson would presumably trigger a build. And running without means
pointing to the built git binaries manually.
-Peff
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 2/2] pack-objects: only perform verbatim reuse on the preferred pack
2024-11-14 13:42 ` [PATCH v2 2/2] pack-objects: only perform verbatim reuse on the preferred pack Taylor Blau
@ 2024-11-22 4:44 ` Junio C Hamano
2024-11-22 8:33 ` Jeff King
0 siblings, 1 reply; 14+ messages in thread
From: Junio C Hamano @ 2024-11-22 4:44 UTC (permalink / raw)
To: Taylor Blau; +Cc: git, Jeff King, Elijah Newren
Taylor Blau <me@ttaylorr.com> writes:
> This regression comes from ca0fd69e37 (pack-objects: prepare
> `write_reused_pack_verbatim()` for multi-pack reuse, 2023-12-14), which
> incorrectly assumed that there would be no gaps in reusable regions of
> non-preferred packs.
I was reviewing the past release notes from this year, and saw this
snippet in the 2.44 release notes:
* Streaming spans of packfile data used to be done only from a
single, primary, pack in a repository with multiple packfiles. It
has been extended to allow reuse from other packfiles, too.
Essentially this two-patch series the change completely so we only
reuse pack data from the primary pack?
Thanks.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 2/2] pack-objects: only perform verbatim reuse on the preferred pack
2024-11-22 4:44 ` Junio C Hamano
@ 2024-11-22 8:33 ` Jeff King
0 siblings, 0 replies; 14+ messages in thread
From: Jeff King @ 2024-11-22 8:33 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Taylor Blau, git, Elijah Newren
On Fri, Nov 22, 2024 at 01:44:37PM +0900, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > This regression comes from ca0fd69e37 (pack-objects: prepare
> > `write_reused_pack_verbatim()` for multi-pack reuse, 2023-12-14), which
> > incorrectly assumed that there would be no gaps in reusable regions of
> > non-preferred packs.
>
> I was reviewing the past release notes from this year, and saw this
> snippet in the 2.44 release notes:
>
> * Streaming spans of packfile data used to be done only from a
> single, primary, pack in a repository with multiple packfiles. It
> has been extended to allow reuse from other packfiles, too.
>
> Essentially this two-patch series the change completely so we only
> reuse pack data from the primary pack?
No. There are really three levels in increasing order of cost:
1. we directly copy out the bytes from the start of a pack completely,
up until the first object that is not needed to be sent
2. from there we process objects one by one, with a few checks and
adjustments to make sure they can be copied verbatim (like ofs
delta adjustment to account for earlier missing objects)
3. we add the objects to the big object_list struct, they become
candidates for delta compression, etc
We call the first two generically "pack reuse". And this two-patch
series is fixing a case where we were doing (1), but needed to do (2).
So I think the changelog entry still applies (there are other cases
where we might jump from 1 to 2 as well, like if the output needs to
convert to REF_DELTA).
-Peff
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] pack-objects: only perform verbatim reuse on the preferred pack
2024-11-14 13:40 ` Taylor Blau
2024-11-15 9:57 ` Jeff King
@ 2024-11-22 9:16 ` Kristoffer Haugsbakk
1 sibling, 0 replies; 14+ messages in thread
From: Kristoffer Haugsbakk @ 2024-11-22 9:16 UTC (permalink / raw)
To: Taylor Blau, Jeff King; +Cc: git, Junio C Hamano, Elijah Newren
On Thu, Nov 14, 2024, at 14:40, Taylor Blau wrote:
> On Wed, Nov 13, 2024 at 07:25:04PM -0500, Jeff King wrote:
>> I do think the explanation in the message of the first commit would be a
>> lot simpler if it were simply combined into this patch. With them split
>> you effectively have to explain the problem twice. I don't feel that
>> strongly about changing it, though.
>
> I always seem to go back and forth on that. I feel somewhat strongly
> that for complicated regression fixes that we should demonstrate the
> existing failure mode in a separate commit with a test_expect_failure.
> That forces the author to ensure they really understand the bug and can
> produce a minimal (or close to it) reproduction.
>
> It also makes it easier to demonstrate that the fix actually does what
> it says, instead of assuming that the test fails without the fix applied
> (and passes with it applied).
I recently made a parallel branch for a topic where the parallel branch had
`test_expect_failure` for each commit (i.e. the commits had only `t/` changes).
That ended up catching a bug I introduced when I tried to simplify the test: the
test was OK on my topic branch but didn’t fail (`test_expect_failure`) on the
parallel branch.
I use a worktree for `master` so at least I didn’t have to build specifically
for that branch.
--
Kristoffer Haugsbakk
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2024-11-22 9:17 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-13 17:32 [PATCH 0/2] pack-objects: more brown-paper-bag multi-pack reuse fixes Taylor Blau
2024-11-13 17:32 ` [PATCH 1/2] t5332-multi-pack-reuse.sh: demonstrate duplicate packing failure Taylor Blau
2024-11-14 1:12 ` Junio C Hamano
2024-11-14 13:37 ` Taylor Blau
2024-11-13 17:32 ` [PATCH 2/2] pack-objects: only perform verbatim reuse on the preferred pack Taylor Blau
2024-11-14 0:25 ` Jeff King
2024-11-14 13:40 ` Taylor Blau
2024-11-15 9:57 ` Jeff King
2024-11-22 9:16 ` Kristoffer Haugsbakk
2024-11-14 13:42 ` [PATCH v2 0/2] pack-objects: more brown-paper-bag multi-pack reuse fixes Taylor Blau
2024-11-14 13:42 ` [PATCH v2 1/2] t5332-multi-pack-reuse.sh: demonstrate duplicate packing failure Taylor Blau
2024-11-14 13:42 ` [PATCH v2 2/2] pack-objects: only perform verbatim reuse on the preferred pack Taylor Blau
2024-11-22 4:44 ` Junio C Hamano
2024-11-22 8:33 ` Jeff King
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).