Git development
 help / color / mirror / Atom feed
* Re: [PATCH v2 00/26] pack-objects: multi-pack verbatim reuse
From: Taylor Blau @ 2023-12-15  1:25 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King, Patrick Steinhardt
In-Reply-To: <xmqq7clgnvqv.fsf@gitster.g>

On Thu, Dec 14, 2023 at 04:40:40PM -0800, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
>
> > I haven't looked into the details yet, but it seems that
> > t5309-pack-delta-cycles.sh fails under
> >
> >     $ SANITIZE=leak GIT_TEST_PASSING_SANITIZE_LEAK=true make -j16 test
>
> Hmph, this seems to be elusive.  I tried it again and then it did
> not fail this time.

Indeed, but I was able to reproduce the failure both on my branch and on
'master' under --stress, yielding the following failure in t5309.6:

    + git index-pack --fix-thin --stdin
    fatal: REF_DELTA at offset 46 already resolved (duplicate base 01d7713666f4de822776c7622c10f1b07de280dc?)

    =================================================================
    ==3904583==ERROR: LeakSanitizer: detected memory leaks

    Direct leak of 32 byte(s) in 1 object(s) allocated from:
        #0 0x7fa790d01986 in __interceptor_realloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:98
        #1 0x7fa790add769 in __pthread_getattr_np nptl/pthread_getattr_np.c:180
        #2 0x7fa790d117c5 in __sanitizer::GetThreadStackTopAndBottom(bool, unsigned long*, unsigned long*) ../../../../src/libsanitizer/sanitizer_common/sanitizer_linux_libcdep.cpp:150
        #3 0x7fa790d11957 in __sanitizer::GetThreadStackAndTls(bool, unsigned long*, unsigned long*, unsigned long*, unsigned long*) ../../../../src/libsanitizer/sanitizer_common/sanitizer_linux_libcdep.cpp:598
        #4 0x7fa790d03fe8 in __lsan::ThreadStart(unsigned int, unsigned long long, __sanitizer::ThreadType) ../../../../src/libsanitizer/lsan/lsan_posix.cpp:51
        #5 0x7fa790d013fd in __lsan_thread_start_func ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:440
        #6 0x7fa790adc3eb in start_thread nptl/pthread_create.c:444
        #7 0x7fa790b5ca5b in clone3 ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

    SUMMARY: LeakSanitizer: 32 byte(s) leaked in 1 allocation(s).
    Aborted

The duplicate base thing is a red-herring, and is an expected result of
the test. But the leak is definitely not, and I'm not sure what's going
on here since the frames listed above are in the LSan runtime, not Git.

I'll try to dig into this a bit more, but I'm not quite sure what's
going on yet.

Thanks,
Taylor

^ permalink raw reply

* Re: [PATCH v2 00/26] pack-objects: multi-pack verbatim reuse
From: Junio C Hamano @ 2023-12-15  0:40 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Patrick Steinhardt
In-Reply-To: <xmqqbkasnxba.fsf@gitster.g>

Junio C Hamano <gitster@pobox.com> writes:

> I haven't looked into the details yet, but it seems that
> t5309-pack-delta-cycles.sh fails under
>
>     $ SANITIZE=leak GIT_TEST_PASSING_SANITIZE_LEAK=true make -j16 test

Hmph, this seems to be elusive.  I tried it again and then it did
not fail this time.

^ permalink raw reply

* Re: [PATCH v2 00/26] pack-objects: multi-pack verbatim reuse
From: Taylor Blau @ 2023-12-15  0:38 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King, Patrick Steinhardt
In-Reply-To: <xmqqbkasnxba.fsf@gitster.g>

On Thu, Dec 14, 2023 at 04:06:49PM -0800, Junio C Hamano wrote:
> I haven't looked into the details yet, but it seems that
> t5309-pack-delta-cycles.sh fails under
>
>     $ SANITIZE=leak GIT_TEST_PASSING_SANITIZE_LEAK=true make -j16 test

Hrm. I tried to reproduce this, but I'm not seeing it. I have:

$ make SANITIZE=leak GIT_TEST_PASSING_SANITIZE_LEAK=true test
[ ... ]
All tests successful.
Files=1001, Tests=14558, 48 wallclock secs ( 8.34 usr  2.10 sys + 391.60 cusr 319.67 csys = 721.71 CPU)
Result: PASS

With this series applied on top of 1a87c842ec (Start the 2.44 cycle,
2023-12-09). The tree I get at the end is d148e16f5cfba405a9823cb68540a8c83004f98f.

Did we apply onto different bases?

Thanks,
Taylor

^ permalink raw reply

* Re: [PATCH v2 00/26] pack-objects: multi-pack verbatim reuse
From: Junio C Hamano @ 2023-12-15  0:06 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Patrick Steinhardt
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>

I haven't looked into the details yet, but it seems that
t5309-pack-delta-cycles.sh fails under

    $ SANITIZE=leak GIT_TEST_PASSING_SANITIZE_LEAK=true make -j16 test


^ permalink raw reply

* [PATCH v2 26/26] t/perf: add performance tests for multi-pack reuse
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>

To ensure that we don't regress either the size or runtime performance
of multi-pack reuse, add a performance test to measure both of these.

The test partitions the objects in GIT_TEST_PERF_LARGE_REPO into 1, 10,
and 100 packs, and then tries to perform a "clone" at each stage with
both single- and multi-pack reuse enabled.

Note that the `repack_into_n_chunks()` function in this new test script
differs from the existing `repack_into_n()`. The former partitions the
repository into N equal-sized chunks, while the latter produces N packs
of five commits each (plus their objects), and then another pack with
the remainder.

On git.git, I can produce the following results on my machine:

    Test                                                            this tree
    --------------------------------------------------------------------------------
    5332.3: clone for 1-pack scenario (single-pack reuse)           1.57(2.99+0.15)
    5332.4: clone size for 1-pack scenario (single-pack reuse)               231.8M
    5332.5: clone for 1-pack scenario (multi-pack reuse)            1.79(2.96+0.21)
    5332.6: clone size for 1-pack scenario (multi-pack reuse)                231.7M
    5332.9: clone for 10-pack scenario (single-pack reuse)          3.89(16.75+0.35)
    5332.10: clone size for 10-pack scenario (single-pack reuse)             209.9M
    5332.11: clone for 10-pack scenario (multi-pack reuse)          1.56(2.99+0.17)
    5332.12: clone size for 10-pack scenario (multi-pack reuse)              224.4M
    5332.15: clone for 100-pack scenario (single-pack reuse)        8.24(54.31+0.59)
    5332.16: clone size for 100-pack scenario (single-pack reuse)            278.3M
    5332.17: clone for 100-pack scenario (multi-pack reuse)         2.13(2.44+0.33)
    5332.18: clone size for 100-pack scenario (multi-pack reuse)             357.9M

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/perf/p5332-multi-pack-reuse.sh | 81 ++++++++++++++++++++++++++++++++
 1 file changed, 81 insertions(+)
 create mode 100755 t/perf/p5332-multi-pack-reuse.sh

diff --git a/t/perf/p5332-multi-pack-reuse.sh b/t/perf/p5332-multi-pack-reuse.sh
new file mode 100755
index 0000000000..5c6c575d62
--- /dev/null
+++ b/t/perf/p5332-multi-pack-reuse.sh
@@ -0,0 +1,81 @@
+#!/bin/sh
+
+test_description='tests pack performance with multi-pack reuse'
+
+. ./perf-lib.sh
+. "${TEST_DIRECTORY}/perf/lib-pack.sh"
+
+packdir=.git/objects/pack
+
+test_perf_large_repo
+
+find_pack () {
+	for idx in $packdir/pack-*.idx
+	do
+		if git show-index <$idx | grep -q "$1"
+		then
+			basename $idx
+		fi || return 1
+	done
+}
+
+repack_into_n_chunks () {
+	git repack -adk &&
+
+	test "$1" -eq 1 && return ||
+
+	find $packdir -type f | sort >packs.before &&
+
+	# partition the repository into $1 chunks of consecutive commits, and
+	# then create $1 packs with the objects reachable from each chunk
+	# (excluding any objects reachable from the previous chunks)
+	sz="$(($(git rev-list --count --all) / $1))"
+	for rev in $(git rev-list --all | awk "NR % $sz == 0" | tac)
+	do
+		pack="$(echo "$rev" | git pack-objects --revs \
+			--honor-pack-keep --delta-base-offset $packdir/pack)" &&
+		touch $packdir/pack-$pack.keep || return 1
+	done
+
+	# grab any remaining objects not packed by the previous step(s)
+	git pack-objects --revs --all --honor-pack-keep --delta-base-offset \
+		$packdir/pack &&
+
+	find $packdir -type f | sort >packs.after &&
+
+	# and install the whole thing
+	for f in $(comm -12 packs.before packs.after)
+	do
+		rm -f "$f" || return 1
+	done
+	rm -fr $packdir/*.keep
+}
+
+for nr_packs in 1 10 100
+do
+	test_expect_success "create $nr_packs-pack scenario" '
+		repack_into_n_chunks $nr_packs
+	'
+
+	test_expect_success "setup bitmaps for $nr_packs-pack scenario" '
+		find $packdir -type f -name "*.idx" | sed -e "s/.*\/\(.*\)$/+\1/g" |
+		git multi-pack-index write --stdin-packs --bitmap \
+			--preferred-pack="$(find_pack $(git rev-parse HEAD))"
+	'
+
+	for reuse in single multi
+	do
+		test_perf "clone for $nr_packs-pack scenario ($reuse-pack reuse)" "
+			git for-each-ref --format='%(objectname)' refs/heads refs/tags >in &&
+			git -c pack.allowPackReuse=$reuse pack-objects \
+				--revs --delta-base-offset --use-bitmap-index \
+				--stdout <in >result
+		"
+
+		test_size "clone size for $nr_packs-pack scenario ($reuse-pack reuse)" '
+			wc -c <result
+		'
+	done
+done
+
+test_done
-- 
2.43.0.102.ga31d690331.dirty

^ permalink raw reply related

* [PATCH v2 25/26] pack-bitmap: enable reuse from all bitmapped packs
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>

Now that both the pack-bitmap and pack-objects code are prepared to
handle marking and using objects from multiple bitmapped packs for
verbatim reuse, allow marking objects from all bitmapped packs as
eligible for reuse.

Within the `reuse_partial_packfile_from_bitmap()` function, we no longer
only mark the pack whose first object is at bit position zero for reuse,
and instead mark any pack contained in the MIDX as a reuse candidate.

Provide a handful of test cases in a new script (t5332) exercising
interesting behavior for multi-pack reuse to ensure that we performed
all of the previous steps correctly.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/pack.txt |  16 ++-
 builtin/pack-objects.c        |   6 +-
 pack-bitmap.c                 |  34 ++++--
 pack-bitmap.h                 |   3 +-
 t/t5332-multi-pack-reuse.sh   | 203 ++++++++++++++++++++++++++++++++++
 5 files changed, 245 insertions(+), 17 deletions(-)
 create mode 100755 t/t5332-multi-pack-reuse.sh

diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
index fe100d0fb7..9c630863e6 100644
--- a/Documentation/config/pack.txt
+++ b/Documentation/config/pack.txt
@@ -28,11 +28,17 @@ all existing objects. You can force recompression by passing the -F option
 to linkgit:git-repack[1].
 
 pack.allowPackReuse::
-	When true or "single", and when reachability bitmaps are enabled,
-	pack-objects will try to send parts of the bitmapped packfile
-	verbatim. This can reduce memory and CPU usage to serve fetches,
-	but might result in sending a slightly larger pack. Defaults to
-	true.
+	When true or "single", and when reachability bitmaps are
+	enabled, pack-objects will try to send parts of the bitmapped
+	packfile verbatim. When "multi", and when a multi-pack
+	reachability bitmap is available, pack-objects will try to send
+	parts of all packs in the MIDX.
++
+	If only a single pack bitmap is available, and
+	`pack.allowPackReuse` is set to "multi", reuse parts of just the
+	bitmapped packfile. This can reduce memory and CPU usage to
+	serve fetches, but might result in sending a slightly larger
+	pack. Defaults to true.
 
 pack.island::
 	An extended regular expression configuring a set of delta
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 684698f679..5d3c42035b 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -232,6 +232,7 @@ static int use_bitmap_index = -1;
 static enum {
 	NO_PACK_REUSE = 0,
 	SINGLE_PACK_REUSE,
+	MULTI_PACK_REUSE,
 } allow_pack_reuse = SINGLE_PACK_REUSE;
 static enum {
 	WRITE_BITMAP_FALSE = 0,
@@ -3251,6 +3252,8 @@ static int git_pack_config(const char *k, const char *v,
 		if (res < 0) {
 			if (!strcasecmp(v, "single"))
 				allow_pack_reuse = SINGLE_PACK_REUSE;
+			else if (!strcasecmp(v, "multi"))
+				allow_pack_reuse = MULTI_PACK_REUSE;
 			else
 				die(_("invalid pack.allowPackReuse value: '%s'"), v);
 		} else if (res) {
@@ -4029,7 +4032,8 @@ static int get_object_list_from_bitmap(struct rev_info *revs)
 		reuse_partial_packfile_from_bitmap(bitmap_git,
 						   &reuse_packfiles,
 						   &reuse_packfiles_nr,
-						   &reuse_packfile_bitmap);
+						   &reuse_packfile_bitmap,
+						   allow_pack_reuse == MULTI_PACK_REUSE);
 
 	if (reuse_packfiles) {
 		reuse_packfile_objects = bitmap_popcount(reuse_packfile_bitmap);
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 242a5908f7..229a11fb00 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -2040,7 +2040,8 @@ static int bitmapped_pack_cmp(const void *va, const void *vb)
 void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 					struct bitmapped_pack **packs_out,
 					size_t *packs_nr_out,
-					struct bitmap **reuse_out)
+					struct bitmap **reuse_out,
+					int multi_pack_reuse)
 {
 	struct repository *r = the_repository;
 	struct bitmapped_pack *packs = NULL;
@@ -2064,15 +2065,30 @@ void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 				free(packs);
 				return;
 			}
+
 			if (!pack.bitmap_nr)
-				continue; /* no objects from this pack */
-			if (pack.bitmap_pos)
-				continue; /* not preferred pack */
+				continue;
+
+			if (!multi_pack_reuse && pack.bitmap_pos) {
+				/*
+				 * If we're only reusing a single pack, skip
+				 * over any packs which are not positioned at
+				 * the beginning of the MIDX bitmap.
+				 *
+				 * This is consistent with the existing
+				 * single-pack reuse behavior, which only reuses
+				 * parts of the MIDX's preferred pack.
+				 */
+				continue;
+			}
 
 			ALLOC_GROW(packs, packs_nr + 1, packs_alloc);
 			memcpy(&packs[packs_nr++], &pack, sizeof(pack));
 
 			objects_nr += pack.p->num_objects;
+
+			if (!multi_pack_reuse)
+				break;
 		}
 
 		QSORT(packs, packs_nr, bitmapped_pack_cmp);
@@ -2080,10 +2096,10 @@ void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 		ALLOC_GROW(packs, packs_nr + 1, packs_alloc);
 
 		packs[packs_nr].p = bitmap_git->pack;
-		packs[packs_nr].bitmap_pos = 0;
 		packs[packs_nr].bitmap_nr = bitmap_git->pack->num_objects;
+		packs[packs_nr].bitmap_pos = 0;
 
-		objects_nr = packs[packs_nr++].p->num_objects;
+		objects_nr = packs[packs_nr++].bitmap_nr;
 	}
 
 	word_alloc = objects_nr / BITS_IN_EWORD;
@@ -2091,10 +2107,8 @@ void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 		word_alloc++;
 	reuse = bitmap_word_alloc(word_alloc);
 
-	if (packs_nr != 1)
-		BUG("pack reuse not yet implemented for multiple packs");
-
-	reuse_partial_packfile_from_bitmap_1(bitmap_git, packs, reuse);
+	for (i = 0; i < packs_nr; i++)
+		reuse_partial_packfile_from_bitmap_1(bitmap_git, &packs[i], reuse);
 
 	if (bitmap_is_empty(reuse)) {
 		free(packs);
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 179b343912..c7dea13217 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -80,7 +80,8 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 					struct bitmapped_pack **packs_out,
 					size_t *packs_nr_out,
-					struct bitmap **reuse_out);
+					struct bitmap **reuse_out,
+					int multi_pack_reuse);
 int rebuild_existing_bitmaps(struct bitmap_index *, struct packing_data *mapping,
 			     kh_oid_map_t *reused_bitmaps, int show_progress);
 void free_bitmap_index(struct bitmap_index *);
diff --git a/t/t5332-multi-pack-reuse.sh b/t/t5332-multi-pack-reuse.sh
new file mode 100755
index 0000000000..2ba788b042
--- /dev/null
+++ b/t/t5332-multi-pack-reuse.sh
@@ -0,0 +1,203 @@
+#!/bin/sh
+
+test_description='pack-objects multi-pack reuse'
+
+. ./test-lib.sh
+. "$TEST_DIRECTORY"/lib-bitmap.sh
+
+objdir=.git/objects
+packdir=$objdir/pack
+
+test_pack_reused () {
+	test_trace2_data pack-objects pack-reused "$1"
+}
+
+test_packs_reused () {
+	test_trace2_data pack-objects packs-reused "$1"
+}
+
+
+# pack_position <object> </path/to/pack.idx
+pack_position () {
+	git show-index >objects &&
+	grep "$1" objects | cut -d" " -f1
+}
+
+test_expect_success 'preferred pack is reused for single-pack reuse' '
+	test_config pack.allowPackReuse single &&
+
+	for i in A B
+	do
+		test_commit "$i" &&
+		git repack -d || return 1
+	done &&
+
+	git multi-pack-index write --bitmap &&
+
+	: >trace2.txt &&
+	GIT_TRACE2_EVENT="$PWD/trace2.txt" \
+		git pack-objects --stdout --revs --all >/dev/null &&
+
+	test_pack_reused 3 <trace2.txt &&
+	test_packs_reused 1 <trace2.txt
+'
+
+test_expect_success 'enable multi-pack reuse' '
+	git config pack.allowPackReuse multi
+'
+
+test_expect_success 'reuse all objects from subset of bitmapped packs' '
+	test_commit C &&
+	git repack -d &&
+
+	git multi-pack-index write --bitmap &&
+
+	cat >in <<-EOF &&
+	$(git rev-parse C)
+	^$(git rev-parse A)
+	EOF
+
+	: >trace2.txt &&
+	GIT_TRACE2_EVENT="$PWD/trace2.txt" \
+		git pack-objects --stdout --revs <in >/dev/null &&
+
+	test_pack_reused 6 <trace2.txt &&
+	test_packs_reused 2 <trace2.txt
+'
+
+test_expect_success 'reuse all objects from all packs' '
+	: >trace2.txt &&
+	GIT_TRACE2_EVENT="$PWD/trace2.txt" \
+		git pack-objects --stdout --revs --all >/dev/null &&
+
+	test_pack_reused 9 <trace2.txt &&
+	test_packs_reused 3 <trace2.txt
+'
+
+test_expect_success 'reuse objects from first pack with middle gap' '
+	for i in D E F
+	do
+		test_commit "$i" || return 1
+	done &&
+
+	# Set "pack.window" to zero to ensure that we do not create any
+	# deltas, which could alter the amount of pack reuse we perform
+	# (if, for e.g., we are not sending one or more bases).
+	D="$(git -c pack.window=0 pack-objects --all --unpacked $packdir/pack)" &&
+
+	d_pos="$(pack_position $(git rev-parse D) <$packdir/pack-$D.idx)" &&
+	e_pos="$(pack_position $(git rev-parse E) <$packdir/pack-$D.idx)" &&
+	f_pos="$(pack_position $(git rev-parse F) <$packdir/pack-$D.idx)" &&
+
+	# commits F, E, and D, should appear in that order at the
+	# beginning of the pack
+	test $f_pos -lt $e_pos &&
+	test $e_pos -lt $d_pos &&
+
+	# Ensure that the pack we are constructing sorts ahead of any
+	# other packs in lexical/bitmap order by choosing it as the
+	# preferred pack.
+	git multi-pack-index write --bitmap --preferred-pack="pack-$D.idx" &&
+
+	cat >in <<-EOF &&
+	$(git rev-parse E)
+	^$(git rev-parse D)
+	EOF
+
+	: >trace2.txt &&
+	GIT_TRACE2_EVENT="$PWD/trace2.txt" \
+		git pack-objects --stdout --delta-base-offset --revs <in >/dev/null &&
+
+	test_pack_reused 3 <trace2.txt &&
+	test_packs_reused 1 <trace2.txt
+'
+
+test_expect_success 'reuse objects from middle pack with middle gap' '
+	rm -fr $packdir/multi-pack-index* &&
+
+	# Ensure that the pack we are constructing sort into any
+	# position *but* the first one, by choosing a different pack as
+	# the preferred one.
+	git multi-pack-index write --bitmap --preferred-pack="pack-$A.idx" &&
+
+	cat >in <<-EOF &&
+	$(git rev-parse E)
+	^$(git rev-parse D)
+	EOF
+
+	: >trace2.txt &&
+	GIT_TRACE2_EVENT="$PWD/trace2.txt" \
+		git pack-objects --stdout --delta-base-offset --revs <in >/dev/null &&
+
+	test_pack_reused 3 <trace2.txt &&
+	test_packs_reused 1 <trace2.txt
+'
+
+test_expect_success 'omit delta with uninteresting base (same pack)' '
+	git repack -adk &&
+
+	test_seq 32 >f &&
+	git add f &&
+	test_tick &&
+	git commit -m "delta" &&
+	delta="$(git rev-parse HEAD)" &&
+
+	test_seq 64 >f &&
+	test_tick &&
+	git commit -a -m "base" &&
+	base="$(git rev-parse HEAD)" &&
+
+	test_commit other &&
+
+	git repack -d &&
+
+	have_delta "$(git rev-parse $delta:f)" "$(git rev-parse $base:f)" &&
+
+	git multi-pack-index write --bitmap &&
+
+	cat >in <<-EOF &&
+	$(git rev-parse other)
+	^$base
+	EOF
+
+	: >trace2.txt &&
+	GIT_TRACE2_EVENT="$PWD/trace2.txt" \
+		git pack-objects --stdout --delta-base-offset --revs <in >/dev/null &&
+
+	# We can only reuse the 3 objects corresponding to "other" from
+	# the latest pack.
+	#
+	# This is because even though we want "delta", we do not want
+	# "base", meaning that we have to inflate the delta/base-pair
+	# corresponding to the blob in commit "delta", which bypasses
+	# the pack-reuse mechanism.
+	#
+	# The remaining objects from the other pack are similarly not
+	# reused because their objects are on the uninteresting side of
+	# the query.
+	test_pack_reused 3 <trace2.txt &&
+	test_packs_reused 1 <trace2.txt
+'
+
+test_expect_success 'omit delta from uninteresting base (cross pack)' '
+	cat >in <<-EOF &&
+	$(git rev-parse $base)
+	^$(git rev-parse $delta)
+	EOF
+
+	P="$(git pack-objects --revs $packdir/pack <in)" &&
+
+	git multi-pack-index write --bitmap --preferred-pack="pack-$P.idx" &&
+
+	: >trace2.txt &&
+	GIT_TRACE2_EVENT="$PWD/trace2.txt" \
+		git pack-objects --stdout --delta-base-offset --all >/dev/null &&
+
+	packs_nr="$(find $packdir -type f -name "pack-*.pack" | wc -l)" &&
+	objects_nr="$(git rev-list --count --all --objects)" &&
+
+	test_pack_reused $(($objects_nr - 1)) <trace2.txt &&
+	test_packs_reused $packs_nr <trace2.txt
+'
+
+test_done
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related

* [PATCH v2 24/26] pack-objects: allow setting `pack.allowPackReuse` to "single"
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>

In e704fc7978 (pack-objects: introduce pack.allowPackReuse, 2019-12-18),
the `pack.allowPackReuse` configuration option was introduced, allowing
users to disable the pack reuse mechanism.

To prepare for debugging multi-pack reuse, allow setting configuration
to "single" in addition to the usual bool-or-int values.

"single" implies the same behavior as "true", "1", "yes", and so on. But
it will complement a new "multi" value (to be introduced in a future
commit). When set to "single", we will only perform pack reuse on a
single pack, regardless of whether or not there are multiple MIDX'd
packs.

This requires no code changes (yet), since we only support single pack
reuse.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/pack.txt |  2 +-
 builtin/pack-objects.c        | 19 ++++++++++++++++---
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
index f50df9dbce..fe100d0fb7 100644
--- a/Documentation/config/pack.txt
+++ b/Documentation/config/pack.txt
@@ -28,7 +28,7 @@ all existing objects. You can force recompression by passing the -F option
 to linkgit:git-repack[1].
 
 pack.allowPackReuse::
-	When true, and when reachability bitmaps are enabled,
+	When true or "single", and when reachability bitmaps are enabled,
 	pack-objects will try to send parts of the bitmapped packfile
 	verbatim. This can reduce memory and CPU usage to serve fetches,
 	but might result in sending a slightly larger pack. Defaults to
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 7aae9f104b..684698f679 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -229,7 +229,10 @@ static struct bitmap *reuse_packfile_bitmap;
 
 static int use_bitmap_index_default = 1;
 static int use_bitmap_index = -1;
-static int allow_pack_reuse = 1;
+static enum {
+	NO_PACK_REUSE = 0,
+	SINGLE_PACK_REUSE,
+} allow_pack_reuse = SINGLE_PACK_REUSE;
 static enum {
 	WRITE_BITMAP_FALSE = 0,
 	WRITE_BITMAP_QUIET,
@@ -3244,7 +3247,17 @@ static int git_pack_config(const char *k, const char *v,
 		return 0;
 	}
 	if (!strcmp(k, "pack.allowpackreuse")) {
-		allow_pack_reuse = git_config_bool(k, v);
+		int res = git_parse_maybe_bool_text(v);
+		if (res < 0) {
+			if (!strcasecmp(v, "single"))
+				allow_pack_reuse = SINGLE_PACK_REUSE;
+			else
+				die(_("invalid pack.allowPackReuse value: '%s'"), v);
+		} else if (res) {
+			allow_pack_reuse = SINGLE_PACK_REUSE;
+		} else {
+			allow_pack_reuse = NO_PACK_REUSE;
+		}
 		return 0;
 	}
 	if (!strcmp(k, "pack.threads")) {
@@ -3999,7 +4012,7 @@ static void loosen_unused_packed_objects(void)
  */
 static int pack_options_allow_reuse(void)
 {
-	return allow_pack_reuse &&
+	return allow_pack_reuse != NO_PACK_REUSE &&
 	       pack_to_stdout &&
 	       !ignore_packed_keep_on_disk &&
 	       !ignore_packed_keep_in_core &&
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related

* [PATCH v2 23/26] t/test-lib-functions.sh: implement `test_trace2_data` helper
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>

Introduce a helper function which looks for a specific (category, key,
value) tuple in the output of a trace2 event stream.

We will use this function in a future patch to ensure that the expected
number of objects are reused from an expected number of packs.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/test-lib-functions.sh | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index 9c3cf12b26..93fe819b0a 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -1874,6 +1874,20 @@ test_region () {
 	return 0
 }
 
+# Check that the given data fragment was included as part of the
+# trace2-format trace on stdin.
+#
+#	test_trace2_data <category> <key> <value>
+#
+# For example, to look for trace2_data_intmax("pack-objects", repo,
+# "reused", N) in an invocation of "git pack-objects", run:
+#
+#	GIT_TRACE2_EVENT="$(pwd)/trace.txt" git pack-objects ... &&
+#	test_trace2_data pack-objects reused N <trace2.txt
+test_trace2_data () {
+	grep -e '"category":"'"$1"'","key":"'"$2"'","value":"'"$3"'"'
+}
+
 # Given a GIT_TRACE2_EVENT log over stdin, writes to stdout a list of URLs
 # sent to git-remote-https child processes.
 test_remote_https_urls() {
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related

* [PATCH v2 22/26] pack-objects: add tracing for various packfile metrics
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>

As part of the multi-pack reuse effort, we will want to add some tests
that assert that we reused a certain number of objects from a certain
number of packs.

We could do this by grepping through the stderr output of
`pack-objects`, but doing so would be brittle in case the output format
changed.

Instead, let's use the trace2 mechanism to log various pieces of
information about the generated packfile, which we can then use to
compare against desired values.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 7eb035eb7d..7aae9f104b 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4595,6 +4595,13 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			   reuse_packfile_objects,
 			   (uintmax_t)reuse_packfiles_used_nr);
 
+	trace2_data_intmax("pack-objects", the_repository, "written", written);
+	trace2_data_intmax("pack-objects", the_repository, "written/delta", written_delta);
+	trace2_data_intmax("pack-objects", the_repository, "reused", reused);
+	trace2_data_intmax("pack-objects", the_repository, "reused/delta", reused_delta);
+	trace2_data_intmax("pack-objects", the_repository, "pack-reused", reuse_packfile_objects);
+	trace2_data_intmax("pack-objects", the_repository, "packs-reused", reuse_packfiles_used_nr);
+
 cleanup:
 	clear_packing_data(&to_pack);
 	list_objects_filter_release(&filter_options);
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related

* [PATCH v2 21/26] pack-bitmap: prepare to mark objects from multiple packs for reuse
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>

Now that the pack-objects code is equipped to handle reusing objects
from multiple packs, prepare the pack-bitmap code to mark objects from
multiple packs as reuse candidates.

In order to prepare the pack-bitmap code for this change, remove the
same set of assumptions we unwound in previous commits from the helper
function `reuse_partial_packfile_from_bitmap_1()`, in preparation for it
to be called in a loop over the set of bitmapped packs in a following
commit.

Most importantly, we can no longer assume that the bit position
corresponding to the first object in a given reuse pack candidate is at
the beginning of the bitmap itself.

For the single pack that this assumption is still true for (in MIDX
bitmaps, this is the preferred pack, in single-pack bitmaps it is the
pack the bitmap is tied to), we can still use our whole-words
optimization.

But for all subsequent packs, we can not make use of this optimization,
since it assumes that all delta bases are being sent from the same pack,
which would break if we are sending OFS_DELTAs down to the client. To
understand why, consider two packs, P1 and P2 where:

  - P1 has object A which is a delta on base B
  - P2 has its own copy of B, in addition to other objects

Suppose that the MIDX which covers P1 and P2 selected its copy of A from
P1, but selected its copy of B from P2. Since A is a delta of B, but the
base was selected from a different pack, sending the bytes corresponding
to A as an OFS_DELTA verbatim from P1 would be incorrect, since we don't
guarantee that B is in the same place relative to A in the generated
pack as in P1.

For now, we detect and reject these cross-pack deltas by searching for
the (pack_id, offset) pair for the delta's base object (using the same
pack_id as the pack containing the delta'd object) in the MIDX. If we
find a match, that means that the MIDX did indeed pick the base object
from the same pack, and we are OK to reuse the delta.

If we don't find a match, however, that means that the base object was
selected from a different pack in the MIDX, and we can let the slower
path handle re-delta'ing our candidate object.

In the future, there are a couple of other things we could do, namely:

  - Turn any cross-pack deltas (which are stored as OFS_DELTAs) into
    REF_DELTAs. We already do this today when reusing an OFS_DELTA
    without `--delta-base-offset` enabled, so it's not a huge stretch to
    do the same for cross-pack deltas even when `--delta-base-offset` is
    enabled.

    This would work, but would obviously result in larger-than-necessary
    packs, as we in theory *could* represent these cross-pack deltas by
    patching an existing OFS_DELTA. But it's not clear how much that
    would matter in practice. I suspect it would have a lot to do with
    how you pack your repository in the first place.

  - Finally, we could patch OFS_DELTAs across packs in a similar fashion
    as we do today for OFS_DELTAs within a single pack on either side of
    a gap. This would result in the smallest packs of the three options
    here, but implementing this would be more involved.

    At minimum, you'd have to keep the reusable chunks list for all
    reused packs, not just the one we're currently processing. And you'd
    have to ensure that any bases which are a part of cross-pack deltas
    appear before the delta. I think this is possible to do, but would
    require assembling the reusable chunks list potentially in a
    different order than they appear in the source packs.

For now, let's pursue the simplest approach and reject any cross-pack
deltas.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 172 +++++++++++++++++++++++++++++++-------------------
 1 file changed, 106 insertions(+), 66 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 1682f99596..242a5908f7 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1841,8 +1841,10 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
  * -1 means "stop trying further objects"; 0 means we may or may not have
  * reused, but you can keep feeding bits.
  */
-static int try_partial_reuse(struct bitmapped_pack *pack,
-			     size_t pos,
+static int try_partial_reuse(struct bitmap_index *bitmap_git,
+			     struct bitmapped_pack *pack,
+			     size_t bitmap_pos,
+			     uint32_t pack_pos,
 			     struct bitmap *reuse,
 			     struct pack_window **w_curs)
 {
@@ -1850,33 +1852,10 @@ static int try_partial_reuse(struct bitmapped_pack *pack,
 	enum object_type type;
 	unsigned long size;
 
-	/*
-	 * try_partial_reuse() is called either on (a) objects in the
-	 * bitmapped pack (in the case of a single-pack bitmap) or (b)
-	 * objects in the preferred pack of a multi-pack bitmap.
-	 * Importantly, the latter can pretend as if only a single pack
-	 * exists because:
-	 *
-	 *   - The first pack->num_objects bits of a MIDX bitmap are
-	 *     reserved for the preferred pack, and
-	 *
-	 *   - Ties due to duplicate objects are always resolved in
-	 *     favor of the preferred pack.
-	 *
-	 * Therefore we do not need to ever ask the MIDX for its copy of
-	 * an object by OID, since it will always select it from the
-	 * preferred pack. Likewise, the selected copy of the base
-	 * object for any deltas will reside in the same pack.
-	 *
-	 * This means that we can reuse pos when looking up the bit in
-	 * the reuse bitmap, too, since bits corresponding to the
-	 * preferred pack precede all bits from other packs.
-	 */
+	if (pack_pos >= pack->p->num_objects)
+		return -1; /* not actually in the pack */
 
-	if (pos >= pack->p->num_objects)
-		return -1; /* not actually in the pack or MIDX preferred pack */
-
-	offset = delta_obj_offset = pack_pos_to_offset(pack->p, pos);
+	offset = delta_obj_offset = pack_pos_to_offset(pack->p, pack_pos);
 	type = unpack_object_header(pack->p, w_curs, &offset, &size);
 	if (type < 0)
 		return -1; /* broken packfile, punt */
@@ -1884,6 +1863,7 @@ static int try_partial_reuse(struct bitmapped_pack *pack,
 	if (type == OBJ_REF_DELTA || type == OBJ_OFS_DELTA) {
 		off_t base_offset;
 		uint32_t base_pos;
+		uint32_t base_bitmap_pos;
 
 		/*
 		 * Find the position of the base object so we can look it up
@@ -1897,20 +1877,44 @@ static int try_partial_reuse(struct bitmapped_pack *pack,
 					     delta_obj_offset);
 		if (!base_offset)
 			return 0;
-		if (offset_to_pack_pos(pack->p, base_offset, &base_pos) < 0)
-			return 0;
 
-		/*
-		 * We assume delta dependencies always point backwards. This
-		 * lets us do a single pass, and is basically always true
-		 * due to the way OFS_DELTAs work. You would not typically
-		 * find REF_DELTA in a bitmapped pack, since we only bitmap
-		 * packs we write fresh, and OFS_DELTA is the default). But
-		 * let's double check to make sure the pack wasn't written with
-		 * odd parameters.
-		 */
-		if (base_pos >= pos)
-			return 0;
+		offset_to_pack_pos(pack->p, base_offset, &base_pos);
+
+		if (bitmap_is_midx(bitmap_git)) {
+			/*
+			 * Cross-pack deltas are rejected for now, but could
+			 * theoretically be supported in the future.
+			 *
+			 * We would need to ensure that we're sending both
+			 * halves of the delta/base pair, regardless of whether
+			 * or not the two cross a pack boundary. If they do,
+			 * then we must convert the delta to an REF_DELTA to
+			 * refer back to the base in the other pack.
+			 * */
+			if (midx_pair_to_pack_pos(bitmap_git->midx,
+						  pack->pack_int_id,
+						  base_offset,
+						  &base_bitmap_pos) < 0) {
+				return 0;
+			}
+		} else {
+			if (offset_to_pack_pos(pack->p, base_offset,
+					       &base_pos) < 0)
+				return 0;
+			/*
+			 * We assume delta dependencies always point backwards.
+			 * This lets us do a single pass, and is basically
+			 * always true due to the way OFS_DELTAs work. You would
+			 * not typically find REF_DELTA in a bitmapped pack,
+			 * since we only bitmap packs we write fresh, and
+			 * OFS_DELTA is the default). But let's double check to
+			 * make sure the pack wasn't written with odd
+			 * parameters.
+			 */
+			if (base_pos >= pack_pos)
+				return 0;
+			base_bitmap_pos = pack->bitmap_pos + base_pos;
+		}
 
 		/*
 		 * And finally, if we're not sending the base as part of our
@@ -1920,14 +1924,14 @@ static int try_partial_reuse(struct bitmapped_pack *pack,
 		 * to REF_DELTA on the fly. Better to just let the normal
 		 * object_entry code path handle it.
 		 */
-		if (!bitmap_get(reuse, pack->bitmap_pos + base_pos))
+		if (!bitmap_get(reuse, base_bitmap_pos))
 			return 0;
 	}
 
 	/*
 	 * If we got here, then the object is OK to reuse. Mark it.
 	 */
-	bitmap_set(reuse, pack->bitmap_pos + pos);
+	bitmap_set(reuse, bitmap_pos);
 	return 0;
 }
 
@@ -1937,36 +1941,72 @@ static void reuse_partial_packfile_from_bitmap_1(struct bitmap_index *bitmap_git
 {
 	struct bitmap *result = bitmap_git->result;
 	struct pack_window *w_curs = NULL;
-	size_t i = 0;
+	size_t pos = pack->bitmap_pos / BITS_IN_EWORD;
 
-	while (i < result->word_alloc && result->words[i] == (eword_t)~0)
-		i++;
+	if (!pack->bitmap_pos) {
+		/*
+		 * If we're processing the first (in the case of a MIDX, the
+		 * preferred pack) or the only (in the case of single-pack
+		 * bitmaps) pack, then we can reuse whole words at a time.
+		 *
+		 * This is because we know that any deltas in this range *must*
+		 * have their bases chosen from the same pack, since:
+		 *
+		 * - In the single pack case, there is no other pack to choose
+		 *   them from.
+		 *
+		 * - In the MIDX case, the first pack is the preferred pack, so
+		 *   all ties are broken in favor of that pack (i.e. the one
+		 *   we're currently processing). So any duplicate bases will be
+		 *   resolved in favor of the pack we're processing.
+		 */
+		while (pos < result->word_alloc &&
+		       pos < pack->bitmap_nr / BITS_IN_EWORD &&
+		       result->words[pos] == (eword_t)~0)
+			pos++;
+		memset(reuse->words, 0xFF, pos * sizeof(eword_t));
+	}
 
-	/*
-	 * Don't mark objects not in the packfile or preferred pack. This bitmap
-	 * marks objects eligible for reuse, but the pack-reuse code only
-	 * understands how to reuse a single pack. Since the preferred pack is
-	 * guaranteed to have all bases for its deltas (in a multi-pack bitmap),
-	 * we use it instead of another pack. In single-pack bitmaps, the choice
-	 * is made for us.
-	 */
-	if (i > pack->p->num_objects / BITS_IN_EWORD)
-		i = pack->p->num_objects / BITS_IN_EWORD;
-
-	memset(reuse->words, 0xFF, i * sizeof(eword_t));
-
-	for (; i < result->word_alloc; ++i) {
-		eword_t word = result->words[i];
-		size_t pos = (i * BITS_IN_EWORD);
+	for (; pos < result->word_alloc; pos++) {
+		eword_t word = result->words[pos];
 		size_t offset;
 
-		for (offset = 0; offset < BITS_IN_EWORD; ++offset) {
-			if ((word >> offset) == 0)
+		for (offset = 0; offset < BITS_IN_EWORD; offset++) {
+			size_t bit_pos;
+			uint32_t pack_pos;
+
+			if (word >> offset == 0)
 				break;
 
 			offset += ewah_bit_ctz64(word >> offset);
-			if (try_partial_reuse(pack, pos + offset,
-					      reuse, &w_curs) < 0) {
+
+			bit_pos = pos * BITS_IN_EWORD + offset;
+			if (bit_pos < pack->bitmap_pos)
+				continue;
+			if (bit_pos >= pack->bitmap_pos + pack->bitmap_nr)
+				goto done;
+
+			if (bitmap_is_midx(bitmap_git)) {
+				uint32_t midx_pos;
+				off_t ofs;
+
+				midx_pos = pack_pos_to_midx(bitmap_git->midx, bit_pos);
+				ofs = nth_midxed_offset(bitmap_git->midx, midx_pos);
+
+				if (offset_to_pack_pos(pack->p, ofs, &pack_pos) < 0)
+					BUG("could not find object in pack %s "
+					    "at offset %"PRIuMAX" in MIDX",
+					    pack_basename(pack->p), (uintmax_t)ofs);
+			} else {
+				pack_pos = cast_size_t_to_uint32_t(st_sub(bit_pos, pack->bitmap_pos));
+				if (pack_pos >= pack->p->num_objects)
+					BUG("advanced beyond the end of pack %s (%"PRIuMAX" > %"PRIu32")",
+					    pack_basename(pack->p), (uintmax_t)pack_pos,
+					    pack->p->num_objects);
+			}
+
+			if (try_partial_reuse(bitmap_git, pack, bit_pos,
+					      pack_pos, reuse, &w_curs) < 0) {
 				/*
 				 * try_partial_reuse indicated we couldn't reuse
 				 * any bits, so there is no point in trying more
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related

* [PATCH v2 20/26] pack-revindex: implement `midx_pair_to_pack_pos()`
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>

Now that we have extracted the `midx_key_to_pack_pos()` function, we can
implement the `midx_pair_to_pack_pos()` function which accepts (pack_id,
offset) tuples and returns an index into the psuedo-pack order.

This will be used in a following commit in order to figure out whether
or not the MIDX chose a given delta's base object from the same pack as
the delta resides in. It will do so by locating the base object's offset
in the pack, and then performing a binary search using the same pack ID
with the base object's offset.

If (and only if) it finds a match (at any position) we can guarantee
that the MIDX selected both halves of the delta/base pair from the same
pack.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-revindex.c | 11 +++++++++++
 pack-revindex.h |  3 +++
 2 files changed, 14 insertions(+)

diff --git a/pack-revindex.c b/pack-revindex.c
index baa4657ed3..a7624d8be8 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -564,3 +564,14 @@ int midx_to_pack_pos(struct multi_pack_index *m, uint32_t at, uint32_t *pos)
 
 	return midx_key_to_pack_pos(m, &key, pos);
 }
+
+int midx_pair_to_pack_pos(struct multi_pack_index *m, uint32_t pack_int_id,
+			  off_t ofs, uint32_t *pos)
+{
+	struct midx_pack_key key = {
+		.pack = pack_int_id,
+		.offset = ofs,
+		.midx = m,
+	};
+	return midx_key_to_pack_pos(m, &key, pos);
+}
diff --git a/pack-revindex.h b/pack-revindex.h
index 6dd47efea1..422c2487ae 100644
--- a/pack-revindex.h
+++ b/pack-revindex.h
@@ -142,4 +142,7 @@ uint32_t pack_pos_to_midx(struct multi_pack_index *m, uint32_t pos);
  */
 int midx_to_pack_pos(struct multi_pack_index *midx, uint32_t at, uint32_t *pos);
 
+int midx_pair_to_pack_pos(struct multi_pack_index *midx, uint32_t pack_id,
+			  off_t ofs, uint32_t *pos);
+
 #endif
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related

* [PATCH v2 19/26] pack-revindex: factor out `midx_key_to_pack_pos()` helper
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>

The `midx_to_pack_pos()` function implements a binary search over
objects in the MIDX between lexical and pseudo-pack order. It does this
by taking in an index into the lexical order (i.e. the same argument
you'd use for `nth_midxed_object_id()` and similar) and spits out a
position in the pseudo-pack order.

This works for all callers, since they currently all are translating
from lexical order to pseudo-pack order. But future callers may want to
translate a known (offset, pack_id) tuple into an index into the
psuedo-pack order, without knowing where that (offset, pack_id) tuple
appears in lexical order.

Prepare for implementing a function that translates between a (offset,
pack_id) tuple into an index into the psuedo-pack order by extracting a
helper function which does just that, and then reimplementing
midx_to_pack_pos() in terms of it.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-revindex.c | 39 ++++++++++++++++++++++++---------------
 1 file changed, 24 insertions(+), 15 deletions(-)

diff --git a/pack-revindex.c b/pack-revindex.c
index 7dc6c776d5..baa4657ed3 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -520,19 +520,12 @@ static int midx_pack_order_cmp(const void *va, const void *vb)
 	return 0;
 }
 
-int midx_to_pack_pos(struct multi_pack_index *m, uint32_t at, uint32_t *pos)
+static int midx_key_to_pack_pos(struct multi_pack_index *m,
+				struct midx_pack_key *key,
+				uint32_t *pos)
 {
-	struct midx_pack_key key;
 	uint32_t *found;
 
-	if (!m->revindex_data)
-		BUG("midx_to_pack_pos: reverse index not yet loaded");
-	if (m->num_objects <= at)
-		BUG("midx_to_pack_pos: out-of-bounds object at %"PRIu32, at);
-
-	key.pack = nth_midxed_pack_int_id(m, at);
-	key.offset = nth_midxed_offset(m, at);
-	key.midx = m;
 	/*
 	 * The preferred pack sorts first, so determine its identifier by
 	 * looking at the first object in pseudo-pack order.
@@ -542,16 +535,32 @@ int midx_to_pack_pos(struct multi_pack_index *m, uint32_t at, uint32_t *pos)
 	 * implicitly is preferred (and includes all its objects, since ties are
 	 * broken first by pack identifier).
 	 */
-	if (midx_preferred_pack(key.midx, &key.preferred_pack) < 0)
+	if (midx_preferred_pack(key->midx, &key->preferred_pack) < 0)
 		return error(_("could not determine preferred pack"));
 
-
-	found = bsearch(&key, m->revindex_data, m->num_objects,
-			sizeof(*m->revindex_data), midx_pack_order_cmp);
+	found = bsearch(key, m->revindex_data, m->num_objects,
+			sizeof(*m->revindex_data),
+			midx_pack_order_cmp);
 
 	if (!found)
-		return error("bad offset for revindex");
+		return -1;
 
 	*pos = found - m->revindex_data;
 	return 0;
 }
+
+int midx_to_pack_pos(struct multi_pack_index *m, uint32_t at, uint32_t *pos)
+{
+	struct midx_pack_key key;
+
+	if (!m->revindex_data)
+		BUG("midx_to_pack_pos: reverse index not yet loaded");
+	if (m->num_objects <= at)
+		BUG("midx_to_pack_pos: out-of-bounds object at %"PRIu32, at);
+
+	key.pack = nth_midxed_pack_int_id(m, at);
+	key.offset = nth_midxed_offset(m, at);
+	key.midx = m;
+
+	return midx_key_to_pack_pos(m, &key, pos);
+}
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related

* [PATCH v2 18/26] midx: implement `midx_preferred_pack()`
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>

When performing a binary search over the objects in a MIDX's bitmap
(i.e. in pseudo-pack order), the reader reconstructs the pseudo-pack
ordering using a combination of (a) the preferred pack, (b) the pack's
lexical position in the MIDX based on pack names, and (c) the object
offset within the pack.

In order to perform this binary search, the reader must know the
identity of the preferred pack. This could be stored in the MIDX, but
isn't for historical reasons, mostly because it can easily be inferred
at read-time by looking at the object in the first bit position and
finding out which pack it was selected from in the MIDX, like so:

    nth_midxed_pack_int_id(m, pack_pos_to_midx(m, 0));

In midx_to_pack_pos() which performs this binary search, we look up the
identity of the preferred pack before each search. This is relatively
quick, since it involves two table-driven lookups (one in the MIDX's
revindex for `pack_pos_to_midx()`, and another in the MIDX's object
table for `nth_midxed_pack_int_id()`).

But since the preferred pack does not change after the MIDX is written,
it is safe to cache this value on the MIDX itself.

Write a helper to do just that, and rewrite all of the existing
call-sites that care about the identity of the preferred pack in terms
of this new helper.

This will prepare us for a subsequent patch where we will need to binary
search through the MIDX's pseudo-pack order multiple times.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c                    | 20 ++++++++++++++++++++
 midx.h                    |  2 ++
 pack-bitmap.c             | 17 +++++++----------
 pack-bitmap.h             |  1 -
 pack-revindex.c           |  4 +++-
 t/helper/test-read-midx.c | 13 +++++--------
 6 files changed, 37 insertions(+), 20 deletions(-)

diff --git a/midx.c b/midx.c
index beaf0c0de4..85e1c2cd12 100644
--- a/midx.c
+++ b/midx.c
@@ -21,6 +21,7 @@
 #include "refs.h"
 #include "revision.h"
 #include "list-objects.h"
+#include "pack-revindex.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
 #define MIDX_VERSION 1
@@ -177,6 +178,8 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 
 	m->num_packs = get_be32(m->data + MIDX_BYTE_NUM_PACKS);
 
+	m->preferred_pack_idx = -1;
+
 	cf = init_chunkfile(NULL);
 
 	if (read_table_of_contents(cf, m->data, midx_size,
@@ -460,6 +463,23 @@ int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name)
 	return midx_locate_pack(m, idx_or_pack_name, NULL);
 }
 
+int midx_preferred_pack(struct multi_pack_index *m, uint32_t *pack_int_id)
+{
+	if (m->preferred_pack_idx == -1) {
+		if (load_midx_revindex(m) < 0) {
+			m->preferred_pack_idx = -2;
+			return -1;
+		}
+
+		m->preferred_pack_idx =
+			nth_midxed_pack_int_id(m, pack_pos_to_midx(m, 0));
+	} else if (m->preferred_pack_idx == -2)
+		return -1; /* no revindex */
+
+	*pack_int_id = m->preferred_pack_idx;
+	return 0;
+}
+
 int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local)
 {
 	struct multi_pack_index *m;
diff --git a/midx.h b/midx.h
index 89c5aa637e..f87a8fff26 100644
--- a/midx.h
+++ b/midx.h
@@ -29,6 +29,7 @@ struct multi_pack_index {
 	unsigned char num_chunks;
 	uint32_t num_packs;
 	uint32_t num_objects;
+	int preferred_pack_idx;
 
 	int local;
 
@@ -74,6 +75,7 @@ int midx_contains_pack(struct multi_pack_index *m,
 		       const char *idx_or_pack_name);
 int midx_locate_pack(struct multi_pack_index *m, const char *idx_or_pack_name,
 		     uint32_t *pos);
+int midx_preferred_pack(struct multi_pack_index *m, uint32_t *pack_int_id);
 int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local);
 
 /*
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 4d5a484678..1682f99596 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -338,7 +338,7 @@ static int open_midx_bitmap_1(struct bitmap_index *bitmap_git,
 	struct stat st;
 	char *bitmap_name = midx_bitmap_filename(midx);
 	int fd = git_open(bitmap_name);
-	uint32_t i;
+	uint32_t i, preferred_pack;
 	struct packed_git *preferred;
 
 	if (fd < 0) {
@@ -393,7 +393,12 @@ static int open_midx_bitmap_1(struct bitmap_index *bitmap_git,
 		}
 	}
 
-	preferred = bitmap_git->midx->packs[midx_preferred_pack(bitmap_git)];
+	if (midx_preferred_pack(bitmap_git->midx, &preferred_pack) < 0) {
+		warning(_("could not determine MIDX preferred pack"));
+		goto cleanup;
+	}
+
+	preferred = bitmap_git->midx->packs[preferred_pack];
 	if (!is_pack_valid(preferred)) {
 		warning(_("preferred pack (%s) is invalid"),
 			preferred->pack_name);
@@ -1926,14 +1931,6 @@ static int try_partial_reuse(struct bitmapped_pack *pack,
 	return 0;
 }
 
-uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git)
-{
-	struct multi_pack_index *m = bitmap_git->midx;
-	if (!m)
-		BUG("midx_preferred_pack: requires non-empty MIDX");
-	return nth_midxed_pack_int_id(m, pack_pos_to_midx(bitmap_git->midx, 0));
-}
-
 static void reuse_partial_packfile_from_bitmap_1(struct bitmap_index *bitmap_git,
 						 struct bitmapped_pack *pack,
 						 struct bitmap *reuse)
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 7a12a2ce81..179b343912 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -77,7 +77,6 @@ int test_bitmap_hashes(struct repository *r);
 
 struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 					 int filter_provided_objects);
-uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git);
 void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 					struct bitmapped_pack **packs_out,
 					size_t *packs_nr_out,
diff --git a/pack-revindex.c b/pack-revindex.c
index acf1dd9786..7dc6c776d5 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -542,7 +542,9 @@ int midx_to_pack_pos(struct multi_pack_index *m, uint32_t at, uint32_t *pos)
 	 * implicitly is preferred (and includes all its objects, since ties are
 	 * broken first by pack identifier).
 	 */
-	key.preferred_pack = nth_midxed_pack_int_id(m, pack_pos_to_midx(m, 0));
+	if (midx_preferred_pack(key.midx, &key.preferred_pack) < 0)
+		return error(_("could not determine preferred pack"));
+
 
 	found = bsearch(&key, m->revindex_data, m->num_objects,
 			sizeof(*m->revindex_data), midx_pack_order_cmp);
diff --git a/t/helper/test-read-midx.c b/t/helper/test-read-midx.c
index e48557aba1..4acae41bb9 100644
--- a/t/helper/test-read-midx.c
+++ b/t/helper/test-read-midx.c
@@ -6,6 +6,7 @@
 #include "pack-bitmap.h"
 #include "packfile.h"
 #include "setup.h"
+#include "gettext.h"
 
 static int read_midx_file(const char *object_dir, int show_objects)
 {
@@ -79,7 +80,7 @@ static int read_midx_checksum(const char *object_dir)
 static int read_midx_preferred_pack(const char *object_dir)
 {
 	struct multi_pack_index *midx = NULL;
-	struct bitmap_index *bitmap = NULL;
+	uint32_t preferred_pack;
 
 	setup_git_directory();
 
@@ -87,16 +88,12 @@ static int read_midx_preferred_pack(const char *object_dir)
 	if (!midx)
 		return 1;
 
-	bitmap = prepare_bitmap_git(the_repository);
-	if (!bitmap)
-		return 1;
-	if (!bitmap_is_midx(bitmap)) {
-		free_bitmap_index(bitmap);
+	if (midx_preferred_pack(midx, &preferred_pack) < 0) {
+		warning(_("could not determine MIDX preferred pack"));
 		return 1;
 	}
 
-	printf("%s\n", midx->pack_names[midx_preferred_pack(bitmap)]);
-	free_bitmap_index(bitmap);
+	printf("%s\n", midx->pack_names[preferred_pack]);
 	return 0;
 }
 
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related

* [PATCH v2 17/26] git-compat-util.h: implement checked size_t to uint32_t conversion
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>

In a similar fashion as other checked cast functions in this header
(such as `cast_size_t_to_ulong()` and `cast_size_t_to_int()`), implement
a checked cast function for going from a size_t to a uint32_t value.

This function will be utilized in a future commit which needs to make
such a conversion.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 git-compat-util.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/git-compat-util.h b/git-compat-util.h
index 3e7a59b5ff..c3b6c2c226 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -1013,6 +1013,15 @@ static inline unsigned long cast_size_t_to_ulong(size_t a)
 	return (unsigned long)a;
 }
 
+static inline uint32_t cast_size_t_to_uint32_t(size_t a)
+{
+	if (a != (uint32_t)a)
+		die("object too large to read on this platform: %"
+		    PRIuMAX" is cut off to %u",
+		    (uintmax_t)a, (uint32_t)a);
+	return (uint32_t)a;
+}
+
 static inline int cast_size_t_to_int(size_t a)
 {
 	if (a > INT_MAX)
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related

* [PATCH v2 16/26] pack-objects: include number of packs reused in output
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>

In addition to including the number of objects reused verbatim from a
reuse-pack, include the number of packs from which objects were reused.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 31053128fc..7eb035eb7d 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -223,6 +223,7 @@ static struct progress *progress_state;
 
 static struct bitmapped_pack *reuse_packfiles;
 static size_t reuse_packfiles_nr;
+static size_t reuse_packfiles_used_nr;
 static uint32_t reuse_packfile_objects;
 static struct bitmap *reuse_packfile_bitmap;
 
@@ -1265,6 +1266,8 @@ static void write_pack_file(void)
 			for (j = 0; j < reuse_packfiles_nr; j++) {
 				reused_chunks_nr = 0;
 				write_reused_pack(&reuse_packfiles[j], f);
+				if (reused_chunks_nr)
+					reuse_packfiles_used_nr++;
 			}
 			offset = hashfile_total(f);
 		}
@@ -4587,9 +4590,10 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		fprintf_ln(stderr,
 			   _("Total %"PRIu32" (delta %"PRIu32"),"
 			     " reused %"PRIu32" (delta %"PRIu32"),"
-			     " pack-reused %"PRIu32),
+			     " pack-reused %"PRIu32" (from %"PRIuMAX")"),
 			   written, written_delta, reused, reused_delta,
-			   reuse_packfile_objects);
+			   reuse_packfile_objects,
+			   (uintmax_t)reuse_packfiles_used_nr);
 
 cleanup:
 	clear_packing_data(&to_pack);
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related

* [PATCH v2 15/26] pack-objects: prepare `write_reused_pack_verbatim()` for multi-pack reuse
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>

The function `write_reused_pack_verbatim()` within
`builtin/pack-objects.c` is responsible for writing out a continuous
set of objects beginning at the start of the reuse packfile.

In the existing implementation, we did something like:

    while (pos < reuse_packfile_bitmap->word_alloc &&
           reuse_packfile_bitmap->words[pos] == (eword_t)~0)
      pos++;

    if (pos)
      /* write first `pos * BITS_IN_WORD` objects from pack */

as an optimization to record a single chunk for the longest continuous
prefix of objects wanted out of the reuse pack, instead of having a
chunk for each individual object. For more details, see bb514de356
(pack-objects: improve partial packfile reuse, 2019-12-18).

In order to retain this optimization in a multi-pack reuse world, we can
no longer assume that the first object in a pack is on a word boundary
in the bitmap storing the set of reusable objects.

Assuming that all objects from the beginning of the reuse packfile up to
the object corresponding to the first bit on a word boundary are part of
the result, consume whole words at a time until the last whole word
belonging to the reuse packfile. Copy those objects to the resulting
packfile, and track that we reused them by recording a single chunk.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 73 ++++++++++++++++++++++++++++++++++--------
 1 file changed, 60 insertions(+), 13 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 6ce52d88a9..31053128fc 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1097,31 +1097,78 @@ static void write_reused_pack_one(struct packed_git *reuse_packfile,
 
 static size_t write_reused_pack_verbatim(struct bitmapped_pack *reuse_packfile,
 					 struct hashfile *out,
-					 off_t pack_start UNUSED,
+					 off_t pack_start,
 					 struct pack_window **w_curs)
 {
-	size_t pos = 0;
+	size_t pos = reuse_packfile->bitmap_pos;
+	size_t end;
 
-	while (pos < reuse_packfile_bitmap->word_alloc &&
-			reuse_packfile_bitmap->words[pos] == (eword_t)~0)
-		pos++;
+	if (pos % BITS_IN_EWORD) {
+		size_t word_pos = (pos / BITS_IN_EWORD);
+		size_t offset = pos % BITS_IN_EWORD;
+		size_t last;
+		eword_t word = reuse_packfile_bitmap->words[word_pos];
 
-	if (pos) {
-		off_t to_write;
+		if (offset + reuse_packfile->bitmap_nr < BITS_IN_EWORD)
+			last = offset + reuse_packfile->bitmap_nr;
+		else
+			last = BITS_IN_EWORD;
 
-		written = (pos * BITS_IN_EWORD);
-		to_write = pack_pos_to_offset(reuse_packfile->p, written)
-			- sizeof(struct pack_header);
+		for (; offset < last; offset++) {
+			if (word >> offset == 0)
+				return word_pos;
+			if (!bitmap_get(reuse_packfile_bitmap,
+					word_pos * BITS_IN_EWORD + offset))
+				return word_pos;
+		}
+
+		pos += BITS_IN_EWORD - (pos % BITS_IN_EWORD);
+	}
+
+	/*
+	 * Now we're going to copy as many whole eword_t's as possible.
+	 * "end" is the index of the last whole eword_t we copy, but
+	 * there may be additional bits to process. Those are handled
+	 * individually by write_reused_pack().
+	 *
+	 * Begin by advancing to the first word boundary in range of the
+	 * bit positions occupied by objects in "reuse_packfile". Then
+	 * pick the last word boundary in the same range. If we have at
+	 * least one word's worth of bits to process, continue on.
+	 */
+	end = reuse_packfile->bitmap_pos + reuse_packfile->bitmap_nr;
+	if (end % BITS_IN_EWORD)
+		end -= end % BITS_IN_EWORD;
+	if (pos >= end)
+		return reuse_packfile->bitmap_pos / BITS_IN_EWORD;
+
+	while (pos < end &&
+	       reuse_packfile_bitmap->words[pos / BITS_IN_EWORD] == (eword_t)~0)
+		pos += BITS_IN_EWORD;
+
+	if (pos > end)
+		pos = end;
+
+	if (reuse_packfile->bitmap_pos < pos) {
+		off_t pack_start_off = pack_pos_to_offset(reuse_packfile->p, 0);
+		off_t pack_end_off = pack_pos_to_offset(reuse_packfile->p,
+							pos - reuse_packfile->bitmap_pos);
+
+		written += pos - reuse_packfile->bitmap_pos;
 
 		/* We're recording one chunk, not one object. */
-		record_reused_object(sizeof(struct pack_header), 0);
+		record_reused_object(pack_start_off,
+				     pack_start_off - (hashfile_total(out) - pack_start));
 		hashflush(out);
 		copy_pack_data(out, reuse_packfile->p, w_curs,
-			sizeof(struct pack_header), to_write);
+			pack_start_off, pack_end_off - pack_start_off);
 
 		display_progress(progress_state, written);
 	}
-	return pos;
+	if (pos % BITS_IN_EWORD)
+		BUG("attempted to jump past a word boundary to %"PRIuMAX,
+		    (uintmax_t)pos);
+	return pos / BITS_IN_EWORD;
 }
 
 static void write_reused_pack(struct bitmapped_pack *reuse_packfile,
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related

* [PATCH v2 14/26] pack-objects: prepare `write_reused_pack()` for multi-pack reuse
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>

The function `write_reused_pack()` within `builtin/pack-objects.c` is
responsible for performing pack-reuse on a single pack, and has two main
functions:

  - it dispatches a call to `write_reused_pack_verbatim()` to see if we
    can reuse portions of the packfile in whole-word chunks

  - for any remaining objects (that is, any objects that appear after
    the first "gap" in the bitmap), call write_reused_pack_one() on that
    object to record it for reuse.

Prepare this function for multi-pack reuse by removing the assumption
that the bit position corresponding to the first object being reused
from a given pack must be at bit position zero.

The changes in this function are mostly straightforward. Initialize `i`
to the position of the first word to contain bits corresponding to that
reuse pack. In most situations, we throw the initialized value away,
since we end up replacing it with the return value from
write_reused_pack_verbatim(), moving us past the section of whole words
that we reused.

Likewise, modify the per-object loop to ignore any bits at the beginning
of the first word that do not belong to the pack currently being reused,
as well as skip to the "done" section once we have processed the last
bit corresponding to this pack.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 07c849b5d4..6ce52d88a9 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1127,7 +1127,7 @@ static size_t write_reused_pack_verbatim(struct bitmapped_pack *reuse_packfile,
 static void write_reused_pack(struct bitmapped_pack *reuse_packfile,
 			      struct hashfile *f)
 {
-	size_t i = 0;
+	size_t i = reuse_packfile->bitmap_pos / BITS_IN_EWORD;
 	uint32_t offset;
 	off_t pack_start = hashfile_total(f) - sizeof(struct pack_header);
 	struct pack_window *w_curs = NULL;
@@ -1145,17 +1145,23 @@ static void write_reused_pack(struct bitmapped_pack *reuse_packfile,
 				break;
 
 			offset += ewah_bit_ctz64(word >> offset);
+			if (pos + offset < reuse_packfile->bitmap_pos)
+				continue;
+			if (pos + offset >= reuse_packfile->bitmap_pos + reuse_packfile->bitmap_nr)
+				goto done;
 			/*
 			 * Can use bit positions directly, even for MIDX
 			 * bitmaps. See comment in try_partial_reuse()
 			 * for why.
 			 */
-			write_reused_pack_one(reuse_packfile->p, pos + offset,
+			write_reused_pack_one(reuse_packfile->p,
+					      pos + offset - reuse_packfile->bitmap_pos,
 					      f, pack_start, &w_curs);
 			display_progress(progress_state, ++written);
 		}
 	}
 
+done:
 	unuse_pack(&w_curs);
 }
 
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related

* [PATCH v2 13/26] pack-objects: pass `bitmapped_pack`'s to pack-reuse functions
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>

Further prepare pack-objects to perform verbatim pack-reuse over
multiple packfiles by converting functions that take in a pointer to a
`struct packed_git` to instead take in a pointer to a `struct
bitmapped_pack`.

The additional information found in the bitmapped_pack struct (such as
the bit position corresponding to the beginning of the pack) will be
necessary in order to perform verbatim pack-reuse.

Note that we don't use any of the extra pieces of information contained
in the bitmapped_pack struct, so this step is merely preparatory and
does not introduce any functional changes.

Note further that we do not change the argument type to
write_reused_pack_one(). That function is responsible for copying
sections of the packfile directly and optionally patching any OFS_DELTAs
to account for not reusing sections of the packfile in between a delta
and its base.

As such, that function is (and should remain) oblivious to multi-pack
reuse, and does not require any of the extra pieces of information
stored in the bitmapped_pack struct.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index f51b86d99f..07c849b5d4 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -221,7 +221,8 @@ static int thin;
 static int num_preferred_base;
 static struct progress *progress_state;
 
-static struct packed_git *reuse_packfile;
+static struct bitmapped_pack *reuse_packfiles;
+static size_t reuse_packfiles_nr;
 static uint32_t reuse_packfile_objects;
 static struct bitmap *reuse_packfile_bitmap;
 
@@ -1094,7 +1095,7 @@ static void write_reused_pack_one(struct packed_git *reuse_packfile,
 	copy_pack_data(out, reuse_packfile, w_curs, offset, next - offset);
 }
 
-static size_t write_reused_pack_verbatim(struct packed_git *reuse_packfile,
+static size_t write_reused_pack_verbatim(struct bitmapped_pack *reuse_packfile,
 					 struct hashfile *out,
 					 off_t pack_start UNUSED,
 					 struct pack_window **w_curs)
@@ -1109,13 +1110,13 @@ static size_t write_reused_pack_verbatim(struct packed_git *reuse_packfile,
 		off_t to_write;
 
 		written = (pos * BITS_IN_EWORD);
-		to_write = pack_pos_to_offset(reuse_packfile, written)
+		to_write = pack_pos_to_offset(reuse_packfile->p, written)
 			- sizeof(struct pack_header);
 
 		/* We're recording one chunk, not one object. */
 		record_reused_object(sizeof(struct pack_header), 0);
 		hashflush(out);
-		copy_pack_data(out, reuse_packfile, w_curs,
+		copy_pack_data(out, reuse_packfile->p, w_curs,
 			sizeof(struct pack_header), to_write);
 
 		display_progress(progress_state, written);
@@ -1123,7 +1124,7 @@ static size_t write_reused_pack_verbatim(struct packed_git *reuse_packfile,
 	return pos;
 }
 
-static void write_reused_pack(struct packed_git *reuse_packfile,
+static void write_reused_pack(struct bitmapped_pack *reuse_packfile,
 			      struct hashfile *f)
 {
 	size_t i = 0;
@@ -1149,8 +1150,8 @@ static void write_reused_pack(struct packed_git *reuse_packfile,
 			 * bitmaps. See comment in try_partial_reuse()
 			 * for why.
 			 */
-			write_reused_pack_one(reuse_packfile, pos + offset, f,
-					      pack_start, &w_curs);
+			write_reused_pack_one(reuse_packfile->p, pos + offset,
+					      f, pack_start, &w_curs);
 			display_progress(progress_state, ++written);
 		}
 	}
@@ -1206,9 +1207,12 @@ static void write_pack_file(void)
 
 		offset = write_pack_header(f, nr_remaining);
 
-		if (reuse_packfile) {
+		if (reuse_packfiles_nr) {
 			assert(pack_to_stdout);
-			write_reused_pack(reuse_packfile, f);
+			for (j = 0; j < reuse_packfiles_nr; j++) {
+				reused_chunks_nr = 0;
+				write_reused_pack(&reuse_packfiles[j], f);
+			}
 			offset = hashfile_total(f);
 		}
 
@@ -3949,19 +3953,16 @@ static int pack_options_allow_reuse(void)
 
 static int get_object_list_from_bitmap(struct rev_info *revs)
 {
-	struct bitmapped_pack *packs = NULL;
-	size_t packs_nr = 0;
-
 	if (!(bitmap_git = prepare_bitmap_walk(revs, 0)))
 		return -1;
 
 	if (pack_options_allow_reuse())
-		reuse_partial_packfile_from_bitmap(bitmap_git, &packs,
-						   &packs_nr,
+		reuse_partial_packfile_from_bitmap(bitmap_git,
+						   &reuse_packfiles,
+						   &reuse_packfiles_nr,
 						   &reuse_packfile_bitmap);
 
-	if (packs) {
-		reuse_packfile = packs[0].p;
+	if (reuse_packfiles) {
 		reuse_packfile_objects = bitmap_popcount(reuse_packfile_bitmap);
 		if (!reuse_packfile_objects)
 			BUG("expected non-empty reuse bitmap");
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related

* [PATCH v2 12/26] pack-objects: keep track of `pack_start` for each reuse pack
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>

When reusing objects from a pack, we keep track of a set of one or more
`reused_chunk`s, corresponding to sections of one or more object(s) from
a source pack that we are reusing. Each chunk contains two pieces of
information:

  - the offset of the first object in the source pack (relative to the
    beginning of the source pack)
  - the difference between that offset, and the corresponding offset in
    the pack we're generating

The purpose of keeping track of these is so that we can patch an
OFS_DELTAs that cross over a section of the reuse pack that we didn't
take.

For instance, consider a hypothetical pack as shown below:

                                                (chunk #2)
                                                __________...
                                               /
                                              /
      +--------+---------+-------------------+---------+
  ... | <base> | <other> |      (unused)     | <delta> | ...
      +--------+---------+-------------------+---------+
       \                /
        \______________/
           (chunk #1)

Suppose that we are sending objects "base", "other", and "delta", and
that the "delta" object is stored as an OFS_DELTA, and that its base is
"base". If we don't send any objects in the "(unused)" range, we can't
copy the delta'd object directly, since its delta offset includes a
range of the pack that we didn't copy, so we have to account for that
difference when patching and reassembling the delta.

In order to compute this value correctly, we need to know not only where
we are in the packfile we're assembling (with `hashfile_total(f)`) but
also the position of the first byte of the packfile that we are
currently reusing. Currently, this works just fine, since when reusing
only a single pack those two values are always identical (because
verbatim reuse is the first thing pack-objects does when enabled after
writing the pack header).

But when reusing multiple packs which have one or more gaps, we'll need
to account for these two values diverging.

Together, these two allow us to compute the reused chunk's offset
difference relative to the start of the reused pack, as desired.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 102fe9a4f8..f51b86d99f 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1015,6 +1015,7 @@ static off_t find_reused_offset(off_t where)
 
 static void write_reused_pack_one(struct packed_git *reuse_packfile,
 				  size_t pos, struct hashfile *out,
+				  off_t pack_start,
 				  struct pack_window **w_curs)
 {
 	off_t offset, next, cur;
@@ -1024,7 +1025,8 @@ static void write_reused_pack_one(struct packed_git *reuse_packfile,
 	offset = pack_pos_to_offset(reuse_packfile, pos);
 	next = pack_pos_to_offset(reuse_packfile, pos + 1);
 
-	record_reused_object(offset, offset - hashfile_total(out));
+	record_reused_object(offset,
+			     offset - (hashfile_total(out) - pack_start));
 
 	cur = offset;
 	type = unpack_object_header(reuse_packfile, w_curs, &cur, &size);
@@ -1094,6 +1096,7 @@ static void write_reused_pack_one(struct packed_git *reuse_packfile,
 
 static size_t write_reused_pack_verbatim(struct packed_git *reuse_packfile,
 					 struct hashfile *out,
+					 off_t pack_start UNUSED,
 					 struct pack_window **w_curs)
 {
 	size_t pos = 0;
@@ -1125,10 +1128,12 @@ static void write_reused_pack(struct packed_git *reuse_packfile,
 {
 	size_t i = 0;
 	uint32_t offset;
+	off_t pack_start = hashfile_total(f) - sizeof(struct pack_header);
 	struct pack_window *w_curs = NULL;
 
 	if (allow_ofs_delta)
-		i = write_reused_pack_verbatim(reuse_packfile, f, &w_curs);
+		i = write_reused_pack_verbatim(reuse_packfile, f, pack_start,
+					       &w_curs);
 
 	for (; i < reuse_packfile_bitmap->word_alloc; ++i) {
 		eword_t word = reuse_packfile_bitmap->words[i];
@@ -1145,7 +1150,7 @@ static void write_reused_pack(struct packed_git *reuse_packfile,
 			 * for why.
 			 */
 			write_reused_pack_one(reuse_packfile, pos + offset, f,
-					      &w_curs);
+					      pack_start, &w_curs);
 			display_progress(progress_state, ++written);
 		}
 	}
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related

* [PATCH v2 11/26] pack-objects: parameterize pack-reuse routines over a single pack
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>

The routines pack-objects uses to perform verbatim pack-reuse are:

  - write_reused_pack_one()
  - write_reused_pack_verbatim()
  - write_reused_pack()

, all of which assume that there is exactly one packfile being reused:
the global constant `reuse_packfile`.

Prepare for reusing objects from multiple packs by making reuse packfile
a parameter of each of the above functions in preparation for calling
these functions in a loop with multiple packfiles.

Note that we still have the global "reuse_packfile", but pass it through
each of the above function's parameter lists, eliminating all but one
direct access (the top-level caller in `write_pack_file()`). Even after
this series, we will still have a global, but it will hold the array of
reusable packfiles, and we'll pass them one at a time to these functions
in a loop.

Note also that we will eventually need to pass a `bitmapped_pack`
instead of a `packed_git` in order to hold onto additional information
required for reuse (such as the bit position of the first object
belonging to that pack). But that change will be made in a future commit
so as to minimize the noise below as much as possible.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 87e16636a8..102fe9a4f8 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1013,7 +1013,8 @@ static off_t find_reused_offset(off_t where)
 	return reused_chunks[lo-1].difference;
 }
 
-static void write_reused_pack_one(size_t pos, struct hashfile *out,
+static void write_reused_pack_one(struct packed_git *reuse_packfile,
+				  size_t pos, struct hashfile *out,
 				  struct pack_window **w_curs)
 {
 	off_t offset, next, cur;
@@ -1091,7 +1092,8 @@ static void write_reused_pack_one(size_t pos, struct hashfile *out,
 	copy_pack_data(out, reuse_packfile, w_curs, offset, next - offset);
 }
 
-static size_t write_reused_pack_verbatim(struct hashfile *out,
+static size_t write_reused_pack_verbatim(struct packed_git *reuse_packfile,
+					 struct hashfile *out,
 					 struct pack_window **w_curs)
 {
 	size_t pos = 0;
@@ -1118,14 +1120,15 @@ static size_t write_reused_pack_verbatim(struct hashfile *out,
 	return pos;
 }
 
-static void write_reused_pack(struct hashfile *f)
+static void write_reused_pack(struct packed_git *reuse_packfile,
+			      struct hashfile *f)
 {
 	size_t i = 0;
 	uint32_t offset;
 	struct pack_window *w_curs = NULL;
 
 	if (allow_ofs_delta)
-		i = write_reused_pack_verbatim(f, &w_curs);
+		i = write_reused_pack_verbatim(reuse_packfile, f, &w_curs);
 
 	for (; i < reuse_packfile_bitmap->word_alloc; ++i) {
 		eword_t word = reuse_packfile_bitmap->words[i];
@@ -1141,7 +1144,8 @@ static void write_reused_pack(struct hashfile *f)
 			 * bitmaps. See comment in try_partial_reuse()
 			 * for why.
 			 */
-			write_reused_pack_one(pos + offset, f, &w_curs);
+			write_reused_pack_one(reuse_packfile, pos + offset, f,
+					      &w_curs);
 			display_progress(progress_state, ++written);
 		}
 	}
@@ -1199,7 +1203,7 @@ static void write_pack_file(void)
 
 		if (reuse_packfile) {
 			assert(pack_to_stdout);
-			write_reused_pack(f);
+			write_reused_pack(reuse_packfile, f);
 			offset = hashfile_total(f);
 		}
 
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related

* [PATCH v2 10/26] pack-bitmap: return multiple packs via `reuse_partial_packfile_from_bitmap()`
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>

Further prepare for enabling verbatim pack-reuse over multiple packfiles
by changing the signature of reuse_partial_packfile_from_bitmap() to
populate an array of `struct bitmapped_pack *`'s instead of a pointer to
a single packfile.

Since the array we're filling out is sized dynamically[^1], add an
additional `size_t *` parameter which will hold the number of reusable
packs (equal to the number of elements in the array).

Note that since we still have not implemented true multi-pack reuse,
these changes aren't propagated out to the rest of the caller in
builtin/pack-objects.c.

In the interim state, we expect that the array has a single element, and
we use that element to fill out the static `reuse_packfile` variable
(which is a bog-standard `struct packed_git *`). Future commits will
continue to push this change further out through the pack-objects code.

[^1]: That is, even though we know the number of packs which are
  candidates for pack-reuse, we do not know how many of those
  candidates we can actually reuse.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 9 +++++++--
 pack-bitmap.c          | 6 ++++--
 pack-bitmap.h          | 5 +++--
 3 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index c3df6d9657..87e16636a8 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3940,14 +3940,19 @@ static int pack_options_allow_reuse(void)
 
 static int get_object_list_from_bitmap(struct rev_info *revs)
 {
+	struct bitmapped_pack *packs = NULL;
+	size_t packs_nr = 0;
+
 	if (!(bitmap_git = prepare_bitmap_walk(revs, 0)))
 		return -1;
 
 	if (pack_options_allow_reuse())
-		reuse_partial_packfile_from_bitmap(bitmap_git, &reuse_packfile,
+		reuse_partial_packfile_from_bitmap(bitmap_git, &packs,
+						   &packs_nr,
 						   &reuse_packfile_bitmap);
 
-	if (reuse_packfile) {
+	if (packs) {
+		reuse_packfile = packs[0].p;
 		reuse_packfile_objects = bitmap_popcount(reuse_packfile_bitmap);
 		if (!reuse_packfile_objects)
 			BUG("expected non-empty reuse bitmap");
diff --git a/pack-bitmap.c b/pack-bitmap.c
index c75a83e9cc..4d5a484678 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -2001,7 +2001,8 @@ static int bitmapped_pack_cmp(const void *va, const void *vb)
 }
 
 void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
-					struct packed_git **packfile_out,
+					struct bitmapped_pack **packs_out,
+					size_t *packs_nr_out,
 					struct bitmap **reuse_out)
 {
 	struct repository *r = the_repository;
@@ -2069,7 +2070,8 @@ void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 	 * need to be handled separately.
 	 */
 	bitmap_and_not(result, reuse);
-	*packfile_out = packs[0].p;
+	*packs_out = packs;
+	*packs_nr_out = packs_nr;
 	*reuse_out = reuse;
 }
 
diff --git a/pack-bitmap.h b/pack-bitmap.h
index ab3fdcde6b..7a12a2ce81 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -78,8 +78,9 @@ int test_bitmap_hashes(struct repository *r);
 struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 					 int filter_provided_objects);
 uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git);
-void reuse_partial_packfile_from_bitmap(struct bitmap_index *,
-					struct packed_git **packfile,
+void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
+					struct bitmapped_pack **packs_out,
+					size_t *packs_nr_out,
 					struct bitmap **reuse_out);
 int rebuild_existing_bitmaps(struct bitmap_index *, struct packing_data *mapping,
 			     kh_oid_map_t *reused_bitmaps, int show_progress);
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related

* [PATCH v2 09/26] pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>

The signature of `reuse_partial_packfile_from_bitmap()` currently takes
in a bitmap, as well as three output parameters (filled through
pointers, and passed as arguments), and also returns an integer result.

The output parameters are filled out with: (a) the packfile used for
pack-reuse, (b) the number of objects from that pack that we can reuse,
and (c) a bitmap indicating which objects we can reuse. The return value
is either -1 (when there are no objects to reuse), or 0 (when there is
at least one object to reuse).

Some of these parameters are redundant. Notably, we can infer from the
bitmap how many objects are reused by calling bitmap_popcount(). And we
can similar compute the return value based on that number as well.

As such, clean up the signature of this function to drop the "*entries"
parameter, as well as the int return value, since the single caller of
this function can infer these values themself.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 16 +++++++++-------
 pack-bitmap.c          | 16 +++++++---------
 pack-bitmap.h          |  7 +++----
 3 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 321d7effb0..c3df6d9657 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3943,13 +3943,15 @@ static int get_object_list_from_bitmap(struct rev_info *revs)
 	if (!(bitmap_git = prepare_bitmap_walk(revs, 0)))
 		return -1;
 
-	if (pack_options_allow_reuse() &&
-	    !reuse_partial_packfile_from_bitmap(
-			bitmap_git,
-			&reuse_packfile,
-			&reuse_packfile_objects,
-			&reuse_packfile_bitmap)) {
-		assert(reuse_packfile_objects);
+	if (pack_options_allow_reuse())
+		reuse_partial_packfile_from_bitmap(bitmap_git, &reuse_packfile,
+						   &reuse_packfile_bitmap);
+
+	if (reuse_packfile) {
+		reuse_packfile_objects = bitmap_popcount(reuse_packfile_bitmap);
+		if (!reuse_packfile_objects)
+			BUG("expected non-empty reuse bitmap");
+
 		nr_result += reuse_packfile_objects;
 		nr_seen += reuse_packfile_objects;
 		display_progress(progress_state, nr_seen);
diff --git a/pack-bitmap.c b/pack-bitmap.c
index d64a80c30c..c75a83e9cc 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -2000,10 +2000,9 @@ static int bitmapped_pack_cmp(const void *va, const void *vb)
 	return 0;
 }
 
-int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
-				       struct packed_git **packfile_out,
-				       uint32_t *entries,
-				       struct bitmap **reuse_out)
+void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
+					struct packed_git **packfile_out,
+					struct bitmap **reuse_out)
 {
 	struct repository *r = the_repository;
 	struct bitmapped_pack *packs = NULL;
@@ -2025,7 +2024,7 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 				warning(_("unable to load pack: '%s', disabling pack-reuse"),
 					bitmap_git->midx->pack_names[i]);
 				free(packs);
-				return -1;
+				return;
 			}
 			if (!pack.bitmap_nr)
 				continue; /* no objects from this pack */
@@ -2059,10 +2058,10 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 
 	reuse_partial_packfile_from_bitmap_1(bitmap_git, packs, reuse);
 
-	*entries = bitmap_popcount(reuse);
-	if (!*entries) {
+	if (bitmap_is_empty(reuse)) {
+		free(packs);
 		bitmap_free(reuse);
-		return -1;
+		return;
 	}
 
 	/*
@@ -2072,7 +2071,6 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 	bitmap_and_not(result, reuse);
 	*packfile_out = packs[0].p;
 	*reuse_out = reuse;
-	return 0;
 }
 
 int bitmap_walk_contains(struct bitmap_index *bitmap_git,
diff --git a/pack-bitmap.h b/pack-bitmap.h
index b68b213388..ab3fdcde6b 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -78,10 +78,9 @@ int test_bitmap_hashes(struct repository *r);
 struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 					 int filter_provided_objects);
 uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git);
-int reuse_partial_packfile_from_bitmap(struct bitmap_index *,
-				       struct packed_git **packfile,
-				       uint32_t *entries,
-				       struct bitmap **reuse_out);
+void reuse_partial_packfile_from_bitmap(struct bitmap_index *,
+					struct packed_git **packfile,
+					struct bitmap **reuse_out);
 int rebuild_existing_bitmaps(struct bitmap_index *, struct packing_data *mapping,
 			     kh_oid_map_t *reused_bitmaps, int show_progress);
 void free_bitmap_index(struct bitmap_index *);
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related

* [PATCH v2 08/26] ewah: implement `bitmap_is_empty()`
From: Taylor Blau @ 2023-12-14 22:23 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>

In a future commit, we will want to check whether or not a bitmap has
any bits set in any of its words. The best way to do this (prior to the
existence of this patch) is to call `bitmap_popcount()` and check
whether the result is non-zero.

But this is semi-wasteful, since we do not need to know the exact number
of bits set, only whether or not there is at least one of them.

Implement a new helper function to check just that.

Suggested-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 ewah/bitmap.c | 9 +++++++++
 ewah/ewok.h   | 1 +
 2 files changed, 10 insertions(+)

diff --git a/ewah/bitmap.c b/ewah/bitmap.c
index 7b525b1ecd..ac7e0af622 100644
--- a/ewah/bitmap.c
+++ b/ewah/bitmap.c
@@ -169,6 +169,15 @@ size_t bitmap_popcount(struct bitmap *self)
 	return count;
 }
 
+int bitmap_is_empty(struct bitmap *self)
+{
+	size_t i;
+	for (i = 0; i < self->word_alloc; i++)
+		if (self->words[i])
+			return 0;
+	return 1;
+}
+
 int bitmap_equals(struct bitmap *self, struct bitmap *other)
 {
 	struct bitmap *big, *small;
diff --git a/ewah/ewok.h b/ewah/ewok.h
index 7eb8b9b630..c11d76c6f3 100644
--- a/ewah/ewok.h
+++ b/ewah/ewok.h
@@ -189,5 +189,6 @@ void bitmap_or_ewah(struct bitmap *self, struct ewah_bitmap *other);
 void bitmap_or(struct bitmap *self, const struct bitmap *other);
 
 size_t bitmap_popcount(struct bitmap *self);
+int bitmap_is_empty(struct bitmap *self);
 
 #endif
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related

* [PATCH v2 07/26] pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions
From: Taylor Blau @ 2023-12-14 22:23 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>

When trying to assemble a pack with bitmaps using `--use-bitmap-index`,
`pack-objects` asks the pack-bitmap machinery for a bitmap which
indicates the set of objects we can "reuse" verbatim from on-disk.

This set is roughly comprised of: a prefix of objects in the bitmapped
pack (or preferred pack, in the case of a multi-pack reachability
bitmap), plus any other objects not included in the prefix, excluding
any deltas whose base we are not sending in the resulting pack.

The pack-bitmap machinery is responsible for computing this bitmap, and
does so with the following functions:

  - reuse_partial_packfile_from_bitmap()
  - try_partial_reuse()

In the existing implementation, the first function is responsible for
(a) marking the prefix of objects in the reusable pack, and then (b)
calling try_partial_reuse() on any remaining objects to ensure that they
are also reusable (and removing them from the bitmapped set if they are
not).

Likewise, the `try_partial_reuse()` function is responsible for checking
whether an isolated object (that is, an object from the bitmapped
pack/preferred pack not contained in the prefix from earlier) may be
reused, i.e. that it isn't a delta of an object that we are not sending
in the resulting pack.

These functions are based on two core assumptions, which we will unwind
in this and the following commits:

  1. There is only a single pack from the bitmap which is eligible for
     verbatim pack-reuse. For single-pack bitmaps, this is trivially the
     bitmapped pack. For multi-pack bitmaps, this is (currently) the
     MIDX's preferred pack.

  2. The pack eligible for reuse has its first object in bit position 0,
     and all objects from that pack follow in pack-order from that first
     bit position.

In order to perform verbatim pack reuse over multiple packs, we must
unwind these two assumptions. Most notably, in order to reuse bits from
a given packfile, we need to know the first bit position occupied by
an object form that packfile. To propagate this information around, pass
a `struct bitmapped_pack *` anywhere we previously passed a `struct
packed_git *`, since the former contains the bitmap position we're
interested in (as well as a pointer to the latter).

As an additional step, factor out a sub-routine from the main
`reuse_partial_packfile_from_bitmap()` function, called
`reuse_partial_packfile_from_bitmap_1()`. This new function will be
responsible for figuring out which objects may be reused from a single
pack, and the existing function will dispatch multiple calls to its new
helper function for each reusable pack.

Consequently, `reuse_partial_packfile_from_bitmap()` will now maintain
an array of reusable packs instead of a single such pack. We currently
expect that array to have only a single element, so this awkward state
is short-lived. It will serve as useful scaffolding in subsequent
commits as we begin to work towards enabling multi-pack reuse.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 118 +++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 87 insertions(+), 31 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index d2f1306960..d64a80c30c 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1836,7 +1836,7 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
  * -1 means "stop trying further objects"; 0 means we may or may not have
  * reused, but you can keep feeding bits.
  */
-static int try_partial_reuse(struct packed_git *pack,
+static int try_partial_reuse(struct bitmapped_pack *pack,
 			     size_t pos,
 			     struct bitmap *reuse,
 			     struct pack_window **w_curs)
@@ -1868,11 +1868,11 @@ static int try_partial_reuse(struct packed_git *pack,
 	 * preferred pack precede all bits from other packs.
 	 */
 
-	if (pos >= pack->num_objects)
+	if (pos >= pack->p->num_objects)
 		return -1; /* not actually in the pack or MIDX preferred pack */
 
-	offset = delta_obj_offset = pack_pos_to_offset(pack, pos);
-	type = unpack_object_header(pack, w_curs, &offset, &size);
+	offset = delta_obj_offset = pack_pos_to_offset(pack->p, pos);
+	type = unpack_object_header(pack->p, w_curs, &offset, &size);
 	if (type < 0)
 		return -1; /* broken packfile, punt */
 
@@ -1888,11 +1888,11 @@ static int try_partial_reuse(struct packed_git *pack,
 		 * and the normal slow path will complain about it in
 		 * more detail.
 		 */
-		base_offset = get_delta_base(pack, w_curs, &offset, type,
+		base_offset = get_delta_base(pack->p, w_curs, &offset, type,
 					     delta_obj_offset);
 		if (!base_offset)
 			return 0;
-		if (offset_to_pack_pos(pack, base_offset, &base_pos) < 0)
+		if (offset_to_pack_pos(pack->p, base_offset, &base_pos) < 0)
 			return 0;
 
 		/*
@@ -1915,14 +1915,14 @@ static int try_partial_reuse(struct packed_git *pack,
 		 * to REF_DELTA on the fly. Better to just let the normal
 		 * object_entry code path handle it.
 		 */
-		if (!bitmap_get(reuse, base_pos))
+		if (!bitmap_get(reuse, pack->bitmap_pos + base_pos))
 			return 0;
 	}
 
 	/*
 	 * If we got here, then the object is OK to reuse. Mark it.
 	 */
-	bitmap_set(reuse, pos);
+	bitmap_set(reuse, pack->bitmap_pos + pos);
 	return 0;
 }
 
@@ -1934,29 +1934,13 @@ uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git)
 	return nth_midxed_pack_int_id(m, pack_pos_to_midx(bitmap_git->midx, 0));
 }
 
-int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
-				       struct packed_git **packfile_out,
-				       uint32_t *entries,
-				       struct bitmap **reuse_out)
+static void reuse_partial_packfile_from_bitmap_1(struct bitmap_index *bitmap_git,
+						 struct bitmapped_pack *pack,
+						 struct bitmap *reuse)
 {
-	struct repository *r = the_repository;
-	struct packed_git *pack;
 	struct bitmap *result = bitmap_git->result;
-	struct bitmap *reuse;
 	struct pack_window *w_curs = NULL;
 	size_t i = 0;
-	uint32_t offset;
-	uint32_t objects_nr;
-
-	assert(result);
-
-	load_reverse_index(r, bitmap_git);
-
-	if (bitmap_is_midx(bitmap_git))
-		pack = bitmap_git->midx->packs[midx_preferred_pack(bitmap_git)];
-	else
-		pack = bitmap_git->pack;
-	objects_nr = pack->num_objects;
 
 	while (i < result->word_alloc && result->words[i] == (eword_t)~0)
 		i++;
@@ -1969,15 +1953,15 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 	 * we use it instead of another pack. In single-pack bitmaps, the choice
 	 * is made for us.
 	 */
-	if (i > objects_nr / BITS_IN_EWORD)
-		i = objects_nr / BITS_IN_EWORD;
+	if (i > pack->p->num_objects / BITS_IN_EWORD)
+		i = pack->p->num_objects / BITS_IN_EWORD;
 
-	reuse = bitmap_word_alloc(i);
 	memset(reuse->words, 0xFF, i * sizeof(eword_t));
 
 	for (; i < result->word_alloc; ++i) {
 		eword_t word = result->words[i];
 		size_t pos = (i * BITS_IN_EWORD);
+		size_t offset;
 
 		for (offset = 0; offset < BITS_IN_EWORD; ++offset) {
 			if ((word >> offset) == 0)
@@ -2002,6 +1986,78 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 
 done:
 	unuse_pack(&w_curs);
+}
+
+static int bitmapped_pack_cmp(const void *va, const void *vb)
+{
+	const struct bitmapped_pack *a = va;
+	const struct bitmapped_pack *b = vb;
+
+	if (a->bitmap_pos < b->bitmap_pos)
+		return -1;
+	if (a->bitmap_pos > b->bitmap_pos)
+		return 1;
+	return 0;
+}
+
+int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
+				       struct packed_git **packfile_out,
+				       uint32_t *entries,
+				       struct bitmap **reuse_out)
+{
+	struct repository *r = the_repository;
+	struct bitmapped_pack *packs = NULL;
+	struct bitmap *result = bitmap_git->result;
+	struct bitmap *reuse;
+	size_t i;
+	size_t packs_nr = 0, packs_alloc = 0;
+	size_t word_alloc;
+	uint32_t objects_nr = 0;
+
+	assert(result);
+
+	load_reverse_index(r, bitmap_git);
+
+	if (bitmap_is_midx(bitmap_git)) {
+		for (i = 0; i < bitmap_git->midx->num_packs; i++) {
+			struct bitmapped_pack pack;
+			if (nth_bitmapped_pack(r, bitmap_git->midx, &pack, i) < 0) {
+				warning(_("unable to load pack: '%s', disabling pack-reuse"),
+					bitmap_git->midx->pack_names[i]);
+				free(packs);
+				return -1;
+			}
+			if (!pack.bitmap_nr)
+				continue; /* no objects from this pack */
+			if (pack.bitmap_pos)
+				continue; /* not preferred pack */
+
+			ALLOC_GROW(packs, packs_nr + 1, packs_alloc);
+			memcpy(&packs[packs_nr++], &pack, sizeof(pack));
+
+			objects_nr += pack.p->num_objects;
+		}
+
+		QSORT(packs, packs_nr, bitmapped_pack_cmp);
+	} else {
+		ALLOC_GROW(packs, packs_nr + 1, packs_alloc);
+
+		packs[packs_nr].p = bitmap_git->pack;
+		packs[packs_nr].bitmap_pos = 0;
+		packs[packs_nr].bitmap_nr = bitmap_git->pack->num_objects;
+
+		objects_nr = packs[packs_nr++].p->num_objects;
+	}
+
+	word_alloc = objects_nr / BITS_IN_EWORD;
+	if (objects_nr % BITS_IN_EWORD)
+		word_alloc++;
+	reuse = bitmap_word_alloc(word_alloc);
+
+	if (packs_nr != 1)
+		BUG("pack reuse not yet implemented for multiple packs");
+
+	reuse_partial_packfile_from_bitmap_1(bitmap_git, packs, reuse);
 
 	*entries = bitmap_popcount(reuse);
 	if (!*entries) {
@@ -2014,7 +2070,7 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 	 * need to be handled separately.
 	 */
 	bitmap_and_not(result, reuse);
-	*packfile_out = pack;
+	*packfile_out = packs[0].p;
 	*reuse_out = reuse;
 	return 0;
 }
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related

* [PATCH v2 06/26] midx: implement `midx_locate_pack()`
From: Taylor Blau @ 2023-12-14 22:23 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>

The multi-pack index API exposes a `midx_contains_pack()` function that
takes in a string ending in either ".idx" or ".pack" and returns whether
or not the MIDX contains a given pack corresponding to that string.

There is no corresponding function to locate the position of a pack
within the MIDX's pack order (sorted lexically by pack filename).

We could add an optional out parameter to `midx_contains_pack()` that is
filled out with the pack's position when the parameter is non-NULL. To
minimize the amount of fallout from this change, instead introduce a new
function by renaming `midx_contains_pack()` to `midx_locate_pack()`,
adding that output parameter, and then reimplementing
`midx_contains_pack()` in terms of it.

Future patches will make use of this new function.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c | 13 +++++++++++--
 midx.h |  5 ++++-
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/midx.c b/midx.c
index de25612b0c..beaf0c0de4 100644
--- a/midx.c
+++ b/midx.c
@@ -428,7 +428,8 @@ static int cmp_idx_or_pack_name(const char *idx_or_pack_name,
 	return strcmp(idx_or_pack_name, idx_name);
 }
 
-int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name)
+int midx_locate_pack(struct multi_pack_index *m, const char *idx_or_pack_name,
+		     uint32_t *pos)
 {
 	uint32_t first = 0, last = m->num_packs;
 
@@ -439,8 +440,11 @@ int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name)
 
 		current = m->pack_names[mid];
 		cmp = cmp_idx_or_pack_name(idx_or_pack_name, current);
-		if (!cmp)
+		if (!cmp) {
+			if (pos)
+				*pos = mid;
 			return 1;
+		}
 		if (cmp > 0) {
 			first = mid + 1;
 			continue;
@@ -451,6 +455,11 @@ int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name)
 	return 0;
 }
 
+int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name)
+{
+	return midx_locate_pack(m, idx_or_pack_name, NULL);
+}
+
 int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local)
 {
 	struct multi_pack_index *m;
diff --git a/midx.h b/midx.h
index b404235db5..89c5aa637e 100644
--- a/midx.h
+++ b/midx.h
@@ -70,7 +70,10 @@ struct object_id *nth_midxed_object_oid(struct object_id *oid,
 					struct multi_pack_index *m,
 					uint32_t n);
 int fill_midx_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e, struct multi_pack_index *m);
-int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name);
+int midx_contains_pack(struct multi_pack_index *m,
+		       const char *idx_or_pack_name);
+int midx_locate_pack(struct multi_pack_index *m, const char *idx_or_pack_name,
+		     uint32_t *pos);
 int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local);
 
 /*
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox